I’m nearing the midterm evaluation deadline for GSoC with Indic project. My promised delivarable before the mid-term is a statistical sandhi-splitter. With a few days left, I’m glad to announce that the split point identification is working quite well.
The major problems when it comes to sandhi-splitting is the many levels of agglutination observed in indic languages and that we have to take into account the context and apply different rules. This is quite easy for humans who are trained all their life and can take into account the changes coming during pronouncing and writing.
Let’s consider the naive method. Let
AXyZB is a word.
y is a character which has more chances of being a split point. We can’t be sure if
y is a split point unless we take
Z into account. But we can’t do it for all
Z using a dictionary. A trie is better, but that alone won’t do.
I’m implementing this paper, as mentioned in a previous post. The approach followed is to capture common split points and non-split points from training data. There’s no way I can if else my way out of this. So, I take a decent sample of annotated data, and do the following:
For each word in the training data, I have also stored split points. I loop over the string from start to end. If the index is annotated as a split point, it might be a common end-start pattern. Otherwise, it might be a middle of a word, not a split point. This is captured by using two tries.
Each trie node stores count of split points and non split points. Two tries are maintained - one for capturing the forward pattern and the other for the backward pattern. If the point is a split point, I add one to the count of split points on nodes along the patterns in both tries. Else, I increment the counter for non-split points along the path in both tries.
Because the first characters commonly tend to have more non-split points rather than split points, we ignore the first few counts of split points and non-splits points, we check only after something which is termed as the authors as initial_skip.
['ണ', '്', 'ട'] [0.7633587786259542, 0.7633587786259542, 0.7633587786259542] [(100, 31), (100, 31), (100, 31)] --- Final: 0.7633587786259542 --
The above is the output for Malayalam. It captures the pattern similar ‘അവനുണ്ട്’, and probability checked at ‘അവന + ുണ്ട്’.
first: എന, second: ്താണെന first: നഎ, second: ്താണെന --- # Backwards pattern: Nodes visited in order: ['എ'] Probability : [0.0] List (C_sp, C_nsp) - [(0, 80)] --- Final: 0.0 --- # Forward pattern Nodes visited in order: ['ത', 'ാ', 'ണ', 'െ', 'ന'] Probability : [0.004149377593360996, 0.05128205128205128, 0.0, 0.0, 0.0] List (C_sp, C_nsp) -[(2, 480), (2, 37), (0, 19), (0, 4), (0, 4)] --- Final: 0.011086285775082454
The above on the other hand is a non-split point pattern. And therefore, the probability of it being a split point is very low.
I have the code ready for training, I’m planning to add a feature to load and store the model, which here are the two tries. I also have to generate more data so that accuracy can be improved over time. The model once trained, can be used for inference in the sandhi-splitter module. I’m almost past the hard part. Splitting can be rule based for now, but I’m thinking of using a dictionary of root words which is feasible for the language, and implement something similar to the prediction feature in keyboards to fill in blanks, rather than identifying transforms.
The file IO and the training script is python3 for now, I might have to replace it with codecs or something for python2.7 compatibility. Still clueless as to how to write tests for these.
The necessary files can be found here: sandhi-splitter.
Instructions are provided in the README of the repo.
Instead of having an initial skip, we could assign weights to the probabilities, since counts being higher deeper in the tree is more weighted than the counts initially. Some smoothing is already enabled to capture patterns missing from training data, but its nowhere near perfect.
(Comments disabled. Email me instead.)