Summer of Code '16

Background

Community Bonding period of Google Summer of Code ’16 is almost over. My proposal for a sandhi-splitter for Indic Project had been selected. The proposal draft has been made public and is available here.

For those too lazy to read the entire proposal, here’s a summary of what I have to implement:

  1. sandhi-splitter for Malayalam, an attempt at solving the problem of agglutination in indic languages, using Malayalam to demonstrate.
  2. Integration of sandhi-splitter to spellchecker, using the split words to check for spelling as the corpus size is not infinite, improving the spellcheckers results.
  3. REST API for libindic, offer a REST API for the services offered by the platform.

I’m grateful to Devadath Vasudevan and Litton J Kurisinkel, who suggested the idea and whose original paper I’m going to implement as the first part. My mentors for this project are Vasudev Kamath and Hrishikesh K.B.

Progress of Work

As promised during the community period, I’ve come up with a skeleton app for the annotation tool. I chose to write the tool as a web application in Flask, as libindic runs on flask and most community members are familiar with it. The current setup is very basic and supports the required functionalities. The part of the interface which allows to label split points are ready. I experimented with a rule based system to transform the words once split only to get awful results. Language isn’t a formal system, its glorious chaos - quoting a famous webcomic. I found plenty of ambiguities, places where you can’t make definite rules.

After a meeting with the mentors, a few modifications were suggested to the annotation tool so as to include more languages and that the sandhi-splitter repo should contain only the work, that the tool should be moved to an independent repository.

You can find the working repos here:

A few finishing touches for enabling selection of rules are left with the annotation helper. But anyway, since I have a decent dataset already in hand for Malayalam I can start working on the main promised deliverable. Modifiying the tool and creating more data meanwhile.

Problems

Here is an example where rules are in conflict. This is when the rules are purely based on one or two of the characters at the junctions. The example is in Malayalam.

ആയിരിക്കണം = ആയി + ഇരിക്കണം 
അവനെയാണ് = അവനെ + ആണ് 

Sure, the solution is capturing a few more characters and taking them into account. This is better if modelled statistically rather than a definite rule based system.

Since the rule based splitting addition to the annotation tool have gone awful, I’m thinking of giving an interface with options of rules where users can choose from existing rules and add rules, leaving the annotation of the morphophonemic transform to humans, and not the rule based system.

(Comments disabled. Email me instead.)