GSoC Final Report

sandhi-splitter

This module was written from scratch and efficiently detects constituent words from a given agglutinated word. The identification part is done agnostic to the language, while splitting and joining are currently rule based.

The module is hosted at libindic/sandhi-splitter.

Documentation with examples of training, testing and using the API provided is currently hosted using gh-pages, and is available here.

Applications

  • Enhance the root word corpus
  • Improve spellchecker, discussed below.

spellchecker, enhanced

Next proposed addition was enhancing the spellchecker with sandhi-splitter.

The implementation inherits from the existing Malayalam spellchecker, which was improved to handle inflections by Balasankar C, and with sandhi-splitter improves the results further.

The pull request corresponding to this can be found here.

REST API for libindic

WSME was used to provide REST services of libindic modules. This is interfaced as a modular application through flask blueprints, on the main libindic application.