sandhi-splitter

Module Design

_images/libindic-uml.png

Installation

The module supports both python2.7 and python3. To install:

git clone https://github.com/libindic/sandhi-splitter.git
cd sandhi-splitter
python setup.py install

For development purposes, you can use a virtualenv.

git clone https://github.com/libindic/sandhi-splitter.git
cd sandhi-splitter
virtualenv env
. env/bin/activate
python setup.py install

Training and Testing Model

Two console scripts have been provided for training and testing purposes. Post installation, these can be used to generate and test models.

Help has been provided in the scripts.

Training

$ sandhisplitter_train -h
    usage: sandhisplitter_train [-h] -k DEPTH -s SKIP -i TRAINFILE -o MODELFILE

    Train a model

    optional arguments:
      -h, --help            show this help message and exit
      -k DEPTH, --depth DEPTH
                                                    depth of the trie
      -s SKIP, --skip SKIP  initial skip
      -i TRAINFILE, --trainfile TRAINFILE
                                                    path to training file
      -o MODELFILE, --outputfile MODELFILE
                                                    path to store model

To understand the parameters DEPTH and SKIP, please refer to the paper this is based on: Sandhi Splitter for Malayalam.

TRAINFILE should contain data in the following format:

word=word1+word2+word3|loc1,loc2,loc3

The locations should be the indices of the character before the split point in a 0-based indexed string. For example, consider takeaway.

takeaway=take+away|3

Please have a look at data/ss_data if you still have doubts.

Testing

$ sandhisplitter_benchmark_model -h
    usage: sandhisplitter_benchmark_model [-h] -m MODELFILE -t TESTFILE -o OUTPUT

    Test a model

    optional arguments:
      -h, --help            show this help message and exit
      -m MODELFILE, --modelfile MODELFILE
                                                    path to model file
      -t TESTFILE, --testfile TESTFILE
                                                    path to test file
      -o OUTPUT, --output OUTPUT
                                                    file to store output

The MODELFILE here is the output generated by the training script. TESTFILE should follow the same syntax as the one mentioned in training.

The above test outputs measures Accuracy, Precision, Recall and True Negative Rate to the console. For more information regarding this refer Precision and recall: Wikipedia .

Using the module

>>> from sandhisplitter import Sandhisplitter
>>> s = Sandhisplitter()
>>> s.split('ആദ്യമെത്തി')
(['ആദ്യം', 'എത്തി'], [4])
>>> s.split('വയ്യാതെയായി')
(['വയ്യാതെ', 'ആയി'], [7])
>>> s.split('അവന്റെയാണ്')
(['അവന്റെ', '', 'ആണ്'], [5, 6])
>>> s.split('എന്നെക്കൊണ്ടുവയ്യ')
(['എന്നെക്കൊണ്ടുവയ്യ'], [])
>>> s.split('ഇന്നത്തെക്കാലത്ത്')
(['ഇന്നത്തെക്കാലത്ത്'], [])
>>> s.split('എന്തൊക്കെയോ')
(['എന്ത്', 'ഒക്കെയോ'], [3])