This post was written 2 years back in a website I used to report results back to my advisor. I spent quite some time doing quite unnecessary stuff, figuring out what was wrong why I was getting astonishingly poor results compared to those reported in literature. The corpus page had briefly a filtered version from Saumitra Yadav, which I thought was nice gesture and since seem to have taken off the same. To point to the general people why not to use a corpus as is, from my hard learnings, I publish the post backdated with my experiences here. The post is added to this site on 2020 November 28.

Some of the writing in this document is not very kind, account for that it was from an irritated research-scholar who was held to explain why he had poor results from a strong faculty-advisor, and had to go and do a set of brute-force experiments which would not go anywhere to publication. I have since built many resources out of usable parts of IITB Hindi English corpus, which I am grateful to the people who released the resources for, and in process empathize with their plights as well.

# Premise

I’ve been trying to produce results with the IIT-Bombay Hindi English parallel corpus, but the dataset doesn’t seem to be giving good results at all, for NMT.

I’m in no way able to match scores reported by IIT-B teams, with some tinkering and adjustment based the training corpus, I can fine tune it for the test set - but still I have doubts of the suitability of IIT-Bombay Hindi English parallel corpus as a good training set for neural machine translation.

# Architecture and Framework

• Framework: OpenNMT-py
• Preprocessing:
• Primitive Tokenization (Whitespace, punctuation based)
• BPE

• Model Architecture:
• Encoder: BRNN, 500 Neurons x 2 Layers
• Decoder: RNN, 500 Neurons x 2 Layers
• Attention: Luong’s General Attention
• Decoding: Beam Search, beam width 5.

• Training Configurations:
• 100 Epochs.
• Model with best (Validation Accuracy, Validation Perplexity) chosen for testing.
• Testing:
• Replace unknowns with nearest word using attention.

# Objectives

• Throughout my method, I’ll keep my model constant - since I’m primarily interested in how data affects training and generalization.

• To figure out what exactly is happening, to see if the individual datasets that constitutes the IIT-Bombay corpus would do any better.

• See how well one dataset transfers to other - use these edge-weights as a metric to combine datasets for a better training set.

# Experiments

## Individual Datasets

The following are the results on the individual corpora that constitutes IIT-B Hindi-English parallel dataset provided.

Section B-1 B-2 B-3 B-4 BLEU Perplexity
gnome 74.1 63.7 58.8 55.6 54.87 1.17
tanzil 59.8 42.2 37.5 35.7 33.54 1.07
ted 55.6 30.1 17.8 11.0 23.16 12.66
govtweb 49.2 23.2 12.3 7.3 14.37 40.01
hiencorp 45.6 21.3 12.1 7.7 14.03 20.41
mahashabdkosh 40.6 16.8 6.9 3.3 10.26 63.01
books 39.6 13.0 4.8 2.0 7.29 41.39
judicial 22.6 4.2 1.0 0.3 1.90 125.82
indicparallel 4.8 1.7 0.9 0.4 1.34 307.91
opensubs 32.7 7.9 2.9 0.8 1.25 123.40
kde 2.7 1.1 0.0 0.0 0.00 255.67
wikihead 9.2 6.7 2.9 0.0 0.00 1252.89
hienwnetlinkage 0.1 0.0 0.0 0.0 0.00 1485.97
tatoeba 22.7 4.1 0.2 0.0 0.00 199.22
whole 11.4 1.0 1.2 0.1 0.0 173.68

Some notes: 1. I’m using train, dev, test = (0.8, 0.05, 0.15) for the individual corpus, randomly sampling for all three, without replacement. 2. The whole values are reported on IIT-B parallel dataset splits of train, dev, test.

## Transfer across constituents

My experiments are similar to the ones performed in Six Challenges for Neural Machine Translation. I’m using only subsets of the corpus wherein the perplexity is reported to be low/high-BLEU is reported on the individual training, since I don’t expect the others to transfer at all.

BLEU scores are reported below, upon transfer from columns to the dataset indicated on the row.

transfer books govtweb hiencorp tanzil ted
books 7.29 3.92 5.11 0.15 2.36
gnome 0.86 4.17 39.86 0.0 4.33
govtweb 9.57 14.37 9.7 0.07 3.67
hiencorp 3.92 4.41 14.03 0.1 2.91
hienwnetlinkage 0.0 0.26 0.95 0.0 0.0
indicparallel 4.37 3.31 19.58 0.0 1.41
judicial 8.16 8.27 8.47 0.0 5.03
kde 0.45 12.88 13.51 0.0 6.89
mahashabdkosh 7.31 9.44 6.99 0.0 5.08
opensubs 3.05 4.01 6.46 0.0 11.83
tanzil 1.16 0.53 1.01 33.54 0.82
tatoeba 11.51 12.58 16.1 0.42 7.22
ted 4.79 6.89 9.48 0.23 23.16
wikihead 0.17 7.78 36.13 0.0 0.84
iitb-parallel 5.04 5.37 5.37 0.0 3.96

### Inferences

I’m putting forth the following take-aways from the above table.

• Looks to me like books, govtweb, hiencorp and ted are what would be useful part of the datasets which would generalize to the dev/test sets given - news crawls.
• I’ll however try a few more combinations, see the variation.
• Distributions distant from the test set may end up hurting the model, trying to generalize on a more diverse corpus.
• I’m going to call tanzil, gnome, kde, dictionaries - mostly noise and distant.
• hiencorp transfers well to gnome - this is unexpected. Perhaps the same scenario in the larger dataset applies to hiencorp as well.

## Refining training data based on transfer stats

I’ll try just IIT-B test set now, using models trained on combination of the above corpora.

### Quantitative

The percentage of actual training data used to obtain the BLEU is indicated. We’re able to achieve better results with a fraction of the training set.

combination % of iitb-train BLEU on iitb-test
govtweb + hiencorp + ted + tatoeba + indicparallel + opensubs 24.28 9.62
govtweb + hiencorp + ted 23.24 9.94
govtweb + ted + books 21.03 10.22
govtweb + hiencorp + ted + books + wikihead 37.18 10.50
govtweb + hiencorp + ted + books 35.41 10.86

### Qualitative

10 is reasonably good enough BLEU for translations to make sense. At this point, it may be a good idea to look at data.
The translations for refined-4 are most likely the best. And most make sense, they’re not complete garbage at the very least.

### Conclusions

I think it’s reasonable enough to make the following conclusions at this point:

• BLEU is a bad, horrible metric. At least, for NMT based approaches, we need a consensus based metric with multiple hypotheses for an input sentence.

• IITB Hindi English Corpus is a disaster. Maybe a necessary evil to some people - but I’d say more trouble than it’s worth.
• Corpus has collected data from a mountain load of sources, most of which are practically useless in generalizing to newscrawl.
• I’m not even sure if the test set is learnable at all from the train-set. When creating datasets to benchmark - at least the train set should contain data enough to learn a distribution which the test is also drawn from. I’m pretty certain test-set contains completely new vocabularies.
• There is scope for releasing a new natural dataset, than this mess of a dataset.

# Forward

### Better(?) Data Sampling

Related: Dynamic Data Selection for NMT

One thing I’ve seen a lot while training character language models are that frequent patterns are learnt quickly - and the perplexities of these tend to be lower.

OpenNMT-py has a bunch of metrics which quantify the above, and I believe these maybe of use to capture what would be the trainable “good” corpus within the training set.