Skip over navigation

Statistical Prediction

Statistical Prediction is different from Dictionary based predictions. It is an innovation we will explain here.

We are all familiar with dictionary based predictions that work on our phones - auto word completion, auto word suggestions etc. These are typically based on a look up table implementation on a finite dictionary of some size made available on the phone.

At first sight, many people assume that the prediction model of Panini Keypad is also based on some kind of dictionary but it is not.

Dictionary based models have suffered from a blind spot that we are all too familiar with. They anger you when you are trying to write a word which is not in the dictionary. A name of a person or a place which is actually so common in personal communication sent from a phone. We were aware of this drawback right from the stage we were on the drawing board.

The prediction model of Panini Keypad is instead based on linguistics or what has been called Statistical above. It is practically built by mining of large amount of text of a particular language, called corpora or corpus with the help of computers, to obtain the character frequency/probability information and then using that information to build the prediction model. This was an innovation from us when applied to the world of input and this approach has worked extremely well for all our languages and is full of advantages. Later we have explored this model for languages from each linguistic family of the world and in each it has worked extremely well. All the worlds languages are united in how extremely correlated they are and that can be exploited for more efficient input mechanisms.

Let me explain this further. Every language has its linguistic quirks that often go unrecognized. For example, Indian languages are full of occurrences of the phoneme "bh", but that phoneme is almost entirely absent in the English language. It means you can hardly come up with a word in English that has that letter combination. Similarly if you take the character combination of "aa". That is almost absent in English but is very common in Dutch which uses a lot of double vowels and hence also in Afrikaans. Languages are full of this, some of them only discoverable through analysis by computers which can look at a lot of information at one time.
Let's make it even more apparent.
If you were to look at a passenger list from an international flight like the following.
1. Ruzz Lavotski.
2. Abak Meheruia.
3. Tegendra Mukta.
4. Nill Coldy.
5. Hubiki Kamugara.

Which name do you think is Japanese, which did you think was East European.

By the way, these are names I just made up right now and was careful that they were no real names of anyone I heard before and yet no one could make a mistake on that. That is how powerful the linguistic patterns are. And that fact is open to exploitation as in this case in designing an input technology.

The interesting thing is the linguistic patterns of a language maintain themselves through the dictionary and non dictionary words. Places in England will have names of places which are different from names of places in Russia. Names of people in Kenya will be different from the names of people in Oman and these follow the same linguistic patterns as in the language.

Advantages of Statistical Prediction over Dictionary based predictions.

1. A seamless usability for the user for words within and outside the dictionary.

2. Far more lightweight implementation than a dictionary based implementation because the statistical information can be stored in far more compact fashion than hundreds of thousands of words.

3. The statistical models can be made as large or as tiny as an implementation demands because statistical tables follow the laws of diminishing returns. Therefore one can build something useful with a 10 Kb footprint that can go into a SIM card and yet contain the essential signatures of an infinite representation of a living language.

4. Dictionaries are always inadequate, because new words are always being added to it. No amount of dictionary can be considered adequate.

5. It is very low cost and productive to develop statistical models. Dictionaries require on board linguists and laborious manual cleaning and validation. Statistical models can be made as soon as you can get hold of a corpus, very quickly and in predictable time schedules as in an assembly line.

Indian languages can be considered complex in their composition rules but they are spectacular in how correlated those complex patterns are. Hence this has worked very well, the new model.

It did involve a lot of laborious work for the employees of Luna Ergonomics that has turned fruitful. And this is what made it possible to quickly build for so many languages without knowing how to read them.

Panini Keypad