This project illustrates a 'next word' text prediction technique based on a simple statistical model of language. In this model, the predicted 'next word' is the word with the highest probability of occurrence conditioned on the words that proceed it. Three source texts were used for estimating the conditional probabilities and for estimating the performance of the prediction algorithm. The same sources were used to build the final demonstration project, implemented in publicly accessible webpage at data-dancer.com/langModel.
This algorithm predicts the most likely word that will immediately follow an input word or phrase. Actually, a set of up to 10 words is returned in decreasing likelihood as predicted by each of the source texts and for all source texts combined. The probability of occurrence of each word in the set is illustrated graphically.
The source texts comprise excerpts from on-line news sources, blogs, and twitter posts. These three sources vary in terms of vocabulary, writing style, and word statistics. The source texts were processed in the following way:
The texts were then tokenized by word boundaries into 2-, 3-, and 4-grams on a line by line (i.e, sentence by sentence) basis. The n-grams were split into "prefix" (first n-1 words in the n-gram) and "suffix" (last word in the n-gram). Finally, the relative frequencies of occurrence for each suffix for a given prefix in each document and for all three documents combined were calculated.
Prediction works as follows: