Scope

This document presents the current status of the text prediction project. Some basic statistics and observations about the source documents are included, as well as some analysis of N-gram tokenization, and methods for handling certain language constructs such as money, dates, and URLs. The prediction method is described, and operating instructions are given for the application hosted here.

Source documents

This project uses the english source documents found at: Capstone Dataset.

The file statistics are shown in the following table:

Document # Lines max Line length # Words # Characters ave # char / wd ave # wd / ln
en_US.blogs.txt 899,288 40,832 37,272,578 210,160,014 5.6 44.4
en_US.news.txt 1,010,243 11,384 34,309,641 205,811,886 6.0 34.0
en_US.twitter.txt 2,360,148 140 30,341,028 167,105,338 5.5 12.9

The twitter data requires that tweets be no longer than 140 characters, as shown. What surprised me was that the average number of characters per word for tweets is similar to that of the other sources. I would have expected a much larger use of abbreviations; on the other hand, abbreviations are not much shorter that 4-5 letters anyway, so the effect is small.

Sampling

The documents in the corpus were sampled for the purpose of exploration to reduce processing time. A random sample of 8000 lines was taken from each of the three documents, and a smaller version of the original corpus was constructed from the sampled files. The sampling code is shown here for one of the files:

dirName <- 'D:/Capstone-language-analysis/Coursera-SwiftKey/final/en_US'
blog <- readLines(paste0(dirName,"/en_US.blogs.txt"),skipNul=TRUE,encoding='latin1')
nb <- length(blog)
sampleSize <- 8000
set.seed(1340)
sblog <- blog[sample(1:nb,sampleSize)]
writeLines(sblog,"D:/Capstone-language-analysis/Coursera-SwiftKey/final/en_US_sample/en_US.blog.sample.txt")

and the smaller corpus formed from the sampled files as shown:

dirName <- 'D:/Capstone-language-analysis/Coursera-SwiftKey/final/en_US_sample'
dCorpus <- Corpus(DirSource(dirName))

Data cleanup

The raw files include special sequences that can be identified as dates, times, money, email addresses, and web URLs. Also, the text includes various numbers with other various significances, impolite words, punctuation marks. Finally, there are a lot of filler words such as articles (“a”, “an”, “the”), certain modifiers, prepositions, and pronouns that carry little information. These are called “stop words”. We consider these here:

Dates, times, money, email, and web URLS

We consider these elements as “generic” with little specific predictive value. That is, it is conceivable that a date, for instance, could likely follow a phrase (e.g., “I start work at”, “8 am”), but it is not clear that “8 am” should be preferred to, say, “9 am”. So while perhaps a phrase may anticipate a time, it is less likely that the phrase would anticipate a specific time. So all times were encoded as “<time>”. The same argument could be made for “dates”, “money”, “email” and “websites”. So these were identified using regular expressions and tokenized as “<date>”, “<money>”, “<email>”, and “<URL>”, respectively.

Other numbers

All other numbers were eliminated after the dates, times, money, email, and websites were encoded, per above.

Impolite words.

A list of “badwords” was obtained from several web sources and editted.

Punctuation

We identified three classes of punctuation based on their impact on prediction: “Thought-stopping” punctuation are marks in the set {:;.?!}; “apostrophes”, which have several uses; and all other marks not in these sets. “Thought-stopping” punctuation marks terminate one complete sentence. The idea is that the final word(s) in a sentence could is not likely to be of value to predict the beginning word of the following sentence. There is no continuity of sentence structure across any of these punctuation marks. So these were left in the documents during this phase of data exploration and cleanup. The last action of the clean phase is to separate the text documents into ‘one line = one sentence’ prior to tokenization.

Apostrophes “’” can indicate contractions, quotes, or possessive forms of words. The following contractions were identified and expanded, thereby eliminating apostrophes used for this purpose:

Contraction Expansion
'll will
'd would
n't not
're are
'm am
n' and

Of course, some of these expansions were arbitrary (for example “’d” might be more appropriately expanded as “had” in some cases). But for now that will be one of several compromises made in this project to be improved upon later.

Apostrophes used as quotations were eliminated entirely. Again, this was an arbitrary decision; one could argue that the beginning of a quote is similar to a “thought-ending” punctuation. And again, another compromise to deal with later on.

Apostrophes used in a possessive form were eliminated. Hyphens between words were converted to spaces, and all other punctuation marks were eliminated.

Stop words

After a little experimentation, we concluded that the stop words were in fact necessary for prediction, simply because they are so prevalent in the language. These were therefore retained.

Tokenization

The modified corpus was then tokenized into 1-gram to 6-gram tokens for the purpose of exploring language structure. The code used to create the n-grams is:

Tokenizer <- function(min, max){ 
  function(x) NGramTokenizer(x, Weka_control(min = min, max = max))
}
dTDM <- TermDocumentMatrix(dCorpus, control=list(tokenize=Tokenizer(1,6)))

Given a set of k unique 1-gram tokens (words), consider the expected number of unique n-gram tokens (phrases of n words). For example, a 2-gram is constructed by prepending or appending one 1-gram to another 1-gram. In general, an n+1-gram token is formed by prepending or appending a 1-gram to an n-gram. If n+1-gram tokens were constructed by all combinations of n-grams and 1-grams in this manner, then the theoretical number of n-gram tokens would be on the order of O(kn). But this is not the case. The actual number of unique n-grams in the sample documents as a function of n are shown in figure 1, below. The structure of the language must be imposing some restrictions on the formation of an n+1 gram from an n-gram. That is, only a small subset of 1-grams are likely to prepend or append a given n-gram, and as n gets larger, the size of the subset gets smaller.

The number of n-grams as a function of n is also given in this table, and compared with the (unstructured) theoretical maximum number of n-grams, given k 1-grams.

n No. of n-grams Max possible number of n-grams
1 41027 41027
2 327605 1.7 x 109
3 372177 6.9 x 1013
4 355584 2.8 x 1018
5 335065 1.2 x 1023
6 315413 4.8 x 1027

We should remember that these statistics are taken from an analysis of a relatively small sample of the source documents. Later, we will explore the impact of larger subsets while evaluating the performance of the prediction algorithms.

Word frequency

N-gram frequency is illustrated in figure 2. Here, we grouped the tokens by n-value (ie. 1-grams, 2-grams, etc), then sorted on frequency in decreasing order. Code is here:

Terms <- Terms(dTDM)
Ngrams <- data.frame(terms=Terms,
                N=sapply(Terms,function(t)length(unlist(strsplit(t," ")))),
                count=row_sums(dTDM),
                stringsAsFactors=FALSE
)

Ngrams <- Ngrams[order(Ngrams$count,decreasing=TRUE),]
Ngrams <- Ngrams[order(Ngrams$N),]

Sequential index numbers were added to the data, which were then plotted in figure 2. Code is here:

Ngrams$ind <- 1:dim(Ngrams)[1]

NgramCounts <- sapply(1:Nmax,function(n)sum((Ngrams$N==n)*1))
NgramCounts.df <- data.frame(N=1:Nmax,counts=NgramCounts)

palette(rainbow(Nmax))
g <- ggplot(Ngrams,aes(x=ind,y=count,group=N))
g <- g+geom_point(shape=21,aes(color=factor(N)))+scale_y_continuous(trans="log10") 
g <- g+guides(color=guide_legend("N")) + ggtitle("Distribution of N-gram counts")
g

Figure 2 shows that the percentage of n-grams that have only one occurance increases with n. Again, this data is from our sampled source documents, but since the sampling was uniform we expect a similar pattern to prevail. Figure 3 shows detail for just 1-gram tokens.

Prediction

We can look at two differenct prediction scenarios: predicting the next word in a phrase, given a short sequence of words; and predicting the next letter in a word, given a short sequence of letters.

Predicting the next word in a phrase

Suppose we are given a word that matches the first word in some of our 2-grams. We then select all 2-grams beginning with that word, and count the number of occurrances of 2-grams for each of their unique last words. The last word in the set of 2-grams with the highest count is therefore the most likely word to follow and is selected. For example, suppose the given word is “shop” and the available 2-grams starting with the word “shop” are:

n-gram No. of occurances in the source documents
“shop store” 6
“shop supermarket” 2
“shop mall” 4

We would then suggest “store” as the most likely word to follow the word “shop”. Suppose further there are many 2-grams starting with our given word. In the interest in economy we would perhaps consider only the 2-grams with the three highest counts in the source documents. In the case of ties we could either select one of the candidate predicts randomly, or suggest both.

This method can be generalized to the case where a phrase of 2 or more words is given. Consider the first n-1 words in an n-gram as a “prefix” and the final word as a “suffix”. Then our prediction method is to match a given n-1-word phrase to the prefixes in our set of n-grams, and then suggest the suffix of the n-gram with the highest occurance count as the next word.

But as was seen above, the number of n-grams becomes more and more sparse as n increases, compared with the total number of possible combinations of n words. This means that it becomes less and less likely that a given n-1 word phase will match a prefix of an n-gram. In this case, we “back off” and try the last n-2 words of the given phrase and try to match a prefix of our n-1-grams. We continue in this manner until we find a match or until we reach our 1-grams. In the case a given single word matches none of our 1-grams, then our only choice is to suggest nothing or to suggest the most common 1-gram (word) that occurs in our document. Statistically, that would be the most likely word to occur, although it would not likely be perceived as an appropriate choice.

Final implementation

The final implementation struck a balance between prediction accuracy, prediction dataset size, and application speed. To ensure all variables would fit into memory, a 20% random sample of the three source files was used. Also, the reliance on the ‘tm’ package was abandoned due to performance problems and memory usage issues. Instead, a simpler tokenizer was used. This code is credited to: zero323 git repo

Tokenization

The three texts were cleaned as above, then tokenized into 2-, 3-, and 4-gram tokens on word boundaries. N-grams that appeared no more than once were eliminated to remove strange spellings and language constructs that were especially prevalent in the twitter texts. The frequency of occurrence of each n-gram in each of the three source documents and the total frequency of occurrence were determined. The N-grams were then parsed into ‘prefix’ (first n-1 words) and ‘suffix’ (nth word), then sorted on the ‘prefix’.

Each prefix may appear in this data multiple times, each time with a unique suffix and occurrence frequency. The most frequent occurring suffixes (up to a maximum of 10) for a given prefix were retained for prediction, and the remaining n-grams were discarded. Here is an extract of the prediction data set at this stage as an illustration:

                  pref       suff N 1 2 3 total
10000          a field        day 3 4 5 4    13
10001          a field         in 3 3 4 0     7
10002          a field        and 3 2 4 0     6
10003          a field       that 3 0 6 0     6
10004          a field      where 3 2 2 0     4
10005          a field       with 3 2 2 0     4
10006          a field         on 3 0 3 0     3
10007      a field day       with 4 0 2 2     4
10008     a field goal         in 4 0 2 0     2
10009     a field goal         on 4 0 2 0     2
10010       a field of     number 4 0 6 0     6
10011   a field office         in 4 0 2 0     2
10012 a field sobriety       test 4 0 3 0     3
10013     a field that   included 4 0 3 0     3
10014     a field trip         to 4 6 2 3    11
10015     a field with          a 4 2 0 0     2
10016        a fielder     choice 3 0 4 0     4
10017       a fielding      error 3 0 6 0     6
10018         a fierce     battle 3 3 2 0     5
10019         a fierce competitor 3 0 2 0     2
10020         a fierce      storm 3 0 2 0     2
10021          a fiery      crash 3 0 2 0     2

The fields in this extract (except the left-most index number) are:

  • pref: prefix (first n-1 words in the n-gram)
  • suff: suffix (last word in the n-gram)
  • N: length of n-gram (the “n” in n-gram)
  • 1, 2, 3: frequency of occurrence of the n-gram in each of the three source documents.
  • total: total frequency of occurrence of the n-gram.

Prediction

The prediction process is straightforward. The user enters an input string comprising m words. The last (up to) four words in the string are cleaned using the same cleaning algorithm as was used to clean the souce texts. These are concatenated into one string which is compared the prefixes in the ordered prediction dataset. The suffixes corresponding to prefix which matches the comparison string are returned. Since the data set is sorted in decreasing frequency order, the returned words are listed in the order of most likely to least likely to follow the input phrase.

If no comparison occurs, then the first word in the comparison string is dropped and the remaining words are then again compared with the prefixes in the prediction dataset. This process continues until a comparison occurs. If no comparison occurs even with just the last word in the comparison string, no word is predicted. The reason is because at this point the only logical word to predict would be the one word out of all words in the source documents that occurs most frequently. Obviously this would be same word predicted every time, hence useless.

Operation

  1. Navigate to https://data-dancer.com/simple-text-prediction
  2. Wait a few moments for the server to initialize. You will see a bargraph appear, showing the conditional probabilities for first up to 10 words predicted by the prefix that appears in the horizontal axis label. A screen shot at this stage is shown in the figures, below.
  3. Enter an english phrase in the text input box.
  4. The most likely word to follow the input phrase appears next to the text input box. This word is the most likely word, taking all documents into consideration.
  5. The label for the horizontal axis shows the actual prefix that was used for the prediction. This prefix may be shorter than four words, reflecting the ‘back-off’ process described above.
  6. Finally, the screen shows the top words predicted by each of the source texts as well as all the words predicted by all source texts combined.

Further work

A number of interesting ideas for extending this product came up while playing around and may be pursued later. These include:

  1. Adjustment of conditional probability by context. Key words can appear earlier in a phrase that might alter the conditional probability of the first word to follow the phrase. Context is expressed through relatively rare words, as opposed to ‘stop words’. So one way to proceed might be a frequent itemset analysis of rare words in the source texts on a sentence by sentence basis. If one word in a frequent itemset appears in a sentence, then there is a non-zero probability that another word from the frequent itemset is also in the same sentence.
  2. Gender-conditioned prediction. Examine the words predicted by the phrases ‘keep her’ and ‘keep him’. The set of words predicted for each of these phrases appear somehow ‘different’; like they possess some kind of different quality. I ran across this by accident, to be explored later.
  3. Characterization of prediction accuracy by cross-validation. One possible method:
    • Partition the n-grams into N roughly equal size subsets.
    • Hold out one subset as the “test” subset.
    • Use the remaining N-1 subsets as the training subset, and predict the conditional word probabilities with them.
    • Now, parse the test set into prefix and suffix, as above.
    • Predict the suffix for each prefix in the test set, and compare with the actual suffix.
    • Count the number of prediction successes and failures, and express as a fraction of total number of predictions attempted.
    • Alternatively, compare the actual suffix to the top ‘p’ predicted words, simulating a situation where the text predictor returns a choice of ‘p’ words, which might be more practical. Record a success if the actual suffix appears in the set, a failiure otherwise.
    • Repeat N-1 times, holding out a different subset as the test set each time.
    • Calculate mean error (or accuracy) percentage and variance.
    • Repeat the process several times, using smaller subsamples of the training subset.
    • Compare the means and variances of all trials, and select the optimal training based on bias/variance trade-off.

Figures

Number_of_ngrams
Number_of_ngrams

Figure 1: The number of n-grams as a function of n is much lower for 3-grams and up compared with the O(kn), where k is the number of 1-grams

Dist of n-grams
Dist of n-grams

Figure 2: Frequency distribution of n-grams

Dist of 1-grams
Dist of 1-grams

Figure 3: Frequency distribution of 1-grams

app_screen_shot
app_screen_shot

Figure 4: Application screen shot after start-up