If you have ever typed something on a smartphone, you have likely seen it attempt to predict what you’ll write next. This article is about how text predictors work, and how crucial the input language dataset is for the resulting predictions. To see how this in action, we will predict tweets by four Twitter accounts: Barack Obama, Justin Timberlake, Kim Kardashian, and Lady Gaga.

To be able to make useful predictions, a text predictor needs as much knowledge about language as possible, often done by machine learning. We will look at a simple yet effective algorithm called k Nearest Neighbours. This works by looking at the last few words you wrote and comparing these to all groups of words seen during the training phase. It outputs the best guess of what followed groups of similar words in the past.

Here is an example of using k Nearest Neighbours to predict tweet text. After choosing a person and an example tweet, move the slider to various positions in the text and it will automatically detect the last trigram (group of three words). It creates a database of trigrams from all tweets from that account, then searches for similar ones. The best matching trigrams will be displayed, along with the word that most often followed them.