Natural Language Processing (NLP) is a field in Computer Science which tries to analyze and understand the meaning of the human language. It’s quite a challenging topic, since computers find it pretty hard to understand what we are trying to say (although they are perfect for executing commands well known to them). By utilising established techniques, NLP analyzes the text, enabling applicability in the real world, such as automatic text summarization, sentiment analysis, topic extraction, named entity recognition, parts-of-speech tagging, relationship extraction, stemming, and more. NLP is commonly used for text mining, machine translation, and automated question answering.

NLP is also starting to get important in the mobile world. With the rise of conversational interfaces, extracting the correct meaning of the user’s spoken input is crucial. For this reason, there are many NLP solutions on the two most popular platforms iOS and Android. Since iOS 5, Apple has the NSLinguisticTagger class, which provides a lot of natural language processing functionalities in different languages. NSLinguisticTagger can be used to segment natural language text into paragraphs, sentences, or words, and tag information about those tokens, such as part of speech, lexical class, lemma, script, and language. There’s a great presentation in this year’s WWDC about NSLinguisticTagger, which discusses the new enhancements of the class.

In this post, we will create a simple app that will list all the posts from my blog. When a post is selected, the app will open it in a web view, along with details at the bottom about the detected language of the post, as well as the most important words. We will accomplish this using the NSLinguisticTagger class and a simple implementation of the TF-IDF algorithm.



First, we will keep the posts in a local file called posts.json (taking them from a web service is out of scope for this post). Each entry stores information about the title and the url of the post. The user interface of the app would be pretty simple:



After the posts are loaded into memory, we will get the contents of each url, strip the html tags (since we don’t want them to be included in the importance of the words) and count occurrencies of a given word in each of the documents. We will use this information later for the TF-IDF algorithm for extracting the keywords. To do this, we will create a method called words(inText:url:action), which will receive a text and the corresponding url as an input, it will find the words using the linguistic tagger and it will invoke a completion handler (action), provided by the caller. We will use this helper method both for the counting of the words and for the computing of the importance of the words:



Let’s see what’s happening here. We are creating an instance of the NSLinguisticTagger class, with lemma provided as a tag scheme. Lemmatisation is the process of handling different forms of a word as a single item. For example, we want to treat the ‘verb’ do in all its different forms (doing, does), as a one word, since we are more interested in the importance of the word’s semantics, not its syntax. Apart from the lemma, there are a lot of other tag schemes that the linguistic tagger supports. For example, you can use nameType, which classifies tokens which are part of named entities. There’s also lexicalClass, which classifies tokens according to their class – part of speech for words, type of punctuation or whitespace etc. You can also tag tokens based on their most likely language. Check the NSLinguisticTagScheme struct for more details.

After the tagger is created, we are setting the provided text (which in this case is the stripped html text) and we are defining the range of the string. Then, we need to provide the options of the tagger. Here we are setting the omitWhitespace, omitPunctuation and joinNames options. The first two are for ignoring the whitespace and the punctuation, as their names imply. The joinNames option is to handle people’s name and surnames as one entry. For example, Tim Cook would be handled as one token.

When everything is setup, we are calling the enumerateTags(in:, unit:, scheme:, options:) method. This is new method starting from iOS 11, which will segment the string into tokens for the given unit, and return those ranges along with a tag for any scheme in its array of tag schemes. The new thing here is the unit parameter (word in our case, but you can also provide sentence, paragraph or document). If you have to target iOS version less than 11, there are also enumerateTags methods that do not specify a unit act at the word level, starting from iOS 5.

Now, let’s see which action should we provide to the words method. As mentioned above, we need to cound how many times each word appeares in any of the documents. For this, we will use variable called wordCountings. This would be a dictionary, where the key will be the word, and the value would be another dictionary, whose keys would be the url of the blog post, and the value would be the count – how many times the word appears in that url, for example:

{ "ios" : { "url1" : 1, "url2" : 5 }, "siri" : { "url1" : 2, "url2" : 0 } }

For the TF-IDF algorithm (which is explained below), we will also need to know how many words every post has. In order not to over-complicate wordCountings, we will define another dictionary, documentSizes, which will have as a key the url of the post and value the total number of words in the document. The following method fills the data in wordCountings and documentSizes:



All the data is filled, which will allow for faster computation of the keywords for a given article. The computation will happen on click of a selected post. We will not store any computation or already loaded html (implementing a caching mechanism), since that would be too much for one post. That’s why on click on a post, we will load the selected url again, split the html into words and then provide different action (this time the TF-IDF algorithm), to the words method. The extractKeywordsTask does that:



We are creating the request, starting a session task and in the action block, we are filling the result dictionary, which contains the TF-IDF value for each word. We’ve mentioned this algorithm few times now, but we haven’t explained it. Term frequency–inverse document frequency (TF-IDF) is a numerical statistic method that is intended to reflect how important a word is to a document in a collection of documents. The first part, term frequency, is about how many times a term occurs in a document. We’ve already computed that in our wordCountings dictionary. Since there are words like “the” which appear very often, but are not important to the meaning of a text, the inverse document frequency is introduced. With IDF, we count how many times a word appears in the other documents. If it appears a lot (like “the”), the weight of a word is diminished. We already have the countings for a word in all documents in wordCountings, so we only need to ignore the number of occurrencies of the current document and sum the other ones. The IDF factor will then be computed by logarithm of the division of the total number of posts with the sum. After the TF and IDF are computed, they are multiplied to get the total TF-IDF weight.



That’s all that we need to do – after the word weights are computed, we are sorting them to get the top 10 and we are passing them to the next screen. The WebViewController just presents those at the bottom, below the post:



You can see that although the algorithm is pretty simple, the results are good. Of course, they are not perfect, but for this we would need some more advanced algorithms. You may have noticed one more detail above the keywords on the screenshot – the detected language. This is also a new thing in iOS 11, it’s the property called dominantLanguage of the linguistic tagger:



That’s all we have for today. The NSLinguisticTagger is a very powerful class, with lots of possibilities for developers to make their apps smarter and more aware, by utilizing NLP methods. You can check the source code of the whole app here.