Machine learning — it’s a term you’ve probably heard before. Nowadays, it’s deeply integrated in different sectors, from spam filters to automated driving cars. There is a lot of math and statistics behind these algorithms, but today there are many great tools we can use without being a math professor.

In this blog post, I’m going to smoothly introduce you to supervised text classification — showing you how to build a simple language detection app using third-party libraries written in JavaScript.

Classifiers

A classifier classifies data for us —“this email is spam” or “this text is in English,” for example.

To have a good classification, we need to train our classifier. We’ll have some data where we already know the classification, which in machine learning is called a label. The classifier will then use this data to infer the labels of new data — this is what’s called supervised learning.

Considering our language classifier, we can train it using phrases in different languages and tell it the language of each phrase. To create a spam filter, we can take emails from our mailbox and for each email, we can tell the classifier which one is spam and which is ham (not spam).

In supervised learning, the quality of the training data is probably the most important thing. And the better training data we have, the better accuracy we can get from the classifier.

Text Classifier: Naive Bayes

The Naive Bayes classifier is a simple but powerful classifier, and it’s used a lot to classify text. The two examples above, language and spam detection, can be done using the Naive Bayes classifier.

It’s based on Bayes’ theorem, which I won’t go deep into, but it is worth touching on why it’s called Naive. This classifier is based on a quite strong assumption called naive assumption — every word is independent from the other, and the order of words is not considered at all. This assumption is wrong, especially if we classify text.

In the phrase, “Michael is looking for John,” the term “for” is quite conditionally dependent to “is looking” and, changing the order of the words, “John is looking for Michael” has a different meaning. It’s for that reason the Naive Bayes classifier is trained and classifies the two phrases exactly in the same way, because it checks the frequency of the words and doesn’t consider the position of them.

This wrong assumption makes the Naive Bayes classifiers easy to develop, and despite this assumption, they classify the text quite well.

NB Language Classifier

So let’s start to build a language classifier using a Naive Bayes classifier starting at the beginning with just two languages, French and Italian. Once we start with these basics and you get the hang of how it works, we can then easily extend the classifier to other languages. Let’s consider two phrases per language: