You see, Portuguese is a language divided across the Atlantic. The Brazilian and Old World (Portugal+Portuguese-speaking African and Asian countries) varieties differ when it comes to vocabulary, grammar and pronunciation.

At the same time, the freelance Portuguese translations market in Poland is pretty narrow. I didn’t have the luxury to specialize and stick to only one of the varieties. With both hats on my head, sometimes on the same workweek, I had to be twice as careful of keeping consistency in a given translation. That’s why one of my first ideas for an NLP project was a Portuguese dialect classifier.

It is no groundbreaking research paving new paths in NLP nor AI. My motivation here is similar as in my previous Medium article about a Tweetbot — show how straightforward it is to make the first steps in natural language processing in Python. As a person with little technical background, I try to present it in as plain English as possible.

The article breaks down into three main parts:

preparing source data (Scrapy);

training the model (scikit-learn);

deploying an API with the model (Flask+Vue.js).

Preparing source data

A lecture. Could be a TED one. Photo by Samuel Pereira on Unsplash

What data do I need, where to gather them and how? In this case, I chose TED.com as my source. It’s a reasonably large database of conveniently transcribed lectures. What’s particularly important in this case, the translating team at TED keeps a strict division between both Portuguese dialects (which is not the case of Wikipedia, for example).

TED used to maintain an official API, but they discontinued it in 2016, so I set up a Scrapy spider to scrape the data. Most of it is boilerplate based on the Scrapy tutorial. The spider cycles through the catalog of TED talks in the specified language and follows links to individual lectures:

def parse_front(self, response):



talk_links = response.css(“a.ga-link::attr(href)”)

links_to_follow = talk_links.extract() for url in links_to_follow:

url = url.replace(‘?language’, ‘/transcript?language’)

yield response.follow(url=url, callback=self.parse_pages)

Then, the second parsing method scrapes the titles and texts of individual lectures and adds them to a dictionary:

def parse_pages(self, response):

title = response.xpath(‘//head/title/text()’).extract_first()

title = title.strip()

title = title.split(‘:’)[0]

talk = response.xpath( ‘//div[@class=”Grid__cell flx-s:1 p-r:4"]/p/text()’).extract()

for line in talk:

line = line.strip()

line = line.replace(‘

’,’ ‘)

line = line.replace(‘\t’,’ ‘)

talk = ‘ ‘.join(talk)

talk = talk.replace(‘

’,’ ‘)

ted_dict[title] = talk

Once complete, the dictionary is dumped into a CSV file. All I needed to come up with is to determine the appropriate xpaths. I also set a delay of half a minute between each download (without it, I was hitting the server too frequently and it blocked my queries):

class TedSpiderPt(scrapy.Spider):

name = “ted_spider_pt”

download_delay = 0.5

The rest of the data preparation and the training process can be accompanied in the Jupyter notebook here. Below, I skim through it and highlight the crucial moments.

I scraped about 2500 TED talks for each version of Portuguese, some 12–18 thousand characters per talk. After cleaning the formatting, I label the transcriptions “0” for PT-PT (European Portuguese) and “1” for PT-BR, then fit and transform them with scikit-learn’s CountVectorizer.

An example of data prepared for CountVectorizing.

If this article is your introduction into natural language processing: CountVectorizer transforms a set of text documents into word counts. This numeric form of the text data dramatically improves the ability to process it.

First I split the lectures into training and testing sets, in a 2:1 proportion. The model will only learn on the training data. The testing part is used to check its performance before releasing it into the world.

Then I fit the Vectorizer with the training data — in plain English, it assigns a unique number for each word found there, creating a vocabulary. To eliminate the extreme outliers, I set the min_df parameter to 2 — words that occur only once are not taken into account. Then I transform the training set — i.e. count the occurrences of each word from the vocabulary. After CountVectorizing, each of the 59061 words is counted in each of the 3555 talks that compose the training set.

See below for an illustration:

Tracking down a particular word.

The word “eu” (“me” in Portuguese) was indexed as “23892” in the vocabulary. In the word counts for this index, we see one lecture with a significantly higher number of “eu”s. We can trace it back to the talk transcription and… indeed, personal experiences play a major role in Chiki Sarkar’s TED talk.

Apart from vectorizing and counting, I refrain from any other preprocessing. I do not stem the words, as I want to maintain the grammatical differences (e.g. verb conjugation differences) between both versions of the language. To give an example: reducing “[a] fazer” and “fazendo” to their stems would blur the PT/BR distinctions I want to catch.

Training the model

Photo by Jelmer Assink on Unsplash

I then train a Multinomial Naive Bayes classifier on the training set. Although this approach is fairly basic, it works surprisingly well in NLP applications such as this one.

Again, some plain English explanation: the NaiveBayes classifier basically compares the frequencies of a given word in the BR and PT sets and thus determines if the word suggests a more “Brazilian” or “Portuguese” text. While predicting, all of the words in the text are weighed and we obtain the final probability.

After training the classifier, I transform the test set using the same vectorizer as before. It is already populated with the training set vocabulary, and it will only count the words that do appear in it.

Then the vectorized test set is classified using MultinomialNaiveBayes. In this particular case, I did a little tweak with the test set — I wanted to test its ability to classify short texts. Thus, I split the lectures of the test set into smaller chunks of 200–760 characters. For comparison, the paragraph you are reading right now contains 370 characters. The results?

How’s that for a score?

I achieved 86.55% accuracy on short texts, with a nearly identical proportion of falsely classified PT-PT and PT-BR examples. Is it a good enough score? I’m not the most objective judge here, but I’d say it’s quite decent. Let’s see some examples of wrongly classified phrases to help us assess the model.