It was a few months back that my friend Sam, working in Lisbon’s Data Science Academy, challenged me to build a team and take on the responsibility of creating the Natural Language Processing (NLP) component of their 6-month program.

Consisting of three weeks of learning materials, the module builds towards a 1-day hackathon where participants would compete by trying to solve a challenging NLP task.

By the end of July, we hosted the final competition of Text Classification at Unbabel. It was a fantastic ride, and I came away from it with a new opinion on hackathons, a better understanding of language in data science, and some different notions of the field itself.

But before delving into the most important skills to learn when entering the field of data science (with an extra focus on NMT) and the process we took to build our hackathon, allow me to present The Academy.

The Academy

Data is one of the highest-valued assets nowadays, and companies are desperately searching for data scientists, data analysts, data engineers, data modelers, and data architects [1][2]. There are still not nearly enough candidates as there are open positions, so naturally, a lot of courses and materials have emerged claiming to be able to turn you into a data scientist.

So while you might be rolling your eyes and thinking yet another data science hyped course (and indeed it is yet another data science course), it’s a good one, and here’s why.

The Lisbon Data Science Academy (LDSA) is a 6-month program aiming at ”helping people become entry-level data scientists by teaching introductory material”. With a group of amazing people in connected fields, it goes through the basics of data science, teaching how to gather, analyze, visualize and present data in many different contexts. It divides the workload into big groups, called specializations, each consisting of:

weekly notebooks of learning materials, complemented by graded exercises

a hackathon on the subject

The learning materials may consist of more than one notebook, but they should go over most of the skills required to learn the topic. Although the academy is paid (mostly to support operational costs associated with the hackathon), all the materials are made available to anyone, including the hackathon challenge. I find this extremely valuable, since it allows anyone to initiate themselves in the field or just access some good materials, at any time.

The community also adds huge value to the Academy. Behind it all, you have data scientists working in all sorts of things, in all sorts of companies, and you have eager students with all these different backgrounds, each bringing something new to this network. For each hackathon, they randomly shuffle the students into groups, stimulating not only the ability to work with different teams, but also allowing them to better connect with their peers.

There is one additional thing that I really want to praise about this group. LDSA has a policy to keep waste — plastic, in particular — to a minimum. As such, they try to find providers that don’t rely on plastic packaging and chose, for example, a traditional coffee machine using ground coffee over capsule machines(you can argue these are recyclable, but the process is not as environmentally friendly as simply not generating that extra trash). Even though they are a small organization, the commitment they make towards this initiative is inspiring, and I do hope that each and every person coming across the academy takes this mindset into their own companies once the course is done.

Let’s move on to more practical things.

Part 1 — Getting Started with Data Science and NLP

What can you do to learn data science? And what is a data scientist, really?

Although there are a lot of possible answers, in my mind a data scientist should know the basics of all stages of data, from its extraction to all the required processing until the actual analysis phase. This can also entail creating visualizations and exposing condensed and useful information from that data.

To learn this, you need some good skills in many fields ranging from mathematics to programming, together with good analytical and presentational skills. All of these are achievable, regardless of background, and the academy aims to arm you with the following curriculum:

Initial bootcamp on basic statistics, programming, and classification

1st hackathon on Binary Classification

2nd hackathon on Time Series

3rd hackathon on Data wrangling

4th hackathon on Text classification

5th hackathon on Recommendation systems

6th hackathon on Deployment of models

The bootcamp introduces an overview of useful frameworks, such as pandas, scikitlearn, matplotlib, and numpy. It assumes a basic knowledge of the Python language, but I should mention that most of it is quite transferable to R, one of the languages of choice for some in the field.

The first units also cover the basics of statistics, from which students learn how to analyze data properly, get basic metrics out of it, do statistical inference and understand what correlation actually means — spoiler: not the same as causality.

Paraphrasing Huffman,“The secret language of statistics, so appealing in a fact-minded culture, is employed to sensationalize, inflate, confuse, and oversimplify.”[3]. With the main foundations of statistics, not only do you learn how to analyze data, but you also become more aware of what is happening behind the curtain, and are better at avoiding the pitfalls of data misrepresentation.

From the basics, it builds from simple classification algorithms, linear regressions and logistic regressions to more powerful algorithms like the K-nearest neighbours, Gaussian Naive Bayes, Decision Tree, and Random Forest, and even includes unsupervised learning methods like K-means for clustering problems — this post has a nice and small description of each.

Hyperparameter tuning and validation and evaluation techniques are also presented — like some of the described here — and by the end of the bootcamp, all students are able to create basic scikitlearn workflows and apply them to simple problems. The first hackathon immediately follows, challenging them to tackle a Binary Classification problem.

The next week is dedicated to time series, the second specialization of the academy. It goes over some particularities of working with data that involve time, introducing important concepts like trend, cyclical, seasonal and irregular components. The students learn to identify these components and to do forecasts in this context. They also learn how to deal with these problems as a non-timeseries problem, skills they’ll be able to apply in the TimeSeries Regression hackathon.

By this point, they’ve seen a great many deal of problems, but a very important piece is missing.

As someone who actually handles data, I can tell you that the biggest chunk of work in solving data science problems is just getting the data and processing it so it is ready to be analyzed. For this purpose, there is a need to process different formats — JSON, HTML, XML, XLS, CSV, TSV, among others — and handle different encodings. Other sources of data exist: databases and APIs are also a big part of it. SQL, HTTP requests and the basics of web scraping are taught to introduce the required concepts. The specialization is completed with cleaning methods to get the data in the same formats and prepare it for processing. Although this can be a painful experience, it is also extremely useful.

The fourth specialization, and the one that brought us here, is the Text Classification specialization. It extends a lot of the previous concepts — that previously were dealt with through numeric data — to text. Thus, it focuses initially on techniques used in text processing, like tokenization, regular expressions, stemming, stopwords, among others.

The students learn how to extract features from raw text, which differs from the previous problems where the features were readily available. They learned how to assess the usability of these features through methods like Chi-squared and perform some initial analysis on the text. The specialization also introduced high-dimensionality problems and dimensionality reduction techniques like SVD and PCA. Although Neural Networks were left out, it closed with word embeddings, a major concept in most NLP problems nowadays, and one that allows you to extract interesting relations like “king - man + woman = queen”.

The NLP frameworks and libraries presented included NLTK, SciPy, FastText, and some Stanford libraries. The examples shown covered topic classification and sentiment analysis, and the hackathon focused on intent classification, providing a wide overview of real text classification problems.

I should say, though, that NLP is a much bigger field than just text classification. If you want to fully understand the field, there are a vast amount of resources out there. From a more academic perspective, I’ve worked with some colleagues to pull together the following reading list for a deeper dive on NLP.

(Disclaimer:I would not advise you to start from these if you don’t have the basic mathematical or statistical background)

The two last specializations are, Recommendation systems and Deployment. The first deals with a type of systems that is present almost everywhere nowadays — think of any online platform, like Amazon, Airbnb, or even Facebook and Google, providing you with“suggestions”(a.k.a recommendations) of what to buy, consume, and where to go next. The second teaches students how to deploy their models, so they can actually be put into good use. After all, students should be prepared to apply data science to the real world — this is the end goal!

So, if you want to become a data scientist, and/or these topics interest you, you can either spend all that time looking them up and trying to understand them by yourself or just follow the academy notebooks. Who knows, you might even want to enroll in the next one!

But if you can’t wait until the next signup date, there are a lot more resources that you can access in the meantime to start learning data science or just practice your skills — a decent analysis of available courses is presented here. And if you get stuck trying to solve or learn something, remember there is always a community ready to help out.