A list of beginner-friendly NLP projects—using pre-trained models

Build software with machine learning—no math required.

If you’re interested in studying machine learning from the ground up, there are plenty of great resources. Organizations like fast.ai have made it so that anyone with a vaguely technical background can learn the foundations of machine learning and train their own models.

If you’re interested in building production software with machine learning, however, there are fewer resources available to you. The infrastructure challenges of putting machine learning in production simply don’t have the same wealth of writing around them.

This article is designed to serve as a directory of software projects built on NLP (natural language processing), that anyone—even someone without ML experience—can build.

These projects are not toys, either. Every one of these projects is inspired by real software sold by real companies today.

A note before we start

Each project below will use a similar architecture:

You will implement a relevant pre-trained model.

You will deploy the model as an API.

You will connect your API to your main application.

This design pattern is referred to as realtime inference. There are a number of benefits to this approach.

First, it removes the burden of computation from your main application, offloading it to a server specifically built for serving ML models. Secondly, it allows you to incorporate ML predictions via an API, which is a pattern most software developers will be familiar with. Finally, there are open source tools—like Cortex—which will allow us to automate all of the infrastructure work needed to deploy these models as APIs, meaning you won’t have to burn hours figuring out how to configure AWS to serve machine learning models.

To deploy any model with Cortex, as you’ll see in the examples linked throughout this article, you’ll need to do three things:

Write a Python script for serving predictions from your model.

Write a configuration file which will define your deployment.

Run cortex deploy from your command line.

You can see all of the above in this gif:

With all of that out of the way, let’s start.

Project #1: An autocomplete feature

Autocomplete, traditionally, has been achieved using key-value look ups, where an incomplete user-inputted word is compared to a dictionary, with potential words being suggested.

With machine learning, however, autocomplete can be taken a step further. Instead of referencing a static dictionary of words or phrases, a model can be trained on real world user input to predict the most likely next phrase.

A familiar example of this is Gmail’s Smart Reply, which suggests responses to emails you receive:

Let’s look at how you can build your own version of ML-powered autocomplete.

What model should I use?

In this situation, we’re going to want to use RoBERTa.

RoBERTa is an NLP model developed at Facebook. It’s built off of Google’s famous BERT—hence the weird capitalization in RoBERTa—and improves on its predecessor’s performance by implementing a slightly different approach to training. For a little more info, you can check out this article.

The pre-trained RoBERTa, loaded through the PyTorch Hub, comes with a built in fill_mask() method that allows you to pass in a string, point to the location where RoBERTa should predict the next word/phrase (the “mask” referred to by “fill_mask”), and receive your prediction.

Now, all you need to do is deploy RoBERTa as an API, and write a function on your frontend that queries your model with your user’s input.

If you run into any trouble, you can deploy RoBERTa using this Cortex example.

Project #2. Customer support bot

Support bots are by no means a new concept—sites have had bots loaded with canned responses for years—but with machine learning, the entire field has taken a step forward.

In the past, a support bot might have prewritten answers to a handful of questions. If the question wasn’t worded in a way the bot recognized, or if it touched on a topic outside of or more nuanced than the prewritten responses, the bot wouldn’t not work.

Now, however, ML-powered bots can parse and understand user input—not just compare it to a list of questions—and can generate answers all on their own.

Companies like Reply.ai, who build custom support bots for companies, are a prime example of this development. According to Reply.ai’s data, an average company can handle 40% of their inbound support requests via an ML-powered bot, like the example bot below:

Now, let’s build our own customer support bot.

What model should I use?

DialoGPT is perfect for this task.

DialoGPT is a model built by Microsoft, developed from Hugging Face’s pytorch-transformer and OpenAI’s GPT-2. The model was trained on Reddit conversations, and will return an answer to any text it is queried with.

However, because Microsoft famously withheld the decoder for this model (there were concerns about the potential output of a Reddit-trained model), so you’ll have to implement your own GPT-2 decoder to translate the model’s responses into human language. Luckily, that won’t be too difficult for you—you can run the entire deployment of DialoGPT by cloning this repo.

Once you’ve deployed your DialoGPT API, you can connect it to your frontend and start fielding customer requests.

Bonus Tip: If you have trouble with DialoGPT, I’ve written another tutorial to building a chatbot using a different model, ELMo-BiDAF. You can read it here.

Project #3. Predictive Text Generator

If you’re vaguely aware of the machine learning community, you’ve heard of AI Dungeon 2. The game—which is so popular it was initially shutdown because its cloud hosting cost more than $10,000/day—is a classic text adventure game, except that the story is entirely generated by GPT-2. This allows you to do anything, like, say, eat the moon:

AI Dungeon 2 was built using OpenAI’s GPT-2, and while an interactive RPG may not be the business case you’re looking for, AI Dungeon 2 demonstrates how convincing auto-generated text can be.

A great example of auto-generated text’s business implications is Deep TabNine. Deep TabNine is a product that uses machine learning to implement autocomplete within your IDE, for a variety of programming languages:

If you’re a software engineer, the idea of using ML to generate accurate, complete lines of code instantly must be thrilling.

Let’s look at how we can build our own version.

What model should I use?

For this project, you should use the behemoth itself, OpenAI’s GPT-2.

When GPT-2 was first released, it made waves for a couple of reasons. First, it was extremely powerful. Second, the team at OpenAI refrained from releasing the full pre-trained model, fearing it might be abused. This, predictably, set off a media firestorm over a potentially “too dangerous for the public” AI.

Fast forward to now, and the full model has been released, with no Skynet apocalypse reported.

Interacting with the implemented model is very simple. Send it a piece of text, and watch it generate.

To deploy GPT-2 with Cortex, you can use this repository.

Project #4. Language identifier

Have you ever browsed to a website in Google Chrome and seen this popup?

Have you ever wondered how Chrome identifies what language the page is in? The answer is simple. It uses a language identifier.

Language identification is notoriously tricky. Different languages share many words in common, different dialects and slang make languages harder to detect, and there is no law against using multiple languages in a web page (an english article featuring a french quote, for example.)

This fuzzy task of determining which language a given body of text is written in is perfect for machine learning. Let’s look at how we’d build our own language identifier below.

What model should I use?

Facebook’s fastText.

fastText is a model that uses word embeddings to understand language. In my tutorial on deploying fastText as an API, I explained the high-level overview of what makes fastText special like this:

Word embeddings represent words as n-dimensional vectors of floating point numbers, in which each number represents a dimension of the word’s meaning. Using word vectors, you can “map” words according to their semantic meaning — for example, if you subtract the vector for “man” from the vector for “king” and add “woman,” you’ll end up with roughly the vector of “queen.” In other words, king — man + woman = queen word2vec was one of the first popular tools for producing word embeddings, and fastText is an extension of word2vec. Whereas word2vec processes individual words, fastText breaks words down into n-grams. This makes fastText capable of, among other things, better understanding obscure words. When given a rare word like “demisemiquavers,” fastText will analyze the smaller n-grams within it (“demi,” “semi,” etc.) to help find its semantic meaning, similar to how you might analyze familiar root words to understand an unfamiliar word.

Deploying fastText is fairly straightforward. You can use this repository, and if you need extra help, you can follow along with this tutorial.

Project #6. Media monitor

The reality of modern business is that if your users have an opinion on your product, they have massive platforms to share it on. To effectively stay on top of your brand, monitoring social media for mentions of your company can be necessary.

An example of this kind of product is Keyhole, a social analytics platform that uses machine learning to monitor social media relevant to your company:

One of the biggest challenges in building this kind of tool, however, is figuring out what constitutes an actual mention of your brand.

Say you wanted to build a service that monitors Hacker News for your brand. Scraping HN comments every day would be fairly straightforward, and searching those comments for words related to your brand would be easy as well. But, and this is the sticking point, how could you know for certain that those keywords are being used in relation to your brand?

For example, if I was monitoring for Cortex, how could I know that the word “cortex,” when used in a given comment, was referring to the open source platform and not the prefrontal cortex of a person’s brain?

This is where machine learning comes in.

What model should I use?

Flair’s SequenceTagger.

Flair is an open source NLP library built on PyTorch. Flair excels in a number of areas, but particularly around named entity recognition (NER), which is exactly the problem we are trying to solve.

For example, the following is taken directly from the Flair repository:

You can implement Flair with Cortex using Cortex’s Predictor API, which is the method we’ve been using for deploying all of our PyTorch models so far.

You don’t need a PhD to use machine learning

There is often a perception around machine learning that it is only for those with intense mathematical or theoretical CS backgrounds.

If you want to develop new model architectures or push the boundaries of machine learning, that’s probably true—you will need to understand machine learning on a theoretical level. If, however, you simply want to build software with machine learning, this barrier to entry is imaginary.

Every one of the projects I’ve listed above can be built even if you only have a rudimentary understanding of software development, and all of them mimic the functionality of real, successful products—they’re not toy projects.

If you build any of the above, let me know in the comments, and if you have any questions about deploying your models as APIs, feel free to ask them in the Cortex Gitter.