Edit: Rasa is an open source framework for building conversational AI.

I believe in most cases it makes sense for bot makers to build their own natural language parser, rather than using a third party API. There are good strategic and technical arguments for doing this, and I want to show how easily you can put something together. This post has 3 sections:

argues why you should bother shows that the simplest possible approach works OK shows something you can actually use

So what NLP do you need for a typical bot? Let’s say you’re building a service to help people find restaurants. Your users might say something like:

I’m looking for a cheap Mexican place in the centre.

To respond to this you need to do two things:

i. Understand the user’s intent: they are searching for a restaurant and not saying “hello”, “goodbye”, or “thanks”.

ii. Extract cheap, Mexican, and centre as fields you can use to run a query.

In a previous post I mentioned that tools like wit and LUIS make intent classification and entity extraction so simple that you can build a bot like this during a hackathon. I’m a huge fan of both of these tools, and the teams behind them. But they’re not the right solution in every case.

1. Three reasons to do it yourself

Firstly, if you actually want to build a business based on conversational software, it’s probably not a good policy to pass on everything your users tell you to either Facebook or Microsoft. Secondly, I just don’t believe a web API is the solution for every problem in programming. https calls are slow, and you’re always constrained by the design choices that went into the API endpoints. Libraries, on the other hand, are hackable. Thirdly, there’s a good chance you’ll be able to build something that performs better on your data & use case. Remember, the general-purpose APIs have to do well on every problem, you only have to do well on yours.

2. Word2vec + heuristics — sophistication = working code

We’ll start by building the simplest possible model to get a feel for how this works, not using any libraries (other than numpy) .

I strongly believe that the only thing you can really do in machine learning is to find a good change of variables. If that seems cryptic, I’m currently writing another post explaining what I mean by that, so check back later. The point is that if you have an effective way to represent your data, then even extremely simple algorithms will do the job.

We’re going to use word vectors, which are vectors of a few tens or hundreds of floating point numbers which to some extent capture the meaning of a word. The fact that this can be done at all, and the way in which these models are trained, are both super interesting. For our purposes this means that the hard work has already been done for us: word embeddings like word2vec or GloVe are a powerful way to represent text data. I decided to use GloVe for these experiments. You can download pre-trained models from their repo, and I took the smallest (50 dimensional) pre-trained model.

Below I’ve pasted some code, based on the python examples in the GloVe repo. All this does is load the word vectors for the whole vocabulary into memory.

Now let’s try to use these vectors for the first task: picking out Mexican as a cuisine type in the sentence “I’m looking for a cheap Mexican place in the centre”. We’ll use the simplest method possible: provide some example cuisine types, and just look for the most similar words in the sentence. We’ll loop through the words in the sentence, and pick out the ones whose average cosine similarity to the reference words is above some threshold.

Let’s try it out on an example sentence.

Amazingly, this is enough to correctly generalise, and to pick out “indian” as a cuisine type, based on its similarity to the reference words. Hence why I say that once you have a good change of variables, problems become easy.

Now on to classifying the user’s intent. We want to be able to put sentences in categories like “saying hello”, “saying thanks”, “asking for a restaurant” , “specifying a location”, “rejecting a suggestion”, etc, so that we can tell the bot’s backend which lines of code to execute. There are many ways we could combine word vectors to represent a sentence, but again we’re going to do the simplest thing possible: add them up. If you think it’s abhorrent to add together word vectors, I hear you. Read the appendix at the bottom for why this works.

We can create these bag-of-words vectors for each sentence, and again just categorise them using a simple distance criterion. Again, amazingly, this can already generalise to unseen sentences.

Neither the parsing nor classification that I’ve shown is particularly robust, so we’ll move on to something better. But I hope that I’ve shown that there’s really nothing mysterious going on, and in fact very simple approaches can already work.

3. Something you can actually use

There are lots of things we can do better. For example, converting a block of text into tokens requires more than just splitting on whitespace. One approach is to use a combination of SpaCy / textacy to clean up and parse your text, and scikit-learn to build your models. Here I’ll use a nice library called MITIE (MIT Information Extraction) which does everything we need and has python bindings.

There are two classes we can use directly. First, a text categoriser:

Second, an entity recogniser:

The MITIE library is quite sophisticated and uses spectral word embeddings rather than GloVe. The text categoriser is a straightforward SVM, whereas the entity recogniser uses a structural SVM. The repo has links to some relevant literature if you’re interested.

As you might expect, using a library like this (or a combination of SpaCy and your favourite ML lib) gives much better performance than the hacky code I posted at the start. In fact, in my experience you can very quickly beat the performance of wit or LUIS on your particular problem, purely because you can tune the parameters for your data set.

Conclusion

I hope I’ve convinced you that it’s worth having a go at creating your own NLP modules when building a bot. Please do add your ideas, suggestions, and questions below. I look forward to the discussion. If you enjoyed this, you can follow me here on medium or, better yet, on twitter.

Thanks to Alex, Kate, Norman, and Joey for reading drafts!