Word vectors

Computers only deal with numbers, and so to get a computer to analyse text data — for example, to find topics, to translate, to summarise, etc — you must first convert the data into numbers. A ‘word vector’ is simply a set of numbers which represent a word: the computer’s internal representation of that word.

If we train a computer to predict the missing word from a sentence, giving it millions of examples to learn from, and we allow the computer to improve its predictions by changing the numbers allocated to each word, we find that synonyms end up being allocated numbers that are close to one another.

There are lots of blog posts and tutorials out there which explain the mechanics behind the word vector training process. My aim below is to give an understanding of why the words end up in the places they do — of why synonyms end up close together.

Consider the problem of predicting the word missing from the following sentence:

“I picked up the _____ and started to write.”

What can we say about the missing word? It has to be a noun; it’s probably a thing (although it could be a metaphorical thing); but it’s most likely something to write with, or something to write on.

Assume that the computer’s representation for words consists of 2 numbers — i.e. the word vector has 2 dimensions. As such, we can view each vector as the coordinates for a point a world map — each word is located at a distinct point on this world (word?) map. At the start of the training process, the word allocation is random, with words spread uniformly across the world.

Next, imagine that the computer makes a prediction by throwing a hypothetical dart at the map. Its prediction will be the word closest to the point where the dart lands.

Our hypothetical darts playing computer has a rather shaky aim. Faced with the sentence above, it thinks ‘pen’ is the most likely word. Actually, it thinks of a a set of coordinates — the word vector — which happens to be a point located over Barbados. So, the computer aims for Barbados, but the dart lands on St Lucia some way to the east. Given that the initial allocation of words is random, it ends up predicting a word that has nothing to do with writing.

How can the computer improve its predictions? Clearly, if all writing implement words were clustered around the West Indies then we might at least hit ‘pencil’ when we take aim for ‘pen’, and this is probably the right answer in at least a few cases where ‘pen’ is appropriate.

So far, so good. But what about the following (somewhat contrived) example:

“I shall _____ a letter to him forthwith.”

Both ‘write’ and ‘pen’ look like good options, but this is going to cause us problems. If both writing implement nouns and writing verbs are clustered together, how can we avoid predicting ‘pencil’ or ‘biro’ in this example; or verbs such as ‘write’ in the first example?

The answer is to give the computer more numbers per word — a higher dimensional word vector. Although stretching an already sketchy analogy, imagine the computer now throws two darts: one at the map, and another at a height chart with a scale stretching from the bottom of the sea bed up to the stratosphere. This gives us the space to allow ‘things to do with writing’ to continue to be located over the West Indies. Noun-like words might found near sea level, and verbs somewhere up in the sky; with ‘pen’ sitting between the two, perhaps being closer to the ground since it’s most commonly used as a noun.

The most widely used word vectors have 300 dimensions. That is, we allocate 300 numbers for each word. This provides an enormous space to store information about all kinds of different aspects of words — whether it is a noun, verb, adjective, etc; the tense of a verb; whether nouns are plural; various different aspects relating to the meaning of the word; and, as we shall see below, whether it is spelled correctly.

Before turning to spelling mistakes, note again that the computer doesn’t know anything about words beyond the contexts in which they are found. In particular, it doesn’t know which letters are used. It has no way of knowing that ‘write and ‘writing’ share a common root. The two words end up in a similar part of the vector space because the contexts in which they are found overlap. Furthermore, some of the differences between the two word vectors will display a specific pattern because ‘write’ is the infinitive of the verb and ‘writing’ is the present participle; and this difference will be similar across all verbs because the contexts in which infinitives and present participles are used also overlap.