The first step is to build a database of tagged words. For a financial application, we would tag “gold” as a commodity. For an image application, we would probably tag “gold” as a color. Once we have a starting database, we can use it to split up sentences into words, which are individually tagged as their topic. Then a human must manually confirm the tag as correct. This is known as supervised learning.



If the tags are correct, we continue. If the tag is incorrect, the algorithm must be adjusted to the correct tag. In supervised learning, there is a significant amount of manual work.

As the program sees more sentences, it can start to make some connections. Linguistically, these are called collocations. In English, for example, it is extremely common to say “gold prices”. Statistically, it is overwhelmingly likely “gold prices” refers to the financial concept rather than an artistic way of expressing excellent pricing. As the machine sees more articles, it might start to associate “prices rise” and “prices decline” as actionable scenarios. If “gold prices” is tagged a as a financial concept, and “prices rise” is tagged as an event, there comes some understanding from “gold prices rise”.

Now let’s say the final part of the sentence is “on Planet X”, so we have “gold prices rise on Planet X”. “Planet” could be tagged as a location, and on the first try, the program will probably say this is a relevant story (gold prices rise in [location]). The human overseer would have to tag “Planet X” as an irrelevant location (at least for real-world financial news), and in the future, “Planet X” stories will be ignored – though this could lead to interesting problems, like when a scientific discovery of a real Planet X is worthwhile news. The news story might be about the gold price in an online game.

Sometimes machines make their own connections based on patterns. The program may eventually learn in+[location]. It may not have the town of Sarasota, New York in the original database, but it can see “government raises taxes in Sarasota, NY” and it may understand “raise taxes” as an action and can guess (correctly) that “Sarasota, NY” is a location based on “in” and the pattern “unknown word, comma, two-letter US state code”. This is the main goal with ML and NLP: develop the application so that it can make accurate connections on its own.

ML and NLP are not perfect, but they are statistical tools that greatly reduce the number of news stories one must sort through. The algorithm will throw out “Gold Prices Rise on Planet X” and be able to understand that “raising taxes in [unknown location]” is relevant to someone interested in government actions.