Say you have a document and you want to know if it talks about python (a term you care for)

Document: I am a python developer.

Term: python

You want to check if the document contains the word python or not. So you open the document, press ctr+f and search for ‘python’. And you find it :)

Now say you have 100 such terms: [python, java, github, medium, etc.]

You will open the document with a simple python code. Loop through each term, and see if the term is present or not.

Open (document)

for each term in terms:

if term is present in document: print(term)

Now say you have a 100 documents. Well you can open each document in a loop. Per document you search for each term in the document.

for document in documents:

Open (document)

for each term in terms:

if term is present in document: print(term)

Now say java should match Java but not javascript.

Better yet, java should match j2ee and Java both, but not java script.

(j2ee and java are synonyms, and did you notice the space in java script?)

Now it’s getting interesting. How do you do that?

We ran into this problem last year @Belong.co. We noticed that people talk about the same terms in multiple ways. Big apple could be either a big apple or New York. Luckily for us, we had some context. When our documents talk about Python, they 99.99 % of the times mean the programming language, not the animal.

But this didn’t simplify our problem. Java and j2ee are the same thing for us, but not java script. So how to extract this information from millions of documents?

As you can imagine we wrote a regex based code. For 1 million documents and 2K keywords the code took 24 hours to run. And life was good :)