Back in 1958 Hans Peter Luhn assumed in his paper "The Automatic Creation of Literature Abstracts" that "the frequency of word occurrence in an article furnishes a useful measurement of word significance" which is until now probably one of the most significant things in the Information Retrieval science and is used in all well known big and small search engines starting from Google and Yahoo to custom search solutions such as ElasticSearch and Manticore Search.

The significance of the Luhn's assumption cannot be overestimated and many further scientific studies in the area of Information Retrieval were based on the Luhn's assumption even though not all of them mentioned that as probably after time it became axiomatic.

In this article I'll try to show in very clear examples how TF standing for 'term frequency' and it's counterpart IDF standing for 'inverse document frequency' help to find what you're looking for.

Let's take 5 documents:

1. My dog doesn't like all my pets, but me and my wife do love them all. A dog cannot be a cat lover 2. Cats and dogs do not like each other, but my cat does like dogs 3. Walking a dog is a good start of the day 4. Not all cats like walking, but some cats do 5. All dogs like walking, but mine doesn't like. It's like so weird

and try to rank them (i.e. find the most and least relevant and the order) for different queries.

Weight = 1 if word occurs and 0 otherwise

Query: 'like'.

Note we do not know what 'like' is meant in the query: like 'A is like B' or 'I like smth'. So what can we do to rank the documents?

The first thing which comes to mind is that if a document doesn't contain word 'like' at all then it's probably less relevant. Thus the order can be:

1. My dog doesn't like all my pets, but me and my wife do love them all. A dog cannot be a cat lover 2. Cats and dogs do not like each other, but my cat does like dogs 4. Not all cats like walking, but some cats do 5. All dogs like walking, but mine doesn't like. It's like so weird 3. Walking a dog is a good start of the day

We've put the document #3 to the very end since it doesn't contain 'like'. As for the rest we didn't change their order since they all contain 'like'. Is the quality of such ranking good enough? I don't think so since among the first 4 documents there are 2 (#5 and #2) that seem to be more relevant than the others since they provide more facts related with the query term 'like' (remember we don't know what exactly 'like' is meant in the query), but they're not the top 2. So what can we do about that?

Weight = term frequency

Query: 'like'.

As was said before according to the Luhn's assumption documents containing more occurrences of a query term might be more relevant. Let's rank the documents by their term frequencies (note here and in the area of Informaton Retrieval in general 'frequency' means just count, not count divided by something as in Physics):

5. All dogs like walking, but mine doesn't like. It's like so weird | tf = 3 2. Cats and dogs do not like each other, but my cat does like dogs | tf = 2 1. My dog doesn't like all my pets, but me and my wife do love them all. A dog cannot be a cat lover | tf = 1 4. Not all cats like walking, but some cats do | tf = 1 3. Walking a dog is a good start of the day | tf = 0

This solves the problem, now documents #5 and #2 are in the very top.

But let's now try another query - 'my day':

1. My dog doesn't like all my pets, but me and my wife do love them all. A dog cannot be a cat lover | tf = 3 2. Cats and dogs do not like each other, but my cat does like dogs | tf = 1 3. Walking a dog is a good start of the day | tf = 1 4. Not all cats like walking, but some cats do | tf = 0 5. All dogs like walking, but mine doesn't like. It's like so weird | tf = 0

Documents #1 and #2 get the highest positions, but is it correct? They say nothing about 'day', only about 'my': my pets, my wife, my cat, that may be important too given the query, but most likely not. Document #3 gets a lower position while that seems to be the most relevant one. How can we put it higher?

Let's think what's so different in 'my' comparing to 'day' and other words. 'day' seems to be much more specific while 'my' can be applicable to virtually anything and in the documents the 'my' occurs much more frequently than the 'day' and other more specific terms. Here comes the 'document frequency' - another important part of the formula and the second cornerstone of term weighting.

In 1972 Karen Spärck Jones, a British computer scientist, said in one of her academic papers: "the exhaustivity of a document description is the number of terms it contains, and the specificity of a term is the number of documents to which it pertains".

We've already covered the former part of that, let's now see how the document order changes if we count the latter.

Weight = term frequency * (number of documents / document frequency)

Query: 'my day'.

'My' occurs in 2 documents, 'day' - in only one, so the IDF (inverse document frequency) of 'my' is 5(total number of documents) / 2 = 2,5, 'day' - 5/1 = 5.

1. My dog doesn't like all my pets, but me and my wife do love them all. A dog cannot be a cat lover Weight = 3(tf of 'my') * 2,5(idf of 'my') + 0(tf of 'day') * 5 (its idf) = 7,5 3. Walking a dog is a good start of the day Weight = 0 * 2,5 + 1 * 5 = 5 2. Cats and dogs do not like each other, but my cat does like dogs Weight = 2,5 4. Not all cats like walking, but some cats do Weight = 0 5. All dogs like walking, but mine doesn't like. It's like so weird Weight = 0

With the help of IDF we could put the relevant document 'Walking a dog is a good start of the day' one position higher, but it's still not in the very top. What's the reason? That's because 'my' occurs way too often which outweighs the importance of the 'day''s IDF. How can we solve that? We need something which will lower the contribution of the TF. Over years there have been many variants of TF normalization suggested and researched from just log(TF) to so called 'term frequency saturation curve'. For simplicity let's just take the logarithmic variant.

Weight = log(1+TF) * (N/df)

The whole idea of the logarithmic smoothing comparing to a linear function is that the higher the argument the slower it increases, i.e. it grows fast for smaller x and slower for higher x. '1+x' in the function is required to not have 0 in the argument for term frequency = 0 since log(0) is undefined while log(0+1) is 0 which is exactly what we need for tf=0.

Let's see if it solves the problem:

Query: 'my day'.

3. Walking a dog is a good start of the day Weight = log(1+0) * 2,5 + log(1+1) * 5 = 0 + 3,47 = 3,47 1. My dog doesn't like all my pets, but me and my wife do love them all. A dog cannot be a cat lover Weight = log(1+3) * 2,5 + log(1+0) * 5 = 3,47 + 0 = 3,47 2. Cats and dogs do not like each other, but my cat does like dogs Weight = 1,73 4. Not all cats like walking, but some cats do Weight = 0 5. All dogs like walking, but mine doesn't like. It's like so weird Weight = 0

As we can see the most relevant document is finally on the 1st position, but it shares it with another document with the same weight 3,47. Is it right or wrong? It's a hard question since we never know for sure what is relevant for the requester. There is a chance that by asking 'my day' he or she would consider relevant the second document too since it gives a number of facts for 'my' while the 1st document says only about one fact for 'day'. So it well may be quit fair they get the same weight.

And because this uncertainty it's probably why it's typical in search engines to return not a single result, but some number: 10 or 20 or something, so we just let the user make the final decision himself.

Conclusion

In this article I showed in a nutshell the evolution of thinking over TF-IDF weighting formula which proves it works and each its component is important.

There's a continuation of the TF-IDF called BM25 which is based on the previously mentioned 'term frequency saturation curve' in particular and just much more probabilistic approaches to defining the weighting fomula (remember again we never know what is really relevant, hence the probabilistic approach). In general it still remains TF-IDF (+ normalization by document length), but its roots are completely different. It will be covered in another article. Stay with us and subscribe to our newsletters to not miss that.