Note that in this example we are only using the paper's title for simplicity. In the implementation below we'll also add in the abstract to generate more accurate recommendations.

Slide 1

On the left, we define the matrix with columns as the titles and rows defined as all words encountered in the three titles. If a word appears in a title, we place a 1 next to that word. This will be the input matrix to our hashing function.

Notice our first record in the signature matrix on the right is simply the current row number where we first find a 1. See the next slide for a more detailed explanation.

Slide 2

We perform the first real row permutation. Notice the unigram words have been reordered. Again, we also have the signature matrix on the right where we'll record the result of the permutations.

In each permutation, we are recording the row number of the first word appears in the title, i.e. the row with the first 1. In this permutation, Title 1's and Title 2's first unigram are both in row 1, so a 1 is recorded in row 1 of the signature matrix under Title 1 and 2. For Title 3, the first unigram is in row 5, so 5 is recorded in the first row of the signature matrix under Title 3's column.

Slides 3-4

We are essentially doing the same operations as Slide 2, but we are creating a new row in the signature matrix for every permutation and putting the result there.

So far we've done three permutations, and we could actually use the signature matrix to calculate the similarity between paper titles. Notice how we have compressed the rows from 15 in the shingle matrix, to 3 in the signature matrix.

After the MinHash procedure, each conference paper will be represented by a MinHash signature where the number of rows is now much less than the number of rows in the original shingle matrix. This is due to us capturing the signatures by performing row permutations. You should need far fewer permutations than actual shingles.

Now if we think about all words that could appear in all paper titles and abstracts, we have potentially millions of rows. By generating signatures through row permutations, we can effectively reduce the rows from millions to hundreds without loss of the ability to calculate similarity scores.

Obviously, the more you permute the rows, the longer the signature will be. This provides us with the end goal where similar conference papers have similar signatures. In fact, you can observe the following:

$$P(h(\mbox{Set 1})) = P(h(\mbox{Set 2})) = \mbox{jaccard_sim} (\mbox{Set 1, Set 2})\mbox{,}$$

$$\mbox{ where h is the MinHash of the set}$$

In order to be more efficient, random hash functions are used instead of random permutations of rows when done in practice.