I’d try to explain LSH with help of python code and map-reduce technique.

It is said that There is a remarkable connection between minhashing and Jaccard similarity of the sets that are minhashed. [Chapter 3, 3.3.3 Mining of massive datasets]

Jaccard similarity

Where a and b are sets.

J = 0 if A and B are disjoint

J = 1 if A and B are identical

example,

>>> a = {'nike', 'running', 'shoe'} >>> b = {'nike', 'black', 'running', 'shoe'} >>> c = {'nike', 'blue', 'jacket'} >>> float(len(a.intersection(b))) / len(a.union(b)) 0.75 # a and b are similar. >>> float(len(a.intersection(c))) / len(a.union(c)) 0.2 # a and c are... meh..

Minhashing

Probability of collision is higher for similar sets.

Table 1: Matrix representation of sets

keyword x a b c nike 1 1 1 1 running 2 1 1 0 shoe 3 1 1 0 black 4 0 1 0 blue 5 0 0 1 jacket 6 0 0 1

Table 2: Signature Matrix with hash values

Hash Function a b c h1(x) = x + 1 mod 6 min(2,3,4) min(2,3,4,5) min(2,0,1) h2(x) = 3x + 1 mod 6 min(4,1,4) min(4,1,4,1) min(4,4,1)

which becomes,

Table 3: Signature matrix with minhash values

Hash Function a b c h1(x) = x + 1 mod 6 2 2 0 h2(x) = 3x + 1 mod 6 1 1 1

From Table 3 We can infer that set a and b are similar.

Similarity of a and b from Table 1 is 3/4 = 0.75

From signature matrix Table 3 similarity of a and b is 2/2 = 1

The fraction from signature matrix Table 3 is just an estimate of the true jaccard similarity. on a larger set the estimates will be close.

Map-Reduce

Mapper

sample_dict.txt will have word to id mapping.

for every line in input file split text and convert to array of ids using the word to id mapping file. for every id compute minimum hash value split the array of min hash values into multiple equally sized chunks a.k.a, bands. assign id to bands and emit hash of band, band-id and doc-id



Reducer

group by band-hash and band-id to get list of similar doc-ids.

Mapper Code

# lsh_mapper.py __author__ = 'raj' import sys from random import randrange word_ids = dict() num_hashes = 10 num_per_band = 2 # a_hash and b_hash cannot be generated on the fly if running in a distributed env. they should be same across all nodes a_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)] b_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)] def min_hash_fn(a, b, sig): hashes = [((a * x) + b) % len(word_ids) for x in sig] return min(hashes) def get_min_hash_row(sig): hashes = [min_hash_fn(a, b, sig) for a, b in zip(a_hash, b_hash)] return hashes def get_band(l, n): for i in xrange(0, len(l), n): yield frozenset(l[i:i+n]) for word, wid in map(lambda x: x.split(), open("sample_dict.txt").readlines()): word_ids[word] = int(wid) for doc_id, doc in enumerate(sys.stdin): words = doc.strip().lower().split() signature = map(lambda x: word_ids.get(x), words) signature = filter(lambda x: x is not None, signature) min_hash_row = get_min_hash_row(signature) banded = get_band(min_hash_row, num_per_band) for band_id, band in enumerate(banded): print "%d\t%d\t%d" % (band_id, hash(band), doc_id)

Reducer Code

#lsh_reducre.py __author__ = 'raj' import sys prev_band_id, prev_band_hash = None, None cluster = [] cid = 0 for line in sys.stdin: band_id, band_hash, doc_id = line.strip().split("\t", 3) if prev_band_id is None and prev_band_hash is None: prev_band_id, prev_band_hash = band_id, band_hash if prev_band_id is band_id: if prev_band_hash == band_hash: cluster.append(doc_id) else: print cid, cluster cluster = [doc_id] else: print cid, cluster cluster = [doc_id] cid += 1 prev_band_id, prev_band_hash = band_id, band_hash

In action

sample_input.txt

You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink Corduroy Shorts - Flat Front (For Men) SLATE BLUE Nike Airmax Running SHoe Corduroy Shorts - Flat Front (For Men) BEIGE Nokia Lumia 721 Corduroy Shorts - Flat Front (For Men) BROWN

sample_dict.txt

& 1 (for 2 -0 3 1-14 4 12- 5 14 6 2-piece 7 721 8 airmax 9 and 10 beige 11 blue 12 brown 13 corduroy 14 corduroys 15 denim 16 doll 17 dot 18 dress 19 fashion 20 flat 21 flower 22 front 23 inch 24 jumper 25 leggings 26 lumia 27 me 28 men) 29 nike 30 nokia 31 outfit 32 piece 33 pink 34 polka 35 running 36 shirt 37 shoe 38 shorts 39 slate 40 teal 41 top 42 white 43 with 44 you 45 - 46

Command

$ cat sample_input.txt | python lsh_mapper.py | sort | python lsh_reducer.py

Output

0 ['1', '2'] 0 ['0'] 0 ['5', '7'] 0 ['6'] 0 ['3'] 0 ['4'] 1 ['4'] 1 ['6'] 1 ['0'] 1 ['2'] 1 ['3', '5', '7'] 1 ['1'] 2 ['6'] 2 ['4'] 2 ['0', '1', '2'] 2 ['3', '5', '7'] 3 ['6'] 3 ['3', '5', '7'] 3 ['0', '1', '2'] 3 ['4'] 4 ['0', '1'] 4 ['3'] 4 ['5'] 4 ['4'] 4 ['2'] 4 ['7']

resolved output

band 0 ------ You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink Corduroy Shorts - Flat Front (For Men) BEIGE Corduroy Shorts - Flat Front (For Men) BROWN band 1 ------ Corduroy Shorts - Flat Front (For Men) SLATE BLUE Corduroy Shorts - Flat Front (For Men) BEIGE Corduroy Shorts - Flat Front (For Men) BROWN band 2 ------ You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink Corduroy Shorts - Flat Front (For Men) SLATE BLUE Corduroy Shorts - Flat Front (For Men) BEIGE Corduroy Shorts - Flat Front (For Men) BROWN band 3 ------ Corduroy Shorts - Flat Front (For Men) SLATE BLUE Corduroy Shorts - Flat Front (For Men) BEIGE Corduroy Shorts - Flat Front (For Men) BROWN You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink band 4 ------ You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt

code here

13.081604 80.275183