I recently got my hands on the common-crawl URL dataset from here and wanted to compute some stats like the number of unique domains contained in the corpus. This was the perfect opportunity to whip up a script that implemented Durande et. al’s log-log paper.

The algorithm operates on a stream and produces an estimate for the cardinality of the input stream. Here’s the algorithm itself:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ( ns probabilistic-counting.log-log "The LogLog algorithnm" ( :use [ incanter.core ]) ( :import [ org.apache.mahout.math MurmurHash ] [ org.apache.commons.lang3 SerializationUtils ])) ( defn rho "Number of leading zeros in the bit-representation Args: y : the number itself size : optional : the number of bytes used to represent the number. Default: 4 bytes/32 bits" ([ y ] ( rho y 32 )) ([ y size ] ( int ( - size ( Math/ceil ( / ( Math/log ( + y 1 )) ( Math/log 2 ))))))) ( defn alpha [ m ] ( if ( < 64 m ) 0.79402 ( * 2 ( Math/pow ( * ( gamma ( - ( / 1 m ))) ( / ( - 1 ( Math/pow 2 ( / 1 m ))) ( Math/log 2 ))) ( - m ))))) ( defn log-log [ xs k ] ( let [ m ( int ( Math/pow 2 k )) buckets ( make-array Integer/TYPE m )] ( do ( map # ( aset buckets % 0 ) ( range m )) ( doseq [ x xs ] ( let [ h ( int ( MurmurHash/hash ( SerializationUtils/serialize x ) 1991 )) idx ( bit-and h ( - m 1 )) val ( max ( aget buckets idx ) ( rho h ))] ( aset buckets idx val ))) ( * ( Math/pow 2 ( / ( apply + buckets ) m )) m ( alpha m )))))

And here’s a test (actual cardinality = 1,000,000):

1 2 3 ( defn demo-log-log [] ( log-log ( map # ( mod % 1000000 ) ( range 10000000 )) 10 ))

When run, we get:

user> (demo-log-log) 1023513.5806580923

which is an error of 2.3%.

Now, let us deploy it on the URL dataset. There are 2,412,755,840 URLs in the format: host_reversed/path/scheme . The following piece of code constructs the host stream and estimates the cardinality.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ( ns probabilistic-counting.demo-urls "Estimate the number of unique domains in the Common crawl dataset" ( :use [ probabilistic-counting.log-log ])) ( defn urls-stream [ url-file ] ( -> url-file clojure.java.io/reader line-seq )) ( defn hosts-stream [ url-stream ] ( map # ( first ( clojure.string/split % # "/" )) url-stream )) ( defn count-hosts [ a-hosts-stream ] ( log-log a-hosts-stream 10 )) ( defn -main [ & args ] ( let [ path ( first args ) num-urls ( when ( second args ) ( -> args second Integer/parseInt ))] ( if num-urls ( ->> path urls-stream hosts-stream ( take num-urls ) count-hosts ) ( ->> path urls-stream hosts-stream count-hosts ))))

On a portion of the dataset, these are the results:

Using standard unix tools ( uniq works here without a sort because the dataset is already sorted by hostname):

➜ probabilistic_counting git:(master) ✗ cat ~/common_crawl_index_urls | head -n 50000 | cut -d "/" -f 1 | uniq | wc -l 9205 ➜ probabilistic_counting git:(master) ✗ cat ~/common_crawl_index_urls | head -n 500000 | cut -d "/" -f 1 | uniq | wc -l 22164 ➜ probabilistic_counting git:(master) ✗ cat ~/common_crawl_index_urls | head -n 5000000 | cut -d "/" -f 1 | uniq | wc -l 196318 ➜ probabilistic_counting git:(master) ✗ cat ~/common_crawl_index_urls | head -n 50000000 | cut -d "/" -f 1 | uniq | wc -l 1525445

And log-log produces:

➜ probabilistic_counting git:(master) ✗ lein trampoline run ~/common_crawl_index_urls 50000 9677.974613035705 ➜ probabilistic_counting git:(master) ✗ lein trampoline run ~/common_crawl_index_urls 500000 22708.710878857888 ➜ probabilistic_counting git:(master) ✗ lein trampoline run ~/common_crawl_index_urls 5000000 194919.74158794453 ➜ probabilistic_counting git:(master) ✗ lein trampoline run ~/common_crawl_index_urls 5000000 1602155.9911824786

And the accuracy is really very good.

This algorithm works since:

$ \rho(x) $ computes the position of the LSB in $ x $.

The probability that $ \rho(x) $ is $ k $ is $ \frac{1}{2^k} $ (the probability of obtaining a sequence of $ k – 1 $ zeroes and a one).

Say, the real cardinality is $ n $. Thus, on $\frac{n}{2^k}$ members of this set, $ \rho(x) $ will yield a value of $ k $.

As a result, if you can drive $ 2^k $ as close to $ n $ as possible, you have a good estimation of cardinality. This is achieved by maximizing $ k $.

And thus, $ \arg\max_{x \in M} \rho(x) $ is an estimator (albeit biased) for $ \log(n) $.

This is one of those extremely sweet papers where the idea (minus the details) fits on a business card and the resulting algorithm has immense practical value.

The full source code is available in this github repository.