$\begingroup$

While we usually use large e.g. 64 bit hashes, there are many techniques to reduce this size, e.g. for savings in storage and transmission.

Popular Bloom filter instead of marking just 1 hash position in a bit table - indicating element in our database, marks $k$ random positions ($k$ hashes). Finally it requires $\lg(e)\cdot n \lg(1/p_f)$ bits for $n$ element database and $p_f$ probability of false positive - that it will accidentally answer that a given element is in the database, while it is not there. For example for $p_f=10^{-6}$ we need $\approx 29$ bits/hash this way.

We can reduce it further to $n \lg(n/p_f)-\lg(n!)\approx n(\lg(1/p_f)+\lg(e))$ bits if using a single bit table (size $n/p_f$ for $n$ elements and $p_f$ probability of false positive), but storing the $n$ values without information about their order - saving $\lg(n!)$ bits. It can be done e.g. by entropy coding of such bit table, and there are also nice approximations. This way we need $\approx 22$ bits/hash for $p_f=10^{-6}$.

But what if we don't need to worry about false positives? - when we are certain that query will only regard objects from our database, expecting to return ID/position for a given hash.

I was able (paper, slides) to get to asymptotically $(3/2+\gamma \lg(e))$ $\approx 2.33275$ bits/hash for this case and would like to ask if it can be further improved?

Specifically, to get such low number we need to compress minimal prefix tree (trie) obtained for all our hashes - going to the first node distinguishing given hash/leaf in our database.

Assuming hashes are are $Pr(0)=Pr(1)=1/2$ random uncorrelated bit sequences (i.i.d.), we can calculate entropy as required to choose how many leaves are on the left of the root ($k$ out of $n$ with probability $p_{kn}={n \choose k}/2^n$) and then go recursively to both subtrees:

$$H_n=\sum_{k=0}^n p_{kn} \left( \lg\left(1/p_{kn}\right)+H_k+H_{n-k}\right)\to \approx 2.77544 n$$

We can reduce it further by not writing direction in internal degree 1 nodes - as it isn't required to distinguish elements, reducing by $1$ bit/degree 1 node: to $\approx 2.33275$ bits/hash. It can be approached by compressing the tree e.g. using the recurrence above (combining two degree 1 possibilities).

Some examples, $F_n$ denotes false positive probability - that a random value will get to a leaf, index $d$ means expanding at least to depth $d$, $D$ is average depth of all leaves:

It is a nice example of that "subtracting the order": $\lg(n!)$ bits can take us from $n \lg n$ to linear requirement - while length of label has go grow with logarithm of database size $(D_n\approx \lg(n)+1.33275)$, compressing all labels together we can use asymptotically a constant number of bits/element.

My derivation assumed $Pr(0)=Pr(1)=1/2$ i.i.d. hash sequences - which seems the natural choice as they contain maximal amount of information, but technically there is still needed a proof if a different choice cannot improve it (and it becomes much more difficult) ... can this $\approx 2.33275$ bits/object "distinguishability constant" be further reduced?

Update: let's formalize the problem: imagine you get n random potentially infinite bit sequences (hash values) - what is the minimal expected size of structure, which determines some injection from this set of hashes to {1,..,n}?

Looking at such formulation, I have just got down from ~2.33 to ~1.44 bits/element, however, the current algorithm is absolutely impractical.

So while the above construction used only prefixes, it might turn out it is more efficient to use different positions in obtained hashes. A more general approach might be:

Basing on obtained hashes choose better (more distinguishing) positions and encode these positions. For these chosen positions build minimal distinguishing tree as above.

Previously we had trivial 1. (just prefixes), now let's look at approach with trivial 2.: choose position such that we get complete binary tree (constant depth, assume $n=2^m$) - where we don't need to encode the tree.

First position should have one half of 0 and 1, probability of finding it is ${n \choose n/2}/2^n$. Second should have half of 0 and 1 within both previous sets, and so on - finally the bits on the positions chosen this way uniquely determine all $n$ hashes.

We can encode distance to successive such position - it has geometric distribution $(1-p)^k p$ with entropy $h(p)=(-p \lg(p)-(1-p)\lg(1-p))/p$.

Finally summing over cost of writing positions halving all previous subsets, we get ($n=2^m$):

$$\sum_{i=1}^m h\left(\left({2^i \choose 2^{i-1}}/2^{(2^i)}\right)^{(2^{m-i})} \right) \to n\lg(e)$$

Numerics suggest that it quickly approaches $\lg(e)n\approx 1.4427 n$ bits (?)

While it is completely impractical - requiring testing exponential number of positions, there might be some practical intermediate algorithms: using both 1. and 2. It might be a valuable minimal description length lesson - of choosing features, still having cost of their description in consideration.

There remains question if this 1.4427 bits/element is optimal? (now I doubt it)

To enforce the previous 2.33275 bits/element, we might require that the original lexicographic order is maintained.

Update: Instead of pointing $m$ interesting position as above, I was suggested to build tree of positions to consider, what means replacing $h(p^k)$ with $kh(p)$ in the above formula. While it is much closer to a practical algorithm, it has turned out that it gives 2.544 bits/element instead.

To see that the previous recurrence approaches $\lg(e)\approx 1.44$ bits/element, imagine that we analogously search for just $m$ successive positions to get all $n=2^m$ length $m$ possibilities different. Its probability is $p=n!/n^n\approx e^{-n}$. For small $p$ we have $h(p)\approx -\lg(p)\approx n \lg(e)$.