Binary search

You probably know this story. An Indian king summons the inventor of chess. "I really like your game and I want to reward you. What do you want as a reward?"

"Not much", replies the inventor. "Just put a single grain of rice on the first cell of the chessboard. Put two grains on the second one. Then put four on the third. Then eight on the fourth."

"Go on", says the king.

"Then 16 on the 5th. Then 32 on the 6th. Then 64 on the 7th. Then 128 on the 8th. Then 256 on the 9th. Then 512 on the 10th. Then 1024 on the 11th. Then 2048 on the 12th. Then 4096 on the 13th. Then 8192 on the 14th."

"Please, continue", says the king.

"Then 16,384 on the 15th. Then 32,768 on the 16th. Then 65,536 on the 17th. Then 131,072 on the 18th. Then 262,144 on the 19th. Then 524,288 on the 20th. Then 1,048,576 on the 21st. Then 2,097,152 on the 22nd. Then 4,194,304 on the 23rd."

"Aha."

"Then 8,388,608 on the 24th. Then 16,777,216 on the 25th. Then 33,554,432 on the 26th. Then 67,108,864 on the 27th. Then 134,217,728 on the 28th. Then 268,435,456 on the 29th. Then 536,870,912 on the 30th. Then 1,073,741,824 on the 31th."

"Ok, and then?"

"Then 2,147,483,648 on the 32th. Then 4,294,967,296 on the 33th. Then 8,589,934,592 on the 34th. Then 17,179,869,184 on the 35th. Then 34,359,738,368 on the 36th. Then 68,719,476,736 on the 37th. Then 137,438,953,472 on the 38th. Then 274,877,906,944 on the 39th. Then 549,755,813,888 on the 40th. Then 1,099,511,627,776 on the 41th. Then 2,199,023,255,552 on the 42th."

"And then what?"

"Then 4,398,046,511,104 on the 43th. Then 8,796,093,022,208 on the 44th. Then 17,592,186,044,416 on the 45th. Then 35,184,372,088,832 on the 46th. Then 70,368,744,177,664 on the 47th. Then 140,737,488,355,328 on the 48th. Then 281,474,976,710,656 on the 49th."

"I think, I know where it's heading to but please, go on."

"Then 562,949,953,421,312 on the 50th. Then 1,125,899,906,842,624 on the 51th. Then 2,251,799,813,685,248 on the 52th. Then 4,503,599,627,370,496 on the 53th. Then 9,007,199,254,740,992 on the 54th. Then 18,014,398,509,481,984 on the 55th. Then 36,028,797,018,963,968 on the 56th."

"And?"

"And then 72,057,594,037,927,936 on the 57th. Then 144,115,188,075,855,872 on the 58th. Then 288,230,376,151,711,744 on the 59th. Then 576,460,752,303,423,488 on the 60th. Then 1,152,921,504,606,846,976 on the 61th. Then 2,305,843,009,213,693,952 on the 62th. Then 4,611,686,018,427,387,904 on the 63th. And 9,223,372,036,854,775,808 on the 64th."

"Ok. Listen, I really like your game but this is just unreasonable. That's 18,446,744,073,709,551,615 grains of rice total. That's about 500,000,000,000 metric tons of rice. That's more rice that will be produced worldwide from now until the moment this story is published on that neat little website with movable thingies."

"Oh," said the inventor. "I didn't realize how much rice that is."

The moral of the story is, well, there are actually two equally important lessons. First, always read your requirements to the last page. Second, by multiplying things over and over again, you get from fairly small to unreasonably large numbers very fast. The good news is, this works the other way around too.

Binary search

Now imagine you have a database with 18,446,744,073,709,551,615 entries in it. You want to find a particular entry by the key. Looking up through all the entries will take you exactly 18,446,744,073,709,551,615 steps. Even if a single step is 1 nanosecond long, the whole operation takes 584 years.

But let's say you can split the whole array in two and for each half, say does the queried entry belong to that half or not. How many steps will it require to find an entry then?

Yes, only this many.

Ok, let's be more specific. Let's say you have an array of 640 numbers sorted from little to great. You want to see if number 480 is among them.

Step 1. If the range is only 1 element wide (obviously, that's not true for the first iteration but we'll get there), then if the element is 480 then the range start is its index in the array. If it's not, then there is no 480 in the array at all. Either way, we're done.

Step 2. Split the range in half. Now you have two sorted arrays. On the first iteration, the first contains 320 numbers from first to 319th, and the second — from 320th to 639th. 320 is the dividing index.

Step 3. Select the right half-range. If the element pointed by the dividing index is greater than 480, then 480 may only be in the first half-range. It may not be there either but that's yet inconclusive. If the dividing number is less or equal than 480, then it may only be in the second half-range.

Step 4. Run step 1 for the new range.

Eventually, and in fewer then a dozen iterations, you'll either find the element or establish that it's not there at all.

Here's an interactive plot for this very exercise. The numbers are random but they are generated according to one of the three distributions. The shade of gray corresponds to the iteration number. You can pick a number to search by clicking or tapping anywhere on the plot.

Distribution: uniform; skewed; bi-skewed; Proportion:

As you have noticed, there is one more thing here to play with. It's the proportion in which the ranges are split. The thing is, you don't necessarily split the ranges into two equal halves. Yes, on uniform distribution this is the best strategy but it isn't necessarily so for all the possible ranges and all the possible distributions.

Interpolation search

While we can all agree that 64 is less than 18,446,744,073,709,551,615 it still seems that the introduction story is a little too long.

We can choose a strategy for the range subdivision that will be more effective for specific distributions. For example, using linear interpolation instead of a fixed proportion for the dividing element index works wonders for the uniform distribution.

It can be written down as:

i div = i 1 (A[i 2 ] - x) + i 2 (x - A[i 1 ]) A[i 2 ] - A[i 1 ]

Here i div is the dividing index. It's the element that should start the second sub-range after the subdivision. The pair (i 1 , i 2 ) is the pair of indices for the range being divided on a current iteration. The x stands for the element we're looking for, and A is the array to search in.

Here's an interactive plot that shows how it works.

Distribution: uniform; skewed; bi-skewed;

Note that while it works well on uniform(-ish) distribution, the more uneven the distribution becomes, the more iterations it takes to find the element. That's the downside of linear interpolation.

Of course, the interpolation may be non-linear too. I remember of a very clever algorithm that effectively predicts the proportion for every sub-range based on the approximating polynomial of any degree. Unfortunately, I don't remember the algorithm itself. If you know what I am talking about, please let me know.