This question was once used as a Microsoft interview to test knowledge of algorithms. I’ve re-characterized the puzzle in terms of a situation involving an Excel spreadsheet.



The puzzle

Your boss sends you a spreadsheet with a list of tasks. In one column your boss has written the task, and in the column next to it he has ranked each task in importance from 1 to 100. The list is unordered.

Your boss, however, is incompetent, and sends you a list with only 99 items–so you know one of the items is missing.

How can you quickly find out which task number is missing?

There are many ways to do this. The puzzle is to come up with an efficient algorithm.

Bonus: What if there are 2 tasks missing?

.

.

"All will be well if you use your mind for your decisions, and mind only your decisions." Since 2007, I have devoted my life to sharing the joy of game theory and mathematics. MindYourDecisions now has over 1,000 free articles with no ads thanks to community support! Help out and get early access to posts with a pledge on Patreon. .

.



.

.

.

.

M

I

N

D

.

Y

O

U

R

.

D

E

C

I

S

I

O

N

S

.

M

O

N

D

A

Y

.

P

U

Z

Z

L

E

.

.

.

.

Answer to Find the Missing Numbers

A brute force method is that you search the list contains task 1, then task 2, and so on. This is obviously inefficient as you have to keep looping through the list.

A much better method is that you sort the list from 1 to 100 and quickly scan to see which number is missing (in Excel it is very easy to sort data).

However, there are two problems with this method. First, it becomes impractical as the list gets bigger: you might be able to eyeball over 100 items, but you can’t really scan 1,000 items very efficiently. Second, you are limited by how quickly Excel can sort the list**. For these reasons, sorting is not the best idea.

**This is a technical point which I will elaborate. Computer scientists quantify the amount of time an algorithm will take to run, called a time complexity analysis. The time is related to the size of the data input n. For instance, let’s say that I give you a list of n numbers and the job is for you to tell me the first number in the list. The time it takes only depends on how quickly you can access the first element–it doesn’t matter if the list is one number or a million numbers. So this task can be done in a constant amount of time, regardless of how big the input n is. On the other hand, let’s say I ask you to find the largest item in the list. Now clearly this task will take longer, and it will also take longer as the list gets larger. In fact, the time it takes will be directly proportional to the list size–this is an example of linear time. If I ask you to sort the list in ascending order, how long will that take? That’s actually a somewhat loaded question, as there are many ways you can sort the list (there’s a neat video of this). Regardless of the method, sorting is generally done in a time it’s slightly worse than linear time. There is a shorthand for various time classifications called big O notation. Constant time is denoted as O(1), linear time is O(n), and sorting is linearithmetic time O(n log n).

So where does that leave us? How can we find the missing task number quickly?

There is a very neat algorithm that will find us the solution in linear time, and it can be performed using a single formula. And if that doesn’t get you excited, I don’t know what will! (In our Big O notation, this algorithm works in O(n) time and it will use a constant amount of memory, or O(1) space)

The trick is to use a famous formula for the sum of n numbers. The numbers 1 to n sum to n(n + 1)/2.

So here’s the algorithm: we will write the formula

= 100(100+1)/2 – SUM(task numbers) = 5050 – SUM(task numbers)

Why does this formula work? Let’s say the missing number is X. Then the formula SUM(task numbers) equals sum of the numbers from 1 to 100 excluding the missing task number. So we have SUM(task numbers) = 5,050 – X. Therefore, 5,050 – SUM(task numbers) = X so we exactly retrieve the missing number!

For a list size of numbers 1 to n, we would have the formula

= n(n+1)/2 – SUM(task numbers)

This task will run as quickly as it takes to sum up n numbers, and that is clearly a solution in linear time. (Furthermore, you only need a constant amount of memory to sum up n numbers.)

Bonus: what if 2 numbers are missing

This is a great variation I found from a stackoverflow discussion.

A first approach is to find out the sum of the two missing numbers–using the above method–and to find product of the two missing numbers, since we know the product of the first n numbers is n factorial. Then you’ll have 2 equations in 2 variables.

Sum of missing: = n(n + 1)/2 – SUM(list) Product of missing: = n! – PRODUCT(list)

This is a very bad approach because the product n! gets astronomical even for a modestly sized list of 100 items. Plus, you still have to solve for the 2 variables, and that’s going to take more time and memory.

A better method is described in Data Streams: Algorithms and Applications by S. Muthukrishnan. The trick is to use another well-known formula: the sum of the squares of the first n numbers! This is n(n + 1)(2n + 1)/6.

Sum of missing: = n(n + 1)/2 – SUM(list) Sum of squares of missing: = n(n + 1)(2n + 1)/6 – SUM(squares in list)

If the formulas return A and B, and the missing numbers are x and y, you end up needing to solve the two equations.

x + y = A x2 + y2 = B

And these types of equations can be solved with relative efficiency (if interested, read more about it in the stackoverflow discussion.)

The method can be generalized if k numbers are missing by adding up the sum of the cubes, fourth powers, and so on to the kth power.