Illustration by Julia Harrison

How to Estimate a Death Toll

To scope a disaster, do you need an entire list of names or just a single popular one?

The most prominent monuments in Battery Park, the military fortification turned public space on the southern tip of Manhattan, belong to the East Coast Memorial, which commemorates the U.S. servicemen who died at sea in the Atlantic Ocean during World War II. Like other memorials that attempt to honor massive casualties in an unostentatious manner, it is a simple arrangement of eight stone panels engraved with names.

Battery Park’s East Coast Memorial [WikiCommons]

As a compassionate human, you might wonder how one could possibly understand such tragedy. As a curious data scientist, you might wonder how one could synthesize the information. In other words, can you collapse the size of a population into a smaller, representative number?

Talk of sampling and population might stir memories of that middle school math lesson that was probably your introduction to statistical methodology. How many readers had a unit dedicated to the mark-and-recapture method of estimating population size?

The logic is pretty simple: if a biologist caught a certain number of animals in a population of unknown size, tagged and released them, and then at a later date recaptured another group of animals from the same pool, she could estimate the size of the entire population using a simple ratio. More specifically, the proportion of tagged animals within the second sample should approximate the proportion of total tagged animals within the entire population, which can now be calculated via cross multiplication.

Here’s an animation of the process that definitely wasn’t made by an amateur:

Because the second sample (“recapture”) is 1/8th tagged, and we originally tagged 4 fish, we’d estimate a total population of 32 (the actual figure is 40).

Slight modifications to the same concept could give us a way to estimate the size of a group of humans. If we know the incidence of a trait within the general population — the naturally occurring “tag” — we might be able to extrapolate the size of a sample based on the frequency of those tags.

This approach is sometimes called the Multiplier Method, and it’s a common way to approximate unseen members of a group. It’s the basis of the old housekeeping adage that if you see one cockroach there are likely hundreds hidden, and also actual studies that attempt to uncover the number of people engaged in illegal or stigmatized behavior. For example, this report leverages the mortality associated with drug use to estimate the total number of users in the EU based on fatal overdoses.

The memorial is essentially just a roster though with little additional information, meaning we can only subset our sample based on names. So the question becomes: can we estimate a death toll simply by knowing how many people with a certain name died?

To test this theory, we first need to understand two things. First, the known characteristic is only passed along if the sample is representative. If we grab a cohort of people that is fundamentally different than our general population, then whatever “tag” we’re relying on might not make the jump with us.

Second, our identifying feature must not be too rare. Otherwise, the sample is disproportionately vulnerable to chance occurrences. Here’s a demonstration of the volatility of rare events. On the left is a distribution of a thousand trials of a 1-in-1000 chance event; on the right is a distribution of a thousand trials of a 1-in-2 chance event:

If you knew the expected rate of successes and they were also the only thing visible, the 1-in-2 trials would give you a pretty accurate estimate of the total number of trials: double anything between 460 and 540 and you get near the unknown number, 1000. But the 1-in-1000 are much less consistent. Much of the time there are zero successes, so your guess for total trials would be…zero. Meanwhile, success rates of 2 or 3 map to guesses of 2,000 or 3,000 trials, and triple-digit error rates.

All this is to say we must pick the most popular name for our experiment, which you might have guessed by now is Smith. Finding World War II era surname records is difficult, but taking the midpoint between the reported 1.25% of 1880 and the current figure of 0.8% leaves us with a nice round approximation: Smith accounted for 1%, or one in every hundred, of last names in the 1940s. Which means that a good rule of thumb for estimating a death toll should just be to take the number of people name Smith who died and multiply by 100.

So…does it work?

If you go down to the East Coast Memorial in Battery Park, you can count that there are 40 sailors, 10 soldiers, and 3 coastguardsmen listed with the last name Smith for a total of 53, which would give us an estimate of 5,300 total dead. According to Wikipedia, there are 4,609 names on the memorial, making our guess not perfect, but pretty good.

What about some of the other disasters since then? Here’s a list of major tragedies along with their “Smiths Estimates”, which can be calculated after a little bit of Googling:

Clearly this method isn’t that reliable. Not a single Smith died in the 1942 Cocoanut Grove Fire, and the approach can overshoot the final actual toll (Jonestown) or undershoot it (Katrina).

What’s interesting, though, is what type of incident leads to better results — remember that our sample must mimic the general population in composition. The two best cases are Pearl Harbor and the USS Indianapolis sinking. The military draws members from all across geographies, classes, religions, and cultures, so their ranks probably mirror American demographics, last names included, to a decent degree.

Meanwhile, Jonestown attracted many members of the same community and even whole families, so their members were likely to clump around whatever surnames were present, and hence the overestimate. New Orleans, on the other hand, is a majority black city, whereas American Smiths are nearly three quarters white, perhaps explaining the underestimate.