

The Number



When Johns Hopkins epidemiologists set out to study the war in Iraq, they did not anticipate that their findings would be so disturbing, or so controversial.

By Dale Keiger

I n April of last year, Gilbert H. Burnham and Leslie F. Roberts, A&S '92 (PhD), began finalizing plans for some new epidemiology. There was nothing notable in that; Burnham and Roberts, at the time both researchers at Johns Hopkins' Bloomberg School of Public Health, were epidemiologists. What was notable was the subject. They would not be studying the spread of HIV in sub-Saharan Africa, or incidence of cholera in Bangladeshi villages. They meant to conduct epidemiological research on the war in Iraq. They would treat the war as a public health catastrophe, and apply epidemiological methods to answer a question essential to an occupying power with the legal obligation to protect the occupied: What had happened to the Iraqi people after the 2003 U.S.-led invasion? Their efforts produced a mortality study, their second in two years, published last October in The Lancet, Britain's premier medical journal. The study produced a number: 654,965. This was the researchers' estimate of probable "excess mortality" since the 2003 invasion — Iraqis now dead who would not be dead were it not for the war. The number was a product of the study, not its central point. But it commanded attention because it was appallingly, stupefyingly large. It was beyond anyone's previous worst imagining. It was just plain hard to believe, and in the weeks following its publication, it became an oddity of science: a single number so loud, in effect, it overwhelmed the conclusions of the research that produced it. Newspapers the world over put the number in their headlines. Reporters tried to explain it, often bungling the job. To dismiss the research, critics seized on its implausibility, in the process frequently distorting its meaning. Political leaders dodged its implications by brushing it aside as the meaningless product of a discredited methodology. In a leading scientific journal, other scientists challenged how the study had been done. Burnham, who is professor of epidemiology and co-director of Johns Hopkins' Center for Refugee and Disaster Response (CRDR), tried to keep attention focused on what he thought the public needed to understand. "I have one central message," he says. "That central message is that local populations, people caught up in conflict, do badly. This is not a study that says, Ain't it awful. This is a study that says, We need to do something about this." A message lost in a number. P ublic health researchers frequently quote research, done by Christer Ahlström for the Department of Peace and Conflict Research at Uppsala University in Sweden, estimating that up to 90 percent of casualties in late-20-century wars were not fighters but civilians. Until recently, estimating these losses has been far from scientific. Conventional armies record the deaths of their own soldiers, but rarely count dead civilians. A country ravaged by war faces huge difficulties compiling its own tally. Research has demonstrated that what epidemiologists call "passive surveillance" — monitoring reports from hospitals, morgues, and government ministries — never produces an accurate, comprehensive summary of how violence affects a population. Burnham reports that with the exception of Bosnia, he and his colleagues have found no recent wars in which passive surveillance recorded more than 20 percent of the deaths later revealed by population-based studies. Roberts began active measurement of war-zone mortality in 1992. Now a lecturer at the Mailman School of Public Health at Columbia University, he remains an adjunct faculty member at the Bloomberg School. When he decided to work on Iraq in 2004, he approached Burnham, who recalls, "He came looking for money and ideas. I had both of them." The pair of scientists devised a study in collaboration with faculty members at Al Mustansiriya University's College of Medicine in Baghdad, and Richard Garfield of the School of Nursing at Columbia University. In September 2004, Roberts packed $20,000 in his shoes and a money belt and, lying on the floor of an SUV, slipped into Iraq from Jordan to coordinate the data gathering. Over three weeks, six trained Iraqi volunteers surveyed 7,868 people in 33 locations throughout the country. They asked each person several questions. From January 1, 2002, up to the day of the interview, how many people in your household were born? How many died? What did they die from, and when? Based on the data, Burnham and Roberts estimated that prior to the invasion, the crude mortality rate in Iraq had been 5.0 per 1,000 persons. After the war commenced, the rate jumped to 12.3 per 1,000. Counting the sample's post-invasion deaths in excess of the deaths that would be expected given the pre-invasion mortality rate, then calculating figures for the whole population based on their statistical sample, they produced an estimate of how many Iraqis had died up to that point as a result of the war: 98,000. The Lancet released the study on October 29, days before the 2004 U.S. presidential election. American newspapers took note of the unexpectedly large estimated mortality, but devoted little space to any more discussion as the last days of the campaign consumed public attention. There was some hostile press reaction. For example, Fred Kaplan, writing for the online publication Slate, which is owned by The Washington Post, called it "a useless study." There was some sniping at the authors and The Lancet for publishing it so close to the election, especially after the Associated Press quoted Roberts: "I e-mailed it on September 30 under the condition that it came out [in the journal] before the election. My motive was that if this came out during the campaign, both candidates would be forced to pledge to protect civilian lives in Iraq. I was opposed to the war and I still think that the war was a bad idea, but I think that our science has transcended our perspectives." Burnham says that when he and his co-authors completed that first study, its estimate of post-invasion deaths startled them by its magnitude. He had no idea what was coming two years later. B urnham and Roberts began pondering a second survey as soon as they had published the first. Roberts hoped someone else would do it and, as an independent source, verify their results. But in late 2005, a group from the Center for International Studies at the Massachusetts Institute of Technology approached them. They had $90,000 for research in conflict areas of the Middle East and sought advice on how to apply it. Burnham says, "We started discussing with them what you had to do. Within five minutes of starting that conversation, the MIT people said, 'Would you guys like to do it?'" The CRDR contributed additional funds, and Burnham, Roberts, and other Bloomberg School faculty set to work on a new mortality survey. All statistical surveys attempt the same thing: to gather data from a representative subset of a population, then interpret that data to produce meaningful information about the population as a whole. The larger and more random the sample, the more confident the researchers can be that what they find in the subset holds true for the entire population. In an ideal situation, surveyors would assemble a random sample of sufficient size from a list of everyone in the country, then interview each person in the sample. Such a survey was impossible in Iraq. Burnham and Roberts did not have enough money; there was no adequate census from which to randomly draw the sample; and travel anywhere but relatively peaceful Kurdistan, in the north of the country, had become extraordinarily dangerous. The researchers needed a less costly methodology that would produce a sufficiently large and random sample but minimize travel by the Iraqi surveyors.



In a typical cluster survey, researchers randomly assign clusters to populated areas and randomly select a start house. They then go door to door surveying a predetermined number of homes. That dictated the same method used in the first study, a cluster survey. Designed to estimate vaccination coverage in undeveloped countries, a cluster survey is cheaper, faster, and more flexible, and does not require a detailed local population list. Researchers first randomly assign clusters to populated areas. Interviewers then randomly select a start house in each assigned area and survey a cluster of surrounding households. For the Iraq study, interviewers could slip into a village or neighborhood, go door to door efficiently gathering data from houses that were all near each other, then leave before they attracted too much attention. There are problems inherent in this methodology, what epidemiologists speak of as "design effects." For example, the experiences of people within a cluster — say, incidence of contagion, or of violence in a violent neighborhood — may have been similar to an unrepresentative degree because the people in the cluster all live near each other. One way to compensate for these problems is to increase the number of clusters. But more clusters means more interviews, more time in the field, more risk of getting killed. How much could Burnham and Roberts ask of the volunteers? They decided they could ask them to cover 50 clusters this time, up from the first study's 33. Each cluster would be 40 households instead of 30, making the sample upward of 12,000 Iraqis versus the 7,868 of the first survey. The Hopkins scientists again would have assistance from Al Mustansiriya University in Iraq, plus three Bloomberg School researchers who had helped with the first study: Scott Zeger, Shannon Doocy, and Elizabeth Johnson. Zeger, a professor of biostatistics, worried about "recall bias." The interviewers would be asking people to remember when a member of their household had died, and people tend to recall events as more recent than they actually were. Zeger made sure that the second survey would cover not just the months elapsed since the conclusion of the first one, but go back over the period of the first study as well. If a new sample of people produced results closely corresponding to the earlier findings, that would go a long way toward validating both studies. He also advised taking a subset of 10 clusters from the first survey and sampling them again; not the same households, but the same neighborhoods. Again, if the new survey produced similar results for those clusters, that would strengthen confidence in the data. Burnham and Roberts planned more detailed interviews and requested an additional corroborating element: This time, interviewers would ask every household reporting a death to produce a death certificate. (In the first survey, interviewers asked that of only a subset of households.) Concern for the safety of interviewers and respondents alike produced two more decisions. First, they would not record identifiers like the names and addresses of people interviewed. Burnham feared retribution if a hostile militia at a checkpoint found a record of households visited by the Iraqi survey teams. The second decision was how to choose each cluster's starting point. In the first survey, the researchers had used randomly selected GPS coordinates. Burnham and Roberts wanted to use GPS again, but interviewers from the first survey who had volunteered for the second one said no. Iraqis believe that GPS is used to direct U.S. precision bombing: Simply being caught at a checkpoint with a GPS unit in one's possession could be fatal. So the Hopkins team devised a different sampling methodology — one that would be at the center of a scientific debate once the study was published. F rom May 20 to July 10, 2006, eight Iraqi physicians — four men, four women, all trained in health surveys and community medicine, all fluent in English and Arabic — braved dangerous conditions throughout Iraq to gather the data. "During the survey, we were all holding our breath and crossing our fingers hoping that something bad wouldn't happen to the survey teams," Burnham says. "It was a great relief to hear they were all back safely." He and Roberts stayed out of Iraq, judging the risks to be too great for them and anyone working with them. In 2004, Roberts had learned the futility of trying to disguise himself. He had dyed his hair black, donned Iraqi clothing, and avoided speaking to anyone in public. Then one day during the survey, police detained a team of interviewers. They didn't know Roberts was waiting in the car, pretending to sleep so no one would notice his blue eyes. As he waited, hoping the interviewers would be released (they were), two young boys approached the car, took one look at Roberts, and said, in good English, "Hello, mister!" So much for disguise. Burnham and Doocy, a research associate from the Bloomberg School, flew to Jordan last August to analyze the results. They pored over every interview and double-checked every report of a death, to be sure the report had been properly translated and the data correctly entered into the database. Says Doocy, "We made a very conscious effort to take the most conservative approach. We made as few assumptions as possible, and if we made an assumption, it was an assumption that was going to lower the overall violent-death rate. So for example, if there was a death that had no cause reported, we categorized it as a non-violent death because that was the most conservative thing to do." On detailed examination, three clusters had to be dropped because their locations had been misattributed. The data from the remaining 47 clusters looked solid. The new estimate of pre-invasion mortality, 5.5 per 1,000, corresponded to the estimate from the first survey (5.0 per 1,000). Furthermore, it was consistent with the same statistic for Syria and Jordan, nearby countries with similar demographics, and with the rate for Iraq listed by the CIA's World Factbook. Clusters sampled by both studies yielded corresponding data. The researchers analyzed the survey with a variety of statistical tools, obtaining close to the same results each time, another indication that the methodology had been sound and the data were good. But the findings were stunning. If the estimates were correct, throughout the Allied occupation Iraqis had been dying at an appalling rate, upward of 1,000 per day during the last year of the survey period. The crude mortality rate had progressed year by year, rising from the pre-invasion 5.5 per 1,000 to 7.5, then to 10.9, and finally to 19.8 between June 2005 and June 2006. Of the post-invasion deaths, 92 percent were by violence. The data indicated that gunshots had killed more people than air strikes, car bombs, and improvised explosive devices combined. Only 31 percent of violent deaths were attributed to coalition armed forces, which revealed the level of sectarian violence and lawlessness. Interviewers had remembered to ask for death certificates in 87 percent of all cases of reported mortality, and respondents had produced them for 92 percent of those deaths, a good check against recall bias. On October 11, 2006, The Lancet released "Mortality After the 2003 Invasion of Iraq: A Survey." The authors noted in the article the potential for bias inherent in the sampling methodology, in the conduct of the interviews, from migrations of population within and out of Iraq, and from the possibility that entire families had been wiped out, leaving no one to report their deaths. But they believed the study had documented what they had feared they would find — that the civilian population of Iraq was suffering a public health disaster. There were eight pages in the Lancet article, with maps, charts, tables, and many numbers. But one number stopped every reader: 654,965 Iraqi dead. P olitical reaction to the study was swift and negative. U.S. President George W. Bush told reporters at a news conference, "I don't consider it a credible report," and described the research methodology as "pretty well discredited." In Britain, a spokesman for Prime Minister Tony Blair said, "The problem is they're using an extrapolation technique from a relatively small sample from an area of Iraq which isn't representative of the country as a whole. We have questioned that technique right from the beginning and we continue to do so." Journalists around the world reported the study, and the focus was all on the number, which many could not get right. For example, the first story to appear in The New York Times bore the misleading headline, "Iraqi Dead May Total 600,000, Study Says." That number was, of course, 54,965 too low and in the study approximated only violent deaths. Nowhere in that first Times story does the total number of excess deaths appear. Misleading headlines appeared in The Wall Street Journal, The Los Angeles Times, and The Times of London. The latter also reported that the deaths were civilian, though the Lancet article makes clear the surveyors did not attempt to ascertain if the dead had been civilians or combatants. Robert Lichter and Rebecca Goldin, faculty members at George Mason University, are president and director of research, respectively, of the nonprofit Statistical Assessment Service (STATS), which studies the use of statistics in news media. Lichter was surprised by the sort of errors he found: "The study was complicated, and you would expect journalists to have problems with it. But the coverage got the simple things wrong. You don't need to be a statistician to read plain English."



"We were all holding our breath and crossing our fingers hoping something bad wouldn't happen to the survey teams," says Gil Burnham.



Commentators struggled with, or perhaps in some cases preferred to ignore, an important aspect of the data: what statisticians refer to as the "confidence interval." Sampling a population is inherently inexact. Statistical analysis recognizes this by producing not a single number, like the attendance figure recorded by a stadium's turnstile, but a range of figures, the confidence interval. Thus the study in The Lancet stated that the authors were 95 percent sure that the true figure was within a range of estimated mortality: 392,979 to 942,636 above what would have been expected had the war not occurred. Many people mistakenly assumed that any number within that range was equally likely to be the actual count. Not true. The range was a bell-curve distribution, with the now-famous figure of 654,965 at the top of the curve, statistically the most likely accurate count. Every other number in the range was less and less likely as one approached the extremes. So 400,000 or 900,000 deaths were possible but highly improbable. The reporting on the first Iraq survey had generally failed to grasp this, too. Says Lichter, "I have found that 'second time around' coverage rarely improves on the first time. That's because so many reporters treat the LexisNexis [journalism database] files on the previous story as a template for the next one. In journalism, practice doesn't make perfect, practice makes permanent." In The Wall Street Journal, Steven E. Moore wrote an op-ed column titled, "655,000 War Dead? A Bogus Study on Iraq Casualties." Moore, a political consultant who had worked in Iraq for Paul Bremer and the Coalition Provisional Authority, was derisive about the survey's methodology. Of the survey's 47 clusters, he wrote, "This is astonishing: I wouldn't survey a junior high school, no less an entire country, using only 47 cluster points." With so few clusters, he added, "it is highly unlikely the Johns Hopkins survey is representative of the population." Moore was wrong about methodology. As Goldin of STATS points out, the number of clusters has nothing to do with whether a sample is representative; it affects only the size of the confidence interval. And the Bloomberg School's Doocy notes that given the total population of Iraq, the study's sample size was significantly larger than is customary for public health surveys. Burnham and Roberts defended their study as they were asked the same questions in one interview after another. The Iraqi government, and President Bush, had been citing 30,000 Iraqi deaths. The volunteer organization Iraq Body Count, which from Britain had been monitoring media reports of casualties, had tallied fewer than 50,000. An Iraqi non-governmental organization Iraqiyun, which had not relied on passive reporting but done some field surveys of its own, had estimated 128,000 deaths from the start of the war through July 2005. How could all these numbers be so far off? Where were all the bodies? The Hopkins researchers reminded everyone that data produced by the same methodology for the Darfur region of Sudan and for Congo had never been discredited, and was routinely quoted by both the United Nations and the U.S. government. They pointed out the historical inadequacies of passive surveillance and official statistics. They reminded people that even without the war, Iraqis should have been dying at the rate of at least 120,000 per year from natural causes, yet in 2002, before the invasion, the Iraqi government had reported only 40,000. They noted the extreme violence throughout Iraq, and the great difficulties journalists faced reporting anything outside Baghdad. They stressed the scrutiny their paper had received from The Lancet's peer reviewers and editors. Burnham says he received 15 pages of comments, concerns about methodology, suggestions for better graphics and data tables, and questions the reviewers felt needed to be addressed before publication. "This was one of the most reviewed and edited papers I've ever done," he says. On the Internet, Iraqis debated the study's merits. Some thought it was ridiculous; they were as incredulous at the figures as everyone else had been. Others seized on the mortality estimate as one more example of the wrong inflicted on their country; for them, the number was politically useful. A dentist in Baghdad, writing under the name Zeyad A., contributed a more reasoned perspective: "I have personally witnessed dozens of people killed in my neighborhood over the last few months (15 people in the nearby vicinity of our house alone, over four months), and virtually none of them were mentioned in any media report while I was there. And that was in Baghdad, where there is the highest density of journalists and media agencies. Don't you think this is a common situation all over the country?" Raed Jarrar, a contributor to Foreign Policy and director of the Iraq Project at Global Exchange, an international human rights organization, grew up in Iraq. He was country director of a 2003 door-to-door casualty survey, sponsored by the Campaign for Innocent Victims in Conflict (CIVIC), that counted 2,000 civilian deaths in Baghdad in roughly the first 100 days of the war. "I know how hard that [sort of survey is]," he says, and notes that his limited study recorded twice as many casualties as that period's official figure. Now based in Washington, D.C., Jarrar says, "I didn't think [the mortality figure] was exaggerated at all. I do not know of any Iraqis who did not lose relatives. I know so many people who died, and I don't know hundreds of people in Iraq, you know?" Roberts and Burnham got some nasty mail, which they took in stride along with the political invective directed at them. Some critics dismissed them as leftist academics disguising politics as science; adding to their conviction was the fact that Roberts had made a brief, unsuccessful run at New York's 24th Congressional District seat as a Democrat who favored withdrawal from Iraq and Congress rescinding the president's authority to conduct the war. Burnham praises press reports by reporters who made an effort to understand the epidemiology. But he was annoyed by a story that appeared in Science, one of the world's foremost research journals. The Lancet had sent a press release announcing the study to John Bohannon. Bohannon is a Vienna-based contributing correspondent for Science who happened to be writing about a pair of Oxford University physicists, Neil Johnson and Sean Gourley, who were studying statistical patterns in casualty figures from more than 10 wars. Bohannon asked Gourley what he thought of the Lancet article. Gourley, Johnson, and Michael Spagat, an economist at Royal Holloway, University of London, who is working on a book about conflict analysis, all read the article. Says Johnson, "I got hold of the Lancet paper, and as I read it, I began to feel uneasy about the Hopkins group's particular implementation of the methodology." According to the summary that appeared in The Lancet, the Hopkins researchers had randomly selected a section from each of the study's 50 population areas. Next they randomly picked a main commercial street, then randomly selected residential streets that crossed it. In one more random process, they picked a single house from one of those cross streets as the starting point for the cluster. Johnson and his colleagues believe that violence in Iraq is concentrated on just the sort of commercial streets that were near the start points of the survey's clusters. By having repeatedly sampled too near where the most violence occurred, they reasoned, the study might be fatally skewed. Johnson labeled the sampling problem "main-street bias," and the three Brits wrote an article about it. (At press time, Johnson, Gourley, and Spagat's paper, titled "Bias in Epidemiological Studies of Conflict Mortality," had yet to be published, but Johnson made a draft copy available to Johns Hopkins Magazine.) In the October 20, 2006, edition of Science, Bohannon published a story headlined "Iraqi Death Estimates Called Too High; Methods Faulted," in which he cited the main-street bias argument and quoted Johnson as saying, "It is almost a crime to let [the survey's procedures] go unchallenged." Bohannon's story said of Burnham: "He also told Science he does not know exactly how the Iraqi team conducted its survey; the details about neighborhoods surveyed were destroyed 'in case they fell into the wrong hands and could increase the risks to residents.'" Burnham replies that in none of his e-mail exchanges with Bohannon did he say he didn't know how the Iraqi surveyors conducted their interviews. Nor did he say details of each household's location had been destroyed. Burnham notes that as a security precaution those details were never recorded in the first place. (In e-mail exchanges with Johns Hopkins Magazine, Bohannon responded that he was referring to addresses on scraps of paper that were used to randomly choose start-point households; those were destroyed.) He and Roberts also maintain that their British critics made false assumptions because, due to space limitations, the methodology section of the Lancet article was only a summary of the methods used. Burnham says a fuller explanation would have made plain that the researchers did sample away from main streets and cross streets, and that since every household in Iraq had an equal chance of being selected, there was no bias. "The basic problem is that we did not have the space [in The Lancet] to go into all the details on the sampling method," Burnham says. "In retrospect, since it raised questions, we probably should have done that." Johnson does not buy that explanation. In an e-mail to Johns Hopkins Magazine, he wrote, "I'm sorry, but this is totally unconvincing." He also said, "It is crucial to understand that saying 'every household had an equal chance of being selected' is a desired outcome, not a sampling methodology. With this goal in mind, the JH team must state exactly what methods they employed to guarantee this desired outcome, and correct for any inherent bias if this desired outcome is not 100 percent achieved. We object to the claim that we made false assumptions. The only assumption we made is to assume that the authors were telling the truth about their methodology in the Lancet article. We believe the correct thing for them to do is to publish an erratum in The Lancet ASAP since scientists obviously should not leave misleading methodology lying around in published papers."

