For the past 10 years, the CIA has overtly funded the production of a publicly available data set on certain atrocities around the world that now covers the period from January 1995 to early 2014 and is still updated on a regular basis. If you work in a relevant field but didn’t know that, you’re not alone.

The data set in question is the Political Instability Task Force’s Worldwide Atrocities Dataset, which records information from several international press sources about situations in which five or more civilians are deliberately killed in the context of some wider political conflict. Each record includes information about who did what to whom, where, and when, along with a brief text description of the event, a citation for the source article(s), and, where relevant, comments from the coder. The data are updated monthly, although those updates are posted on a four-month lag (e.g., data from January become available in May).

The decision to limit collection to events involving at least five fatalities was a pragmatic one. As the data set’s codebook notes,

We attempted at one point to lower this threshold to one and the data collection demands proved completely overwhelming, as this involved assessing every murder and ambiguous accidental death reported anywhere in the world in the international media. “Five” has no underlying theoretical justification; it merely provides a threshold above which we can confidently code all of the reported events given our available resources.

For the past three years, the data set has also fudged this rule to include targeted killings that appear to have a political motive, even when only a single victim is killed. So, for example, killings of lawyers, teachers, religious leaders, election workers, and medical personnel are nearly always recorded, and these events are distinguished from ones involving five or more victims by a “Yes” in a field identifying “Targeted Assassinations” under a “Related Tactics” header.

The data set is compiled from stories appearing in a handful of international press sources that are accessed through Factiva. It is a computer-assisted process. A Boolean keyword search is used to locate potentially relevant articles, and then human coders read those stories and make data from the ones that turn out actually to be relevant. From the beginning, the PITF data set has pulled from Reuters, Agence France Press, Associated Press, and the New York Times. Early in the process, BBC World Monitor and CNN were added to the roster, and All Africa was also added a few years ago to improve coverage of that region.

The decision to restrict collection to a relatively small number of sources was also a pragmatic one. Unlike GDELT, for example—the routine production of which is fully automated—the Atrocities Data Set is hand-coded by people reading news stories identified through a keyword search. With people doing the coding, the cost of broadening the search to local and web-based sources is prohibitive. The hope is eventually to automate the process, either as a standalone project or as part of a wider automated event data collection effort. As GDELT shows, though, that’s hard to do well, and that day hasn’t arrived yet.

Computer-assisted coding is far more labor intensive than fully automated coding, but it also carries some advantages. Human coders can still discern better than the best automated coding programs when numerous reports are all referring to the same event, so the PITF data set does a very good job eliminating duplicate records. Also, the “where” part of each record in the PITF data set includes geocoordinates, and its human coders can accurately resolve the location of nearly every event to at least the local administrative area, a task over which fully automated processes sometimes still stumble.

Of course, press reports only capture a fraction of all the atrocities that occur in most conflicts, and journalists writing about hard-to-cover conflicts often describe these situations with stories that summarize episodes of violence (e.g., “Since January, dozens of villagers have been killed…”). The PITF data set tries to accommodate this pattern by recording two distinct kinds of events: 1) incidents, which occur in a single place in short period of time, usually a single day; and 2) campaigns, which involve the same perpetrator and target group but may occur in multiple places over a longer period of time—usually days but sometimes weeks or months.

The inclusion of these campaigns alongside discrete events allows the data set to capture more information, but it also requires careful attention when using the results. Most statistical applications of data sets like this one involve cross-tabulations of events or deaths at a particular level during some period of time—say, countries and months. That’s relatively easy to do with data on discrete events located in specific places and days. Here, though, researchers have to decide ahead of time if and how they are going to blend information about the two event types. There are two basic options: 1) ignore the campaigns and focus exclusively on the incidents, treating that subset of the data set like a more traditional one and ignoring the additional information; or 2) make a convenient assumption about the distribution of the incidents of which campaigns are implicitly composed and apportion them accordingly.

For example, if we are trying to count monthly deaths from atrocities at the country level, we could assume that deaths from campaigns are distributed evenly over time and assign equal fractions of those deaths to all months over which they extend. So, a campaign in which 30 people were reportedly killed in Somalia between January and March would add 10 deaths to the monthly totals for that country in each of those three months. Alternatively, we could include all of the deaths from a campaign in the month or year in which it began. Either approach takes advantage of the additional information contained in those campaign records, but there is also a risk of double counting, as some of the events recorded as incidents might be part of the violence summarized in the campaign report.

It is also important to note that this data set does not record information about atrocities in which the United States is either the alleged perpetrator or the target (e.g., 9/11) of an atrocity because of legal restrictions on the activities of the CIA, which funds the data set’s production. This constraint presumably has a bigger impact on some cases, such as Iraq and Afghanistan, than others.

To provide a sense of what the data set contains and to make it easier for other researchers to use it, I wrote an R script that ingests and cross-tabulates the latest iteration of the data in country-month and country-year bins and then plots some of the results. That script is now posted on Github (here).

One way to see how well the data set is capturing the trends we hope it will capture is to compare the figures it produces with ones from data sets in which we already have some confidence. While I was writing this post, Colombian “data enthusiast” Miguel Olaya tweeted a pair of graphs summarizing data on massacres in that country’s long-running civil war. The data behind his graphs come from the Rutas de Conflicto project, an intensive and well-reputed effort to document as many as possible of the massacres that have occurred in Colombia since 1980. Here is a screenshot of Olaya’s graph of the annual death counts from massacres in the Rutas data set since 1995, when the PITF data pick up the story:

Now here is a graph of deaths from the incidents in the PITF data set:

Just eyeballing the two charts, the correlation looks pretty good. Both show a sharp increase in the tempo of killing in the mid-1990s; a sustained peak around 2000; a steady decline over the next several years; and a relatively low level of lethality since the mid-2000s. The annual counts from the Rutas data are two or three times larger than the ones from the PITF data during the high-intensity years, but that makes sense when we consider how much deeper of a search that project has conducted. There’s also a dip in the PITF totals in 1999 and 2000 that doesn’t appear in the Rutas data, but the comparisons over the larger span hold up. All things considered, this comparison makes the PITF data look quite good, I think.

Of course, the insurgency in Colombia has garnered better coverage from the international press than conflicts in parts of the world that are even harder to reach or less safe for correspondents than the Colombian highlands. On a couple of recent crises in exceptionally under-covered areas, the PITF data also seems to do a decent job capturing surges in violence, but only when we include campaigns as well as incidents in the counting.

The plots below show monthly death totals from a) incidents only and b) incidents and campaigns combined in the Central African Republic since 1995 and South Sudan since its independence in mid-2011. Here, deaths from campaigns have been assigned to the month in which the campaign reportedly began. In CAR, the data set identifies the upward trend in atrocities through 2013 and into 2014, but the real surge in violence that apparently began in late 2013 is only captured when we include campaigns in the cross-tabulation (the dotted line).

The same holds in South Sudan. There, the incident-level data available so far miss the explosion of civilian killings that began in December 2013 and reportedly continue, but the combination of campaign and incident data appears to capture a larger fraction of it, along with a notable spike in July 2013 related to clashes in Jonglei State.

These examples suggest that the PITF Worldwide Atrocities Dataset is doing a good job at capturing trends over time in lethal violence against civilians, even in some of the hardest-to-cover cases. To my knowledge, though, this data set has not been widely used by researchers interested in atrocities or political violence more broadly. Probably its most prominent use to date was in the Model component of the Tech Challenge for Atrocities Prevention, a 2013 crowdsourced competition funded by USAID and Humanity United. That challenge produced some promising results, but it remains one of the few applications of this data set on a subject for which reliable data are scarce. Here’s hoping this post helps to rectify that.

Disclosure: I was employed by SAIC as research director of PITF from 2001 until 2011. During that time, I helped to develop the initial version of this data set and was involved in decisions to fund its continued production. Since 2011, however, I have not been involved in either the production of the data or decisions about its continued funding. I am part of a group that is trying to secure funding for a follow-on project to the Model part of the Tech Challenge for Atrocities Prevention, but that effort would not necessarily depend on this data set.