Data

This is a guest post by David Masad of Caerus Analytics. An elaborated version of this analysis is here.

*****

One of the important challenges in studying conflict is simply identifying where it happens. For more than 40 years, researchers have sought to build systematic data about episodes of conflict. Monitoring events on the ground in hundreds of countries is quite difficult, but now, thanks to the tremendous work of political scientist Philip Schrodt and Patrick Brandt and information scientist Kalev Leetaru, there is a new dataset –the Global Database of Events, Language, and Tone (GDELT) — that facilitates this task:

bq. The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world over the last two centuries down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first “realtime social sciences earth observatory.” Nearly a quarter-billion georeferenced events capture global behavior in more than 300 categories covering 1979 to present with daily updates.

But does it work? Can we remotely observe violence conflicts around the world through computer-coded media reports? Building on previous analyses by New Scientist magazine and Jay Ulfelder, I will show that the GDELT data can indeed help us do that by examining GDELT data about the ongoing Syrian civil war. In particular, I will show that the violent events identified in GDELT correlate with death tolls at the national level. I will also show that GDELT events are correlated with the future registration of refugees. This preliminary analysis suggests that GDELT does capture underlying dynamics in the Syrian civil war, although the analysis also suggests where the GDELT data may fall short.

Here is the trend in violent events within Syria through May 27, 2013 as captured by GDELT and consolidated to at most one event of any type between two specific actors at a particular location each day:

Here is what those data look like when mapped:

First, we can compare the trend in these computer-coded events to the casualty data amassed by Syria Tracker. The Syria Tracker project compiles data on individuals killed in Syria from volunteered reports. While not the only such dataset, it has been widely used and was readily obtainable. As the graph below shows, while the number of violent events is higher than the number of reported deaths, the two trends initially move together: an increase in violent events accompanies an increase in reported deaths, and vice versa.

However, the correlation between the two data sources seems to weaken in 2012. While the level of violence remained roughly constant as measured by the fatalities reported, the number of GDELT events steadily dropped. As Jay Ulfelder suggests, this may be evidence of media fatigue: as the conflict drags on with few major developments, media interest wanes and coverage (and hence sources for GDELT to draw from) decreases. Such media fatigue has been observed in previous event data validations.

Here is more direct evidence that correlation between these two data sources has declined over time:

On its face this suggests a limitation to GDELT as a tool for persistent conflict tracking. That said, it may be possible to normalize these counts and generate a less biased estimate of the underlying conflict dynamics. (We also conducted the death counts analysis at the governate level. Please see the full analysis for details, including spatial variation.)

We can also corroborate the GDELT data by comparing it to the number of refugees registered by the United Nations. Intuitively, an increase in violence seems likely to lead to an increase in the flow of refugees from Syria into neighboring countries. UNHCR provides data on registered refugees in total, as well as by country, monthly or bimonthly. Unfortunately, the UN data available only begins in January of 2012, and thus does not overlap with the first year of our data. Here are the two data sources side-by-side:

The increase in violent events appears related to refugee flows, but only at a later point in time. Thus, the correlation between violent events in prior months and refugee registration is higher:

Of course, different refugees will take different amounts of time to arrive at the various locations, and the time to become registered will vary as well. Additional work could identify relationships between violence in particular areas and flows of refugees the nearby borders and develop a better understanding of the actual UN registration process.

In sum, the issues with media fatigue identified here and by Jay Ulfelder should temper that enthusiasm and suggest a need for additional (statistical) tools to address this problem. That said, the high correlations between GDELT’s violent Syria event and two separate outside measures of violence and dislocation suggest GDELT’s value as a remote sensing tool for certain kinds of violent conflicts.