A new year is about to start, and that means it’s time for me to update my coup forecasts (see here and here for the 2013 and 2012 editions, respectively). The forecasts themselves aren’t quite ready yet—I need to wait until mid-January for updates from Freedom House to arrive—but I am making some changes to my forecasting process that I thought I would go ahead and describe now, because the thinking behind them illustrates some important dilemmas and new opportunities for predictions of many kinds of political events.

When it comes time to build a predictive statistical model of some rare political event, it’s usually not the model specification that gives me headaches. For many events of interest, I think we now have a pretty good understanding of which methods and variables are likely to produce more accurate forecasts.

Instead, it’s the data, or really the lack thereof, that sets me to pulling my hair out. As I discussed in a recent post, things we’d like to include in our models fall into a few general classes in this regard:

No data exist (fuggeddaboudit)

Data exist for some historical period, but they aren’t updated (“HA-ha!”)

Data exist and are updated, but they are patchy and not missing at random (so long, some countries)

Data exist and are updated, but not until many months or even years later (Spinning Pinwheel of Death)

In the past, I’ve set aside measures that fall into the first three of those sets but gone ahead and used some from the fourth, if I thought the feature was important enough. To generate forecasts before the original sources updated, I either a) pulled forward the last observed value for each case (if the measure was slow-changing, like a country’s infant mortality rate) or b) hand-coded my own updates (if the measure was liable to change from year to year, like a country’s political regime type).

Now, though, I’ve decided to get out of the “artisanal updating” business, too, for all but the most obvious and uncontroversial things, like which countries recently joined the WTO or held national elections. I’m quitting this business, in part, because it takes a lot of time and the results may be pretty noisy. More important, though, I’m also quitting because it’s not so necessary any more, thanks to timelier updates from some data providers and the arrival of some valuable new data sets.

This commitment to more efficient updating has led me to adopt the following rules of thumb for my 2014 forecasting work:

For structural features that don’t change much from year to year (e.g., population size or infant mortality), include the feature and use the last observed value.

For variables that can change from year to year in hard-to-predict ways, only include them if the data source is updated in near-real time or, if it’s updated annually, if those updates are delivered within the first few weeks of the new year.

In all cases, only use data that are publicly available, to facilitate replication and to encourage more data sharing.

And here are some of the results of applying those rules of thumb to the list of features I’d like to include in my coup forecasting models for 2014.

Use Powell and Thyne’s list of coup events instead of Monty Marshall’s. Powell and Thyne’s list is updated throughout the year as events occur, whereas the publicly available version of Marshall’s list is only updated annually, several months after the start of the year. That wouldn’t matter so much if coups were only the dependent variable, but recent coup activity is also an important predictor, so I need the last year’s updates ASAP.

Use Freedom House’s Freedom in the World (FIW) data instead of Polity IV to measure countries’ political regime type. Polity IV offers more granular measures of political regime type than Freedom in the World, but Polity updates aren’t posted until spring or summer of the following year, usually more than a third of the way into my annual forecasting window.

Use IMF data on economic growth instead of the World Bank’s. The Bank now updates its World Development Indicators a couple of times a year, and there’s a great R package that makes it easy to download the bits you need. That’s wonderful for slow-changing structural features, but it still doesn’t get me data on economic performance as fast as I’d like it. I work around that problem by using the IMF’s World Economic Outlook Database, which include projections for years for which observed data aren’t yet available and forecasts for several years into the future.

Last but not least, use GDELT instead of UCDP/PRIO or Major Episodes of Political Violence (MEPV) to measure civil conflict. Knowing which countries have had civil unrest or violence in the recent past can help us predict coup attempts, but the major publicly available measures of these things are only updated well into the year. GDELT now represents a nice alternative. It covers the whole world, measures lots of different forms of political cooperation and conflict, and is updated daily, so country-year updates are available on January 2. GDELT’s period of observation starts in 1979, so it’s still a stretch to use it models of super-rare events like mass-killing onsets, where the number of available examples since 1979 on which to train is still relatively small. For less-rare events like coup attempts, though, starting the analysis around 1980 is no problem. (Just don’t forget to normalize them!) With some help from John Beieler, I’m already experimenting with adding annual GDELT summaries to my coup forecasting process, and I’m finding that they do improve the model’s out-of-sample predictive power.

In all of the forecasting work I do, my long-term goals are 1) to make the forecasts more dynamic by updating them more frequently (e.g., monthly, weekly, or even daily instead of yearly) and 2) to automate that updating process as much as possible. The changes I’m making to my coup forecasting process for 2014 don’t directly accomplish either of these things, but they do take me a few steps in both directions. For example, once GDELT is in the mix, it’s possible to start thinking about how to switch to monthly or even daily updates that rely on a sliding window of recent GDELT tallies. And once I’ve got a coup data set that updates in near-real time, I can imagine pinging that source each day to update the counts of coup attempts in the past several years. I’m still not where I’d like to be, but I think I’m finally stepping onto a path that can carry me there.