Zoonotic EID events as response variable

We followed the definition of an emerging infectious disease and an EID event used in ref. 4—specifically, events documented in the scientific literature denoting the first emergence of pathogen in a human population where that pathogen was classified as “emerging” due to recent spillover from an animal reservoir, a significant increase in its incidence or geographic distribution in the human population, a marked change in its pathogenicity or virulence, or other factors. In this study we focus only on EID events of wildlife origin (“wildlife zoonoses”) because these represent the majority of EID events in the most recent decade studied, are increasing significantly as a proportion of all EIDs after correcting for reporting bias, include most of the highest impact EIDs of recent decades (e.g., Ebola viruses, Nipah virus) and almost all recent pandemics (e.g., pandemic influenza viruses, SARS). Data on EID events were derived from an updated version of the database originally used by ref. 4 (Supplementary Data 1), which contained EID events ranging from 1940 to 2004 (n = 335 total, n = 145 for wildlife zoonoses (43.3% of all EIDs)). We updated the database to include EID events for wildlife zoonoses through 2008 (n = 224), following the methodology in ref. 4 so as to include only diseases reported in the peer-reviewed literature, where there is evidence that a disease is emerging for one of the reasons laid out above. In addition, we only included the first emergence of a new disease-causing agent, such that the MERS Coronavirus was included, but not reports of new strains of Ebola virus. For each EID event, data were derived from the literature, if available, for date, location (see below), pathogen genus and species, zoonotic origin and type, and associated or hypothesized drivers, following ref. 4. Location data for initial EID emergence events were variable in their geographic specificity, ranging from precise coordinates to broader regions (e.g., municipalities, counties, districts) or entire continents depending on details reported in the primary literature. A spatial polygon was created for each event that represented the most precise municipal region the EID event was known to have occurred in. All EID event polygons, regardless of precision, were included in our bootstrap resampling framework; removing those with geographic uncertainty (e.g., those with only country-level resolution) may artificially inflate the apparent certainty of our model, and our resampling scheme limits their impact to appropriate levels. Events with precise coordinates were also assigned a polygon for consistency of data format, but rather than using a municipal boundary, the event was assigned a 5 km circular buffer zone. EID polygons were subsampled for model fitting as described below. Because our model matches EID events with decadal population and land use data (described below), we restricted our analyses to decades for which covariate data exist, excluding events before 1970 and leaving n = 147 records for analysis (66% of wildlife zoonosis events).

Explanatory variables

We compiled spatial data layers for 20 predictors in four broad categories to decompose which factors are associated with zoonotic disease emergence. These reflected the most frequently hypothesized drivers of zoonotic disease emergence and included (Table 1): human presence/activity, animals/hosts, the environment, and reporting effort. Explanatory variables came from a variety of data sources, and all were rescaled or transformed to a spatial grid of 1° resolution (WGS84, c. 110 km at the equator) prior to their use in models. Full details of sources, original resolutions and rescaling are presented in Tables 1 and 2.

Table 1 List of predictor layers included in the model Full size table

Table 2 Original resolutions and extents of source data sets Full size table

“Human Activity” data were compiled and eight predictors derived based on the following rationale: (1) Population density likely influences EID risk in two discrete ways. First, as EID events are defined as diseases emerging in the human population, their frequency—before the effects of other predictors—is assumed to be proportional to population density, with the other predictors modifying the per-person risk of EID events. To represent this, we treated human population as a baseline multiplicative factor in our models36. Second, population density may affect transmission dynamics such that EID events in areas of denser population may be more likely to produce outbreaks large enough to be detected37. We used the Global Rural–Urban Mapping Project38 human population data set, which provides gridded estimates of human population every five years for 1970–2000. (2) Population change acts as a proxy for changing demands on ecosystems leading to environmental perturbation, which has been hypothesized to drive disease emergence21. We created a measure for population change by calculating the inter-decadal difference of human population per grid cell. (3) Land-use type represents largely anthropogenic influence on the landscape (as opposed to ‘land cover’ below) and has been hypothesized to play a role in disease emergence and spatial distribution19, 21, 39,40,41. We used the HYDE data set which estimates the percentage of land-use types in each grid cell of a global data set every ten years for 1900–200042 to derive predictors representing percentage of land used for cropland and percentage used for pasture. We also include the layers for Urban Land and Managed/Cultivated Vegetation from the EarthEnv data set, described below under “Environment”, in this category, as they index human impact on the environment. (4) Land-use change has been hypothesized as a key driver for disease emergence by perturbing ecosystems and bringing humans into close proximity with wildlife5, 7, 8, 21, 27. We created metrics of change for pasture and cropland by calculating the between-decade difference in values for each grid cell for cropland and pasture.

For data sets with multiple temporal layers (human population, cropland, and pasture), we included the intersection of available dates in different data sets (decades 1970–2000) and calculated inter-decadal change layers by differencing consecutive decades. All presence and absence samples drawn for each event (see below) were matched to the nearest decadal layers (years ending in 5 were rounded up) and the change layer for the decade they fell in.

“Animal/host” data were represented by two predictors: (1) Mammalian biodiversity. The diversity and prevalence in a host population of potentially zoonotic pathogens in an area is hypothesized to be a key factor in the risk of novel pathogen emergence8, 21, 43. However, spatial data on global pathogen diversity do not currently exist, and it is estimated that we have identified less than 1% of mammalian viral diversity35. Consistent with previous studies, we therefore assume that the number of available pathogens in an area is proportional to the diversity (species richness) of wildlife species4, 5, 35, 44. The overwhelming majority of emerging zoonoses have mammalian hosts45, and global biogeographic patterns of human infectious diseases is highly correlated with global patterns of mammalian diversity30. We therefore used mammal biodiversity (species richness), measured as number of mammal species per grid cell as a proxy for pathogen species richness. To do this, we used the most up to date mammal species distribution maps available, derived from species distribution ranges filtered according to species-specific habitat preferences12. These habitat suitability models reflected species preferences for land cover types, their altitudinal limits, their tolerance to human presence, and their relationship with water bodies. The full-resolution mammal biodiversity data (representing all 5291 terrestrial mammal species)12 was rescaled to the study grid by summing the number of species’ distributions that overlapped each grid cell; (2) Domestic animal density. A number of past EID events with wildlife origin have emerged through farmed or domestic animal intermediate or amplifier hosts (e.g., Hendra and Nipah virus, SARS). In addition, there is growing evidence that the global trend of intensification of livestock production increases the emergence risk of novel wildlife origin zoonoses, e.g., Nipah virus in Malaysia46, influenza viruses, and others6. We used the Gridded Livestock of the World (GLW) data set47, which contains data for poultry, goat, buffalo, cattle, sheep, and pig headcounts. We summed mammals to a single predictor (livestock mammal headcount) and retained poultry as a discrete predictor.

We analyzed eight predictors from two data sets representing “Environmental” variables: (1) Climate. Climatic factors have been repeatedly hypothesized as important in the global biogeography of human infectious diseases, including EIDs30, 48, 49. Climate may influence disease distribution through enhanced suitability for vectors of wildlife origin zoonoses (e.g., West Nile virus), more rapid vector reproduction rates and biting rates, changes in the efficiency or rates of pathogen transmission among hosts and vectors, and changes in the ability of pathogens to persist in the environment, among other factors50, 51. Climate was represented by a single layer in our study, the Global Environmental Stratification52, which uses a quantitative model to stratify the Earth’s surface into zones of similar climate on a single scalar measure, where higher values equate to warmer, wetter (more tropical) regions; (2) Land cover type: Land cover type is associated with the distribution of terrestrial mammals12 and other taxa53, potentially exposing humans present to different assemblages of viral species. It is also likely that the types of contact between wildlife and people vary with land cover type. For land cover, we used the EarthEnv data set54, which divides the Earth’s surface into 12 classes. These include different classes of natural ecosystems, urban land and cultivated vegetation (grouped with “Human Activity” above). We excluded barren areas, open water and snow/ice due to a lack of biologically plausible mechanisms for disease emergence. EarthEnv represents each class as a percentage per grid cell.

Reporting effort

The distribution of reported EID events is likely strongly influenced by an inconsistent spatial distribution of detection and reporting of disease outbreaks. Previous studies have used proxies of reporting effort such as the interpolated locations of known sampling sites (“sampling effort”)55; frequency of countries of residence for all authors of all articles in the Journal of Infectious Disease (“reporting effort”)4; and PubMed searches for keywords for each country (“reporting bias”)25. Other studies have used occurrence records for a similar class of observations as a surrogate for background sampling effort; for example, in ecology, modeling the distribution of a particular species and utilizing occurrence records from multiple other species to represent background samples56.

We adapted these approaches by deriving an index for reporting effort based on the spatial distribution of toponyms (place names) in peer-reviewed biomedical literature. We wrote a Python package, PubCrawler (see Supplementary Methods for full details), to search the full text of each of the 1,266,085 (as of April 2016) articles in the PubMed Central Open-Access Subset (PMCOAS)57 for toponyms from the GeoNames database58, which includes data on population (if appropriate), country, and geographical coordinates for each toponym. PubCrawler uses a set of heuristics, based on textual and geographic features of the identified toponyms, to minimize the number of false positives and select amongst ambiguous matches. We selected articles matching terms from the Human Disease Ontology59 and exported extracted toponyms. After excluding a further round of potentially spurious matches, place name matches were assigned a weight, normalized by article, and then summed to the study grid. To impute missing data (resulting in a number of zero-value grid cells) and smooth noise in the raw output, we fit a Poisson boosted regression tree model (using human population, accessibility, urbanized land, DALY rates, health expenditure, and GDP as predictors), and used this to represent reporting effort in our model. This approach produced a layer that adequately represented the underlying data while achieving a similar coverage of grid cells to other layers.

Statistical framework

We used boosted regression trees (BRT) to model EID occurrence26, 48, 60 and to determine how conditions varied between locations where EID events have been observed compared to areas where they have not. BRTs handle non-linear relationships and higher order interactions among many variables more robustly than many other modeling methods, and are robust to monotonic transformations of data26, 60. They fit potentially complex, non-linear relationships by aggregating the predictions of multiple simpler models, and are trained iteratively on random partitions of the data26, 60. In addition, predictive accuracy of BRTs, as determined by common validation methodologies (e.g., Area Under the Curve of the Receiver-Operator Characteristic (AUC of the ROC), True Skill Statistic (TSS)), frequently exceeds conventional linear methods26. Unlike conventional models, they do not produce confidence intervals or p-values.

Resampling regimes

We employed various resampling techniques to incorporate our measure of reporting effort56, 61, estimate the predictive power of our models, account for spatial uncertainty in EID events15, and generate empirical confidence intervals for effects representing both sampling uncertainty and spatial uncertainty62. Each time an event was sampled, one presence point and one absence point were drawn (artificially fixing overall prevalence at 0.5)15. The presence point was from the grid cells overlapped by that event’s polygon, and the absence point from all grid cells; both were weighted by reporting effort (the effect of weighting presence points by reporting effort made little difference for points with small, precisely specified occurrence polygons, and for events with high uncertainty it acted as a prior, specifying that, in the absence of other knowledge, the event was more likely detected where reporting effort was higher).

All replicate BRT models were fit using the R packages dismo and gbm26. The function gbm.step() was called with the parameters tree.complexity = 3 (governing interaction depth), learning.rate = 0.0035 (setting the “shrinkage” applied to individual trees), and n.trees = 35 (governing the initial number of trees fit, as well as the “step size” or number added at each step of the stagewise fitting process)26. These values were selected through an iterative process, starting with the default parameters, adding tree complexity, and tuning the shrinkage and step size parameters to achieve successful gradient descent consistently across resampling runs, following refs. 26, 62. With the final parameters, the BRTs composing the bootstrap model fit a mean of 1005 trees.

Our main model used a bootstrap resampling regime, which was used to fit 1000 replicate models. For each model, 147 events were drawn randomly with replacement from the set the 147 EID events of interest, and for each selected event, 1 presence and 1 absence value were drawn as described above. The fitted models were used to generate Relative Influence box plots and Partial Dependence plots with empirical 90% confidence intervals. The mean of the predictions of these models were used to generate all maps.

To compute validation statistics (described below), we conducted 100 rounds of 10-fold cross-validation15, 62. In each round, a single presence and absence sample were drawn for each event, which were assigned randomly to ten groups. Each group in turn was held out, and a model was trained on the remaining groups’ samples. The model’s predictions for the presence and absences samples of the held-out group were used to construct confusion matrices, and calculate the AUC and TSS. This process was repeated 100 times, and the median, 0.05 and 0.95 quantiles for all scores were reported.

Factoring reporting bias out

We assumed that the distribution of observed EID events was conditional on the distribution of reporting effort across the globe following56. We fit our main, “weighted” model with grid cells sampled relative to reporting effort. The model thus produced a response relative to reporting effort (Fig. 4). We multiplied this response by the value of reporting effort in each grid cell to map the index of observed EID event risk (Fig. 3a).

We produced the estimate of the risk index after factoring out reporting bias (Fig. 3b) as follows. We assumed that the optimal distribution of reporting effort for human disease events in a location is proportional to the distribution of the human population. In reality, other unmeasured factors likely affect this. However, given this assumption, we can define reporting bias as proportional to the ratio of reporting effort to the human population (Fig. 4)

Fig. 4 Heat map of weighted model response, i.e., EID risk relative to reporting effort. Value indicates the binomial probability that a grid cell sampled at that location will contain an EID event as opposed to a background sample, when drawing equal numbers of absence and background samples weighted by reporting effort (see Methods section). This layer was weighted by reporting effort to produce the “observed” EID risk index map (Fig. 3a) and by population to produce the risk index map with bias factored out (Fig. 3b) Full size image

.

$${\rm{Reporting}}\,{\rm{bias}} \propto \frac{{{\rm{Reporting}}\,{\rm{effort}}}}{{{\rm{Population}}}}$$

When bias is known, it is possible to estimate the true distribution of a phenomenon by “factoring bias out”56. In ecological studies, this generally means dividing by the measured “survey effort”, assuming that the optimal distribution of search effort is uniform across the landscape.

$${\rm{True}}\,{\rm{risk}}\,{\rm{index}}\, \propto \,\frac{{{\rm{Observed}}\,{\rm{risk}}\,{\rm{index}}}}{{{\rm{Reporting}}\,{\rm{bias}}}}$$

We posit that, in the case of human disease events, uniform search effort across a landscape is also suboptimal, and that it is safer to assume optimal reporting effort distribution would be proportional to the human population. In this case, we remove “bias” by factoring out measured reporting effort and factoring in assumed optimal effort, and obtain a hypothetical map of the true event risk index, thus:

$${\rm{True}}\,{\rm{risk}}\,{\rm{index}}\, \propto \,{\rm{Observed}}\,{\rm{risk}}\,{\rm{index}}\, \times \,\frac{{{\rm{Human}}\,{\rm{population}}}}{{{\rm{Reporting}}\,{\rm{effort}}}}$$

Model validation and performance

We used multiple tools for model validation and performance. For our bootstrap model, we calculated deviance explained using the gbm.step() function26 and also derived median and empirical 90% CIs by taking the 0.05, 0.5, and 0.95 quantiles of those values for the replicate models. Since this model is fit relative to reporting effort, percentage deviance explained is calculated relative to that variable. For the ten-fold cross-validation runs, we calculated the AUC, a threshold-independent measure of model predictive performance that is commonly used as a validation metric in species distribution modelling63. The AUC can be interpreted as “the probability that the model will rank a randomly chosen presence site higher than a randomly chosen absence site”64, or more accurately in our application, a measure of a model’s performance to discriminate EID events from random points56. Because the use of AUC has been criticized for its lack of sensitivity to absolute predicted probability and its inclusion of a priori untenable prediction thresholds13, we also calculated the True Skill Statistic (TSS)15.

Because all test statistics and figures from our main model are relative to the reporting effort measure, we also ran “unweighted” models. We expected these would score yield higher cross-validation scores, since we expected that reporting effort would be correlated both with some important predictor variables and the outcome, and weighting background samples uniformly rather than according to this variable would present a clearer contrast. To avoid bias from land area in the WGS84 grid cells, we additionally weighted our “unweighted models” by land area per grid cell. The figures from these models are presented fully in Supplementary Information.

Code availability

All data and code used to generate the models are available on GitHub (doi: 10.5281/zenodo.400978)65, as is the code used to generate the reporting effort layer (doi: 10.5281/zenodo.400977)66.

Data availability

The data sets analyzed during this study are included in this published article and its Supplementary Information Files, with the exception of EID Event shape files, which are available from the corresponding author on reasonable request.