Materials and Methods Sixty-six reports of laboratory confirmed, geolocated infection with Y. pestis in sylvan and domestic animal hosts were collected by the International Society of Infectious Diseases through the ProMED electronic surveillance system between January 1, 2000 and August 31, 2015 (http://www.promedmail.org/). Geographic coordinates for each case were obtained using Google Maps. These data, including the ProMED archive number for each report, are available via Figshare (http://dx.doi.org/10.6084/m9.figshare.1538604). For inclusion in this study, the animal cases were required to have been reported with a spatial resolution of 1 square kilometer or finer, meaning that cases reported at >1 km2 were of too coarse a resolution for consideration. In addition to the infections in animal hosts, there were 48 laboratory-confirmed plague cases in humans over the same period. However, these reports are generally only reported at the county level to protect patient privacy. Moreover, as many western US counties are large in area, the spatial resolution available for analysis using these human cases would be quite coarse, so human cases were not included in these analyses. The WorldClim—Global Climate (2014) database was the source of all climate date used in this investigation. Annual mean temperature, annual mean precipitation, mean temperature for the hottest and coldest quarters, mean precipitation for the wettest and driest quarters, and isothermality from 1950 to 2000 were each extracted as 30 arc second resolution rasters (Hijmans et al., 2005). Altitude was also extracted from this database at the same resolution. Each pixel in these rasters represents the value of the measurement for that approximately 1 km2 area on the Earth’s surface. A raster of the distribution of 9 unique land cover types was obtained from the Food and Agriculture Organization of the United Nations’ Global Land Cover—SHARE database (Latham et al., 2014). The resolution of this raster was 1 km2. The land cover types represented in this raster were as follows: bare soil, cropland, grassland, shrubland, sparse landscape, snow cover, tree cover, water, and artificial surface. Each of the 9 land cover types was extracted from this raster and the distance between each pixel and the nearest pixel of each unique land cover type was calculated to create a new distance raster, of the same extent and resolution, for each land cover type. The distance calculations were performed in the QGIS geographic information system (http://www.qgis.org/) using the proximity function for calculating raster distances. The result of this process was a separate raster of 1 km2 resolution for each of the 9 land cover types described above wherein the value of each pixel is the distance in meters between that pixel and the nearest pixel wherein the particular land cover feature is present. This allows for modelling a spectrum of proximity to different land cover types in a given space rather than the more crude approach of simply designating the space as present or absent for specific land cover (see modelling description below). The Global Biodiversity Information Facility (GBIF) was used to identify the geographic distribution of one of the most important reservoir species of Y. pestis in the enzootic transmission cycle, Peromyscus maniculatus (http://www.gbif.org/). This database contained 94,983 field records of documented and geolocated P. maniculatus individuals within the spatial extent of latitude 49.38436°N, 25.83738°N, and longitude 88.81702°W, 124.7631°W. This extent delineates the boundaries of the western US within which all previous plague occurrences, both animal and human, have been documented. This field sample of 94,983 individual deer mice was used to create a raster of the predicted occurrence of the reservoir across the western US using a Maxent model (see description of statistical methods below). Statistical analysis A maximum entropy machine learning approach was employed to classify and map the ecologic niche of plague infection in sylvan and domestic animal hosts. Because the aim of this exercise was to predict the probability of animal plague occurrence across geographic space, machine learning is an attractive analytic approach. Machine learning does not require the assumptions of a specific model form. Algorithms are constructed to form decision trees, which partition a data space based on rules that optimally identify homogeneity among predictors and a response (i.e., the presence of plague infected animals) (Elith, Leathwick Hastie, 2008). The analytic structure of these decisions trees is appealing because they are robust to outliers, predictors may be of any form, and their hierarchical structure inherently models interactions between the predictors. The Maxent machine learning algorithm, as implemented in the dismo package in R (Hijmans et al., 2014), was used to predict plague in animal hosts in the western United States. More specifically, the Maxent algorithm was used to predict the probability that the biotic and abiotic conditions in a given location were suitable for the presence of animal plague. This algorithm is based on a maximum entropy probability distribution, which holds that in the absence of boundaries to species’ dispersal, the distribution of those species will approach uniformity. This algorithm is particularly appropriate for presence-only animal plague data because it does not require the unknown information associated with animal plague absences, which were not available (Phillips, Anderson & Schapire, 2006; Franklin, 2010). Documented species occurrences are interpreted as presence points, while those points without documented occurrences are treated as the background environment. In the Maxent model employed here, presence points are represented by the locations of identified animal plague cases. As described above, the geographic extent of this analysis was latitude 49.38436°N, 25.83738°N, and longitude 88.81702°W, 124.7631°W. Maxent models consistently compare favorably with other approaches to machine learning, such as random forests or support vector machines (Duan et al., 2014). The current analysis also compared the performance of Maxent to predictive models derived from random forests and support vector machines to determine if superior performance could be replicated in our study of animal plague. For each 1 km2 geographic area within the spatial extent of the western US, the Maxent model was used (1) to predict the probability of habitat suitability of P. maniculatus, and (2) to predict the probability of epizootic plague. The first model, targeting P. maniculatus, included mean annual temperature, mean temperature during the coldest quarter, isothermality, mean precipitation during the driest quarter, mean precipitation during the wettest quarter, distances to water, grassland, shrubland, trees, and artificial surface as predictors of habitat suitability. The regularization parameter was set to 1.0, to balance between overly localized, overfit model predictions and overly generalized, broadly fit model predictions. This output raster described the ecologic niche of P. maniculatus at a resolution of 1 km2. Five-fold cross-validation was applied to the Maxent model, wherein the model was fit by first dividing the training sets into k = 5 subsets, and then cross-validating by iteratively fitting the model to 4 of the combined subsets and testing against a 5th. This was repeated such that each of the k = 5 subsets was used as a test set during one iteration. The final cross-validated model was then evaluated against a test dataset. The 5 subsets were comprised of 12,000 deer mouse observations each, with these being randomly selected from the overall pool of training observations (n = 60,000, or ∼2/3 of the total 94,983 deer mouse observations; see further description below). As a further evaluation of prediction error, the data were split into two subsets prior to beginning the analyses. The first subset was a training set (n = 60,000), which was used to fit (or “train”) the models, while the second subset (n = 34,983) was left out of the model fitting process altogether and then used to test against the model predictions. The difference in model predictions based on training and testing data provided an assessment of the model’s prediction error. The use of separate training and testing datasets reduces the typically high prediction error that attends overfitting of the data when all available data are used to train the model. In this analysis, the data were partitioned into 2/3 and 1/3 of the total observations for the training and test sets, respectively, and the area under the curve (AUC), reported as a percentage, was calculated to assess prediction error. The relative contribution is also reported for each predictor in the Maxent model. The relative contribution refers to the reduction of a loss function, which is referred to as the gain and is analogous to the residual deviance from generalized linear models, attributable to each predictor variable. The output raster of the predicted probability of P. maniculatus presence was then used as a predictor of epizootic plague. This subsequent Maxent model of epizootic plague included mean annual temperature, mean temperature during the coldest quarter, mean precipitation during the driest quarter, mean precipitation during the wettest quarter, distances to grassland, shrubland, cropland, sparse vegetation, bare soil, and artificial surface, and the predicted probability of P. maniculatus as predictors. The same 5-fold cross validation process used to model P. maniculatus probability described above was used to model epizootic plague. The animal plague data (n = 66) were also partitioned into a training set comprising 2/3 of the whole (n = 44) and a testing set comprising 1/3 of the whole (n = 22). As above, the AUC was reported to assess the prediction error for the epizootic plague model and the relative contribution was reported for each predictor variable in the model. Variable response curves were also examined to more clearly elucidate specific relationships between individual variables and predicted animal plague occurrence. Model predictions of epizootic plague probability were converted to a binary score to designate presence versus absence across a range of predicted plague occurrence. Four thresholds were thus created at 25%, 50%, 75%, and 90% predicted probability. These thresholds were used to create four separate binary rasters designating each 1 km2 space as either present or absent for epizootic plague.

Results The distribution of the Y. pestis infected animals as reported by the ProMED electronic surveillance system of the ISID is presented in Fig. 1. As expected, the distribution is limited to the western United States, but with high concentrations of reported infected animals in Colorado, New Mexico, and southern California. Central South Dakota marked the eastern boundary of reported animal plague during the 15 year surveillance period, while north-central Montana delineated the northern boundary in close proximity to the Canadian border (Hijmans et al., 2005). Figure 1: The distribution of the 66 laboratory-confirmed animal plague cases identified through the ProMED system between January 1, 2000 and August 31, 2015 in the United States. Map data: Google, ©2015 TerraMetrics. Figure 2: Rasters of model predictors. The predicted distribution of Peromyscus maniculatus (deer mouse), altitude, mean precipitation during the driest and wettest quarters, distances to land cover types, and mean annual temperature with the distribution of the 66 animal plague cases overlaid (dots). Maps of the distribution of the environmental variables with the locations of infected animals overlaid are presented in Fig. 2. Some of the environmental features clearly demonstrate greater heterogeneity than others. For example, precipitation, in both the driest and wettest quarters, and temperature were unevenly distributed across the western states, as was proximity to artificial surface, grassland, and bare soil land cover types. The map of predicted P. maniculatus in the first panel depicts the probability surface of this enzootic species across the western US. The Maxent model used to predict the species distribution identified mean annual temperature, isothermality, and distances to artificial surfaces, grassland, and shrubland as the most influential with 34.7%, 22.6%, 14.9%,9%, and 8.9% relative contribution to the loss function, respectively. The AUC for the P. maniculatus species distribution model was 76% indicating acceptable prediction error against test data. Figure 3: Predicted probability of epizootic animal plague. These risk surfaces are based on the ecologic niche of animal plague as derived from the Maxent model. The ecologic niche of epizootic plague as predicted by the Maxent model is presented in Figs. 3 and 4. The former delineates a continuous risk surface as the probability of animal plague occurrence, while the latter figure depicts animal plague presence versus absence based on four thresholds of 25%, 50%, 75%, and 90% predicted probability of animal plague occurrence. In all maps counties are included to provide a frame of reference within specific local municipalities. These maps highlight a pattern of occurrence, albeit somewhat discontinuous, predicted along the eastern edge of the front range of the American Rocky Mountains that extends from northwestern Montana down to north-central New Mexico. Additional areas of high occurrence include southwestern and northeastern California, southwestern South Dakota, central Arizona, as well as more localized areas in Oregon, Washington, southern Idaho, and northern Utah. As the threshold for designating animal plague presence increased from 25% to 90%, the range of predicted presence narrowed until it was limited primarily to the front range within the state of Colorado, with a few additional highly localized areas in Oregon, California, and New Mexico. In the Maxent model used to predict the ecologic niche described in these maps, predicted presence of P.maniculatus, altitude, distance to artificial surfaces, mean precipitation during the wettest quarter, and mean precipitation during the driest quarter were the most influential environmental variables contributing 50%, 17.2%, 8.3%, 7.2%, and 7%, respectively, to the loss function. Additionally, mean annual temperature and distances to cropland, bare soil, and sparse vegetation modestly contributed between 2%–3%, while temperature during the coldest quarter and distances to grassland and shrubland all contributed less than 1%. The AUC derived from this Maxent model was86%, with confidence bands based on k-fold cross-validation replicates between 85.9% and 86.1%, indicating acceptable prediction error against the test data. Moreover, the Maxent model performed better than either a random forests model (AUC = 83%) or a support vector machine model (AUC = 70%). Figure 4: Predicted presence of animal plague based on thresholds of 25%, 50%, 75%, and 90% probability of animal plague occurrence derived from the Maxent model. The variable response curves highlighted some interesting landscape features of habitat suitability for animal plague (Fig. S1). The relationship between altitude and predicted animal plague probability demonstrated a threshold at approximately 2,000 m, below which plague risk increased linearly from sea level and after which no further increase in risk was observed. Thresholds were also noted for precipitation. Animal plague probability increased modestly with increasing mean precipitation during the driest quarter up to 50 mm, while falling precipitously after this threshold until reaching 0% probability of plague by 100 mm. Conversely, mean precipitation during the wettest quarter was associated with a sharp increase in predicted probability from 0 mm to 100 mm of precipitation, while steadily falling back down to zero between 100 mm and 1,200 mm of precipitation. The sensitivity analysis, presented in Fig. S2, shows the probability of predicted plague occurrence at the 5 km2 predictor resolution to be very similar to that of the 1 km2 predictor resolution, including the same abiotic and biotic predictors in both models. Moreover, the AUC for this lower resolution model was 88%. Therefore, this model of plague appears to be robust to scale.

Conclusions This investigation predicted animal plague occurrence across the western US based on reported occurrences of plague in sylvan and domestic animal hosts. The distribution of P. maniculatus, presumed to be an important species in the enzootic cycle of plague, altitude, precipitation during both the driest and wettest quarters, and proximity to developed land parcels were identified as important features of habitat suitability for animal plague, and may demarcate potential areas of zoonotic transmission to humans.