No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.

Database

To construct the mammal–virus association database we initially extracted all viruses listed as occurring in any mammal from the International Committee on Taxonomy of Viruses database (ICTVdb), and further individually went through each virus listed in the ICTV 8th edition master list and searched the literature for mammalian hosts. All viral species names were synonymized to ICTV 8th edition, which was the global authority on viral taxonomy at the start of our data collection in 2010 (ref. 16). From 2010–15 the authors and a team of research assistants and interns at EcoHealth Alliance compiled mammal species associations for each of 586 unique viruses published in the literature between 1940–2015 initially by using the virus name and synonyms as the search keywords in the major online reference databases (Web of Science, PubMed, and Google Scholar) in addition to searching in books, reviews, and literature cited in sources we had already obtained. To narrow the search for hosts for well-researched viruses, we additionally included the terms ‘host(s)’, ‘reservoir’, ‘wildlife’, ‘animals’, ‘surveillance’, and other relevant terms to find publications related to host range. Associations were cross-checked for completeness with the Global Mammal Parasite Database for primate, carnivore and ungulate viruses, version as of Nov 2006 (GMPD, http://www.mammalparasites.org)29 and other published reviews specific to bats and rodents12,30,31. We excluded all records without species-level host information, and those where we could not track down the primary references. Records of mammal–virus associations from experimental infection studies, zoological parks or captive breeding facilities, or cell culture discoveries were excluded. Host species were defined as domestic or wild following the list of domestic animal species from the Food and Agriculture Organization (FAO)32, and we removed the black rat (Rattus rattus) and domestic mouse (Mus musculus) from the domesticated list as these two species make up their own ‘peri-domestic’ category. Host species were categorized as either occurring in human modified habitats or being hunted by humans—both estimates for human contact—according to the IUCN Red List species descriptions33.

To control for the fact that some detection methods are more reliable than others in identifying the pathogen of interest, we recorded the detection method used for each host–virus association and scored these as 0, 1, or 2 according to the reliability of detection method used. Viral isolation and PCR detection with sequence confirmation were scored as a 2 (=stringent data); and serological methods were scored as a 0 or 1, with viral or serum neutralization tests (=1), and enzyme-linked immunoassays (ELISA), antigen detection assays, and other serological assays scored as (=0). ‘Stringent data’ were analysed separately to remove potential uncertainty owing to cross-reactivity with related viruses. We exhaustively searched the literature to identify a stringent detection for each mammal–virus pair, and only included the serological finding for that pair if no molecular or viral isolation studies were available. We partitioned data and conducted separate analyses for the entire data set (0 + 1 + 2 detection quality) and the stringent data (score of 2) to reduce the noise from potential serological cross-reactivity. Full list of host–virus associations, detection methods, and associated references are provided in our data and code repository at http://doi.org/10.5281/zenodo.596810.

Our operational definition of a zoonotic virus includes any virus that was detected in humans and at least one other mammalian host in at least one primary publication, and does not imply directionality. Our complete dataset of mammalian viral associations demonstrates evidence of past or current viral infection which we believe is a reasonable proxy for measuring spillover, and our stringent dataset specifically is more robust to exclude species that may have been exposed to a given virus versus those that show some evidence for replication within the host species. Our bi-directional definition of spillover follows a proposal by the WHO that defines a zoonosis as “any disease or infection that is naturally transmissible from vertebrate animals to humans and vice-versa” (http://www.who.int/zoonoses/en/) and excludes any human pathogens that recently evolved from nonhuman pathogens (for example, HIV in primates), as per Woolhouse and Gowtage-Sequeria (2005) (ref. 1).

In order to address influence of transmission from humans to wildlife in our models, we also ran our GAM model fitting and selection procedure (see below) on a subset of data that excluded any probable ‘reverse zoonotic’ viruses. We first searched our entire dataset and removed any clear instances of transmission from humans to primates, for example, including records from zoological parks and wildlife rehabilitation centres (as previously noted). We then additionally removed several human viruses most commonly transmitted from humans back to non-human primates to create a subset of data without the most common reverse zoonotic viruses (adeno-associated virus-2; human adenovirus D; human herpesvirus 4; human metapneumovirus; human respiratory syncytial virus; measles virus; mumps virus)34,35. We present these additional analyses excluding reverse zoonoses and associated code at http://doi.org/10.5281/zenodo.596810.

Total viral richness was calculated as the number of unique ICTV-recognized viruses found in a given host species, and zoonotic viral richness was defined as the number of unique ICTV-recognized viruses in a given host species that were also detected in humans in our database.

To assess research bias for both host and virus, we searched ISI Web of Knowledge, including Web of Science and Zoological Record, and PubMed for the number of research publications for a given host or pathogen. We recorded two values for the number of research papers for a host. The first was a simple search by scientific binomial in Zoological Abstracts where we recorded the number of papers published between 1940–2013 for each host species. We also recorded the number of disease-related publications for each species using the scientific binomial AND topic keyword: disease* OR virus* OR pathogen* OR parasit*. The * operator was used in our search criteria to capture all words that begin with each term, for example, ‘parasit*’ would return hits for ‘parasite’, ‘parasites’, and ‘parasitic’. These search criteria broadly included papers that examined disease or diseases, virus or viruses, pathogen or pathogens, parasite parasites, or parasitology, for each species. Only one measure of per-host research effort was included at a time in model selection. As these metrics are highly correlated and the number of disease related citations per host outperformed the total number of publications per host in all but one model (all-data zoonoses), we decided to use disease-related publications as our per-species research effort measure for all models to improve interpretability. We also recorded the number of publications for each of 586 virus species using a keyword search by virus name in PubMed and Web of Science. Only one measure of per virus research effort was included at a time in model selection.

We used a phylogenetically corrected measure of body mass (see details below under ‘Phylogenetic signal’) as our main life history predictor variable, because it was the only one for which a nearly complete dataset existed for the species in our dataset. We used the body mass recorded in the PanTHERIA database36 for 709 species. For 3 species, we used the second choice option, body mass recorded in the AnAge database37. For 11 species, we used the third choice option of the extrapolated body mass recorded in PanTHERIA, which is based on body length or forearm length, depending on species. For 36 species, we used the average body mass for members of the genus that had a recorded body mass. We explored other life-history variables related to longevity38, reproductive success, and basal metabolic rate but these were ultimately excluded owing to the high number of missing records.

Phylogenetic signal

We address the issue of non-independence of host species traits owing to shared ancestry39 in our analyses by first quantifying the phylogenetic signal for each variable in our model using Blomberg’s K40. Blomberg’s K measures phylogenetic signal in a given trait by quantifying trait variance relative to an expectation under a Brownian motion null model of evolution using a phylogenetic tree with varying branch lengths. Blomberg’s K-values are scaled from 0 to infinity, with a value of 0 equal to no phylogenetic signal and values greater than 1 equal to strong phylogenetic signal for closely related species that share more similar trait values. While there is no clearly defined K value cut-off in which to apply phylogenetic comparative methods, non-significant value of <1, or more conservatively <0.5, are typical for traits that are phylogenetically independent. The only host variables we examined with significant K values >0.5 were host body mass, and our direct measure of phylogenetic distance to humans. While there are several tools available to control for phylogeny in multivariate analyses, for example, using phylogenetic generalized least square models (for example, PGLS)41, there is currently no modelling approach to control for phylogeny using GAMs. More importantly, a wholesale effort to control for phylogeny across all variables in our analysis was not appropriate here, as we are explicitly testing the relative importance of phylogenetic distance to humans versus other host traits including measures of human–wildlife contact to predict the proportion of zoonotic viruses for a given host species. This left body mass as the only variable in our models, excluding our direct measures of phylogenetic distance, with a significant Blomberg K value that was greater than 1. We controlled for the significant effect of shared evolutionary history using a phylogenetic eigenvector regression (PVR)42,43 on body mass. The PVR approach allowed us to remove phylogenetic signal for any phylogenetically non-independent variables and then include the corrected values back in our GAMs, while retaining predictor variables like phylogenetic distance to humans as unmodified. We calculated PVR for body mass using the R package PVR and our custom-build maximum likelihood host phylogeny using cytochrome b sequences constrained to the order-level topology of the mammalian supertree28,44. Our new variable for body mass that controls for phylogenetic signal (PVRcytb_resid) removed most of the phylogenetic signal, with K = 3.5 unadjusted, and K < 0.5 after PVR correction. Our new metric of body mass scales in the same way, with larger values equal to species with larger body mass. PVR body mass was included in our GAM model selection for the total viral richness and zoonotic virus models.

Host phylogenetic analysis and phylogenetic host breadth

We used two different mammal phylogenetic trees in our analyses and used a model selection framework to determine which best explained our observed association with zoonotic viral richness. First the mammal supertree was pruned in R (package ape, function drop.tips) to include only synonymous species for the 753 species in our database28,45. We synonymized all host species names between the mammal supertree and the host associations in our database using the IUCN Red List33. If the species was listed as ‘cattle’ it was assumed to be Bos taurus, all other records were excluded if there was ambiguity as to the scientific name for the host species. Second, a maximum likelihood cytochrome b tree was generated using the constraint of a multifurcating tree with taxa constrained to their respective orders and the order-level topology matching that of the mammal supertree6, as per this Newick tree file: (MONOTREMATA,((DIDELPHIMORPHIA,(DIPROTODONTIA,PERAMELEMORPHIA)),(PROBOSCIDEA,((PILOSA,CINGULATA),((((RODENTIA,LAGOMORPHA),(PRIMATES,SCANDENTIA)),((((CETARTIODACTYLA,PERISSODACTYLA),CARNIVORA),CHIROPTERA),EULIPOTYPHLA))))))). This generated a higher-resolution species-level mammal tree using cytochrome b data, with more reliable positioning of the higher-level taxonomic relationships than was obtained in exploratory phylogenetic analyses using cytochrome b data alone. GenBank accession numbers and cytochrome b sequence lengths for each species are provided in in our data and code repository. Cytochrome b gene fragments ranged from 143 to 1,140 bp, with >1,000 bp available for 558/665 (84%) of the taxa. Data derived from the cytochrome b tree constrained to the topology of the mammal supertree was selected as the best option in all best-fit GAMs.

Sequences were aligned using MUSCLE with default setting in Geneious R6, and checked visually for errors46. The best maximum likelihood tree with and without the constraint tree were generated using RAxML-HPC2 on XSEDE via the CIPRES Science Gateway server v.3.1 (ref. 47) using a GTR model with parsimony seed, 1.000 bootstrap replicates, and the following, specific parameters (raxmlHPC-HYBRID -s infile -n result -x 12345 -g constraint.tre -N 1000 -c 25 -p 12345 -f a -m GTRCAT).

Matrices of pairwise patristic distances between all species, including Homo sapiens, were calculated from the two phylogenies using the ‘cophenetic’ function in the R package ape45. Phylogenetic trees (Newick format for pruned supertree and cytochrome b tree) and matrices of phylogenetic distance from humans are provided in the data and code repository.

We calculated mean, median, max., min., IQR, and standard deviation (represented as generic function F in equation (1) of phylogenetic host breadth (PHB) from all known mammalian hosts for each virus using the pairwise patristic distances for each mammal–mammal association for all hosts of a given virus excluding humans, where i indexes each mammal in the database, as does j, and J represents the total mammals in the database. We aggregated these PHB values using mean, median, or maximum values at a viral species, genus and viral family level to generate higher-level taxonomic variables of host breadth per viral group. Our measure is similar to those developed by previous studies to understand parasite host specificity48,49,50, but here we create a generalizable variable to measure viral host breadth that can be aggregated at different viral taxonomic levels.

To make Extended Data Fig. 9, taxon names and terminal branches of cytochrome b tree constrained to supertree were colour-coded using residual from the best-fit zoonotic virus GAM (predicted minus observed zoonotic viral richness) for wildlife species, and plotted using the plot.phylo function in the R package ape45. Symbols (circles) at terminal taxa additionally added to better visualize residual value colours were added using willeerd.nodelabels function (http://dx.doi.org/10.5281/zenodo.10855). All marine mammals, domestic animals, and other taxa with missing data were coded as grey for missing data.

Viral richness heat map (Extended Data Fig. 2) was generated using the R package pheatmap, and the ‘complete’ hierarchical clustering algorithm to sort cells across rows and columns by similar values of viral richness. All box plots, histograms and all other figures generated in R v.3.3.0 (ref. 51). R code for primary figure generation is provided in the code repository.

GAM fitting and selection

We fit a set of generalized additive models (GAMs) that included all of our selected potential variables explaining the number of total viruses or number of zoonoses in hosts, as well as whether viruses were zoonotic (for conceptual framework and summary of each GAM see Extended Data Fig. 1; for full variable list and data sources see Supplementary Table 1). Our use of GAMs, an incorporation of smooth spline predictor functions into the generalized linear model (GLM) framework, allowed us to examine the functional form of our predictor variables (for example, Figs 2 and 4). Categorical and binary variables (for example, host order, IUCN status of hunted or not, and certain viral traits) were fit as random effects of each variable level. We used automated term selection by double penalty smoothing52 to eliminate variables from the models. This method removes variables with little to no predictive power and has been shown to be comparable or superior to comparing alternate models with and without variables. We did use the model comparison method for domestic animals, where the sample size was not sufficient for fitting all variables. In this case dropping variables by double penalty smoothing still allowed pruning the model list to eliminate redundant models. Where there were competing variables measuring the same mechanistic effect, we fit alternate GAMs using only one of each of these variables (as specified in below and in the Extended Data Fig. 1). These included phylogenetic variables, citation counts from alternate databases, and different measures of human population/host overlap. For example, to capture host phylogeny we used phylogenetic distance based on either the mammal supertree20 or a purpose-built cytochrome b constrained by the topology of the mammal supertree, but never both in the same model. For human population variables, we looked at either variables measuring overlap of species range with human-occupied areas, or human population in those areas, as area- and population-based measures were highly co-linear. For citation variables, we looked at either all citations or the number of disease-related citations for each host species, not both, and similarly citations in either PubMed or Web of Knowledge. We used a binomial GAM to analyse the 586 mammalian viruses in our database and identify viral traits that may serve as predictors of zoonotic potential. Co-linearity was not a major issue among variables included in the same model.

We inspected models within 2 AIC units of the model with the lowest AIC, and present the outputs of the best-fit and all other top models (<2 ΔAIC) in our data and code repository. In general, variable effects retained the same functional form and effect size across models within 2 ΔAIC—differences were limited to the adding or dropping of very weak, insignificant effects, or switching between highly correlated competing variables such as citation counts from different databases.

For our model of number of zoonoses per host, we used the total number of observed viruses per host as an offset, effectively fitting a model of proportion of zoonotic viruses per host. We found this variable had a coefficient near to one when it was used as a linear predictor, indicating its appropriateness as an offset.

We repeated the model selection process for all models using the more stringent set of data that used only virus identified in mammal hosts using viral isolation, PCR, or other methods of nucleic acid sequence confirmation, that is, that excluded all associations detected via serology.

All models were fit using the MGCV package for R (version 1.8-12.). We used the model with the lowest AIC to predict the number of expected zoonotic viruses for each host species, using all the data from our database that had complete observations for the best model. Our top models consistently outperform the alternatives by wide margins, as measured by AIC. We used standard methods in the R package MGCV to calculate deviance explained, which is defined as (D_null – D_model)/D_null. In this formula, D_null is the deviance (−2 × likelihood) of an intercept-only, (or, in the case of the zoonoses model, offset-only), model, while D_model is the deviance of our best-fit model.

Analyses were limited to terrestrial mammal species as defined by the IUCN Red List (marine mammals were excluded) and we ran separate analyses for wild and domestic animals. As domestic animals made up a much smaller dataset (n = 32 species) with a unique set of explanatory variables that differed from the wild species analyses, these models were fit separately. Domestic species results are also discussed separately (see Supplementary Discussion) as they are tangential to the primary findings.

Model cross-validation

We used k-fold cross-validation to evaluate goodness of fit for all models. The data was divided into ten folds, selected randomly. For each fold, the model was re-fit based on the other nine folds, and goodness of fit was assessed by conducting a nonparametric permutation test comparing the predicted values versus the real values for the kth fold, where a non-significant result indicates that predictions are unbiased. Poisson models goodness-of-fit may be compared via a parametric χ2 permutation test on deviance values, but this test is inappropriate in the case of models with low mean values, as is our case for some of our GAMs53. The k-fold cross-validation confirmed the robustness of our model predictions for wild mammals, code and outputs from these tests for each best-fit GAM are provided in Supplementary Table 2.

In addition to randomly selected k-fold cross-validation, we evaluated the robustness of our models via a non-random geographic cross-validation, code and summary document provided in our code and data repository. In order to meaningfully organize species in our dataset by geographic areas, we used the 34 zoogeographic regions for terrestrial mammals recently redefined by Holt et al.54. Using QGIS55, a mammal-specific zoogeographical shapefile provided by Holt’s group at the University of Copenhagen (http://macroecology.ku.dk/resources/wallace) was intersected (using QGIS Vector > Geoprocessing Tools > Intersect) with a shapefile of IUCN’s host ranges for all mammals in our database. Areas of these intersections were then calculated using an equal-area projection (Mollweide), and each host was assigned to only the region that contained the greatest proportion of its range. We systematically removed all observations (species) from each given zoogeographical region, re-fit the model using all observations from outside the region, then performed a non-parametric permutation test comparing the predicted values to the observed values for that region. Non-significant results indicate that model predictions are unbiased. Significant results for a given zoogeographic region suggest that there are location-specific biases that remain unexplained. This systematic zoogeographic cross-validation supported the overall robustness of our model predictions for several models, that is, all-data zoonoses, all-data total viral richness, and stringent-data total viral richness models. For these models, even though a majority of zoogeographic regions were unbiased, we still identified several zoogeographic regions that showed significant bias. Our zoogeographic cross-validation was equivocal for the stringent-data zoonoses model, with eight regions that showed evidence of bias and seven regions which showed no evidence of bias (Supplementary Table 3).

The presence of biased regions in our zoogeographic cross-validation suggested the possibility that there is a systematic bias associated with geography not captured by the predictor variables in our models. To further investigate this, we added zoogeographical region as a categorical random effect to each of our best-fit models. For three of our best-fit GAMs (all-data total viruses, stringent-data total viruses, and stringent-data zoonoses) the addition of zoogeographical region as a categorical random effect decreased the model AIC and increased the total deviance explained by 3–5%. The all-data zoonoses model, which was used to create the series of maps in the main manuscript, does not improve with the inclusion of zoogeographical region. However, the improved predictive power of models using region-specific terms is offset by the increase in degrees of freedom (that is, if we included 31 zoogeographic regions as separate terms) and, more importantly, a decreased interpretability of our models—especially when compared to the geographical variables we used, such as host area or species range overlap with human modified habitat. We opted not to include these random effects in our final GAMs in favour of keeping only variables interpretable in the context of our host trait-specific framework. Instead, we indicate areas of geographic bias directly on our spatially mapped outputs. (See ‘Calculating and visualizing missing viruses and missing zoonoses’, below.) Summaries of these models, along with changes in relative deviance explained for the other explanatory variables when zoogeographic region is added as a random effect, are provided in our code and data repository.

Spatial variables

For all the wildlife hosts we used the geographic range information obtained from the IUCN spatial database version 2015.2. Wildlife host species shapefiles needed to replicate analysis are hosted on our Amazon S3 storage (https://s3.amazonaws.com/hp3-shapefiles/Mammals_Terrestrial.zip)33. IUCN depict species’ range distributions as polygons based on the extent of occurrence (EOO), which is defined as the area contained within a minimum convex hull around species’ observations or records. This convex hull or polygon is further improved by including areas known to be suitable or by removing unsuitable or unoccupied areas based on expert knowledge. To accurately calculate the area in km2 of each host species we projected the polygons to an equal area projection (Mollweide).

We calculated various thresholds of mammal sympatry based on percentage of range overlap for each wild species in our database using IUCN shape files for all mammals globally. We define mammal sympatry as the number of mammalian species that overlap with the target species’ geographic range. We calculated mammal sympatry for each wild species in our database at six different thresholds based on the percentage area overlap with the target species geographic range, that is, the number of other wild mammal species with any (>0%), ≥ 20%, ≥ 40%, ≥ 50%, ≥ 80%, or 100% range overlap. The six different thresholds for mammal sympatry were included as competing terms in our model selection for the total viral richness models.

We derived and tested several global measures to estimate the level of human contact with each wild species in our database. To estimate the area of host geographic range covered by crops, pastures, rural and urban areas—as measures of global human contact with a given wildlife species—each species polygon was intersected (overlapped) with spatial data representing those land cover types. Additionally, we calculated the total number of people within each host geographic range using data from HYDE database56, and also separately totalled the number of people in rural and urban populations. We obtained data on the distribution of cropland, pastures, rural and urban areas also from the HYDE database56 for the years 1970, 1980, 1990, 2000 and 2005 with a spatial resolution of 5 × 5 arc minutes, equivalent to 10 km by 10 km at the equator. These datasets were created by combining information from satellite imagery and sub-national crop and pasture statistics56. In our GAMs, we used several transformations of these variables as competing proxies for human–wildlife contact: the log-transformed area of host range that overlapped each type of human-modified land cover, log-transformed human population in the host range, log-transformed human population density in the host range, and the log-ratio of urban and rural human populations in the host range. For each of these, we also included as a variable the change in value from 1970 to 2005. Human–wildlife contact variables that significantly covaried were excluded (set as competing terms) during the model selection process. The ratio of urban to rural human population was used to disentangle variables of human–wildlife contact that significantly covaried. For example, the total area of a species range that overlapped with urban and rural areas was highly correlated with the total geographic area variables we examined (for example, total area, and area in crop, pasture, rural, and urban). The ratio of urban to rural population allowed us to separate these signals and best represent this proxy of per-species human–wildlife contact. All spatial analyses were performed in R (3.3.2)51, using the following R libraries: raster57, rgdal58, and sp58.

Calculating and visualizing missing viruses and missing zoonoses

We used each respective best-fit, all-data GAM from the total viral richness and proportion zoonoses models to calculate the estimated number of viruses that would be observed if the research effort variable for each species was equal to that of the most-studied wild species in our database (Vulpes vulpes with 4,433 total publications and 1,477 disease-related publications). We used the prediction of the total virus richness GAM as the offset for the zoonoses GAM. We then calculated the missing viruses and missing zoonoses by subtracting the observed number of viruses and zoonoses from the predictions based on maximum research for each wild mammalian species.

We used geographic range maps from the IUCN spatial database (2015.2) to visualize the spatial distribution of observed host–virus associations, observed host–zoonoses associations, these associations as predicted under maximum research, and the maximum predicted minus the observed viruses, or the missing viruses and missing zoonoses (for example, Fig. 3; Extended Data Figs 3, 4, 5, 6, 7, 8; Supplementary Table 4). We also generated maps comparing species richness of all species in the IUCN database against those with viral associations in our database. For each species, the distribution range was converted to a grid system with cells 1/6 of a geographic degree (approximately 18 km × 18 km at the equator line). Each grid cell was assigned a value of one to indicate presence. We repeated this process and assigned the observed and predicted-under-maximum-effort number of zoonotic viruses to their correspondent grid cells. Viral and host species richness maps, and both the missing viruses and missing zoonoses maps were calculated by overlying individual grids. Each richness map represents the sum of all values for a given grid cell. We repeated the process for all the host species in our database and created viral and species richness maps for the following orders: Carnivora, Cetartiodactyla, Chiroptera, Primates and Rodentia. These taxa were selected because they represent 681/736 (92.5%) of wild mammal species in our database.

In the process of translating our non-spatial, species-level predictions to geographic space (that is, layered raster maps), we identified several geographic areas where our model predictions of the number of total and zoonotic viruses were systematically biased, that is, P < 0.05 (Supplementary Table 3). In order to visualize the geographic biases of our non-spatial model predictions in our maps (see above regarding zoogeographic cross-validation), we demarcate regions with significant bias with hatching. Hatched regions represent areas where model predictions of total or zoonotic viral richness deviate systematically for the collection of species in that grid cell. For each grid cell we calculated whether the bias exceeded that expected from a random sampling of hosts. This was accomplished by summing the residuals from 100,000 random draws of species in our dataset that was equal to the number of species present in that grid cell, then identifying grid cells where the observed bias was outside the middle 95% of the randomly drawn distribution. We calculated this for all mammals, and separately for each order across all grid cells. Areas with observed bias (outside of 95% of the randomly drawn distribution) are shown with hatched regions on each missing virus and missing zoonoses map.

Animal images used in figures

Animal silhouettes added to Figs 1 and 3 and Extended Data Figs 1 and 2 to visually represent each mammalian order were downloaded from PhyloPic (http://www.phylopic.org). Images used to represent the orders Chiroptera, Cingulata, Diprotodontia, Lagomorpha, Peramelemorphia and Primates were available for use under the Public Domain Dedication license. Images used to represent the orders Carnivora and Rodentia (by R. Groom), Didelphimorphia, Pilosa, and Probscidea (by S. Werning), Eulipotyphyla (by C. Rebler), Certartiodactyla and Perissodactyla (by J. A. Venter, H. H. T. Prins, D. A. Balfour & R. Slotow and vectorized by T. M. Keesey) were provided under a Creative Commons license (https://creativecommons.org/licenses/by/3.0/). We created the silhouette used to represent the order Scandentia.

Data availability

All datasets (host traits, viral traits, full list of host–virus associations and associated references, phylogenetic trees, and phylogenetic distance matrices) needed to fully replicate and evaluate these analyses are provided at http://doi.org/10.5281/zenodo.596810. The top-level README.txt file in the directory details the file structure and metadata provided.

Code availability

All R code and R package dependencies needed to fully replicate and evaluate these analyses are provided at http://doi.org/10.5281/zenodo.596810.