Genome sequencing and raw-read processing

We shotgun-sequenced the D. rotundus genome using a wing biopsy from a sample collected by the NIH through the Catoctin Wildlife and Zoo in Thurmont, Maryland, USA. The capturing method and dead preservation procedure of the specimen are unknown. The age and sex of the dead individual are unknown. Sampling permits were given to BGI for the sequencing of the specimen, originally as part of the BGI 10K genome project. Genomic DNA was extracted at the Laboratory of Genomic Diversity and was fragmented to 2–10 kilobases (kb). Sequencing libraries were constructed with insert sizes 170 bp−10 kb, according to the Illumina protocol for sequencing on Illumina HiSeq2000 following the manufacturer’s instructions. We sequenced reads of 49 bp for the long-insert-size libraries (2 kb, 5 kb and 10 kb) and 100 bp for the short-insert-size libraries (170 bp, 500 bp and 800 bp). Sequencing errors were corrected on the basis of the frequency of nucleotide strings of a given length and low-quality reads were filtered out using SOAPfilter43 as follows. (1) Remove reads with >10% uncalled nucleotides (Ns). (2) For short-insert-size libraries (<2 kb), reads were removed if the quality score of >60% bases was <7. For large-insert-size libraries (≥2 kb), reads were removed if the quality score of >80% bases was <7. (3) Adapter sequences, and duplicate or identical reads were removed. (4) Read pairs were removed if Read1 and Read2 were completely identical. (5) For short-insert-size paired-end sequences, reads with overlapping length ≥10 bp between the Read1 and Read2 were removed.

Genome assembly

We estimated the genome size of D. rotundus using Kmerfreq44 by dividing the total number of seven decamers by the peak of the seven decamer Poisson distribution. High-quality reads were assembled using SOAPdenovo43 as follows. (1) Short-insert library reads were assembled as initial contigs ignoring the sequence pair information. (2) Reads were aligned to the previously generated contig sequences. Scaffolds were constructed from short-insert-size libraries to large-insert-size libraries by weighting the paired-end relationships between pairs of contigs, with at least three read pairs required forming a connection between any two contigs. (3) Gaps in the scaffolds were closed using GapCloser43. Genome quality assessment was performed by downloading a publicly available D. rotundus transcriptome45 and aligning the transcripts to the genome using BLAT46.

Genome contiguity improvement

We prepared two Chicago libraries22 using 5 μg of high-molecular-weight DNA obtained from D. rotundus cultured cells from the San Diego Zoo collection, which were originally derived from a skin sample taken from between the shoulder blades of a D. rotundus individual. Permits for this were obtained from the San Diego Zoo Global. The capturing method and dead preservation procedure of the specimen are unknown. The age and sex of the dead individual are unknown. DNA was extracted with Qiagen Blood and Cell Midi kits according to the manufacturer’s instructions. The steps required for building the Chicago libraries were performed as described in ref. 22. The libraries were sequenced using Illumina HiSeq 2500 2 × 100 bp rapid run. Our initial D. rotundus assembly, shotgun sequence data and Chicago library sequences were used by Dovetail Genomics as input data for HiRise, as described in ref. 22. Genome assembly contiguity statistics were obtained using a minimum N track length of 1 to delimit the contig blocks within the scaffolds.

Protein-coding gene and functional annotation

Homology-based gene prediction was performed using as a reference the Ensembl gene sets of Myotis lucifugus, Pteropus alecto, Myotis davidii, horse and human. We aligned the protein sequences of the reference gene sets to the D. rotundus assembly using tblastn47 and linked the blast hits into candidate gene loci with genBlastA48. We filtered out those candidate loci with homologous block length <90% of the query length. We extracted genomic sequences of candidate gene loci, including the intronic regions and 3 kb upstream/downstream sequences. The sequences were passed to GeneWise49 to search for accurately spliced alignments. We filtered out pseudogenes containing more than one frame error for single-exon genes. Potentially pseudogenized single exons were removed if they were part of a multi-exon gene. We then aligned protein sequences of these genes against UniProt using blastp and filtered out genes without matches. We also filtered out genes that had >80% repeat regions. De novo gene prediction was performed with AUGUSTUS50 using a published common vampire bat transcriptome45 as a training data set and with masked transposable-element-related repeats. We filtered out partial and <150 bp predicted genes. Genes that aligned over 50% of their length to annotated transposable elements were filtered out. Finally, we built a non-redundant gene set with the homology-based evidence prioritized over the de novo evidence. If de novo genes were chosen in the reference gene set, we retained only those with > 30% of their length aligning against UniProt51 and that contained at least 3 exons. The integrated gene set was translated into amino-acid sequences, which were used to search the InterPro database with iprscan_4.852. We used BLAST to search the metabolic pathway database in KEGG53 and homologues in the SwissProt and TrEMBL databases in UniProt. The quality and annotation of the D. rotundus genome was assessed and compared to that of E. europaeus, R. ferrumequinum, M. brandtii, M. davidii, M. lucifugus and P. parnellii with BUSCO54.

Repeat annotation

Repeat annotation was performed on the genomes of D. rotundus, P. parnelli, M. lyra and P. vampyrus. Transposable elements were identified using RepeatMasker55 and RepeatProteinMask against the Repbase transposable element library56. We used RepeatScount, PILER-DF and RepeatModeler-1.0.555,57 to construct a de novo transposable element library, which was then used by RepeatMasker to predict repeats. We predicted tandem repeats using TRF58. LTR_Finder59 was used to detect long terminal repeats (LTRs). The Repbase-based annotations and the de novo annotations were merged.

Non-retroviral EVEs

We constructed a comprehensive library of all non-retroviral virus protein sequences available in GenBank and EMBL. We used DIAMOND60 to search these sequences against the D. rotundus genome. We extracted the matching amino-acid sequences and performed reciprocal blastp-like searches61 using DIAMOND with the selected subset of D. rotundus amino-acid sequences and the set of non-redundant protein sequences. D. rotundus genome sequences were considered of viral origin if they unambiguously matched viral proteins in the reciprocal best hits. Putative viral open reading frames were inferred from automated alignments, using exonerate62. For each putative D. rotundus viral peptide, we retrieved the function and predicted the taxonomic assignation by comparison to the best reciprocal blastp-like hit viral proteins. Phylogenetic analyses were performed by aligning the sequences using MAFFT63 and curated using AliView64. Maximum-likelihood inferences were performed on each multiple amino-acid alignment using RAxML65. Support for nodes was obtained from 100 non-parametric bootstrap iterations, and the root of maximum-likelihood trees was determined by midpoint rooting. We confirmed the presence of selected viruses by constructing double-indexed Illumina libraries66 on DNA derived from the spleen tissue of four different D. rotundus. Libraries were pooled and sequenced using 150-bp paired-end Illumina platform NextSeq 500. The raw paired-end reads were quality assessed, merged and filtered by mapping against the bacterial, human and chiropteran reference genome database with SMALT (https://www.sanger.ac.uk/resources/software/smalt/). Reads >150 bp were run through a viral assignment pipeline67 using blastx68 to search against the viral non-redundant protein database. Viral-matching reads were assigned a taxonomy with MEGAN569. Reads matching viral sequences were manually verified by reciprocal blastx analysis. Positive viral hits were defined as those with at least two different reads matching two different proteins or two different regions within the same viral protein.

ERVs

We identified flanking LTR regions using RepeatMasker70 and RMBLAST (http://www.repeatmasker.org/RMBlast.html), Tandem Repeats Finder58 and RepBase71 to the D. rotundus genome. We used blastn and tblastx with the retro-viral reference genome sequences in the RefSeq database of NCBI against the D. rotundus genome. Reads matching each of the RefSeq sequences were extracted and sorted by GI, collapsed by ID and manually verified by reciprocal blastn. Tblastx was re-run and the kept reads were mapped to the D. rotundus genome with SMALT and manually verified by reciprocal blastp. ERV sequences validation was carried out with tblastx against the Retroviridae. The retrieved ERV sequences were mapped using SMALT v0.7.6 against the genomes of D. rotundus, Mastomys natalensis, Eptesicus fuscus, M. lucifugus, M. brandtii, M. davidii, P. alecto, R. aegyptiacus and P. parnellii. Tblastx of the full genomic sequences of DrERV and DrgERV was also performed against Chiroptera. Selected genomic contigs were further analysed by reciprocal blastx against the non-redundant GenBank coding DNA sequence (CDS) translations + PDB + SwissProt + PIR + PRF database restricting the search to the Retroviridae, or against the DrgERV sequences. Retrieved sequences were aligned using MUSCLE in SeaView72. The best-fit amino-acid substitution model for each alignment was identified using jModelTest273 and trees were inferred under maximum-likelihood criteria using RAxML65.

Orthologous gene families

Using the D. rotundus genome against a range of other mammalian species, we performed clustering of orthologous genes using two strategies. (1) Identifying single-copy orthologues in the species by using the TreeFam method74. (2) Identifying 1:1 orthologues by building pair-wise orthologues between D. rotundus and the other species and using a reciprocal best hit (RBH) plus synteny approach as described in ref. 75.

dN/dS analyses

We used PAML codeml76 on the cleaned CDS alignments from the two sets of orthologous families and the corresponding phylogenetic tree as reported in ref. 77. To identify genes with accelerated evolution in the D. rotundus lineage we ran the two-ratio branch model. As a null model, we used the one-ratio model. Using these results, we performed likelihood ratio tests (LRTs) to identify genes with significant P values. To adjust for multiple testing, we used the FDR method. To identify genes with positively selected sites in D. rotundus, we also used a branch site with PAML codeml76. LRT and FDR were computed as carried out for the branch model tests.

Gene family expansion/contraction

We used CAFE78 with the results from the single-copy orthologous gene families. To obtain the dated tree required as input for CAFE, we obtained divergence times from the fossil dating records from TimeTree79. We concatenated the CDSs aligned with MAFFT and used PAML mcmctree to determine split times with the approximate likelihood calculation method. Genes with >200 copies in 1 of the species were filtered out.

Putative gene loss

We identified genes putatively lost in D. rotundus as previously described in ref. 80. The human gene set was downloaded from Ensembl and we obtained a non-redundant gene catalogue to be used as a reference by collapsing redundant genes and keeping the longest open reading frame. We used Bos taurus, Equus caballus and E. europaeus as outgroups. Genes identified as missing in D. rotundus were searched with blastp against the protein catalogues of the other available bats. Further validation on selected genes was performed by: evaluating the conservation of their syntenic regions compared to other species, and the GC content of the syntenic regions compared to the D. rotundus genome average GC content; searching them against the published D. rotundus transcriptome; and through PCR assays.

Functional characterization

We performed GO analysis using GOrilla81 as well as manual characterization using the GO annotations of the human genes downloaded from Ensembl and literature mining. We used PROVEAN82 to characterize the functional impact of the non-synonymous substitutions present only on the proteins in D. rotundus.

Species-specific PSS identification and protein modelling

We tested for positive selection and positive selected sites (PSSs) using the complete CDS alignments for the proteins FFAR1, PLXNA4, REG4, RNAS7 and TA2R3 from various species downloaded from the OrthoMaM database83. Sequences were realigned using MUSCLE84 in SeaView72 and phylogenetic analysis was performed using PhyML85. We used PAML codeml76 to test: the M1/M2 branch model constraining the D. rotundus node76; the M8a/M8 site model; and the branch-site model A constraining the D. rotundus node. The branch-site model A was evaluated under a LRT against the null hypothesis, while PSSs were scored under naïve empirical Bayes (NEB) and Bayes empirical Bayes (BEB). We performed modelling of the mentioned proteins from D. rotundus using the PyMOL Molecular Graphics System (Schrödinger, LLC). The three-dimensional protein models of the D. rotundus sequences were constructed using Phyre286 based on profile hidden Markov model analysis using the structures reported in refs 87,88,89,90.

Metagenomic data

We used faecal and anal swab samples from D. rotundus, R. aegyptiacus, R. ferrumequinum and M. gigas. The D. rotundus samples were obtained from wild D. rotundus individuals captured using mist nets and cloth bags and preserved in RNAlater according to the manufacturer’s instructions. The R. ferrumequinum samples were collected in Woodchester Mansion, Gloucestershire, UK. The M. gigas samples from Australia were collected by the consulting company Biologic in Pilbarra, Australia. The R. aegyptiacus faecal samples were collected from the Copenhagen Zoo. We have complied with all relevant ethical regulations for the collection of these samples. DNA from faecal samples was extracted using the PowerFecal DNA Isolation Kit (MoBio) with modifications to the manufacturer’s instructions. DNA from the D. rotundus milk sample was extracted using a standard commercial kit (Qiagen), following the manufacturer’s instructions. Anal swab samples were extracted using the BiOstic Bacteremia DNA Isolation Kit (MoBio) following the manufacturer’s protocol. All samples were kept at −20 °C. Libraries were built using the NEBnext DNA Library Prep Mast Mix Set for 454 (New England BioLabs) following the manufacturer’s instructions. Samples were 100-bp paired-end sequenced on the Illumina 2500 HiSeq platform.

Taxonomic and functional identification

The reads were cleaned with Trimmomatic91 and prinseq-lite92. The data sets were mapped against the closest available bat genome and only the non-mapping reads were kept. MGmapper93 was used for the taxonomic identification of invertebrates, protozoa, fungi, virus and bacteria. We kept the species identified with a coverage value higher than the first-quartile from the coverage distribution of the corresponding database and filtered out those found on the corresponding extraction blanks. Rarefaction curves from each data set were obtained using an in-house script with the MGmapper results. We then performed de novo assembly using IDBA_UD94, predicted genes using Prodigal95 and generated a non-redundant gene catalogue with usearch96. The non-redundant catalogue was searched with ublast96 against the UniProt database51 for functional and taxonomic annotation. We used DIAMOND60 to search the unmapped reads against the UniProt database keeping only the best hit for functional and taxonomic annotation. We annotated the UniProt protein identifications using KEGG orthology (KO) and eggNOG IDs.

Taxonomic and functional metagenomics comparison

We filtered on the basis of the breadth of coverage and the identifications of the extraction blanks. We removed any non-microbial hit and any taxa in which the paired reads matched different genera or only one of the reads had a hit. The counts were normalized by percentage. We identified a microbial taxonomic and functional sanguivorous core by comparing the filtered sets of the bats and keeping as core those taxa and genes identified only in the D. rotundus samples. We manually examined the taxa from the filtered taxonomic identifications, and the KO and COG annotations from the filtered non-redundant gene set catalogue. We compared the normalized abundance of taxa and functions between D. rotundus and the non-sanguivorous bats as follows. (1) Using the distributions of the different functional categories from D. rotundus and each non-sanguivorous bat species with a Wilcoxon rank-sum test. (2) Using the entire taxonomic and functional data sets, as well as down-sampling the normalized count values. Sampling values were the minimum, median and third-quartile values of the count distributions. With the resulting data sets, we calculated the Euclidean, Bray–Curtis and Jaccard distance metrics with the R package vegan97, and used the Ward hierarchical clustering method using UPGMA and Ward, and performed PCAs with prcomp and the GPA method. (3) We identified taxa and functions significantly contributing to the variation between the D. rotundus and the non-sanguivorous bat species. We examined the rotation matrix from the PCA of the normalized counts, excluding the four deepest sequenced samples, of the species and genus microbial taxonomic levels and the KEGG functional pathways. We identified the most significantly abundant D. rotundus microbial taxa as those with a significantly higher median normalized count value (P < 0.05) in D. rotundus and a median and mean normalized count value of 0 in the other 3 bat species for the first 3 principal components. We also identified significantly more abundant genes in D. rotundus by generating and annotating with KEGG a non-redundant gene set with all of the predicted genes from all of the bat samples. The reads of the samples were mapped against this bat non-redundant gene set, and a normalized count matrix was generated and used for Fisher tests on each of the functional pathways.

Life Sciences Reporting Summary

Further information on experimental design is available in the Life Sciences Reporting Summary.

Code availability

In-house scripts used for the processing of the data are available from the corresponding authors upon request.

Data availability

The NCBI BioProject accession code for the genome assembly is PRJNA414273, and the sequence reads are available at the NCBI sequence read archive (SRA) under the accession SRA619672. The BioProject code for the metagenomic sequencing data is PRJNA415003, and the reads can be accessed at SRA with the accession SRA620977.