tBLASTn with Marburgvirus NP amino acid sequence yielded matches with low expect values (as low as 10-49), indicating that similarity is unlikely to be a chance result. We found twenty matches with expect values less than the standard "significance" value of 10-5 (see Fig. 2). The tammar wallaby, (Macropus eugenii) showed the strongest similarity (49.4% identity) and also had at least 12 different strong sequence matches. The little brown bat (Myotis lucifugus) had four significant matches, while the guinea pig (Cavia porcellus), Ord's kangaroo rat (Dipodmys ordii), the common shrew (Sorex araneus), and the gray short-tailed opossum (Monodelphis domestica; Chromosome 2) each had single matching sequences with expect values <10-5. Another marsupial, the common brushtail possum (Trichosurus vulpecula) had six strong matches from the Expressed Sequence Tags (EST) database. All but three of these sequences (including the EST matches) had at least one apparent disruption of the open reading frame (ORF). tBLASTn with the L protein yielded one value with a low expect value (10-74), the gray short-tailed opossum (Monodelphis domestica; Chromosome 3). A tBLASTn search using the best matching placental mammal match from the original NP search as a query sequence also yielded strong matches in mammals: the pygmy hedgehog tenrec (Echinops telfairi), the mouse (Mus musculus) and the brown rat (Rattus norvegicus). The filovirus-like EST nucleotide sequences from the common brushtail possum had a BLAST match to a single region of the wallaby genome with longest match (DY609334) having a 78% identity (9% of mismatches are gaps) for 662 bases.

Figure 2 Midpoint rooted maximum likelihood phylogram of nucleoprotein (NP) amino acid sequences from filoviruses, morbilliviruses and related mammalian genomic and EST sequences. Branches with more than two sequences and strong support (at least 90 for bootstrap or 95 for Bayesian posterior probability) have values shown above the branch (in the order of approximate likelihood ratio tests, Bayesian Posterior Probabilities, and non-parametric bootstrap values). Parentheses contain GenBank Accession numbers and are followed by the range of the sequence for nucleotide submissions. Red filled branches indicate clades of viruses (Mononegavirales), black filled branches indicate mammalian sequences, and blue filled lines indicate expressed sequence tags. Geographic origins are given in parentheses adjacent to species names. Shaded cartoons indicate outlines of species represented in the analysis. Full size image

We tested for integrated DNA based copies of the filovirus-like sequences in the two mammals with the most copies, the tammar wallaby and the little brown bat. We designed PCR primers from mammalian genomic sequence flanking the longer BLAST matches and carried out PCR amplification of DNA extractions from different specimens than used for existing genome projects. Our sequence of the tammar wallaby had only a single transition difference from the genome project sequence. The sequence of the little brown bat from Minnesota (FMNH 172384) had a similarity of 96% with four indels compared to contig (AAPE01196249) from the existing genome. To test for the presence of a filovirus-like DNA sequence in an additional insectivorous bat, we extracted DNA from a specimen of big brown bat (Eptesicus fuscus). Using primers designed from the little brown bat, we again obtained PCR product and sequence. In this case, the identity between the sequences of the two genera of bats was 87% with 11 indels. In each case the similarity of the new sequences obtained from DNA to genomic sequence is consistent with an integrated filovirus-like DNA copy in these mammalian genomes.

We next carried out a phylogenetic analysis of the NP and L protein amino acid sequence alignments with Mononegavirales (paramyxovirids and filovirids) to assess the direction of the transfer. Because the L protein gene is known to be the most conserved gene in the Mononegavirales, a large number of BLAST matches with expect values <10-5 was found between the families of Mononegavirales in L protein compared to the NP. The midpoint rooted maximum likelihood (ML) phylogram placed the potential mammalian NIRVs within the Mononegavirales, and revealed that the mammalian sequences are more closely related to filoviruses than to Paramyxoviruses (Figs. 2, 3). Indeed the L protein-like sequence from Monodelphis was more closely related on the best ML tree to Marburgvirus than to other known filoviruses (i.e., Ebolavirus) (Fig. 3). This result suggests that the most recent integration of filoviruses from our data involves South American marsupials. The NP analysis also revealed that the South American Monodelphis is more closely related to known filoviruses than to other mammalian sequences (Fig. 3 and Additional file 1: Fig. S1). Although many of the sequences are of different lengths in the NP alignment (Additional file 2: Fig. S2), it is now well known that sequences of very different lengths can be accurately placed on phylogenies [21]. However, there could be long-branch effects or alignment effects for the NP phylogeny as the exclusion of the distantly related Morbillivirus sequences yielded the same mammalian paraphyly, but increased the support values (Fig. 4). For both genes, the placement of the mammalian NIRVs with the filoviruses (i.e. within Mononegavirales) had maximum support for each measure of reliability. The placement and the strong support values for this node are consistent with the direction of transfer from viruses (Mononegavirales) to mammalian genomes. Endogenous reverse transcriptase activity has been shown experimentally to integrate non-retroviral RNA viruses in mammals [17, 22] and may have played a role in filovirus integration. Interestingly, the closest flanking coding regions of integrated filovirus-like elements to at least five of the NIRV's of Macropus, and the separate NP and L-like NIRVs of Monodelphis, are truncated or disrupted non-LTR retrotransposons of the LINE-1 family. Our results represent the first case of NIRV formation in mammals with a virus that has extranuclear replication [17].

Figure 3 Midpoint rooted maximum likelihood phylogram of L protein amino acid sequences from filoviruses, Paramyxoviridae, and a South American marsupial genomic sequence. Labeling and shading details are as in Fig. 2 except that the species name and continent for the mammalian sequence are provided in the caption: Monodelphis domestica (South America). Full size image

Figure 4 Midpoint rooted maximum likelihood phylogram of nucleoprotein (NP) amino acid sequences from filoviruses and related mammalian genomic and EST sequences showing the paraphyly of mammals. Branches with more than two sequences and strong support (at least 90 for bootstrap or 95 for Bayesian posterior probability) have values shown above the branch (in the order of approximate likelihood ratio tests, Bayesian Posterior Probabilities, and non-parametric bootstrap values). Parentheses contain GenBank Accession numbers and are followed by the range of the sequence for nucleotide submissions. Red filled branches indicate clades of viruses (Mononegavirales), black filled branches indicate mammalian sequences, and blue filled lines indicate expressed sequence tags. Geographic origins are given in parentheses adjacent to species names. Shaded cartoons indicate outlines of species represented in the analysis. Full size image

The observation that most of the mammalian sequences have ORF disruptions and possess only truncated NP-like genes (Fig. 1) is also inconsistent with a transfer from mammals to virus. Only Monodelphis has more than one different filovirus-like gene (Additional file 3: Fig. S3) and these (the NP and L protein-like sequences) are on separate chromosomes. The apparent genic bias of NIRVs for the NP gene could have a biological explanation. Because of the transcription gradient in the Mononegavirales, the most common primary transcript is NP [13]. We also note that experimental expression of an N-terminal portion of the Ebolavirus NP gene (from residue 1-450 in wildtype NP) that is positionally homologous to the region of NP spanned by mammalian NIRVs (from residue 18-405 in wildtype NP, NP_066243) is sufficient to inhibit the formation of Ebolavirus minigenomes in a dosage specific fashion [23]. A background transcription bias could account for overrepresentation in NIRVs of NP, but such a bias fails to explain the N-terminal bias within the NIRVs of NP. The bias is consistent with the experimental filoviral interference mechanism involving the N-terminal of NP.

Despite ORF disruptions, it is clear that at least some mammalian filovirus-like NIRVs of NP are expressed. In the marsupial Trichosurus, we detected six different NP-like ESTs (EC302609, DY609334, EC300968, EC310159, DY613238, EC352436) from three tissue-specific cDNA libraries: liver, spleen/lymphatic system and gonads. These tissues play an important role in the pathology and replication of filoviruses [24]. We did not detect the NIRV in the cDNA libraries made from brain, whole embryo, kidney, uterus/reproductive tract, or gut tissues. Still, non-functional pseudogenes can be transcribed by interactions with neighboring functioning loci [25]. We tested for selective maintenance of codon structure in the filovirus-like NIRVs as a further indication of function. Comparisons of rates of amino-acid changing substitutions (d N or K a ) to rates of silent substitutions (d S or K s ) do bear the signature of selective codon maintenance or purifying selection. Non-functional regions should conform to neutral expectations where d N = d S and d N /d S = 1 [26]. For regions undergoing purifying selection, the silent substitution rate should prevail whereby d N -d S << 0 and d N /d S << 1. The codon-based test of neutrality using the model of Kumar (which accommodates transition/transversion rate bias) indicates that silent mutations are significantly overrepresented in an alignment of filovirus-like NIRVs (d N -d S = -9.427, P < 0.001) [27]. Likewise, Bayesian calculations of site-specific K a /K s using evolutionary models that accommodate codon usage differences [28], reveal a prevailing pattern of values significantly less than 1 (Fig. 5). Under a model that allows purifying, neutral and positive selection (Model M8), the distribution of K a /K s peaks at about 0.4. For the M8 model, 67 percent of these alignment sites (and all of the M7 sites) have upper 95 percent confidence limits for <1. While these K a /K s values are larger than is typical of strong purifying selection, they are markedly less than neutral expectations or even the range of K a /K s = 0.6 to 1.0 that is reported for disrupted transcribed pseudogenes in mammals [29]. Even though there appears to be selection for preserving codons, the tests cannot differentiate between past and present function. Moreover, the products need not be protein-based -- RNA interference products can elicit codon-like selection to interact with protein-coding genes [29]. The functionality and potential role of NIRVs in the well-known resistance to filoviruses of some NIRV-containing mammals (mice and guinea pigs) will have to be addressed with experiments.

Figure 5 Histograms of K a /K s values calculated from alignment sites of the filovirus-like elements in eleven species of mammals. Values are calculated using Bayesian methods and a model that accommodates neutral, positive and negative selection (M8 below), and a model that accommodates largely negative or purifying selection (M7 above). Note the better fit of the purifying selection model. Red dashed lines indicate the expected values under neutral evolution for non-functional pseudogenes, while values <<1 are consistent with purifying selection. Full size image

More than one endogenization is required to account for the paraphyly of mammals and the paraphyly of marsupials with filoviruses. The finding of a monophyletic clade for placental mammals with samples from several continents requires a single ancient integration with several losses of NIRV signal or multiple integrations of a related virus in unrelated mammal groups (Fig. 6). A single origin for eutherian NIRVs is supported by the rarity of the process -- endogenization of non-retroviral RNA viruses with extranuclear replication is previously unknown in mammals. Ancient transcribed pseudogenes >100 million years old are known from mammals [29] and the primate bornavirus integration is believed to be older than 40 million years [17]. Although much of the deeper groupings have weak support and there has been gene duplication, there are some well-supported groupings that agree with mammalian phylogeny. The strongly supported groups are the two bat genera, the genera of mouse-like rodents, and the Australian marsupials, Trichosurus and Macropus. These genera of marsupials are believed to have shared a common ancestor from 39 to 52 million years ago [30]. A clear indicator of antiquity is the syntenous genomic location of a rat and mouse filovirus-like NIRV (Fig. 7A, B). These are the same copies that have a sister group relationship (Fig. 2). It is unlikely that integration of filovirus NP genes at the same genomic position occurred independently in rats and mice. The rat-mouse orthology provides a minimum date of NIRV formation at 12 to 24 MY [31, 32]. Of the species with filovirus-like elements only the rat, mouse and Monodelphis have detailed chromosomal maps, but further mapping and taxonomic sampling will permit a more robust assessment of the age of eutherian NIRVs. Still, we conclude that the association between filoviruses and mammals is likely to be 10's of millions of years older than the previous estimate. Filoviruses join bornaviruses as the only demonstrated prehistoric non-retroviral RNA viruses.

Figure 6 Summary graph showing mammalian genomes assayed for filovirus-like elements and the phylogenetic distribution of the mammals with filovirus-like elements. Red shading indicates that species with detected filovirus-like elements fall into a marsupial and a eutherian group. The platypus genome was also assayed but is not depicted here. The mammalian phylogeny is based on a composite of recent studies [30, 44, 45]. Full size image

Figure 7 Chromosome maps showing synteny of regions flanking filovirus-like elements in rat and mouse genomes with a whole chromosome view (A) and a local view (B). White asterisks represent the locations of the phylogenetic sister copies of filovirus-like elements. Five synteny blocks with a reversal distance of 2 were found between CHR 16 of the mouse and CHR 11 of the rat. The filovirus-like elements are located on a reversed synteny block (purple shading). A close up view shows the flanking gene locations and acronyms. Full size image

The eutherian orders with NIRVs of filoviruses closely match the proposed candidate reservoir groups of bats, rodents, and insectivores [1, 2] (Fig. 6). This pattern is not a sampling artifact that we can attribute to the available genome assemblies. Seven of the ten genomes (including the Big Brown bat) sampled from predicted reservoir orders had integrated filoviruses, while only 1 of 27 from non-candidate eutherian orders had detected integrated filovirus-like elements (Fisher's exact test, two-tailed p value = 0.00003). The sole eutherian species from a non-candidate group to have a potential NIRV was the pygmy hedgehog tenrec, which is the Afrotherian small insectivore analog on the island of Madagascar. The three assemblies of genomes from candidate orders that lacked apparent NIRVs were the ground squirrel (Spermophilus tridecemlineatus), the European hedgehog (Erinaceus europaeus) and the fruit bat (Pteropus vampyrus). At present it is unclear why some small mammal groups (bats, rodents, insectivores and marsupials) appear to have an association with filoviruses. Still, the study of filovirus-like NIRVs could have predictive value for identifying filovirus reservoirs, ancestral proteins, outbreak modeling, undetected lineages of filoviruses and virulence in mammalian species. For example, the close relationship of South American and expressed Australian marsupial filovirus-like NIRVs with rapidly evolving African filoviruses now makes it more likely that the New World harbors undetected filoviruses or has acted as a source region for extant filoviruses.