Improved genomic annotations of conserved and novel miRNA loci across five mammals

Adapter-trimmed reads were mapped against the corresponding genome using patman38. Despite some technical variability, we observed generally high proportions of reads perfectly matching to the genome (Figs S1–S3), providing us with a robust dataset across several tissues and organisms.

For the identification of miRNA loci, we ran miRCat11 and miRDeep212 using the combined set of small RNA libraries of each species. By running both tools, we were able to generate two independent sets of putative miRNA loci per species. Predictions were then filtered based on the following two main criteria: evidence of both miRNA-3p and miRNA-5p expression in small RNA reads alignments against the predicted loci (Supplementary Files 1–5), and miRNA-like hairpin secondary structure (as predicted using the Vienna-RNA package)39. This lead to the identification of a final set (union of miRCat and miRDeep2 high confidence predictions) of 2088 loci: 1676 miRBase annotated and 412 representing novel miRNA loci (Table 1, Supplementary Tables S1–S5). Tissue specific expression plots (Supplementary Figs S4–S8) for all novel and conserved miRNAs were also generated. The number of novel miRNA loci is particularly high in the dog. The computational predictions are highly dependent on sequencing depth, and in our study more brain samples were available for the dog compared to all the other species.

Table 1 Counts of annotated miRNA loci, either belonging to a miRBase family or representing a novel gene. Full size table

Therefore, we sampled 40,680,089 reads (the total number of reads available for the horse) from the combined set of genome matching dog reads (n = 99,597,072), and recalled dog miRNA loci. Based on our filtering criteria (coverage of at least one read at both the 3′ and 5′ ends, and a total coverage of at least 10 reads), we could still confidently annotate 111 novel loci. This corresponds to 66% of the original novel annotation, and still represents the highest count across all our species.

We compared each of the five miRNA annotations with the corresponding latest Ensembl gene annotation (B. taurus UMD3.1.8, C. familiaris 3.1.86, E. caballus 2.86, S. scrofa 10.2.84, O. cuniculus 2.0.84). Unsurprisingly, we observed very high proportions of intronic and intergenic miRNAs, and very low numbers of miRNA overlapping UTR sequences. These patterns appear to be consistent across all 5 species (see Fig. 1). Analyses performed with GAT40 indicate a significantly higher than expected overlap with introns in all species (q-value < 0.001), while intergenic regions, despite containing a high number of miRNAs, are significantly underrepresented in all five genomes (q-value < 0.001). We additionally looked at the representation of different non-coding RNA classes in our dataset, in a species specific manner. When considering the set of genome matching reads, we observed the expected enrichment for miRNA sequences (Supplementary Table S6), while other non-coding RNA classes represent, altogether, 5–10% of the reads.

Figure 1 Genomic annotation for all predicted miRNA loci. Colour labelling indicate different genomic features (3′UTR, 5′UTR, intergenic regions, introns). Full size image

The fate of mammalian miRNA families

In order to generate clusters of homologous miRNA loci, we used CD-hit41 with 80% minimum identity on our set of annotated miRNAs. We thus obtained a total of 732 clusters (Supplementary Table S7), of which 432 grouped together only miRBase loci, 291 consisted of only novel (: absent in miRBase) sequences, and 9 represented mixed orthogroups. In order to limit potential biases in our annotations resulting from different genome assembly quality and sequencing depth across our species, we decided to look for evidence of sequence homology between genome assemblies. We thus aligned all annotated miRNA loci against the genome assemblies of human, mouse and all species considered in this study. This allowed for the identification of loci missing in the annotation, but showing high sequence homology to a miRNA annotated in another species, as well as evidence of synteny conservation in the surrounding region (see Materials and Methods). As an additional strategy to overcome differences in annotation and assembly quality, we looked for annotated miRNA sequences in the set of unaligned reads of horse and pig (for which the initial estimates of miRNA gain rates were surprisingly low). This analysis allowed us to further improve the presence-absence information used for the gain/loss inference.

We used this curated set to characterise the most parsimonious patterns of gain and loss of miRNA orthogroups across the phylogeny. Despite the high proportion of broadly conserved miRNA families, we observed high levels of miRNA turnover across the phylogeny, with many families being gained or lost in internal and terminal branches (Fig. 2). We observed a positive net gain rate in all terminal branches, with the only exception of the horse lineage. In this case, the virtually equal rate of gain and loss is likely due to a lack of sequencing depth, resulting in a limited number of predicted novel, horse specific miRNAs.

Figure 2 Gain and loss of miRNA clusters across the phylogenetic tree, as inferred by Dollo parsimony and synteny analyses. For each branch of the tree, the (black labelled) number of gained (+) and lost (−) orthogroups is provided. Red labelled numbers represent the branch specific net gain rate of orthogroups per million year. Full size image

Genomic sources of miRNAs

Various mechanisms can contribute to the appearance of new miRNA genes. While introns represent a crucial source of newly processed miRNA hairpins (sometimes not requiring Drosha processing, see the case of miRtrons), other miRNAs could arise by gene duplication, transcription of the opposite strand of an existing miRNA locus, or evolution from repetitive elements8. Our dataset allows us to determine the extent of the contribution of different evolutionary processes in mammals.

For the identification of miRNAs derived from repetitive elements, we used BLASTN42 to align all hairpin sequences against the Repbase (www.girinst.org/repbase) database. A bit score threshold, determined through the alignment of miRNAs against shuffled Repbase sequences (see Materials and Methods) was used in combination with other parameters to select high confidence BLAST hits. Following these conservative approach, we identified 72 novel and 45 miRBase miRNA loci showing a significant similarity with one or more Repbase sequences (Supplementary Table S8, Fig. S9). Interestingly, 49 out of the 72 novel, putatively repeat-derived miRNA loci are part of a single, large orthogroup specific to the dog: cluster 508. We performed GO:term enrichment analyses on repeat-derived miRNAs, and found significant enrichment for immunological processes (including “positive regulation of memory T cell differentiation”, GO:0043382; “positive regulation of activated T cell proliferation”, GO:0042104), as well as cognitive and behavioural (including “cognition”, GO:0050890; “behaviour”, GO:0007610, “exploration behaviour”, GO:0035640). These results might reflect an important role of these novel miRNAs in the evolution of immune response and neural expression plasticity. While we find 16 broadly conserved orthogroups, there are also 18 orthogroups appearing in terminal branches (Fig. 3), suggesting an important role for repetitive elements in the emergence of novel miRNA loci.

Figure 3 Gain and loss of repeat-derived miRNA clusters across the phylogenetic tree, as inferred by Dollo parsimony. The tree labelling is equivalent to Fig. 2. Full size image

Next, we asked the question whether we could find any case of reverse complement miRNA sequences, lying on the opposite strand of exactly the same genomic interval. Our analyses lead to the identification of at least one of such cases in every species (Table 2, Supplementary Table S9), with numbers ranging from 1 (in rabbit) to 22 coupled loci (in dog). It must be noted here that single sequencing errors in reads representing very abundant miRNAs might map to the reverse complementary strand, creating an artificial reverse complementary miRNA couple. Therefore, we looked for evidence of 1 mismatch differences in reads mapping to reverse complementary miRNA couples. We observed a single mismatch difference between the most abundant read of “X/81951230-81951286(−)_mir_cow_317” at the 5′ end, TTACAATACAACCTGATAAGT, and read TTATAATACAACCTGATAAGT (mapping to “X/81951229-81951289(+)_mir_cow_318” on the opposite strand). Given that mir_cow_317 has low expression levels (24 reads), while its reverse complementary partner mir_cow_318 is much more abundant (13,607 reads), we cannot completely rule out the possibility of 1-error reads of mir_cow_318 mapping to the opposite strand.

Table 2 The number of reverse complement miRNA gene couples and number of miRNA orthogroups containing paralogous duplicated genes. Full size table

We also investigated miRNA clusters containing multiple paralogous gene copies, and found a total of 135 orthogroups associated with duplication events (58 in tandem) in at least one species (Table 2 and Supplementary Table S10, Fig. S10). We identified both old duplication events (many duplicated loci have orthologous counterparts in other species) as well as more recent duplications, with genes arising in terminal branches (Fig. 4). Thus, duplication seems to be an important mechanism for the evolution of miRNAs in our set of species, as previously suggested by other studies in mammals17. When we compared the set of clusters with duplicated loci and with high similarity to repetitive elements, we identified 10 common clusters, 5 of which represents species specific groups.

Figure 4 Gain and loss of clusters of duplicated miRNAs across the phylogenetic tree. The tree labelling is equivalent to Fig. 2. Full size image

Additionally, we looked for miRtrons, representing intronic miRNAs which do not require Drosha processing, as the pre-miRNA is generated by intron splicing43. We were able to identify only a few putative miRtrons, with numbers ranging from 0 to 8 loci per species (Table 3, Supplementary File 6).

Table 3 The number of miRtrons and putatively repeat-derived miRNA genes. Full size table

Expression patterns across the phylogeny

Our dataset provides us with a great opportunity to investigate how the gain and loss of miRNAs relates to the observed patterns of miRNA expression across 4 different tissues. Our data suggests that young orthogroups tend to be more tissue restricted than the older, conserved ones, particularly when we consider brain and testis. Fig. 5 shows the number of gains across the phylogeny when considering only the orthogroups with evidence of expression in brain (Fig. 5A) and testis (Fig. 5B). By constructing trees where branch lengths are proportional to the fraction of tissue specific orthogroups, we can clearly visualise the higher tissue restriction of young miRNAs. This pattern is particularly evident in the case of brain tissues, and is in line with previous studies on the evolution of novel mammalian miRNAs17.

Figure 5 Expression patterns across the phylogeny for all miRNA orthogroups expressed in brain (A) and testis (B). Colour labelling indicate the number of miRNA families expressed in a specific tissue, divided in the following categories: tissue specific; expressed in the tissue + n additional ones (n = 3 indicated as “all tissues”). Branch lengths in the phylogenetic trees are proportional to the fraction of tissue specific orthogroups. Full size image

Next, we asked the question whether we can see differences in tissue specificity between novel and miRBase miRNAs. Fig. 6 shows the difference in the proportions of orthogroups expressed in a particular tissue, divided into sub-sets depending on the total number of tissues showing evidence of expression. We observed a significantly higher proportion of tissue specific families in the novel set compared to the miRBase set (z test, α = 0.05) for all four tissues except the kidney, with a particularly striking difference in the cases of brain and testis (brain: p < 10e−4, heart: p = 0.012, kidney: p = 0.051, testis: p = 10e−3). This result suggests that young miRNAs are expressed in a single or a few tissues when they first appear, and become more broadly expressed over time. Interestingly, we find that as much as 20% of novel orthogroups are restricted to the brain tissue. When we performed GO term enrichment analyses of the mRNA targets of novel, brain specific miRNA orthogroups (see Materials and Methods), results highlighted several neuronal (for example: “regulation of neuron projection development”,GO:0010975; “forebrain generation of neurons”, GO:0021872), behavioural (including “locomotory behaviour”, GO:0007626; “aggressive behaviour”, GO:0002118) and immune related processes (for instance, “negative regulation of innate immune response”, GO:0045824). Moreover, we find that the vast majority of these brain restricted orthogroups (32 out of the 55) have a species specific miRNA seed sequence (nt 2–8), potentially leading to novel regulatory interactions restricted to a particular lineage. Thus, our results suggest that the emergence of novel, tissue restricted miRNAs might play an important role in the lineage specific evolution of neuronal regulation, especially through the acquisition of novel seeds and associated targets.

Figure 6 Expression patterns of novel and miRBase loci across 4 tissues. Colour labelling indicate the number of miRNA families expressed in a specific tissue, divided in the following categories: not expressed in the tissue considered; tissue specific; expressed in the tissue + n additional ones (n = 3 indicated as “all tissues”). Full size image

The co-evolution of miRNAs and their UTR target sites

Homology analyses and Dollo parsimony provided an overview of the evolutionary patterns of miRNA families in our species. Next, we asked the question whether we can detect signatures of selection acting on the predicted target repertoires. As a result of the selective constraints acting on miRNA target sites, we would expect these loci to show increased conservation compared to the surrounding 3′UTR regions. We compared 20-way phastcons scores of the targets associated with species specific and conserved seed families (defined as groups of miRNA loci sharing exactly the same miRNA seed sequence), and observed an evident increase in scores corresponding to the targets of conserved families (Fig. 7). Additionally, the 3′UTR of genes targeted by conserved families show substantially higher conservation across all bins, as compared to those targeted by species specific seeds. These results suggest increased levels of purifying selection at the binding sites of conserved seed families.

Figure 7 Median and confidence interval of 20-way Phastcons scores, calculated across 7nt bins, centred around the predicted targets of conserved (black line), species specific (red) and combined (blue) seed families. Full size image

We next asked the question whether 3′UTR target site conservation reflects the loss of a miRNA orthogroup (or seed family) during evolution. Based on homology analyses and Dollo parsimony inference, we first identified all miRNA orthogroups which appear to be lost in a terminal branch of our phylogenetic tree (see Table 4) to test the hypothesis that the loss of a miRNA might be lead to relaxed selective constraints on the associated target sites.

Table 4 Summary of the number of lost orthogroups (also absent in the miRBase annotation) and seed families. Full size table

We tested our hypothesis of relaxation of selection by calculating pairwise target sequence similarity between the rabbit (chosen as an outgroup) and all other species.

However, we could not find any example where the species missing the miRNA family has a significantly lower conservation level in all comparisons (Supplementary Figs S11–S16). The observed lack of evidence for differential target site conservation between the species retaining and losing the miRNA orthogroup could be explained by: 1) a very recent miRNA loss; 2) a shared seed sequence between the lost and a more conserved orthogroup, leading to continued purifying selection; 3) the seed/orthogroup loss having a very weak or null effect on target site conservation; 4) the presence of false positive target predictions in our data.

Seed sharing appears to be widespread in our set of species, as clearly shown in Fig. 8. However, we also observe a considerable number of species specific seeds in the cow and the dog. Among the significantly enriched GO:term accessions (adjusted p-value < 0.05) for dog specific seed families (Fig. 9) we find: “extracellular structure organisation” (GO:0043062), “positive regulation of axon extension” (GO:0045773), “positive regulation of neuroblast proliferation” (GO:0002052), and “behaviour” (GO:0007610). The significantly enriched GO:term accessions for cow specific seed families include “lactation” (GO:0007595) and “mRNA 3′ end processing” (GO:0031124), while terms found significant only for the dog include “forebrain development” (GO:0030900) and “DNA repair” (GO:0006281). Among the accessions enriched in both the cow and dog specific seed targets we find “behaviour” (GO:0007610), “behavioural fear response” (GO:0001662) and several immune-related processes: “T-cell activation” (GO:0042110), “T cell receptor V(D)J recombination” (GO:0033153), “negative regulation of innate immune response” (GO:0045824) and additional related accessions.

Figure 8 Gain and loss of miRNA seed sequences across the phylogeny. The tree labelling is equivalent to Fig. 2. Full size image

Figure 9 GO term results for dog specific seed family GGACCGA (orthogroup 490) as summarised by Revigo62. Full size image

Next, we searched for domestication genes described in the literature44,45,46 in the target genes corresponding to significantly enriched GO accessions. 12 genes located in the candidate domestication regions identified by44 are found among the targets of dog specific seed families. This set includes genes associated with behavioural (POLR1E), immunological (TLX3) and body weight (TNKS2) phenotypes in mouse (http://www.informatics.jax.org/). When we considered the set of genes lying in the top 100 genomic regions under selection in the dog identified by45, we found 25 genes belonging to one or more significant GO term accessions for dog specific seed families. Once again, we observed genes associated with behaviour and body weight phenotypes (for instance GLRA1 and HTR2B). This overlap is significantly higher than expected (hypergeometric test, 1.24 fold enrichment, p = 0.01).

We also looked for positively selected genes47,48,49 in the set of targets of cow specific seed families belonging to a significant GO accession. We found 24 positively selected targets among the 35 common between our set of 1:1 orthologs and the gene list provided by47 (1.33 fold under-enrichment, p = 10e−4). We observed genes associated with immune system, cardiovascular and muscular phenotypes, including TNFRSF8, CREBBP and MTMR14. Only 3 targets were found in the gene set from Xu et al.48: WIF1 (Increased osteosarcoma incidence), KIT (coat/hair pigmentation) and LRIG3 (Abnormal craniofacial morphology) (2.43 fold under-enrichment, p = 2 × 10e−4). Additionally, we considered the dataset from Braud et al.49, representing genes with high miRNA binding sites divergence between B. taurus and B. primigenius. Among the top 200 scoring genes, we identified 43 targets of cow specific seed families (1.31 fold under-enrichment, p < 10e−4). Even in this case, we observed many genes involved in immunity (for example CIITA, RHOH), brain morphology/behaviour/body size (ASPA), and neurological (FOXI1) phenotypes. Interestingly, FOXI1 (nervous system phenotype) appears as a positively selected45,49 target of both dog and cow specific seed families. Altogether, we clearly highlight a trend among positively selected miRNA targets towards a few biological processes and phenotypic traits. Thus, our results suggest that species specific seed families might have played a role in domestication, by modulating the expression of genes under artificial positive selection.