While the general threat of insertional mutagenesis due to unmitigated ERV transcription and subsequent retrotransposition is minimized by epigenetic mechanisms, including DNA methylation, histone lysine methylation, and small noncoding RNAs (), recent studies have revealed that ERVs have also played a prominent role in expanding the regulatory landscape of mammalian genomes (). The “controlling element” theory that transposable elements (TEs) may participate in gene regulation was postulated over 60 years ago by Barbara McClintock () and was later expanded upon by Britten and Davidson’s gene battery hypothesis (). Genome-wide studies have indeed confirmed that species-specific ERV LTRs exert regulatory effects on genes in many cell types during development to modulate the transcriptome (). However, the molecular mechanisms whereby these heterologous sequences are converted into regulatory elements for host genes remain obscure. Here, we highlight recent studies that have advanced our understanding of how LTR sequences are exapted into species-specific cis-regulatory elements. We begin by exploring why LTR retrotransposons are particularly suitable for co-option by the host and subsequently review recent experimental evidence supporting a model of reiterative exaptation of LTRs in mammals as tissue-specific promoters or enhancers for protein-coding genes and long noncoding RNAs (lncRNAs). We conclude with a discussion of recent functional studies of the role of specific exapted LTRs in gene regulation and outstanding questions to be addressed in future studies.

Spotting the enemy within: targeted silencing of foreign DNA in mammalian genomes by the Krüppel-associated box zinc finger protein family.

Retrotransposons, which replicate via a transcription and reverse-transcription “copy-and-paste” mechanism, account for greater than 40% of the human and mouse genomes (). These parasitic sequences can be classified into two major groups. Those lacking long terminal repeats (LTRs), including long and short interspersed nuclear elements (LINEs and SINEs, respectively) and SINE variable-number tandem-repeat Alu (SVA) elements, comprise ∼30%–35% of the genome, while those with LTRs, termed endogenous retroviruses (ERVs) or LTR retrotransposons, comprise ∼8% and 10% of the human and mouse genomes, respectively () ( Figure 1 A). ERVs are the descendants of exogenous retroviruses that integrated into the genome of germ cells. Most subsequently lost the ability to exit the host cell. Thus, those ERVs that may be defective for infection but are still competent for retrotransposition expand in their host genome by vertical transmission (). In addition to their 5′ and 3′ LTRs, which are identical in sequence following reverse transcription and integration, autonomous proviral elements typically harbor several open reading frames (ORFs) that encode proteins essential for viral replication, including gag, which encodes a group-specific retroviral antigen, and pol, which encodes the reverse transcriptase ( Figure 1 B). A third ORF encodes an envelope protein (env), although the vast majority of ERVs have truncated or mutated env sequences.

(D) LTR exaptation as a promoter for a novel lncRNA. Through a process as in (C), a newly formed intergenic solo LTR without an SD site could initiate de novo lncRNA transcription, forming a novel lncRNA gene. An example of this is the lincRNA-RoR transcript.

(C) LTR exaptation as a protein-coding gene promoter. In a developmental/tissue-specific context, particularly in cell types undergoing epigenetic reprogramming (e.g., early embryo, placenta, or germline), a hypomethylated solo LTR (pink rectangle) 5′ of a protein-coding gene (or in an intragenic region) may become exapted as a novel promoter (black circles represent DNA methylation). The process may involve base substitutions in a neighboring near-consensus TFBS (gray rectangle with dash outline) and a near-consensus site within the LTR (green rectangle with dash outline), which then form a positive genetic interaction with another LTR-derived TFBS (orange rectangle), a mechanism termed “epistatic capture” (). This leads to synergy in the binding of several TFs (gray, orange, and green ovals); deposition of “active” histone modifications, such as H3K4me3 and H3K9ac (green circles); and robust transcription initiation from the LTR-derived promoter. The canonical genic promoter may be DNA methylated as a consequence of such transcription. This process generates LTR-genic exon chimeric transcripts, where exon 1 is derived from the LTR and splicing occurs from the internal LTR SD (or from a cryptic SD site in the intervening genomic sequence downstream of the active LTR) to the first downstream exon with a splice-acceptor site, generally exon 2. Examples of such chimeric transcripts include Spin1 in mouse and CSFR1 and B3GALT5 in human. Arrow sizes indicate relative level of transcription from each promoter.

(B) Full-length ERVs have 5′ and 3′ LTRs, and an “internal” region that includes a primer-binding site (PBS) involved in priming reverse transcription and retroviral ORFs gag, pol, and a truncated or mutated env gene (Δenv). Recombination between 5′ and 3′ LTRs deletes the internal region, generating “solo” LTRs (not to scale), which consist of unique 3′ (U3) and 5′ (U5) regions and a regulatory region (R) containing the TSS (white arrow). LTRs often harbor different combinations of TFBSs (green and orange rectangles) in addition to core polymerase II promoter elements (such as TATA box, shown in red) and may also contain a splice donor (SD) site (dashed line) within the U5 region.

(A) Schematic of non-LTR retrotransposons, which include SINEs (i.e., Alus), LINEs (i.e., L1Hs), and SVAs (in humans), and LTR retrotransposons, which include many lineage/species-specific subfamilies. Most LINE elements are truncated at the 5′ end, thus lacking the 5′ UTR promoter and TSS.

Structure of an Intact ERV and Solo LTR and the Molecular Mechanisms of LTR Exaptation as Protein-Coding or lncRNA Promoters

Figure 1 Structure of an Intact ERV and Solo LTR and the Molecular Mechanisms of LTR Exaptation as Protein-Coding or lncRNA Promoters

Consistent with the presence of TFBSs and their propensity to evade epigenetic silencing, many ERVs and LTRs exhibit tissue-specific expression patterns, especially during embryonic and germline development (). Indeed, ERVs have likely been under selection to increase their odds of successful retrotransposition and vertical transmission and therefore exhibit high levels of transcription in the early embryo and reproductive tissues, including primordial germ cells (PGCs) and oocytes (). Thus, it is not surprising that the ERV families present in high copy number are also those competent for expression in the germline. Although H3K9me3 and/or DNA methylation play a role in silencing of ERVs in both undifferentiated and differentiated cell types, specific ERVs likely exploit global reprogramming of epigenetic states, such as during embryonic preimplantation development () or in the placenta (), to promote their expression. During these developmental stages, the LTRs that have accumulated mutations that relieve selective pressure for KRAB-ZFP-based silencing in these cell types, or are otherwise not efficiently bound by KRAB-ZFPs due to low level of expression of the relevant KRAB-ZFP or their genomic context, would come under purifying selection for beneficial regulatory effects on neighboring genes. Thus, the combination of autonomous RNA polymerase II promoter/enhancer activity conferred by intact TF binding sites and changes in the repertoire of ERVs bound by KRAB-ZFPs over evolutionary time is likely to provide a unique context for exaptation of solo LTRs for tissue-specific gene regulation.

Dynamic stage-specific changes in imprinted differentially methylated regions during early mammalian development and prevalence of non-CpG methylation in oocytes.

Once all members of a particular ERV family are effectively silenced by the KRAB-ZFP/KAP1 repression system, the accumulation of inactivating mutations in replication-competent proviruses, i.e., in functional viral protein-coding regions, would over time relieve the positive selective pressure for KRAB-ZFP recognition, allowing mutations to accumulate within the relevant KRAB-ZFP gene, ultimately modifying or ablating the DNA-binding specificity of the encoded protein regardless of whether it binds in the LTR or internal region. The remaining replication-incompetent full-length proviruses and solo LTRs derived from these elements would no longer be recognized by a specific KRAB-ZFP, allowing for selection of LTRs as promoter or enhancer elements of nearby genes (). This does not exclude the possibility that there may be purifying selection of KRAB-ZFP binding sites within otherwise decaying ERV internal regions or LTRs, allowing for ERV exaptation for silencing of nearby genes ().

Many intact ERVs are targeted for transcriptional silencing by the rapidly diversifying family of Krüppel-associated box zinc-finger proteins (KRAB-ZFPs), which interact with the co-repressor KAP1 and the histone H3 lysine 9 (H3K9) methyltransferase SETDB1 (). Indeed, chromatin immunoprecipitation sequencing (ChIP-seq) analysis in mouse embryonic stem cells (ESCs) reveals that the solo LTRs of a subset of ERV families, including IAP solo LTRs, are marked by H3K9me3 (), indicating that for some ERVs, the LTR itself may be bound by specific KRAB-ZFPs. However, while the binding sites of only a few of the >300–400 KRAB-ZFPs in humans and mice have been studied, the majority characterized thus far recognize internal ERV sequences, including the primer binding site, 5′ UTR, gag, and 3′ polypurine tract regions (). Since solo LTRs lack these internal sequences, they may escape the KRAB-ZFP/KAP1 silencing machinery directed at full-length elements, facilitating their exaptation as positive regulatory elements by the host.

The presence of a conserved splice donor (SD) site within some classes of LTRs also likely contributes to the propensity for LTRs from specific families to be exapted as alternative promoters ( Figure 1 B). The consensus sequence of MaLR LTRs, for example, including the mouse transcript (MT) subtypes (see Table 1 for classification of LTRs exapted as regulatory elements discussed here), harbors a conserved SD site that is utilized in many MT-initiated chimeric transcripts in oocytes (). Similarly, a primate-specific MaLR LTR, THE1B, which harbors an intact SD site, is aberrantly reactivated in Hodgkin’s lymphoma and drives expression of CSF1R transcripts (). Alternatively, mutations within LTRs may generate novel SD sites, as is the case for the highly expressed oocyte-specific Spin1 transcript, also driven by an MT LTR (). Furthermore, at specific loci, cryptic SD sites may be present in the flanking genomic sequence downstream of a transcriptionally active LTR. Regardless, the presence of an SD site within or immediately downstream of the LTR minimizes the length of the 5′ UTR. This decreases the likelihood that the transcript will contain a cryptic start codon upstream of the canonical start codon, thus preserving the native ORF in the resulting chimeric mRNA, and may stabilize the nascent RNA, as SD sites may compete with termination signals ().

Examples of LTRs Exapted as Regulatory Elements in Human and Mouse and Their Classification

Table 1 Examples of LTRs Exapted as Regulatory Elements in Human and Mouse and Their Classification

Several studies in mammals indicate that ERVs have been more frequently exapted as cis-regulatory elements relative to other TEs (). Consistent with these observations, ERVs have been reported to evolve more rapidly than other TEs, as evidenced by orthologous ERVs in humans and chimpanzees exhibiting signatures of directional selection since the human-chimp divergence ∼five million years ago (). These observations are unlikely to be explained by the integration sites of ERVs, as retrotransposons of all types are most prevalent in intergenic regions, and older LTR and LINE elements are underrepresented within 5 kb of gene promoters, perhaps due to their negative impact on expression of proximal genes and, in turn, host fitness (). Rather, the frequent co-option of ERV sequences for gene regulation may be due to the relatively high probability of recombination between the 5′ and 3′ LTRs of intact proviruses, which deletes the internal region, leaving a single or “solo” LTR at the original integration site () ( Figure 1 B). Recombination between 5′ and 3′ LTRs has generated an estimated 577,000 “solo” LTRs in the human genome, representing the vast majority of annotated ERV sequences (). Notably, both full-length intact ERVs and solo LTRs are underrepresented specifically in the sense orientation within introns, likely reflecting the generally deleterious effects of insertion of polyadenylation signals encoded by LTRs (). As LTRs harbor the regulatory regions required for proviral transcription, generally including combinations of transcription factor binding sites (TFBSs), they have the intrinsic capacity to autonomously recruit cellular TFs and in turn to maximize transcription of proviral mRNA in specific cell types. Indeed, LTR-derived TFBSs are now known to have contributed up to ∼20% of functional binding sites for many TFs in human and mouse (), including p53, OCT4, SOX2, and NANOG (). In contrast with the regulatory properties of LTRs, the majority of LINE1 elements are truncated at the 5′end, which removes the regulatory region and canonical transcription start site (TSS) ( Figure 1 A) of these RNA polymerase II-driven elements, rendering them transcriptionally “dead on arrival” ().

Retroelement distributions in the human genome: variations associated with age and proximity to genes.

Retroelement distributions in the human genome: variations associated with age and proximity to genes.

In addition, as active LTR promoters exhibit the same chromatin modification patterns found at active genic promoters, including H3K4me3 and DNase I hypersensitivity, profiling of these features by NGS can also be exploited to identify candidate LTR promoters () ( Figure 1 C). For example, using ChIP-seq for H3K4me3 on cyclic AMP and progesterone-treated human decidualized stromal cells,found that ∼31% of active promoters mapped in those cells overlap with ancient mammalian TEs, including LTRs (). Similarly, analysis of DNase I hypersensitivity data from a large panel of human embryonic, adult, and cancer cell lines revealed that up to ∼80% of LTRs are located in open chromatin regions in a cell-type-specific manner (), and intersection with ENCODE H3K4me3 ChIP-seq data revealed that a subset of these LTRs are active promoters.

Deep sequencing and de novo assembly of the mouse oocyte transcriptome define the contribution of transcription to the DNA methylation landscape.

With the widespread use of next-generation sequencing (NGS) technologies and complementary development of bioinformatics tools to exploit such datasets, novel transcripts, including those expressed at relatively low levels, can now be easily identified and enumerated. Indeed, LTR promoter usage in a given cell type can now be readily inferred genome-wide from RNA sequencing (RNA-seq) data. Paired-end RNA-seq data in particular have been used to identify candidate chimeric transcripts (). RNA-seq data have also been employed for de novo transcriptome assembly to identify LTR promoter usage in an unbiased manner in the developing oocyte and to identify novel chimeric transcripts initiating in RLTR10B in mouse testis (). Recent technological advances have led to significant increases in library read depth and standard read lengths, increasing the probability of mapping unique reads within such repetitive elements and in turn the identification of chimeric transcripts showing a broad range of expression levels.

Deep sequencing and de novo assembly of the mouse oocyte transcriptome define the contribution of transcription to the DNA methylation landscape.

Trim33 binds and silences a class of young endogenous retroviruses in the mouse testis; a novel component of the arms race between retrotransposons and the host genome.

Early studies of the role of LTR elements as candidate genic promoters relied on single-gene analyses using methods such as 5′ rapid amplification of cDNA ends (RACE) or PCR. Subsequently, higher-throughput approaches were developed, including those based on sequence mining of EST or RefSeq databases (), or the combination of EST data with high-throughput sequencing by capped analysis of gene expression (CAGE) ().

Retroelement distributions in the human genome: variations associated with age and proximity to genes.

In support of this model, a substantial number of LTRs have been reported to function as tissue-specific primary or alternative promoters in a variety of mammalian cell types, including in the early mouse embryo, placenta, human and mouse pluripotent stem cells, mouse erythroid cells, and growing mouse oocytes (). Notably, many of the LTRs that have apparently been exapted as genic promoters are not only lineage specific but also show clear differences in transcriptional activity between cell types in the given species. For example, in mouse zygote and two-cell-stage embryos, LTRs from the class III LTR retrotransposon MERVL drive expression of a cohort of stage-specific genes (), whereas MaLR and ERVK family LTRs drive expression of many mouse oocyte-specific transcripts (). Similarly, in human pluripotent stem cells, LTR7, derived from the primate-specific HERV-H, drives transcription of many pluripotency-associated lncRNAs (). Furthermore, LTR3B, LTR14B, LTR12C, MLT2A1, THE1A, and LTR5_Hs are expressed at discrete stages during the progression of human preimplantation embryo development from the zygote to the morula stage and serve as promoters for a class of previously unannotated transcripts that may serve important functions at these stages ().

Deep sequencing and de novo assembly of the mouse oocyte transcriptome define the contribution of transcription to the DNA methylation landscape.

Distinct roles of KAP1, HP1 and G9a/GLP in silencing of the two-cell-specific retrotransposon MERVL in mouse ES cells.

Deep sequencing and de novo assembly of the mouse oocyte transcriptome define the contribution of transcription to the DNA methylation landscape.

At least 50% of human-specific HERV-K (HML-2) long terminal repeats serve in vivo as active promoters for host nonrepetitive DNA transcription.

LTRs as Tissue-Specific Genic Promoters

Emera and co-workers (2012) Emera D.

Casola C.

Lynch V.J.

Wildman D.E.

Agnew D.

Wagner G.P. Convergent evolution of endometrial prolactin expression in primates, mice, and elephants through the independent recruitment of transposable elements. Emera et al., 2012 Emera D.

Casola C.

Lynch V.J.

Wildman D.E.

Agnew D.

Wagner G.P. Convergent evolution of endometrial prolactin expression in primates, mice, and elephants through the independent recruitment of transposable elements. Romanish et al., 2007 Romanish M.T.

Lock W.M.

van de Lagemaat L.N.

Dunn C.A.

Mager D.L. Repeated recruitment of LTR retrotransposons as promoters by the anti-apoptotic locus NAIP during mammalian evolution. Further evidence for strong selective pressure for novel tissue-specific promoters of protein-coding genes can be inferred from the exaptation of different LTRs for orthologous genes in independent lineages.showed that MER39 and MER77 LTRs ( Table 1 ) were independently exapted as novel promoters in primates and rodents, respectively, for the Prolactin gene, which is expressed in endometrial cells during pregnancy and essential for normal gestation (). In addition, different LTRs have been independently exapted as promoters for the anti-apoptotic gene NAIP. In primates, testis-specific NAIP transcripts are driven by the MER21C LTR, while in rodents, Naip is expressed in many different tissues from ORR1E or MT-C LTRs (). Although these are isolated cases, many other instances of exaptation may have occurred earlier in mammalian evolution, with the regulatory elements in question no longer recognizable as LTRs.

Mak et al., 2014 Mak K.S.

Burdach J.

Norton L.J.

Pearson R.C.M.

Crossley M.

Funnell A.P.W. Repression of chimeric transcripts emanating from endogenous retrotransposons by a sequence-specific transcription factor. Isbel et al., 2015 Isbel L.

Srivastava R.

Oey H.

Spurling A.

Daxinger L.

Puthalakath H.

Whitelaw E. Trim33 binds and silences a class of young endogenous retroviruses in the mouse testis; a novel component of the arms race between retrotransposons and the host genome. Karimi et al., 2011 Karimi M.M.

Goyal P.

Maksakova I.A.

Bilenky M.

Leung D.

Tang J.X.

Shinkai Y.

Mager D.L.

Jones S.

Hirst M.

Lorincz M.C. DNA methylation and SETDB1/H3K9me3 regulate predominantly distinct sets of genes, retroelements, and chimeric transcripts in mESCs. Macfarlan et al., 2011 Macfarlan T.S.

Gifford W.D.

Agarwal S.

Driscoll S.

Lettieri K.

Wang J.

Andrews S.E.

Franco L.

Rosenfeld M.G.

Ren B.

Pfaff S.L. Endogenous retroviruses and neighboring genes are coordinately repressed by LSD1/KDM1A. Wolf et al., 2015b Wolf G.

Yang P.

Füchtbauer A.C.

Füchtbauer E.-M.

Silva A.M.

Park C.

Wu W.

Nielsen A.L.

Pedersen F.S.

Macfarlan T.S. The KRAB zinc finger protein ZFP809 is required to initiate epigenetic silencing of endogenous retroviruses. Friedli and Trono, 2015 Friedli M.

Trono D. The developmental control of transposable elements and the evolution of higher species. In addition to promoting gene expression, the co-option of LTRs as promoters also provides the opportunity for TF-directed repression, as evidenced by a recent study that found that KLF3 enforces transcriptional repression of ORR1A0 LTR-driven transcripts in mouse fetal and adult erythroid cells (). Whether suppression of such ORR1A0 LTR-driven chimeric transcripts serves only to prevent aberrant genic transcription emanating from the LTR remains to be determined. Chromatin modifiers may also direct the silencing of LTR-driven genes, similar to non-TE-derived genic promoters (). Thus, LTR promoters are apparently as versatile as typical genic promoters, allowing for both positive and negative regulation of their cognate genes. Similarly, purifying selection for KRAB-ZFP-directed gene repression may reflect the persistence of ancient KRAB-ZFP binding sites in degenerate LTRs and/or nonrepetitive regions near genes ().

Emera and Wagner, 2012a Emera D.

Wagner G.P. Transformation of a transposon into a derived prolactin promoter with function during human pregnancy. Emera and Wagner, 2012a Emera D.

Wagner G.P. Transformation of a transposon into a derived prolactin promoter with function during human pregnancy. Emera and Wagner, 2012a Emera D.

Wagner G.P. Transformation of a transposon into a derived prolactin promoter with function during human pregnancy. Xie et al., 2010 Xie D.

Chen C.C.

Ptaszek L.M.

Xiao S.

Cao X.

Fang F.

Ng H.H.

Lewin H.A.

Cowan C.

Zhong S. Rewirable gene regulatory networks in the preimplantation embryonic development of three mammalian species. Emera and Wagner, 2012b Emera D.

Wagner G.P. Transposable element recruitments in the mammalian placenta: impacts and mechanisms. Whether LTRs functioning as genic promoters generally exhibit substantial sequence differences relative to their ancestral sequence has not been systematically addressed. However, a recent study examining Prolactin expression in the placenta, which is driven by an MER39 LTR in various primate lineages, but not in non-ape species (), sheds some light on the role of “fine-tuning” of LTR promoters. While the ancestral MER39 LTR present in all primates and rodents possessed an intact ETS1 binding site at the time of integration, this LTR was a weak promoter in non-ape species and was replaced by the MER77 LTR as the major Prolactin promoter in mice (). However, over millions of years of ape evolution, MER39 was gradually transformed into a strong promoter by selection for base substitution mutations that synergized with the ancestral ETS1 site in the LTR and consequently improved the strength of the promoter (). Thus, although the primordial LTR possessed a functional TFBS, it was likely inefficient to act as a promoter in the placenta and required a series of substitutions to refine its activity. This finding is consistent with previous work showing that species-specific expression of genes near TEs is positively correlated with the number of bound TFBSs in the TE, with a minimum of two bound TFBSs to detect the correlation (). The mechanism termed “epistatic capture” was proposed to describe the process by which a TE-derived TFBS comes under increased purifying selection as a consequence of epistatic interactions with nearby TFBSs refined by mutations over evolutionary time () ( Figure 1 C). Notably, this mechanism also accounts for the tissue specificity of LTR exaptation into promoters/enhancers, since the positive epistatic interactions between the TE-derived ancestral and newly derived TFBSs would be expected to occur only if they enhance recruitment of the TFs relevant to expression in that tissue. After the acquisition and selection for functional TFBSs within LTRs, the accumulation of additional mutations that are nonessential for their transcriptional activity will invariably lead to their progressive divergence from the ancestral sequence ( Figure 1 C). Indeed, LTRs co-opted as regulatory elements earlier in mammalian evolution may no longer be recognizable as repeat elements using conventional bioinformatics tools, raising the possibility that many more canonical gene promoters are actually derived from ancient LTRs.