Making error-free DNA from RNA DNA polymerase enzymes copy DNA into new strands of identical DNA. Reverse transcriptase (RT) enzymes copy RNA into DNA. Unlike many DNA polymerases, RT enzymes do not have a proofreading function that checks for errors in the newly synthesized DNA. Ellefson et al. use in vitro directed evolution and protein engineering to build an error-correcting RT from a prokaryotic DNA polymerase. The RT “xenopolymerase” shows increased fidelity as compared to natural RTs and should streamline and increase the precision of transcriptomics methods. Science, this issue p. 1590

Abstract Most reverse transcriptase (RT) enzymes belong to a single protein family of ancient evolutionary origin. These polymerases are inherently error prone, owing to their lack of a proofreading (3′- 5′ exonuclease) domain. To determine if the lack of proofreading is a historical coincidence or a functional limitation of reverse transcription, we attempted to evolve a high-fidelity, thermostable DNA polymerase to use RNA templates efficiently. The evolutionarily distinct reverse transcription xenopolymerase (RTX) actively proofreads on DNA and RNA templates, which greatly improves RT fidelity. In addition, RTX enables applications such as single-enzyme reverse transcription–polymerase chain reaction and direct RNA sequencing without complementary DNA isolation. The creation of RTX confirms that proofreading is compatible with reverse transcription.

The molecular basis for life rests on the information flow between DNA, RNA, and proteins (1). Early notions of a unidirectional central dogma were amended after the discovery of the reverse transcriptase (RT) enzyme (2, 3). The RT family has a single ancient evolutionary origin based on amino acid homology and the presence of RT across multiple domains of life (4). RTs are involved in processes such as telomere addition, mitochondrial plasmid replication, transposition, and the proliferation of retroviral genomes (5). It is also hypothesized to be the catalyst in the transition of the RNA to DNA world by providing an avenue to copy RNA into more stable DNA genomes (6).

The progenitor of RT is postulated to be an RNA-dependent RNA polymerase. Because RNA polymerases generally lack an error-checking 3′-5′ exonuclease domain (4, 7), proofreading activity is also not present across the RT family, resulting in low-fidelity reverse transcription and characteristic quasispecies behavior in organisms that rely upon it for replication (8). In contrast to RTs, other DNA polymerase families have evolved exquisite proofreading mechanisms to increase DNA synthesis fidelity during genome replication (9).

To determine whether the evolutionary divide between RTs and DNA polymerases is a matter of history or function, we have attempted to directly evolve a reverse transcription xenopolymerase (RTX; Fig. 1A) from an error-correcting DNA polymerase using a modified directed evolution strategy (10), reverse transcription–compartmentalized self-replication (RT-CSR) (Fig. 1B). RT-CSR enables the simultaneous screening of up to 109 polymerase variants for RT activity.

Fig. 1 Evolution of a synthetic family of reverse transcriptases by RT-CSR. (A) Polymerase phylogeny depicts reverse transcription xenopolymerases (RTX) as a second, evolutionarily distinct, origin of RT function. (B) Framework for the directed evolution of hyperthermostable RT using reverse transcription compartmentalized self-replication (RT-CSR). Libraries of polymerase variants are created, expressed in Escherichia coli, and in vitro compartmentalized. During emulsion PCR, primers flanking the polymerase enable self-replication but are designed with a variable number of RNA bases separating the plasmid annealing portion from the unique recovery tag.

We chose the Archaeal family-B DNA polymerases (polB) for directed evolution of the RTX as they are monomeric, hyperthermostable, highly processive, and contain proofreading domains. Attempts to rationally design these enzymes to use RNA templates have met with limited success (11, 12), and initial experiments confirmed that two common polB enzymes from Pyrococcus furiosus and Thermococcus kodakarensis (KOD) (13, 14) failed to polymerize across five template RNA bases (fig. S1). Modeling to identify mutations enabling RT activity was deemed impractical, given the extensive contacts these polymerases make with the template (>50 direct interactions). We initiated evolution using low-stringency RT-CSR (10 RNA residues) with a random library (one or two amino acid mutations per gene) of KOD polymerase variants. As polymerases were enriched, we gradually increased RT-CSR stringency with the stepwise addition of RNA bases into primers (table S1). By cycle 18, primers were entirely composed of RNA—requiring reverse transcription of 176 residues to occur every thermal cycle to maintain exponential amplification in the emulsion polymerase chain reaction (PCR).

Profiling of polymerases revealed one variant, B11, which contained 37 mutations. RT-CSR enriched for RT activity, and B11 was capable of reverse transcription of at least 500 base pairs; however, sequencing and testing confirmed inactivation of the proofreading domain (fig. S2). Kinetic analyses established that B11 uses both DNA and RNA templates with similar efficiencies by greatly lowering the Michaelis constant (K m ) on RNA:DNA heteroduplexes. We attempted to restore proofreading by transplantation of the wild-type 3′-5′ exonuclease, which reactivated proofreading capabilities, albeit to barely detectable levels. Encouraged that minimizing extraneous mutations could restore proofreading activity, we sought to design polymerases with a minimal set of mutations.

To understand how our process reshaped KOD polymerase to use RNA templates, we deep-sequenced RT-CSR cycles to recapitulate the evolutionary path to RT activity (Fig. 2A and table S2). Mutations were identified throughout the polymerase and accumulated along the template-binding interface so as to progressively increase the length of RNA that could be accommodated. The mutated positions are hypothesized to be molecular checkpoints used to enforce strict DNA template utilization: as the template enters, near the active site, and at the nascent duplex. Given the likely importance of these regions, we used computer modeling to determine the molecular basis for RNA utilization (figs. S3 and S4).

Fig. 2 Molecular checkpoints involved in template recognition. (A) Structural heat map of mutated residues over the RT-CSR process. Conserved mutations are colored incrementally darker shades of red to indicate frequency in the polymerase pool. Amino acid residues that were mutated in more than 50% of the population are labeled. [Figure was adapted from KOD structure PDB 4K8Z.] (B) Computer modeling of KOD (gray) and RTX mutations (orange) at checkpoints responsible for DNA and RNA template recognition at R97, Y384, and E664. Free-energy changes between wild-type KOD and RTX mutations are inlet displayed. Single-letter abbreviations for the amino acid residues are as follows: A, Ala; E, Glu; F, Phe; G, Gly; H, His; I, Ile; K, Lys; M, Met; N, Asn; R, Arg; V, Val; and Y, Tyr.

The first selected mutation localized near the template entry site of the polymerase at position R97 (Fig. 2B). Proximal to this site, native polB scans for uracils (typically caused by cytosine deamination) by flipping template bases into a specialized pocket to halt polymerization until the mutation can be corrected by repair machinery (15, 16). Evolved polymerases contained a variety of amino acid mutations at R97, all of which destabilize a salt bridge to the phosphate backbone that presumably regulates base flipping into the pocket.

As template residues near the active site, they encounter mutation Y384H, which prevents Y384 and Y494 from hydrogen bonding to the 2' hydroxyl of template RNA by reorganizing a hydrogen bonding network. After polymerization, in the thumb domain, the most prevalent mutations (E664K, G711V, and E735K) promote tighter homo- and heteroduplex binding in both A- and B-form conformations. The E664K mutation alone has been shown to increase binding to RNA:DNA heteroduplexes (17). To further validate that we had established an optimized set of mutations, we fully randomized several positions and repeated the RT-CSR. In support of our modeling, many amino acids solutions were viable at position R97, but other positions (Y384 and E664) had strong preferences for particular amino acids (fig. S5).

Modeled designs of several polymerases with favorable RT mutations were synthesized and tested empirically (fig. S6). The best-performing RTX contained fewer than half the mutations found in B11, without sacrificing catalytic efficiency or K m on RNA (fig. S7). Mutations in RTX did not affect desirable properties of parental KOD polymerase. Thermostability was maintained, with optimal RT occurring at ~70°C, and consequently RTX was capable of single-enzyme RT-PCR (in which RTX performs both first-strand RT synthesis and PCR amplification). Across several RNA samples and gene loci, RTX demonstrated high processivity on RNA templates, performing RT-PCR on RNAs more than 5 kb in length (fig. S8).

Initial testing of RTX using dideoxy mismatch primers in PCR demonstrated robust proofreading activity on DNA template (fig. S9), but it was unclear whether the proofreading mechanism was compatible during reverse transcription because RNA:DNA heteroduplexes can adopt A-form helical structures (18). Primer extension reactions with a canonical matched base pair or a 3′ deoxy mismatched pair (preventing extension until terminator excision) were tested. Both wild-type KOD and RTX were capable of extending mismatched primers on DNA templates, unlike exonuclease-deficient mutants. When tested on an RNA template, KOD’s exonuclease was stimulated—actively degrading the priming oligonucleotide. In contrast, RTX could extend the mismatched primer with activity indistinguishable from that of DNA templated proofreading (Fig. 3A).

Fig. 3 RTX polymerase proofreads during reverse transcription. (A) Primer extension reactions of KOD and RTX polymerases and their proofreading-deficient counterparts (exo-), on both DNA and RNA templates. Extension reactions were performed with matched 3′ primer:templates (gray) or a 3′ deoxy mismatch (orange), which must be excised before extension can proceed. The primer is denoted by a gray arrow, extended product is in green, and exonuclease-degraded primer is in red. (B) Deep sequencing of reverse transcription reaction on HSPCB gene. The overall error rate was determined by dividing the sum of base substitutions and insertions or deletions by the total number of bases sequenced. The error profile of MMLV, RTX, and RTX exo- is shown as frequency of errors per million bases sequenced.

Given that RTX is capable of proofreading during reverse transcription, we hypothesized that it may have increased RT fidelity compared to natural polymerases. Barcoded primers used during RT of several human mRNAs allowed multiple reads of a single cDNA during deep sequencing—reducing background sequencing errors by several orders of magnitude (fig. S10) (19). Sequencing analyses revealed that the control retroviral RT [Moloney murine leukemia virus (MMLV)] had an error rate of 1.1 × 10−4 to 4.8 × 10−4, whereas RTX had an error rate of 3.5 × 10−5 to 3.7 × 10−5 (3- to 10-fold lower) (Fig. 3B). The mutational spectra of RTX favored G-to-A transitions and G-to-T transversions, which accounted for nearly half the observed mutations. Inactivating the RTX’s proofreading capabilities increased error frequency nearly threefold, supporting evidence that active proofreading was occurring during RT. Inactivating the proofreading of RTX shifted the mutational bias (Fig. 3B and table S3). Given that the barcoding error detection limit is identical to the observed error of RTX (table S3) (19), we anticipate the true error rate for RTX to be even lower than reported.

RTX has the potential to streamline workflows (combining RT and PCR steps) and increase the precision of transcriptomics, reducing biases and errors introduced in the reverse transcription step of RNA-sequencing protocols (20). To demonstrate its utility, we introduced RTX into a commonly used platform for RNA sequencing. Analysis revealed nearly identical coverage and expression profiles (fig. S11), suggesting that RTX is compatible with established workflows. In addition, we developed a more streamlined protocol to directly sequence RNA. Using a traditional Sanger sequencing approach (21), we directly sequenced a GATC 5 RNA repeat (fig. S12). Direct RNA sequencing should be adaptable to single-molecule sequencing platforms, enabling high-throughput and high-fidelity sequencing of complex RNA samples by eliminating the biases created in cDNA synthesis and subsequent amplification.

The expanded template specificity of the RTX lineage may presage the ability to use entirely new chemistries in genetics. Primer extension reactions were performed on a ribose sugar analog [2' O-methyl (Me) DNA] that indicated that RTX reverse transcription could extend alternative templates but with much lower efficiency, indicating a preference for RNA substrates (fig. S13). However, RTX was still far more efficient at using 2'-OMe DNA than the parental wild-type, allowing the possibility of further optimization and, owing to 2'-OMe stability, potential therapeutic applications.

The RTX RNA reverse transcriptase function is fundamentally distinct from that of the retroelement lineage. Using RT-CSR, we have altered the substrate specificity of a high-fidelity DNA polymerase, highlighting the plasticity of highly conserved molecular machinery. Ostensibly, the mutations identified unlocked molecular checkpoints in the discrimination of DNA and RNA, but did not disrupt the proofreading capabilities of the polymerase. This was unexpected, especially given that RNA:DNA hybrid duplexes often form A-helical structures unlike DNA:DNA duplexes, and may provide insights into the transition from polymerization to editing modes of the polymerase.

Only a handful of mutations were required to impart RT activity, suggesting that the evolutionary hurdle for forming high-fidelity reverse transcription is relatively low. Nevertheless, all known retroelements use proofreading-deficient RTs, suggesting that high error rates are either a historical coincidence or an evolutionary strategy to promote diversity. Another possible explanation is that high fidelity was never required simply because RNA genomes are small as a result of their inherent instability (22). Given the plasticity of these polymerases for modified templates and the adaptability of the RT-CSR framework (as primers are simply programmed to contain modified bases), RTX evolution should be compatible with many base and sugar analogs (23–26). Combination with previously evolved XNA polymerases could enable synthesis of genomes entirely composed of artificial nucleic acids (27).

Supplementary Materials www.sciencemag.org/content/352/6293/1590/suppl/DC1 Materials and Methods Figs. S1 to S13 Tables S1 to S3 Sequences

Acknowledgments: We thank the University of Texas Genomic Sequencing and Analysis Facility for DNA sequencing support. This work was supported by Defense Advanced Research Projects Agency (HR0011-12-2-0001), National Security Science and Engineering Faculty Fellowship (FA9550-10-1-0169), NASA (NNX15AF46G), and the Welch foundation (F-1654). J.W.E. conceived of the project. J.W.E., J.G., R.S. performed research with contributions from H.S and V.R.I. J.W.E, J.G., R.S., and A.D.E. wrote the manuscript. The authors declare competing financial interests. A.D.E. was a paid consultant for Enzymatics. A provisional patent has been filed on the sequence composition of the reverse transcriptase.