SARS-CoV-2 is a betacoronavirus responsible for the COVID-19 pandemic. Although the SARS-CoV-2 genome was reported recently, its transcriptomic architecture is unknown. Utilizing two complementary sequencing techniques, we present a high-resolution map of the SARS-CoV-2 transcriptome and epitranscriptome. DNA nanoball sequencing shows that the transcriptome is highly complex owing to numerous discontinuous transcription events. In addition to the canonical genomic and 9 subgenomic RNAs, SARS-CoV-2 produces transcripts encoding unknown ORFs with fusion, deletion, and/or frameshift. Using nanopore direct RNA sequencing, we further find at least 41 RNA modification sites on viral transcripts, with the most frequent motif, AAGAA. Modified RNAs have shorter poly(A) tails than unmodified RNAs, suggesting a link between the modification and the 3′ tail. Functional investigation of the unknown transcripts and RNA modifications discovered in this study will open new directions to our understanding of the life cycle and pathogenicity of SARS-CoV-2.

In this study, we combined two complementary sequencing approaches, DRS and SBS. We unambiguously mapped the sgRNAs, ORFs, and TRSs of SARS-CoV-2. Additionally, we found numerous unconventional RNA joining events that are distinct from canonical TRS-mediated polymerase jumping. We further discovered RNA modification sites and measured the poly(A) tail length of gRNAs and sgRNAs.

Deep sequencing technologies offer powerful means to investigate viral transcriptome. The “sequencing-by-synthesis (SBS)” methods such as the Illumina and MGI platforms confer high accuracy and coverage. However, they are limited by short read length (200–400 nt), so the fragmented sequences should be re-assembled computationally, during which the haplotype information is lost. More recently introduced is the nanopore-based direct RNA sequencing (DRS) approach. Although nanopore DRS is limited in sequencing accuracy, it enables long-read sequencing, which would be particularly useful for the analysis of long nested CoV transcripts. Moreover, because DRS detects RNA instead of cDNA, the RNA modification information can be obtained directly during sequencing (). Numerous RNA modifications have been found to control eukaryotic RNAs and viral RNAs (). Terminal RNA modifications such as RNA tailing also play a critical role in cellular and viral RNA regulation ().

Each coronaviral RNA contains the common 5′ “leader” sequence of ∼70 nt fused to the “body” sequence from the downstream part of the genome () ( Figure 1 ). According to the prevailing model, leader-to-body fusion occurs during negative-strand synthesis at short motifs called transcription-regulatory sequences (TRSs) that are located immediately adjacent to ORFs ( Figure 1 ). TRSs contain a conserved 6–7 nt core sequence (CS) surrounded by variable sequences. During negative-strand synthesis, RdRP pauses when it crosses a TRS in the body (TRS-B) and switches the template to the TRS in the leader (TRS-L), which results in discontinuous transcription leading to the leader-body fusion. From the fused negative-strand intermediates, positive-strand mRNAs are transcribed. The replication and transcription mechanism has been studied in other coronaviruses. However, it is unclear whether the general mechanism also applies to SARS-CoV-2 and if there are any unknown components in the SARS-CoV-2 transcriptome. For the development of diagnostic and therapeutic tools and the understanding of this new virus, it is critical to define the organization of the SARS-CoV-2 genome.

The viral genome is also used as the template for replication and transcription, which is mediated by nsp12 harboring RNA-dependent RNA polymerase (RdRP) activity (). Negative-sense RNA intermediates are generated to serve as the templates for the synthesis of positive-sense genomic RNA (gRNA) and subgenomic RNAs (sgRNAs). The gRNA is packaged by the structural proteins to assemble progeny virions. Shorter sgRNAs encode conserved structural proteins (spike protein [S], envelope protein [E], membrane protein [M], and nucleocapsid protein [N]), and several accessory proteins. SARS-CoV-2 is known to have at least six accessory proteins (3a, 6, 7a, 7b, 8, and 10) according to the current annotation (GenBank: NC_045512.2 ). However, the ORFs have not yet been experimentally verified for expression. Therefore, it is currently unclear which accessory genes are actually expressed from this compact genome.

CoVs carry the largest genomes (26–32 kb) among all RNA virus families ( Figure 1 ). Each viral transcript has a 5′-cap structure and a 3′ poly(A) tail (). Upon cell entry, the genomic RNA is translated to produce nonstructural proteins (nsps) from two open reading frames (ORFs), ORF1a and ORF1b. The ORF1a produces polypeptide 1a (pp1a, 440–500 kDa) that is cleaved into 11 nsps. The −1 ribosome frameshift occurs immediately upstream of the ORF1a stop codon, which allows continued translation of ORF1b, yielding a large polypeptide (pp1ab, 740–810 kDa) which is cleaved into 15 nsps. The proteolytic cleavage is mediated by viral proteases nsp3 and nsp5 that harbor a papain-like protease domain and a 3C-like protease domain, respectively.

From the full-length genomic RNA (29,903 nt) that also serves as an mRNA, ORF1a and ORF1b are translated. In addition to the genomic RNA, nine major subgenomic RNAs are produced. The sizes of the boxes representing small accessory proteins are bigger than the actual size of the ORF for better visualization. The black box indicates the leader sequence. Note that our data show no evidence for ORF10 expression.

Coronavirus disease 19 (COVID-19) is caused by a novel coronavirus designated as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (). Like other coronaviruses (order Nidovirales, family Coronaviridae, subfamily Coronavirinae), SARS-CoV-2 is an enveloped virus with a positive-sense, single-stranded RNA genome of ∼30 kb. SARS-CoV-2 belongs to the genus betacoronavirus, together with SARS-CoV and Middle East respiratory syndrome coronavirus (MERS-CoV) (with 80% and 50% homology, respectively) (). Coronaviruses (CoVs) were thought to primarily cause enzootic infections in birds and mammals. However, the recurring outbreaks of SARS, MERS, and now COVID-19 have clearly demonstrated the remarkable ability of CoVs to cross species barriers and transmit between humans ().

Results and Discussion

Kim et al., 2020 Kim J.M.

Chung Y.S.

Jo H.J.

Lee N.J.

Kim M.S.

Woo S.H.

Park S.

Kim J.W.

Kim H.M.

Han M.G. Identification of Coronavirus Isolated from a Patient in Korea with COVID-19. Figure 2 Statistics of Sequencing Data Show full caption (A) Read counts from nanopore direct RNA sequencing of total RNA from Vero cells infected with SARS-CoV-2. “Leader+” indicates the viral reads that contain the 5′ end leader sequence. “No leader” denotes the viral reads lacking the leader sequence. “Nuclear” reads match mRNAs from the nuclear chromosome while “mitochondrial” reads are derived from the mitochondrial genome. “Control” indicates quality control RNA for nanopore sequencing. (B) Genome coverage of the nanopore direct RNA sequencing data shown in (A). The stepwise reduction in coverage corresponds to the borders expected for the canonical sgRNAs. The smaller inner plot magnifies the 5′ part of the genome. (C) Read counts from DNA nanoball sequencing using MGISEQ. Total RNA from Vero cells infected with SARS-CoV-2 was used for sequencing. (D) Genome coverage of the DNA nanoball sequencing (DNB-seq) data shown in (C). See also Figure S1 To delineate the SARS-CoV-2 transcriptome, we first performed DRS runs on a MinION nanopore sequencer with total RNA extracted from Vero cells infected with SARS-CoV-2 (BetaCoV/Korea/KCDC03/2020). The virus was isolated from a patient who was diagnosed with COVID-19 on January 26, 2020, after traveling from Wuhan, China (). We obtained 879,679 reads from infected cells (corresponding to a throughput of 1.9 Gb) ( Figure 2 A). The majority (65.4%) of the reads mapped to SARS-CoV-2, indicating that viral transcripts dominate the transcriptome while the host gene expression is strongly suppressed. Although nanopore DRS has the 3′ bias due to directional sequencing from the 3′ ends of RNAs, approximately half of the viral reads still contained the 5′ leader.

The SARS-CoV-2 genome was almost fully covered, missing only 12 nt from the 5′ end due to the known inability of DRS to sequence the terminal ∼12 nt ( Figure 2 B). The longest tags (111 reads) correspond to the full-length gRNA ( Figure 2 B). The coverage of the 3′ side of the viral genome is substantially higher than that of the 5′ side, which reflects the nested sgRNAs. This is also partly due to the 3′ bias of the directional DRS technique. The common presence of the leader sequence (72 nt) in viral RNAs results in a prominent coverage peak at the 5′ end, as expected. We could also clearly detect vertical drops in the coverage, whose positions correspond to the leader-body junction in sgRNAs. All known sgRNAs are supported by DRS reads, with an exception of ORF10 (see below).

To further validate sgRNAs and their junction sites, we performed DNA nanoball sequencing (DNB-seq) based on the sequencing-by-synthesis principle and obtained 305,065,029 reads with an average insert length of 220 nt ( Figure 2 C). The results are overall consistent with the DRS data. The leader-body junctions are frequently sequenced, giving rise to a sharp peak at the 5′ end in the coverage plot ( Figure 2 D). The 3′ end exhibits a high coverage as expected for the nested transcripts.

Figure 3 Viral Subgenomic RNAs and Their Recombination Sites Show full caption (A) Frequency of discontinuous mappings in the long reads from the DNB-seq data. The color indicates the number of reads with large gaps spanning between two genomic positions (starting from a coordinate in the x axis and ending in a coordinate in the y axis). The counts were aggregated into 100-nt bins for both axes. The red asterisk on the x axis indicates the column containing the leader TRS. Please note that the leftmost column was expanded horizontally on this heatmap to improve visualization. The red dots on the sub-plot alongside the y axis denote local peaks which coincide with the 5′ end of the body of each sgRNA. (B) Transcript abundance was estimated by counting the DNBseq reads that span the junction of the corresponding RNA. (C) Top 50 sgRNAs. The asterisk indicates an ORF beginning at 27,825 that may encode the 7b protein with an N-terminal truncation of 23 amino acids. The gray bars denote minor transcripts that encode proteins with an N-terminal truncation compared with the corresponding overlapping transcript. The black bars indicate minor transcripts that encode proteins in a different reading frame from the overlapping major mRNA. (D) Canonical discontinuous transcription that is mediated by TRS-L and TRS-B. (E) TRS-L-dependent noncanonical fusion between the leader TRS and a noncanonical 3′ site in the body. (F) TRS-L-independent long-distance (>5,000 nt) fusion. (G) TRS-L-independent local joining yielding a deletion between proximal sites (20–5,000 nt distance). See also Figures S2 and S3 and Tables S2 S3 , and S4 The depth of DNB-seq allowed us to confirm and examine the junctions on an unprecedented scale for a CoV genome. We mapped the 5′ and 3′ breakpoints at the junctions and estimated the fusion frequency by counting the reads spanning the junctions ( Figure 3 A). The leader represents the most prominent 5′ site, as expected ( Figure 3 A, red asterisk on the x axis). The known TRS-Bs are detected as the top 3′ sites ( Figure 3 A, red dots on the y axis). These results confirm that SARS-CoV-2 uses the canonical TRS-mediated template-switching mechanism for discontinuous transcription to produce major sgRNAs ( Figure 3 B). Quantitative comparison of the junction-spanning reads shows that the N RNA is the most abundantly expressed transcript, followed by S, 7a, 3a, 8, M, E, 6, and 7b ( Figure 3 C).

It is important to note that ORF10 is represented by only one read in DNB data (0.000009% of viral junction-spanning reads) and that it was not supported at all by DRS data. ORF10 does not show significant homology to known proteins. Thus, ORF10 is unlikely to be expressed. The annotation of ORF10 should be reconsidered. Taken together, SARS-CoV-2 expresses nine canonical sgRNAs (S, 3a, E, M, 6, 7a, 7b, 8, and N) together with the gRNA ( Figures 1 and 3 C).

Of note, the junctions in these noncanonical transcripts are not derived from a known TRS-B. Some junctions show short sequences (3–4 nt) common between the 5′ and 3′ sites, suggesting a partial complementarity-guided template switching (“polymerase jumping”). However, the majority do not have any obvious sequences. Thus, we cannot exclude a possibility that at least some of these transcripts are generated through a different mechanism(s).

Although the noncanonical transcripts may arise from erroneous replicase activity, it remains an open question if the fusion has an active role in viral life cycle and evolution. Although individual RNA species are not abundant, the combined read numbers are often comparable to the levels of accessory transcripts. Most of the RNAs have coding potential to yield proteins. Transcripts that belong to the “TRS-L-independent distant” group encode the upstream part of ORF1a, including nsp1, truncated nsp2, and/or truncated nsp3, whose summed abundance is ∼20% of gRNA. Depending on translation efficiency, the protein products may change the stoichiometry between nsps ( Figure 3 F; Table S4 ). Another notable example is the 7b protein with an N-terminal truncation that may be produced at a level similar to the annotated full-length 7b ( Figure 3 C, asterisk). Frameshifted or deleted ORFs may also generate shorter proteins that are distinct from known viral proteins ( Figure 3 C). It will be interesting in the future to examine if these unknown ORFs are actually translated and yield functional products.

Wu et al. (2013) Wu H.Y.

Ke T.Y.

Liao W.Y.

Chang N.Y. Regulation of coronaviral poly(A) tail length during infection. Tomecki et al., 2004 Tomecki R.

Dmochowska A.

Gewartowski K.

Dziembowski A.

Stepien P.P. Identification of a novel human nuclear-encoded mitochondrial poly(A) polymerase. Tvarogová et al., 2019 Tvarogová J.

Madhugiri R.

Bylapudi G.

Ferguson L.J.

Karl N.

Ziebuhr J. Identification and Characterization of a Human Coronavirus 229E Nonstructural Protein 8-Associated RNA 3′-Terminal Adenylyltransferase Activity. Figure 4 Length of Poly(A) Tail Show full caption (A and B) Kernel density plots showing poly(A) tail length distribution of viral transcripts without (A) or with (B) a subpeak near 30 nt. Arrowheads indicate peaks at ~30 and ~45 nt. (C) Kernel density plots showing poly(A) tail length distribution of quality control RNA that has a 30-nt poly(A) tail, host mRNAs from the nuclear chromosome, or host RNAs from the mitochondrial chromosome. As nanopore DRS is based on single-molecule detection of RNA, it offers a unique opportunity to examine multiple epitranscriptomic features of individual RNA molecules. We recently developed software to measure the length of poly(A) tail from DRS data (Y. Choi and H.C., unpublished data). Using this software, we confirm that, like other CoVs, SARS-CoV-2 RNAs carry poly(A) tails ( Figures 4 A–4B). The tail of viral RNAs is 47 nt in median length. The full-length gRNA has a relatively longer tail than sgRNAs. Notably, sgRNAs have two tail populations: a minor peak at ∼30 nt and a major peak at ∼45 nt ( Figure 4 B, arrowheads).previously observed that the poly(A) tail length of bovine CoV mRNAs changes during infection: from ∼45 nt immediately after virus entry to ∼65 nt at 6–9 hours post-infection and ∼30 nt at 120–144 hours post-infection. Thus, the short tails of ∼30 nt observed in this study may represent aged RNAs that are prone to decay. Viral RNAs exhibit a homogeneous length distribution, unlike host nuclear genome-encoded mRNAs ( Figure 4 C). The distribution is similar to that of mitochondrial chromosome-encoded RNAs whose tail is generated by MTPAP (). It was recently shown that HCoV-229E nsp8 has an adenylyltransferase activity, which may extend poly(A) tail of viral RNA (). Because poly(A) tail should be constantly attacked by host deadenylases, the regulation of viral RNA tailing is likely to be important for the maintenance of genome integrity. Poly(A) tail of mRNA is also generally critical for stability control and translation through its interaction with poly(A) binding proteins (PABPs). Cytoplasmic PABPs facilitate deadenylation by the CCR4-NOT complex while blocking untimely decay by exosome and uridylation machinery. PABPs also interact with translation initiation factors to allow translation. Thus, the viral tail is likely to play multiple roles for translation, decay, and replication.

Figure 5 Frequent RNA Modification Sites Show full caption (A) Distinct ionic current signals (“squiggles”) from viral S transcript (green lines) and in vitro transcribed control (IVT, black lines) indicate RNA modification at the genomic position 29,016. (B) The ionic current signals from viral N transcript at the genomic position 29,016 (yellow lines) are similar to those from IVT control (black lines), indicating that modification is rare on the N sgRNA. (C) Kernel density estimations of ionic current distribution at A29016. Blue line shows the signal distribution in the standard model of tombo 1.5. (D) Dwell time difference supports RNA modification. The dwell time of the region 29,015–29,017 of the S RNA (right) is significantly longer than those of IVT control and N RNAs. On the contrary, the neighboring region 28,995–28,997 of IVT, N, and S RNA is indistinguishable (left). See also Figures S4 and S5 Figure S5 Detected Modified Sites in Viral RNAs, Related to Figure 5 Show full caption A, Ionic current levels near the genomic position 27,947 in viral S RNA (green lines) and IVT control RNA (black lines). B, Ionic current levels for the identical region in the viral ORF8 RNA (orange lines) and IVT control RNA (black lines). C, Kernel density plots for signal distributions at the position 27,947 in the different RNAs. The blue line shows the standard model used for modification detections without controls (“alternative base detection” and “de novo” modes) in Tombo. We, however, noticed intriguing differences in the ionic current (called “squiggles”) between negative control and viral transcripts ( Figure 5 A). At least 41 sites displayed substantial differences (over 20% frequency), indicating potential RNA modifications ( Table S5 ). Notably, some of the sites showed different frequencies depending on the sgRNA species. Figures 5 A–5C show an example that is modified more heavily on the S RNA than the N RNA, while Figures S5 A–S5C present a site that is modified more frequently on the ORF8 RNA compared with the S RNA. Moreover, the dwell time of the modified base ( Figure 5 D, right) is longer than that of the unmodified base ( Figure 5 D, left), suggesting that the modification interferes with the passing of RNA molecules through the pore.

Figure 6 Detected RNA Modifications Are Differentially Regulated Show full caption (A) Position-specific base frequency of a motif enriched in the frequently modified sites. (B) Sequence alignment of the detected modification sites with “AAGAA”-like motif. Base positions on the left hand side correspond to the genomic coordinates denoted with red arrowhead. (C) Genomic location of modification sites with the AAGAA-like motif (top row) and the others grouped by the detected nucleotide base. (D) Location and modification levels in different RNA species. (E) Kernel density plots showing poly(A) length distribution of gRNA (left) and S RNA (right). Modified viral RNAs carry shorter poly(A) tails. See also Figure S6 and Table S5 Among the 41 potential modification sites, the most frequently observed motif is AAGAA ( Figures 6 A and 6B ). The modification sites on the “AAGAA-like” motif (including AAGAA and other A/G-rich sequences) are found throughout the viral genome but particularly enriched in genomic positions 28,500–29,500 ( Figure 6 C). Long viral transcripts (gRNA, S, 3a, E, and M) are more frequently modified than shorter RNAs (6, 7a, 7b, 8, and N) ( Figure 6 D), suggesting a modification mechanism that is specific for certain RNA species.

−5 and p < 7.3 × 10−12 for ORF1ab and S, respectively; Mann-Whitney U test). These results suggest a link between the internal modification and 3′ end tail. Because poly(A) tail plays an important role in RNA turnover, it is tempting to speculate that the observed internal modification is involved in viral RNA stability control. It is also plausible that RNA modification is a mechanism to evade host immune response. The type of modification(s) is yet to be identified, although we can exclude METTL3-mediated m6A (for lack of consensus motif RRACH), ADAR-mediated deamination (for lack of A-to-G sequence change in the DNBseq data), and m1A (for lack of the evidence for RT stop). Our finding implicates a hidden layer of CoV regulation. It will be interesting in the future to identify the chemical nature, enzymology, and biological functions of the modification(s). Figure S6 Highly Modified Viral RNAs Carry Shorter Poly(A) Tails, Related to Figure 6 Show full caption Poly(A) tail length distribution of each viral transcript other than shown in Figure 6 Because DRS allows simultaneous detection of multiple features on individual molecules, we cross-examined the poly(A) tail length and internal modification sites. Interestingly, modified RNA molecules have shorter poly(A) tails than unmodified ones ( Figures 6 E and S6 ; p < 9.8 × 10and p < 7.3 × 10for ORF1ab and S, respectively; Mann-Whitney U test). These results suggest a link between the internal modification and 3′ end tail. Because poly(A) tail plays an important role in RNA turnover, it is tempting to speculate that the observed internal modification is involved in viral RNA stability control. It is also plausible that RNA modification is a mechanism to evade host immune response. The type of modification(s) is yet to be identified, although we can exclude METTL3-mediated m6A (for lack of consensus motif RRACH), ADAR-mediated deamination (for lack of A-to-G sequence change in the DNBseq data), and m1A (for lack of the evidence for RT stop). Our finding implicates a hidden layer of CoV regulation. It will be interesting in the future to identify the chemical nature, enzymology, and biological functions of the modification(s).