PhyloCSF identification of two novel ORFs in the POLG mRNA

We initially found evidence of alternate-frame translation in POLG as part of a project to identify novel coding regions using PhyloCSF [30]. We had previously developed PhyloCSF [31] (Phylogenetic Codon Substitution Frequencies) to determine whether a given nucleotide sequence is likely to represent a functional, conserved protein-coding sequence by determining the likelihood ratio of its multi-species alignment under coding and non-coding models of evolution that use precomputed substitution frequencies for every possible pair of codons, trained on whole-genome data. To find novel coding regions we had computed PhyloCSF scores for every codon in the human genome in each of six reading frames, used a hidden Markov model to find potential coding intervals, and screened out intervals overlapping known coding or pseudogenic regions in the same frame or the antisense frame, leaving us with approximately 70,000 PhyloCSF Candidate Coding Regions (PCCRs), which were then prioritized by a machine learning algorithm and the first 1000 examined by expert manual annotators.

We found that a cluster of PCCRs on the minus strand of chromosome 15 are within exons 2 and 3 of POLG (Fig. 1b). Since we had previously screened out intervals overlapping known coding regions in the same frame, this indicated possible translation in an alternative reading frame. An alignment of 58 placental mammal genomes in the frame indicated by the PhyloCSF signal (the − 1 frame relative to the main ORF) indicated a partial ORF roughly coinciding with the signal and ending in a well-conserved stop codon (Supplementary Figure 1) but left ambiguous where the ORF started. There are no AUG codons in this reading frame 5′ of the PhyloCSF signal in exon 2, or in any frame in exon 1, suggesting that the ORF is initiated at a non-AUG start codon. The CUG codon with hg38 coordinates chr15:89333807–89,333,809 is conserved in all the aligned genomes and roughly coincides with the start of the PhyloCSF signal, so we investigated it further as a plausible candidate start codon. With this start, the candidate ORF, which we refer to as ORF-Y, would create a 260-amino acid protein with a PhyloCSF score of 412.1, which is significantly higher than could be expected to arise from a non-coding region of that length (p < 1 × 10− 7). We have included this translation in the GENCODE / Ensembl gene set as model ENST00000650303.1. Analysis of the sequence upstream of the CUG putative initiation codon revealed a second potential uORF, herein coined as ORF-Z (Supplementary Figure 2).

The overlapping portion of ORF-Y with the main CDS has a significantly reduced rate of synonymous substitutions in most mammals

Since translation in more than one frame can suppress synonymous substitutions, we assessed synonymous site conservation within the POLG ORF using the Synplot2 program [32]. Plots of stop codon positions in each of the three forward reading frames of the alignment were also generated (Fig. 2). In the mammalian alignment, a highly significant increase in synonymous site conservation was observed in the ORF-Y overlap region (783 nucleotides in Homo sapiens) (Fig. 2a). Enhanced synonymous site conservation in the POLG ORF disappears immediately after the ORF-Y stop codon. The presence of such a long, conserved stop codon free region argues against an RNA structural element being responsible for the synonymous site conservation.

Fig. 2 Synonymous site conservation in the POLG coding region for the major vertebrate clades. Clades shown are a. mammals, b. amphibians, c. sauropsids, and d. teleost fish. In each subfigure, the top panel shows the position of 0-frame stop codons in each sequence in the alignment. The following panels show the positions of stop codons in the + 1 and + 2 frames. The blue dots represent stop codons and the grey regions represent alignment gaps. The bottom two panels show the synonymous site conservation analysis, with the brown line showing the ratio of the observed number of synonymous substitutions within a given window to the number expected under a null model of neutral evolution at synonymous sites, and the red line showing the corresponding p-value. The horizontal grey dashed line indicates a p = 0.05 threshold after an approximate correction for multiple testing (namely scaling by [sliding window size]/[POLG ORF length]). All subfigures use a 25-codon sliding window. The stop codon of ORF-Y in mammals is indicated with a black arrow Full size image

A closer look at organisms in the mammalian clade revealed that all POLG sequences contain a conserved CUG codon in ORF-Y that is in a good initiation context, except for Camelus ferus (camel), and three marsupial species: Vombatus ursinus (wombat), Phascolarctos cinerus (koala), and Monodelphis domestica (opossum). A fourth marsupial species, Sarcophilus harisii (Tasmanian devil), has a CUG codon in the correct frame but the surrounding sequence is dissimilar to all other mammals. Furthermore, these five organisms have stop codons in the − 1 frame shortly after the main ORF AUG start codon (Fig. 2a).

The disruption of ORF-Y in marsupials suggests that it became a protein-coding ORF de novo in placental mammals. This is confirmed by a 100-vertebrates codon alignment of ORF-Y, which shows that the early portion of ORF-Y is frameshifted in marsupials and platypus (Supplementary Figure 3). Furthermore, looking at the alignment in the second and third blocks, we see that there are many in-frame stop codons in marsupials and most of the non-mammal vertebrates. Finally, the synonymous substitution constraint as seen in Synplot2 analysis (Fig. 2a) appears to be restricted to placental mammals.

Ribosome profiling of POLG reveals that ORF-Y is actively translated

In order to verify translation of ORF_Y, we mined H. sapiens ribosome profiling data from an aggregate of studies using GWIPS-viz [33,34,35] and Trips-Viz [36]. Aggregate ribosome profiling reveals translation in the 5′-UTR at a comparable level to the beginning of the main ORF. Filtering ribo-seq data for samples treated with the initiation inhibitors lactimidomycin or harringtonine shows a comparable level of initiating ribosomes at the main ORF AUG start codon and at the upstream ORF-Y CUG codon (Fig. 3a). If ribosomes were translating both ORFs prior to the − 1 frame stop codon for ORF-Y, a step-wise decrease in ribosome density after this stop codon could be apparent. Looking at an aggregate of elongation ribosome profiling studies, reads were found to peak at the − 1 frame stop codon for ORF-Y (Fig. 3b). Looking at the framing of ribosomes, we see that in the region overlapping ORF-Y and the POLG ORF, the plurality of ribosomes are in frame 1 but in the nonoverlapping region of the POLG ORF, the plurality of ribosomes are in frame 2. Following this − 1 frame stop codon, the number of reads per nucleotide drops in half, further indicating that a fraction of ribosomes have already terminated at ORF-Y’s stop codon (Fig. 3c).

Fig. 3 Ribosome profiling analysis of ORF-Y. Aggregated ribosome profiling data for all studies available on GWIPs-viz (subfigure a) and Trips-Viz (subfigures b-c). a. Ribosome profiling coverage of POLG exon 2. The top panel in blue shows the aggregate of initiating ribosome profiling experiments (samples treated with harringtonine or lactimidomycin) and the bottom panel in red shows the aggregate of elongating ribosome profiling experiments. b. Ribosome profiling coverage of part of exon 3 containing the ORF-Y ‘UGA’ stop codon (box). c. Read counts by frame for the regions covering ORF-Y only, the ORF-Y/Main ORF overlap, the Main ORF only, and then all of ORF-Y and the Main ORF Full size image

The initiation context of ORF-Y is highly favorable despite using a non-ATG start codon

The CUG putative start codon has a strong initiation context (GCCAAGCTGG) that is highly conserved, though the initiator codon is GUG in a select few sequences (Fig. 4a). Specifically, the ‘G’ in the + 4 position and the ‘A’ in the − 3 position are the most favorable nucleotides for these critical positions.

Fig. 4 Initiation context of ORF-Y. a. Weblogo of initiation context sequences extracted from all mammalian POLG mRNA sequences that contain ORF-Y. The start codon is underlined. b. Representation of the consensus downstream RNA secondary structure for mammalian POLG mRNA sequences that contain ORF-Y. The structure was determined with RNAalifold. The arrow is pointing at the + 14 nucleotide, where the ‘G’ in ‘CUG’ is nucleotide 0 Full size image

To check for additional features that could provide a favorable context for initiation, the regions in 88 mammal genomes downstream of the CUG codon were aligned and probed for RNA secondary structure (Supplementary Figure 4, Fig. 4b). RNAalifold [37] predicted a stem loop with a bulge in the middle. Conservation of this stem-loop suggests that it may play a role in the promotion of initiation at the CUG codon. The stem-loop begins at the optimal distance (14 nt) from the initiation codon for pausing the 43S pre-initiation complex over the CUG codon [21].

Proteomic evidence of active ORF-Y translation suggests that the peptide may harbor function

We next investigated proteomic evidence for translation of ORF-Y, by reanalyzing the Kim et al., 2014 draft human proteome datasets [38] and searching against a set of candidate coding regions detected by PhyloCSF including the ORF-Y protein sequence [39]. Two unique peptides (AAAAQPJGHPDAJER and AAAAAAAAAAAAAAATAASAAASAJJGGR) were found only in CD8 T-cell samples mapping unambiguously to the candidate protein sequence (Fig. 5). This could suggest that the function of ORF-Y’s protein product is linked to an immune function, since high confidence peptides were not found in other cell types; however, mass spectrometry is not guaranteed to detect all expressed proteins, so it is possible that ORF-Y is expressed in other cell types as well. The first of these peptides confirms a previous identification made in the original Kim et al. analysis, and has since been confirmed in PeptideAtlas [40] across 7 additional experiments (PAp06322239). This further supports the translation of the proposed ORF-Y into a protein that is folded stably enough to be detected, suggesting it may have function. The protein product of ORF-Y for H. sapiens is predicted to have a transmembrane domain (TMHMM prediction software [41]). However, inspection of the ORF-Y protein products for representative members of other mammalian orders reveals that this predicted transmembrane domain is not conserved (Supplementary Figure 5A). An alanine repeat expansion appears to have occurred in some species, causing the TMHMM prediction software [41] to call some of these peptides as potential transmembrane domains (Supplementary Figure 6). Taking the portion of the ORF-Y peptide corresponding to the region of strongest POLG-frame synonymous site conservation (Fig. 2; region with p < 10− 20) and inputting it into the Eukaryotic Linear Motif (ELM) prediction server [42] yielded five potential functions (Supplementary Figure 5B). One of them, a predicted tankyrase binding motif, is plausible given that tankyrases are members of the poly ADP-ribose polymerase (PARP) family, DNA methylation and repair are some of the many functions of proteins in this family, and these functions are all related to the function of the POLG protein in DNA replication [43]. Two of the five predicted motifs are cleavage sites, and the other two are localization signals.

Fig. 5 Mass spectrometry evidence for translation of ORF-Y. a. Predicted translation of human ORF-Y. The CUG initiation codon is presumed to translate to methionine. The two peptides detected by mass spectrometry are colored in blue and red. b. Spectra for the first (red) peptide. c. Spectra for the second (blue) peptide. The sequences of the fragmented ions and their abundances are shown in both b and c Full size image

ORF-Z is highly translated and probably regulatory

Ribosome profiling indicates that translation initiation is potentially even more efficient at the AUG initiation codon of ORF-Z than at the CUG of ORF-Y or the main start codon (Fig. 6a and b, Fig. 1b). The initiation context surrounding this upstream AUG is also favorable with a G at − 3 and a G at + 4 (Fig. 6c). The theoretical translation of ORF-Z is only 23 amino acids in length and not highly conserved, having a negative PhyloCSF score. However, CodAlignView [44] shows that the start and stop codons for ORF-Z and its reading frame are indeed well conserved across placental mammals (Supplementary Figure 2), suggesting that translation of ORF-Z, but not the encoded peptide, could be functionally important, for example by playing a regulatory role in translation of ORF-Y and/or the POLG ORF [45]. We also examined ORF-Z and ORF-Y ribosome profiling in both Mus musculus and Rattus norvegicus (Supplementary Figure 7). We found that the ribosome footprints found in rats met the expected trend with a spike of reads at the ORF-Z and ORF-Y start codons. However, the footprints found in mouse are not what was expected. There is little translation in ORF-Y and there appears to be translation occurring 5′ of ORF-Z. This could be due to two different reasons. It could be possible that mice have loss the ability to translate ORF-Y. This could leave an open question of how, mechanistically, it could be behave differently in mouse and rat. Yet the Kozak context is the same in both species (Supplementary Figure 2) and the nucleotides involved in the downstream secondary structure are the same, with the exception of the fifth position of the first stem (a C in mice, and a U in rats) that does not affect the folding (in both species, the C or U base pair to a G, Supplementary Figure 4). Alternatively, it is possible that the set of ribosome profiling experiments in mice do not include the conditions needed for ORF-Y to be translated, especially since the diversity of ribosome profiling experiments available for humans is much larger than that of mice.

Fig. 6 POLG contains a further upstream ORF-Z. a. Schematic of where ORF-Z is located relative to the architecture of POLG. b. Ribosome profiling data mined from GWIPs-viz. The top panel in blue represents initiating ribosomes while the bottom panel in red represents elongating ribosomes. Arrows indicate positions of the initiation codons of all three ORFs, which exactly match peaks in initiating ribosome coverage. c. Weblogo of ORF-Z initiation contexts extracted from mammalian POLG mRNA sequences that contain ORF-Y and at least 150 nucleotides of 5′ UTR. The start codon is underlined Full size image

Clinvar analysis reveals potentially harmful mutations in ORF-Y

Since mutations in POLG have been well documented in mitochondrial disease [7], we surveyed reported Clinvar variants within ORF-Z or ORF-Y that are synonymous or in the 5′-UTR with respect to the main ORF (Table 1). We found 41 Clinvar variants that do not to change the POLG amino acid sequence but that do affect the ORF-Y peptide, and one variant that changes an ORF-Z amino acid, though this one might not be as important since ORF-Z is likely a regulatory ORF rather than a coding one. Many of these mutations are listed as benign, perhaps owing to the fact that they appeared to be synonymous. Given the evidence that ORF-Y encodes a functional protein, such mutations should be re-evaluated for their possible clinical significance.