HIV-1 is commonly assumed to have nine genes. However, in 1988 a 10th gene was suggested, overlapped by the env gene, but read on the antisense strand. The corresponding protein was named AntiSense Protein (ASP). Several pieces of evidence argue in favor of ASP expression in vivo, but its function is still unknown. We performed the first evolutionary study of ASP, using a very large number of HIV-1 and SIV (simian) sequences. Our results show that ASP is specific to group M of HIV-1, which is responsible for the pandemic. Moreover, we demonstrated that evolutionary forces act to maintain the asp gene within the M sequences and showed a striking correlation of asp with the spread of the pandemic.

Recent experiments provide sound arguments in favor of the in vivo expression of the AntiSense Protein (ASP) of HIV-1. This putative protein is encoded on the antisense strand of the provirus genome and entirely overlapped by the env gene with reading frame −2. The existence of ASP was suggested in 1988, but is still controversial, and its function has yet to be determined. We used a large dataset of ∼23,000 HIV-1 and SIV sequences to study the origin, evolution, and conservation of the asp gene. We found that the ASP ORF is specific to group M of HIV-1, which is responsible for the human pandemic. Moreover, the correlation between the presence of asp and the prevalence of HIV-1 groups and M subtypes appeared to be statistically significant. We then looked for evidence of selection pressure acting on asp. Using computer simulations, we showed that the conservation of the ASP ORF in the group M could not be due to chance. Standard methods were ineffective in disentangling the two selection pressures imposed by both the Env and ASP proteins—an expected outcome with overlaps in frame −2. We thus developed a method based on careful evolutionary analysis of the presence/absence of stop codons, revealing that ASP does impose significant selection pressure. All of these results support the idea that asp is the 10th gene of HIV-1 group M and indicate a correlation with the spread of the pandemic.

It is well established that retroviruses are able to perform antisense transcription from the 3′ long terminal repeat (LTR) of their proviral genome (1, 2). In 1988, the existence of an ORF on the antisense strand of the HIV type 1 (HIV-1) genome was suggested (3). This ORF encodes the putative AntiSense Protein (ASP). The existence of this ORF and of the encoded protein was controversial for many years, but now several pieces of evidence argue in favor of its expression (see ref. 4 for an extensive review): (i) several polyadenylated antisense transcripts capable of encoding ASP have been characterized within HIV-1–infected cells (1, 5, 6); (ii) it was demonstrated that the full-length ASP protein can be expressed ex vivo from the HIV-1 3′ LTR (7); (iii) ASP has been detected in freshly infected cells (2, 8, 9); and (iv) two recent independent clinical studies have shown the in vivo expression of ASP by detecting a cell-mediated immune response against several ASP epitopes within 30% of individuals infected with subtype B viruses (10, 11) [a percentage similar to those observed with other HIV-1 proteins, e.g., Tat and Pol (12)]. Moreover, experimental results suggested that ASP could form stable aggregates, be located partially at the plasma membrane, and be associated with autophagy (4, 7, 8). Despite this accumulation of evidence, the existence of ASP is still questioned because, for example, defective ribosome products with immune response have been reported for several viruses including HIV-1 (13, 14). Elucidating the function of ASP is thus a major goal, but studying the evolutionary forces acting on ASP is also crucial.

A striking fact with ASP ORF (and a challenge in terms of bioinformatics analyses) is its location on the provirus genome, as it overlaps the env (envelope) gene. Overlapping genes are a common feature of viruses to “compress” their genome (15). However, as the same portion of DNA encodes for several proteins, their adaptability is strongly lowered (16). Proteins encoded by overlapping genes are generally accessory proteins that play a role in viral pathogenicity or spreading (17). ASP ORF overlaps env on the frame −2: the codon positions 1, 2, and 3 in env face positions 2, 1, and 3 in asp, respectively (Fig. 1B). Because the two most important positions of env and asp codons are opposite each other, there is particularly little flexibility to encode amino acids (18).

Structure of the HIV-1 genome in the env gene region. (A) This region contains five overlapping ORFs: the env gene, the exon 2 of tat, the exon 2 of rev, the C-terminal extremity of vpu, and the ASP ORF. The env gene contains five variable regions (V1 to V5, reddish) and the RRE (greenish). (B) Overlapping sequences on the frames −2 and +1 [start of the asp region, HXB2 ( 20 ; GenBank accession no. K03455 )].

The aims of this study were to assess the presence and conservation of the ASP ORF in the HIV-1 and SIVcpz/gor (chimpanzee and gorilla) groups and subtypes and to demonstrate the selection pressure induced by ASP to confirm its importance in some of the mechanisms of the virus.

Results

HIV-1 strains are classified into four phylogenetic groups: M, N, O, and P. These four groups resulted from four separate cross-species transmission events of Simian Immunodeficiency Virus (SIV) to humans (19). Group M is the pandemic group. It is divided into nine distinct subtypes, and more than 70 circulating recombinant forms (CRFs).

The ASP ORF is entirely overlapped by the env gene (Fig. 1A), which has several overlapping ORFs on different reading frames. The env gene contains five variable regions, separated by constant regions (21), and the Rev Response Element (RRE) (22), which is a highly structured RNA element that plays a role in the export of HIV-1 mRNAs.

Data. We downloaded all available and complete HIV-1/SIVcpz/gor env sequences and data annotations from the Los Alamos HIV Sequence Database (www.hiv.lanl.gov/content/index). We also used GenBank to retrieve the original version of some of the sequences. After deleting problematic sequences, we obtained 22,992 env sequences belonging to 3,931 individuals. Codon-based, multiple alignments were performed on the env gene and on the frame −2 of this gene. To avoid counting several times sequences that are very close to each other and belong to the same individual, we used two strategies: (i) we used the complete multiple alignment, but weighted the sequences so that each individual had a total weight of 1 (we then obtained the “weighted” alignment); and (ii) we randomly selected one sequence per individual when it was required for computational reasons. Most of our results and statistics are based on weighted sequences, except where otherwise specified. Details are provided in SI Text.

Detection of the ASP ORF. We based our analyses on the presence/absence of start and stop codons in frame −2 of the env region. The ASP ORF of the reference sequence HXB2 (20; GenBank accession no. K03455) has a length of 188 codons and is located between reference env positions 1,717 and 1,151. We thus searched all of our sequences for long DNA segments (>150 codons) with a start codon and no stop codon, read in frame −2 and located between these two reference positions. The analysis was carried out in the group M and an “out-of-M” dataset comprising all nonpandemic (N, O, P) HIV-1 and SIVcpz/gor sequences. For the sequences in group M and using the above criteria, we detected the ASP ORF for 77% of the (weighted) sequences. We clearly observed (Fig. 2) a region that is nearly free of stop codons and located between the reference positions of asp. At the beginning of the asp region, we note the presence of a stop codon for 14.5% of the sequences, located 12 codons after the start codon. However, most of these sequences (90%) belong to subtype A and its recombinants. One of the A recombinants in the asp region, namely CRF02_AG (112 individuals), has the early stop for ∼100% of the sequences, but only 7% of its sequences have the ASP ORF using our criteria. In contrast, in subtype A (240 sequences), ∼100% of sequences have the early stop, but a large percentage of them (∼90%) have an alternative start codon located 17 codons after the early stop; These sequences thus have a shorter version of ASP ORF, but still with a length of more than 150 codons. Fig. 2. Detection of the ASP ORF. Weighted percentages of start (blue) and stop (red) codons in frame −2, in the groups M and out-of-M. The asp region (white area) is located between the env positions 1,717 and 1,151 (HXB2 reference). The red star indicates an early stop codon that is specific to subtype A and A recombinants. This early stop codon is followed by an alternative start codon (blue star) in most of the A sequences and certain A recombinants. In the out-of-M sequences, a number of stop codons are observed inside the asp region (Fig. 2). The start codon is not conserved (38% of sequences have a start/methionine codon), and less than 1.5% of out-of-M sequences have an ORF in asp region with length >150 codons.

Recent Emergence of the ASP ORF. The contrasting results between the pandemic group M and the other groups (out-of-M) led us to study the emergence and evolution of the ASP ORF using a phylogenetic approach. For this purpose, we inferred a maximum-likelihood tree using PhyML (23) (GTR+Γ4+I model, 1,000 bootstrap replicates) on a selection of sequences extracted from our complete alignment. We used 33 reference sequences (24) of group M subtypes and CRFs (A, B, C, D, F, G, H, J, CRF01_AE, CRF02_AG), 10 randomly selected sequences from group O, and the 40 sequences of the other out-of-M groups. To complete this phylogeny, we computed statistics from the complete, weighted alignment for all groups, subtypes, and CRFs. By using the same detection criteria as above, we measured the length of the longest ORF in the asp region and the fraction of sequences that had the ASP ORF. The phylogeny (Fig. 3) clearly shows the four introductions of HIV-1 in the human population, corresponding to the four groups O, P, N, and M. The start codon corresponding to asp is present in most of the studied sequences. However, group O sequences and some exceptions (e.g., subtype H, prevalence ∼0.1%) do not have this start codon. The median length of the ORF in the asp region of the out-of-M groups increases when approaching group M: there are 66 codons in SIVcpz_Pts (from Pan troglodytes schweinfurthii), which increases to 125 codons in SIVcpz_Ptt (from Pan troglodytes troglodytes) that is closest to the group M. For group M sequences, the ASP ORF is present in 77% of the sequences with a median length of 182 codons. All of this indicates that the ASP ORF was created recently and that its emergence in HIV-1 is concomitant with the emergence of the group M. This recent de novo creation is further supported by the fact that ASP does not have any known homologs (SI Text). Interestingly, among SIVcpz_Ptt sequences, there is one sequence that possesses the ASP ORF in its entirety. This simian ASP has the same structural features (4) as the human ones. However, it is phylogenetically remote, and we do not have enough SIVcpz_Ptt sequences to figure out whether ASP appeared in the HIV and SIV genomes independently or, rather, if ASP first appeared in the SIVcpz_Ptt genome and was maintained when the HIV-1 group M emerged from it (SI Text and Fig. S1). Fig. 3. Recent emergence of the ASP ORF using a phylogenetic approach. This phylogeny (* = bootstrap >80%) of the env gene contains reference sequences from HIV-1 and SIVcpz/gor groups, subtypes, and CRFs. The four distinct simian/human transmissions are indicated by red stars. For each of the sequences, we show the distribution of start codons (red triangle; black triangle for alternative start) and stop codons (black cross) in the asp region. For each group, subtype, and CRF, the table provides the median length of the ASP ORF, the fraction of sequences with the ASP ORF (length > 150 codons), and the prevalence in the human population (44). Fig. S1. Phylogenetic analysis of the SIVcpz_Ptt sequence that has the ASP ORF. The phylogenetic tree was estimated using PhyML with a nucleic multiple alignment of the asp region, GTR+ Γ6, and 100 replicates. The DQ373063 sequence that has ASP is in red; the SIVcpz_Ptt sequences without ASP are in blue; the group M reference sequences that have ASP are in black. However, the fraction of M sequences that have ASP ORF varies among subtypes and CRFs. The less prevalent subtypes (D, F, J, H, K, total prevalence ∼3%) have the ASP ORF for less than 45% of their sequences. As already mentioned, only a few sequences (7%) in CRF02_AG (prevalence 7.7%) have the ASP ORF. In contrast, the other prevalent subtypes and CRFs (A, B, C, G, and CRF01_AE, total prevalence = 81%) have ASP for 84% of their sequences. We thus see a clear correlation (P value = 0.003) (Materials and Methods): prevalent M subtypes and CRFs (except CRF02_AG) have the ASP ORF for a large majority of sequences, whereas low-prevalence subtypes and nonpandemic groups (N, O, P) have the ASP ORF in a minority of sequences (none in some groups/subtypes). This correlation is confirmed when accounting for phylogenetic correlation (Bayes factor = 3.8) (Materials and Methods). The ASP ORF is present in 84% of sequences for prevalent subtypes (A, B, C, G) and CRF01_AE. This fraction is quite high and is not likely to be due to chance, as we shall see. However, 16% of sequences in these subtypes and CRF do not have the ASP ORF. This level of absence is similar to the one observed with nef [13.5% (25)], an accessory gene, and higher than the ∼5% that we found for env and pol (two obligatory genes) by scanning for the presence of stop codons [all available Los Alamos database sequences (December 2015)]. This nonnegligible fraction of stops in env and pol is explained by both sequencing errors (26) and the fact that some of the sequences are defective (27). The higher level of absence with ASP ORF is explained by the fact that asp is an accessory gene. As expected, ASP ORF was lost not only in some of the M subtypes and CRFs, but also in some of the individuals of prevalent subtypes and CRF01_AE, where 12% of individuals in our dataset do not have any sequence with the ASP ORF, whereas 81% of individuals have the ASP ORF for all of their sequences. These 12%, added to the 5% of sequencing errors and defective sequences observed with env and pol, roughly explain the 16% of asp absence in prevalent subtypes and CRF01_AE.

Conservation of the ASP ORF. Previous analyses indicate that the ASP ORF is present in a large fraction of the group M sequences. We used computer simulations to demonstrate that there is a very low probability that this is due to chance. We first estimated the probability of observing an ORF with ASP length overlapping the env gene in frame −2. For this purpose, we randomly generated sequences with the same length (856 codons) as the env gene of HXB2 and the same codon usage as HIV-1 (www.kazusa.or.jp/codon/). In this case, the probability of an ORF of length 180 in frame −2 is ∼3%. This is a low probability, but ASP is the longest overlapping ORF present in HXB2 in any reading frame (3), and one could argue that having such an ORF in the whole HIV-1 genome is quite likely. Using the same method as above, we thus generated sequences having the same length (3,239 codons) as HXB2 and searched for an ORF of length 180 in the five possible reading frames. The probability in this case is ∼19%. This is still a relatively low probability, but clearly we cannot reject that the presence of the ASP ORF at the root of HIV-1 M is merely due to chance. However, we observed the ASP ORF in 77% of our M sequences. The question then is: if we assume the ASP ORF presence at the root of HIV-1 M, would we have a significant chance of observing its presence in so many sequences at the phylogeny tips? To answer this question, we simulated the evolution of the env gene along phylogenies inferred using 350 randomly selected strains. We used PhyML and codonPhyML (28) to infer 10 such phylogenies. For each one, we performed 100 codon-based simulations using Alf (29), starting from the env gene of HXB2 at the tree root (Materials and Methods). The maximum percentage of the tip sequences where the ASP ORF was still present (across 1,000 datasets) was equal to 67%, and on average, the ASP ORF was conserved in only 42% of sequences. These results show that there is an extremely low probability that our observation of 77% on the conservation of the ASP ORF in the M group is due to chance, thus revealing a selection pressure that tends to conserve ASP. In the following section, we show that this selection pressure is also detected at the sequence level. We first used standard methods (evolutionary rate, nonsynonymous versus synonymous substitutions, codon usage), but none of these approaches provided a significant signal due to the specificity of frame −2 (30, 31) (SI Text). Thus, we developed a method dedicated to the frame −2.