2019-nCoV Spike Protein Does Not Include Insertions Unique to HIV-1

In a recent manuscript entitled “Uncanny Similarity of Unique Inserts in the 2019-nCoV Spike Protein to HIV-1 gp120 and Gag”, (10) Pradhan et al. presented a discovery of four novel insertions unique to 2019-nCoV spike protein ( Figure 1 ). They further concluded that these four insertions are part of the receptor binding site of 2019-nCoV and that these insertions shared “uncanny similarity” to human immunodeficiency virus 1 (HIV-1) proteins but not to other coronaviruses. These claims resulted in considerable public panic and controversy in the community, (12) even after the manuscript was withdrawn. To investigate whether the conclusions by Pradhan et al. are scientifically precise, we reanalyzed the structural location and sequence homology of the four spike protein insertions discussed therein.

Figure 1 Figure 1. Sequence alignment of spike proteins from 2019-nCoV (NCBI accession: QHD43416) and SARS-CoV (UniProt ID: P59594). The four “novel” insertions “GTNGTKR” (IS1), “YYHKNNKS” (IS2), “GDSSSG” (IS3), and “QTNSPRRA” (IS4) by Pradhan et al. are highlighted by dashed rectangles. We noted that these fragments are not bona fide “insertions”; in fact, at least three out of the four fragments are also shared with bat coronavirus RaTG13 spike glycoprotein (NCBI accession: QHR63300.1), as shown in Table 1. Nevertheless, we still refer to these fragments as “insertions” in this Communication for consistency with the original report. The receptor binding domain of the spike is marked by the solid box, which corresponds to residue positions 323–545 in the above alignment. A pair of arrows immediately following IS4 indicates the protease cleavage site by which spike proteins are cut into S1 and S2 isoforms.

Because the full-length structure of the spike protein in 2019-nCoV was not available at the beginning of this study, we used C-I-TASSER (15) to model its tertiary structure as part of our efforts in the full genome structure and function analyses of 2019-nCoV, which are available at https://zhanglab.ccmb.med.umich.edu/C-I-TASSER/2019-nCoV/ . The 2019-nCoV spike model was then assembled with the human ACE2 structure (PDB ID: 6ACJ (22) by DEMO (18) to form a spike–ACE2 complex. In Figure 2 A, we present a cartoon superposition of the C-I-TASSER model with a recently solved spike structure, (23) where the C-I-TASSER model shares a high structure similarity, with a TM score of 0.95, (24,25) to the cryo-EM structure. Because the experimental structure covers only 75% of the residues in the full-length sequence, with several important residues on the receptor binding domain (RBD) of the spike protein missing, our following analysis will mainly be built on the C-I-TASSER reconstructed full-length model. We note that C-I-TASSER, also known as “Zhang-Server”, is the top ranked automated server for protein structure prediction in the Critical Assessment of protein Structure Prediction round 13 (CASP13) challenge ( http://www.predictioncenter.org/casp13/zscores_final.cgi?model_type=best&gr_type=server_only ) among all 39 servers from the community. C-I-TASSER improves our previously developed I-TASSER structure prediction protocol (26) by incorporating a deep-learning-based contact map prediction. (17,27) On all 121 CASP13 targets, the average TM score of the C-I-TASSER first model (0.674) is 8.0% higher than that of I-TASSER (0.624) and 0.15% higher than that of C-QUARK (0.673), which is our only other automated CASP13 server and was ranked in second place in CASP13.

Figure 2 Figure 2. Structure of the 2019-nCoV spike protein trimer. (A) Superposition between the C-I-TASSER constructed model (blue) and the experimental structure (orange, PDB ID: 6VSB), which was determined after our model was predicted. Only residues common to both structures are shown. (B) Complex structure model between human ACE2 (left yellow) and the spike protein trimer (right, with three chains colored in magenta, cyan, and blue, respectively) constructed by C-I-TASSER. The four insertions are shown as spheres. During different stages of coronavirus infection, the spike proteins may be postprocessed (i.e., cleaved) to produce different isoforms. Therefore, the eventual spike complex might not include all residues of a full-length spike protein. Nevertheless, we construct the complex model using a full-length spike sequence to illustrate the locations of the four insertions.

As shown in Figure 2 B, all four insertions in the C-I-TASSER/DEMO structural models are located outside the RBD of the spike protein, in contrast with the original conclusion made by Pradhan et al., which stated that the insertions are located on the interface with ACE2. Here it is important to note that following ACE2 receptor binding, the spike protein can be cleaved by host proteases such as cathepsin L (CTSL) to produce the S1 and S2 isoforms to facilitate viral entry into host cells. (28,29) Because this cutting site immediately follows insertion 4 (IS4) ( Figure 1 arrow) along the 2019-nCoV spike protein sequence, there is a possibility that IS4 could affect the cleavage of the spike protein. Regardless, all of the insertions are not directly related to receptor binding.

E values of the BLAST hits, which is a parameter used by BLAST to assess the statistical significance of the alignments and usually needs to be <0.01 to be considered significant,E values suggest that the majority of these similarities are likely to be coincidental. To investigate the viral homologues of the four insertions, we further performed a BLAST sequence search of these four insertions against the nonredundant (NR) sequence database, restricting the search results to viruses (taxid: 10239) but leaving other search parameters at default values. The top five sequence homologues (including the query itself) identified for each insertion are listed in Table 1 . In contrast wit the previous claim that the four insertions are unique to 2019-nCoV and HIV-1, all four insertion fragments can be found in other viruses. In fact, an HIV-1 protein is among the top BLAST hits for only one of the four insertion fragments, whereas three of the four insertion fragments are found in bat coronavirus RaTG13. Moreover, partially due to the very short length of these insertions, which range from six to eight amino acids, thevalues of the BLAST hits, which is a parameter used by BLAST to assess the statistical significance of the alignments and usually needs to be <0.01 to be considered significant, (30) are all >4, except for a bat coronavirus hit for IS2. These highvalues suggest that the majority of these similarities are likely to be coincidental.

Table 1. BLAST Search Result for IS1 a IS NCBI accession sequence E value sequence identity species IS1 query GTNGTKR 27 1.00 2019-nCoV APC94153 GTNGTKR 28 1.00 uncultured marine virus AFU28737 -TNGTKR 224 0.86 human immunodeficiency virus 1 AVE17137 GTDGTKR 224 0.86 rat astrovirus Rn/S510/Guangzhou QBX18329 -TNGTKR 224 0.86 Streptococcus phage Javan411 QHR63300 GTNGIKR 643 0.86 bat coronavirus RaTG13 IS2 query YYHKNNKS 0.13 1.00 2019-nCoV QHR63300 YYHKNNKS 0.13 1.00 bat coronavirus RaTG13 AUL79732 -YHKNNKS 4.2 0.88 tupanvirus deep ocean YP_007007173 YYHKDNK- 8.7 0.75 Klebsiella phage vB_KleM_RaK2 ALS03575 YYHKNN-- 12 0.75 gokushovirus WZ-2015a IS3 query GDSSSG 1004 1.00 2019-nCoV QAU19544 GDSSSG 1003 1.00 orthohepevirus C AYV78550 GDSSSG 1004 1.00 edafosvirus sp. QHR63300 GDSSSG 1004 1.00 bat coronavirus RaTG13 QDP55596 GDSSSG 1004 1.00 prokaryotic dsDNA virus sp. IS4 query QTNSPRRA 1.0 1.00 2019-nCoV YP_009226728 QTNSPRR- 8.5 0.88 Staphylococcus phage SPbeta-like BAF95810 QTNSPRRA 35 1.00 Bovine papillomavirus type 9 ARV85991 ETNSPRR- 106 0.75 peach-associated luteovirus QDH92312 QTNAPRKA 142 0.75 Gordonia phage Spooky

Given that three out of the four insertion fragments are found in the bat coronavirus RaTG13, it is tempting to assume that these “insertions” may be directly inherited from bat coronaviruses. Currently, there are at least seven known human coronaviruses (2019-nCoV, SARS-CoV, MERS-CoV, HCoV-229E, HCoV-OC43, HCoV-NL63, and HCoV-HKU1), where many of them, including severe acute respiratory syndrome-related coronavirus (SARS-CoV) and Middle East respiratory syndrome-related coronavirus (MERS-CoV), were shown to be transmitted from bats. (3,31−34) To further examine the evolutionary relationship between the 2019-nCoV and the bat coronavirus in comparison with other human coronaviruses, we used MUSCLE to create a multiple sequence alignment (MSA), presented in Figure 3 , for all seven human coronaviruses and two bat coronaviruses, RaTG13 and RsSHC014, which have been considered to be the ancestors of 2019-nCoV and SARS-CoV, respectively. (3,31,34) Among the four “insertions” (ISs) of the 2019-nCoV, IS1 has only one residue different from the bat coronavirus, and three out of seven residues are identical to MERS-CoV. IS2 and IS3 are both identical to the bat coronavirus. For IS4, although the local sequence alignment by BLAST did not hit the bat coronavirus in Table 1 , it has a close evolutionary relation to the bat coronavirus in the MSA. In particular, the first six residues in the IS4 fragment “QTQTNSPRRA” from 2019-nCoV are identical to RaTG13, whereas the last four residues, which were absent in the bat coronavirus or SARS-CoV, have at least 50% identity to MERS-CoV and HCoV-HKU1.

Figure 3 Figure 3. Multiple sequence alignment for the spike proteins of seven known human coronaviruses. 2019-nCoV (QHD43416.1), SARS-CoV (P59594), MERS-CoV (YP_009047204.1), HCoV-NL63 (YP_003767.1), HCoV-229E (NP_073551.1), HCoV-OC43 (YP_009555241.1), and HCoV-HKU1 (YP_173238.1) sequences are downloaded from the NCBI and UniProt databases. RaTG13 (QHR63300.1) and RsSHC014 (AGZ48806.1), the two bat coronaviruses thought to be the ancestors of 2019-nCoV and SARS-CoV, are also included. For brevity, only the regions near the four “insertions” are displayed in the figure.