In this short report, we set out to characterize the heterogeneity of all 2019‐nCoV genomes and proteomes available at the moment of the study, comparing them to other representative coronaviridae, specifically SARS, MERS, and BCoV. We will generate phylogenetic trees of the 2019‐nCoV cases and apply entropy‐based analyses of position‐wise variance and categorical principal component analysis (CATPCA) as an alternative method to estimate the sequence distance between all analyzed viruses.

No vaccine for 2019‐nCoV has been publicly released, but a World effort has arisen toward the characterization of the molecular determinants and evolutionary features of this novel virus. An initial comparison of 10 genomic sequences from 2019‐nCoV specimens has reported a low heterogeneity of this viruses with intersample sequence identity above 99.9%. 6 There are currently 54 2019‐nCoV complete genome sequences from the Global Initiative for Sharing all Influenza Data (Gisaid 7 ) and from GenBank, 8 plus two partial sequences obtained by the Spallanzani hospital in Rome, Italy (also from Gisaid).

Very recently, a novel beta‐CoVs coronavirus (2019‐nCoV) originating from the province of Wuhan, China, has been causally linked to severe respiratory infections in humans. At the time of writing, 14 441 cases of 2019‐nCoV‐related pneumonia cases have been reported in China, plus 118 cases from 23 other countries. There are currently 315 deaths linked to this pathogen (source: World Health Organization report, 02 February 2020). Phylogenetic relationships between Bat and Human coronaviridae have been discovered for SARS 3 and more recently also for 2019‐nCoV, 4 suggesting events of inter‐species transmissions. 5

Coronaviridae (CoVs) are the largest known single‐stranded RNA viruses. 1 They have been categorized in three groups, based on phylogenetic analyses and antigenic criteria, 2 specifically: (a) alpha‐CoVs, responsible for gastrointestinal disorders in human, dogs, pigs, and cats; (b) beta‐CoVs, including the Bat coronavirus (BCoV), the human severe acute respiratory syndrome (SARS) virus and the Middle Eastern respiratory syndrome (MERS) virus; (c) gamma‐CoVs, which infect avian species.

Pairwise protein identity and coverage were calculated using BLAST protein v2.6.0 15 with BLOSUM62 matrix and default parameters. Nucleotide sequence identity and coverage were calculated using BLAST nucleotide v2.6.0. 15

CATPCA was performed on R version 3.6.1 using the package FactoMineR. 14 Specifically, an MSA FASTA file from MUSCLE is loaded in R and converted into a categorical matrix, with genomes as rows and nucleotide coordinates as columns. Factors are defined as A, C, G, T, N, or—(gap), as described in results. Then, the FactoMineR multiple correspondence analysis algorithms is run with default parameters and custom R functions are used to plot the component values for each genome.

Phylogenetic model generation and tree visualization were done using MEGAX v 10.1.7, 11 using the Maximum Likelihood method and Tamura‐Nei model. 12 The tree structure was validated by running the analysis on 100 bootstrapped input datasets. 13

3 RESULTS

3.1 Phylogenetic analysis We collected 53 full genomic 2019‐nCoV sequences from the Gisaid database (Table S1), plus the GenBank‐deposited sequence from the Wuhan seafood market pneumonia virus isolate Wuhan‐Hu‐1 (NC_045512.2) and two partial sequences from Italian isolates (EPI_ISL_406959 and EPI_ISL_406960). To compare 2019‐nCoVs with closely related viral species, we obtained six sequences from distinct human SARS genomes from GenBank (the reference NC_004718.3, plus the genomes AY274119.3, GU553363.1, DQ182595.1, AY297028.1, and AY515512.1). We also obtained six BCoV genomic sequences (DQ022305.2, DQ648857.1, JX993987.1, KJ473816.1, MG772934.1, EPI_ISL_402131). Finally, as more distantly related beta‐CoVs we analyzed also MERS genomes from GenBank entries JX869059.2 and KT368829.1. Similarly to a previous report with 10 virus specimens,6 we detected very high conservation between the 56 analyzed 2019‐nCoV genomes, with sequence identity above 99%. We found a bat CoV genome (Gisaid EPI_ISL_402131) with 96.2% sequence identity (and query coverage above 99%) to the 2019‐nCoV reference sequence (NC_045512.2), while the previously reported closest bat CoV (bat‐SL‐CoVZC45) has a sequence similarity of 88%.6 The reference human SARS genome (NC_004718.3) appears more distant from the reference 2019‐nCoV, with sequence identity of 80.26% and query coverage of 98%. We aligned all the 70 coronavirus sequences using MUSCLE9 and inferred the evolutionary relationships between these sequences with a Tamura‐Nei Maximum Likelihood estimation12 with 100 bootstraps for model robustness estimation. The results are shown in Figure 1 as a phylogenetic tree representation. All the human 2019‐nCoV appear very similar to each other, despite the different locations of sampling. Bat coronaviridae appears to be the closet homologs. Two specific specimens gathered in 2013 and 2015 in China from the bat species Rhinolophus affinis and Rhinolophus sinicus appear to be located between the BCoV and the human 2019‐nCoV groups, supporting the notion of a zoonotic transfer from bats to humans.4 Human SARS sequences group with BCoV sequences more distantly related to 2019‐nCoV genomes. Finally, MERS genomes are the most genetically distinct amongst the other sequences. Figure 1 Open in figure viewer PowerPoint Phylogenetic tree of all the 2019‐nCov sequences available at 02‐Feb‐2020 (branches shown in blue), plus six Bat coronavirus sequences (default black, as they are split in multiple taxa), six human SARS (green) and two MERS (orange). The percentage of bootstraps supporting each branch is reported. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. MERS, Middle Eastern respiratory syndrome; SARS, severe acute respiratory syndrome A purely topological representation of a bootstrapped Maximum Likelihood tree (Figure S1) shows that 2019‐nCoV sequences are highly similar to each other, with poor support to the existence of distinct subgroups. The global multiple sequence alignment (MSA) is available as Supporting Information File S1.

3.2 Genomic divergence from other beta‐coronaviridae Given the high homogeneity between 2019‐nCoV genomes, we developed a novel method to classify genomic sequences, based on CATPCA.14 Briefly, this analysis finds the eigenvectors describing the highest variance within a categorical dataset, like ours. Our dataset derived from the MUSCLE MSA of 70 genomes and generated 32 206 positions: the categories in each coordinate could be A (Adenine), C (Cytosine), G (Guanine; T [Thymine], although being an single strand RNA virus, it would be more appropriate to use U [Uracil]), N (Nucleotide, uncertain location: very rare in this dataset, and accounting for only nine positions, or 0.0004% of all the data). Our analysis shows similar results for phylogenetic tree representations. In Figure 2A, we show the catPCA of the first components for all analyzed genomes. The MERS/non‐MERS grouping accounts for the largest variance, while SARS and SARS‐like BCoVs cluster together. While 2019‐nCoV constitute a tightly similar cluster, the two bat virus sequences MG772934.1 and EPI_ISL_402131 appear to be linking the human 2019‐nCoV to the Bat coronaviridae. Figure 2 Open in figure viewer PowerPoint Categorical principal component analysis of the projected variance in the entire coronaviridae genome dataset considered in this study (A) and in the 2019‐nCoV subset only (B). Human 2019‐nCoV are shown as blue circles, other genomes as grey diamonds A catPCA analysis on the sole 2019‐nCoV sequences highlights some internal variability (Figure 2B), with two likely outliers identified in the genome EPI_ISL_406862 (collected in Germany) and EPI_ISL_406592 (collected in Shenzhen, China).

3.3 Genomic variance estimation within 2019‐nCoV genomes Although the variability within the 2019‐nCoV genomes is very low, we set out to discover possible hotspots of hypervariability within the viral population. We analyzed the approximately 30 000 nt of multiple genome alignments performed on the 54 full 2019‐nCoV genomes. Our analysis shows that these viruses have largely the same genomic arrangement as the SARS species.17 A large gene encoding for a polyprotein (ORF1ab) at the 5′ end of the genome is followed by four major structural protein‐coding genes: S = Spike protein, E = Envelope protein, M = Membrane protein, and N = Nucleocapsid protein. There are also at least six other accessory open reading frames (ORFs) (Figure 3A). Figure 3 Open in figure viewer PowerPoint Variability within 54 2019‐nCoV full genomic sequences. A, location of major structural protein‐encoding genes (red boxes; S = Spike protein, E = Envelope protein, M = Membrane protein, N = Nucleocapsid protein) and accessory protein ORFs (blue boxes) on the meta‐genomic sequence derived from the MSA of all genomes. B, Shannon entropy values across genomic locations. The two coordinates with the highest entropy (excluding the 5′ and 3′ highly variable UnTranslated regions) are indicated. C, Zoom‐in of the MSA describing the two most variable locations in the core genome, in the ORF1ab (left) and in ORF8 (right). MSA, multiple sequence alignment For each position of the multialigned 54 2019‐nCoV, we calculated Shannon Entropy as a measure of the position variability.18 Apart from the 5′ and 3′ ends, likely nonprotein coding and less homogeneous, we identified two hotspots of hypervariability at positions 8789 and 28151 (Figure 3B,C). Position 8789 witnesses the presence of either T (U) or C, and it falls within the polyprotein gene. It causes a synonymous variation in the nucleotide triplet encoding for Serine 2839 (amino acid coordinates based on the reference genome NC_045512.2), so it is likely not to introduce phenotypical differences between the different strains. On the other hand, position 28151 falls within ORF8 and is characterized by the presence of either a C or a U. This causes a Ser/Leu change in amino acid (aa) 84, which can affect the conformation of the peptide, given that Serine is a polar amino acid, and Leucine is nonpolar. Aa84 appears to be nonconserved also across other coronaviridae (Figure 4A, black arrow). Figure 4 Open in figure viewer PowerPoint A, MSA of the coronavirus proteins encoded by ORF8 in representative sequences for the 2019‐nCoV (Wuhan), BCoVs (Bat SARS‐like and Bat SARS) and SARS viruses. B, Alignment of ORF8‐L and ORF8‐S, two isoforms varying at amino acid position 84 (indicated by a red dot). The diagram shows in larger blue the Globplot‐predicted intrinsically disordered regions in the two isoforms. C, MSA of the coronavirus Nucleocapside proteins. MSA, multiple sequence alignment We analyzed the alternative isoforms of 2019‐nCoV ORF8‐aa84 alternative isoforms, dubbed ORF8‐L (Leucine) and ORF8‐S (Serine). Unfortunately, no crystal structures of close homologs to the ORF8 protein are available for a reliable homology modeling to measure the structural impact of this amino acid substitution. The closest 3D model to 2019‐nCoV ORF8 available in Protein Data Bank19 is a short 22 amino acid stretch in the protein entry 6P65, with a nonsignificant E‐value of 0.848. We, therefore, employed de novo methods to infer the structural features of ORF8. One important effect we could detect is a significant effect of Serine in ORF8‐S in inducing structural disorder in the protein C‐terminal portion, which is not predicted to be present in the ORF8‐L (Figure 4B), using the Russell/Linding algorithm.16 Moreover, it did not escape our attention that the ORF8‐S could theoretically generate a novel phosphorylation target for the mammalian host Serine/Threonine kinases of the host organism. So, we searched for ORF8 homologous substrates in the Mammalia NCBI nr protein database, but could not find matches within the E‐value threshold of 1.