This is a brief report outlining some phylogenetic analysis of the initial genome sequences. It gives some preliminary findings for information purposes is not intended for publication as an academic work. All the data used here is provided by the laboratories listed below through NCBI or GISAID.

Available genome data

One annotated genome has been released on GenBank by Shanghai Public Health Clinical Center & School of Public Health, Fudan University, Shanghai, China:

https://www.ncbi.nlm.nih.gov/nuccore/MN908947

This was the first genome released but it has been updated a few times as resequencing was performed, particularly focusing on the start and end of the genome. It is likely that this is a reliable genome sequence but there is insufficient epidemiological information for it to be useful here (there is no exact date of sample collection and it is unclear if the sample is from the same patient as one of the other genomes).

As of 19-Jan-2020, 13 other genome sequences have been released on to GISAID http://gisaid.org/ originating from 6 different labs.

Accession Strain Location Collection date Lab EPI_ISL_402132 BetaCoV/Wuhan/HBCDC-HB-01/2019 China / Hubei Province / Wuhan City 2019-12-30 [1] EPI_ISL_402127 BetaCoV/Wuhan/WIV02/2019 China / Hubei Province / Wuhan City 2019-12-30 [2] EPI_ISL_402128 BetaCoV/Wuhan/WIV05/2019 China / Hubei Province / Wuhan City 2019-12-30 [2] EPI_ISL_402129 BetaCoV/Wuhan/WIV06/2019 China / Hubei Province / Wuhan City 2019-12-30 [2] EPI_ISL_402130 BetaCoV/Wuhan/WIV07/2019 China / Hubei Province / Wuhan City 2019-12-30 [2] EPI_ISL_402126 BetaCoV/Kanagawa/1/2020 Kanagawa Prefecture, Japan 2020-01-14 [3] EPI_ISL_403963 BetaCoV/Nonthaburi/74/2020 Thailand/ Nonthaburi Province 2020-01-13 [4] EPI_ISL_403962 BetaCoV/Nonthaburi/61/2020 Thailand/ Nonthaburi Province 2020-01-08 [4] EPI_ISL_402120 BetaCoV/Wuhan/IVDC-HB-04/2020 China / Hubei Province / Wuhan City 2020-01-01 [5] EPI_ISL_402119 BetaCoV/Wuhan/IVDC-HB-01/2019 China / Hubei Province / Wuhan City 2019-12-30 [5] EPI_ISL_402121 BetaCoV/Wuhan/IVDC-HB-05/2019 China / Hubei Province / Wuhan City 2019-12-30 [5] EPI_ISL_402124 BetaCoV/Wuhan/WIV04/2019 China / Hubei Province / Wuhan City 2019-12-30 [2] EPI_ISL_402123 BetaCoV/Wuhan/IPBCAMS-WH-01/2019 China / Hubei Province / Wuhan City 2019-12-24 [6]

[1] Wuhan Jinyintan Hospital & Hubei Provincial Center for Disease Control and Prevention, China

[2] Wuhan Jinyintan Hospital & Wuhan Institute of Virology, Chinese Academy of Sciences, China

[3] Dept. of Virology III, National Institute of Infectious Diseases, Japan

[4] Bamrasnaradura Hospital & Department of Medical Sciences, Ministry of Public Health, Thailand

[5] National Institute for Viral Disease Control and Prevention, China CDC, China

[6] Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College, China

Table 1 | Available nCoV2019 genome sequences

Ten genomes are from Wuhan City in Hubei Province, China with samples collected between 24-Dec-2019 and 01-Jan-2020. Two genomes are from patients in Thailand who had recently travelled from Wuhan. One sequence is from a patient in Japan who had also travelled from Wuhan but this is a short fragment of genome (369 nucleotides long) and is not included in this analysis. One genome, ‘BetaCoV/Wuhan/IVDC-HB-04/2020’, has evidence of sequencing artefacts and is excluded from the analysis.

Phylogenetic analysis

The phylogenetic tree of the remaining 11 complete genomes is given in Figure 1. This shows that there is very limited genetic variation in the currently sampled viruses in Wuhan (3 are identical, the others have 1, 2 or 3 differences from these). This is indicative of a relatively recent common ancestor for all these viruses.



Figure 1 | Maximum likelihood tree of 11 nCoV2019 genomes. Blue genomes from are from Thailand. The tree is rooted using the oldest sequence but this is an arbitrary choice. Constructed using PhyML

Thailand announced positive tests for two apparently independent travellers from Wuhan. They were reported as not having visited the seafood market that had been associated with some of the early cases in Wuhan and no reported epidemiological links with any of the known cases. We might therefore expect these individuals to have been infected with a random representative of the diversity of viruses circulating in Wuhan (either through exposure to a non-human source or other infected people through human to human transmission). Considering how similar these two virus genomes are to the sample from Wuhan may be informative about how diverse population of viruses is.

The two genomes sampled in Thailand are genetically identical to three of the genomes sampled from Wuhan on the 30-Dec-2019. This suggests that the (very limited) diversity present amongst the sampled and sequenced Wuhan cases is representative of the overall diversity of the outbreak.

Virus Estimated rate x10-3 subst/site/year Reference SARS-CoV 0.80 – 2.38 Zhao et al. 2004 [2] MERS-CoV 0.63 [0.14 – 1·1] Cotten et al. 2013 [3] 1.12 [0.88 – 1.37] Cotten et al. 2014 [4] 0.96 [0.83 − 1.09] Dudas et al. 2018 [5] HCoV-OC43 0.43 [0.27 – 0.60] Vijgen et al. 2005 [6]

Table 2 | Evolutionary rate estimates of human coronaviruses

To estimate the time of the most recent common ancestor (TMRCA) of the currently sampled viruses (including the ones from Thailand), I used a Bayesian phylogenetic software package called BEAST 7. With the available data it is not possible to estimate the rate of evolution of the virus so I used two assumed values 1x10-3 substitutions per site per year (a reasonable expected rate of evolution for an acute RNA virus) and 0.5x10-3. These values approximately span the rate of rate estimates for other human coronaviruses shown in Table 2.

The estimated date for the most recent common ancestor (and the 95% credible interval) are:

Assumed rate Estimated date of MRCA 95% interval 1x10-3 27-Nov-2019 06-Nov-2019 – 16-Dec-2019 0.5x10-3 28-Oct-2019 13-Sep-2019 – 03-Dec-2019

Both these estimates are compatible with the TMRCA at the beginning of December.

Observations

From the available data it is not possible to tell whether the TMRCA of the sampled cases was in a human or a non-human animal (the reservoir).

The sampled human viruses may be the result of multiple independent zoonotic introductions from a non-human animal source, a few introductions and then limited human-to-human transmission or a single introduction into the human population and spread. Determining which of these scenarios is more likely will depend on the assessment of other information (dates of onset, locations of likely non-human animal sources, epidemiological links between cases).

However, the phylogenetic data thus far suggests that the jump or jumps from non-human animals occurred relatively soon before the earliest identified cases. If multiple zoonotic jumps occurred, these did not come from a virus reservoir that was genetically diverse. That, in turn, would suggest that the virus had only recently become established in the direct non-human source or that the initial human patients had been exposed to a non-human animal source that had a genetically limited population of viruses. This might be the case if one or a group of infected animals had been brought into Wuhan city from elsewhere and was in a position to expose multiple individuals.

Caveats

The number of genetic differences in the genomes is close to the error rate of the sequencing process. Some of the observed differences may be artefacts of this process.

The evolutionary rates used to estimate the TMRCA are supposed represent a plausible range based on previous estimates for other human coronaviruses.

The samples from Wuhan were likely collected as part of the initial investigation of the outbreak centred on the seafood market. This may have resulted in sampling of epidemiologically linked cases that are not representative of the outbreak within the human. But high degree of similarity with the two Thailand cases, and the absence of any reported link between these cases and the Wuhan cases, suggest this is not the case for the reasons outlined above.

The date estimates for the TMRCA is averaged over many plausible phylogenetic reconstructions of the genome data as there is insufficient information in the data to reconstruct any single time-calibrated tree.

References