How similar is COVID-19 to previously discovered Coronaviruses

A simple comparison of composition profiles of different coronavirus genomes

With the availability of genome data for COVID-19 publicly on National Center for Biotechnology Information (NCBI), I wanted to see how similar was the 2019 novel coronavirus when compared to other coronaviruses. In this article, I will compare the reference genome of COVID-19 with the reference genomes of two different coronaviruses discovered previously; one human coronavirus and one bat coronavirus.

Disclaimer: This article is based on my analysis of the reference genomes available on NCBI and is intended for adaptation, learning and understanding of concepts of metagenome composition. This is not a part of any official research and results are not suitable for decision making of any sort.

For this analysis, I have considered the following three coronavirus genomes.

Image by Olga Lionart from Pixabay

Criteria considered

I have considered a few criteria to compare the composition of the selected coronaviruses.

Oligonucleotide composition GC content

Oligonucleotide composition

An oligonucleotide is considered to be a contiguous string of a small number of nucleotides. In computational terms, we define oligonucleotides as k-mers (words of size k). In this comparison, I have considered 3-mers (also known as trimers of trinucleotides) and their composition (trinucleotide composition). There are 32 (4³/2) distinct 3-mers. We obtain the normalised frequencies of each distinct trinucleotide by counting the number of occurrences of that trinucleotide and dividing by the total number of trinucleotides. We normalise these counts to avoid any unevenness caused due to different lengths of sequences.

Normalised freqeuncy of kᵢ

= Number of occurrences of kᵢ / total number of k-mers

(where kᵢ is the iᵗʰ k-mer)

The oligonucleotide composition is considered to be conserved within microbial species and varies between species [1][2].

GC content

GC content (or guanine-cytosine content) is the percentage of nucleotides in a sequence that are either guanine or cytosine.

GC content = (G + C) / (A + G + C + T) * 100

The GC content is also believed to vary between different species[2].

Individual Analysis of Coronavirus Genomes

Let us first analyse the coronavirus genomes individually.

1. SARS coronavirus ZJ0301

Fig 1. Trinucleotide Composition of SARS coronavirus ZJ0301

This is the reference genome of the SARS coronavirus ZJ0301 reported from China which has been published in the year 2003 [3].

Publication: Severe acute respiratory syndrome-associated coronavirus genotype and its characterization and Molecular biological analysis of genotyping and phylogeny of severe acute respiratory syndrome associated coronavirus

Figure 1 denotes the trinucleotide composition of the SARS coronavirus ZJ0301.

2. Bat SARS-like coronavirus isolate bat-SL-CoVZC45

Fig 2. Trinucleotide Composition of Bat SARS-like coronavirus isolate bat-SL-CoVZC45

This is the reference genome of the bat SARS-like coronavirus [4] which is considered to be very closely related to COVID-19 [5].

Publication: Genomic characterization and infectivity of a novel SARS-like coronavirus in Chinese bats

Figure 2 denotes the trinucleotide composition of the bat SARS-like coronavirus.

3. COVID-19 (SARS-CoV-2)

Fig 3. Trinucleotide Composition of COVID-19

This is the latest reference genome published on NCBI for the 2019 novel coronavirus [5].

Figure 3 denotes the trinucleotide composition of COVID-19.

Comparison of different coronaviruses

Trinucleotide composition

Figures 1, 2 and 3 may seem to show the exact patterns and vary between the same range, but if we plot these together as shown in Figure 4, we can see some differences. We can observe that COVID-19 and bat-SL-CoV show very similar trinucleotide composition patterns compared to SARS-CoV (especially for 3-mers such as AGG, ATC, ATG, CTA, CTC, GAA and GTA).