Machine Learning for Biology: How Will COVID-19 Mutate Next?

Genome sequence analysis with K-Means & PCA

Something that many people don’t think much about viruses is that, like every other organism on Earth struggling for survival, they evolve, or mutate.

Just look at a snippet of the bat virus RNA nucleotide sequence the human virus was derived from…

AAAATCAAAGCTTGTGTTGAAGAAGTTACAACAACTCTGGAAGAAACTAAGTT

…and a snippet from the human COVID-19’s RNA nucleotide sequence…

AAAATTAAGGCTTGCATTGATGAGGTTACCACAACACTGGAAGAAACTAAGTT

…clearly, the coronavirus has changed its internal structure to adapt to the new species of their host (to be more precise, about 20% of the internal structure of the coronavirus was mutated), but maintained enough such that it is still true to its origin species.

In fact, research has shown COVID-19 has mutated repeatedly in ways to boost its survival. In our fight to defeat the coronavirus, we need to find not just how the virus can be destroyed, but how the virus mutates and how those mutations can be addressed.

In this article, I will…

Provide a surface-level explanation of what RNA nucleotide sequences are

Use K-Means to create genome information clusters

Use PCA to visualize the clusters

…and derive insights from each of the analytics procedures we perform.