Basic Examples Get a Dataset containing rows for each sequence: In[1]:= Out[1]= Return the latest date a sequence was included: In[2]:= Out[2]= Count the different lengths of sequences provided, which corresponds well to the part of the virus that was sequenced: In[3]:= Out[3]= The lengths of sequences break down into two categories, corresponding to more complete sequences versus specific genetic regions: In[4]:= Out[4]= Most of these SARS-CoV-2 samples are collected from humans, but not all: In[5]:= Out[5]=

Scope & Additional Elements Get a date histogram of collection dates: In[6]:= Out[6]= See a date histogram of release dates: In[7]:= Out[7]= See a date histogram of inclusion dates: In[8]:= Out[8]= Show the locations where the sequences were gathered: In[9]:= Out[9]= Obtain the available alignment differences with the reference sequence: In[10]:= Out[10]=

Visualizations A phylogenetic tree comparison of the most-common complete genomes by location shows clusters that are broadly distributed. Dropping the trailing sequences of adenine terms avoids arbitrary differences from varying poly(A) RNA tail lengths, which may be sequencing artifacts and shouldn’t affect viral adaptivity: In[11]:= Out[13]= In[14]:= Out[14]= A similar visualization can be created for samples where more detailed geographic information is supplied. In this visualization of most-common sequences reported for US states, we see the emergence of clusters containing interesting regional blocks as shown in the map below: In[15]:= Out[15]= In[16]:= Out[16]= When visualizing the similarity of the most common sequence by week of sequence collection, while one does observe that similar times tend to cluster together, there is some overlap (such as between the week of Dec. 30, 2019 and the week of Feb. 17, 2020), illustrating that the virus has not only seen evolution, but significant continuity: In[17]:= Out[18]=