Motivation

The phylogeny of the H3N2 human influenza virus shows a very distinctive structure, exhibiting a pronounced 'trunk' and short side branches that go extinct within 1-5 years. This structure is indicative of strong positive selection driving population turnover. At any given moment, there exists a virus that will, in a few years time, take over the entire influenza virus population. Because of this fast rate of evolution and population turnover, the vaccine for influenza A (H3N2) must be updated every 1-2 years.

Here, I've shown the phylogeny for H3N2 influenza constructed from the HA1 gene region for 126 viruses sampled between 1968 and 2003. These sequences were taken from Smith et al. (2004), where they were assigned to antigenic clusters based on hemagglutination inhibition assay data. You can download the BEAST XML control file I used to construct this phylogeny here.

I've followed the standard algorithm of sorting branches according to number of progeny branches. This sorts the trunk of the tree to the top of the y-axis and makes the trunk / side branch structure very clear. However, in presenting this work, Josephine Pemberton made the interesting point that this layout algorithm makes it look like the choice of trunk lineage is entirely deterministic. I've also thought it strange that the y-axis doesn't have a direct interpretation. This phylogeny is a diagram, and not a plot.

Here, I've made a small attempt at plotting phylogenies with an interpretable y-axis.

Counting nucleotide differences

We can quantify evolution directly in the phylogenetic plot by counting differences between nodes and the root of the tree. Here, I've set BEAST to reconstruct the ancestral sequence at each internal node. Then, rather than doing the traditional layout algorithm, I just count the number of nucleotide differences between the sequence at the root of the tree and sequences at the nodes of the tree.

From this you can see that the rate of nucleotide evolution has been both rapid and fairly constant. As in the traditional phylogeny, you can see that very little nucleotide diversity exists at any given moment. However, you can also see that the trunk of the tree doesn't always emerge from the lineage with the most nucleotide differences. You can see this more clearly when plotting residuals.

This was made by calculating a LOESS regression and then taking residuals for each node in the tree. From this, it's clear that the trunk is not deterministically chosen based on nucleotide differences. Still, the LOESS regression shows that trunk nodes are, on average, 0.89 nucleotide differences ahead of the rest of the tree. There is some signal here.

Counting amino acid differences

Here, I've conducted a similar analysis using amino acid sequences instead of nucleotide sequences.

The results are similar, though surprisingly, amino acid differences don't provide a better prediction of trunk lineage. In this case, trunk nodes are 0.17 amino acid differences ahead of the rest of the tree. Especially clear, is that the Beijing/92 cluster (shown in blue) is significantly advanced in amino acid differences, but still dies out.

For this first pass, I haven't addressed one technical issue. Trees shown are a single representative sample from the MCMC chain. It would be possible (and preferable) to calculate means and credible intervals for substitution counts across the MCMC chain.