Evolution is mediated not only by random mutations over a number of generations (vertical evolution), but also through the mixture of genomic material between individuals of different lineages (horizontal evolution). The standard evolutionary representation, the phylogenetic tree, faithfully represents the former but not the latter scenario. Although many elaborations have been developed to address this issue, there is still no agreed-upon method of incorporating both vertical and horizontal evolution. Here, we present an alternative strategy based on algebraic topology to study evolution. This method extends beyond the limits of a tree to capture directly even complex horizontal exchanges between multiple parental strains, as well as uncover broader reticulate patterns, including the segregation of segments during reassortment.

The tree structure is currently the accepted paradigm to represent evolutionary relationships between organisms, species or other taxa. However, horizontal, or reticulate, genomic exchanges are pervasive in nature and confound characterization of phylogenetic trees. Drawing from algebraic topology, we present a unique evolutionary framework that comprehensively captures both clonal and reticulate evolution. We show that whereas clonal evolution can be summarized as a tree, reticulate evolution exhibits nontrivial topology of dimension greater than zero. Our method effectively characterizes clonal evolution, reassortment, and recombination in RNA viruses. Beyond detecting reticulate evolution, we succinctly recapitulate the history of complex genetic exchanges involving more than two parental strains, such as the triple reassortment of H7N9 avian influenza and the formation of circulating HIV-1 recombinants. In addition, we identify recurrent, large-scale patterns of reticulate evolution, including frequent PB2-PB1-PA-NP cosegregation during avian influenza reassortment. Finally, we bound the rate of reticulate events (i.e., 20 reassortments per year in avian influenza). Our method provides an evolutionary perspective that not only captures reticulate events precluding phylogeny, but also indicates the evolutionary scales where phylogenetic inference could be accurate.

In On the Origin of the Species in 1859, Darwin first proposed the phylogenetic tree as a structure to describe the evolution of phenotypic attributes. Since then, the advancement of modern sequencing has spurred development of a number of phylogenetic inference methods (1, 2). The tree structure effectively models vertical or clonal evolution, mediated by random mutations over multiple generations (Fig. 1A). Phylogenetic trees, however, cannot capture horizontal, or reticulate, events, which occur when distinct clades merge together to form a new hybrid lineage (Fig. 1B and SI Appendix, Fig. S1). In nature, horizontal evolution can occur through species hybridization in eukaryotes, lateral gene transfer in bacteria, recombination and reassortment in viruses, viral integration in eukaryotes, and fusion of genomes of symbiotic species (e.g., mitochondria). These horizontal genetic exchanges create incompatibilities that mischaracterize the species tree (3). Doolittle (4) argued that molecular phylogeneticists have failed to identify the “true tree of life” because the history of living organisms cannot be understood as a tree. One may wonder what other mathematical structures beyond phylogeny can capture the richness of evolutionary processes.

Linking algebraic topology to evolution. (A) A tree depicting vertical evolution. (B) A reticulate structure capturing horizontal evolution, as well. (C) A tree can be compressed into a point. (D) The same cannot be done for a reticulate structure without destroying the hole at the center.

Current techniques that detect reticulate events can be divided into phylogenetic and nonphylogenetic methodologies. Phylogenetic methods detect incongruence in the tree structure of different segments (5⇓⇓–8). Nonphylogenetic methods probe for homoplasies (shared character traits independently arising in different lineages by convergent or parallel evolution) or similar inconsistencies in sequence alignment (9⇓⇓⇓–13). Although many of these methods are designed for sensitive detection of viral recombination and bacterial lateral gene transfer, they do not provide comprehensive representation of the evolutionary process.

Perhaps phylogenetic networks exemplify the largest departure from trees, allowing multiple paths between any two leaves. These methods visualize incompatibilities of sequence patterns or tree topologies as reticulation cycles in a network (14⇓–16). Only the subfield of evolutionary networks is amenable to reticulate detection. However, major stumbling blocks abound for such methods. Although phylogenetic network structure is not necessarily unique, all current implementations produce only one network that may represent a suboptimal solution; results may depend on factors as arbitrary as the ordering of samples in the data matrix (16, 17). Moreover, many methods have impractical running times for even small datasets owing to the nondeterministic polynomial-time hard (NP-hard) problem of determining whether a tree exists in an evolutionary network (18). To address these obstacles, ad hoc methods simplify the search space of network structures: k-level, galled, tree-child, and tree-sibling networks. Although some of these methods cease to be NP-hard (19), all prioritize computational tractability over biological modeling (20). For example, galled tree networks minimize the number of inferred recombinations by ensuring that reticulation cycles share no nodes (21). This heuristic is appropriate only for low recombination rates and is not universally applicable.

Here, we propose a comprehensive and fast method of extracting large-scale patterns from genomic data that captures both vertical and horizontal evolutionary events at the same time. The structure we propose is not a tree or a network, but a set of higher-dimensional objects with well-defined topological properties. Using the branch of algebraic topology called persistent homology (throughout this paper, we refer to mathematical homology, not the notion of genetic or structural similarity), we extract robust global features from these high-dimensional complexes. Unlike phylogenetic methods that produce a single, possibly suboptimal, tree or network, persistent homology considers all topologies and their relationships across the entire parameter space of genetic distance. Through analysis of viral and simulated genomic datasets, we show how persistent homology captures fundamental evolutionary aspects not directly inferred from phylogeny. In addition to representing clonal and reticulate evolution, persistent homology can determine the rate of horizontal genomic events, complex exchanges involving more than two organisms, and statistical patterns of cosegregation (genes more likely to be exchanged as a set). We calibrate our method using viral genomes because they are richly sampled and annotated, with a wide range of reticulate events; however, we foresee broader application to bacteria, eukaryotes, and other datasets.

Results

Persistent Homology in Evolution. We propose a mathematical structure that represents both vertical and horizontal evolutionary events at once. This structure is based on the field of algebraic topology, which characterizes global properties of a geometric object that are invariant to continuous deformation, that is, stretching or bending without tearing or gluing any single part of it. These properties include such notions as connectedness (the number of distinct connected components), as well as the number of holes. We aim to apply these concepts to characterize the topology of evolution. In settings of vertical evolution, a tree can be continuously deformed into a single point or connected component (Fig. 1C). The same action cannot be performed for a reticulate structure without destroying the loops within it. The active hypothesis is that the presence of holes results directly from reticulate events (Fig. 1D). We are then interested in computing the number of holes in the evolutionary topological space, and algebraic topology mathematically formalizes these notions. In particular, holes can exist in different dimensions: a loop in one dimension, a void or cavity in two dimensions, and so on. To be precise, we are interested in holes that are “irreducible” cycles: a cycle in dimension k that does not serve as the boundary of a (k + 1)-dimensional object. We can define a topological invariant called the “homology group” H k as an algebraic structure that encompasses all holes in dimension k, and the “Betti number” b k is the count of these holes. The special case of the H 0 group addresses how many independent, unconnected components comprise a space. For our purposes, we can assume that evolution forms some topological space E. Instead of directly observing E, we observe a sample of data points in E, particularly the genomic sequences separated from each other by some genetic distance. The set of these data points and space E do not share the same topology. However, we can estimate the topology of E by defining a function B(x,ε) as the ball centered at data point x with radius of genetic distance ε. We can show that at some value of ε, the union of balls B(x,ε) for all x shares the same topology as E (SI Appendix, Fig. S2). The topology of the union of balls is difficult to compute but can be estimated by constructing a corresponding topological space called a simplicial complex (22⇓–24). In short, a simplicial complex is a set of points, lines, triangles, tetrahedra, and higher-dimensional “simplices.’’ Like any topological space, this structure can contain holes of different dimensions (SI Appendix, Fig. S3). To build the simplicial complex, one can construct a line if any pair of points is within distance ε of each other, a triangle if any triplet of points are all within ε of each other, and so forth. At some ε, the resulting simplicial complex shares the same topology as the union of balls, and that of E as well (SI Appendix, Fig. S2). However, different scales of ε create different simplicial complexes and reveal different irreducible cycles. A more comprehensive approach would consider all simplicial complexes over the entire parameter space of ε. For irreducible cycle C, we track when C exists over a filtration (subset) of simplicial complexes over a particular interval [a C ,b C ] of genetic distance ε. Here, a C and b C are the birth and death of feature C. We then perform persistent homology, which computes the homology groups of dimension k at all scales ε. This process is depicted in a barcode plot, which shows a horizontal bar between ε = [a C ,b C ] for every independent object C in the homology group. Bars persisting over a large interval of ε are unlikely to derive from noise (24). Our aim, then, is to apply persistent homology to the study of evolution. We consider a set of genomes and calculate the genetic distance between each pair of sequences. Using the pairwise distance matrix, we calculate the homology groups across all genetic distances ε in different dimensions. We can refine our original hypothesis now and assert that zero-dimensional topology provides information about vertical evolution. At a particular scale ε, for example, b 0 represents the number of different strains or subclades. However, one-dimensional topology provides information about horizontal evolution, because reticulate structures contain loops (Fig. 2). We hypothesize that even higher-dimensional homology groups H i≥2 result from multiple horizontal exchanges or complex reticulate events involving multiple parental strains. The “generator,” the set of sequences that represents a particular irreducible cycle, can describe such complex genomic mixtures, as we will see in simulations and real data (HIV and avian influenza). In sum, we propose to connect the principles of algebraic topology and evolution and provide a dictionary that translates vocabulary in both fields (Table 1). Fig. 2. Persistent homology characterizes topological features of vertical and horizontal evolution. Evolution was simulated with and without reassortment (SI Appendix, Supplementary Text). (A) A metric space of pairwise genetic distances d(i,j) can be calculated for a given population of genomic sequences g 1 ,…, g n . We visualize these data points using principal coordinate analysis (PCoA) (SI Appendix, Supplementary Text). (B) In the construction of simplicial complexes, two genomes are considered related (joined by a line) if their genetic distance is smaller than ε. Three genomes within ε of each other form a triangle, and so on (SI Appendix, Supplementary Text). From there, we calculate the homology groups at different genetic scales. In the barcode, each bar in different dimensions represents a topological feature of a filtration of simplicial complexes persisting over an interval of ε. A one-dimensional cycle (red highlight) exists at ε = [0.13, 0.16 Hamming distance] and corresponds to a reticulate event. The evolutionary scales I where b 1 = 0 are highlighted in gray. Table 1. Dictionary between persistent homology and evolutionary concepts

Topological Obstruction to Phylogeny. We can mathematically formalize the role of phylogeny within the framework of persistent homology. By definition, trees cannot contain holes; therefore, higher-dimensional irreducible cycles in a simplicial complex constructed from genomic data preclude phylogenetic construction at genetic distance ε. This intuition is proven for holes in dimension one and higher (SI Appendix, Theorem 2.1). We define an additive tree as a phylogeny where the distance between two leaves is the sum of the branch lengths connecting them. If there exists a nonzero Betti number of dimension greater than zero, then no additive tree exists that appropriately represents the genomic data. A related concept is the set of genetic distances I where nonzero topology vanishes. I reflects the evolutionary scales at which there is no evidence of reticulate exchange that can confound tree construction (Fig. 2B). We define the topological obstruction to phylogeny (TOP) to be the L-∞ norm, or maximum, of B. If TOP is nonzero, then no additive tree exists that can appropriately represent the data. Another important concept is stability: the amount of fluctuation in the results owing to statistical noise, sequencing errors, or incomplete sampling. We show that TOP is a stable measure that is bounded by the Gromov–Hausdorff distance to the additive tree (SI Appendix, Theorem 2.2). Because additive structures have vanishing higher-dimensional homology, small deviations from additivity generate only small bars in the barcode.

Topological Estimates of Recombination/Reassortment Rates. Most approaches to estimating recombination rates are based on observed variance in site differences between pairs of haplotypes (25) or maximum likelihood estimators (26). Typically, these estimators assume constant population size, panmictic populations, and constant rates. Persistent homology, however, provides a lower bound for these rates by considering independent irreducible cycles for all ε in a period. The irreducible cycle rate (ICR) then is defined as the average number of the one-dimensional irreducible cycles per unit of time, ICR = (total number of one-dimensional bars for all ε)/(time frame). The numerator of this quantity is the L-0 norm (number of higher-dimensional bars). We normalized ICR based on the time interval between the earliest 5% and most recent 5% of the sequence dataset. Simulations show that ICR is proportional to and provides a lower bound for recombination/reassortment rate (SI Appendix, Fig. S5 C and D and G and H). In this way, sequence data add a temporal dimension to our methodology, a unique aspect not shared by other applications of algebraic topology.

Detection of Simulated Reticulate Events. To evaluate sensitivity and specificity of persistent homology to capture complex evolutionary processes, we simulated four scenarios: clonal evolution, population admixture, reassortment, and homologous recombination. Each simulation represents a population of constant size evolving over generations under a Wright–Fisher model with substitution rate μ, reassortment/recombination rate r, number of reassorting segments S, and subsamples with or without ancestral sequences (SI Appendix, Supplementary Text). Simulations show that (i) nontrivial homology appears when r is nonzero, (ii) one-dimensional ICR increases proportionally to r, and (iii) multiple reassortment/recombinant events can produce 2D topology. A precise account of simulations, comparison with other methods to identify recombination, and discussion of results can be found in SI Appendix, Supplementary Text and Figs. S4–S6.

Nonrandom Reassortment Pattern in Avian Influenza. Although previous phylogenetic studies confirmed a high reassortment rate in avian influenza, none has identified a clear pattern of gene segment association (32). To determine whether any segments cosegregate more than expected by chance, we applied persistent homology to avian influenza. We first considered all pairs of concatenated segments and estimated the number of reassortments by b 1 . We then ascertained the significance of observing a number of reassortments between each pair of segments given the total estimate of reassortments in the concatenated genome (SI Appendix, Supplementary Text). Analysis of avian influenza reveals a statistically significant configuration of four cosegregating segments: polymerase basic 2 (PB2), polymerase basic 1 (PB1), polymerase acidic (PA), and nucleoprotein (NP) (Fig. 3D). Interestingly, this pattern mimics previous in vitro results that suggest that effective protein–protein interaction between the polymerase complex and the NP protein constrain reassortment (31).