We show in the Supplemental Information that an estimate for P(s→t | π) can be efficiently computed. Consider a path c, consisting of a set of states beginning at s and ending at t, constructed as follows. In the ith path step c, identify the set T(c) of all accessible states that are compatible with t (meaning that they do not lack genes that t possesses). Choose the next state cfrom T(c) according to the associated transition probabilities normalized over T(c) alone. The advantage of this sampling approach is that we can simulate trajectories guaranteed to start at s and end at t, thus avoiding wasting computational time on trajectories that do not contribute to the overall sum. An estimate of P(s→t) is then efficiently given by averaging the quantity ΠP(c→ ○ ϵ T(c)) over a set of samples, where cis the ith state in path c, ○ ϵ T(c) denotes any member of the set of states accessible from cthat are compatible with final target t, and the sampling is taken over a set of paths, constructed according to the sampling scheme above, starting at s and ending at t ( Figure S1 ). This estimate can be computed using the algorithm below, which to our knowledge has not been previously published; if this is true, we propose the name “HyperTraPS,” both standing for hypercubic transition path sampling and referring to the act of forcing trajectories toward specific points on a hypercube. We note that HyperTraPS is in a sense more comparable to the transition interface sampling or forward flux sampling approaches found in statistical physics than transition path sampling as understood in that context ().

In more detail, we work with an “evolutionary space” consisting of a set S of states comprising all possible patterns of the presence and absence of L traits under consideration, and a set of transition probabilities πbetween each pair of states s and t, determining the stochastic dynamics of evolution ( Figure 1 B). To compute the probability of observing given evolutionary transitions, we require an estimate of the probability P(s→t | π) that an evolutionary trajectory on a network π passes through a “target” state t, given that it passed through a “source” point s.

Using this general framework, we constructed an algorithm that allows us to compute how likely an evolutionary path within this modeling framework is to visit two given states. This computation is simple in the above example, but when L is higher and there are many different paths through which a given state may be visited (or avoided), it becomes prohibitively difficult to compute the probability of every appropriate path. Our approach provides an efficient way to estimate the probability of observing a given pair of states, given a particular parameter set of intensities. Briefly, it accomplishes this by sampling the set of paths that start at the first state and finish at the second, but preferentially sampling the paths that are most likely. This preferential sampling is accomplished first by ensuring that no sampled paths involve transitions to states from which the second desired state cannot be reached, and second by choosing each step on a sampled path proportionally to its intensity. The amount of bias introduced by this preferential sampling is recorded at each step in the sampled path, and it is corrected for to yield the probability required.

For example, consider a simple system restricted to L = 3 genes. The evolutionary space is illustrated in Figure 1 B, with intensities of each transition labeled — for example, in Figure 1 B (left), we expect 98% of evolutionary processes to follow the initial step 111→110. Hence, in Figure 1 B (left), a step from 110 is highly likely to be 110→100 due to that transition’s high intensity, whereas in Figure 1 B (right) — with a different set of intensities — a step from 011 is equally likely to be 011→010 or 011→001 because these transitions have equal intensity.

To explore the evolutionary processes that give rise to present-day patterns of mtDNA genes across taxa, we constructed a general model for mtDNA evolution that allowed us to amalgamate and unify the large volume of genomic data available (see Experimental Procedures ). This model includes a description of every possible pattern of the presence and absence of all 65 protein-coding mtDNA genes that we consider. These patterns are represented by binary strings of length L, where a 0 as the ith character corresponds to the ith gene being absent and a 1 at that position corresponds to that gene being present. We will use this picture to represent the transitions that make up the evolutionary history of mtDNA ( Figure 1 A). We allow evolutionary transitions between states that differ by one trait, so that individual genes are lost one by one (however, simultaneous loss of several genes can be captured by this model, and this corresponds to an equal probability that each gene is lost at a given timestep). Each transition has a given intensity, and the probability of a given transition from some state is proportional to that transition’s intensity (normalized by the sum of intensities of all transitions leading away from that state). Evolution is modeled as a walk from the state of all 1 s to the state of all 0 s, with individual transitions along the walk occurring randomly, weighted by intensity.

(E) Illustration of the distinct mitochondrial gene sets present in the source dataset, ordered vertically from highest to lowest gene content. Rows are genomes, and columns are mitochondrial genes (an example species is given in gray for each genotype). Black and white pixels represent present and absent genes, respectively. Single letters denote positions of some well-known organisms by initial letter: H, Homo sapiens; R, Reclinomonas americana; P, Plasmodium vivax; F, Fucus vesiculosus; Z, Zea mays; S, Saccharomyces cerevisiae; A, Arabidopsis thaliana; C, Caenorhabditis elegans; D, Drosophila melanogaster.

(D) This posterior distribution is then summarized by recording the probability with which a given gene is lost at a given ordering on an evolutionary pathway.

(C) MCMC is used to build a posterior distribution of parameterizations based on the associated likelihood of observed transitions, producing an ensemble of possible evolutionary landscapes.

(B) An evolutionary space defined by the presence or absence of L = 3 traits and parameterized by probabilities of transitions between these states. If our source data reveal two evolutionary transitions 111→110 and 110→100, then the parameterization on the left is more likely than that on the right because it supports evolutionary trajectories that are likely to give rise to those observations. HyperTraPS is used to calculate the associated likelihood, determining which parameterizations are accepted (perhaps the left) and which are rejected (perhaps the right).

, the number of sampled trajectories, is a parameter of the algorithm. Lower numbers will be computationally cheaper but will give a poorer sampling of possible trajectories and thus a less accurate estimate. In the Supplemental Information , we give a more detailed description of HyperTraPS and show a range of validation checks for its functionality ( Figures S2 A–S2E). To infer patterns of evolutionary dynamics with HyperTraPS, first a phylogeny is used to identify the evolutionary transitions that have occurred throughout history ( Figure 1 A and Supplemental Information ). Then HyperTraPS is used to compute the joint probability of observing these transitions given a trial parameterization π ( Figure 1 B). A Markov chain Monte Carlo (MCMC) approach is used to sample parameterizations that are associated with high likelihoods ( Figure 1 C), forming a posterior distribution over π that can be summarized and visualized ( Figure 1 D). In the Supplemental Information , we compare HyperTraPS with related approaches and demonstrate its increased flexibility and efficiency ( Figures S2 F and S2G).

Compute the probability of making a move to a t-compatible next step (for the first step, all trajectories are at the same point and the probability for each is thus the same); record this probability as α’ i .

We used the HyperTraPS algorithm within a Bayesian MCMC framework to compute the probabilities of different patterns of mtDNA gene loss, given observed changes in mtDNA across the tree of life and uninformative uniform priors on transition probabilities. Figure 3 shows a summary of these results, illustrating the probability with which a given gene is lost at a given step in time from a “full” genome containing all genes found in R. americana, to an “empty” mitochondrial genome containing no genes. The figure heuristically represents possible pathways for the evolutionary history of a “typical” mitochondrial genome in an evolving eukaryotic lineage, and it shows that the pattern of mitochondrial gene loss is remarkably structured and non-random. The inferred structure quantitatively supports intuitive observations, for example, cytochrome b (cob) is observed in almost every known mitochondrial genome, whereas the secY-independent protein translocase component tatA is only observed in R. americana. We broadly observe three classes of genes: early loss (including complex II sdh[X] genes, and many ribosomal rps[X] and rpl[X] genes); intermediates present in a variety of taxa (including plants and fungi) but lost in animals (including more subunits of the mitochondrial ribosome and some Complex V atp[X] genes); and highly retained genes (including Complex I nad[X] genes, Complex IV cox[1-3] genes, and cytochrome b cob). Different lineages have clearly experienced different specific gene loss trajectories (for example, Schizosaccharomyces pombe possesses cox2 but not cox3, and Babesia bovis possesses cox3 but not cox2; Figure 1 E), but the probabilistic trend observed in Figure 3 holds broadly (convergently) across eukaryotes.

The probability that a given gene is lost at a given time ordering in the process of mitochondrial gene loss. The flat surface in the main plot and black contour in the inset give the probability (1/L) associated with a null model in which all genes are equally likely to be lost at all times. Blue corresponds to a probability above that expected from this null model; red corresponds to a probability below this null model. Genes are ordered by mean inferred loss time.

We analyzed the complete annotated mitochondrial genomes of 2015 species from GOBASE (), identifying the presence and absence of mitochondrial genes. For a basis of comparison, we used the jakobid protozoan Reclinomonas americana. R. americana has a large mitochondrial genome () that includes orthologs of every gene found in any organism’s mitochondrial genome (with very few exceptions []). Across the 2015 species, 74 distinct combinations of genes were present ( Figure 1 E). Visible vertical clusters of retained genes include subunits of Complexes I (nad[X]), III (cox[X]), and V (atp[X]), and the small subunit of the mitochondrial ribosome (rps[X]). We then mapped the differences between each genome onto a phylogeny of all the species in our dataset ( Figure 2 ). We employ two assumptions about mtDNA evolution: first, that mitochondrial gene loss is rare, and second, that mitochondrial gene gain is negligible. These assumptions allow us to reconstruct the mitochondrial genomes at ancestral nodes in the phylogeny ( Supplemental Information ). To ensure that our approach was robust to potential errors in annotation and phylogeny construction, we repeatedly perturbed our source data and confirmed that our results were comparable across perturbed datasets ( Figure S2 E; Experimental Procedures Supplemental Information ).

Each leaf is an organism in which the set of present mitochondrial genes has been characterized. Colored pairs of nodes denote those ancestor/descendant pairings where a change in mitochondrial gene complement is inferred to have occurred, with hue denoting the number of protein-coding genes present from blue (maximum 65) to red (minimum 3). Single letters denote positions of some well-known organisms by initial letter; see Figure 1 for labels.

Mitochondrial DNA of the coral Sarcophyton glaucum contains a gene for a homologue of bacterial MutS: a possible case of gene transfer from the nucleus to the mitochondrion.

During the preparation of this report, crystal structure data for ETC Complex I became available on the PDB (), affording a valuable opportunity to independently test the prediction of our model. We computed interaction energies for Complex I subunits and found that protein products encoded by mtDNA genes also display the strongest interaction energies, consistent with the pattern we observed with other complexes ( Figure 4 ). As is the case with the other ETC complexes, the Complex I subunit with the strongest interaction energy, nad1, is one of the most commonly retained genes in mitochondrial genomes across all taxa.

We analyzed the crystal structures that have been determined for ETC Complexes II, III, and IV (see Experimental Procedures ). We found that the genes that encode the proteins with the strongest binding energy (computed using PDBePisa; see Experimental Procedures ) within each complex have invariably been retained in mitochondrial genomes more commonly than those encoding other subunits of the complex ( Figure 4 ). This relationship is apparent across the entire eukaryotic tree of life. The two genes most retained by mitochondrial genomes throughout life, cob and cox1, are the most energetically central components of their complexes (Complexes III and IV, respectively). The sdh[2-4] genes controlling Complex II and cox[2-3], which play an important role in Complex IV but are less central than cox1, represent intermediate points on this spectrum. Many organisms do not encode any Complex II subunits in mtDNA: those that do so retain the energetically most central sdh[2-4] genes and not sdh1. cox[2-3] are retained in the mitochondria of most but not all organisms. By contrast, the energetically more peripheral gene products are invariably encoded by the nucleus. This link between biophysical and evolutionary features of mitochondrial genes is consistent with the CoRR hypothesis, supporting its applicability to organellar genomes of multiple and diverse taxa across the tree of life.

Interaction energy of subunits within Complexes I-IV computed with PDBePISA (see Experimental Procedures ). Blue bars denote subunits encoded by mtDNA in at least one eukaryotic species; pink bars denote subunits always encoded in the nucleus. Inset crystal structures show the location of mtDNA-encoded subunits in blue (darker shades show higher interaction energies).

Our inferred order of gene loss shows that genes encoding distinct mitochondrial protein complexes were lost in different “phases” throughout evolutionary time. As a first step in more detailed characterization of this loss patterning, we sought to examine the order of gene loss within mitochondrial protein complexes to address a long-standing hypothesis for why some genes are retained in organelles. As described in the introduction, the CoRR hypothesis suggests that components of central importance to organelle function are preferentially encoded in the organelle genome to allow localized control of gene expression and therefore the assembly of protein complexes (). This allows individual mitochondria to adjust the stoichiometry of ETC complexes in response to demands or stresses, without affecting the regulation of other mitochondria within the same cell. We reasoned that protein subunits occupying energetically central positions within their complexes likely exert the most control over complex assembly () and thus provide a means to test this hypothesis.

GC Content and Protein Hydrophobicity Are Both Required to Predict Mitochondrial Gene Retention

We next sought to examine additional factors that predict the probability that a given gene is retained in the mitochondrial genome. To do this, we gathered data for ten properties of mitochondrial genes (or their encoded proteins) that have been hypothesized to influence the probability that a given gene is retained in the mitochondrial genome ( Table S1 ). These hypotheses can be viewed as predicting a strong link between the corresponding gene property and the propensity with which a gene is lost from mtDNA. Our characterization of the probability with which a gene is lost in a given ordering allows us to quantify the strength of these hypothesized links within a probabilistic framework.

We used a linear model approach to explore the ability of features of mitochondrial genes, or the proteins they encode, to predict the inferred patterns of gene loss shown in Figure 3 . As described in Experimental Procedures , we performed our Bayesian model selection approach with two datasets: one corresponding to genetic features in R. americana alone and one with features averaged across taxa. This approach allows us to control both for specific properties of R. americana and for cross-species variation. As shown in this section and the Supplemental Information , results from both datasets are very comparable, illustrating the robustness of our findings.

Figure 5 GC Content and Hydrophobicity Predict Mitochondrial Gene Retention Show full caption (A) Bars show posterior probabilities for individual models for mitochondrial gene loss based on gene properties in R. americana, given the inferred gene loss patterns and a prior favoring parsimonious models. Inset matrix show the posterior probability with which features are present in these models for gene loss (diagonal elements correspond to single feature, off-diagonal elements to a pair of features). Labels A–J denote model features as described in Experimental Procedures , B is GC content, and C is protein product hydrophobicity. (B) Predicted loss ordering from GC and hydrophobicity model (horizontal axis) against inferred mean loss ordering (vertical axis). (C) GC content at more synonymous (position 3) and less synonymous (average of position 1 and 2) positions in codons, for different mtDNA genes (datapoints) in R. americana, H. sapiens, and averaged across taxa. (D) Codon use and GC content in R. americana and H. sapiens. Black polygons give the proportional usage of each codon (segments) that can encode each amino acid (groups of segments) in protein-coding genes. Codon segments are shaded according to GC content (blue highest, white lowest); a null model where each codon for a given amino acid is used equally is shown in red. In R. americana but not H. sapiens, GC-rich codons are disfavored: the black usage polygon often falls below the red null model polygon in GC-rich (blue) segments. This disfavoring is quantified by considering Pearson’s r between the amount of disfavoring (the ratio of observed codon usage to codon usage under the null model) and GC content. The resultant posterior probabilities over model structure favored GC content and protein product hydrophobicity as predictors of whether or not genes are retained in mitochondrial genomes; GC content and hydrophobicity appear in 97%–98% of inferred model structures using both R. americana gene properties ( Figure 5 A) and averaged gene properties ( Supplemental Information Figures S3 A–S3C). Few other properties are so represented, with the exception of strandedness (J), which appears in 56% of inferred models using averaged gene properties (but few models using R. americana properties). Results involving uniform priors on model structure, and using an expanded set of features, were compatible with these findings ( Figures S3 A–S3C; Supplemental Information ). The absence of other features in favored models illustrates a lack of support for other retention determinants including, for example, features related to genetic code variability and mutational robustness. We did not find strong evidence that the magnitude of a gene’s expression is directly related to that gene’s retention in mtDNA ( Supplemental Information Figure S3 D), although this analysis was based on a smaller dataset limited by available expression data (see Experimental Procedures and Supplemental Information ).

Samuels, 2005 Samuels D.C. Life span is related to the free energy of mitochondrial DNA. Reyes et al., 1998 Reyes A.

Gissi C.

Pesole G.

Saccone C. Asymmetrical directional mutation pressure in the mitochondrial genome of mammals. Our analysis provides strong support for genes with high GC content relative to other genes in the same organism being preferentially retained in mtDNA. As we discuss in detail in the Supplemental Information , patterns of GC content vary dramatically between species ( Figures 5 C and 5D;), but this inter-species variability is never incompatible with the intra-species favoring of GC-rich genes. Notably, at a sequence level, we have found that protein-coding genes in mtDNA can in some species display a bias against GC content, in the sense that GC-poor codons are used more often than GC-rich codons to encode a given amino acid ( Supplemental Information Figure 5 D). This is most likely due to the strong asymmetric mutational pressure arising from the hydrolytic deamination of cytosine into uracil, which is converted into thymine (). Our results suggest a tension between this mutational drive at the sequence level (decreasing genome-wide GC content) and a selective drive retaining individual genes with higher GC content.

To better understand the net result of this tension, we compared the GC content of individual genes at more nonsynonymous (bases 1 and 2 in a codon) and more synonymous (base 3) positions. We performed this comparison both for R. americana and for GC statistics aggregated across a diverse set of organisms. We found that GC content was lower at synonymous positions than at nonsynonymous positions in R. americana, but that this difference largely disappeared in the dataset aggregated across taxa ( Figure 5 C). This difference in R. americana is consonant with a link between GC content and structural conservation at the protein level: synonymous sites can be mutated without causing deleterious changes to protein structure, and are thus susceptible to the GC-decreasing asymmetric mutation, whereas nonsynonymous sites are more constrained and GC-decreasing mutations will be selected against. In this picture, the link between GC content and gene retention could arise indirectly through another relationship: genes most retained in mtDNA being most highly conserved (perhaps due to their structural importance in protein complex assembly, as above). However, the absence of this signal in the taxa-averaged data suggests that this relationship may not hold across all the species we consider. We also tested the link between structural conservation and gene retention by using nonsynonymous, synonymous, and total GC content as different terms in our model selection process ( Supplemental Information Figure S4 ). The model selection process always favored total GC content, suggesting that the conservation link, while explanatory in R. americana, does not represent the only way in which GC content is linked to gene retention (see Discussion ).

2 = 0.51 (p ≈2 × 10−11; see The parameterization of the most common model always favored the retention of genes encoding proteins with high hydrophobicity and high GC content. In Figure 5 B, we illustrate a specific parameterization of this model, chosen to optimize the least-squares difference between model prediction and mean inferred ordering (from Figure 3 ). This parameterization shows a strong gene-by-gene correlation with the mean inferred loss ordering, with Pearson’s R= 0.51 (p ≈2 × 10; see Supplemental Information for interpretation) and Spearman’s ρ = 0.72. The model also strongly correlated with observation frequencies of mtDNA genes in the original dataset ( Figure S3 G).

Intuitively, it may be expected that the signals associated with GC content, hydrophobicity, and energetic centrality may reflect different aspects of the same fundamental feature: GC-rich codons encode hydrophobic amino acids, and hydrophobic proteins are likely to occupy “core” positions in complexes and within membranes. However, there is evidence that these features are at least somewhat decoupled. Figure S3 F shows an absence of strong correlations between the three features for the genes we consider: Complexes II and IV display a moderate correlation between scaled energetic centrality and GC content (although sample size is limited by the small number of subunits), but no relationship exists for other complexes or hydrophobicity. Additionally, our model selection process will discard redundant information; if all the predictive power associated with GC content was already present in the hydrophobicity data, GC content would not be identified as a joint factor in the selected models. Although the links above are likely true in some contexts, we cannot avoid the conclusion that independent features associated with GC content, hydrophobicity, and energetic centrality all contribute to gene loss propensity. A combination of these, as opposed to any single feature, is therefore required to account for observed patterns of mtDNA gene retention.