High-performance computing

Analyses of the genome data sets in this study were computationally intensive. Heavy computations used the “Comet” supercomputer located at the San Diego Supercomputer Center (SDSC). Each standard node is equipped with 24 Intel Haswell CPU cores and 128 GB of DDR3 memory, while each GPU node is equipped with four NVIDIA P100 GPUs, plus 28 Broadwell CPU cores and 128 GB memory. A small proportion of the computations used the “Barnacle” computer system operated by the Knight Lab, each node of which has 32 Haswell CPU cores and 256 GB DDR3 memory. Whenever possible, all CPU cores were used in a typical multi-processing task to minimize run time. Tasks that required >128 or 256 GB memory used the large-memory nodes of Comet, featuring 64 CPU cores and 1.5 TB memory per node. Benchmarking of the prototype selection algorithms and some local developments were performed on the “WarpDrv” workstation, equipped with 32 Intel Sandy Bridge CPU cores and 256 GB of DDR3 memory.

Retrieval of genome data and metadata

Microbial genomes were downloaded from the NCBI genome database (GenBank and RefSeq) as of March 7, 2017. We used and provided updates related to this work to the automated workflow RepoPhlAn (https://bitbucket.org/nsegata/repophlan, commit 03f614c) to download genomes from the NCBI server. Each genome was given a unique identifier, which was derived from the NCBI accession of the corresponding assembly, but without version number. For example, a genome with assembly accession “GCF_000123456.1” was identified as “G000123456” in this study. In cases when the same genome was present in both GenBank (accession starting with “GCA_”) and RefSeq (accession starting with “GCF_”), only the RefSeq version was kept.

Annotation and classification of marker genes

The functional annotation of the 400 PhyloPhlAn marker genes25 was performed by aligning the protein sequences of the 400 marker genes (inferred from 2887 bacterial and archeal genomes as described in Segata et al.25) against the UniRef50 database (March 2018 release) using BLASTp. The best hit for each gene was taken and queried against the UniProt database for gene and protein names. To categorize genes by function, the UniProt entries were translated into Gene Ontology (GO) terms40 with the “subset_prokaryote” tag (March 2018 release). Because not all UniProt entries have corresponding GO terms, manual curation was involved to pick the most appropriate GO terms for those cases by examining the BLAST hit table. GO terms were further translated into GO slim terms to obtain higher-level functional categories. Note that this analysis is independent from the phylogenetic analysis of the current genome data set, and the result can be used as a reference for PhyloPhlAn users.

Analyses of genome sequences and identification of marker genes

The DNA sequences of the 86,200 bacterial and archaeal genomes were subjected to the following analyses:

1. The quality scores for DNA, protein, tRNA, and rRNA were calculated following Land et al.41. 2. A MinHash sketch was built for each genome using Mash 1.1.142, with default settings (sketch size = 1000, k-mer size = 21), based on which a pairwise distance matrix was built for the entire genome pool. In brief, MinHash is a k-mer hashing technique that enables the quantification of genome-to-genome distance. It is efficient for very large genome sets, and it correlates well with the conventionally used average nucleotide identity (ANI)42. 3. Although NCBI provides genome annotations, we chose to re-annotate the genomes using a uniform protocol to ensure consistency. Specifically, open-reading frames (ORFs) were predicted using Prodigal 2.6.343, in the single-genome mode, and allowing ORFs to runoff edges of scaffolds. 4. Based on the predicted ORFs, the 400 marker genes were inferred and extracted using the phylogenomics pipeline PhyloPhlAn (commit 2c0e61a)25, in which the 400 marker genes were originally introduced. In brief, we used USEARCH v9.1.13 to align ORFs against the reference marker gene sequences (see above) at an E-value threshold of 1e-40. It then selected the highest bit score hit of each ORF. Should more than one hit per marker per genome was observed, the highest bit score hit was selected as the representative of that marker gene. 5. The completeness, contamination, and strain heterogeneity scores were computed using CheckM 1.0.744 with the default protocol (“lineage_wf”).

Prototype selection and genome sampling

Proper taxon sampling is a key prerequisite to inferring an unbiased view of organism evolution45,46. Beyond computational challenges in robust tree-building, the highly uneven distribution of known biodiversity (e.g., 40.0% of all genomes (34,507) belong to the nine most-sequenced species) requires deliberate subsampling to reduce the bias from the resulting phylogeny in representing a global view of evolution. We therefore applied the data-reduction strategy of “prototype selection”47, which subsamples genomes from the pool such that they represent the largest possible biodiversity—in terms of maximized sum of pairwise distances as defined by k-mer signatures (Supplementary Fig. 1a). We developed a heuristic (detailed in Supplementary Note 1), capable of handling the size of the current genome pool, with results comparable with or better than published alternatives (Supplementary Fig. 1b–e).

Using this algorithm and by applying multiple criteria, we downsampled the 86,200 bacterial and archaeal genomes to 11,079. The procedures are detailed below.

1. Excluded genomes with marker gene count < 100 or contamination > 10%. The marker gene count threshold 100 was chosen because it is sufficiently large to yield high resolution of the tree using ASTRAL (Supplementary Fig. 10a, c). The contamination threshold 10% is inline with the medium- and low-quality draft genome standards proposed by Bowers et al.6. Nevertheless, we did not adopt the completeness and tRNA/rRNA coverage thresholds6, because the 400 protein-coding marker genes are more relevant for phylogenetic reconstruction. 2. Included the NCBI-defined reference and representative genomes (https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/). 3. Included genomes that are the only representative of each taxonomic group from phylum to genus. 4. Included genomes that are the only representative of each species without defined lineage (no classification other than species). 5. Executed the prototype selection algorithm developed in this work: “destructive_maxdist” (see Supplementary Note 1) based on the distance matrix defined by MinHash signatures, with the already included genomes as seeds, to obtain a total of 11,000 genomes. 6. For each phylum to genus, and species without classification from phylum to genus, selected one with the highest marker gene count. This added 79 genomes to the selection.

These 11,079 genomes were subjected to our phylogenetics protocol, during which further filtering was performed based on sequence alignment quality (see below). Eventually, 10,575 genomes were retained.

Impact of alternative genetic codes

We chose to uniformly apply the standard archaeal and bacterial genetic code table 11 to all genomes in order to minimize bias. Reports have shown that several lineages, such as Mycoplasma/Spiroplasma48, Hodgkinia49, and Absconditabacteria50, use alternative genetic code tables 4, 25 and others, in most of which a stop codon is repurposed to encode for an amino acid, resulting in ORF elongation. We did not incorporate alternative genetic codes, however, because there is no accurate way to associate each of the 86,200 genomes with its true genetic code. Incorrect truncation of ORFs may unnecessarily exclude genes and taxa, whereas incorrect elongation of ORFs could result in artificially long branches, because the amino acid sequence after a true stop codon is likely relaxed from selective pressure. Considering our goal of inferring phylogenetic topology and distances, we decided to only use the standard genetic code.

However, we did test the impact of using alternative genetic code on the gene and taxon sampling. We ran Prodigal 3.0.0-rc1, which automatically switches from genetic code 11 to 4 if the average ORF length is too short. This resulted in altered gene calling results in 453 out of the 86,200 genomes, of which 63 had overly short ORF lengths even when using genetic code 4. PhyloPhlAn marker gene discovery on the other 390 genomes with genetic code 4 suggested marginal increase in the extracted number of the 400 marker genes per genome (1.23 ± 5.28, mean and std. dev.). Only seven additional genomes which had <100 marker genes managed to pass this threshold (see above) after switching to genetic code 4. Therefore, omitting alternative genetic code has little impact on the inclusion of genomes.

Metric multidimensional scaling (mMDS) of genome distances

The effect of prototype selection was visualized using the mMDS technique, which renders a low-dimensional plot that minimizes the loss of information when transforming from the high-dimensional data. We performed mMDS using the “mds” function implemented in scikit-learn 0.19.251 on the genome distance matrix, using the default setting, to compute the coordinates at the top five axes. The resulting coordinates were visualized with the interactive tool Emperor52 as bundled in QIIME 2 release 2017.1253.

Protein sequence alignment and filtering

Protein sequences of each of the 400 marker gene families were aligned using UPP v2.054, a phylogeny-based and fragmentary-aware alignment tool. UPP consists of several sequentially connected modules. It first identifies suspected fragmentary sequences, then calls PASTA v1.8.055 to align the remaining sequences and build a phylogeny (backbone tree) based on them. Then it builds an ensemble of HMMs using HMMER56 based on the phylogeny. Finally, it aligns the fragmentary sequences to the HMMs and selects the one with the best match. Sequences that are 25% longer or shorter than the median sequences were considered as fragments and excluded from the backbone. More specifically, PASTA first builds a starting tree, performs a tree-based clustering of the sequences, and builds a spanning tree from these clusters. Then it calls MAFFT v7.149b57 to align the sequences in each cluster, and calls OPAL58 to merge the alignments of adjacent clusters according to the spanning tree, and finally uses transitivity to perform the subsequent merging.

To ensure the quality of the alignment, we filtered out extremely gappy sites and sequences: sites with >90% gaps were deleted from the alignments, followed by the dropping of sequences with >66% gaps.

Filtering of marker genes

To ensure the quality of the species tree built upon these marker genes59, we filtered out the genes that were not aligned reliably by UPP. As such, the marker genes with >75% gaps in the aforementioned alignments were excluded from the pool, leaving 381 marker genes in total. The threshold 75% was chosen based on the distribution pattern of per-gene alignment quality (Supplementary Fig. 2d).

Filtering of outlier taxa from gene trees

We removed suspected outliers by detecting the taxa on disproportional long branches and filtering them out from the phylogeny inferred by FastTree27. To do this, we applied TreeShrink60 v1.1.0, a method that simultaneously detects long branches on a set of gene trees by identifying a set of taxa that could be removed from each gene so that the gene trees are maximally reduced in diameter. We used FastTree 2.1.9 to infer preliminary gene trees of the 381 selected genes, then ran TreeShrink to detect outlier long branches in these trees, with the per-species test with α = 0.05 (5% false-positive tolerance). Finally, we dropped genomes that contained <100 marker genes post gene tree filtering.

Gene tree reconstruction

Gene tree topologies were reconstructed using the maximum likelihood (ML) method as implemented in the state-of-the-art phylogenetic inference program RAxML 8.2.1026. The best amino acid substitution model for each of the 381 universal marker genes was inferred using RAxML’s built-in script ProteinModelSelection.pl. Three phylogenetic trees were reconstructed for each gene family: one using a starting tree computed by the fast ML approach implemented in FastTree) and the other two using parsimony trees built with random seeds 12345 and 23456. RAxML was executed with the ML search convergence criterion (-D) and the CAT rate heterogeneity model without final optimization (-F) to reduce the execution time.

For each of the 1143 topologies (3 × 381), another RAxML run was executed to optimize branch lengths and to compute likelihood scores under the robust, but expensive Gamma rate heterogeneity model. Because of numerical instability, at least one of the RAxML runs failed for 39 of the 381 gene families. For those cases, IQ-TREE 1.6.161, an alternative and faster maximum likelihood program, was used instead to optimize branch lengths using the same model (G4). The tree with the highest likelihood score among the three runs was retained for downstream applications. In 161 gene families, this tree was from the run with the FastTree starting tree, while in the remaining gene families the best tree was from either one of the random seeds.

Species tree reconstruction by summarizing gene trees (ASTRAL)

A species tree was reconstructed by summarizing the 381 gene trees, using ASTRAL-MP62 (implementing ASTRAL-III algorithm28) 5.12.6a. This analysis was run on the Comet supercomputing cluster using 24 cores and 4 GPU acceleration. In the resulting tree, the branch lengths represent the units of coalescence. Each branch has three support values: (1) effective number of genes (EN): the number of gene trees that contain some quartets around that branch; (2) quartet score (QT): proportion of the quartets in the gene trees that support this branch; (3) local posterior probability (LPP): the probability this branch is the true branch given the set of gene trees (computed based on the quartet score and assuming incomplete lineage sorting (ILS))31.

Branch length estimation for the ASTRAL tree

The branch lengths of a summary tree generated by ASTRAL are in coalescent units and only for internal branches. In order to obtain “conventional” branch lengths, i.e., the expected number of amino acid substitutions per site, we ran IQ-TREE using the concatenated alignment (most conserved or randomly selected sites as described below) as input, the ASTRAL tree as the topological constraint, and the LG + Gamma model. Branch lengths obtained using both site categories were highly correlated (Supplementary Note 2).

Species tree reconstruction based on the concatenated alignment (CONCAT)

The alignments of the 381 marker genes were concatenated into a supermatrix. Due to the computational challenge in running classical maximum likelihood tree reconstruction on the full-scale data set, we had to downsample it to around 38 k amino acid sites. In order to explore the impact of site sampling on tree topology and branch lengths, we separately adopted two strategies for site sampling: (1) selected up to 100 most conserved sites per gene. The degree of conservation was estimated using the “trident” metric63, which is a weighted composition of three functions: symbol diversity, stereochemical diversity, and gap distribution. The PFASUM60 substitution matrix was used for computing the stereochemical diversity64. (2) randomly selected 100 sites per gene from sites with <50% gaps.

For the downsampled supermatrix, a maximum likelihood tree was first built using FastTree, with LG model for amino acid substitution and Gamma model for rate heterogeneity. Using this FastTree tree as the starting tree, plus two maximum parsimony trees generated from random seeds (12345 and 23456), we performed three independent runs using RAxML, with the LG + CAT models (PROTCATLG), with rapid hill­climbing (-f D) and without final Gamma optimization (-F). With the resulting topologies, we performed branch length optimization and likelihood score calculation using IQ-TREE, with the LG + Gamma models (LG + G4). We further performed 100 rapid bootstraps using RAxML to provide branch support values.

Species tree reconstruction based on ribosomal proteins

To test the impact of choice of marker gene set on the topology and relative distances among major taxonomic groups, we conducted a separate analysis in which the species tree was built using ribosomal protein sequences. We identified and extracted 30 ribosomal protein families using the program PhyloSift 1.0.165 with its marker database released on August 8, 2017. If more than one copy of a marker protein was detected in a genome, all copies were discarded. After this filtering, genomes with fewer than 25 marker proteins were dropped from the data set, resulting in a total of 9814 genomes out of the original 10,575. Sequences of each ribosomal protein family were aligned using UPP as described above. The resulting alignments were concatenated and subjected to RAxML tree reconstruction using the LG model for amino acid substitution66 (which is the best model for 304 out of the 381 genes based on RAxML’s model selection) and the CAT model for rate heterogeneity (PROTCATLG). The resulting tree was then fed into IQ-TREE for branch length optimization, with the Gamma model for rate heterogeneity. One hundred rapid bootstraps were executed in RAxML to provide branch support.

The same concatenated alignment was also used to estimate the branch lengths for the ASTRAL tree based on the 381 marker gene trees. Because the quality of an ASTRAL tree improves as the number of gene trees increases (Supplementary Fig. 10a, c), running ASTRAL on only 30 trees of structurally and functionally highly related genes is of limited value. Thus we decided not to run ASTRAL de novo, but only to assess the impact of ribosomal proteins on the branch lengths of the existing ASTRAL tree.

Species tree reconstruction and branch length estimation with CPR taxa excluded

We followed the same protocol as stated above to reconstructed species trees and estimate branch lengths based on the protein sequence alignments with the 1454 CPR taxa removed, leaving 9121 taxa. Only one modification was made to the main protocol in order to reduce the computational expense for reconstructing the 381 gene trees: Instead of running RAxML three times per gene and selecting one tree with the highest Gamma likelihood, we ran RAxML once per gene using the random seed 12345. The two alternative site sampling schemes: most conserved (“cons”) and randomly selected (“rand”) as demonstrated in the main result were both tested, using the same amino acid sites as in the main protocol in each scheme.

Species tree reconstruction using site heterogeneous models (PMSF)

We built alternative CONCAT trees using the posterior mean site frequency (PMSF) method38 implemented in IQ-TREE, which considers mixture classes of rates and substitution models (here the LG model) across sites. Because this method is computationally expensive, we downsampled the 10,575 taxa to 1000 (see below for the taxon downsampling strategy). ModelFinder (which is part of IQ-TREE)67 was used to select an optimal model among the empirical profile mixture models C10 to C6068, plus the site homogenous model (with Gamma rate across sites) as a control. This analysis consistently chose C60 as the optimal model for all tests. Therefore, we used the LG + C60 model for PMSF phylogenetic reconstruction. PMSF requires a guide tree, which we obtained from ModelFinder results. Computational challenge limited this analysis to at most 1000 taxa (which consumed 1.43 TB memory, close to the 1.5 TB physical memory equipped in our high-memory nodes). Branch support values were computed using the ultrafast bootstrap (UFBoot)69 method implemented in IQ-TREE. In parallel to this analysis, we performed phylogenetic inference using the Gamma model ( + G) or the FreeRate70 model (+R) on the same 1000-taxon input data for comparison purpose.

Species tree reconstruction using implicit methods

We applied two implicit strategies for inferring the evolutionary relationships among the sampled genomes. They are not based on the alignment of homologous features across multiple genomes, but instead are based on the predefined distances among genomes. Specifically, they are the Jaccard distances defined by the MinHash signature (see above), and by the presence/absence of the 400 marker genes (see above). The conventional neighbor joining (NJ) method as implemented in ClearCut 1.0.971 was used to reconstruct phylogenetic trees from the two distance matrices, respectively.

Rooting and post-manipulation of species trees

We rooted the species tree at the branch connecting the Archaea clade and Bacteria clade, according to the widely adopted hypothesis of life evolution72,73,74. The absence of Eukaryota does not impact the placement of root, since Eukaryota is considered derived, as a sister group or ingroup of Archaea in this hypothesis. We want to remind readers that this hypothesis is not without controversy75,76. The discovery and study of CPR and other divergent or transitional groups may provide materials for a second examination of this hypothesis, although this is beyond the scope of this study.

Internal nodes were flipped to follow the descending order (i.e., child nodes are sorted from less descendants to more descendants). Incremental numbers were assigned to internal node IDs in a pre-order traversal of the tree starting from the root (i.e., root = N1, LCA of Archaea = N2, LCA of Bacteria = N3, etc.). These node IDs can be used as unique identifiers in downstream analyses and applications.

Phylogeny-based downsampling of taxa

We designed a protocol to downsample taxa from the 10,575 genomes for further phylogenetic analyses. We adopted the relative evolutionary divergence (RED) metric14, as the core of our subsampling strategy. This metric allowed us to select large clades that best represent the deep phylogeny. Specifically, we calculated RED for all nodes (terminal and internal) of the ASTRAL tree (i.e., the tree shown in Fig. 1) using TreeNode functions implemented in scikit-bio 0.5.277. Nodes were selected iteratively from the low end of the RED list, with ancestral nodes (if any) of the current node dropped from the selection at each iteration, until the desired number of clades n was achieved.

Within each selected clade, four criteria were sequentially applied to the descendants until one taxon was selected: (1) contains the most marker genes; (2) contamination level is the lowest; (3) DNA quality score is the highest; (4) random selection (if there were still more than one taxon after applying the other three criteria). This protocol guaranteed the selection of n taxa, which maximize the representation of deep phylogeny.

Visualization and annotation of trees

Unique colors were assigned to selected NCBI-defined taxonomic groups above phylum, and phyla with 100 or more representatives in the sampled genomes. Colors of taxa were directly assigned based on their NCBI taxonomy assignment. Colors of clades and branches were determined based on the tax2tree decoration. The trees were rendered using iTOL v478 (unrooted or circular layouts) or FigTree 1.4.379 (rectangular layout).

Comparison of multiple trees

We used both the classical Robinson–Foulds (RF) metric80 (calculated using scikit-bio’s “compare_rfd” function) and the quartet score (calculated using ASTRAL) to quantify the topological concordance between a pair of trees. Furthermore, we used the “tip distance” (TT), calculated using scikit-bio’s “compare_tip_distances” function, to measure the correlation of the phylogenetic distances among taxa in a pair of trees. It equals (1 − r)/2, where r is the Pearson correlation efficient between the tip-to-tip distance (i.e., total length of branches connecting two tips) matrices of the two trees. Because the two trees might have different sets of taxa, we first truncated them using the “shear” function implemented in scikit-bio so that they both only contained the shared taxa. This enabled the subsequent computation of the three metrics.

For a set of multiple trees (species trees or gene trees), a matrix of the pairwise RF distance, quartet distance (1−quartet score) or tip distance was constructed, based on which subsequent statistical analyses were performed to assess the clustering pattern of trees, as stated below.

Clustering analysis of multiple trees

We used several statistical approaches to assess the clustering pattern of multiple trees based on the RF, tip or quartet distance matrices built as stated above:

1. Hierarchical clustering, using the “linkage” function implemented in SciPy 1.1.081. 2. mMDS, as detailed above. 3. Principal coordinate analysis (PCoA), performed using QIIME 2’s “pcoa” command, and visualized using Emperor. This method aims to visualize the biggest variance in a few dimensions, as compared with mMDS as explained above. 4. Permutational multivariate analysis of variance (PERMANOVA)82, performed using QIIME 2’s “beta-group-significance” command, with 999 permutations (the default setting). This method evaluates the statistical significance of grouping of trees by a certain variable such as method, site sampling and taxon sampling.

Cross-comparison of the ASTRAL and CONCAT trees

The first challenge for this comparison was that the branch support values were estimated using completely different methods (local posterior probability vs. rapid bootstrap) and so are not directly comparable. We manipulated the trees so that they have the same overall resolution: First, we collapsed the low-supported branches in the CONCAT tree (by conserved sites), using the commonly accepted bootstrap threshold: 50. This left 9595 internal nodes. Then we performed branch collapsing to the ASTRAL tree, from the low end of the range of local posterior probability (lpp), until it reached 0.68057, also leaving 9595 internal nodes.

The second challenge was that large-scale trees are difficult to align and to display. We collapsed the two trees so that they have 50 paired clades with at least 50 descendants each. For each pair of clades, the descendants are identical. The remaining tips were pruned. This operation left 7764 taxa in each tree. The sizes of the 50 chosen clades are 155.3 ± 106.9 (mean and standard deviation).

A tanglegram of the resulting collapsed trees was reconstructed using Dendroscope 3.5.983. In our case, the clades were fully aligned. The tanglegram was then rendered back-to-back without the need for displaying the connector lines.

Calculation of the relative Archaea–Bacteria distance

We calculated the phylogenetic distance (sum of branch lengths) between every pair of taxa in a tree using scikit-bio’s “tip_tip_distances” function. The pairs were divided into three groups: A–A, A–B, and B–B (A and B are abbreviations for Archaea and Bacteria). Within each group, the mean distance was calculated. Then the overall relative A–B distance was calculated as: mean (A–B)2/(mean (A–A) × mean (B–B)). Note that due to HGT and other reasons, archaeal and bacterial taxa are rarely perfectly separated in individual gene trees. Therefore, the calculated distance should be interpreted as the average evolutionary distance between archaeal and bacterial genomes, instead of the distance between the two clades.

Test for amino acid substitution saturation

We followed the principle introduced by Jeffroy et al.84 to test for the saturation. Specifically, we wanted to test whether the degree of saturation on inter-domain taxon pairs (Bacteria vs. Archaea) is larger than that on intra-domain pairs. For each domain, 100 taxa were randomly sampled for this analysis. We plotted the phylogenetic distance, i.e., the sum of branch lengths between two tips, as the x-axis, versus the Hamming distance of gap-free sites per each alignment between a pair of sequences, as the y-axis (Supplementary Fig. 19a–d). Because the three categories of taxon pairs have differential distribution on the x-axis, we further binned on the x-axis and performed comparison within each bin (Supplementary Fig. 19e, f).

Phylogenetic analysis with latest genome availability

We made several modifications to the main protocol to reduce the computational expense for this rapid test of the extended set of 10,762 (10,575 + 187) genomes: UPP was called in “insertion” mode to update the existing amino acid sequence alignments. In-house scripts were used to locate the same set of sites instead of performing de novo site sampling. Both ASTRAL and CONCAT methods were used to build species trees. For CONCAT, we used IQ-TREE in “fast” mode to build de novo species trees from concatenated alignments without using a predefined starting tree. For ASTRAL, we kept the same analysis parameters to build a species tree from the 381 gene trees, whereas the gene trees were built as follows to save computation while maintaining high quality:

First, we used the previous gene trees as topological constraints (-g) to incorporate the new taxa using RAxML. Then we used those trees as starting trees (-t) to perform de novo ML searches using RAxML. This way, we only did de novo ML search once instead of three as previously, but we argued that the generated gene trees would have comparable ML score as in the previous procedure. To test this hypothesis, we randomly selected ten genes to generate four trees each: (1) RAxML with FastTree tree as starting tree; (2) & (3) RAxML with random starting trees with two different random seeds; (4) RAxML tree generated using the described procedure. Note that the tree having highest likelihood score among (1), (2), and (3) defines the ML tree in the previous procedure. Our results showed that the gene trees generated by (4) have higher likelihood scores than the best of (1), (2), and (3) in six of ten of the tested genes. Besides, we use a χ2 test to show that the trees (4) have higher chance to be the best tree than (1), (2), and (3). In this test, the null hypothesis H 0 is that (4) has the same chance to be the best tree among the four trees. Applying the test on the ten selected genes, we rejected H 0 with p-value = 0.011.

Divergence time estimation using maximum likelihood

We used the maximum likelihood tool r8s 1.8185 to estimate the divergence times based on the species trees. Specifically, we used the Langley–Fitch (LF) method86, which assumes a universal molecular clock (substitution rate) for the entire tree, with the truncated-Newton (TN) method for optimizing the likelihoods of branch lengths87. A recent study showed that this method has comparable estimation accuracy when benchmarked against the more sophisticated Bayesian framework, but its computation is significantly faster88, thus suitable for the size of our data set. Near-zero branches were collapsed to avoid numerical errors. Ten replicates with random initial conditions were performed for each setting. In each replicate, three restarts were executed after the initial optimization with a random perturbation factor of 5%. Replicates that failed to pass the gradient check were discarded. The divergence times estimated by the run with the highest likelihood score, and the mean and standard deviation of those by all successful runs were reported.

Divergence time estimation using Bayesian inference

We used the Bayesian tool BEAST 1.10.489 to estimate divergence times. Considering the computational expense, we randomly selected 5000 amino acid sites from the full-length alignment, and downsampled the original 10,575 taxa to 100. Taxon sampling was performed using the same RED-guided protocol (see above), but was manually modified afterwards to ensure sufficient sampling around the calibration point. Two alternative molecular clock models were used: the strict clock model, or the uncorrelated relaxed clock model with a lognormal distribution (UCLD)90. The species tree was modeled using a Yule process91, with topology fixed to the ASTRAL tree. Logs of MCMC runs were examined using Tracer 1.7.192. Burn-ins were set to be at least 10% of iterations, or higher depending on the manual observation of traces. Sufficient MCMC iterations were executed to ensure that the effective sample size (ESS) of the reported parameters was at least 150.

Tree-based taxonomic curation and annotation

We used the program tax2tree (commit 99f19be)93 to curate the original NCBI taxonomy94 assignment of genomes based on the phylogenetic trees and to annotate the internal nodes of the tree using most appropriate taxonomic labels. The same program was used to curate multiple databases, such as the classical Greengenes93 and the recent GTDB14. The program took as input the species tree and the original NCBI taxonomy and inferred the most plausible taxonomic annotation at every node of the tree, as determined using an F-measure scoring system across candidate taxonomic terms. In scenarios where one term was estimated to be the best candidate for multiple, independent clades (i.e., para/polyphyly), a numeric suffix was appended to the term to indicate the grouping and order (from more descendants to less) of those clades. For example, Firmicutes_1 is the largest clade assigned to the paraphyletic phylum Firmicutes, followed by Firmicutes_2, Firmicutes_3, etc. Based on the decorated tree, correct taxonomic names were re-generated for unclassified and mis-annotated genomes. Taxonomic groups represented by only one genome in this work were back-filled post tax2tree annotation.

Assessment of cladistic properties of taxon sets

The cladistic property of a taxonomic group (or an arbitrarily defined taxon set) with reference to a species tree was evaluated using three methods:

1. The strict definition of “monophyly”: when a clade contains all genomes assigned to a single taxonomic group and no other genomes, this taxonomic group is considered monophyletic. Further, we identified “relaxed” monophyletic groups compared to the aforementioned “strict” scenario. In the “relaxed” scenario, if a clade consists of genomes assigned to a taxonomic group, and genomes without assignments at the same taxonomic rank (i.e., unclassified), this taxonomic group is still considered monophyletic. 2. tax2tree’s classification consistency score, representing the fraction of tips within that clade relative to the total number of tips in the tree which are of that taxon. Consistency = 1 is equivalent to strict monophyly. 3. The ASTRAL-computed quartet score of this taxonomic group, i.e., the fraction of quartets in the tree that supports this taxonomic group as monophyletic, i.e., separates this taxonomic group from the others. 4. An approach introduced in DiscoVista95 which evaluates and visualizes the compatibility between a given taxon set and a tree with branch support values. It computes a “support” or “rejection” degree as follows: If the taxon set constitutes a monophyletic clade in the tree, it is supported; and the support degree (green) is the support value of the branch connecting the lowest common ancestor of the clade to its parent. On the other hand, if it is not a monophyletic group in the original tree, but after contracting branches with support values below a threshold, the monophyly can no longer be rejected due to polytomy, the lowest threshold is considered the rejection degree (with a negative sign) (magenta).

Evaluation of GTDB taxonomic groups

We downloaded GTDB14 release: 86.1 from http://gtdb.ecogenomic.org/. The format of genome identifier in GTDB was matched to that of our work (e.g., GB_GCA_000123456.1 was translated into G000123456). Following the protocols described above, we evaluated the GTDB phylogeny and taxonomic units, and annotated our species trees using the GTDB taxonomy.

Statistics

Statistical analyses and plotting were performed using Python 3.6 and QIIME 2 release 2017.12. Specifically, PERMANOVA test was performed using QIIME 2’s “beta-group-significance” command. Independent or paired two-sample t test was performed using scipy 1.1.0’s “ttest_ind” and “ttest_rel” commands, respectively. Fisher’s exact test was performed using scipy’s “fisher_exact” function. Linear regressions were performed using scipy’s “linregress” function. The p-value was computed using a two-sided Wald test, in which the null hypothesis was slope = 0. Gaussian kernel density estimations were performed using seaborn 0.9.0’s “distplot” function. Hierarchical clustering was performed using scipy’s “linkage” function. Quantile–quantile (Q–Q) plot was computed using scipy’s “probplot” command. Redundancy analysis (RDA) was performed using vegan 2.4.4’s “rda” and “ordiR2step” commands. Dimension reductions were performed using mMDS implemented in scikit-learn 0.19.2, or PCoA implemented in QIIME 2 (both detailed above). Pairwise distances based on k-mer signatures and on marker gene presence/absence were computed using the Jaccard index (see above). Branch supports in the phylogenetic trees were computed using rapid bootstrap implemented in RAxML 8.2.10, and ultrafast bootstrap implemented in IQ-TREE 1.6.1, and local posterior probability implemented in ASTRAL 5.12.6a (detailed above). Robinson–Foulds (RF) distance and “tip distance” were calculated using scikit-bio 0.5.2. Quartet scores were calculated using ASTRAL.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.