Identifying the microbial taxon or taxa present in complex biological and environmental samples is one of the oldest and most frequent challenges in microbiology, from determining the etiology of an infection from a patient’s blood sample to surveying the bacteria in an environmental soil sample (). Prior to the advent of genomic sequencing technologies, identifying taxa required time-consuming sequential testing of candidates (). The application of metagenomic sequencing is transforming microbiology by directly interrogating the community composition in an unbiased manner, enabling more rapid species detection and the discovery of novel species and reducing reliance on culture-dependent approaches (). The potential application of these technologies to improve diagnostics and in public health settings has also been widely recognized (), and there is extensive ongoing work to overcome the challenges associated with clinical use of these approaches ().

Because metagenomic sequencing produces genomic data from a set of species instead of a pure species isolate, one of the primary challenges in the field is the development of computational methods for identifying all of the species contained in these samples ( Figure 1 ). There are two primary drivers of this computational challenge. First, the widespread use of high-throughput sequencing technologies that generate millions of short sequences (generally 50–200 nt) presents a computational challenge for classifying large numbers of reads in a reasonable time. BLAST (basic local alignment and search tool) is one of the most well-known and commonly used software programs for DNA search and alignment against a database of genomic sequences (). Although BLAST is one of the most sensitive metagenomics alignment methods, it is computationally intensive, making it infeasible to run on the millions of reads typically generated by metagenomic sequencing studies. Second, this challenge is compounded by the exponential growth in recent years of the number of sequenced microbial genomes, meaning that the number of comparisons that need to be performed for new sequencing reads is huge and ever increasing.

Many software tools have recently been developed to taxonomically classify metagenomic data and estimate taxon abundance profiles. For accurate analysis and interpretation of these data, it is important to understand how these different tools, broadly referred to as classifiers, work and how to determine the best approach for a given sample type, microbial kingdom, or application. This includes continually benchmarking the ensemble of tools for the best performance characteristics along multiple dimensions: classification accuracy, speed, and computational requirements. Several groups have previously benchmarked metagenomic tools (), but the continual introduction of newer tools requires ongoing evaluation to compare them against established tools. Here we review the core principles of metagenomic sequence classification methods, describe how to evaluate classifier performance, and use these approaches to benchmark 20 commonly used taxonomic classifiers. To account for database differences and updates between methods, we further compare the performance of these tools on a uniform database, which has not been considered in earlier studies. We also provide recommendations for their use and describe future directions for the expansion of this field.

Marker-based methods typically include in their reference database only a subset of gene sequences instead of whole genomes, normally specific gene families that have good discriminatory power between species. The most widely used single marker gene for bacterial metagenomics is the highly conserved 16S rRNA sequence (), although other markers are needed to identify viruses, fungi, and other microbes that do not have the 16S marker gene. Some marker-based methods, such as MetaPhlAn2, address this limitation by indexing a number of different gene families in its database to identify taxa from other microbial kingdoms (). The use of a subset of genes makes these methods quick; however, the marker sequences used can introduce a bias in the results when they are not evenly distributed among the microbial sequences of interest ().

DNA-to-DNA and DNA-to-protein tools classify sequencing reads by comparison with comprehensive genomic databases of DNA or protein sequences, respectively. DNA-to-protein tools are more computationally intensive than DNA-to-DNA tools because they need to analyze all six frames of potential DNA-to-amino acid translation, but they can be more sensitive to novel and highly variable sequences because of the lower mutation rates of amino acids compared with nucleotide sequences (). DNA-to-protein tools, however, target only the coding sequence of the genome and, therefore, will not be able to classify non-coding sequencing reads.

To generate assignments, classifiers utilize newer algorithmic approaches to ensure that classification speeds are fast enough for even large numbers of sequencing reads. To do so, most tools first seek to reduce the number of candidate hits for processing via approaches such as searching for stretches of perfect sequence matches with reference sequences (k-mers, typically around 31 nt in length) or via an FM index (full-text index in minute space) (). As a result, these methods are typically not as sensitive as BLAST but are designed to be much faster. In addition, they frequently favor more memory usage to reduce CPU usage and, thus, classification time. These tools can be divided into three groups: DNA-to-DNA classification (BLASTn-like), DNA-to-protein (BLASTx-like) classification, and marker-based classification.

Within taxonomic classifiers, a distinction can be made between taxonomic binning and taxonomic profiling. Binning approaches provide classification of individual sequence reads to reference taxa. Profilers report the relative abundances of taxa within a dataset but do not classify individual reads. However, in practice, these methods are often used interchangeably when analyzing metagenomic sequencing data. Although not generated by default, a taxonomic profile can be calculated from binning approaches by summing up the individual read classifications. Taxonomic classifiers should not be confused with a distinct class of assembly-based tools for analysis of metagenomic sequencing data that cluster contigs de novo without the aid of any reference databases, an approach known as reference-free binning (). These tools cannot taxonomically classify sequences and, thus, are not evaluated here but have recently been benchmarked elsewhere ().

A large number of tools have recently been developed that are focused on classifying large amounts of sequencing reads to known taxa with increasing speed. These taxonomic classifiers require pre-computed databases of previously sequenced microbial genetic sequences against which sequencing data are matched.

All classifier tools are distributed with pre-compiled reference databases, the composition of which can vary substantially between classifiers. This can act as a confounder when comparing classification performance across methods. These databases may use entirely different sources for sequence data, or, even when they share a common source for sequences (e.g., RefSeq), continual updates and addition of new sequences will mean databases created at different times will have different content. Most tools also allow a user to build his or her own database based on a desired set of sequences. This is a computationally intensive process, especially for comprehensive databases, but affords the user greater control over the analysis, especially when investigating rare, recently discovered, or highly diverse species. Given the complexity of both the test samples and reference databases in metagenomic classification, it is further important to perform comparisons using a uniform database to eliminate any confounding effects of differences in default database compositions.

The universe of microbial sequences is very diverse, and these resulting databases are fairly large, typically requiring 10–100 s of gigabytes. This vast search space can also result in a significant number of false positive classifications because of the large number of possible taxa against which the sequences are matched. Additionally, the large universe of presently undiscovered microbial species can result in false negative classifications simply because the genetic sequences have never been categorized in a database before. Recent efforts to expand the number of known microbial genomes have highlighted the improvement in the proportion of reads classified compared with older databases () but must be balanced with the challenges of handling larger databases.

All metagenomics classifiers require a pre-computed database based on previously sequenced microbial genetic sequences whose sheer size presents a considerable computational challenge. The most popular reference databases are RefSeq complete genomes (RefSeq CG) for microbial species as well as the BLAST nt and nr databases for high-quality nucleotide and protein sequences, respectively, from all kingdoms of life, with ∼50 and ∼200 million sequences, respectively, as of 2019. Other databases include SILVA for 16S rRNA, with ∼2 million sequences, and GenBank for a larger quantity of genomes with lower quality control standards ().

Metrics should also be tested across many datasets because classifiers may perform better or worse on certain species or sample types. Last, other features—such as classification speed, memory usage, and output format—may also influence the choice of classifier and should also be considered in any thorough evaluation.

The L2 distance should be considered as a representation of the abundance profiles. Because metagenomic abundance profiles are proportional data and not absolute data, it is important to remember that many common distance metrics (including L2 distance) are not true mathematical metrics in proportional space (). Generally, in proportional data analysis, a common method is to normalize proportions by using the centered log-ratio transform to calculate distances. However, the output of these metagenomic classifiers includes many low-abundance false positives, leading to sparse zero counts for many taxa across the different reports. The log-transform of these zero counts is undefines unless arbitrary pseudocounts are added to each taxa, which can negatively bias accurate classifiers because false positive taxa will have added counts. Another commonly used metric to compare abundance profiles is the UniFrac distance, which considers both the abundance proportion of component taxa as well as the evolutionary distance for incorrectly called taxa (). However, using this metric is complicated by the difficulty in assessing evolutionary distance between microbial species’ whole genomes.

To evaluate the accuracy of abundance profiles, we can calculate the pairwise distances between ground-truth abundances and normalized abundance counts for each identified taxon at a given taxonomic level (e.g., species or genus). For this, we calculate the L2 distance for a given dataset’s classified output as the straight-line distance between the observed and true abundance vectors ( Figure 2 ). We can also use this measure to compare abundance profiles between classifiers by instead computing L2 distances between classified abundances for pairs of classifiers. Abundance profile distance is more sensitive to accurate quantification of the highly abundant taxa present in the sample (). High numbers of very-low-abundance false positives will not dramatically affect the measure because they comprise only a small portion of the total abundance. For this reason, using such a measure alongside AUPR, which is highly sensitive to classifiers’ performance in correctly identifying low-abundance taxa, allows comprehensive evaluation of classifier performance.

In addition to considering the number of correctly identified species, it is also important to evaluate how accurately the abundance of each species or genera in the resulting classification reflects the abundance of each species in the original biological sample (“ground truth”). This is especially critical for applications such as microbiome sequencing studies, where changes in population composition can confer phenotypic effects (). Abundance can be considered either as the relative abundance of reads from each taxa (“raw”) or by inferring abundance of the number of individuals from each taxa by correcting read counts for genome size (“corrected”). Some programs incorporate a correction for genome length into abundance estimates; this calculation can also be manually performed by reweighting the read counts after classification. Here we use raw abundance profiles unless correction is performed automatically by the software, as in the case of PathSeq and Bracken.

To better assess precision and recall scores across all abundance thresholds, it is preferable to use a precision-recall curve, where each point represents the precision and recall scores at a specific abundance threshold ( Figure 2 ). By ranging the abundance threshold from 0–1.0, the area under the precision-recall curve (AUPR) outputs a single metric to aggregate precision and recall scores (). It should be noted that precision and recall focus only on the positive class of identified taxa. Performance metrics that require the calculation of false negatives, such as ROC (receiver operating characteristic) curves, are less informative in this context because false negatives are poorly defined in real-world metagenomic samples. A potential drawback of AUPR is that it is biased toward low-precision, high-recall classifiers. Classifiers that do not recall all of the ground-truth taxa are penalized with zero AUPR from the highest achieved recall to 100% recall. For classifiers that do reach 100% recall, additional false positive taxon calls do not further penalize the AUPR score.

AUPR and L2 distance are two complementary metrics that provide insight into the accuracy of a classifier’s precision-recall and abundance estimates, respectively. Considered together, they provide a readily interpretable picture of classifier performance and can be used to compare classifiers.

The metrics selected to benchmark classifiers can greatly influence their relative rankings and performance and, thus, must be carefully selected to best reflect the way these tools are used in practice. The most important metrics for metagenomic classification are precision and recall. Precision is the proportion of true positive species identified in the sample divided by the number of total species identified by the method, whereas recall is defined as the proportion of true positive species divided by the number of distinct species actually in the sample. These measures and derived metrics are commonly used across benchmarking studies (). The F1 score is the harmonic mean of recall and precision, weighting them equally in a single metric. However, because end users will often filter out taxa below a certain abundance threshold, using a single raw precision, recall, or F1 score does not provide a realistic estimate of classifier performance.

To more thoroughly test the effect of poorly characterized, divergent, and novel sequences on classification, we additionally evaluated the performance on the Critical Assessment of Metagenome Interpretation (CAMI) datasets ( Supplemental Information ). Although these datasets are also simulated based on short DNA sequencing reads, they are comprised of dramatically different taxon profiles. For these datasets, only ∼30%–40% of reads are simulated from known taxa, whereas the rest of the reads are from novel taxa, plasmids, or simulated evolved strains. The AUPR of the CAMI samples was very low at the species rank, but performance improved significantly at the genus classification rank ( Figure S4 ). This is due to the small numbers of taxa for which there is a ground truth specified at the species rank as well as the confounding effects of large quantities of sequences from unknown taxa, which is also confirmed by the low classified abundances ( Figure S5 ). Performance was also broadly similar between classifiers of each class and became even more similar when using the normalized RefSeq CG databases.

To evaluate the importance of the reference database in taxonomic classifications, we constructed databases for each method from a normalized set of microbial RefSeq CG from the archaeal, bacterial, and viral kingdoms. To create the corresponding protein-based databases, we used the protein sequences of these complete genomes. This custom-standardized database is referred to as RefSeq CG. We did not include fungal, protozoan, or other eukaryotic microbial genomes in RefSeq CG because these domains have relatively few complete genomes. Not all methods supported custom database construction ( Table 1 ), so those classifiers could not be assessed. Overall, classifiers’ AUPR scores on the RefSeq CG databases were slightly worse than on their own default databases ( Figure 3 B), with the difference being most pronounced for the protein-based methods. The protein coding subset of RefSeq CG is both smaller than the default nr protein databases and smaller than the set of whole genomes indexed by DNA classifiers. For many of the DNA classifiers, the performance between classifiers was much more similar using the RefSeq CG database than with their own databases, indicating that these methods behave fairly similarly. Inherent database differences contributed more variation to the performance differences between DNA classifiers. Centrifuge’s score improved because the RefSeq CG database was not lossily compressed, in contrast to the default compressed database.

Database limitations negatively affect a few classifiers: MetaPhlAn2 and mOTUs2, with limited marker-based databases, naturally had lower performance. Centrifuge’s default database is significantly lossily compressed (throwing away database sequences to save space), which also results in poor recall. Karp’s recommended database of microbial rRNA sequences is too limited to perform well, and it would be unfair to draw further conclusions from its much smaller database. We additionally excluded GOTTCHA because it has known L2 abundance estimation biases (). Additional benchmarking results for GOTTCHA and Karp are available in Figures S1–S6 , but otherwise they were excluded from further analysis.

Two samples had notably low AUPR scores across multiple classifiers when using default databases. The buccal (oral microbiome) simulated dataset was a consistent low-scoring outlier among the DNA classifiers. This dataset is relatively small and only contains 12 different species, of which one species, Neisseria subflava, was a contig-quality assembly. This lower-quality assembly is not included in most DNA databases but is in the nr database. The simulated ATCC Staggered dataset was a consistently poorly performing sample for the protein-based classifiers because it contains taxa at very low abundances, making false positive classifications more likely, especially when using the expansive nr database.

Most methods performed well in terms of AUPR, with scores above 0.8 at the species level ( Figure 3 A) using their default databases. For most methods, there is a gradual tradeoff between precision and recall, as expected ( Figure S2 ). Most of the performance falloff for AUPR came from a decline in precision at high recall. This is likely due to false positive taxa being classified at low abundance ahead of the lowest abundance true positive taxa. Genus-level performance tended to be slightly better than species-level classifications ( Figure S3 ). In this study, we do not consider classifications below the species level. Sequence similarity between strains and other challenges mean that it is generally preferred to use specialized approaches for accurate strain typing ().

(A) Area under the precision-recall curve (AUPR) scores for each classifier at the species level (a higher value is better). Each plot point represents the score for a (classifier, dataset combination). Classifiers are grouped and colored by their target class.

To first evaluate the precision and recall of each method, we ran all 20 classifiers using a set of 10 benchmarking datasets composed of computationally simulated reads from between 12 and 525 bacterial species ( Supplemental Information ) and generated a report of identified taxa and their corresponding abundance proportions within the sample. Each classifier was run using default parameters with the exception of adding mild filtering for MegaBLAST and Kraken ( Supplemental Information ). Output reports were then reformatted to allow comparison between classifiers at different taxonomic levels, and for each classifier report, we calculated precision and recall ( Figure S1 ), generated the precision-recall curve, and computed the AUPR ( Figure S2 ). Median AUPR was then calculated from the AUPR values of all datasets for a given classifier as described in Figure 2

Here we benchmarked 20 metagenomic classifiers to compare performance in classification precision, recall, F1, speed, and other metrics using a uniform database to eliminate any confounding effects of differences in default databases. DNA-to-DNA classifiers evaluated here were Kraken (and its add-on for more accurate abundance quantification, Bracken), Kraken2, KrakenUniq, k-SLAM, MegaBLAST, metaOthello, CLARK, CLARK-S, GOTTCHA, taxMaps, prophyle, PathSeq, Centrifuge, and Karp. DNA-to-protein classifiers evaluated were DIAMOND, Kaiju, and MMseqs2. We also evaluated the marker-based methods MetaPhlAn2 and mOTUs2. A more detailed description of each classifier’s qualitative characteristics is provided in Table 1 and Table S1 . To evaluate classifier performance controlling for database differences, a uniform database was created, when possible, based on RefSeq CG and benchmarked for each method alongside the default database. We considered the precision and recall across a range of abundance thresholds as well as overall abundance profiles as our primary benchmarking metrics.

“Custom databases” refers to the ability for the end user to create a custom database. The time and memory requirements are for a 5.7 million-read dataset with the database and input already cached in memory. Some methods (marked as “varies”) have the ability to flexibly decrease their memory usage (at the cost of a massive increase in run time).

The latest version of PathSeq now allows the user to create and specify a custom database, but this option was not available when benchmarking studies were performed; thus, it was excluded from those analyses.

a The latest version of PathSeq now allows the user to create and specify a custom database, but this option was not available when benchmarking studies were performed; thus, it was excluded from those analyses.

The AUPR and abundance profile distances only consider the classified proportion of each sample. These metrics do not assess the proportion of reads left unclassified or those classified to a very broad taxonomy node, such as “all bacteria,” which are less informative when describing the sample composition. Classifiers vary greatly in the proportion of reads they classified at the species rank ( Figure 5 ). Classified proportions further varied greatly between sample datasets with the same classifier. Protein classifiers had many more unclassified reads, as expected, because of the databases only containing coding regions of the genome. It should be noted that Bracken, although measured as having the highest proportion of reads classified, does not actually classify the reads directly. Instead, it reassigns the unclassified proportion based on probabilistic estimates of the true abundance profile from the original classified Kraken report.

To assess the similarities of the classifiers between each other, we performed hierarchical clustering using the median L2 distances calculated between pairs of classifiers across all datasets. The k-mer-based methods cluster very closely together, with MegaBLAST and prophyle falling slightly outside of the tight main cluster ( Figure 4 C). The protein classifiers cluster together, although their profiles are quite different from each other. As expected, the marker-based methods MetaPhlAn2 and mOTUs2 have abundance estimates that are very different from both DNA and protein classifiers. PathSeq is very divergent from other DNA methods. This may be because reads matching multiple taxa cannot be disambiguated and, thus, may be counted multiple times. It also internally corrects the taxon abundance by genome lengths, whereas we evaluate based on raw read abundance.

DNA-based classifiers that used long k-mers (>30 nt), such as Kraken and taxMaps, were among the best-scoring methods, with typical average L2 distances from the truth below 0.1 ( Figure 4 A). The marker-based and, to a lesser extent, protein classifiers had higher L2 distances, demonstrating that they performed less well ( Figure 4 A). Genus abundance distances were generally lower than species abundance distances ( Figure S6 ). Additionally, using the standardized database of RefSeq CG did not change abundance distances much compared with default databases ( Figure 4 B), with the notable exception of Centrifuge, which again improved dramatically when used with a database that is not lossily compressed. Notably, Bracken—a post-processing step intended to improve abundance estimates by Kraken—does provide more accurate abundances at the species level ( Figure 4 A; Figure S6 ). This step is a simple and worthwhile addition to Kraken for better abundance estimates.

We next evaluated the accuracy of the estimated abundance profiles across classifiers compared with the ground truth. For methods that did not directly output an abundance profile ( Table 1 ), this was calculated as described in the Supplemental Information . We then calculated the L2 distance between classified and true abundances of all species ( Figure 4 ) and considered the average L2 distance across all datasets for a classifier as representative of the accuracy of that classifier.

(A) Distance between the species abundance profile for each classifier compared with the true composition (a lower value is better). Each plot point represents the L2 distance for a (classifier, dataset) combination. Classifiers are grouped and colored by their target class.

Another source of inaccuracy in most experimental protocols is PCR duplication, and removal of these duplicates is a challenge across classifiers. The NTC sample contained 75% duplicate reads compared with 15% for RNA and 0.01% for DNA, as measured by cd-hit-dup (cluster database at high identity with tolerance). The high level of sequence duplication was expected for the NTC because it required a higher number of PCR amplification cycles to achieve the same DNA library concentrations as the real samples; however, this is an issue for all amplified metagenomic samples. Most metagenomic classifiers do not remove these PCR duplicates as part of their classification, but this is an important step to generate more accurate abundance profiles, especially in cases where many cycles of amplification were performed as part of the sample preparation.

Compared with the simulated samples, the real-life sequenced samples exhibited worse AUPR performance with a more significant decline for the total RNA samples ( Figure S9 ). The protein and marker-based classifiers overall had more significant performance decreases compared with the DNA classifiers. Counterintuitively, the protein-based classifiers did markedly worse on the RNA samples. However, much of the RNA content from the total RNA is not mRNA coding for proteins but, rather, contains rRNA sequences that are not translated.

Abundance estimates were more inaccurate for the sequenced DNA and RNA datasets compared with simulated DNA ( Figures 6 and S8 ). The microbiome standards initially contained an even 5% abundance for all 20 species, but the experimental protocols used to go from whole-cell material to final sequencing reads likely biased the original abundances in a reproducible way. Steps such as PCR amplification tend to bias toward certain DNA library sequences, altering the original abundance compositions (). Although the effect of PCR can be mitigated to some extent by removal of duplicate sequences, this is not standard practice in most metagenomic classifiers. Additionally, a non-template control (NTC) was prepared and sequenced from ultrapure nucleic acid-free water. The NTC classifier reports also contained high numbers of false positives, unlike what would be expected for a completely clean sample containing no genetic material ( Figure 6 ). It is important to note that the NTC sample contained a significant amount of human DNA sequences that are likely the result of contamination (∼70%), as measured by Kraken, but aside from human, most classifiers falsely classified less than 20% of the NTC abundance.

The problem of low-abundance false positives is not captured well by the AUPR and abundance distance metrics. The AUPR curve concludes after the lowest-abundance true positive is found, so further false positives do not decrease the AUPR score, whereas the total abundance sum of these false positives is small and does not have much effect on L2 distance. Compared with the simulated DNA, the sequenced DNA exhibited a similar number and growth of false positives at lower abundances ( Figure 6 ). This implies that false positive calls are mostly the result of computational false positives instead of the sample degradation or contamination from working with wet-lab reagents and samples. In contrast, sequenced RNA tended to have faster false positive growth at higher abundances compared with the DNA datasets, making it more difficult to distinguish ground truth from incorrect species. This is probably due to the heavy bias of RNA content toward the highly conserved bacterial rRNA sequences, which, by mass, are typically much more abundant than mRNA in whole cells and are very similar between different bacterial species. However, the total number of false positives was typically lower than in DNA datasets, which is probably due to a shallower sequencing depth for the RNA samples making extremely low abundance species (∼1 per million reads) stochastically disappear.

To distinguish incorrect classifications from contamination introduced during sample processing and sequencing, classifier outputs from simulated data were compared against sequencing data from real in vitro samples. We prepared and sequenced genomic DNA and total RNA from the ATCC Even, a metagenome standard containing 20 bacterial species, each at 5% abundance, and compared it with the simulated DNA from the same standard ( Supplemental Information ). For the simulated DNA, after the 20 ground truth species were recalled, additional false positive species started appearing below 0.5% abundance for a few methods, whereas most classifiers called false positives below 0.01% abundance ( Figure 6 Figure S8 ). The number of false positives ranged from tens (Bracken and MetaPhlAn2) to thousands (Centrifuge, CLARK, Kaiju, MMseqs2, and PathSeq) of distinct species, depending on the methods, although some methods leveled off at a certain abundance threshold, notably the marker-based methods, whereas others continued to call more and more false positives at lower abundances. These characteristics can be used to make an informed decision about the appropriate abundance level cutoff for post-filtering of classifications.

False positive classifications present a major challenge for the interpretation of metagenomic sequencing data, especially when considering human clinical samples (). This can originate from several sources and could be influences by experimental and analytical factors. At the classification stage, the absence of a species from the reference database can result in misclassification of reads from that species. The clearest example of this is host genomic content, especially because most default reference databases do not explicitly include the human genome. Using simulated genomic reads from the human hg38 reference genome, we found that 1%–5% of human reads were incorrectly classified for most DNA classifiers ( Figure S7 ). Notably, classifiers that index the human genome in their default database, such as KrakenUniq and Centrifuge, had negligible rates of misclassification ( Figure S7 ). The protein-based classifiers had higher misclassification rates, ranging from 5%–15% misclassified abundance, partially because of the larger number of sequences in the default nr databases. Including known or suspected host genomes in a reference database is thus a first important step to reducing false positive classifications.

Storage of the database in memory can reduce overall run time when processing multiple samples. Some methods, such as Kraken and metaOthello, processed the second sample faster after loading the database into memory for the first sample ( Figure 7 A). Even though only CLARK(-S) explicitly supports batch processing of multiple samples, consecutive back-to-back program executions mean that much of the database is still implicitly cached by the operating system and has not yet been evicted from memory. However, not all methods improved during the second run, indicating that their databases were not cache friendly ( Figure 7 A). The disparity in run time between a cached and uncached database can be up to 100×, so careful use of caching and batch processing can result in major performance improvements.

A significant portion of the classification time may reflect the time required to load the reference database into memory. This is especially true of DIAMOND and MMseqs2, which do not index the database during creation but while the classification is actually performed, resulting in fast database creation but higher classification times. All other methods index the database during creation, which steeply increases the time required to create a database to several hours ( Figure 7 C) but reduces the loading time each time a classification is performed.

The computational resources and time required for running a metagenomic classifier are varied and can be very high ( Figures 7 A and 7B ). Apart from the marker-based methods, all of the classifiers used a significant amount of memory that ranged from tens to hundreds of gigabytes. The marker-based methods also had the fastest initial classification time, using just a few minutes per million reads ( Figure 7 A). Among DNA-to-DNA classifiers, the FM index and alignment methods typically used much less memory than those relying on large databases of long k-mers; however, they had a higher run time after database caching because of the pairwise alignments they perform. Interestingly, CLARK-S has a much higher run time than CLARK, even after database caching, without significantly improving accuracy. Among the DNA-to-protein classifiers, MMseqs2 required extremely high amounts of memory and computational time, whereas Kaiju was fairly quick because of the use of an FM index. One reason for slow run times is that DNA-to-protein methods require querying six frame translations of the query DNA sequence, resulting in a significant increase in run time compared with DNA methods, which becomes even more pronounced with increasing numbers of input sequences. These methods also typically use the more comprehensive BLAST nr database as their database, which includes proteins from non-microbial sources, whereas DNA-based classifiers tend to only incorporate microbial sequences.

(C) Time taken and memory used to create the RefSeq CG database using various methods. Classifiers are sorted by increasing time taken. MMseqs2 and DIAMOND do not index the genomes during database construction but, rather, index on the fly during sample classification.

(A) Time required to process a sample containing 5.7 million reads versus a second run immediately after the first. This second run is faster for many classifiers because sample reads and database files are cached in memory. Bracken is not plotted because it requires negligible time and memory.

Discussion

Advancements in the field of metagenomic sequencing analysis have produced a suite of taxonomic classifiers that perform well across a range of datasets, meaning that other preferences or constraints should also guide method selection. The classifiers investigated here had similar performance, especially between classifiers of the same type on the same dataset. Broadly, DNA classifiers provide better precision-recall and abundance estimates than protein-based classifiers when using a uniform database for these whole-genome datasets. This lower specificity of protein classifiers on the normalized RefSeq CG database can be largely attributed to the absence of non-coding sequences from their databases. Indeed, differences in default database compositions accounted for greater performance differences between classifiers than the methods themselves. This similarity in performance means that other factors, such as the computational resources available as well as the ease of use and other particulars of each classifier, should be considered when deciding on the most appropriate tool.

The computational resource requirements of some programs may be prohibitive for some applications, so using classifiers that require lower memory incurs a tradeoff between computational cost and accuracy. Among DNA classifiers, when a server with high amounts of memory (>100 Gb) is available, Kraken and its derivative tools Bracken, KrakenUniq, and Kraken2 provide good performance metrics, are very fast on large numbers of samples as long as the database loading time is amortized, and allow the creation and usage of custom databases. CLARK is a good alternative; however, combination of Kraken and Bracken tends to have slightly more accurate abundance profiles. If high amounts of memory are not available, then MetaPhlAn2 is recommended for having very low computational requirements (<2 Gb of memory) as well as fast classification speed but does not allow custom databases. If custom databases are desired, then Centrifuge may be useful; it requires 10 s of gigabases of memory and demonstrates good performance metrics despite the shortcomings of the compressed default database. Among the protein-based classifiers, Kaiju is generally recommended for having much faster classification speed and lower memory requirements compared with the other classifiers DIAMOND and MMseqs2 without compromising performance. All three of these methods allow custom databases.

Although most of the classifiers included in this study performed well across the benchmark datasets containing known taxa, there are several areas in which more development is needed to improve these tools. This includes both simple technical changes to existing methods as well as conceptual innovations in how these methods are used that could improve the quality of the results generated by these classifiers.

McIntyre et al., 2017 McIntyre A.B.R.

Ounit R.

Afshinnekoo E.

Prill R.J.

Hénaff E.

Alexander N.

Minot S.S.

Danko D.

Foox J.

Ahsanuddin S.

et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Sczyrba et al., 2017 Sczyrba A.

Hofmann P.

Belmann P.

Koslicki D.

Janssen S.

Dröge J.

Gregor I.

Majda S.

Fiedler J.

Dahms E.

et al. Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software. Forster et al., 2019 Forster S.C.

Kumar N.

Anonye B.O.

Almeida A.

Viciani E.

Stares M.D.

Dunn M.

Mkandawire T.T.

Zhu A.

Shao Y.

et al. A human gut bacterial genome and culture collection for improved metagenomic analyses. Almeida et al., 2019 Almeida A.

Mitchell A.L.

Boland M.

Forster S.C.

Gloor G.B.

Tarkowska A.

Lawley T.D.

Finn R.D. A new genomic blueprint of the human gut microbiota. Nayfach et al., 2019 Nayfach S.

Shi Z.J.

Seshadri R.

Pollard K.S.

Kyrpides N.C. New insights from uncultivated genomes of the global human gut microbiome. Pasolli et al., 2019 Pasolli E.

Asnicar F.

Manara S.

Zolfo M.

Karcher N.

Armanini F.

Beghini F.

Manghi P.

Tett A.

Ghensi P.

et al. Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle. Classifiers perform very well when taxa in a sample are genetically distinct from each other and genetically similar to sequences in the reference database. Although not evaluated here, previous studies have shown that classification below the species level is challenging with current taxonomic classifiers (). The dramatically poorer performance of all classifiers on the CAMI datasets compared with the International Metagenomics and Microbiome Standards Alliance (IMMSA) datasets further underlines the influence of evolutionary distance and poorly described taxa on classification performance. Relative classifier performance was largely consistent, indicating that these problems are common across existing tools. Expanding reference databases can improve classification (); however, for datasets as divergent or complex as CAMI, where only a small proportion of abundance is able to be classified at the species or genus level ( Figure S6 ), the taxa present are likely not known or very different from taxa in the reference database. De novo assembly of metagenomes may be a more appropriate strategy for analysis of such highly complex and novel metagenomic samples. These tools have recently identified thousands of novel metagenome assembled genomes in the human microbiome (

Blauwkamp et al., 2019 Blauwkamp T.A.

Thair S.

Rosen M.J.

Blair L.

Lindner M.S.

Vilfan I.D.

Kawli T.

Christians F.C.

Venkatasubrahmanyam S.

Wall G.D.

et al. Analytical and clinical validation of a microbial cell-free DNA sequencing test for infectious disease. Miller et al., 2019 Miller S.

Naccache S.N.

Samayoa E.

Messacar K.

Arevalo S.

Federman S.

Stryke D.

Pham E.

Fung B.

Bolosky W.J.

et al. Laboratory validation of a clinical metagenomic sequencing assay for pathogen detection in cerebrospinal fluid. One of the biggest performance challenges for many classifiers is that they often report large numbers of low-abundance false positives, lowering the precision of these estimates. Most practitioners filter these reports using a given abundance threshold. Integration of this within software packages would improve classifier precision and standardization and simplify downstream analyses. Another possibility for improving performance would be performing per-read filtering based on alignment score or e-value. Some classifiers have options that allow this but do not have default recommendations for using them. These values would depend on classifier support and would vary between classifiers and even experimental designs, which could make them difficult to use. In addition, practices such as including the mammalian and/or host genomes and potential contaminants in reference databases can help reduce misclassification because of missing reference sequences. The human genome is included in the default databases of several classifiers (Centrifuge, DIAMOND, Kaiju, KrakenUniq, MegaBLAST, and MMseqs2) and can be included in a custom-built database with software that permit this. When the host is known, pre-filtering of host-derived sequences can also be used ().

More generally, making full use of information available about sequenced reads, including incorporating sequence quality scores and PCR duplicate removal, common practices in other areas of sequence analysis, could further improve classification. Apart from Karp, most metagenomic methods do not consider sequencer quality scores for classification. Using quality scores in a probabilistic context would increase the computational cost of classification but could be a fruitful avenue for future research. Similarly, most classifiers generate profiles simply based on proportions of classified reads. Of the methods evaluated here, only Bracken takes a probabilistic approach to generate the final abundance profiles. Another common issue in low-input metagenomic sequencing is high PCR duplication rates of sequencing reads resulting from high numbers of PCR cycles. This can distort abundance estimates because certain taxa become dominant via high duplicate numbers. Some effort has been made by methods like KrakenUniq to estimate the unique nucleotide content of each taxon to better determine the duplication level and recover true abundances.

Sinha et al., 2017 Sinha R.

Stanley G.

Gulati G.S.

Ezran C.

Travaglini K.J.

Wei E.

Chan C.K.F.

Nabhan A.N.

Su T.

Morganti R.M.

et al. Index Switching Causes “Spreading-Of-Signal” Among Multiplexed Samples In Illumina HiSeq 4000 DNA Sequencing. Gu et al., 2016 Gu W.

Crawford E.D.

O’Donovan B.D.

Wilson M.R.

Chow E.D.

Retallack H.

DeRisi J.L. Depletion of Abundant Sequences by Hybridization (DASH): using Cas9 to remove unwanted high-abundance species in sequencing libraries and molecular counting applications. Zinter et al., 2019 Zinter M.S.

Mayday M.Y.

Ryckman K.K.

Jelliffe-Pawlowski L.L.

DeRisi J.L. Towards precision quantification of contamination in metagenomic sequencing experiments. In addition to PCR duplication, there are a number of other features of the experimental preparation of samples for metagenomic sequencing that can confound classification performance. Cross-talk between multiplexed samples on Illumina flow cells (especially notable on more recent sequencers that use patterned flow cells) can generate biological false positives at low abundances (). Library molecules can bleed through multiple runs on the sample sequencers. Samples can also pick up biological contamination from microbial sequences in the laboratory environment or genetic material present in extraction and library preparation reagents. These contaminants may confound benchmarking attempts using real sequenced datasets; however, this is part of a larger challenge of metagenomic classification that needs to be addressed. Specifically, although these are “true positives” from the perspective of the classifier because they were present in the final sequenced sample, they confound conclusions and inferences drawn from metagenomic classifiers. Clean laboratory practices as well as recent innovations to experimentally remove routine contaminants () or using artificial sequences to quantify contamination () can reduce this problem but are unlikely to be able to completely remove it.

Davis et al., 2018 Davis N.M.

Proctor D.M.

Holmes S.P.

Relman D.A.

Callahan B.J. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. McLaren et al., 2019 McLaren M.R.

Willis A.D.

Callahan B.J. Consistent and correctable bias in metagenomic sequencing measurements. The use of well-matched negative controls in any metagenomics study is essential to be able to identify and control contamination. Although use of negative controls has recently improved in metagenomics studies, there is currently no accepted approach for correcting for contamination using these controls (). One interesting avenue for exploration is the use of negative control samples to create blacklists of known contaminants.

McIntyre et al., 2017 McIntyre A.B.R.

Ounit R.

Afshinnekoo E.

Prill R.J.

Hénaff E.

Alexander N.

Minot S.S.

Danko D.

Foox J.

Ahsanuddin S.

et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Piro et al., 2017 Piro V.C.

Matschkowski M.

Renard B.Y. MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling. Bazinet et al., 2018 Bazinet A.L.

Ondov B.D.

Sommer D.D.

Ratnayake S. BLAST-based validation of metagenomic sequence assignments. Jiang et al., 2017 Jiang Y.

Wang J.

Xia D.

Yu G. EnSVMB: Metagenomics Fragments Classification using Ensemble SVM and BLAST. Yang et al., 2014 Yang Y.

Jiang X.-T.

Zhang T. Evaluation of a hybrid approach using UBLAST and BLASTX for metagenomic sequences annotation of specific functional genes. Thinking more broadly, one proposed strategy to address the individual shortcomings of different classifiers is to use an ensemble of classifiers or a pipeline of different classifiers (). For pipelines, typically a faster method is run first, and then a slower but more sensitive method is run on the leftover unclassified or poorly classified reads (). This is sometimes performed in a more ad hoc fashion using protein classifiers as a second-pass analysis option for DNA classifiers because it offers higher sensitivity to mutations and more distantly related proteins but trades lower specificity in response (). The downside of ensemble classifiers is a higher computational run time and a more challenging interpretation of the results.

Run time is a recurring challenge for many of these methods, and although some methods use approaches that either reduce the run time or memory use, many classifiers have run times that are dictated by the disk speed of reading their databases into memory. After loading the database, classifying each incremental sample is extremely fast for methods such as Kraken and CLARK. Not all methods support classifying multiple samples in a single execution to explicitly take advantage of database caching, but their database pages might by implicitly cached by the operating system for faster single-sample executions. Some methods, such as DIAMOND and MMseqs2, perform large in-memory sorts that spill out temporary files on disk. These methods are faster, with larger amounts of memory and fast temporary disks.

Nasko et al., 2018 Nasko D.J.

Koren S.

Phillippy A.M.

Treangen T.J. RefSeq database growth influences the accuracy of k-mer-based species identification. The rapid growth of reference databases will present a fundamental challenge to the field in coming years as this pace exceeds that of computer memories. This will likely create a number of secondary challenges to which current software will need to adjust. The growth of reference databases such as RefSeq can result in methods changing performance characteristics over time, as exemplified by the decrease in accuracy of k-mer-based methods (). Many of these newly sequenced genomes are re-sequenced microbial strains that are very similar to existing species, resulting in oversampling of some genera and species groups. NCBI’s taxonomy database is frequently changing, with taxon nodes being constantly renamed, deleted, merged into other nodes, promoted, and/or demoted from taxonomic ranks. There are some ways that software programs can mitigate taxonomy changes, such as including the full versioned taxonomy database tree for hosted pre-built metagenomic databases. However, these changes do not address the underlying challenge presented by the size and growth of these databases. Managing this data explosion is likely to be one of the biggest challenges in metagenomic sequence classification in the coming years and will require more wide-reaching changes in our approach.

The field of metagenomics is approaching a critical milestone in its trajectory. Investment in recent years has led to the development of a range of programs with good overall performance, giving users a choice of ways to analyze their data in accordance with their particular question, computational environment, target taxa, and other preferences. This is making the analysis of metagenomic sequencing more accessible than ever before. However, many taxonomic classifiers are still burdened by high numbers of false positive calls at low abundance that need to be addressed. Looking beyond this, bigger breakthroughs will be needed in many areas, including in controlling experimental sources of contamination and error and in handling the exponential growth of reference databases, to create transformational changes in metagenomic classification toward microbial detection and characterization.