Our study compares the efficacies of the two NGS sequencing strategies used for eDNA studies (amplicon vs. shotgun) over one of the largest datasets of environmental samples to date. We found the amplicon approach was far more discerning in almost all respects, contrasting general dogma in the field and all but one of eleven empirical studies in Table 1 that compared these strategies. Unlike our study, that contrasting study differed primarily because of issues with fungal rDNA recovery and deficient databases, rather than due to the systematic biases of the method12.

Our study showed weak correlation between the two methods, indicating that while taxonomic overlap exists at both the phylum and family levels the methods are substantially different. Under half the phyla identified from amplicons were found with shotgun; almost all of the phyla recognized by the shotgun approach were also recognized by the amplicon approach. About 30% fewer families were identified from shotgun. This superior performance from amplicons comes despite having <1% of the total reads produced from shotgun. The amplicon results were also far more consistent with prior research on the biodiversity of freshwater systems (Fig. 2). In addition, the Procrustes tests indicated that there is only weak correlation for community composition between the two sequencing strategies using NMDS.

The key difference between the amplicon and shotgun derived data in our study was taxonomic breadth and abundance, whether looking at the overall results or site-by-site. The lower taxon counts for shotgun sequencing appear to be due to issues inherent to the shotgun technique, as well as to the database size. As genome databases are continuously improving and expanding in size, this problem should become less significant. New approaches in multi-enzyme and mechanical shotgun extraction and sequencing techniques may also help18. Additionally, shotgun sequencing is complicated by having many reads map to unknown species, which reduces the number of taxonomically-applicable reads (often the majority of reads), and this issue may be more problematic in complex environments such as river basins.

The fundamental issue with the shotgun technique was that taxon richness reached an asymptote on a per site basis at low and unpredictable levels, as compared to amplicon results (Fig. 7A). While this high degree of variability can potentially be overcome by using a large number of sites (note the high variance in total predicted taxa in Supplementary Table 2 and the longer asymptote in Fig. 7B), this is not particularly helpful, as it is a fundamental goal in biodiversity studies to get at the true richness and abundance of organisms at each individual site. Yet, the environmental correlates were greater with the shotgun data (more below). The rarefaction asymptotes of Fig. 7A indicate that further sequencing is unlikely to provide additional insight on a per site basis, at least when using MetaPhlAn2. In contrast, some studies have shown that a greater sequencing depth can be useful for the detection of rare species; unfortunately, it generally comes at the cost of shorter reads that are frequently misaligned - a process that leads to an inflation of both species count and diversity estimates4, 14, 19.

As for genomic databases, even for microbes, they are in their infancy11, 12, 20. While genomes deposited in these databases are increasing at an astonishing pace, they have a long way to go15. This is especially true when compared to the well-curated 16S microbial databases like RDP21, SILVA22, and Greengenes23. This appears to be less of a problem in studies on well-characterized systems like the human microbiome (Table 1).

By definition, all nine of the phyla recognized in the shotgun dataset have whole genome sequences in the database. On the other hand, the 20 amplicon phyla determinations use 16S rDNA sequences to make the identifications, so not all of them necessarily have sequences in the whole genome database. Indeed, only 80% of the phyla identified using the 16S amplicon approach also have whole genomes sequenced from members of those phyla (Supplementary Fig. 1), leaving us with only a minor taxonomic overlap between databases. This discrepancy at the phylum level clearly entails a massive lack of resolution at finer taxonomic levels (e.g., for families reviewed here). Missing a single phylum is disconcerting, let alone 20% of phyla.

Given the 16S vs. genome database discrepancy, many shotgun sequences are surely assigned to inappropriate taxa. These incorrect IDs are most likely close relatives of taxa that have sequenced genomes. Thus, the IDs may still have some merit based on the fact that closely related taxa generally have phylogenetically constrained traits that make them more similar (ecologically, physiologically, etc.) to one another than to more distant relatives24. However, ecological analyses using higher taxa as surrogates for species achieve variable results depending on the types of input data25. In microbial communities, functional diversity cannot be directly predicted from phylogenetic diversity. For example, while in the macroscopic world it is an accepted paradigm that an ecosystem with a low level of taxonomic richness will also have a reduced functional diversity, this does not seem to apply to microbial communities20.

Because of the putative cases of mistaken identity with shotgun sequencing, we chose not to use UniFrac or any of its derivative distances (e.g., weighted and generalized; see ref. 26) for community level analyses. For microbial eDNA community ecology, multivariate analysts now generally favor these phylogenetically adjusted measures rather than simply considering taxa as independent entities. However, without highly accurate identifications, accounting for a specific phylogeny makes little sense: recall that only half the amplicon-recovered phyla were found with shotgun, indicating that many shotgun sequences were identified to incorrect phyla - a phylogenetically gigantic distance.

The biases of close, but not exact, identifications are almost surely less extreme when considered as fully independent entities (i.e., not using UniFrac, but more traditional non-phylogenetic distance matrices). Considering taxa as fully independent entities is standard for community ecology of large eukaryotic organisms. Yet, despite the acceptability of both methods, it is still a notable difference that shotgun data should not – in our opinion – rely on phylogenetically accountable methods until the databases become larger and the tools more sensitive.

Throughout our study we focused on commonly used bioinformatic pipelines. While the RDP appears to work well for amplicons, our findings of MetaPhlAn having lower quality results for shotgun could be called into question. However, MetaPhlAn is one of the most popular taxonomic categorizers; for instance, it was used in the Human Microbiome Project27. More importantly, it relies on clade-specific marker genes, which is crucially important for accurate identifications in bacterial biodiversity studies and is a common algorithmic approach. We believe that current practices for analyzing shotgun data that do not use clade-specific markers may be inappropriate for bacterial taxonomic identifications. Future studies should compare less conservative approaches, such as PhyloSift28.

Due to conjugation, horizontal gene transfer is rampant in bacteria. It is equally well established that there is a core set of genes across bacteria that are highly conserved and rarely transferred; this is generally referred to as the core genome29. While amplicon-derived analyses take advantage of a single gene in the core genome, shotgun relies on genes across the entire genome. Accordingly, the analytics of shotgun will inevitably lead to avoidable misidentifications if based around genes not found in the core genome. This is a major problem for biodiversity and ecology studies, as confident identifications are paramount. Future shotgun analytics can therefore benefit from limiting taxonomic identifications to sequences from the core genome or clade-specific marker genes (as done by MetaPhlAn30, 31.

Furthermore, while our results could be confounded by the fact that we sequenced amplicons via 454 and shotgun via Illumina, we found the majority of studies in Table 1 comparing the amplicon procedure for 454 vs. Illumina agree that these sequencing platforms give highly similar results. Additionally, while Illumina is the dominant NGS platform, amplicon and shotgun studies generally use different Illumina platforms to meet their goals (e.g., HiSeq and MiSeq, respectively; see Table 1). Thus, we believe that our results and comparisons are valid. It is also worth noting that if there were to be an issue with one of these sequencers, it would be assumed that it would be the 454, as it had fewer than 1% of the reads sequenced for Illumina (as expected) - making our results akin to a fisherman with a single fishing rod catching more fish than a commercial trawler.

The only result that is agnostic towards (or at least difficult to interpret for) shotgun or amplicons was in regards to the environmental correlates of the NMDS ordinations (Supplementary Table 1, Fig. 5), which found shotgun to have more significant variables associated with certain metadata. While this could be in favor of shotgun, it is unlikely as the input matrix was so depauperate in terms of taxon richness and evenness across taxa. More likely, this result could be due to a more simplified ordination space that is largely driven by clear divisions by site for a few taxa, as exemplified by the heat maps (Fig. 4). More correlates were found for the phylum level in both sequencing methods, further supporting the idea that the ordinations driven by fewer taxa could be increasing the number of correlates we found. It is also worth noting that for more thoroughly researched microbial floras that have many genomes sequenced, the shotgun system may outperform the amplicon-based approaches as it will provide useful data for a larger array of questions. This already might be the case for urban environments or the human microbiome32.

While both amplicon and shotgun sequencing methods have their own advantages for microbial studies, amplicon sequencing was clearly superior for the goals of microbial eDNA community ecology in the reviewed lakes of floodplain systems from Brazil. Further studies should strive for comparisons of even larger datasets across a greater number of habitats, as there can be major differences in conclusions drawn based on the type of sequencing conducted33. At this point, any large scale studies should at minimum conduct pilot comparisons between these techniques to choose the more appropriate option.