Abstract A comprehensive knowledge of the types and ratios of microbes that inhabit the healthy human gut is necessary before any kind of pre-clinical or clinical study can be performed that attempts to alter the microbiome to treat a condition or improve therapy outcome. To address this need we present an innovative scalable comprehensive analysis workflow, a healthy human reference microbiome list and abundance profile (GutFeelingKB), and a novel Fecal Biome Population Report (FecalBiome) with clinical applicability. GutFeelingKB provides a list of 157 organisms (8 phyla, 18 classes, 23 orders, 38 families, 59 genera and 109 species) that forms the baseline biome and therefore can be used as healthy controls for studies related to dysbiosis. This list can be expanded to 863 organisms if closely related proteomes are considered. The incorporation of microbiome science into routine clinical practice necessitates a standard report for comparison of an individual’s microbiome to the growing knowledgebase of “normal” microbiome data. The FecalBiome and the underlying technology of GutFeelingKB address this need. The knowledgebase can be useful to regulatory agencies for the assessment of fecal transplant and other microbiome products, as it contains a list of organisms from healthy individuals. In addition to the list of organisms and their abundances, this study also generated a collection of assembled contiguous sequences (contigs) of metagenomics dark matter. In this study, metagenomic dark matter represents sequences that cannot be mapped to any known sequence but can be assembled into contigs of 10,000 nucleotides or higher. These sequences can be used to create primers to study potential novel organisms. All data is freely available from https://hive.biochemistry.gwu.edu/gfkb and NCBI’s Short Read Archive.

Citation: King CH, Desai H, Sylvetsky AC, LoTempio J, Ayanyan S, Carrie J, et al. (2019) Baseline human gut microbiota profile in healthy people and standard reporting template. PLoS ONE 14(9): e0206484. https://doi.org/10.1371/journal.pone.0206484 Editor: Ajay Goel, Beckman Research Institute, UNITED STATES Received: October 10, 2018; Accepted: August 5, 2019; Published: September 11, 2019 This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. Data Availability: All GutFeelingKB data is available from https://hive.biochemistry.gwu.edu/gfkb. All Filtered-nt data is available from hive.biochemistry.gwu.edu/filterednt. All Sequence and metadata data is available from NCBI’s Short Read Archive (PRJNA428202, PRJNA487305, PRJNA43021). Funding: This project was supported in part by funds from National Science Foundation (NSF) (award number: 1546491 to RM), the NIH National Center for Advancing Translational Sciences (award number UL1TR000075 to KAC, HM, RM), and the McCormick Genomic and Proteomic Center (MGPC) at the George Washington University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction While humanity has only begun to influence planetary-level events in the last few hundred years [1], microorganisms have shaped our planet since time immemorial [2]. It has been shown that the microbes of the ocean are as important for influencing planetary climate as the microbes of gastrointestinal (GI) tracts of cattle [3]; furthermore, new functions are continuously found for the human microbiome [4–6]. However, since the advent of germ theory and the antimicrobial revolution, microbes have been viewed as insurgents bound for eradication [7]. Hence, we have created GutFeelingKB to provide a reference for the metagenomic analysis of the human gut microbiome. In 2001, some sixty years into the antibiotic era, Joshua Lederberg coined the term ‘microbiome’ as the pendulum of opinion began to swing back to a more microbe-tolerant position [8,9]. In 2008, the US National Institutes of Health launched the Human Microbiome Project (HMP) to better understand the makeup of the community of microbes in cohabitation with humans [10,11]. This population of microorganisms brings with it a vast, diverse, and modifiable set of genomes which have proven to influence human health and disease [12,13]. Together, these organisms’ genomes comprise the metagenome, a highly versatile pool of genetic elements which now serves as a target for medical research [14]. Microbiome characterization through various analysis pipelines has advanced progressively since HMP and this development process has catalyzed the understanding of certain roles of these microbial communities [15,16]. Although microbiomes of all body sites are important, the gut microbiome, with hundreds of prevalent species is of major interest to a large and diverse number of researchers [17,18]. The healthy gut microbiome data and analysis is crucial for all studies of disease with relation to the human gut. A Nature Microbiology issue in 2016 contained a consensus statement which outlined all federally-funded microbiome research over a three-year period [19]. The authors, on behalf of the federal government’s FastTrack Action Committee on Mapping Microbiomes (FTAC-MM), defined a microbiome as a multi-species community of microorganisms in any environment: host, habitat, or ecosystem. One of the conclusions reached by the authors was a “priority need” for higher-throughput, more accurate data acquisition, better pipelines for data analyses, and a greater ability to organize, store, access, and share/integrate data sets. At present, most studies leverage study specific control groups and reporting mechanisms. The studies that are successful at creating clinically relevant results, such as the work by uBiome [20], are based on marker genes, and so they do not shed light on the origin of the “microbial dark matter”, and are not able to be integrated with whole genome shotgun sequencing studies (WGS). These problems are compounded by the fact that different bioinformatics pipelines produce different results largely because all current pipelines use a limited number of ad hoc reference organisms to determine abundance. It has also been shown that database growth influences the accuracy of relatively faster k-mer-based species identification [21]. The final understanding of the baseline healthy microbiome therefore can be flawed because the methods are uniquely applied in each study. As such, there is a need for aggregation, validation for interoperability, and eventual standardization of methods and reporting. Currently, metagenomic analyses use nucleotide sequences from a limited set of pre-determined microorganisms or genes as a reference database, and, as such, these reference lists are not truly comprehensive. The use of limited sets of sequence data is prevalent because it is computationally challenging to perform pairwise read alignment against the entire NCBI non-redundant nucleotide database (NCBI-nt) [22]. Algorithms have been developed that allow the use of the complete NCBI-nt and it has been shown that using the NCBI-nt permits accurate analysis of the data with significantly fewer errors in microorganism abundance quantification [23]. To leverage this prior work on metagenomic analysis algorithms, samples from a healthy cohort of participants were collected and sequenced to specifically target healthy control data. To ensure the samples were abundant and correct enough to build healthy reference list, we also retrieved sequences of healthy people from HMP. Furthermore, we developed an approach that generates a collection of assembled contiguous sequences (contigs) that cannot be aligned to any known sequence in NCBI-nt but are present in healthy individual fecal samples and are ideal for healthy-disease-microbiome correlation analysis and novel primer design. For the purposes of this study, these sequences are defined as metagenomic dark matter–sequences that cannot be mapped to any known sequence but can be assembled into contigs of 10,000 nucleotides or higher. Together, these data form our Gut Feeling Knowledge Base–GutFeelingKB. The contig nucleotide length threshold is expected to reduce the number of contigs in GutFeelingKB that are not of biological origin. Our definition is much stricter than previous definitions of the metagenomic dark matter which accepts remote homology to known sequences [24]. The need to include metagenomic dark matter in comprehensive analyses of the gut microbiome matches the arguments presented by Bernard et al. in their recent manuscript on microbial dark matter where they opine that “unraveling the microbial dark matter should be identified as a central priority for biologists” [25]. The primary aim in creating GutFeelingKB is to provide a reference knowledgebase for the metagenomic analysis of the human gut microbiome. All the organisms which were confidently observed in a healthy human gut are included. Using this knowledgebase, we designed a standard reporting template of individual microbiome data for direct comparison to GutFeelingKB. This type of report can be useful to any scientist, clinician, or patient and can enhance comparison of results from different studies.

Materials and methods Metagenomic sampling and participant statistics Healthy cohort selection and nutritional information. Participants for this study were recruited from the George Washington University (GW) Foggy Bottom campus area through the use of flyers and emails to GW affiliated organizations (selection criterions included in S1 Table). Study participants provided samples and anthropomorphic measurements (included in S1 Table) were collected from healthy people at GW according to a George Washington Institutional Review Board (IRB#011605) approved protocol. At the baseline visit, participants received extensive instructions on how to record their dietary intake (including type, brand, and portion size of every food and beverage consumed on each day throughout the study period) and the time of consumption for each item. Participants then recorded their dietary intake using a seven-day food journal throughout the length of the study. Each participant provided three samples. The food journal was collected at the submission of the final sample, after which the reported 7-day dietary intakes for each subject were entered into the Nutrition Data System for Research (NDSR) [26]. NDSR produces a tabular daily nutrient profile for each day of dietary intake for each individual, which was then added as metadata to the abundance matrices (supplementary table S2 Table). All participants self-reported as ‘healthy’ (participant does not have an obvious or self-declared disease state) at the start of the study and remained healthy throughout. Sampling and sequencing. Fecal samples were collected from healthy volunteers using sterile commode containers at the Milken Institute School of Public Health at the George Washington University (GWSPH). Immediately following collection in ethanol, the fecal samples were stored in a -20° Celsius freezer for a period of up to two weeks, after which, aliquots were placed in longer term storage at -80° Celsius ultra-freezer. Samples were subsequently transported to the sequencing center on dry ice. DNA was extracted using the MoBio PowerFecal DNA Isolation kit25. Double-stranded DNA (dsDNA) concentration and quality was assessed using NanoDrop and the Qubit dsDNA Broad Range (BR) DNA Assay Kit26, respectively. DNA was diluted for library preparation using the Illumina Nextera XT Library Prep Kit, and 1 ng from each sample was fragmented and amplified using Illumina Nextera XT Index Kit primers. Amplified DNA was then cleaned using Agencourt AMPure XP beads, resuspended in buffer, and tested again for concentration, quality, and fragment size distribution on a Bioanalyzer using the Agilent High Sensitivity DNA Kit. DNA libraries were brought to the same nM concentration, pooled, and denatured with 0.2 N NaOH prior to loading on an Illumina MiSeq Reagent Kit v3 and sequencing on the Illumina MiSeq platform. Sequence data FASTQ files were uploaded to BaseSpace (https://basespace.illumina.com/home/index) for sharing and further analysis. Sequence quality assurance. All sequence data were uploaded to the GW High-performance Integrated Virtual Environment (HIVE) [27,28]. Upon initial upload into the system, HIVE automatically conducts a series of quality assurance (QA) computations for each sequence read file and generates figures to display the results. S1 Fig is a compilation of the quality assurance computations done on one read file. Upon completion of the initial upload for each read file, the resulting quality assurance figures were inspected to ensure that the read file was of adequate quality and did not have any unusual characteristics (such as low-quality score or disproportionate distribution of nucleotides). Reads that had an average Phred quality score of 20 or less were discarded. The nucleotide base distribution was also examined to ensure that no read files had an unusual distribution of bases or a positional quality score below the threshold of 20. S2 Fig is an aggregate of the computations across all samples. Healthy cohort from Human Microbiome Project. In addition to the data generated from sequencing described above, additional data were downloaded and analyzed from the Human Microbiome Project (HMP) [29]. HMP sequence data and metadata are available through NCBI SRA and dbGaP. Fifty fecal metagenomic samples, randomly chosen from HMP Phase I (supplementary table S1 Table) to match approximately the number of samples collected in our study were selected. The samples generated by the HMP project dataset subjects were screened based on stringent criteria listed in their publication and the individuals who passed the screening were considered “healthy” subjects [11]. GW and HMP combined data. Sequence and metadata from this study are publicly available through GutFeelingKB (https://hive.biochemistry.gwu.edu/gfkb), and also available from two NCBI-SRA BioProjects (Healthy Human Gut Metagenomics (PRJNA428202), and Effects of non-nutritive sweeteners on the composition of the human gut microbiome (PRJNA487305). For PRJNA487305, only the samples donated prior to intake of non-nutritive sweeteners were used in this study. HMP data were downloaded from NIH Human Microbiome Project (HMP) Roadmap Project (PRJNA43021). A total of 48 samples from 16 individuals were sequenced in the GW cohort. Each sample resulted in two pair-end read files (for details see S3 Table). Sequence data from these 48 samples along with 50 samples from HMP passed sequence quality checks and were used to develop the baseline microbiota profile. For GW samples 55.55% (± 13.46%) while for HMP 48.29% (± 18.54%) of the reads could not be mapped to any known sequence. There was no need for any computational filtering of human DNA as the MoBio PowerFecal DNA Isolation kit25 was used for GW samples, biochemically removing any host DNA. For the HMP data, all human DNA had been computationally removed before the samples were deposited in dbGaP [11]. Sample and participant information can be seen in Table 1. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Human Microbiome Project (HMP) and GW participant statistics. https://doi.org/10.1371/journal.pone.0206484.t001 Filtered-nt. The Filtered-nt (v5.0) was created from the NCBI-nt file downloaded on May 21st, 2017. A detailed README.md and the code used can be found at https://github.com/GW-HIVE/HIVE-lab/tree/master/Filtered_nt. Both the NCBI-nt (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA) and NCBI taxonomy files (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy) were downloaded using the wget command. Using a curated blacklist file of taxonomy IDs, Filtered-nt was generated based on terms that are contained in the lineage of each taxonomy entry. Taxonomy nodes with terms such as ‘unclassified’, ‘unidentified’, ‘uncultured’, ‘unspecified’, ‘unknown’, ‘vector’, ‘environmental sample’, ‘artificial sequence’, ‘other sequence’ were blacklisted. Child nodes are also automatically removed. The filtered taxonomy list was then used to filter the NCBI-nt sequence file. Filtered-nt and the blacklisted taxonomy IDs along with node names are available for download at https://hive.biochemistry.gwu.edu/filterednt. Metagenomic analysis pipeline The innovative metagenomic analysis pipeline developed includes three software tools and one sequence database (Filtered-nt), organized in a fashion to produce a workflow that ensures an efficient and comprehensive analysis of a large sequence space. The tools are CensuScope [30], HIVE-Hexagon [31], and IDBA-UD [32]. All software tools are integrated in the HIVE platform [27,28] and allow end-to-end analysis of metagenomic sequences. Healthy Human gut microbiome list (GutFeelingKB). CensuScope [30] is a taxonomic profiling software that randomly extracts a user-defined number of reads and maps them to any size sequence database using BLAST [33]. CensuScope is rapid, accurate, and is not hindered by the size of the reference sequence database. With the non-redundant sequence database’s almost constant exponential increase, CensuScope offers a scalable approach for estimating taxonomic composition of a microbial population. A list of organisms, taxonomy identifiers, and BLAST alignments are provided as the output by CensuScope. A manual evaluation of the CensuScope results for each of the identified organisms was performed to verify that the “hit” represented an authentic match. “Manual evaluation” included the following criteria: Inspection of the match count. The number of matched alignments over the entire computation (over all iterations) had to be > = five out of total 12,500 alignment threshold set by CensuScope. Five was chosen so that there were enough individual alignments to appraise the authenticity of the matches. Confirmation of a justifiable taxonomy assignment. Hits to sequences that lacked a clear taxonomic lineage were excluded and marked for removal from Filtered-nt. Completeness of sequence in GutFeelingKB. Partial sequences, single proteins, or unassembled contiguous sequences were mapped to complete genomes to be included in GutFeelingKB. This is the only way to keep partial sequences from skewing organism abundance results. Organism verification. In order to have confidence in the results, it was necessary to independently verify the biological accuracy of each “hit”. Metadata about the organism was reviewed to verify appropriateness of its presence in the human gut. Any reference sequence and organism that satisfied these criteria was added to the GutFeelingKB. To extend the usability of this list, available online databases and reference text was used to annotate the organisms [22,34–36]. The NCBI accession numbers from the true positive CensuScope hitlist results were used to obtain the NCBI accession, the RefSeq accession, the NCBI taxonomy ID, the organism name (Scientific Name), the taxonomy id, and the genome assembly IDs. Using the taxonomy ID, the lineage and taxonomic name from the NCBI taxonomy database was retrieved. Genome to proteome mapping was guided by Representative Proteome Groups (RPGs), a dataset that clusters similar proteomes (https://proteininformationresource.org/rps/). The RPG clusters are calculated based on co-membership in UniRef50 clusters [34] (supplementary tables S4 and S8 Tables). Using the taxonomy ID and the RPG, the corresponding proteome in https://www.uniprot.org/proteomes was identified. From the proteome entry, verification of the Genome Assembly ID match between UniProt, RPG, and NCBI was performed. In most instances the proteome entry contained some descriptive text about the organism taken from a publication, as well as citations. Such information was added as organism annotation. Additional fields (Resistance to Antibiotic, Susceptibility to Antibiotic, Physical Characteristics) were populated from other sources [36]. Finally, all of the associated DOI and PMIDs for the metadata were added to the final column. It is important to note that many bacteria are closely related and hence have large homologous regions. This can lead to species level misidentification. Although the concept of pan-genome or pan-proteome for closely related bacteria is well accepted [35], it is important to avoid such misidentification for known pathogens. To avoid such false positives of well-known pathogens (S5 Table), they are included only if their abundance is 1% or higher and their alignments have been manually evaluated. Bacterial abundance profile. Fig 1 provides a schematic representation of the workflow. The first step uses CensuScope (a subsampling BLAST algorithm) to identify organisms that are present in the sample. To generate a rapid and accurate taxonomic profile, 2,500 reads are used in each iteration [30] (up to five iterations). This step allows identification of organisms present in a sample. These organisms are added into GutFeelingKB if it is not already present. Next, HIVE-hexagon, a highly specific and sensitive short-read aligner [37], is used to map all of the reads in each sample to GutFeelingKB (created through the use of CensuScope) to obtain the final abundance profiles. It is important to note HIVE-hexagon best match parameter was used. This parameter allows reads to be mapped to the reference (in the case of best matches to more than one reference) which has the greatest number of matches. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Metagenomic analysis pipeline for 3 samples. Step 1: CensuScope is run for each read file against Filtered-nt. Each of the aligned organism approved by manually check is added to the GutFeelingKB and it is versioned. Step 2: For the final analysis the raw read files are mapped against GutFeelingKB organism sequences using HIVE-hexagon. Outputs are tabulated as relative abundance percentages. Unaligned reads from each sample were assembled using IDBA-UD. Contigs that were over 10,000 nucleotides long had their headers modified to include the following: sample ID, numbered according to length (long to short), and additional metadata data about the participant. These contigs are available as a download at (https://hive.biochemistry.gwu.edu/gfkb). https://doi.org/10.1371/journal.pone.0206484.g001 Metagenomic dark matter. The unaligned reads of each sample were assembled using IDBA-UD [32] and considered as metagenomic dark matter. Only the assembled contiguous sequences (contigs) longer than 10,000 nucleotides were investigated in this experiment. Such a large length threshold was used to ensure that the metagenomics dark matter contigs were truly of biological origin. The gut microbiome of a sample can be represented as the sum of known organisms and organisms represented by the metagenomic dark matter sequences. More specifically, the contigs that were over 10,000 nucleotides in length were tagged with the sample ID and numbered, and metadata data about the participant was added to the header. These contigs are available as a download at (https://hive.biochemistry.gwu.edu/gfkb) for further analysis and novel primer design. Analysis of nutritional metadata and microbial abundance MaAsLin, an R package that employs a “multivariate statistical framework that finds associations between clinical metadata and microbial community abundance or function” [38] was used to find correlations between bacterial abundance and diet. Intra-host variability was analyzed evaluating the standard deviation of multiple measurements for every patient averaged over all patients. Inter-host variability was computed as a standard deviation of the means of per-host abundance values. To estimate the degree of stability of measurements for bacterial populations in patient samples intra-host vs inter-host variability ratio was computed. Nutrition to organism abundance correlation was also computed by using a Cosine Similarity Coefficient. The matrix of bacterial strain abundances was variance scaled and zero centered to create comparable distributions of equal variability. Categorical data (such as gender) were turned into numerical values. More specifically, in order to define correlation metrics between features and bacterial composition for the set of individuals, we used Cosine Similarity Coefficient as defined in Formula 1. Cosine Similarity Coefficient of correlation between bacteria (j) and feature (k) is computed as the sum product of jth Bacteria (Bj) abundance for patient i and kth Feature (Fk) of patient i. A Cosine Similarity of around 1 indicates a strong correlation, -1 indicates a strong anti-correlation, 0 is no correlation with 0.7 being considered the marginal threshold for evidence of some degree of correlation [39,40].

Conclusion The metagenomic analysis workflow described in this study involves a sub-sampling-based method followed by comprehensive mapping of all of the reads to accurately determine the abundance of microorganisms. The workflow provides a comprehensive snapshot of the microbial abundance and can easily be used with any state-of-the-art NGS read mapping and assembly algorithms. The list of baseline organisms identified in the normal human gut has clinical applicability as microbiome research moves closer to the bedside. The methods, tools and data from this project can also be used by regulatory scientists to evaluate workflows related to fecal transplant. In addition to the workflow, this work lays the foundation for an expansive and modular database which can aggregate publicly available data as well as data from contributors to push towards an understanding the baseline human microbiome. This database can serve as a reference in studies of dysbiosis and microbiome associated with diseases. The user-friendly format through FecalBiome report, which contains absolute and relative abundance information about a given sample compared to an average across the entire database allows scientists, clinicians, and eventually patients to understand overview of gut microbiome. This work has the potential to provide a significant impact on regulatory science (e.g., FDA) and standards organization (e.g., NIST) research efforts in this area. For example, GutFeelingKB can potentially allow for rapid assessment of the content of human GI replacement products and, ideally, allow for more expedient review of products. Future studies to advance evidence-based microbiome medicine can be conducted where potential patients identify which outcomes (such as depression, bloating, epilepsy, frequency of common colds, cancer, etc.). For example Apte et al. [20] identified 28 disease-related organisms which can be targeted to evaluate the healthy status of an individual and used to detect disease while FecalBiome report can be used to communicate the microbiome-related health status of an individual between a clinician and patient. Those outcomes will become endpoints in clinical trials or observational studies that demonstrate the effects of various bacteria on the human gut. This type of methodology would tie raw numbers to health states that are meaningful for the general population, ensuring that data gathered are relevant to the patient, and therefore the clinician. This could bring a new, patient-centric perspective to microbiome data use and allow for a greater scope of health data to sit atop metagenomic sequence data. If everyone uses the same set of clinically relevant endpoints, research will be easily comparable across studies and meta-analysis becomes interoperable.

Acknowledgments H Zhang and Y Hu provided valuable comments. The following bioinformatics curators contributed to the GutFeelingKB organism metadata and descriptions (C Sabet: https://orcid.org/0000-0003-2299-1426, Y Chidanandan: https://orcid.org/0000-0001-5703-5667, V Simonyan: https://orcid.org/0000-0002-2577-3240, N Post: https://orcid.org/0000-0002-0457-7056, B Osborne: https://orcid.org/0000-0002-9007-8746, S Halkett: https://orcid.org/0000-0001-9721-3181, M Mazumder: https://orcid.org/0000-0003-1181-8118). This project was supported in part by funds from National Science Foundation (NSF) (award number: 1546491 to RM), the NIH National Center for Advancing Translational Sciences (award number UL1TR000075 supported KAC, HM, RM in part), and the McCormick Genomic and Proteomic Center (MGPC) at the George Washington University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.