T2D and control dataset

Raw NGS datasets of DNA obtained from fecal samples of 170 T2D patients and 174 HCs in China [3] were downloaded from the NCBI Sequence Read Archive (total data, 1.2 terabases; average sample size, 3.5 gigabases; accession numbers SRA045646 and SRA050230). To reduce SNP false positives, we trimmed abnormal bases and filtered low-quality reads as described in detail previously [12]. For each base (A, T, G and C), a mean number of base calls (f) across all sites and a standard deviation (SD) were calculated. Starting from the first site at 5′ end, the site was trimmed if the base call number for a base was beyond f ± 2 × SD. The trimming at 5′ end was terminated until encountering a site with all base call numbers within the range. Next, Trimmomatic [17] was used to remove adapters and trim low-quality bases (<Q20) at the 3′ end (parameters: -phred33 ILLUMINACLIP:adapters.fa:1:0:7 TRAILING:20 SLIDINGWINDOW:5:10 MINLEN:45 AVGQUAL:20). A total of 1.04 terabases of data remained after quality control procedures were completed. Clinical information and other characteristics of the 344 individuals included in the analysis were obtained from [3] and [18].

Bacterial reference genome determination

We first determined a set of bacterial reference genomes for samples included in this study. Metaphlan2, which is based on approximately 17,000 reference genomes [19], was employed to profile the bacterial species in each sample. Genomes of species that were identified by metaphlan2 in at least four samples were included in the reference set for the alignment analyses. The genome sequences of all of these species were downloaded from the NCBI assembly database and are listed in Additional file 1: Table S1.

Selecting species and genes with sufficient supporting reads

Clean reads were aligned to the whole reference set with Burrows-Wheeler Aligner-maximal exact match (BWA-MEM) [20] in default settings, and only unique alignments were outputted. Species with sufficient supporting reads were selected with a cutoff of ≥40% of the reference genome being covered by a ≥10× depth in at least 20 samples in both the T2D and the control groups. For genes, the cutoff was ≥80% gene sites with ≥10× depth in at least 10 samples in each group. Only the species and genes that met these criteria were subjected to subsequent SNP analyses.

SNP and intra-sample variation calling and filtering

Two tools, BCFtools [21] and VarScan2 [22], were applied to identify SNPs of the metagenome. The alignment of duplicates by BWA-MEM was first marked and filtered in Picard [23]. Then, SAMtools [24] was used to generate “mpileup” files from the SAM-formatted alignment files. The mpileup files were employed as input files for both BCFtools and VarScan2. The parameters used for BCFtools and VarScan2 were “-vmO z –V indels” and “pileup2snp min-coverage 10, p value 0.05, min-avg-qual 15,” respectively. SNPs detected by both the tools were selected. SNPs were further filtered with the requirements of a ≥0.5 mutated allele (relative to that in the reference genome) frequency, ≥4× supporting reads for the variant, and strand bias of sequencing bases less than tenfold in both BCFtools and VarScan2. Namely, when genome sites with heterozygosity were found in a sample, the major allele was used for the SNP analysis.

To address the heterozygosity or intra-sample variations, mutated allele frequencies (MuAFs) of variant sites were obtained for analysis under polyclonal scenario. The MuAFs of sites were defined as the mean values of outputs by BCFtools and VarScan2, which were highly correlated (R 2 = 0.997). To examine whether intra-sample variations affect the result, SNPs with a >0.8 MuAF were selected for a parallel SNP analysis.

Annotation of genes and SNPs

Gene ontology (GO) annotations of each genome were downloaded from Uniprot [25]. SNPs were annotated in SnpEff [26] with the -eff parameter. The genome annotation files used by SnpEff were obtained from GenBank.

Phylogenetic tree construction based on whole-genome level SNPs of B. coprocola

Based on the reference genome of B. coprocola (reference strain DSM 17136, GenBank accession GCA_000154845.1), genome regions with >20% samples not having valid coverage (≥10× depth) were discarded. If ≤20% of the samples had invalid coverage in a region, the bases in that region in those samples were labeled as “N.” Then, the nucleotides at SNP sites from the samples were extracted to generate a pileup file. Phylogenetic trees were constructed based on the whole-genome level aligned SNPs by using randomized axelerated maximum likelihood (RAxML) v8.2.9 (100 bootstrap replicates), with GTR model of nucleotide substitution, γ-distributed rates among sites, and Felsenstein correction for ascertainment bias [27, 28]. The parameters for RAxML were “-#100 –m ASC_GTRGAMMA –f a.” Rooting was undertaken by using RAxML with “-f I” option. Trees were drawn with the R package ggtree [29].

Phylogenetic tree construction of genes

The nucleotide sequences of genes were obtained by aligning reads to corresponding full-length genes of the reference genome. For a given gene, gene regions were discarded if with >20% samples not having valid coverage (≥10× depth) and the coverages of full-length genes are listed in Additional file 1: Table S4. If ≤20% of the samples had invalid coverage in a region, the bases in that region in those samples were labeled as “N.” Then, the full-length genes, except the discarded regions, were aligned. The phylogenetic trees of genes were constructed by using RAxML v8.2.9 with settings as these for whole-genome level, but without ascertainment correction. The two clusters are separated at the root of the phylogenetic tree.

Clustering of samples by mutated allele frequencies of sites

For a given samples, the sites of bacterial genome and gene with a >0.2 MuAFs were selected. For all samples, we obtained a matrix with the column denoting site positions and the rows denoting samples. If there was no variation or MuAFs <0.2, the MuAFs were set to 0. Then, hierarchical clustering and affinity propagation (AP) clustering [30] were applied to the matrix. The hclust function of stat package in R v3.3.0 was used for hierarchical clustering with parameters “method = complete,” and the output tree was drawn by ggtree. For AP clustering, APcluster in R v3.3.0 was employed with parameters “K = 2.” The clusters generated by AP clustering were visualized by using Cytoscape v3.5.0 [31]. For each node, the top five edges connecting the nodes with the largest similarity are shown.

Statistical analysis

Relative abundance of species and genome/gene densities were compared between the T2D group and control group with the Mann-Whitney test. Fisher’s exact test was performed to test for bias of SNP sites and sample enrichment inferred from the phylogenetic tree and clusters by AP clustering. Hypergeometric test was applied for one-tailed test for enrichment of biased SNP sites at gene level. The q value with Storey and Tibshirani’s method (R package qvalue v1.43.0) was applied for multiple testing correction [21] to identify species and genes with significantly biased SNP distribution.