Description of cohort and study design

The ABIS cohort has enrolled 17,055 newborn babies from Southeast Sweden born between 1 October 1997 and 1 October 1999. All mothers of babies born during this time period were invited to participate. This cohort serves as a large biobank of biological specimens obtained longitudinally from the enrolled children at birth, 1 year, 2–3 years, and 5–6 years of age. Collected samples types include blood, urine, stool and hair. In addition, parents of enrolled children completed questionnaires including information on duration of breastfeeding, antibiotic use, diet, etc. Many of the children enrolled also had their HLA genotype determined. The aim of the ABIS cohort, in part, is to identify the importance of environmental factors in autoimmune diseases (e.g. type 1 diabetes) and how genetic and environmental factors may interact in such diseases.

In the present study, we used high-throughput 16S rRNA sequencing to assess the microbiome of stool collected at 1 year of age from ABIS children. This time point was chosen due to the proximity in timing for development of T1D autoimmunity. Because HLA genotype data was available for only some of the children at the time of analysis, 403 individual 1-year stool samples were used. Associations between the 16S data and HLA genotype information for these children were made using multiple common statistical methods. Additionally, culturing methods were used to asses stool bacterial viability.

HLA genetic risk explains microbiome variation

Rarefaction curves for the observed number of amplicon sequencing variants (ASVs) and the Shannon alpha diversity index show that the chosen depth of 10,000 reads per sample was sufficient to represent the diversity of unique sequences in each sample (Fig. 1a, b). Shannon diversity was not significantly different overall among the genetic risk groups (Kruskal–Wallis, p-value = 0.4906), nor through pairwise comparisons between groups (Fig. 1c). The lack of difference in diversity between risk groups suggests that HLA does not have an effect on gut bacterial diversity. Antibiotic use (yes/no) within the first year (PERMANOVA: R2 = 0.00522; F Model = 1.0526; p-value = 0.285), duration of exclusive breastfeeding (PERMANOVA: df = 10; R2 = 0.0241; F Model = 0.97047; p-value = 0.6), mode of delivery (PERMANOVA: R2 = 0.00504; F Model = 1.015, p-value = 0.395) and gender (PERMANOVA: R2 = 0.00242; F Model = 0.97412; p-value = 0.529) did not have significant effects on gut bacterial composition at 1 year of age. Also, antibiotic use within 1 month prior to sample collection (n = 30) did not significantly impact gut bacterial composition (PERMANOVA: R2 = 0.00282; F.Model = 1.1386; p-value = 0.168). Out of the total 403 subjects analyzed, 43 were still at least partially breastfed at the time of the 1-year sample collection. Gut microbiome composition was significantly impacted by breastfeeding status at the time of stool sample collection (PERMANOVA: R2 = 0.00966; F.Model = 3.92; p-value = 0.001). Because breastfeeding status at the time of sample collection could be a confounding variable, this was corrected or accounted for in all subsequent analyses.

Fig. 1 No significant difference in microbial alpha diversity between genetic risk groups. a, b Rarefaction curves were generated based on both the number of unique amplicon sequencing variants (a) and the Shannon diversity index measure (b) for each sample. c Violin box plots and pairwise statistical comparison (Wilcoxon test) between alpha diversities among genetic risk groups. P-values for each comparison are depicted above the boxplots of the groups being compared. For statistical testing, n = 403 independent stool samples. Boxplot medians (center lines); interquartile ranges (box ranges); whisker ranges: Decreased = 2.795; 2.335–3.262; 0.210–4.155, Neutral = 2.798; 2.226–3.299; 0.345–4.156, Increased = 2.915; 2.425–3.362; 1.015–4.160, High = 2.983; 2.545–3.353; 0.871–3.851. Source data for Fig. 1c are provided in the source data file Full size image

HLA risk for T1D was able to explain significant differences in gut microbiome composition (PERMANOVA: R2 = 0.0097; F Model = 1.3088, p-value = 0.01, distance metric: binomial). Pair-wise comparisons between each group showed that all but the high vs. increased and decreased vs. neutral group comparisons were significantly different after correction for false discovery rate (FDR) (Table 1). Note that both high and increased risk groups are comprised of DR3 and/or DR4 haplotypes without protective haplotypes. Similarly, the decreased and neutral risk groups include at least one protective haplotype. Thus, similar groups in terms of haplotype makeup and risk do not significantly differ in gut bacterial composition. Additionally, the R2 values of these comparisons show a gradient trend in the amount of variation explained by risk groups at opposite ends of the risk spectrum. For example, the greatest amount of variation is explained between the high and decreased risk groups (R2 = 0.00923), followed by the high and neutral risk groups (R2 = 0.00824), then the high and increased risk groups (R2 = 0.00805). Because HLA risk explains little of the variation in the model, this suggests that the effect HLA has on the gut microbiome is modest and likely specific toward certain taxa. Also, comparisons between the highest risk subjects consistently explained the most amount of variation. This suggests that the risk-associated haplotypes together are more strongly driving compositional changes within the gut, whereas those without risk haplotypes or with at least one protective haplotype tend to be more similar in bacterial gut composition.

Table 1 Risk groups that are significantly different by PERMANOVA Full size table

The PERMANOVA test is dependent upon distances of dissimilarity between samples. Previous work has shown that the choice of distance metric can have a drastic effect on the test result20. This is because different metrics test different hypotheses. Here the binomial distance measure was used as an improved alternative to the Bray–Curtis index. Other commonly used metrics applied to 16S rRNA amplicon data include Bray–Curtis and Jaccard. Bray–Curtis is considered quantitative, because it is mainly sensitive to highly abundant species, whereas Jaccard is qualitative and considers the overlap of community members regardless of their relative abundances. The binomial distance metric is also quantitative but, unlike Bray–Curtis, includes joint absences, allowing pairs of samples missing the same ASV to appear more similar21. Differences in gut communities between risk groups were not significant using the Bray–Curtis distance (PERMANOVA: R2 = 0.00757; F Model = 1.0445; p-value = 0.43), but significant differences were observed with the Jaccard distance (PERMANOVA: R2 = 0.00893, F Model = 1.2257, p-value = 0.007). This suggests that the abundance of the bacterial ASVs observed may not be significantly influenced by HLA genetics. However, the presence or absence of certain bacteria are likely influenced by HLA as differences in community member overlap varies significantly by genetic risk.

An assumption of the PERMANOVA test is that all groups compared have similar variance (homogeneity of variance assumption), particularly in cases of uneven sampling between groups22. Beta dispersion was calculated based on genetic risk category adjusted for sampling bias, and an ANOVA was applied to test whether the average distance to the median variance was significantly different between groups. The result of the ANOVA showed that average distance to the median variance between groups was not significantly different (F value = 0.3928; p-value = 0.7583). Also, a Tukey HSD test showed that pair-wise tests of the variance assumption are not significant. Therefore, significance testing through PERMANOVA is not expected to be influenced by uneven dispersion among groups.

Amplicon sequence variants associate with HLA genetic risk

Because there are significant differences in microbiome composition between genetic risk groups, bacteria associated with genetic risk across the entire dataset were identified. Using Linear discriminant analysis Effect Size (LEfSe) after grouping ASVs at the genus level, significant associations between high risk subjects and members of the Saccharimonadaceae family were identified (Fig. 2a, b). Additionally, the family Erysipelotrichaceae was associated with subjects at increased genetic risk (Fig. 2a, c). Interestingly, two genera of the family Peptostreptococcaceae, specifically Intestinibacter and Romboutsia, were associated with neutral risk HLA (Fig. 2a, d). Differential feature plots show that the taxa described above have higher average relative abundance in their associated risk group than any other group (Fig. 2b–d). Overall, taxa found to be associated with a particular risk group through LEfSe had higher average relative abundance in their associated risk group. Because subjects at neutral risk often have either DR3 or DR4 plus at least one protective haplotype, the protective haplotypes may be important for this association, though no associations with the decreased risk group were observed in the entire dataset level with LEfSe.

Fig. 2 Specific taxa are associated with genetic risk. a LEfSe biomarker discovery results for the entire dataset (n = 403 individual stool samples). Significant taxa are displayed according to their associated risk group and their differential log 10 LDA score. b–d Differential features identified by LEfSe have higher average relative abundance with their associated risk group. b Saccharimonadaceae, c Erysipelotrichaceae, d Peptostreptococceae, (1) Decreased risk, (2) High risk, (3) Increased risk, (4) Neutral risk. Breastfeeding status at the time of sample collection was corrected for in the LEfSe model design. Source data are provided in the source data file Full size image

DESeq2 was used to find ASVs associated with genetic risk through pair-wise comparisons (Table 2). Again, Intestinibacter and Romboutsia were found to be associated with neutral risk compared with the high-risk group. Comparing the high (DR3/DR4 only) vs. decreased risk groups revealed that Intestinibacter and Romboutsia were also associated with decreased risk. This suggests that DR3 and DR4 together have a greater impact on the association of these ASVs than either haplotype alone (as in the increased risk group), and that the presence of protective haplotypes may also be important for their presence and/or abundance. Also, when comparing the increased vs. neutral groups, Bifidobacterium was associated with neutral risk with a relatively high base mean compared to other associated ASVs found by DESeq2, suggesting this ASV was more abundant across all samples compared to other identified significant ASVs. Even though Escherichia/Shigella members were shared within both the neutral and increased risk groups, they represent unique ASVs and therefore, have the potential to be unique strains. Klebisella and Veillonella ASVs were consistently associated with high risk (DR3/4) (Table 2).

Table 2 DESeq2 supports LEfSe results at the ASV level Full size table

DESeq2 was also able to identify bacterial ASVs associated with particular HLA genotypes or haplotypes (e.g., heterozygous DR3/DR4, or genotypes positive for DR3 or DR4 without protective haplotypes). Intestinibacter and Romboutsia were associated with the DR3 positive increased risk genotype over the DR3/DR4 group. However, neither ASV was associated when comparing DR4 positive with DR3 positive increased risk genotypes, suggesting that the high-risk heterozygous genotype may be exerting a selective pressure against these bacteria. Interestingly, when comparing genotypes where DR3 or DR4 positive haplotype is associated with a protective haplotype, these two ASVs are associated with the absence of DR3 and at the same time, are also not associated with DR4. Because different haplotype combinations will lead to varying degrees of antigen-binding affinity, associations between HLA genotype and bacteria within the gut are likely to be genotype-specific. This is evident in that either having or lacking a protective haplotype in combination with DR3 associates with a different group of gut flora.

Prevalence analysis reveals trends by limiting noise

Amplicon 16S data is sparse and therefore populated with numerous low or zero counts for ASVs that are rarely seen. These rare taxa could be considered noise, as they may not be relevant to the biological question because they appear in very few subjects. To limit this noise, we filtered these data using a method that considers ASV prevalence. ASVs were filtered at 5% increments for the entire dataset (Table 3). A prevalence cutoff of 45% was chosen to assess those ASVs which were present in nearly half of all individuals analyzed. Higher prevalence cutoffs, up to 75%, could be obtained for the full dataset at the cost of additional ASVs. This filtering resulted in clear separation by PCoA between genetic risk groups that could not be observed from the raw dataset, where more distinct clusters form between ASVs prevalent in 45% of each risk group (Fig. 3), regardless of whether those samples from participants still breastfeeding at the time of sample collection were included (Fig. 3a) or not (Fig. 3b). In other words, those ASVs that are present more frequently (in at least 45% of subjects in this dataset) make up distinct patterns of composition depending on HLA-driven genetic risk for T1D.

Table 3 Prevalence filtering focuses analysis on most shared ASVs Full size table

Fig. 3 Prevalence uncovers unique clusters among the most shared ASVs by risk. a, b PCoA depicting inter-subject distances based on the binomial metric at 45% prevalence, both including participants breastfed at sample collection (a) and without participants breastfed at sample collection (b). Ellipses for each risk group are calculated based on a 95% confidence for each group Full size image

Considering only those ASVs that are present in at least 45% of subjects in each risk group, 40 unique ASVs remain from the total 4450 ASVs in the raw dataset. Of the 40 total ASVs, 14 are prevalent among all risk groups, yet the other 16 are shared among 2 or 3 groups, while others still are unique to only one group (Fig. 4). Though, at the genus level, some taxa are shared, these organisms are represented by unique sequences and should therefore be considered unique organisms. In support of the differential abundance results, ASVs that are members of Intestinibacter and Romboutsia are most prevalent in the neutral and decreased HLA-risk groups. This provides further evidence that protective HLA haplotypes may be important for shaping the composition of bacteria in the human gut. Interestingly, an ASV belonging to the Bifidobacterium genus was prevalent among all risk groups. However, another Bifidobacterium ASV was prevalent just among high and increased risk subjects, while yet another was unique to just those at neutral risk. This highlights the importance of pairing ASV-level sequence resolution with prevalence, since the distinction among these unique sequences would be overshadowed by rare sequences and overlooked at the taxonomic rather than the single nucleotide variant level.

Fig. 4 Highly prevalent ASVs are either shared or unique to HLA groups. Hierarchy organization map displaying those taxa that are present in at least 45% of subjects in one or more risk groups. Taxa indicated with an asterisk (*) are represented by more than one ASV which are considered unique. Furthermore, those taxa indicated with a cross (+) represent bacteria shared at the taxonomic level but are actually unique to one or a combination of groups at the ASV level Full size image

Geographical clustering identifies bacterial hotspots

Geography explains a significant amount of variation between subjects’ gut microbiomes in ABIS (PERMANOVA: R2 = 0.03651; F Model = 1.3503; p-value = 0.001). Differences in geography can include underlying covariates such as diet and other lifestyle factors, and the effects of these factors can be exacerbated with increasing geographical distance23. Although the ABIS cohort benefits from geography confined to southeast Sweden, the significant impact that geographical factors have on subjects’ gut microbiomes is still evident through PERMANOVA. To limit this impact, towns were divided into three distinct clusters: the northeastern region (Linköping and Norrköping), the central region (Jönköping, Nässjö, Gislaved and Värnamo), and the southern region (Kalmar and Karlskrona). The towns within these regions were chosen for clustering based on their geographical proximity and because they provide the largest number of samples in the dataset, including many of those from high-risk subjects.

This geographical clustering, paired with differential abundance and prevalence analysis, allowed us to determine at which sites associations within the entire dataset were likely to have originated. LEfSe was again able to identify taxa associated with genetic risk at each cluster and overall among the clusters (Fig. 5). The LEfSe results show that among all clusters, Intestinibacter remains associated with neutral risk. Furthermore, this association was identified in the northeastern region while Romboutsia was associated with decreased risk in this region. In the southern region, Romboutsia was associated with neutral risk while Intestinibacter was associated with decreased risk. DESeq2 results also confirm that ASVs belonging to Intestinibacter and Romboutsia were associated with neutral and decreased HLA groups in the northeastern and southern regions.

Fig. 5 LEfSe identifies sources of ASVs by geographical region. Linear discriminant analysis Effect Size (LEfSe) is able to identify taxa as biomarkers associated with genetic risk groups in the three geographical regions and among all three regions. Northeastern region n = 140 individual stool samples, Southern region n = 65 individual stool samples, Central region n = 160 individual stool samples. Map data © 2019 Google, SIO, NOAA, U.S. Navy, NGA, GEBCO. Image © 2019 Landsat/Copernicus. Source data are provided in the source data file Full size image

Prevalence filtering was also applied to each geographical region separately by the same prevalence level used for the entire dataset (45%). The prevalence filtered regions show distinct patterns of separation by HLA risk group through PCoA (Fig. 6). Variation among the most prevalent ASVs by risk group is most considerable in the Southern region, though overlap with these ASVs can be seen between the high/increased and decreased/neutral risk groups, respectively. In the Northeastern region, overlap between gut bacterial communities is more consistent across all risk groups. Meanwhile, in the Central region, a high degree of overlap exists between all risk groups aside from the highest risk, which are substantially separated, indicating that the variation in community overlap among the most prevalent bacteria in this risk group is strong, regardless of breastfeeding status (Fig. 6d). In other words, each genetic risk group comprises a distinct combination of ASVs that are present in most of the participants at each risk level. This would suggest that the ASVs forming these highly prevalent clusters are those most influenced by host HLA genotype, posing a potential mechanism of selection toward those bacteria in the gut.

Fig. 6 Prevalent ASV clusters vary by HLA genotype across geography. a–d PCoA of inter-subject binomial distances for each of the three geographical clusters after applying 45% prevalence filter: a the Northeastern, b the Central, and c the Southern region. d Central region 45% prevalence after removal of samples from participants still breastfed at the time of sample collection. Ellipses are calculated based on a 95% confidence for each group Full size image

Bifidobacteria viable after storage at −80 °C for 19 years

Sample integrity is an important consideration when working with samples that have been stored for decades. Here, we assessed the viability of −80 °C frozen stool samples through isolation of non-spore forming, facultative anaerobic strains of Bifidobacterium. Members of the Bifidobacterium genus are considered relatively sensitive to long-term storage conditions, since they are non-spore forming and are sensitive to oxygen exposure. Nevertheless, strains of two Bifidobacterium species, B. breve and B. longum, were isolated from a stool sample stored at −80 °C for 19 years using media selective for the genus.