Overview of infant cohort

To characterize the development of the neonatal gut and respiratory tract microbiota, we collected rectal, nasal, and throat swabs from 82 pre- and full-term infants over the first year of life (Table 1). From the 38 pre-term infants, weekly samples were collected while hospitalized in the neonatal intensive care unit from birth until discharge, and monthly samples were collected from discharge through one year of gestationally corrected age. From the 44 full-term infants in the cohort, monthly samples were collected through the first year of life, starting at birth. Samples collected during monthly visits in which evidence of acute respiratory illness was observed were excluded from this analysis (as described in the “Methods” section). The nasal swabs were selected as the primary measure of the respiratory microbiota, with throat swabs as a secondary supplemental indicator. Because the focus of this study is on microbiota development across multiple body sites, we assayed throat samples from an unbiased random subset of 40 matched subjects distributed evenly between the pre- and full-term cohorts. Microbiota from 1079 gut, 1013 nasal, and 538 throat samples were characterized by 16S rRNA amplicon sequencing.

Microbiome community state types (CSTs) summarize archetypal states and developmental narrative

The microbiota community composition of the rectal, nasal, and throat samples was quantified by 16S rRNA amplicon sequencing. To develop tractable summary representations of typical compositional profiles, samples from each body site were independently clustered into community state types (CSTs) using Dirichlet-Multinomial mixture (DMM) models [23]. The DMM model sought to explain the operational taxonomic unit (OTU) compositional vector as a sample from a mixture of different canonical Dirichlet components. For each sample, the DMM model posterior probabilities indicated which Dirichlet component the observed vector of OTU counts most likely represented. On the basis of these probabilities, samples were assigned to clusters corresponding to CSTs which collapse the variation in microbiota composition into commonly observed archetypal states that serve as summary representations of the microbiota composition at each site (Additional file 1: Figure S1). The CST of a particular sample indicates the approximate abundance of characteristic taxa and the presence of distinct motifs in the overall community composition profile (Fig. 1 and Additional file 2: Figure S2). A series of community states observed longitudinally within a given infant therefore summarizes the microbial community development in that infant over the course of the study. Additionally, occurrence pattern of CSTs can be associated with a variety of covariates, reflecting properties of the overall phenotype of microbiota development. These attributes of the CST-based approach provide an attractive framework for summarizing and characterizing infant microbiota development and for interrogating and elucidating the relationships between phenotypes of development and components of host maturity. A robust resampling procedure (see the “Methods” section) identified 6 CSTs within the gut, 7 within the nose, and 6 within the throat (Additional file 3: Figure S3), with each CST distinguished by the variance (Additional file 2: Figure S2) and relative abundance of specific OTUs (Fig. 1). The number assigned to each CST indicates the overall frequency of occurrence at each respective site, with CST 1 being the most frequent. Based on OTU abundance and the sequence of progression of CSTs over time observed in each subject (Fig. 2), we concluded that the CSTs consistently exhibited three properties: (1) they had highly dissimilar composition between different body sites, (2) they were associated with post-menstrual age (PMA), gestational age (GA) at birth, and/or week of life (WOL), and (3) they demonstrated patterns of co-occurrence such that the observation of a specific CST at a given body site was predictive of contemporaneous CSTs at other body sites.

Fig. 1 Composition of community state types (CSTs) of the nose (NAS), gut (GUT), and throat (THR). Average composition of each CST was identified by Dirichlet-Multinomial mixture (DMM) model-based clustering. Samples are grouped by the Dirichlet component that they represent, with each component corresponding to a CST, and the average composition of all samples in each CST group is represented. The CSTs in each site are ordered based on their occurrence over time (e.g., CST 2 is the earliest gut CST). The height of each bar is equal, indicating that all total abundances are normalized to a constant sum. The number of samples in each CST is included at the top of each bar. Within each bar, different colored bands correspond to different taxa, and the height of a given band is proportional to the average relative abundance of the corresponding taxon in the given CST. The top ten most abundant taxa within each body site are identified, with the closed circle flanking each taxa name positioned in the corresponding taxa in each bar. The composition of all samples is listed in Additional file 4: Table S1 Full size image

Fig. 2 Sequence index plots indicate the progression of community state types (CSTs) over time for each subject. Subjects are stratified along the y-axis and sorted in descending order by gestational age at birth. Post-menstrual age (PMA) in weeks is indicated along the x-axis. The period of sampling for each individual is colored, with colors indicating the observed CST in a given time period. The time point of each observation is rounded down to the week in which the sample was taken, and the surrounding period of time is colored according to the CST of the sample, with color changes occurring at the midpoint between consecutive samples in which different CSTs were observed. For each subject, the black region on the left indicates the period prior to birth and the white region on the right indicates the period after the last sample was taken. In all three body sites, strong temporal structure and ordered patterns of CST progression are evident. For example, CSTs 1, 2, and 6 are overrepresented during the period prior to 40 weeks PMA in the nose, gut, and throat, respectively Full size image

Microbiota composition of community state types and occurrence through developmental time

The CSTs at all three body sites displayed a range of diversity and abundance of specific OTUs (Additional file 2: Figure S2, Additional file 4: Table S1). The six throat CSTs were the least diverse, with types 1 through 5 dominated by Streptococcus (62–85%) and type 6 by Staphylococcus (41%). The seven nasal CSTs were substantially more diverse, with Streptococcus and Corynebacterium ranging from 5% to greater than 50% abundance across all CSTs. The six CSTs of the gut were the most diverse of all three body sites and were consistently populated with Enterobacteriaceae, Veillonella, Ruminococcus, Streptococcus, Prevotella, Bacteroides, and Bifidobacterium at mean relative abundances greater than 1%. Patterns of temporal CST progression are shared across individuals, with the typical order of gut progression as CST 2,1,3,6,4,5; nasal progression as CST 1,7,4,2,3,6,5; and throat progression as CST 6,1,2,5,4,3 (Fig. 1). The majority of infants manifest most CSTs in their first year, in a similar sequence and at similar ages (Fig. 2, Additional file 5: Figure S4). While no single CST is observed in all 82 infants, common patterns of sequential CST occurrence at each body site reveal canonical temporal orderings and a tendency towards monotonic CST progression over time. Furthermore, CSTs occurring later in this temporal progression comprised relatively more diverse microbial communities within each body site. Pre- and full-term infants are initially colonized by distinct CSTs, with CSTs 1, 2, and 6 overrepresented prior to 40 weeks PMA in the nose, gut, and throat, respectively. Temporal progression after 50 weeks PMA consists of similar CSTs in both pre- and full-term infants as they mature beyond 90 weeks PMA. Notably, individual infants matched by PMA transition through CSTs at different rates, which suggests additional factors influence microbiota progression.

In all three body sites, the CSTs most frequently observed at the earliest PMA in both pre-term and full-term infants contained high levels of Staphylococcus, which decreased as the infants matured beyond 39 weeks PMA. This initial transient colonization of infant gut and respiratory microbiota by Staphylococcus has been described in other studies and suggests that this bacterium confers early developmental metabolic and immune benefits to the host [24, 25]. Continued temporal developmental progression likely reflects adaptation of the microbiota to corresponding functional changes in the infant host, including the transition of the gut microbiota from an aerobic community in early PMA to an anaerobic community with increasing abundance of Ruminococcus, Prevotella, Bacteroides, and other anaerobes beyond PMA week 46. The throat and nasal sites maintain an aerobic microbiota throughout early life, dominated after week 40 PMA by Streptococcus, with emergence and stable colonization of Corynebacterium, Alloiococcus, Moraxella, and Veillonella in the nasal sites and Veillonella and Neisseria in the throat. Although Moraxella is an opportunistic respiratory pathogen, observations from our asymptomatic infant cohort are similar to previous studies which demonstrate stable colonization of Moraxella and Alloiococcus in the nasopharynx of healthy infants [26].

Predicted functions of community state types

To identify microbiota functions over the first year of life, the functional potential of all rectal, nasal, and throat CSTs was predicted using PICRUSt [27], which infers the functional metagenome of microbial communities based on marker gene data and reference bacterial genomes. The top eight KEGG pathways with positive or negative enrichment in each CST (p value < 0.001, FDR < 10%, and linear discriminate analysis (LDA) score > 3.0) were used for evaluation of functional differences within all CSTs (Additional file 6: Figure S5). The initial CSTs in all three sites (nasal CSTs 1 and 7, throat CST 6, and gut CST 2), clustered into a functionally distinct group enriched in multiple pathways, including metabolism of lipids, purines, and pyrimidines. Synthesis of nucleotides from purines and pyrimidines in early neonates supports chromosomal replication and active microbiota growth essential for early colonization of epithelial surfaces [28, 29]. Energy metabolism in early (nasal CST 7, throat CST 6, and gut CST 2) and late (nasal CST 6, gut CST 1, and gut CST 5) CSTs was driven by the pentose phosphate pathway (PPP) with the production of NADPH used for anabolic reactions needed for synthesis of cellular molecules. As the infant matured and gained exposure to additional dietary substrates, there was an enrichment of genes for carbohydrate uptake (PTS), central carbon metabolism (glycolysis and pyruvate synthesis), and the TCA cycle with increased production of ATP as an energy source for microbiota in gut CSTs 1,3,4,5; nasal CSTS 2,3,6; and throat CST 3,4,5. The enrichment of two-component signal transduction (TCS) pathways following temporal progression from the cluster of initial CSTs (nasal CSTs 1 and 7, throat CST 6, and gut CST 2) to all subsequent CSTs suggested increases in communication within the microbiota community and between microbiota and host as a result of changes in the immediate host environment and availability of metabolites [30]. We next compared enriched functions in the initial (gut CST 2 and nasal CST 1) and later (gut CSTs 3,4,6 and nasal CSTs 2,3,6) nasal and gut CSTs to identify potential microbiota functions that distinguish these sites. Aside from a limited number of distinct pathways in gut CST 1, such as bisphenol degradation, the rectal and nasal microbiota share many functions at this level of pathway resolution. Taken together, our results suggest that similar to the longitudinal progression of CSTs, specific functional pathways emerge in the initial early life CSTs, with expansion and diversification of microbial communities in later CSTs occurring as a result of contact with environmental sources and adaptation to changes in energy substrates.

Correlations between community state type and PMA in pre- and full-term infants

In order to elucidate the relationship between the temporal components of host maturity and the progression of community types at each body site, we further examined the associations between CSTs and maturity (GA) at birth and age, which can be measured developmentally as PMA or postnatally with WOL. We first fit smoothed curves of the probability of being in a given CST against WOL and GA at birth (Fig. 3) and the probability of being in a given CST against PMA and GA at birth (Additional file 7: Figure S6), stratifying subjects by GA at birth in both cases. This revealed three distinct canonical temporal patterns of CST occurrence (see the “Methods” section). The first pattern, chronological, observed in 6/19 of the CSTs identified, was characterized by occurrence at consistent post-natal intervals and frequencies which showed no substantive difference over the first year of life between pre- and full-term infants (e.g., throat CST 3 and gut CST 1), indicating that the week of life, a proxy for development shaped by environmental exposure, drives the occurrence of these community types. For these CSTs, the occurrence probability curves in Fig. 3 for each birth stratum overlap. By contrast, the second observed occurrence pattern was idiosyncratic to maturity at birth; whereby CSTs occurred at characteristic post-natal intervals, but their frequency of occurrence was dependent on gestational age at birth and they were significantly over- or under-represented in pre-term subjects (e.g., nasal CSTs 2 and 5). In Fig. 3, these CSTs’ curve for one maturity stratum reaches a distinctly higher peak than at least one other maturity stratum. Lastly, a convergent pattern of occurrence was observed in the 9/19 of CSTs. These CSTs showed increased probabilities of occurrence at earlier post-natal ages in full-term infants, but their occurrence probabilities in pre-term infants reached parity after a post-natal interval approximately equal to their prematurity (e.g., gut CST3, throat CST1, and nasal CST4). Infants with equal CST probability tend to have similar post-menstrual ages. The convergent pattern of occurrence implicates PMA, a proxy for host innate developmental maturity, as a driving force. The curves for the most and least mature strata for these CSTs in Fig. 3 have peaks of similar height that are offset along the week of life axis.

Fig. 3 Associations between community state type membership and time. The posterior probability of membership to each CST (y-axis) is plotted over weeks of life (x-axis), estimated as a non-parametric function of week of life and gestational age at birth. The CSTs are sorted by post-menstrual age at which they achieve a maximal probability of occurrence Full size image

We then constructed a single index model of age for each CST, which fit the probability of observing the CST as a function of a pseudo time index that is learned by the model. This pseudo time is a weighted average of GA at birth and WOL (Additional file 8: Figure S7). The GA at birth and WOL weights that form the pseudo time index quantify the tradeoff between time spent pre- and post-natally, with respect to the probability of manifesting a given CST. These models confirmed the trends described above, with three basic patterns being observed (see the “Methods” section).

Associations between community types and composition across body sites

Having established a summary narrative of typical microbiota development within each body site and identified host age and developmental maturity as primary driving factors, we sought to explore the possibility of significant associations between microbiota across body sites. Given that time is a common factor driving development host-wide, our expectations were confirmed in that the co-occurrence patterns of CSTs across body sites were significantly non-independent, as assessed by a chi-squared test (p value < 0.001). To further characterize these associations, we calculated the pairwise correlations between CSTs observed at each site (Fig. 4). Again, co-occurrence patterns between sites were highly significant, suggesting that the observation of a given CST at one body site is predictive of the concurrent CSTs at other body sites. The greatest degree of CST correlation between body sites was observed among nasal CST 1, gut CST 2, and throat CSTs 1 and 6, for which all cross-body site pairs were positively correlated.

Fig. 4 Pairwise correlations between community state types (CSTs) at different body sites. CSTs on the x- and y-axes are identified by body site nasal (NAS), throat (THR), gut (REC) and type. Each cell represents the Pearson sample correlation of CST membership probability across body sites in the same individual. Red-hued cells correspond to positive correlation coefficients. CST co-occurrence is non-independent with the CST of one body site highly predictive of the CSTs of the other two body sites Full size image

Given the strong associations at all body sites between CST occurrence and infant developmental and chronological age, correlations across body sites were expected. We sought to isolate these correlations resulting from host-wide temporal influences and assess remaining associations between community composition across sites. In order to control for time and other confounding factors and identify potential associations arising from direct or indirect interactions across sites, we further assessed the associations between the CST of each body site and the microbiota composition of the other body sites using a series of linear regression models. As predictors for the abundance of each taxon in a given body site, we first used mode of delivery, GA at birth, birth season, and day of life (which was modeled with a natural spline to allow for non-linear effects), as well as a per-subject random effect to account for repeated sampling of the same individuals. After fitting these “null” models, which contained no information about the composition of other body sites, we then added additional predictive terms for the CSTs of the other body sites and refit the “full” models. Because we sought to identify the relationships across body sites that could not be explained by infant maturity alone, we called significant only those associations between taxon abundance and remote CST for which inclusion of the remote CSTs as terms in the model significantly improved its explanatory power, after multiple test corrections (FDR ≤ 0.05, see the “Methods” section). We identified significant associations across all pairs of body sites (Additional file 9: Table S2), with the most significant associations identified between CSTs of the nose and gut and taxa in the throat. Fewer associations were significant between the gut and nose. Within each body site, certain taxa were uniquely associated with the CST of only one of the other body sites, while other taxa exhibited significant associations with the CST of both of the other two body sites. Of the additional variables included in the models, only those related to maturity were highly significant, with the day of life splines having the largest number of significant associations, followed by gestational age at birth. Mode of delivery had only four significant associations, all with taxa in the gut, while birth season had no significant associations (see Additional file 10: Figure S8 and Additional file 11: Table S3).

A number of taxa had significant associations with CSTs of other body sites at which the taxa themselves were not observed, ruling out direct exchange of these bacteria as the sole explanation for the associations. These taxa included Bacteroides ovatus, Clostridium perfringens, Actinobaculum, and Faecalibacterium sp. (Additional file 12: Table S4). Instead, these associations were consistent with the presence of bacteria in one site impacting, or being impacted by, development of microbiota at another site through indirect physiological or metabolic mechanisms. In order to assess cases where taxa were present in both associated sites, including Viellonella, Prevotella, and Dorea, we tested the OTU residuals for correlation after adjusting for PMA with a spline and each subject with a random effect. There were approximately fifty shared OTUs between each pair of sites, which on average were positively correlated for each site pair. The strongest correlation between shared OTUs was between the nose and throat, followed by the throat and gut, followed by the nose and gut (Additional file 13: Figure S9).

The associations between each set of CSTs and specific taxa (Additional file 9: Table S2) was visualized in two ways. First, as a bipartite graph (Fig. 5a–c) in which each site’s most taxonomically specific significant taxa were connected to the CSTs of the distal body sites to which they had significant associations at a FDR of 5%. Edge color indicates the direction and significance of the association, either as a decrease or increase in abundance when the associated remote CST is observed. Second, as a volcano plot (Fig. 5d), this indicates the significance (F test p value) and the magnitude of the increase in explanatory power (change in R2) when the CSTs of distal body sites are added to the regression models that include as covariates mode of delivery, GA at birth, birth season, day of life (as a natural spline), and a per subject random effect, and taxon abundance as the outcome. We identified 105 taxa with significant cross-body site associations: 34 in the gut, 34 in the nose, and 37 in the throat, with some taxa being significant in multiple body sites where they occurred. In the gut, 5 taxa were significantly associated with CSTs in both the throat and nose, 12 taxa in the nose were associated with both gut and throat CSTs, and 23 taxa in the throat were associated with both nose and gut CSTs. Among taxa present in the gut, the largest numbers of associations were identified with nasal CST 1 and throat CST 2, with 22 and 7 taxa respectively, and the single most significant association was between Bacteroides ovatus and throat CST 2 (Fig. 5b, d). Notably, B. ovatus was not identified in throat samples but was present in 1% of nasal samples at a low (< 1%) abundance. Among taxa present in the nose, the largest numbers of associations were identified with gut CST 1 and throat CST 2, with 12 and 22 taxa respectively, and the single most significant association was between an OTU of Prevotella and throat CST 2 (Fig. 5a, d). Among taxa present in the throat, the largest numbers of associations were identified with nasal CST 1 and gut CST 1, with 12 and 26 taxa respectively, and the single most significant association was between Prevotella pallens and nasal CST 6 (Fig. 5c, d). In the gut, an OTU of Dorea exhibited the most significant associations, with six CSTs from the nose and throat found to be significant. In the nose, an OTU of Veillonella had the most associations, with seven CSTs from the gut and the throat. In the throat, the Lachnospiraceae family and an OTU of Veillonella had the most associations, each with six CSTs from the nose and gut.

Fig. 5 Significant associations between taxa abundance and community state type (CST) across body sites. A bipartite graph was used to visualize the associations between CSTs and taxa at a distal body site (nasal (a), gut (b),throat (c)), with significant associations at a false discovery rate (FDR) of 5%. Edges indicate significant associations with color marking the direction and significance of the effect (red: increase in abundance, blue: decrease in abundance). Color shade corresponds to the level of significance, with lighter colors being less significant. Nodes are positioned using a force-directed layout, which places taxa or CSTs with similar patterns of significant associations near each other while attempting to optimize readability and limit overlap. d Relationships between taxa abundance and CSTs was also visualized using a volcano plot, with improvement in explanatory power (R2) conferred by the inclusion of CSTs in the model on the x-axis and − log10 p values of the model improvement on the y-axis. With individual taxa in each body site as the outcome (subplots GUT, NAS, and THR), a linear regression model was fit using the with and without CSTs of the other body sites as covariates, controlling for gestational age at birth, day of life, mode of delivery, birth season, and subject-level random effects. Full models (including all CSTs) were tested against null models (excluding the CSTs of the other body sites in turn) with a series of F tests Full size image

The microbiome is canonically correlated across body sites following time and space

These observations prompted us to assess the extent to which the OTU composition of each body site over time can be explained as a function of the OTU composition of the other body sites, without the dimension reduction associated with using CSTs. We again paired microbiome samples from different body sites that were acquired at the same visit for each participant, resulting in nasal-gut, nasal-throat, and rectal-throat site pairs. We then assessed the correlation of taxa between body sites directly using canonical correlation analysis (CCA) [31], which transforms two sets of multivariate observations into a series of canonical correlates (weighted averages) that maximize the score cross-correlations, quantifying the extent to which two sets of multivariate observations are correlated. Using a cross-validation scheme that held out blocks of individuals, we found in validation data that the subspace cross-correlations varied from 0.75 (nasal-throat) to 0.6 (gut-throat), for the first canonical coordinate (CC) (Additional file 14: Figure S10). Since we previously established that many OTUs vary as a function of time, we anticipated that temporal variation was responsible for much of this correlation. As expected, after adjusting for time by regressing out PMA in each site with a 14 degree-of-freedom spline, the time-stabilized correlations were attenuated in all site pairs, but still significantly different from zero in the nasal-gut and nasal-throat pairs (Additional file 14: Figure S10).