Identification of a core iCD gene expression signature. The Crohn’s and Colitis Foundation of America–sponsored RISK study is a prospective inception cohort study, which enrolled 1,276 pediatric patients with IBD at diagnosis at 28 sites in North America between 2008 and 2012 (Supplemental Figure 1 and Supplemental Excel file 1; supplemental material available online with this article; doi:10.1172/JCI75436DS1). All patients were treatment naive, with ileal biopsies obtained during the initial diagnostic colonoscopy. Only subjects with a confirmed diagnosis of CD, UC, or non-IBD Ctl during follow-up were included (Table 1 and Supplemental Excel file 2). Our initial analyses identified a core iCD gene expression signature composed of 1,281 genes that were differentially expressed in the ilea of 2 independent iCD groups compared with Ctl (Figure 1B, Supplemental Figure 2A, and Supplemental Excel file 3). We conducted functional annotation enrichment analyses using several data sources, including gene ontology to map groups of related genes within the core iCD gene signature to upstream regulators, immune cell types, pathways, phenotypes, and biologic functions (Supplemental Figure 2, B and C, and Supplemental Excel files 4–6). These analyses provided a way to identify which known transcriptional regulators, biologic processes, and immune cell types were likely to be upregulated or downregulated within the ileum based on a given gene expression pattern. Analyses were conducted using ToppGene (20), ToppCluster (21), and Ingenuity Pathway Analysis software. The relative degree of upregulation or downregulation of a given transcriptional regulator or biologic process was provided by the activation z score obtained as an output from Ingenuity Pathway Analysis software (Supplemental Figure 2B), while P values for the specific cell type, pathway, phenotype, and biologic function associated with upregulated or downregulated genes were obtained as an output from ToppGene (Supplemental Excel files 4–6). Ileal enrichment for a given immune cell class was illustrated as shown in Supplemental Figure 2C by colored bars on the x axis, with the significance for each individual cell type within the class shown as the –log 10 (P value) on the y axis.

Table 1 RISK RNA-seq cohort clinical and demographic characteristics

The core iCD signature was enriched for genes induced by bacterial products and proinflammatory cytokine signaling, including the Th1 cytokine IFN-γ, whereas genes induced by several nuclear receptors, including HNF4α, were suppressed (Supplemental Figure 2B). Functional analyses (20) identified enrichment of innate antimicrobial responses and an unexpected profound loss of nuclear receptor-dependent lipid metabolic functions (Supplemental Figure 2B and Supplemental Excel files 4 and 5). Immune cell enrichment analysis was most significant for genes expressed by granulocytes, myeloid dendritic cells, macrophage/monocytes, and lymphoid stromal cells (Supplemental Figure 2C and Supplemental Excel file 6). Within the top 5 most upregulated genes, we noted the epithelial antimicrobial dual oxidase DUOX2 and its maturation factor DUOXA2. Within the downregulated genes, we noted the antiinflammatory HNF4α-dependent lipoprotein APOA1 as the top downregulated gene (Tables 2 and 3).

Table 2 Top upregulated differentially expressed genes within the core iCD gene signature

Table 3 Top downregulated differentially expressed genes within the core iCD gene signature

The core iCD signature contains APOA1 and DUOX2 gene coexpression signatures. Strong experimental linkage to gut inflammation (22–25), together with the high fold-change expression differences, prompted us to specifically examine the pattern of APOA1 and DUOX2 gene expression in the following clinically defined subgroups: Ctl, UC, cCD, and iCD. Patients in the cCD subgroup met diagnostic criteria for CD but lacked visible ileal inflammation on endoscopy; patients with UC also had disease limited to the colon. The cCD subgroup was further divided into those with normal ileal histopathology and those with microscopic ileal inflammation. The iCD subgroup was further divided into 2 groups based upon the presence of deep ulcers, iCD with deep ulcers [iCD-DU], and iCD without deep ulcers [iCD-noDU]) (26). Figure 1C shows progressive increased expression of DUOX2 in the ileum across the spectrum of Ctl, UC, cCD, and iCD, while suppression of APOA1 expression was specific to all forms of CD. Importantly, the 2 cCD groups showed a similar pattern of gene expression as the 2 iCD groups; this pattern was different from UC and Ctl (Figure 1C). We next performed Pearson correlation analysis to define other genes within the core 1,281 iCD signature with similar patterns of gene expression as DUOX2 or APOA1 across the predefined patient groups that therefore had high likelihood for coregulation and shared biologic function. Pearson coexpression correlation analysis (0.98<|r|<1) of the core iCD genes with either APOA1 or DUOX2 ileal expression identified 435 genes and 222 genes, respectively (Figure 1C, Supplemental Figures 3 and 4, and Supplemental Excel files 7 and 8). The heat map with specific indicated upregulated and downregulated genes further shows the expression pattern within the 2 gene coexpression signatures across the indicated patient groups (Figure 1C). Genes within the DUOX2 gene coexpression signature showed increasing or decreasing signal intensity across the spectrum of patient groups from Ctl to iCD-DU (Figure 1C), while genes within the APOA1 gene coexpression signature showed alterations that were specific to all forms of CD. Interestingly, the mature enterocyte digestive enzyme lactase (LCT) and members of the mucin family (e.g., MUC4) were included in the DUOX2 gene coexpression signature (Figure 1C and Supplemental Figures 3 and 4). By comparison, induction of the Th1 cytokine IFNG and its downstream effector chemokine (CXCL9) was included within the CD-specific APOA1 gene coexpression signature (Figure 1C and Supplemental Figures 3 and 4). Importantly, genes within the 2 gene coexpression signatures showed similar expression patterns between cCD with normal or abnormal histology and iCD-noDU and therefore were independent of overt clinical inflammation (Figure 1C).

Immunohistochemistry confirmed the expected apical localization of DUOX2 within villous enterocytes in the CD ileum, in conjunction with suppression of epithelial APOA1 and evidence of increased lipid peroxidation (Figure 2A). We next performed functional annotation enrichment analyses of the DUOX2 and APOA1 gene coexpression signatures using Ingenuity Pathway Analysis, ToppCluster (21), and Cytoscape (ref. 27 and Figure 2B). Functional annotation enrichment analyses for the DUOX2 gene coexpression signature essentially captured the key features of the NF-κB–dependent upregulated innate antimicrobial response contained within the core iCD signature and reduction of metabolic processes regulated by the glucocorticoid nuclear receptor NR3C1 (Supplemental Excel file 9 and Supplemental Figure 3C). Functional annotation enrichment analyses of the APOA1 gene coexpression signature revealed decreased expression of genes regulated by HNF4α, PPARγ, and several other nuclear receptors, with associated predicted decreases in several lipid metabolic and antioxidant functions (Supplemental Excel file 9 and Supplemental Figure 4C). Importantly, these changes occurred in association with upregulation of IFN-γ/STAT1-dependent genes involved in Th1 polarization. Collectively, these results defined 2 core DUOX2 and APOA1 gene coexpression signatures within the CD ileum, regulating both enterocyte and innate and adaptive immune functions.

Figure 2 Functional annotation enrichment analysis of the APOA1 and DUOX2 gene coexpression signatures. (A) Representative ileal DUOX2, APOA1, and 4HNE immunohistochemistry (original magnification, ×40) for Ctl (n = 3) and iCD (n = 7). Arrows indicate apical DUOX2 and intracellular APOA1 enterocyte staining and lamina propria 4HNE staining for lipid peroxidation. (B) Functional annotation enrichment analyses of the DUOX2 and APOA1 gene coexpression signatures using ToppCluster (21) and Cytoscape (27). Upregulated transcription factors and biologic functions are shown as red circles and orange boxes, respectively, while downregulated transcription factors and biologic functions are shown as blue circles and boxes, respectively.

The APOA1 coexpression genes are enriched within a CD-specific signature that is independent of clinical inflammation and may be used for patient classification. Remarkably, the majority of the 1,281 genes that comprised the core iCD gene signature were also differentially expressed in the ilea of patients with cCD compared with Ctls (1,055 of 1,281 genes, 82%) as opposed to only 18% (232 of 1,281) that were differentially expressed in the ilea of patients with UC compared with Ctls (Supplemental Figure 5A). This included upregulation of DUOX2 and downregulation of APOA1 in the cCD ileum. We then asked whether the APOA1 gene coexpression signature would be enriched within the cCD ileum and whether this genomic information could be used for patient classification. Indeed, a large portion (572 of 614; 93%) of the genes from the combined APOA1 and DUOX2 gene coexpression signatures were also differentially expressed between cCD and Ctl (Supplemental Excel file 10 and Supplemental Figure 5B). In contrast, only 15% (89 of 614) of the genes from the combined APOA1 and DUOX2 gene coexpression signatures, primarily from the DUOX2 signature, were differentially expressed between UC and Ctl (Supplemental Excel file 11 and Supplemental Figure 5B). Unsupervised hierarchical clustering analysis identified groups of biopsies with similar ileal gene expression profiles; this analysis tested whether the cCD ileal transcriptional profile for these 2 gene coexpression signatures would cluster with the iCD transcriptional profile, whereas the UC transcriptional profile would cluster with the non-IBD Ctl transcriptional profile. Results of this analysis showed that, for the APOA1 gene coexpression signature, most cCD ileal biopsies clustered together with iCD ileal biopsies, while most UC ileal biopsies clustered with non-IBD Ctls (Supplemental Figure 4, χ2 = 7.7, P = 0.005). A similar clustering was not observed when using the DUOX2 gene coexpression signature (Supplemental Figure 3, χ2 = 1.1, P = 0.3). Collectively, these data demonstrate that altered transcriptional profiles in CD were observed even in the histologically normal cCD ileum.

Arriving at the correct diagnosis in patients with IBD with colon-only involvement (i.e., differentiating cCD from UC) can be challenging with current clinical tools. Hypothesizing that diagnosis could be enhanced by genomic information, in addition to our training cCD and UC groups (cCD1 and UC1), we used an independent validation (cCD2 and UC2) cohort and tested classification of these groups using both unsupervised and supervised approaches. To further refine a CD-specific signature in the ileum, we identified differentially expressed genes (179 with fold change ≥ 2 and 93 with fold change ≥ 2.5) between the cCD and UC training cohorts (cCD1 and UC1), which were also contained in the DUOX2 or APOA1 gene coexpression signatures, whereby 82% (147 of 179) were from the APOA1 gene coexpression signature (Figure 3A and Supplemental Excel file 12). Unsupervised hierarchical clustering analysis to test for similarities in ileal gene expression between groups showed that both the training and validation cCD cohorts clustered with iCD, irrespective of microscopic inflammation, in contrast to the UC training and validation cohorts, which clustered with Ctl (Figure 3B and Supplemental Figure 5C, χ2 = 11.8, P = 0.0006 for training cohort).

Figure 3 The ileal APOA1 gene coexpression signature is specific to all forms of CD. (A) The Venn diagram shows the overlap of 179 genes differentially expressed 2-fold between patients with cCD and UC and genes within the DUOX2 and APOA1 coexpression modules. (B) A heat map of 93 of the above-mentioned 179 genes after averaging and hierarchical clustering of gene expression, with a fold change of 2.5 between the cCD and UC training cohorts for the indicated clinical subgroups. The training cohort included 17 cCD with abnormal histological features, 17 cCD with normal histology, and 45 UC. The independent validation cohort included 14 cCD with abnormal histological features, 11 cCD with normal histology, and 28 UC. The blue arrow indicates APOA1, and the red arrow indicates DUOX2. (C) The results of functional annotation enrichment analyses (IPA, Ingenuity Systems) using the DUOX2 and APOA1 gene coexpression signatures. Upregulated transcription factors and target genes are shown in orange and red, respectively, while downregulated transcription factors and target genes are shown in blue and green, respectively. Biologic functions associated with these groups of genes are also shown.

Supervised classification algorithms may be used to develop models to classify unknown patient groups (in this case to classify cCD vs. UC) based upon gene expression data that differ within a training set of known patient groups. The Support Vector Machine algorithm is a machine learning approach, which is commonly used to classify 2 patient classes based upon differential expression of biologic data and, in fact, was used recently to distinguish cCD from UC with 77% accuracy by cross-validation in a model based upon differential colon protein abundance (28). We therefore conducted a supervised classification analysis using the cCD1 gene list versus UC1 gene list for the cCD1/UC1 training cohort to develop a classification model using the Support Vector Machine algorithm in Avadis and then tested the accuracy of the model on the independent validation cohort (26 cCD2 and 28 UC2). Applying the model to the independent validation cohort resulted in accurate classification of 41 of 54 patients (76% overall accuracy). Among 17 patients classified by the model as cCD, 15 carried a clinical diagnosis of cCD (88% accuracy). However, among 37 patients classified by the model as UC, only 26 carried a clinical diagnosis of UC (70% accuracy). Of the patients with cCD misclassified as UC, this included 4 with microscopic ileal histologic involvement, 6 with normal ileal histology, and 1 whose ileal histology was not recorded. This result demonstrated a reasonable level of accuracy in using the ileal gene expression data alone to classify the 2 colon-only forms of IBD.

We next performed functional annotation enrichment analyses for the 179 genes specifically altered in CD independent of ileal inflammation (cCD vs. UC) (Figure 3C). Genes from the DUOX2 gene coexpression signature (Supplemental Excel file 12) showed decreased CYP3A4-related NR3C1 glucocorticoid receptor signaling and an increase in NF-κB–dependent innate immune activation and tissue degradation genes. Of particular interest, within the APOA1 gene coexpression signature (Supplemental Excel file 12), a marked downregulation of several HNF4α and PPARγ nuclear receptor-dependent glutathione s-transferases (GSTA1) and apolipoproteins (APOA1/4, APOB, and APOC3) with antioxidant function and an upregulation of an IFN-γ/STAT1-dependent Th1 signature was again noted. Collectively these results define a core iCD gene coexpression signature and associated biologic functions that were largely independent of overt clinical inflammation (Figure 3C).

Upstream regulators and biologic functions associated with mucosal ulceration. Endoscopic severity in CD is defined by the presence of deep ulcers (Figure 4A), and FDA approval of new therapies requires evidence of mucosal healing (8, 29). However, the host expression signature and microbial composition associated with the pathogenesis of this critical clinical parameter has not been defined in a large cohort of treatment-naive patients. We identified 345 differentially expressed genes associated with deep ulcers within the core iCD signature (Figure 4A, Supplemental Figure 6, and Supplemental Excel file 13). Analysis performed using Ingenuity Pathway Analysis software identified enrichment of genes associated with increased bacterial products and proinflammatory cytokine signaling, reduced nuclear receptor signaling, and a broad range of immune responses, including myeloid cell and lymphocyte activation (Figure 4B). Several biologic functions, including reactive oxygen species and hydrogen peroxide production, angiogenesis, and connective tissue degradation, were notably enriched within the iCD-DU gene coexpression signature (Figure 4B). We tested for immune cell–type enrichment between iCD-DU and iCD-noDU by conducting functional annotation enrichment analyses of immune cell–type gene expression using the Immunological Genome Project data series as a reference through ToppGene (20). Ileal enrichment for a given immune cell class (e.g., granulocytes) is illustrated in Figure 4C by colored bars on the x axis, with the significance for each individual cell subtype within the class shown as the –log 10 (P value) on the y axis. The most significant signal within iCD-DU was for granulocytes, followed by myeloid dendritic cells and B cells, while iCD-noDU showed a different order of enrichment, with high representations of myeloid dendritic cells, macrophages, and lymphoid stromal cells, in the absence of the pronounced granulocyte signal (Figure 4C). These data define biologic pathways and associated immune cell types specifically associated with severe mucosal ulceration.

Figure 4 Gene expression signature and biologic pathways associated with mucosal ulceration. (A) Image of ileal deep ulcers and Venn diagram showing 345 genes from the core iCD gene signature, which overlap with differentially expressed genes (Audic Claverie method with Benjamini-Hochberg FDR correction [0.05], fold change ≥ 1.5), between iCD-DU and iCD-noDU. A heat map of averaged gene expression and that after hierarchical clustering for the iCD-DU gene list for the indicated clinical subgroups. Specific upregulated (red) and downregulated (blue) genes are listed. (B) IPA functional annotation enrichment analyses to detect upstream regulators and biologic functions associated with the mucosal ulceration (iCD-DU) gene expression signature are shown. The activation z score for biologic function enrichment (P value range: 6E-8 to 3E-26) and upstream regulator enrichment (P value range: 3E-10 to 1E-56) depicts the degree of activation or suppression of a given regulator or function. (C) Immune cell–type enrichment of upregulated genes for iCD-DU (238 of 345 genes) and upregulated genes for iCD-noDU (524 of 936 genes) was determined using the Immunological Genome Project data series as a reference through ToppGene (20). Ileal enrichment for a given immune cell class (e.g., granulocytes [GN]) is illustrated by colored bars on the x axis, with the significance for each individual cell subtype within the class shown as the –log 10 (P value) on the y axis. pDC, plasmacytoid DC; MF, macrophages; MO, monocytes.

A fundamental clinical problem, as well as a difficulty in trial design in CD, is the lack of correlation between clinical disease activity indices and the severity of mucosal injury as defined by deep ulcers (8, 29). Consistent with this observation, stratifying the 180 patients with iCD by the pediatric Crohn’s disease activity index (PCDAI) identified only 43 differentially expressed genes within the core iCD signature between those with mild symptoms (PCDAI ≤ 30) and those with moderate-severe symptoms (PCDAI > 30) at diagnosis. Therefore, there was no clear association between the modest number of ileal genes associated with more severe symptoms and the much more robust gene coexpression signature associated with tissue ulceration (Supplemental Excel files 13 and 14).

Shifts in the ileal microbial community are preserved in the unaffected CD ileum. Murine studies have established a definitive requirement for bacterial colonization in the development of mucosal inflammation in the genetically susceptible host (30, 31). Functional annotation enrichment analyses of the core iCD gene signature demonstrated enrichment of genes associated with myeloid cell responses (Supplemental Excel file 6, P < 1.60E-44 to 6.74E-69), response to lipopolysaccharides (Supplemental Excel file 4, P < 1.5E-30), altered susceptibility to infection (Supplemental Excel file 4, P < 3.1E-24), and response to biotic stimulus (Supplemental Excel file 4, P < 5.03E-38). Remarkably, a significant proportion of genes whose ileal expression is known to be regulated by bacterial colonization of mice were contained within the APOA1 and DUOX2 gene coexpression signatures (Supplemental Figure 7 and ref. 15). This group included upregulated key drivers of the innate (DUOX2 and DUOXA2) and adaptive (CXCL9 and CXCL10) immune responses both in iCD and following bacterial colonization of the mouse ileum. This finding suggested that the host gene coexpression signal in CD may be due, in part, to shifts in the ileal microbial community. We therefore next defined the ileal microbial community within the patient cohort and tested for association with genes from the DUOX2 and APOA1 gene coexpression signatures.

We first performed a univariate analysis (LDA Effect Size) to test for association between ileal microbial composition and disease state in 240 subjects with CD, 163 non-IBD Ctl, and 56 subjects with UC from the RISK cohort, with no recent exposure to antibiotics (Supplemental Excel file 15). The output of this univariate analysis is illustrated by a cladogram, which demonstrates the relationship between the different bacterial taxa and disease state (Figure 5A). These results are consistent with prior studies of patients with IBD with established disease (32) and the recently published ileal, rectal, and fecal microbial community characterization of the RISK cohort (6). We observed prominent increased abundance of the Firmicutes phyla within the non-IBD Ctls that was preserved in the UC ileum but not in CD ileum, along with a shift toward Fusobacteria, Gemellaceae, and Proteobacteria expansion in the CD ileum.

Figure 5 The ileal microbial community in patients with IBD and Ctls. (A) Univariate analysis (LDA Effect Size) was performed to test for association between ileal microbial composition and disease state in 240 CD (48 cCD, 178 iCD, and 14 CD with no indicated ileal involvement), 163 non-IBD Ctl, and 56 UC subjects from the RISK cohort that had not received antibiotics prior to endoscopy. The cladogram illustrates the output of this univariate analysis by demonstrating the relationship between the different bacterial taxa and disease state. Colored nodes from the center to the periphery represent marked phylum (p), class (c), order (o), family (f), genus (g), and species (s) differences detected between groups for Ctls (green), CD (red), and UC (blue). Only phylum, class, order, and family levels are indicated on the right side of the cladogram. (B) Fold change for each taxon was calculated by dividing the mean abundance in the cases (cCD [48 patients] or iCD [178 patients]) by that of the Ctls (154 patients) and is shown for microbiota with differential abundance in CD compared with Ctl by univariate analysis. The cCD group was further subdivided to cCD with abnormal histological features (25 patients) and cCD with normal histology (18 patients).

Microbial shifts that have previously been defined in the ileal mucosa of patients with adult-onset CD with longstanding disease could be secondary to adaption to the diseased micro-environment or a primary event independent of local inflammation (5, 32). To address this fundamental pathogenic question, we next uniquely characterized the ileal microbial community in patients with CD without clinical ileal involvement (cCD), compared with those with overt ileal inflammation (iCD), for taxa previously shown to be associated with treatment-naive pediatric CD (6). As shown in Figure 5B, a very similar microbial shift was observed in the ilea between patients with iCD and cCD, irrespective of histologic involvement in cCD. This included expansion of Veillonellaceae, Pasteurellaceae, Neisseriaceae, Gemellaceae, and Fusobacteriaceae and persistent suppression of Lachnospiraceae, Bifidobacteriaceae, Clostridiales, and Erysipelotrichaceae in all forms of CD, while expansion of Enterobacteriaceae was not apparent in patients with no histological or clinical inflammation. To examine the possibility that local nearby cecal inflammation contributed to the microbial shift identified in the ileum, we stratified patients with cCD based on the presence of macroscopic cecal inflammation reported during colonoscopy (Supplemental Figure 8). Over all, there was persistent reduction in Lachnospiraceae, Bifidobacteriaceae, Clostridiales, and Erysipelotrichaceae in all forms of CD, with expansion of Veillonellaceae, Pasteurellaceae, Neisseriaceae, Gemellaceae, Fusobacteriaceae, and Enterobacteriaceae independent of cecal involvement. Collectively, these results defined a core CD microbial shift that occurred, like the APOA1 and DUOX2 gene coexpression signatures, largely in the absence of overt clinical inflammation (Figure 5B).

Ileal Firmicutes and Proteobacteria taxa abundance are associated with the APOA1 and DUOX2 gene coexpression signatures and clinical outcomes. To test for an association between the core CD host gene coexpression signatures and these specific microbial shifts, we performed multivariate analysis by linear models (MaAsLin) (5, 6) in a subgroup of patients for whom both RNA-Seq and microbial community profiling had been performed (195 CD, 50 UC, and 34 non-IBD Ctl). We specifically tested for associations between members of the ileal microbial community and representative genes from the APOA1 (APOA1, CXCL9) and DUOX2 (DUOXA2, MUC4, LCT) gene coexpression signatures, clinical group (Ctl, UC, CD), endoscopic severity (ileal deep ulcers), and clinical severity (PCDAI), while controlling for age, gender, BMI, and risk allele carriage in NOD2, FUT2, and ATG16L1. We were able to identify 70 significant associations between gene expression and microbial taxa and 34 significant associations between clinical parameters and microbial taxa (P < 0.05 and q < 0.25, Supplemental Figure 9 and Supplemental Excel file 16).

We then used a novel biplot approach (Figure 6A, ordination was rotated by CD group) to visualize the covariation among the ileal microbial community structure, clinical group, CD clinical disease severity (PCDAI), and ileal gene expression from MaAsLin. Sample and microbial feature coordinates were generated as a standard biplot, with an additional dimension of clinical and gene expression metadata. The biplot used nonmetric multidimensional scaling (NMDS) to depict the multidimensional relationship between these parameters in 2 dimensions, with the stress measurement indicating the goodness of fit of the 2-dimensional representation of the data. A stress measurement of ≤0.2 is regarded as a good fit; during the development of the biplot, 19 of 20 stress measurements were ≤0.2, demonstrating that the model was not overfitted to the data. Points were used to represent samples, labels to represent selected significant microbiota, and labeled arrows to represent clinical and molecular metadata (see Methods section for more details). In the biplot, when we used specifically significant microbial taxa associations from the multivariate analysis starting at the family level, we found that Ctl and most UC ileal samples clustered together with Bifidobacteriales, Lachnospiraceae, and Ruminococcaceae and with a higher level of APOA1 gene expression. Increasing DUOXA2 expression and increasing clinical disease activity (PCDAI score) were represented by a vector directed to the right of the biplot in association with covariation in the ileal microbial community structure. A more central CD subgroup clustered together with Firmicutes Veillonellaceae and Betaproteobacteria Neisseriaceae, whereas the remaining CD samples and a small group of UC samples exhibited covariation with Gammaproteobacteria Pasteurellaceae, Fusobacteriaceae, and Gammaproteobacteria Enterobacteriaceae, further illustrating the dysbiosis of these microbiota in the IBD ileum. Consistent with the univariate analysis, the cCD samples were distributed across the entire spectrum of the CD microbial community structure.

Figure 6 Covariation of the ileal microbial community structure with ileal gene expression and clinical subgroup and severity. (A) The biplot depicts covariation of the ileal microbial community structure with clinical group, clinical disease activity (PCDAI), and ileal gene expression using NMDS rotated by CD and based on significant associations obtained from MaAsLin. Patients with IBD and healthy Ctls are plotted as orange circles (Ctl), green circles (UC), red circles (cCD), purple circles (iCD-noDU), or purple triangles (iCD-DU). Arrows indicate the direction of covariation for APOA1 or DUOXA2 gene expression and clinical severity (PCDAI). Microbial taxa are labeled with capitalized lettering. The stress measurement tests for the goodness of fit of the biplot 2-dimensional depiction of the multidimensional data, with stress < 0.2 regarded as a good fit. (B) The bar plot graphs show effect sizes (r coefficients) on the x axes for significant associations between microbial taxa and the indicated ileal gene expression, clinical disease activity (PCDAI), and clinical groups obtained from MaAsLin, while controlling for age, gender, BMI, antibiotic exposure, and NOD2, FUT2, and ATG16L1 risk allele carriage (P < 0.05, q < 0.25 were considered significant). Representative genes are from the APOA1 (APOA1 and CXCL9) and DUOX2 (DUOXA2, MUC4, and LCT) gene coexpression signatures. The comprehensive presentation of these data with P values, q values, and taxon abundance can be found in Supplemental Excel file 16.

We next focused on the significant specific associations that were detected between ileal gene expression and microbial abundance and between clinical metadata (mucosal ulceration, clinical disease severity, and clinical group) and microbial abundance, as shown in Figure 6B and Supplemental Excel file 16. Given our multivariate approach, the r coefficients shown are the effect sizes for a given microbial taxa, which represent the additional association of that microbe’s relative abundance with a given gene’s expression or clinical metadata, accounting for the association of all other microbes measured. Because the absolute abundance of each of the microbial taxa is relatively low, it would be anticipated that the effect sizes would also be low. DUOXA2 and MUC4 expression was positively associated with taxa from the Proteobacteria phylum (q < 0.0159 and q < 0.0026, respectively) and Enterobacteriaceae family, and MUC4 and CXCL9 expression was positively associated with taxa from the Gammaproteobacteria class Pasteurellaceae family (q < 0.1345 and q < 0.176, respectively) (Figure 6B and Supplemental Excel file 16). Additionally, CXCL9 expression showed positive association with taxa from Firmicutes Veillonellaceae (q < 0.115). In contrast, DUOXA2, CXCL9, and MUC4 expression was negatively associated with taxa from the Firmicutes (q < 0.0156, q < 0.137, and q < 0.0012, respectively) and Bacteroidetes phyla (q < 0.0175, q < 0.143, and q < 0.0025, respectively) (Figure 6B and Supplemental Excel file 16).

In contrast to prior studies of patients with CD with longstanding disease, the presence of more severe mucosal injury (deep ulcers) that we have now shown to be highly associated with specific gene signatures and pathways was not associated with a significant microbial abundance shift in the multivariate analysis. However, clinical severity scoring (PCDAI) that showed no significant association with mucosal gene expression signature was specifically and significantly associated with reduction in abundance of Firmicutes (q < 0.1) and Bacteroidetes (q < 0.1) (Figure 6B) and increase of Proteobacteria taxa (q < 0.038). In addition, prominent preservation of several taxa within the Firmicutes phyla, including the Roseburia from the Lachnospiraceae family and Erysipelotrichaceae family, were observed within the UC ileum (q < 0.0025 and q < 0.14, respectively) (Figure 6B, bottom panel). By comparison, APOA1 and LCT expression was positively associated with Firmicutes (q < 0.04 and q < 0.12, respectively) and Bacteroidetes (q < 0.06 and q < 0.053, respectively) taxa, while LCT expression was also inversely associated with abundance of specific Proteobacteria taxa (q < 0.1, Figure 6B, bottom panel). Collectively these data demonstrate that microbial shifts, which were detected even in the unaffected ilea of patients with cCD, were associated with specific changes in expression of genes from the APOA1 and DUOX2 gene coexpression signatures and the spectrum of ileal mucosal inflammation and injury in CD and UC, while reduction of taxa from the Firmicutes phyla, like the APOA1 signature, was specific to CD and preserved in the ilea of patients with UC.

Finally, we asked whether these gene expression and microbial data, in combination with clinical data at diagnosis, would improve a prediction model for SSFR 6 months after diagnosis, when compared with a model based only on clinical factors and subsequent treatment exposures. Tables 4 and 5 present the results of a multivariable regression analysis to test the accuracy of models for 6-month SSFR, which, in an iterative fashion, included clinical parameters only, clinical and gene expression (DUOX2 and APOA1) parameters, or clinical, gene expression, and microbial (taxa identified by MaAsLin) parameters. As shown in Table 4, the accuracy of each of the 3 models was assessed by the area under the curve (AUC) for a receiver operator curve analysis, with the likelihood ratio test used to formally test for a difference in accuracy between the model based only on clinical parameters and the model which included clinical, gene expression, and microbial parameters. These data demonstrated that a model that included clinical, gene expression, and microbial factors was superior to a model that included only clinical factors, with the highest AUC of 0.760 and P = 0.0043 (likelihood ratio test), compared with the model based on only clinical factors.

Table 4 Characteristics of regression models to predict 6-month SSFR

Table 5 Multiple regression analysis, including clinical, gene expression, and microbial varia

The results of the multivariable regression analysis, which included clinical, gene expression, and microbial parameters, are shown in Table 5. Neither age at diagnosis nor clinical disease activity (PCDAI) alone was associated with 6-month SSFR. Among patients with mild clinical disease activity (PCDAI ≤ 30), the presence of ileal deep ulcers was associated with a higher likelihood of achieving 6-month SSFR, as evidenced by a higher odds ratio (OR), which reached statistical significance (P = 0.0029). As expected, the highest OR for achieving 6-month SSFR was associated with anti-TNF therapy. After accounting for clinical and mucosal severity and anti-TNF therapy, variable selection and the classification and regression tree analysis identified higher APOA1 expression (above the sample 80th percentile) and microbial taxa, including Lachnospiraceae Blautia and Veillonellaceae Veillonella abundance as significant prognostic factors. The relative abundance of Blautia and Veillonella interactively affected the odds of attaining 6-month SSFR: while an increased abundance of Blautia (above the sample 70th percentile) was negatively associated with the odds of achieving 6-month SSFR, its effect was abrogated by an increased abundance of Veillonella (above the sample 80th percentile). Finally, we used 10-fold cross-validation to test the overall reliability of the model and the potential for overfitting. We divided the data set into 10 subsets and used each subset as a validation set, while using the remaining 9 as a test set. Cross-validation showed that the clinical outcome prediction model using clinical, genetic, and microbial data performed reliably: the predictive power measured by the AUC on the validation sets was 0.777 on average, with SEM = 0.011. These data demonstrated that a model which includes baseline APOA1 expression and Blautia and Veillonella abundance accurately predicts 6-month clinical outcome.