Patients and mucosal biopsy samples

Mucosal biopsies were obtained from patients undergoing CRC screening and surveillance colonoscopies at the University of Minnesota (Additional file 1). The majority of CF patients receiving care at the Minnesota Cystic Fibrosis Center participate in a systematic colonoscopic CRC screening program as described previously [4]. None of the CF patients had acute infections within the preceding 3 months of the procedure, and CF patient colonoscopies were done for colon cancer screening and not acute gastrointestinal symptoms. Control samples were obtained from non-CF patients with average risk of CRC undergoing routine colonoscopic CRC screening or surveillance. Pinch biopsies, four per patient, were obtained using the Radial Jaw 4 Jumbo w/Needle 240 (length) forceps for a 3.2-mm working channel (Boston Scientific, Marlborough, MA; Catalog # M00513371) in the right colon and placed into RNAlater stabilization solution (Thermo Fisher Scientific, Waltham, MA). The protocol was approved by the University of Minnesota Institutional Review Board (IRB protocol 1408 M52889). Gene expression was analyzed by RNA-seq from a total of 33 samples obtained from 18 CF patients and 15 non-CF control participants (Additional file 2: Figure S1).

RNA extraction and sequencing

Biopsy tissue was kept in the RNAlater stabilization solution overnight at 4 °C. RNA was prepared following tissue homogenization and lysis using the TRIzol Plus RNA Purification Kit (Thermo Fisher Scientific; catalog # 2183–555) following detailed manufacturer’s instructions. Total RNA samples were converted to Illumina sequencing libraries using Illumina’s Truseq Stranded mRNA Sample Preparation Kit (Cat. # RS-122-2103). Total RNA was oligo-dT purified using oligo-dT-coated magnetic beads, fragmented, and then reverse transcribed into cDNA. The cDNA was adenylated and then ligated to dual-indexed (barcoded) adaptors and amplified using 15 cycles of PCR. Final library size distribution was validated using capillary electrophoresis and quantified using fluorimetry (PicoGreen). Indexed libraries were then normalized, pooled, and then size selected to 320 bp ± 5% using Caliper’s XT instrument. Truseq libraries are hybridized to a paired-end flow cell, and individual fragments were clonally amplified by bridge amplification on the Illumina cBot. Once clustering is complete, the flow cell is loaded on the HiSeq 2500 and sequenced using Illumina’s SBS chemistry (Additional file 2: Figure S1).

Host RNA-seq quality control, read mapping, and filtering

We performed quality check on raw sequences from all 33 samples (to assure better downstream analysis using FastQC) [18]. This helped assess any biases due to parameters such as quality of the reads, GC content, number of reads, read length, and species to which the majority of the reads mapped (Additional file 2: Figure S2). The FASTQ files for forward and reverse (R1 and R2) reads were mapped to the reference genome using kallisto [19], where an index for the transcriptomes was generated to quantify estimated read counts and TPM values. Mean distribution for the TPM values was plotted using R to filter all the transcripts below a threshold value of log2[TPM] < 0. We generated PCA plots using sleuth [20] to examine sample clusters and visualization of expression patterns for genes using bar plots (Additional file 2: Figures S3 and S4). For further analysis of outlier samples, box plots were generated using Cook’s distance and heat map clustered by condition and mutation status was generated for the top 20 expressed genes (Additional file 2: Figures S5 and S6).

Host RNA-seq differential expression and enrichment analysis

To determine differentially expressed genes between CF and healthy samples, we quantified and annotated the transcripts using DESeq2 [21]. The output from kallisto was imported into DESeq2 using the tximport package [22]. The transcripts were annotated against the ensemble database using bioMART to obtain gene symbols [23]. Transcripts below a threshold of row-sum of 1 were filtered and collapsed at a gene symbol level. Prior to differentially expressed gene analysis, the read counts were normalized and the gene-wise estimates were shrunken towards the fitted estimates represented by the red line in the dispersion plot (Additional file 2: Figure S7). The gene-wise estimates that are outliers are not shrunk and are flagged by the blue circles in the plot (Additional file 2: Figure S7). DESeq2 applies the Wald’s test on estimated counts and uses a negative binomial generalized linear model determines differentially expressed genes and the log-fold changes (Additional file 2: Figure S8). The log-fold change shrinkage (lcfshrink()) function was applied for ranking the genes and data visualization. For data smoothing, MA plots were generated before and after log2 fold shrinkage. We found no change in the MA plot (Additional file 2: Figure S9) post smoothing, as there are no large log-fold changes in the current data (log2 fold change between − 1 and 1) due to low counts. The data were further transformed, and the normalized values were extracted using regularized logarithm (rlog) to remove the dependence of variance on mean. We used the Benjamini-Hochberg method for reducing the false discovery rate (FDR) with a cutoff of 0.05 for identifying differentially expressed genes for further analysis. Enrichment analysis was done using Ingenuity Pathway Analysis (IPA, QIAGEN Inc., https://www.qiagenbioinformatics.com/products/ingenuitypathway-analysis). The log-fold changes, p values, and FDR values (for all the genes with FDR < 0.05) were fed into IPA for both up- and downregulated differentially expressed genes between CF and healthy samples. Disease/functional pathways and gene networks were determined based on the gene enrichment. Furthermore, we looked at how many target upstream regulators were enriched based on our list of differentially expressed genes using IPA. We found 134 targets that passed the filter (p value < 0.01) from a total of 492 targets, of which 96 were transcription regulators.

16S rRNA extraction and sequencing

Mucosal biopsies samples (~ 3 × 3 mm) from 13 CF and 12 healthy individuals were collected in 1 mL of RNAlater and stored for 24 h at 4 °C prior to freezing at − 80 °C. DNA was extracted using a MoBio PowerSoil DNA isolation kit according to the manufacturer’s instructions (QIAGEN, Carlsbad, USA). To look at the tissue-associated microbiome, the V5-V6 region of 16S rRNA gene was amplified as described by Huse et al. [24] using the following indexing primers (V5F_Nextera: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGRGG ATTAGATACCC, V6R_Nextera: GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGCGACRRCCATGCANCACCT). Index and flowcell adaptors were added with this step. Forward indexing primer used is - AATGATACGGCGACCACCGAGATCTACAC [i5] TCGTCGGCAGCGTC and reverse indexing primers used is - CAAGCAGAAGACGGCATACGAGAT [i7]GTCTCGTGGGCTCGG. Post two rounds of PCR, pooled, size-selected samples were denatured with NaOH, diluted to 8 pM in Illumina’s HT1 buffer, spiked with 15% PhiX, and heat denatured at 96 °C for 2 min immediately prior to loading. A MiSeq 600 cycle v3 kit was used to sequence the sample.

Gut mucosal microbiome data processing, quality assessment, and diversity analysis

We processed the FASTQ files using FastQC [18] to perform quality control on the raw sequences. We then used SHI7 [25] for trimming Nextera adaptors, stitching paired-end reads and performing quality trimming at both ends of the stitched reads until a minimum Phred score of 32 was reached. Following quality control, we obtained an average of 217,500 high-quality reads per sample (median 244,000; range 9551–373,900) with an average length of 281.9 bases and an average quality score of 37.19. These merged and filtered reads were used for closed reference operational taxonomic unit (OTU) picking and taxonomy assignment against GreenGenes database with 97% similarity level using the NINJA-OPS program [26].

To identify any potential contaminants originating from laboratory kits and reagents, we used two negative controls consisting of “blank” DNA extractions that were processed and sequenced alongside the true samples. The principal coordinates analysis (PCoA) plot of the true samples with the negative controls shows clustering by sample type (Additional file 2: Figure S10) suggesting that most sequences observed in true samples were not derived from reagent contamination. We used these sequenced negative controls for identification of contaminants by applying decontam, an R package that implements a statistical classification procedure to detect contaminants in 16S and metagenomic sequencing data and has been shown to identify contaminants across diverse studies, including those from biopsy samples [27]. We used the prevalence-based contamination identification approach that is recommended for low biomass environments, like tissue biopsy. This method computes a prevalence-based score (ranging from 0 to 1) that is used by decontam to distinguish between contaminant and non-contaminants. A small score (less than 0.5) indicates that a sequence feature is likely to be a contaminant, while higher score (greater than 0.5) indicates non-contaminants (i.e., true sequences). We plotted the distribution of prevalence-based scores assigned by decontam (Additional file 2: Figure S11) that shows that most of the OTUs in our samples were assigned high scores (> 0.5), thus suggesting non-contaminant origin. Nevertheless, in order to identify any potential contaminants, we ran decontam analysis at the default classification threshold of 0.1, and at a higher threshold of 0.2.

We performed alpha- and beta-diversity analysis in R using the vegan [28] and phyloseq [29] packages. We used resampling-based computation of alpha diversity, where the OTU table is subsampled 100 times at minimum read depth (9551 reads) across all samples and computed average richness estimate for each alpha-diversity metric (chao1, observed OTUs, and Shannon). Wilcoxon rank-sum test was used for testing the statistical significance of the associations between alpha diversity of the CF and healthy conditions. For computing beta-diversity, we first rarefied the OTU table (using vegan’s rrarefy() function) at a minimum sequence depth (i.e., 9551 reads) across the samples and then computed Bray-Curtis dissimilarity, weighted UniFrac, and unweighted UniFrac metrics. The Adonis test was used for assessing if there is significant association between the beta-diversity of the CF/healthy condition and the diversity results are plotted using the ggplot2 package in R.

Gut mucosal microbiome differential abundance and functional analysis

We performed differential abundance testing between CF and healthy conditions using the phyloseq [29] package in R. We first created a phyloseq object from the OTU table (using the phyloseq() function) and filtered this object to only include OTUs occurring in at least half of the number of samples in the condition with fewer samples (i.e., min (number of samples in CF, number of samples in Healthy)/2)) with at least 0.1% relative abundance (using the filter_taxa() function). The filtered phyloseq object was converted into a DESeqDataSet object (using phyloseq_to_deseq2()), and the DESeq() function was invoked. This performed dispersion estimations and Wald’s test for identifying differentially abundant OTUs, with their corresponding log-fold change, p value, and FDR-adjusted q values between the CF and healthy conditions. We agglomerated the OTUs at different taxonomic ranks (using the tax_glom() function) and repeated the above steps to identify differentially abundant taxa at genus, family, order, class, and phylum levels.

We also tested for associations between taxonomic abundance and mutation status of CF samples. We first categorized samples into three genotype categories: (1) Healthy: Samples with no mutations; (2) CF_df508: CF samples with homozygous delta-F508 deletion, which is associated with more severe CF condition [30]; and (3) CF_other: CF samples with df508 heterozygous deletion or other mutation status. We used DESeq2’s likelihood ratio test (LRT) to identify taxa that showed significant difference in abundance across the three categories.

We then generated the predicted functional profiles for the gut microbes using PICRUSt v1.0.0 pipeline, [31] where pathways and enzymes are assigned using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. The KEGG level 3 pathways were filtered for rare pathways by only including pathways with relative abundance > 0.1% in at least half of the samples, normalized to relative abundance, and tested for association with CF/healthy conditions using non-parametric Wilcoxon rank-sum test followed by FDR adjustment.

To verify that our results were not affected by potential contaminants, we applied the prevalence-based contamination identification approach implemented in the decontam R package described above. We repeated the differential abundance analysis after removal of OTUs identified as contaminants and found the same microbes to be differentially abundant between CF and healthy samples or mutation status as those in the analysis without contamination identification. This confirmed that our results were not influenced by potential contaminants.

Integrated analysis of interactions between host gene dysregulations and changes in microbiome

For this analysis, differentially expressed genes from host and gut microbial OTUs from their respective overlapping samples were used (22 samples in total, with 12 healthy samples and 10 CF samples). We further subset differentially expressed genes between CF and healthy conditions (FDR < 0.05), specifically enriched for gastrointestinal cancer disease pathways (524 genes). Using absolute expression log ratio greater than 0.35, we obtained a representative set of both up- and downregulated genes from these pathways, leaving 250 genes for downstream analysis. The OTU table was collapsed at the genus level (or the last characterized level) and filtered for rare taxa by only including taxa with at least 0.1% relative abundance present in at least half of the number of samples in the condition with fewer samples (i.e., min (number of samples in CF, number of samples in Healthy)/2)), resulting in 35 taxa for further processing. Following this, centered log ratio transform was applied on the filtered table. We then performed correlation analysis between host gene expression data for 250 genes and gut microbiome abundance data for 35 taxa (genus level) defined above. Spearman correlation was used for this analysis as it performs better with normalized counts (gene expression) as well as compositional data (microbiome relative abundance) compared to other metrics, such as Pearson correlation [32]. We computed the Spearman rank correlation coefficients and the corresponding p values using the cor.test() function with two-sided alternative hypothesis. A total of 8750 (250 genes × 35 taxa) statistical tests were performed, and p values were corrected for multiple comparisons using the qvalue package in R [33]. Representative gene-taxa correlations were visualized using corrplots [34] in R, where the strength of the correlation is indicated by the color and size of the visualization element (square) and the significance of the correlation is indicated via asterisk. We also computed the Sparse Correlation for Compositional Data (SparCC) [35] for the taxa found significantly correlated (q value < 0.1) with the CRC genes. Pseudo p values were computed using 100 randomized sets. Significant gene-microbe correlations (q value < 0.1) and significant microbe-microbe correlations (SparCC |R| > =0.1 and p value < 0.05) were visualized as a network using Cytoscape v3.5.1 [36].

To ensure that these correlations were not influenced by any potential contaminants, we repeated the analysis after removing any contaminants identified by decontam as described above and found that the associations remained unchanged. Additionally, we also verified whether any correlated taxa coincided with known lab contaminants mentioned by Salter and colleagues [37]. We found no overlapping microbes with the list of known contaminants, except Pseudomonas. Pseudomonas was not identified as a contaminant in our decontam analysis. Interestingly, Pseudomonas aeruginosa, which is a major pathogen in cystic fibrosis lung infection [38, 39], has previously been isolated from the fecal samples of patients with cystic fibrosis [17, 40]. This suggests that the presence of Pseudomonas in our samples is not due to contamination and could be potentially attributed to the cystic fibrosis condition of our patient cohort.