Study population

A cohort study was designed to assess the impact of chemotherapy on the GI microbiota of pediatric and adolescent patients diagnosed with acute B-cell leukemia. The study cohort consisted of 51 participants, made up of 23 matched patients and a healthy sibling and five unmatched patients. Five patients who did not have enrolled healthy siblings were also included in the cohort. Three subjects not complete the study: two withdrew and one subject was deceased. Subject demographics by age and gender are shown in Table 1. All study participants were enrolled in the Hyundai Cancer Institute, Children’s Hospital Orange County (CHOC Children’s), California, USA. Human subject protocol and consent forms were established, and approved by the Institutional Review Boards at CHOC Children’s and the J. Craig Venter Institute (JCVI). Stool samples were collected at the completion of each treatment stage during the patient's stay at the hospital, referred to as “sampling visits”. Samples marked “visit 1” were collected at the time of diagnosis before any chemotherapy was administered, and thus provided baseline microbiota for each patient. As such, patient’s samples were collected before chemotherapy, during induction chemotherapy (chemotherapy given to induce a remission), consolidation chemotherapy (chemotherapy given once a remission is achieved) and during maintenance therapy (chemotherapy given in lower doses to assist in prolonging a remission). All healthy sibling controls were sampled once, aligning with the time period before chemotherapy began on the patient (Additional file 1: Table S1), however, four siblings samples were collected at two time points, which were excluded from the analysis. All patients with ALL enrolled in the study received antibiotic prophylaxis with sulfamethoxazole and trimethoprim during treatment and steroid prophylaxis at the induction stage. Incidental use of antibiotics and occurrence of infections in the month before each visit were recorded. Additional file 1: Table S1 provides details of the sampling visits for each patient over the period of enrollment.

Table 1 Characteristics of study subjects Full size table

Sample collection

Samples were collected using the Human Microbiome Project (HMP) collection protocol section 7.3.3. with no modifications. Stool specimens were collected and transported to CHOC Children’s for deoxyribonucleic acid (DNA) extraction [20].

DNA extraction

Bacterial DNA was extracted from the stool samples using the PowerSoil® DNA Isolation Kit from MO BIO Laboratories, Inc. (catalog no: 12888) and by using the protocol as described in Yooseph et al., [21].

Library construction and sequencing

DNA was amplified using primers that targeted the V1-V3 regions of the 16S rRNA gene [22]. These primers included the i5 and i7 adaptor sequences for Illumina MiSeq sequencing as well as unique 8 bp indices incorporated onto both primers such that each sample received its own unique barcode pair. This method of incorporating the adaptors and index sequences onto the primers at the polymerase chain reaction (PCR) stage provided minimal loss of sequence data when compared to previous library construction methods that would ligate the adaptors to every amplicon after amplification. This method also allows generating sequence reads which were all in the same 5′-3′ orientation. Using approximately 100 ng of extracted DNA, the amplicons were generated with Platinum Taq polymerase (ThermoFisher, catalog no: 11304-011) and by using the following cycling conditions: 95 °C for 5 min for an initial denaturing step followed by 95 °C for 30 s, 55 °C for 30 s, 72 °C for 30 s for a total of 35 cycles followed by a final extension step of 72 °C for 7 min then stored at 4 °C. Once the PCR for each sample was completed, the amplicons were purified using the QIAquick PCR purification kit (QIAGEN, catalog no: 28104), quantified using Tecan fluorometric methods (Tecan Group, Männedorf, Switzerland), normalized, and then pooled in preparation for cluster generation followed by Illumina MiSeq sequencing using the dual index 2x300 bp format (Roche, Branford, CT) following the manufacturer’s protocol.

16S rRNA sequence data processing

After primer trimming, the paired-end reads were quality trimmed using the DynamicTrim program (available in the SolexaQA suite [23]). Subsequently, mothur (v.1.34.4) [24] was used to merge overlapping forward and reverse reads to generate contig sequences from the paired-end reads. Chimeric sequences were identified using the UCHIME [25] implementation in mothur; these sequences were removed from the downstream analysis. The resulting sequence set was clustered at different sequence identity thresholds (90, 95 and 97 %) using CD-HIT [26] to generate Operational Taxonomic Units (OTUs); OTU representatives were assigned taxonomy using the mothur implementation of the Ribosomal Database Project (RDP) Classifier [27]. While the RDP classifier has a goal of generating genus-level taxonomic assignments, not all sequences could, however, be confidently assigned taxonomy to the genus level (using the bootstrap confidence threshold of 80 %); we denote these sequences by appending the tag unclassified to the end of their taxonomic assignment (one of phylum, class, order, or family levels).

Electronic medical records

Study-specific participant data was collected at each specimen collection time point. Also, multiple time-stamped data points on each participant starting from the initial visit were gathered in the electronic medical record (EMR) at CHOC Children’s. Events recorded in the EMR were obtained for each patient identified by anonymized participant identification numbers.

16S rRNA data analysis

The Shannon diversity index was calculated from the OTUs in the samples to assess the alpha diversity of the microbial communities they represent. This was done using mothur [24]. Calculations were performed on sub-sampled sequences to account for differences in sequencing depth across the samples. To test whether the microbiota diversity difference between the patients and controls was statistically significant, we applied the Wilcoxon Rank Sum test in R, the statistical programming language, [28] to calculate the p-value for the comparison. The mean age difference between patient and control groups was evaluated using the t-test. For the average microbiota diversity between different time points, we used a paired (one sample) t-test.

The signature associated with the patient and control groups were identified using Random Forests (RF) [29], as implemented in the RF package (version 4.6–10) that is available in the R programming language and environment for statistical computing (version 3.2.2) [28]. For this analysis, OTUs (at 97 % identity threshold) with the same taxonomic classification were combined into a single bin, thereby generating a set of taxon bins where each bin had a unique taxonomy; we note here that the best taxonomic resolution for a taxon bin is at the genus level and that some of the bins may have less resolved taxonomy (that is, at one of phylum, class, order, or family levels). The sequence counts in these taxon bins were used to calculate their relative abundance in each sample, and these abundances were used as input features for the RF analysis. Gender, antibiotic use, and alpha diversity (Shannon index) were also included as input features.