Canadian Complex Chronic Disease Study Samples

The Complex Chronic Disease Study has been described elsewhere [19], and includes 25 ME/CFS subjects meeting the 2003 Canadian Consensus Definition for ME/CFS (all of whom suffered post-exertional malaise with extreme fatigue severity scores) [26], and 25 age- and sex-matched healthy controls. The protocol was approved by the University of British Columbia’s IRB (H11-01998) and all subjects gave written informed consent to participate. Upon consent, serum was collected at baseline for serological tests and stored at − 20 °C until thawed for use. Samples were diluted 1:1 with reagent-grade glycerol plus 0.025% sodium azide to prevent freeze-thaw cycle damage.

Norwegian Rituximab Study Samples

A pilot study and two phase-2 studies of B cell depletion using the monoclonal anti-CD20 antibody rituximab for treatment of ME/CFS have been previously described [13,14,15]. In the present analysis, we used pretreatment sera from 25 individuals drawn from the pilot study and the KTS-2-2010 single-center, open-label, one-armed phase II study (NCT01156909) [16], in which subjects received rituximab (500 mg/m2) infusions 2 weeks apart, followed by maintenance rituximab infusions after 3, 6, 10, and 15 months, and with follow-up for 36 months. The study was approved by the Regional Ethical Committee in Norway, no 2010/1318-4 and by the National Medicines Agency, and all subjects gave written consent to participate.

In this study, subjects improving according to the predefined criteria in the protocol were characterized as responders. For the analyses in this manuscript, we used only the pretreatment samples from subjects in the trial. The biobanked samples were aliquoted before freezing at − 80 °C. Samples of 100 μl were diluted to a final concentration of 50% glycerol for transport to the testing laboratory. Samples were shipped at − 20 °C, at which temperature they were kept throughout.

American Healthy Control Samples

Non-affected control samples were obtained from Clinical Testing Solutions (Tempe, AZ), a national blood testing laboratory. Samples were stored at − 20 °C until use. Healthy samples were obtained from multiple locations throughout the continental US and consisted of blood donors who were negative for the presence of infection. We selected samples based on age (30–62 years of age) but not gender, race, or geography.

Laboratory Methods

Deidentified samples were received and kept frozen at − 20 °C until use. The immunosignature arrays were synthesized and completed as described previously but used 125,000 peptides rather than 330,000 [22]. Peptides were 12 amino acids long and were composed of 16 amino acids, excluding threonine, methionine, isoleucine, and cysteine. Microarray slides were blocked with 1 mM PBS, 3% bovine serum albumin, 0.05% Tween 20, 0.014% mercaptohexanol for 1 h at 25 °C in a darkened humidified chamber, then sera were diluted in 3% bovine serum albumin, 1 mM PBS, 0.05% Tween 20 pH 7.2 to a 1:500 dilution for mouse and human sera, and allowed to bind for 1 h at 37 °C at 20 RPM rotation. Slides were washed 3 × 5’ with 1 mM Tris-buffered saline, 0.05% Tween 20 pH 7.2 followed by three washes with distilled water. Once incubation was completed, the slides were dried by centrifugation at 2400g×10’ and scanned by an Innopsys (Carbonne, France) Innoscan 910 0.5 um 2-color scanner. The images were stored as 16-bit uncompressed TIFF’s, aligned using GenePix Pro 6.0 (Molecular Devices, Santa Clara, CA), and stored in a local relational database prior to analysis. Analysis was done using R (CRAN) and GeneSpring 7.3.1 (Agilent, Mt. View, CA). Serum antibodies were detected by labeled secondary antibody. Labels included either Alexafluor 555 or 647. Secondary antibodies were incubated at a concentration of 5 nM for 1 h at RT. Single-color experiments were performed exclusively, but dye choice depends on availability, usually either Innova Biosciences (Cambridge, UK), Life Technologies (Madison, WI), or Jackson Labs (Bar Harbor, MA).

Data Preprocessing

For each sample, data for 122,926 peptide abundances were available, ranging in value from 0 to 65,535 where 65,535 represented the upper detection limit of the 16-bit digitizer. Samples were typically run in duplicates and data processing included control peptide averaging as well as replicate sample testing for outliers before replicate samples were merged (Online Resource 1). In cases where one of the replicates was an outlier sample, the corresponding replicate pair was removed from the analyses (n = 8). Replicates that passed the outlier testing (n = 78) were merged by calculating the arithmetic mean of peptide abundances for each of the 122,926 peptides. Samples run as singletons were removed (n = 2) except the six American Healthy Control group samples that were all run as singletons. Each sample was then median-centered by dividing each peptide abundance by the median value over all peptides for the corresponding sample, followed by a log2-transformation of the data. The median normalization and log2-transformation put the median peptide value for each of the processed samples at zero.

Data Partitioning

After replicates were processed, the 84 samples from the three data sets—Canadian, Norwegian, and American—were used to create two data partitions: one for immunosignature discovery and one for immunosignature validation (Fig. 1). The data partition used in the discovery analysis (“Discovery Set”) comprised all Canadian ME/CFS (n = 22) and control (n = 21) samples. The validation data partition (“Validation Set”) included all Norwegian ME/CFS samples (n = 22), USA control samples (n = 6), and a subset of randomly selected Canadian ME/CFS (n = 6) and control (n = 7) samples, and was intended to evaluate each immunosignature’s potential to distinguish ME/CFS cases from healthy controls. Samples in the Validation Set were run in an immunoassay experiment separate from the samples in the Discovery Set. The Canadian and USA control samples were included in the Validation Set to compensate for the lack of Norwegian control samples and still being able to characterize the immunosignatures’ ability to distinguish cases from controls. Canadian cases were included to confirm separation of Canadian cases and controls in the Validation Set.

Discovery Analyses

We performed all analyses using scripts implemented in R version 3.3.2 [27]. In the Discovery Analysis, we derived robust candidate peptide signatures based on unsupervised and supervised univariate and multivariate analysis methods as shown in Fig. 1, including PCA, hierarchical clustering, gene shaving, elastic net, and random forest [28,29,30,31,32]. The unsupervised analyses were carried out blind to group status while the supervised analyses used group status directly in the supervising vector.

Three unsupervised and three supervised analyses were run on the full 122,926 peptide dataset to select peptides best able to discriminate ME/CSF from control samples. Amongst the unsupervised methods, we set our sparse PCA (sPCA) and sparse IPCA (sIPCA) analyses to select 100 peptide features, while the gene shaving (GS) method automatically selected features. Each of these three unsupervised methods were instructed to return ten lists (sPCA1-sPCA10, sIPCA1-sIPCA10, and GS1-GS10). For each set of ten lists, the ability to separate ME/CFS cases from controls was reviewed, and subsets of peptide lists were selected and combined into three panels of peptide features: sPCA_panel, sIPCA_panel, and GS_panel.

The supervised methods used different feature selection approaches. Robust limma (RL) used a threshold for the adjusted p value and returned all peptides at or below the threshold. Random forest (RF) used internal bootstrapping to calculate feature importance measures that indicated how classification performance of the random forest was affected when the respective peptides were excluded from the analysis, and a threshold was chosen for a minimum required “Mean Decrease Gini” value to select peptides. Elastic net (EN) used internal cross-validation for parameter estimation and automatically performed feature selection. The list of peptides selected by this method was determined by a frequency-based approach that returned all peptides that were observed in at least 10% of elastic net panels over 100 runs, where each run was based on 39 samples with two ME/CFS cases and two control samples removed at random in each run. The supervised methods returned three panels of peptide features: RL_panel, RF_panel, and EN_panel.

The six panels (A–F) were then combined using different intersections and unions (Fig. 1) to define seven candidate peptide signatures (CPS): CPS001–CPS007. To characterize the predictive ability of these signatures, we calculated the area under the receiver-operating-characteristic curve (AUC) from signature scores defined by the mean signed log2 median-centered peptide abundance where the sign was determined by the sign of principal component 1 of the signature and known group labels [33].

Validation Analyses and Signature Refinement

The seven candidate peptide signatures (CPS001–CPS007) were evaluated using the Validation Set, with the results used to select the most robust discovery signature. This signature was then further refined based on the ability of its individual peptides to separate samples from four different comparisons based on two-sample t tests (assuming unequal variances) using the limma package in R [34]. Our four comparisons were (i) 22 Canadian ME/CFS vs 21 Canadian control samples in the Discovery Set, (ii) six Canadian ME/CFS vs seven Canadian control samples in the Validation Set, (iii) 22 Norwegian ME/CFS vs seven Canadian control samples in the Validation Set, and (iv) 22 Norwegian ME/CFS vs six US control samples in the Validation Set. The final refined signature—CPS0001A—included only those peptide features whose p values for each of the four comparisons were less than 0.05. AUC values were derived from absolute values of peptide weights for principle component 1 (PC1) based on a PCA of peptide-standardized data (122,926 peptides). Higher AUC values indicated a stronger contribution of the tested signature to the separation of samples along PC1. In addition, PCA and hierarchical clustering approaches based on proposed signature were used to cluster validation samples in a blinded fashion, without the use of group status.