Our overall approach was as follows. We used two data sets, one as training data and one as validation data (Table 1). The training data set was used with exhaustive leave-one-out cross-validation for model selection to identify the best model from among seven models tested (Table 2). The seven models correspond to different approaches to representing immune receptor sequences. The model with highest classification accuracy by cross-validation was selected for application to the validation data set.

Table 1 Repertoire sequencing data sets used to develop and test the MS classifier. The number of patients in each study with each diagnosis is shown Full size table

Table 2 Sequence Representations used for Model Selection. CDR3 sequences were cut into snippets of varying length and represented as DNA sequence, amino acid sequence, or Atchley factors [10]. Classification accuracy results are reported as the fraction of patients for which the model’s prediction of the diagnosis is correct Full size table

The training data set consisted of 23 patients, 11 with RRMS and 12 with OND (2015 Study, Table 1). The validation data set consisted of 102 patients, 60 with RRMS and 42 with OND (2017 Study, Table 1). For both studies, B cell repertoires were collected and processed as described in [7]. Briefly, samples were collected from patient cerebrospinal fluid (CSF) (Fig. 1a), and VH4-containing BCR heavy chain genes were sequenced using next generation sequencing (Fig. 1b). VH4-containing heavy chains were targeted because previous studies found elevated VH4 expression in patients with RRMS [2, 7]. Sequence pre-processing was performed as described in Methods to identify complementarity determining region 3 (CDR3) sequences for input into our method.

Fig. 1 Study Overview (a) B cells are collected from patient cerebrospinal fluid. (b) DNA is extracted, and next generation sequencing is used to sequence immunoglobulin heavy chain loci expressing IGHV4 rearrangements. (c) Snippets of amino acid sequence taken from the CDR3 are converted into a set of chemical features using Atchley factors. (d) The chemical features are scored by a detector function. The detector function used in this study is the same function used in logistic regression. A positive diagnosis (for RRMS) is flagged whenever a high scoring snippet is found. Values for the weights on each Atchley factor as well as the bias term are determined by maximizing the likelihood of obtaining the correct diagnoses on a training set of patients Full size image

Representing immune receptor sequences for statistical classification

We utilized the CDR3 sequence of each heavy chain gene, because it is the somatically generated portion of the gene and the primary determinant of the antigen binding specificity encoded by the gene. To accommodate the varying length of CDR3, each CDR3 sequence was cut into snippets of equal length (i.e., k-mers). We considered snippet lengths of 2, 4, 5, 6, and 7 amino acids or codons. For each CDR3, the full set of overlapping snippets was used. We considered three different sequence representations: DNA sequence, amino acid sequence, and a representation based on Atchley factors (Fig. 1c). There are five Atchley factors derived from a set of over 50 amino acid properties by dimensionality reduction to identify clusters of amino acid properties that co-vary [10]. The five Atchley factors correspond loosely to polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. For the Atchley factor representation, each amino acid in a snippet is represented by a vector of its five Atchley factor values. We conducted model selection over seven combinations of snippet length and sequence representation (Table 2).

Scoring each sequence in a repertoire

Every snippet from every CDR3 sequence in a patient’s repertoire is scored by a detector function indicating if a snippet predicts RRMS. We use a logistic function because of its widespread use and simplicity, and because it models the outcome of a two-category process. The first step is to compute a biased, weighted sum of the snippet’s features, referred to as a logit.

$$ \mathrm{logit}={b}_0+{W}_1\cdot {f}_1+{W}_2\cdot {f}_2+\cdots +{W}_N\cdot {f}_N(1) $$

For the DNA and amino acid sequence representations, the values f 1 through f N represent the snippet residues. For the Atchley factor representation, the f i represent the five Atchley factors from each residue in the snippet. For snippets of length six, N = 30. The bias term b 0 along with the weights W i are the parameters of the model and are fit by maximum likelihood using gradient descent optimization techniques as described below. The same weights W i and bias term b 0 are used for all snippets. Once the logit is computed, the value is passed through the sigmoid function to obtain a score between 0 and 1 (Additional file 1: Figure. S1).

$$ \mathrm{score}=\frac{1}{1+{e}^{-\mathrm{logit}}}(2) $$

Aggregation of snippet scores to predict a diagnosis

A patient’s snippet scores need to be aggregated into a single value to form a diagnosis. Because only a small fraction of BCRs in a patient’s repertoire are expected to be disease related, it is necessary to capture a diagnosis even if only a few snippets have a high score. This is accomplished by assigning a positive diagnosis when even a single high scoring snippet is found (Fig. 1d). Assuming the output of the detector function represents a probability value between 0 and 1, the form of the model can be written as:

$$ P\left(\mathrm{positive}\ \mathrm{diagnosis}|{\mathrm{snip}}_1,{\mathrm{snip}}_2,{\mathrm{snip}}_3,\dots \right)=\mathrm{Maximum}\left({\mathrm{score}}_1,{\mathrm{score}}_2,{\mathrm{score}}_3,\dots \kern0.5em \right)(3) $$

A probability >0.5 indicates a positive diagnosis (RRMS), whereas a value <0.5 indicates an OND diagnosis.

Parameter fitting by gradient descent

Specific values for the weights W i and bias term b 0 in the detector function are determined using the patient diagnoses. The values must be chosen to maximize the likelihood that each predicted diagnosis is correct. To search for the optimal values, gradient optimization techniques can be used. With these techniques, each parameter is iteratively adjusted along the gradient in a direction that maximizes the log-likelihood, which in turn maximizes the likelihood that each predicted diagnosis is correct. The initial value for the bias term b 0 is 0, and initial values for the weights are drawn at random according to \( {W}_i\sim \mathcal{N}\left(0,{N}_{\mathrm{features}}^{-1}\right) \). Because the Adam optimizer, a gradient descent based method, has been shown to work well on a wide range of optimization tasks, it is used here [11]. The Adam optimizer is run for 2500 iterations with a step size of 0.01. The default values for the other Adam optimizer settings are: β 1 = 0.9, β 2 = 0.999, ϵ = 10−8.

A limitation of using a gradient descent based method is there is no guarantee of finding the globally optimal solution. Although the chosen detector function constitutes a linear model, the scores from every snippet are aggregated together in a non-linear fashion. Multiple local minima could exist. To address this, 105 runs of Adam optimization, each starting from different initial parameters \( {W}_i\sim \mathcal{N}\left(0,{N}_{\mathrm{features}}^{-1}\right) \), are used, and the best fit solution over all runs is used to diagnose new patients.

Development of the MS classifier– Model selection and validation

We applied the above-described approach to our training data set of 23 patients using one-holdout cross-validation (Fig. 2a). Classification accuracy on the holdout patients was used to identify the best performing model from among the seven models tried (Table 2). A snippet size of 6 amino acid residues resulted in the highest classification accuracy. Categorical representations of the DNA nucleotides and amino acid residues both underperformed the Atchley factor representation. The best performing model correctly diagnosed 20 out of 23 patients (Table 2, Fig. 3a).

Fig. 2 Workflow for Model Selection and Parameter Fitting. (a) The diagram shows how training data is used to train and evaluate multiple hypotheses. The model that gives the best classification accuracy on the exhaustive 1-holdout cross-validation constitutes the lead hypothesis. (b) The diagram shows how data is used to train and test the lead hypothesis. The best performing model is refitted to all the samples in the training data, and then used to score samples from the validation data set Full size image

Fig. 3 Classification Accuracy and Receiver Operating Characteristic (ROC) Curve. (a) Classification accuracy for the best performing model obtained via exhaustive 1-holdout cross-validation on training data. 87% of patients were correctly classified. (b) Classification accuracy of the best performing model on the validation data. 72% of patients were correctly classified. (c) The corresponding ROC curve shows true and false positive rates for different thresholds of a positive diagnosis based on the highest snippet score. The area under the curve is 0.75 Full size image

To estimate the probability of correctly classifying 20 of 23 patients by chance using our best performing model, we performed a permutation analysis. We performed 20 permutations in which patient diagnoses were permutated, and the one-holdout cross-validation procedure was conducted. The resulting classification accuracies are shown in Table 3. All were lower than 20 out of 23, allowing us to assign p < 0.05 to the observed accuracy. The average accuracy over all permutations is 40.4%.

Table 3 Classification accuracies observed by 1-holdout cross-validation after permuting diagnoses in our training data set. Results reported as the fraction of samples for which the model’s prediction of the diagnosis matches the label assigned under permutation Full size table

To determine if the best performing model generalizes to unseen data, the full 23-patient training set was used to fit the weights and bias term, and the resulting model was applied to score a validation data set of 102 patients (Fig. 2b). The model correctly diagnoses 73 out of 102 patients, corresponding to an accuracy of 72% (Fig. 3b). The ROC curve for the validation data is shown in Fig. 3c. The area under the curve is 0.75.

Analysis of the diagnostic biochemical motif

To discern the biochemical features of snippets resulting in a positive diagnosis, we examined the weights of the best performing model with parameters fit on the full 23-patient training set. The weights reveal how each Atchley factor contributes to the score and the relative importance of each position (Fig. 4). We observe relatively large, negative weights along almost every position of the snippet for Atchley factors II and IV, indicating a high probability of an RRMS diagnosis for snippets with negative values for these two Atchley factors. In particular, we notice large negative weights for factor II for positions 1 and 5 and for factor IV for positions 1, 3, and 4. A negative value for Atchley factor II correlates with amino acid residues that appear frequently in α-helical segments. A negative value for Atchley factor IV correlates with amino acid residues less commonly used and having high heat capacity and refractivity. The weights for the other Atchley factors are position-dependent. We observe relatively large positive weights for Atchley factor I at position 1 and for Atchley factor V at position 3. We also observe relatively large negative weights for positions 1 and 3 for Atchley factor III. This indicates increased probability of an RRMS diagnosis for snippets with large, positively charged, hydrophilic residues at snippet positions 1 and 3.

Fig. 4 Illustration of the Classifier Weights. For each of the five Atchley factors, the weights for the model fit on all 23 training samples are shown for the six residue positions. Positive weight values are shown in red pointing up, and negative weight values are shown in blue pointing down. The length of the arrow corresponds to the weight’s magnitude Full size image

We next aligned the highest scoring snippet from each patient to determine where within CDR3 the diagnostic snippet is positioned (Fig. 5). We find that the highest scoring snippets can be located anywhere along CDR3. Although the snippet sequences do not align well, patterns are observable in their Atchley factors, which are shown next to each snippet (Fig. 5). Consistent with the values for the weights, we observe a tendency toward hydrophilicty for snippet position 1, toward α-helical values at position 5, toward high heat capacity and refractivity at positions 1 through 4, and toward negative charge at position 6.

Fig. 5 The Highest Scoring Snippet from each Patient in the Training Data Set. The snippets were scored with the model trained on all 23 subjects. (Left) Location of the snippet is shown in its CDR3 sequence using yellow highlighting. (Right) The Atchley factor values are shown for each snippet in the five boxes. Each box corresponds to one Atchley factor. The columns in each box correspond to the snippet positions Full size image

We next looked at the distribution of snippet scores in the 23 patients of our training data set (Fig. 6). Only 27 of 3259 snippets score above 0.5 (the threshold for a RRMS diagnosis), and all of these were from RRMS patient repertoires. Each RRMS patient had no more than 5 snippets that scored above the threshold.

Fig. 6 Histograms of Snippet Scores for all Snippets in the Training Data Set. The snippets were scored with the model trained on all 23 subjects. The red bars indicate the distribution of snippet scores from RRMS patients. The blue bars indicate the distribution of snippet scores from OND patients. Only a few snippets score above 0.5, which is diagnostic of RRMS Full size image

To determine if the rarity of high scoring snippets in our patient repertoires can be attributed to the likelihood of the corresponding DNA sequences arising by chance in V(D)J recombination junctions, we examined the DNA encodings of each snippet. For each amino acid sequence, there are many possible DNA encodings. An example of how to calculate the total number of encodings for a single snippet is shown in Fig. 7a. The distribution for the total number of ways to encode each snippet is shown for non-diagnostic snippets in Fig. 7b and for diagnostic snippets in Fig. 7c. We find that the diagnostic snippets identified by the model have significantly fewer possible encodings than non-diagnostic snippets (p-value is 7.41 × 10−8). Under the naïve assumption that CDR3 sequence is generated at random, RRMS diagnostic snippets would be some of the least likely to occur.