Abstract Diagnosis of psychiatric disorders based on brain imaging data is highly desirable in clinical applications. However, a common problem in applying machine learning algorithms is that the number of imaging data dimensions often greatly exceeds the number of available training samples. Furthermore, interpretability of the learned classifier with respect to brain function and anatomy is an important, but non-trivial issue. We propose the use of logistic regression with a least absolute shrinkage and selection operator (LASSO) to capture the most critical input features. In particular, we consider application of group LASSO to select brain areas relevant to diagnosis. An additional advantage of LASSO is its probabilistic output, which allows evaluation of diagnosis certainty. To verify our approach, we obtained semantic and phonological verbal fluency fMRI data from 31 depression patients and 31 control subjects, and compared the performances of group LASSO (gLASSO), and sparse group LASSO (sgLASSO) to those of standard LASSO (sLASSO), Support Vector Machine (SVM), and Random Forest. Over 90% classification accuracy was achieved with gLASSO, sgLASSO, as well as SVM; however, in contrast to SVM, LASSO approaches allow for identification of the most discriminative weights and estimation of prediction reliability. Semantic task data revealed contributions to the classification from left precuneus, left precentral gyrus, left inferior frontal cortex (pars triangularis), and left cerebellum (c rus1). Weights for the phonological task indicated contributions from left inferior frontal operculum, left post central gyrus, left insula, left middle frontal cortex, bilateral middle temporal cortices, bilateral precuneus, left inferior frontal cortex (pars triangularis), and left precentral gyrus. The distribution of normalized odds ratios further showed, that predictions with absolute odds ratios higher than 0.2 could be regarded as certain.

Citation: Shimizu Y, Yoshimoto J, Toki S, Takamura M, Yoshimura S, Okamoto Y, et al. (2015) Toward Probabilistic Diagnosis and Understanding of Depression Based on Functional MRI Data Analysis with Logistic Group LASSO. PLoS ONE 10(5): e0123524. https://doi.org/10.1371/journal.pone.0123524 Academic Editor: Michael Breakspear, Queensland Institute of Medical Research, AUSTRALIA Received: June 3, 2014; Accepted: March 4, 2015; Published: May 1, 2015 Copyright: © 2015 Shimizu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited Data Availability: The original data for this study is held confidential due to the involvement of patient data. It can be obtained upon request from the Department of Psychiatry and Neuroscience, Hiroshima University, Japan (primary contact: Shigeto Yamamwaki, PhD, MD, yamawaki@hiroshima-u.ac.jp). Funding: This study is the result of “Integrated Research on Neuropsychiatric Disorders” carried out under the Strategic Research Program for Brain Sciences by the Ministry of Education, Culture, Sports, Science and Technology of Japan (http://brainprogram.mext.go.jp/missionF/). All authors of this study are supported by this program. This program does not have a grant number. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Major depressive disorder (MDD) belongs to the mental, neurological, and substance-abuse diseases (MNS) currently regarded as significant challenges in global mental health [1]. Due to the complexity and variety of symptoms, MDD diagnosis requires time-consuming interviews, which rely heavily on the clinical experience of the doctor, as well as on patient cooperation. Objective diagnosis methods based on biological markers have yet to be found. The aim of this study is to corroborate development of a diagnostic method for MDD and other mental disorders by applying machine learning algorithms to functional brain imaging (fMRI) data. Recent imaging studies show that task-related brain activation of MDD patients, as well as brain activation during rest, differs significantly from that of healthy controls [2–6], thus encouraging the diagnosis of MDD from brain imaging data using statistical machine learning algorithms (see e.g. [5, 7]). This idea is supported by the emerging field of computational psychiatry, which emphasizes the integrative, explanatory role of computational ideas in neuroscience and the impact it could have on assessing mental illnesses [8]. However, a major obstacle in applying statistical machine learning methods to brain imaging data is that the dimension of input variables (voxels) considerably exceeds the number of training samples (subjects), resulting in over-fitting. Often Support Vector Machine (SVM) [7, 9–11] is used to overcome this problem. It can largely prevent over-fitting using the principle of large margin separation [12, 13]. However, a shortcoming of SVM lies in the assignment of weights to all input features. The discriminative relevance of individual features (here, neural activation in each voxel of the imaging data) is difficult to interpret. A further limitation of deterministic classification algorithms such as SVM, is that they do not provide a measure of classification reliability. This has also been pointed out in Nouretdinov et al. [14]. In consideration of the above, we propose the application of sparse logistic regression with a least absolute shrinkage and selection operator (LASSO [15, 16]). Standard LASSO limits the number of effective variables through regularization of the L1 norm of the attributed weights. The regularization penalty is controlled through a regularization parameter (λ S ). Standard LASSO finds solutions based on a minimum number of features, but allows for selection of isolated voxels. This compromises robustness with respect to local variation in inter-subject brain function. As previous functional brain imaging studies have shown, brain activity in certain brain areas such as thalamus and frontal lobe is altered in depression patients [6, 17]. We aim to extract brain regions, rather than individual voxels. For this purpose, we propose the use of group LASSO [18], which constrains groups of features. Regularization of the number of groups is thereby subject to the Euclidian norm of weights in each group. As with standard LASSO, regularization strength can be controlled by a parameter (λ G ). This approach facilitates interpretation of the learned classifier, since the remaining features inherently comprise groups. Defining voxel groups according to known functional and anatomical brain areas, we expect group LASSO to reveal brain areas critical for depression diagnosis. Feature-wise regularization and group-wise regularization can be employed at the same time, so that the number of voxel groups (brain areas), as well as the number of individual voxels, is sparsified. The algorithm hence depends on a pair of regularization parameters (λ = (λ S , λ G )) and is then referred to as sparse group LASSO. We verified the performance of group LASSO (gLASSO), and sparse group LASSO (sgLASSO) in comparison to that of standard LASSO (sLASSO) in the analysis of fMRI data obtained from depression patients and age-matched healthy control subjects. Since executive dysfunction is a neuropsychological constituent of depression [19], fMRI experiments were based on semantic and phonological verbal fluency tasks, in which depression patients are known to perform poorly [10, 20–23]. Moreover, Bom de Araujo et al. [24] have demonstrated that disease severity has a direct impact on verbal fluency, regardless of age, educational level, or gender. This fact is beneficial for unbiased population results. We also compared LASSO results to those of SVM [12, 13] and the Random Forest algorithm [25].

Materials and Methods This study was approved by the Research Ethics Committee at the Okinawa Institute of Science of Technology as well as the Research Ethics Committee of Hiroshima University (permission nr. 172). Written consent was obtained from all subjects participating in the study (approved by the Research Ethics Committee of the Okinawa Institute of Science and Technology and the Research Ethics Committee of Hiroshima University). Subjects Thirty-one drug naive, i.e. first time diagnosed, patients (age 26–63, average 38:81 ± 9:76, 16 male) with major depression disorders were recruited by the Psychiatry Department of Hiroshima University and collaborating medical institutions, based on the Mini-international neuropsychiatric interview (M.I.N.I [21]), which enables doctors to identify psychiatric disorders according to the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV [26]). As a control group, 31 persons (ages 23–63, average 33.45 ± 12, 15 male) with no history of mental or neurological disease, were recruited by advertisement in local newspapers. All controls underwent the same self-assessment and examinations administered to the test group. Subjects of both groups completed the Japanese version of the National Adult Reading Test [27]. Data Acquisition & Task fMRI measurements were performed at Hiroshima University on a 3T GE Signa HDx scanner with a 2D EP/GR (TR = 3s, TE = 27ms, FA = 90deg, matrix size 64 × 64 × 32, voxel size 4 × 4 × 4mm, no gap, interleaved). As mentioned above, subjects underwent two-block designed verbal fluency tasks known to pose difficulties for depressive patients. Structural T1 images were acquired after the fMRI experiments for correction of head position changes in the subsequent analysis (IRP FSPGR, TR = 6.824ms, TE = 1.9ms, FA = 20deg, FOV 256mm, matrix size 256 × 256 × 180, voxel size 1 × 1 × 1mm). Semantic Verbal Fluency. After an initial rest period of 30 seconds, a categorical word (e.g., furniture) was presented to each participant for 2500ms (Fig 1). A fixation cross was presented for the next 500ms. Subjects were asked to find a word matching the given category (e.g., table) and press a button once they had uttered the chosen word in their minds. After five consecutive trials repeating the same categorical word, five trials employing a different categorical word followed. Under control conditions, subjects were presented two words selected from a certain category (e.g., table), five times each. Subjects had to repeat each word in their minds and were again asked to press a button once they had done so. Nine seconds of blank screen indicated the end of the task and control blocks. This whole sequence was repeated three times and required approximately four minutes, during which 94 volumes of the whole brain were acquired. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. fMRI Semantic Verbal Fluency Task. The experiment consisted of three task and three control blocks. In the task condition, subjects were presented with a categorical word (semantic task) or syllable of the Japanese alphabet (phonological task) for 2500ms (e.g. occupation) for which they had to find a matching word, utter it in their minds and press a button. A white cross displayed for 500ms indicated the end of the trial. This was repeated five times each for two words referring to two different categories in the semantic task and two different syllables in the phonological task. In the control condition, the button press occurred after repeating the displayed word in the mind. Two different words were displayed five times each. The end of each block was indicated by 9 seconds of blank screen. https://doi.org/10.1371/journal.pone.0123524.g001 Phonological Verbal Fluency. The setting of the phonological verbal fluency task was identical to that of the semantic task. Instead of categorical groups, a syllable of the Japanese alphabet was presented. Subjects then had to think of a word beginning with that syllable and to repeat it in their minds. Subjects were asked to press a button immediately thereafter. Data Preprocessing For each subject, the first five volumes of each time series were discarded so as to allow for magnetic field equilibrium. The remaining volumes were processed with SPM8 (Wellcome Trust Centre for Neuroimaging, UCL, London), using following standard procedure: After slice timing correction, motion correction, co-registration to anatomical MRI, normalization with standard brain template, and smoothing (kernel width 8 × 8 × 8mm), a model conforming to the task was specified and the contrast between task and control conditions was evaluated. Z-scores that exceeded the absolute value of 5 were considered outliers and corresponding voxels were discarded from the dataset. Z-scores of voxels which could not be labeled by Automated Anatomical Labeling software (AAL [28]), provided by SPM8, were also excluded from the analysis. Vectorized subject volumes were assembled to a matrix of size 62 × 65280, in which a row represented brain activity Z-scores in the voxels of a given subject. Columns (voxels) that were zero for all subjects were discarded, leaving a data matrix of size 62 × 14055 (= number of subjects × number of remaining voxels). In the following, we refer to the remaining voxels as features and the resulting matrix as the feature matrix. The latter was normalized voxel-wise and served as input for all classification algorithms applied in this study, LASSO regressions as well as SVM and Random Forest algorithms. For combined semantic and phonological data, we simply concatenated both feature matrices. For group LASSO, we defined voxels belonging to the same brain area as a group. Brain areas themselves were determined and labeled using the MNI standard brain template provided by the AAL toolbox. Out of 116 brain areas defined in this atlas, fMRI data covered 105 areas. Areas not covered by data were as follows: bilateral superior frontal orbital gyri, bilateral middle frontal orbital gyri, left olfactory gyrus, left medial frontal orbital gyrus, bilateral rectus, left middle temporal pole and bilateral cerebellum (pars10). For the combined verbal fluency data we therefore arrived at 210 groups (105 brain areas for each dataset). Classification Algorithms For classification, we considered three variants of logistic regression using LASSO [15, 16], SVM [12], and Random Forest [25]. LASSO logistic regression. Let x i = (x i1 , ⋯, x id ) ∈ ℝd be the vector of values representing brain activation in each voxel of the i-th subject (hereafter referred to as feature vector, i = 1 ⋯ n, n the number of subjects). The binary label y i ∈ {±1} indicates whether the subject belongs to the control (y i = −1) or the patient group (y i = +1). Logistic regression predicts the probability of the label y from the corresponding feature vector x, and is defined as (1) where xT is the transpose of the vector x. w = (w 1 , ⋯, w d ) ∈ ℝd is a vector representing the contribution weights of each element in the vector x, and determined such that it fits the given dataset D = {(x i , y i )|i = 1, ⋯, n}. This can be achieved by minimizing the negative mean log-likelihood: (2) However, the minimizer is not well defined for d ≫ n, as is the case in our study. The Least Absolute Shrinkage and Selection Operator (LASSO) [15, 16] is a regularization technique to restrict the number of non-zero elements in the minimiser and to make the solution unique. Consider G partitions on x, then the weight vector w is determined so as to minimize (3) where ‖ w ‖ 1 = ∑ j = 1 d | w j | and ℐ g,j is the indicator function such that ℐ g,j = 1 if the j − th voxel belongs to the g-th partition and ℐ g,j = 0 otherwise. The parameters λ S , λ G ≥ 0 adjust the balance between model fitting precision (first term in Eq 3), voxel-wise sparseness (second term), and group-wise sparseness (third term). λ S > 0 and λ G > 0 imposes both voxel-wise and group-wise sparseness. We refer to this case as sparse group LASSO (sgLASSO). For λ S > 0 and λ G = 0 the above therefore yields the standard sparse LASSO (sLASSO) while λ S = 0 and λ G > 0 defines the group LASSO (gLASSO). The advantage of group LASSO is that prior knowledge of possibly related input features can be incorporated by suitably setting the indicator function ℐ g,j . We exploit this advantage to incorporate functional localization within brain regions. Input to the algorithm consists of a feature matrix formed from standardised brain activation Z-scores x i of 62 subjects, where 31 rows represent control subjects (label y i = −1) and 31 rows depressive patients (label y i = +1). 105 brain areas were considered in the analysis, i.e., G = 105 and ℐ g,j = 1 if the j-th voxel belongs to the g-th brain area. Once the weight vector w is determined, the logistic regression model can be used as a discriminant function for the binary classification (in the fashion of the most probable explanation, a special case of the maximum a-posteriori probability). For each feature vector x, the label y is determined as follows: Unless specified otherwise, we evaluated classification performance in our study based on this discriminant function. Support Vector Machine. The SVM training algorithm is a non-probabilistic binary classifier that represents samples as points in space, so that samples of different categories are separated by a margin as large as possible (large margin principle of separation [13]). Here we use parameter-free linear (hard margin) SVM [12]. Random Forest. The Random Forest algorithm is an ensemble learning method that constructs a multitude of decision trees and yields the mode of the class output by individual trees as result [25]. Performance Criteria Parameter tuning of LASSO algorithms was based on the mean log likelihood μ log L , a standard performance measure for probabilistic models: (4) For better visualization of the class probability distributions, we calculated the logarithmic odds ratio log ( P ( y = + 1 | x ) P ( y = − 1 | x ) ) and normalized it by the maximum logarithmic odds ratio of all test data. In the following, we refer to these values as normalized discriminative scores. For performance comparison of probabilistic LASSO models with non-probabilistic models such as SVM, we used the four criteria defined below. We defined patients who were correctly classified as depressive, as true positives (TP), and those incorrectly classified as healthy, as false negatives (FN). We referred to control subjects who were correctly classified as healthy, as true negatives (TN), and those incorrectly classified as depressed, as false positives (FP). Sensitivity and Specificity evaluate the probability of correct diagnosis of patients and control subjects, respectively: (5) (6) Accuracy and Fscore evaluate overall performance. The Fscore provides a more rigorous measure when the number of control subjects is larger than that of patients, in which case, true negatives can dominate the Accuracy measure. (7) (8) where P r e c i s i o n = # T P # T P + # F P , is the probability that a subject classified as a patient really is a patient and R e c a l l = # T P # T P + # F N (= Sensitivity), the probability that a patient is classified as a patient. Parameter Tuning and Performance Evaluation Classification performance of the algorithms was evaluated in a 10-fold cross-validation. In order to account for model variability, the procedure was repeated 100 times, each time pseudo-randomly shuffling sample contributions to the training and test datasets. For SVM and Random Forest, conventional 10-fold cross-validation was used. For LASSO algorithms, cross-validation was conducted in a nested manner as described below, in order to account for parameter validation (Fig 2). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Nested 10-fold cross-validation. 10% of the data were defined as test dataset (TestData1) while the remaining 90% (TrainingData1) were used to evaluate the best performing regularization parameter λ opt (= λ Sopt in the case of sLASSO, = λ Gopt in the case of gLASSO, = (λ Sopt , λ Gopt ) in the case of sgLASSO) in a 10-fold cross-validation (inner 10-fold cross-validation). Parameters thus determined were then used to train TrainingData1 and to evaluate the final performance using TestData1. The same procedure was repeated ten times, altering the set of train and test data each time, so as to complete a 10-fold cross-validation (outer 10 fold cross-validation). https://doi.org/10.1371/journal.pone.0123524.g002 Nested 10-fold cross-validation. The first cross-validation (inner cross-validation) served to optimize regularization parameters, while the second cross-validation (outer cross-validation) evaluated predictive performance when using (optimal) parameters obtained in the first cross-validation to a new dataset. For sLASSO 200 logarithmically distributed values of λ S in the range of [0.01 0.2512] were evaluated, while for gLASSO we chose 100 logarithmically distributed values of λ G in the range of [0.1 0.6310]. For sgLASSO we used 200 combinations of 20 logarithmically distributed values of λ S and 10 logarithmically distributed values of λ G in the range of [10−6 0.1] and [0.0631 0.3162], respectively. The approximate range of the parameters was determined by grid search. In detail, the procedure consisted of following twelve steps: Data from controls and patients were divided into 10 non-overlapping sets with approximately equal number of subjects. One of these datasets was taken as test data (TestData1) and the rest were used as training data (TrainingData1). Step 1 was repeated for TrainingData1, i.e., it was divided into 10 separate and non-overlapping sets. One of the datasets produced in Step 3 was used as test data (TestData2) and the rest were used as training data (TrainingData2). Models were trained for a range of parameters using TrainingData2. Class probabilities for TestData2 were predicted using the resulting models. Steps 4 to 6 were repeated ten times, each time defining different datasets as TestData2 and TrainingData2. The mean log likelihood (μ logL ) was evaluated for each setting of the parameters and test dataset and the optimal parameter (λ Sopt , λ Gopt and λ opt = (λ Sopt , λ Gopt ) for sLASSO, gLASSO and sgLASSO respectively), for which μ logL is maximized was determined. A model was estimated using TrainingData1 and λ max . Class probabilities for TestData1 were predicted. Steps 2 to 9 were repeated ten times, each time choosing a different dataset from the 10 sets produced in Step1, as TestData1. Prediction performance was evaluated. Fig 2 illustrates the procedure. As for SVM and Random Forest, this procedure was repeated 100 times, each time pseudo-randomly shuffling the combination of data in Step1. All routines were programmed in Matlab (Mathworks). For SVM and Random Forest, functions provided in the Matlab optimization Toolbox were used while for LASSO algorithms we made use of the sparse learning package (SLEP) provided by Liu et al. [29].

Discussion We applied three variants of logistic regression LASSO and two deterministic classification algorithms, SVM and Random Forest, to fMRI data from semantic and phonological verbal fluency tasks to demonstrate the advantages of group LASSO for classification of fMRI data, specifically the ability to identify relevant brain regions of interest. More than 90% accuracy was achieved in distinguishing depression patients from matched healthy control subjects when applying gLASSO, sgLASSO, and SVM to the combination of the two datasets. Our results surpass classification performance in similar studies involving depression related fMRI and SVM (65% to 86% [9, 10, 14]). Similarly accurate prediction results with respect to depression-related MRI data, to the best of our knowledge have only been achieved by Mwangi et al. [31]. However, in contrast to our study, the latter authors relied on structural MRI data from patients considered to be chronically depressed. For such patients, structural brain changes are very likely [32]. Depression studies based on whole structural brain data are the most common and are reviewed in [33]. Except for the study mentioned above [31], the maximum classification accuracy between healthy controls and depression patients in other relevant studies was 70%. Considering that in general, classification accuracy for treatment responsive and non-responsive subjects is generally 80% [33], we assume that the classification success of [31] results from the homogenous, chronically depressed patient group. Kipli et al. also provide a comparative study on classification accuracy for various brain volume attributes and varying combinations of machine learning algorithms for feature selection and classification [34]. Here, the maximum classification accuracy for depression reached 85.3%. However, the evaluation of the algorithms was solely based on classification accuracy for preselected brain volumetric attributes. Sparse algorithms were not considered. sLASSO vs gLASSO/sgLASSO Since sLASSO prediction relies on mostly spatially isolated voxels (Fig 8a), activities shifted to adjacent voxels due to inter-subject variability can lead to classification errors. In contrast, gLASSO and sgLASSO estimate class label probability from values of spatially continuous voxels (Fig 8b and 8c). These voxels not only indicate brain regions of interest, thereby facilitating interpretation of results; brain activity shifted due to inter-subject variability is likely to be retrieved by neighboring weights. This argument is supported by the small number of commonly selected voxels over different runs of the cross-validation (Table 3). In the application of sLASSO to semantic and phonological task data, the number of voxels selected in at least one of the evaluated models was 450 and 478, selected from 65 and 71 brain areas, respectively. Only a single voxel was selected commonly in more than 80% of the models in the cross-validation. In contrast, the highest occurrence rate in gLASSO and sgLASSO reached 100%. More than 500 voxels were selected in more than 80% of the models for each of the algorithms and datasets. These numbers increased when the two datasets were combined (Table 3). gLASSO vs sgLASSO The relative performance of gLASSO and sgLASSO varied depending on the datasets. Both algorithms attributed the highest weights to the same brain areas and even the same locations within these brain areas (Fig 8b and 8c). While sgLASSO can be more effective in selecting truly relevant voxels, a more relaxed voxel selection with gLASSO might be advantageous when taking variability in individual brain areas and activation into account. Also, sgLASSO relies on two regularization parameters, λ S for voxel sparseness and λ G for group sparseness, which makes fine parameter tuning and optimization computationally demanding. A preferable procedure would thus be to use gLASSO to achieve good classification performance without heavy computing needs for parameter optimization and use simple thresholding of weights as a practical way to determine the most relevant voxels within selected brain areas. Alternatively, if enough test data is available, sgLASSO can be applied using only voxels of brain areas preselected with gLASSO. This is preferable when handling large datasets. gLASSO vs SVM Both gLASSO and sgLASSO achieved better accuracy and Fscore than SVM for semantic task data. Their performance was slightly, but significantly surpassed by SVM for the phonological task and combined data. However, as mentioned above, gLASSO and sgLASSO offer other benefits, which are highly advantageous in clinical practice. The major advantage is the straight-forward interpretation of weights attributed to features. SVM constructs weights by linear combination of support vector data, which results in non-zero weights for all voxels (Fig 8(d)). This makes it difficult to draw conclusions about brain areas relevant to diagnosis. In contrast, gLASSO as well as sgLASSO reveal discriminating brain areas by reducing the number of voxels to those most relevant. In the case of gLASSO this is done in a group-wise manner, while in the sgLASSO, the number of voxels within groups is reduced as well. Brain areas contributing to the prediction were in agreement with their known functions and their relationships to verbal fluency and symptoms of depression. The left precuneus, for example, is generally interpreted as a hub to the prefrontal lobe [35], a connection that affects motivation, planning, social behavior, and speech production [36]. It is also part of the default mode network, that evidence suggests is altered in depressive patients. Moreover, Krug et al. [37] found in a study with a similar verbal fluency task, that the left precuneus was more activated in healthy subjects carrying an allele variant of a specific gene, found to be overrepresented in patients suffering from bipolar disorder, MDD, or schizophrenia. Similarly, the left precentral gyrus is explicitly involved in semantic tasks [38], as well as emotion processing [14]. However, the true relation between indicated brain areas and depression remains to be investigated, especially since verbal fluency is impaired in a variety of mental diseases. On the other hand, this implies that the same experiment conducted with patients with other diseases, could reveal either differences or coincidences in the neural origin of common symptoms. The results of such experiments would be of high clinical value and could be revealed by extending the LASSO approach to a multi-class problem. Another advantage of the LASSO algorithm is its graded, probabilistic prediction. In the presented study, for example, when applying gLASSO or sgLASSO to the combined data, classification was only incorrect for subjects with an absolute normalized discriminative score lower than 0.2 (Fig 6). That means that the diagnosis of a subject as depressed can be considered certain if the normalized discriminative score is higher than 0.2. Similarly, a subject can certainty be considered diagnosed as healthy if the normalized discriminative score is lower than −0.2. Remarkably, false negative classification occurred only for patients whose condition improved significantly over the following six weeks, so we can assume that patients with normalized discriminative scores < 0.2 are likely to recover. Alternatively, we can define a separate measure of confidence based on the overlap of the two discriminative score distributions. However, these assumptions have to be confirmed using more extensive datasets. Mwangi et al. [31] as well as Nouretdinov et al. [14] have already drawn attention to the advantage of probabilistic prediction. Mwangi et al. illustrated this by comparing SVM and Relevance Vector Machine, which has an identical functional form to SVM, but uses Baysian inference to achieve probabilistic classification [39]. However, as with SVM, weights are difficult to interpret with respect to the most relevant brain areas. In addition, computation time can increase considerably with increased data size [39]. Nouretdinov et al. propose transductive conformal predictors (TCP, [14, 40]) to provide the classification model with a confidence measure. In contrast to our approach, where the confidence level of a prediction can be deduced from the overlap of discriminative score distributions, TCP provides a confidence level for each prediction based on the relative number of training samples that differ markedly from the test sample in terms of a certain non-conformity measure. However, even though all predictions with a confidence level of 95% were correct, the number of certain predictions (predictions for which the output predictor contained only one label) was 0. This type of dual output is difficult to interpret compared to discriminative score distributions provided by the LASSO method. Further, TCP is based on pre-selection of voxels via two sample t-test. This step is not necessary in the application of LASSO and prevents potentially useful information from being discarded. The present study demonstrates this effect. Feature selection with a two-sample t-test of semantic verbal fluency data would have left us with no features to which the LASSO algorithm could be applied. Similarly, results of phonological data show that LASSO picked up voxels from brain regions that were not indicated as significantly different in the two-sample t-test (frontal, post central and temporal areas). Summary In summary, we verified that gLASSO and sgLASSO are superior to sLASSO in terms of classification robustness and preferable to the commonly used SVM with respect to feature selection, i.e., identification of relevant brain areas, and probabilistic prediction. The large number of input features is successfully handled without the need of low-dimensional feature extraction. Topographic continuity of non-zero weights can be achieved by adequately grouping input features, thus elucidating discriminative brain areas. Finally, a measure of diagnosis reliability is provided by the discriminative score. We therefore found that gLASSO and sgLASSO are preferable for classification of depression-related fMRI data and identification of relevant brain areas. Further investigation of these brain areas may contribute to the establishment of new prevention and therapeutic programs. This study is the first to explicitly control for data from patients who had not previously been diagnosed with depression. Sparse classification algorithms for imaging data have independently been considered in two recent studies concerning Alzheimer’s and mild cognitive impairment [41, 42]. Liu et al. [41] raised similar concerns about standard LASSO as presented here and proposed a tree-guided approach combined with SVM to achieve spatial feature grouping. Xin et al. [42] presented a fast, scaleable, generalized, fused LASSO algorithm for the same problem, which could provide an alternative to sgLASSO. However, this remains to be investigated (Sparse group LASSO is also used in studies by Zhou et al. [43], Tsao et al. [44] and [45] but concerns temporal grouping of progressive stages rather than grouping of structural features.).

Acknowledgments This study is the result of “Integrated Research on Neuropsychiatric Disorders” carried out under the Strategic Research Program for Brain Sciences by the Ministry of Education, Culture, Sports, Science and Technology of Japan. Thanks to Dr. Steven D. Aird, technical editor of the Okinawa Institute of Science and Technology, for thorough editing and proofreading. Thanks to Dr. Masashi Sugiyama and Dr. Ryota Tomioka for technical advice on LASSO algorithms.

Author Contributions Conceived and designed the experiments: YS ST YO S. Yamawaki. Performed the experiments: MT S. Yoshimura ST. Analyzed the data: YS JY. Wrote the paper: YS JY KD.