We first sought to determine whether simply adding age or VIQ as training features would improve brain morphometric classification. When age was added to the brain morphometric features for training the classifier, AUC improved to 0.62. When VIQ was added, AUC improved to 0.66. When both age and VIQ were added, AUC improved to 0.68. In addition, site information was explicitly added to the morphometric features for training the classifier. One-hot coding method was used to represent the site information, i.e. for each scanning site, a binary feature was added with values of one for subjects from the scanning site and values of zero for other subjects. In total, 17 binary features for 17 sites representing the site information were added to the 538 morphometric features. After adding site information to the morphometric features for training, AUC did not improve and was same as that from using only the morphometric features.

To confirm that the increase in classification performance after sub-grouping is not due to optimization issues and is actually due to the reduction in the heterogeneity in the sub-groups, we tested the RF classification model trained in one sub-group on other sub-groups. To obtain the distribution of test score, the classification model trained in one sub-group was tested on 200 bootstrap replications from other sub-group. Classification results are presented in S3 Fig , where each sub-plot represents a sub-group and three data points correspond to the performance scores (when tested on the sub-group) of the models trained in three sub-groups. The AUC scores in intra-subgroup classification were much larger than the AUC scores in inter-subgroup classification in 16 out of 18 comparisons. AUC score of intra-subgroup classification was lower than inter-subgroup classification in 2 comparisons. This disparity occurred in the mid-age sub-group where the intra-subgroup classification was close to chance (50% success) and the inter-subgroup rates were also close to chance (53% and 52% success). Moreover, the AUC scores decreased when the difference between the training and testing sub-groups increased along the variable by which sub-groups were defined. For example, when the classification models were tested on the low-VIQ sub-group, the AUC scores decreased from 0.75, 0.64, and 0.35 respectively as the subjects from the low, mid, and high VIQ sub-groups were used for training the models.

Classification performance further improved after matching the subject demographics. A high AUC of 0.92 was achieved in the low AS sub-group; see Table 4 and Fig 1B . Similarly, high AUCs of 0.81 and 0.80 were achieved for moderate AS and low VIQ sub-groups respectively. The AUC trends from this experiment were the same as that from the up-sampling scheme presented above, i.e. AUC decreases with AS and VIQ. The results from GBM were similar and are presented in Table 3 and S1 Fig . When separate classification models were built for the ASD subjects with each level of AS, AUC sharply decreased with AS according to both RF and GBM (RF: r = -0.86, p = 0.029*, GBM: r = -0.87, p = 0.026*); see S5 Fig .

In the above section, although the subjects were more homogenous after sub-grouping, the distribution of other DB measures of ASD and TDC subjects in the sub-groups might be different. This raises a concern that the results from the up-sampling scheme could have been influenced by the difference in DB measures distribution. To check if the results are not due to the different demographics, ASD and TDC subjects in each sub-group were matched on demographics. In each sub-group, the bigger class was down-sampled to match the ASD and TDC subjects on age and/or VIQ; see Table 3 for the number of subjects. Subjects were matched by age and VIQ in the sub-groups by AS and by age in the sub-groups by VIQ. For sub-groups by age, down-sampling was not performed as the number of subjects in each group were comparable and very few subjects had AS and VIQ.

In summary, sub-grouping the subjects by AS, VIQ and age improved the classification rate with the most and least improvements from sub-grouping by AS and age respectively. The results from GBM were similar and are presented in Table 4 and S1 Fig .

In sub-groups by VIQ, AUC decreased with VIQ, with AUC of 0.75, 0.63 and 0.62 for low, normal and high VIQ sub-groups respectively. In sub-groups by age, AUC was modest in young and old sub-groups with AUC of 0.66 and 0.65 respectively. AUC was low (0.5) in mid-age sub-group.

In sub-groups by AS, AUC was 0.78, 0.8 and 0.72 for low, moderate and high sub-groups respectively. The sample sizes in the sub-groups by AS were unequal. So, to check if the results were due to unequal sample sizes, separate classification models were built for the subjects with AS = 4–5 (#ASD/#TDC = 20/373), 6 (33/373), 7 (27/373), 8 (25/373), 9 (29/373) and 10 (22/373). The result of this experiment are presented in S5 Fig . In this experiment, where the sample sizes in the sub-groups were comparable, AUC decreased with the AS according to both RF and GBM. There was strong negative correlation (RF: r = -0.72, p = 0.1, GBM: r = -0.86, p = 0.028*) between mean AUC and mean AS of sub-groups; see S5 Fig .

The AUC scores of the classification in the sub-groups are presented. A point represents the mean and an error bar represents the one standard deviation of the AUC scores from 10 test folds. A) Smaller classes were up-sampled in each training fold to balance the number of ASD & TDC subjects. Sub-grouping improved the classification with the most and least improvements from sub-grouping by AS and age respectively. B) Larger classes were down-sampled matching the demographics of the smaller classes. This scheme further improved the classification performance.

In each sub-group, the smaller class was randomly up-sampled in each training fold to match the number of ASD and TDC subjects. The AUC scores achieved in the sub-groups are presented in Fig 1A and Table 4 . In Fig 1A , a point represents the mean and an error bar represents the one standard deviation of the AUC scores from 10 test folds. AUC scores and number of ASD and TDC subjects are presented below the error bar.

Multivariate analysis: Important features for classification and their variability across sub-groups

The top 10 important features for classification in each sub-group with matched subjects (i.e. from section 4.3.2) are presented in Fig 2. The top features for classification across all subjects are in Fig 2D. Each feature is represented by a bar whose length is proportional to its importance for the classification. The feature importance was calculated as an average of the importance scores from 10 test folds. Before each feature, ASD vs. TDC Cohen’s d and two sample t-test significance (P<0.005** and P<0.05*) are presented. The different morphometric features are color coded and have been grouped together. Findings for the volume features reported in this study are after they were normalized by TIV.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Important features for classification are different across sub-groups. Top 10 important features for autism spectrum disorder (ASD) vs. typically developing controls (TDC) classification in each sub-group (by AS, VIQ, age) are presented. Each feature is represented by a colored bar; the length of the bar represents the relative % importance for classification with respect to the top feature. The features have been grouped and color-coded by volume, area, thickness mean, thickness standard deviation, folding index, mean curvature and Gaussian curvature. Before each feature, Cohen’s d and two sample t-test significance (P<0.005** and P<0.05*) of ASD vs. TDC group difference are presented. The important features for classification varied across the sub-groups demonstrating the heterogeneity in ASD brain morphometry. https://doi.org/10.1371/journal.pone.0153331.g002

The important features for classification varied across the sub-groups. However, the important features were mainly from the frontal, temporal, insular, ventricular, right hippocampal and left amygdala regions. Most of the important features from RF and GBM were common; see Fig 2 and S2 Fig.

To remove the concern that the arbitrary cutoff of top 10 might have influenced our results, the important features were also selected by another technique based on cumulative distribution of the feature importance scores. After sorting the features in descending order of their importance scores, the scores were cumulatively added starting from the most important feature. The features required to reach the 10% of the total sum of the scores were considered important and the corresponding feature importance plot for RF is presented as S4 Fig. In addition, we relaxed our criteria for important features and used 25% threshold; see S5 Fig. Even after using this different technique to select the important features with multiple thresholds, the top features for classification were highly dissimilar across the sub-groups.

The important features according to two classifiers were similar suggesting that the results are not influenced by model choice. To statistically verify the similarity, we performed the Pearson’s correlation test between the importance scores of all features from the two classifiers. We performed the test separately in nine sub-groups and the correlation coefficients are reported in S1 Table. All correlation coefficients were high (r > 0.75 in 9 and r > 0.85 in 7 sub-groups) and statistically significant (p < E-16). The high similarity between the feature importance scores from two different classifiers supports that the important features reported in this study are not affected by model choice and hence are robust.

To demonstrate the extent of the heterogeneity in brain abnormalities, variability of the 13 important features with AS, VIQ and age are presented in Fig 3. The 13 features include the top feature from each sub-group (nine in total) and four important features from the classification using all the subjects.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. Variability of the important features with autism severity (AS), verbal IQ (VIQ) and age. A total of 13 important features for classification are presented; the top feature for each sub-group (nine in total) and the four important features for all subjects. The magnitude and direction of the ASD vs. TDC group differences of the top features varied with AS, VIQ and age demonstrating the heterogeneity in brain morphometry. https://doi.org/10.1371/journal.pone.0153331.g003

Curvature and thickness based features were predominant in the sub-groups by AS; see Fig 2A. Interestingly, there were no important volume features in the low-AS sub-group. Volume features were present in moderate and high-AS sub-groups and many of them were from ventricles. Thickness standard deviation of the left fusiform gyrus (red line in Fig 3) was the most important feature in the low AS sub-group and had very large ASD vs. TDC group difference (d = 1.75, p = 4E-16*). The group difference decreased with AS but was still high (d = 0.57, p = 0.0006) in the high AS sub-group. Interestingly, the group difference even changed its direction with VIQ. The difference was negative (ASD < TDC) with small effect size (d = -0.11, p = 0.7) in the low-VIQ sub-group but was positive and statistically significant with medium effect size (d = 0.36, p = 0.01*) in the high-VIQ sub-group. Similarly, mean curvature of the inferior parietal gyrus (blue line in Fig 3) was the most important feature in the moderate-AS sub-group. It was significantly larger in ASD (d = 0.91, 2E-6*). The group difference decreased with AS but was still high (d = 0.51, p = 0.0006*) in the high AS sub-group. This feature also showed the reversal in the direction of the group difference- ASD > TDC with medium effect size (d = 0.41, p = 0.02*) in the young-age sub-group and ASD < TDC with medium effect size (d = -0.3, p = 0.03*) in the old-age sub-group. Right choroid plexus volume (green line in Fig 3) was the most important feature in the high-AS sub-group and had positive (ASD>TDC) group difference with large effect size (d = 0.55, p = 9E-4*). Across all subjects, it was larger in ASD with small effect size (d = 0.18, p = 0.02*). There was large positive group difference (d = 0.71, p = 0.05) in the low-VIQ sub-group, however, it decreased with VIQ and was negative in the high-VIQ sub-group (d = -0.15, p = 0.3).

Folding index of left rostral anterior cingulate gyrus (orange line in Fig 3) was the most important feature in the low-VIQ sub-group with small negative group difference (d = -0.13, p = 0.7). It is an interesting observation that it is the most important feature for classification even when the ASD vs. TDC group difference is very small and statistically insignificant. One thing to remember is that it is the most important in the multi-variate setting where the importance of a feature is dependent on its relationship with other features. For example, the group difference of the ratio of folding index of left and right rostral anterior cingulate gyrus was large (d = 0.7, p = 0.05). This demonstrates the superiority of MVPTs over univariate techniques by its ability to automatically find inter-variable relationships important for inter-group distinction. Similarly, volume of right parahippocampal gyrus (black line in Fig 3) was the most important feature in the mid-VIQ sub-group with small negative group difference (d = -0.15, p = 0.2). It was also an important feature in classification using all subjects; see Fig 2D. Thickness standard deviation of left inferior temporal gyrus (brown line in Fig 3) was the most important feature in the high-VIQ sub-group where it was larger in ASD with medium effect size (d = 0.46, p = 0.002*). However, the group difference was nearly zero in the normal-VIQ sub-group and even flipped its direction in the low-VIQ sub-group (d = -0.45, p = 0.2).

The important features across the sub-groups by age were distinct. Folding index and Gaussian curvature features from the frontal and temporal regions were predominant and there were very few volume, thickness and area based important features in the young-age sub-group. The volume features became more dominant with increase in age—most of the important features in the old-age sub-group were volume-based. Folding index of right insula gyrus (purple line in Fig 3) was the most important feature in the low-age sub-group with small positive ASD vs. TDC group difference (d = 0.23, p = 0.08). The group difference decreased with age and was nearly zero for the old-age sub-group. Volume of the mid anterior corpus callosum (dotted red line in Fig 3) was the most important feature in the mid-age sub-group where it was smaller in ASD (d = 0.35, p = 0.03*). However, in the young-age sub-group, it was larger in ASD (d = 0.15, p = 0.4). Mean curvature of the left pericalcarine gyrus (dotted blue line in Fig 3) was the most important feature in the old-age sub-group and was larger in ASD (d = 0.3, p = 0.02*) but was smaller in ASD (d = -0.38, p = 0.002*) in the old-age sub-group.

Across all subjects, Gaussian curvature of frontal pole was the most important feature. The volume features were predominant and were mainly from the left amygdala, right parahippocampal, ventricular and temporal regions. As other important curvature based features, the group difference in Gaussian curvature of frontal pole (dotted green line in Fig 3) was the largest in younger subjects (d = 0.28, p = 0.03*) and was the smallest for older subjects (d = 0.02, p = 0.9). Among the ventricular volumes, left lateral ventricle volume (dotted orange line in Fig 3) was the most important across all subjects (d = 0.24, p = 0.002*). It was larger in ASD and the group difference decreased with VIQ and increased with age. Across all subjects, all the ventricles were larger in ASD compared to TDC and the group differences were statistically significant (before multiple comparisons). In general, except 3rd and 4th ventricles, the group difference in ventricles decreased with VIQ; see S5 Fig. Left amygdala (dotted brown line in Fig 3) was also an important feature for classification across all subjects. It was larger in ASD in the old-age sub-group with medium effect size (d = 0.41, p = 0.001*) but was smaller in the young-age sub-group (d = -0.15, p = 0.3). Likewise, most of the important features showed high variability with AS, VIQ and age and even changed the ASD vs. TDC group difference direction.