Inferior temporal (IT) cortex in human and nonhuman primates serves visual object recognition. Computational object-vision models, although continually improving, do not yet reach human performance. It is unclear to what extent the internal representations of computational models can explain the IT representation. Here we investigate a wide range of computational model representations (37 in total), testing their categorization performance and their ability to account for the IT representational geometry. The models include well-known neuroscientific object-recognition models (e.g. HMAX, VisNet) along with several models from computer vision (e.g. SIFT, GIST, self-similarity features, and a deep convolutional neural network). We compared the representational dissimilarity matrices (RDMs) of the model representations with the RDMs obtained from human IT (measured with fMRI) and monkey IT (measured with cell recording) for the same set of stimuli (not used in training the models). Better performing models were more similar to IT in that they showed greater clustering of representational patterns by category. In addition, better performing models also more strongly resembled IT in terms of their within-category representational dissimilarities. Representational geometries were significantly correlated between IT and many of the models. However, the categorical clustering observed in IT was largely unexplained by the unsupervised models. The deep convolutional network, which was trained by supervision with over a million category-labeled images, reached the highest categorization performance and also best explained IT, although it did not fully explain the IT data. Combining the features of this model with appropriate weights and adding linear combinations that maximize the margin between animate and inanimate objects and between faces and other objects yielded a representation that fully explained our IT data. Overall, our results suggest that explaining IT requires computational features trained through supervised learning to emphasize the behaviorally important categorical divisions prominently reflected in IT.

Computers cannot yet recognize objects as well as humans can. Computer vision might learn from biological vision. However, neuroscience has yet to explain how brains recognize objects and must draw from computer vision for initial computational models. To make progress with this chicken-and-egg problem, we compared 37 computational model representations to representations in biological brains. The more similar a model representation was to the high-level visual brain representation, the better the model performed at object categorization. Most models did not come close to explaining the brain representation, because they missed categorical distinctions between animates and inanimates and between faces and other objects, which are prominent in primate brains. A deep neural network model that was trained by supervision with over a million category-labeled images and represents the state of the art in computer vision came closest to explaining the brain representation. Our brains appear to impose upon the visual input certain categorical divisions that are important for successful behavior. Brains might learn these divisions through evolution and individual experience. Computer vision similarly requires learning with many labeled images so as to emphasize the right categorical divisions.

Funding: This work was funded by Cambridge Overseas Trust and Yousef Jameel Scholarship to SMKR; and by the Medical Research Council of the UK (programme MC-A060-5PR20) and a European Research Council Starting Grant (ERC-2010-StG 261352) to NK. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. The data has been used in previous studies, including a recent PLOS Computational Biology paper (‘A Toolbox for Representational Similarity Analysis’ Nili et al. 2014), and is already available from here: http://www.mrc-cbu.cam.ac.uk/methods-and-resources/toolboxes/ .

Copyright: © 2014 Khaligh-Razavi, Kriegeskorte. This is an open-access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Internal representations of the HMAX model (the C2 stage) and several computer-vision models performed well on EVC. Most of the models captured some component of the representational dissimilarity structure in IT and other visual regions. Several models clustered the human faces, which were mostly frontal and had a high amount of visual similarity. However, all the unsupervised models failed to cluster human and animal faces that were very different in visual appearance in a single face cluster, as seen for human and monkey IT. The unsupervised models also failed to replicate IT's clear animate/inanimate division. The deep supervised convolutional network better captured the categorical divisions, but did not fully replicate the categorical clustering observed in IT. We proceeded to remix the features of the deep supervised model to emphasize the major categorical divisions of IT using maximum-margin linear discriminants. In order to construct a representation resembling IT, we combined these discriminants with the different representational stages of the deep network, weighting each discriminant and layer of the deep network so as to best explain the IT representational geometry. The resulting IT-geometry model, when tested with crossvalidation to avoid overfitting to the image set, explains our IT data. Our results suggest that intensive supervised training with large sets of labeled images might be necessary to model the IT representational space.

We analyzed brain responses in monkey IT (mIT; cell recording data acquired by Kiani and colleagues [6] ) and human IT (hIT; fMRI data from [7] ) for a rich set of color images of isolated objects spanning multiple animate and inanimate categories. The human fMRI measurements covered the entire ventral stream, so we also tested the models on fMRI data from the foveal confluence of early visual cortex (EVC), the lateral occipital complex (LOC), the fusiform face area (FFA), and the parahippocampal place area (PPA).

We also attempted to recombine model features, so as to construct a representation resembling IT in both its categorical divisions and within-category representational geometry. We linearly recombined the features in two ways: (a) by reweighting features (thus stretching and squeezing the representational space along its original axes) and (b) by remixing the features, creating new features as linear combinations of the original features (thus performing general affine transformations). All unsupervised and supervised training and all reweighting and remixing was based on sets of images nonoverlapping with the image set used to assess how well models accounted for IT.

We also tested models that were supervised with category labels. Two of the models (GMAX and supervised HMAX) [35] were trained in a supervised fashion to distinguish animates from inanimates, using 884 training images. In addition, we tested a deep supervised convolutional neural network [41] , trained by supervision with over a million category-labeled images from ImageNet [49] .

We tested a total of 37 computational model representations. Some of the models mimic the structure of the ventral visual pathway (e.g. HMAX, VisNet, Stable model, SLF) [20] , [21] , [33] – [37] ; others are more broadly biologically motivated (e.g. Biotransform, convolutional network) [38] – [41] ; and the others are well-known computer-vision models (e.g. GIST, SIFT, PHOG, PHOW, self-similarity features, geometric blur) [42] – [48] . Some of the models use features constructed by engineers without training with natural images (e.g. GIST, SIFT, PHOG). Others were trained in an unsupervised fashion (e.g. HMAX and VisNet).

Evaluating a computational model requires a framework for relating brain representations and model representations. One approach is to directly predict the brain responses to a set of stimuli by means of the computational models. Because of its roots in the computational neuroscience of early visual areas, this approach is often referred to as receptive-field modeling. It has been successfully applied to cell recording, e.g. [25] , and fMRI data, e.g. [26] – [28] . Here we attempt to test complex network models whose internal representations comprise many units (ranging from 99 to 2,904,000). The brain-activity data consist of hundreds of measured brain responses. In this scenario, the linear correspondency mapping between model units and brain responses is complex (a matrix of number of model units by number of brain responses). Estimating this linear map is statistically costly, requiring a combination of substantial additional data (for a separate set of stimuli) and prior assumptions (for regularizing the fit). Here we avoid these complications by testing the models in the framework of representational similarity analysis (RSA) [17] , [18] , [29] , [30] , in which brain and model representations are compared at the level of the dissimilarity structure of the response patterns. The models, thus, predict the dissimilarities among the stimuli in the brain representation. This approach relies on the assumption that the measured responses preserve the geometry of the neuronal representational space. The representational geometry would be conserved to high precision if the measured responses sampled random dimensions of the neuronal representational space [31] , [32] . The RSA framework enables us to test any pre-trained model directly with data from a single stimulus set.

Here we investigate a wide range of computational models [24] and assess their ability to account for the representational geometry of primate IT. Our study addresses the question of how well computational models from computer vision and neuroscience can explain the IT representational geometry. In particular, we investigated whether models not specifically optimized to distinguish categories can explain IT's categorical clusters and whether models trained using supervised learning with category labels better explain the IT representational geometry.

This raises the question if any existing computational vision models, whether motivated by engineering or neuroscientific objectives, can more fully explain the IT representation and account for the IT category clustering. IT clearly represents visual shape. However, the degree to which categorical divisions and semantic dimensions are also represented is a matter of debate [22] , [23] . If visual features constructed without any knowledge of either category boundaries or semantic dimensions reproduced the categorical clusters, then we might think of IT as a purely visual representation. To the extent that knowledge of categorical boundaries or semantic dimensions is required to build an IT-like representation, IT is better conceptualized as a visuo-semantic representation.

Previous studies have compared the representational dissimilarity matrices (RDMs) of a small number of models (mainly low-level models) with human IT and some other brain areas [7] , [17] – [19] . One of the previously tested models was the HMAX model [20] , [21] , which was designed as a model of IT taking many of its architectural parameters from the neuroscience literature. The internal representation of one variant of the HMAX model failed to fully explain the IT representational geometry [7] . In particular, the HMAX model did not account for the category clustering observed in the IT representation.

Visual object recognition is thought to rely on a high-level representation in the inferior temporal (IT) cortex, which has been intensively studied in humans and monkeys [1] – [12] . Object images that are less distinct in the IT representation are perceived as more similar by humans [10] and are more frequently confused by humans [13] and monkeys [6] . IT cortex represents object images by response patterns that cluster according to conventional categories [6] , [7] , [9] , [14] – [16] . The strongest categorical division appears to be that between animates and inanimates. Within the animates, faces and bodies form separate sub-clusters [6] , [7] , [15] .

Results

The results for the 37 model representations are presented separately for two sets of representations. The first set comprises the not-strongly-supervised representations (Figures 1–5). The second set comprises the layers of a strongly supervised deep convolutional network and an IT-like representation constructed by remixing and reweighting the features of the deep supervised model (Figures 6–10). The not-strongly-supervised set (Table 1) includes two supervised models: GMAX and Supervised HMAX (Materials and Methods). These were supervised much more weakly than the deep convolutional network, using merely hundreds of images. The deep convolutional network (Table 2) was supervised with 1.2 million category-labeled images. Note that the first set contains many independent model representations, whereas the second set contains the stages of a single deep strongly supervised object-vision model.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Representational dissimilarity matrices for IT and for the seven best-fitting not-strongly-supervised models. The IT RDMs (black frames) for human (A) and monkey (B) and the seven most highly correlated model RDMs (excluding the representations in the strongly supervised deep convolutional network). The model RDMs are ordered from left to right and top to bottom by their correlation with the respective IT RDM. These are the seven most higly correlated RDMs among the 27 models that were not strongly supervised and their combination model (combi27). Biologically motivated models are in black, computer-vision models are in gray. The number below each RDM is the Kendall τ A correlation coefficient between the model RDM and the respective IT RDM. All correlations are statistically significant. For statistical inference, see Figure 2. For model abbreviations and RDM-correlation p values, see Table 1. For other brain ROIs (i.e. LOC, PPA, FFA, EVC) see Figure S1 and Table 1. The RDMs here are 96×96, including the four stimuli we did not have monkey data for. The corresponding rows and columns are shown in blue in the mIT RDM and were ignored in the RDM comparisons. https://doi.org/10.1371/journal.pcbi.1003915.g001

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. The not-strongly-supervised models fail to fully explain the IT data. The bars show the Kendall-τ A RDM correlations between the not-strongly-supervised models and IT for human (A) and monkey (B). The error bars are standard errors of the mean estimated by bootstrap resampling of the stimuli. Asterisks indicate significant RDM correlations (random permutation test based on 10,000 randomizations of the stimulus labels; ns: not significant, p<0.05: *, p<0.01: **, p<0.001: ***, p<0.0001: ****). Most models explain a small, but significant portion of the variance of the IT representational geometry. The noise ceiling (gray bar) indicates the expected correlation of the true model (given the noise in the data). The upper and lower edges of the gray horizontal bar are upper and lower bound estimates of the maximum correlation any model can achieve given the noise. None of the not-strongly-supervised models reaches the noise ceiling. The noise ceiling could not be estimated for mIT, because the available data were from only two animals. Models with the subscript ‘UT’ are unsupervised trained models, models with the subscript ‘ST’ are supervised trained models, and others without a subscript are untrained models. Note that the supervised models included here were “weakly supervised”, i.e. with small numbers (884) of category-labeled images. Biologically motivated models are set in black font, and computer-vision models are set in gray font. https://doi.org/10.1371/journal.pcbi.1003915.g002

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 3. IT-like categorical structure is not apparent in any of the not-strongly-supervised models. Brain and model RDMs are shown in the left columns of each panel. We used a linear combination of category-cluster RDMs (Figure S5) to model the categorical structure (least-squares fit). The categories modeled were animate, inanimate, face, human face, non-human face, body, human body, non-human body, natural inanimates, and artificial inanimates. The fitted linear-combination of category-cluster RDMs is shown in the middle columns. This descriptive visualization shows to what extent different categorical divisions are prominent in each RDM. The residual RDMs of the fits are shown in the right column. For statistical inference, see Figure 4. https://doi.org/10.1371/journal.pcbi.1003915.g003

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. The not-strongly-supervised models are less categorical than IT. Categoricality was measured using a categoricality index (vertical axis) for each model and brain RDM. The categoricality index is defined as the proportion of RDM variance explained by the category-cluster model (Figure S5), i.e. the squared correlation between the fitted category-cluster model and the RDM it is fitted to. Bars show the categoricality index for each of the not-strongly-supervised models. The blue (gray) line shows the categoricality index for hIT (mIT). Error bars show 95%-confidence intervals of the categoricality index estimates for the models. The 95%-confidence intervals for hIT and mIT are shown by the blue and gray shaded regions, respectively. Significant categoricality indices are marked by stars underneath the bars (* p<0.05, ** p<0.01, *** p<0.001, **** p<0.0001). Error bars are based on bootstrapping of the stimulus set, and the p-values are obtained by category label randomization test. Significant differences between the categoricality indices of each model and hIT (inference by bootstrap resampling of the stimuli) are indicated by blue vertical arrows (p<0.05, Bonferroni-adjusted for 28 tests). The corresponding inferential comparisons for mIT are indicated by gray vertical arrows. Categoricality is significantly greater in hIT and mIT than in any of the 28 models. This analysis is based on equating the noise level in the models with that of hIT (Materials and Methods). Similar results obtain for a conservative inferential analysis comparing the categoricality of the noise-less models with that of the noisy estimates for hIT and mIT (Figure S9). https://doi.org/10.1371/journal.pcbi.1003915.g004

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 5. Remixing and reweighting features of the not-strongly supervised models does not explain IT. In order to build an IT-like representation, we attempted to remix the features to strengthen relevant categorical divisions. We trained three linear SVM classifiers (for animate/inanimate, face/nonface, and body/nonbody) on the combi27 features using 884 training images (separate from the set we had brain data for). RDMs for the resulting SVM decision values for the 92 images presented to humans and monkeys are shown at the top. The Kendall-τ A RDM correlations with hIT and mIT are stated underneath the RDMs. The RDM correlations are low, but all three are statistically significant (p<0.05). We further attempted to create an IT-like representation as a reweighted combination of the models. We fitted one weight for each of the 27 not-strongly-supervised models, the combi27 model, and the three SVM decision values. The weights were fitted by non-negative least squares, so as to minimize the sum of squared deviations between the RDM of the weighted combination of the features and the hIT RDM. The resulting weights are shown in the second row. Error bars indicate 95%-confidence intervals obtained by bootstrap resampling of the stimulus set. The resulting IT-geometry-supervised RDM is shown at the bottom (center) in juxtaposition to hIT (left) and mIT (right). Importantly, the RDM was obtained by cross-validation to avoid overfitting to the image set (Materials and Methods). The RDMs here are 92×92, excluding the four stimuli that we did not have monkey data for. https://doi.org/10.1371/journal.pcbi.1003915.g005

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 6. RDMs of all layers of the strongly supervised deep convolutional network. RDMs for all layers of the deep convolutional network (Krizhevsky et al. 2012) ref [41] are shown for the set of the 96 images (L1: layer 1 to L7: layer 7). Kendall-τ A RDM correlations of the models with hIT and mIT are stated underneath each RDM. All correlations are statistically significant. For inferential comparisons to IT and other regions, see Figure 7 and Table 2, respectively. https://doi.org/10.1371/journal.pcbi.1003915.g006

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 7. The strongly supervised deep network, with features remixed and reweighted, fully explains the IT data. The bars show the Kendall-τ A RDM correlations between the layers of the strongly supervised deep convolutional network and human IT. The error bars are standard errors of the mean estimated by bootstrap resampling of the stimuli. Asterisks indicate significant RDM correlations (random permutation test based on 10,000 randomizations of the stimulus labels; p<0.05: *, p<0.01: **, p<0.001: ***, p<0.0001: ****). As we ascend the layers of the deep network, model RDMs explain increasing proportions of the variance of the hIT RDM. The noise ceiling (gray bar) indicates the expected correlation of the true model (given the noise in the data). The upper and lower edges of the gray horizontal bar are upper and lower bound estimates of the maximum correlation any model can achieve given the noise. None of the layers of the deep network reaches the noise ceiling. However, the final fully connected layers 6 and 7 come close to the ceiling. Remixing the features of layer 7 (Figure 10) using linear SVMs to strengthen the categorical divisions, provides a representation composed of three discriminants (animate/inanimate, face/nonface, and body/nonbody) that reaches the noise ceiling. Reweighting the model layers and the three discriminants (see Figure 10 for details) yields a representation that explains the hIT geometry even better. A horizontal line over two bars indicates that the two models perform significantly differently (inference by bootstrap resampling of the stimulus set). Multiple testing across the many pairwise comparisons is accounted for by controlling the expected FDR at 0.05. The pairwise statistical comparisons show that the IT-geometry-supervised deep model explains IT significantly better than all other candidate representations. https://doi.org/10.1371/journal.pcbi.1003915.g007

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 8. IT-like categorical structure emerges across the layers of the deep supervised model, culminating in the IT-geometry-supervised layer. Descriptive category-clustering analysis as in Figure 3, but for the deep supervised network. We used a linear combination of category-cluster RDMs (Figure S5) to model the categorical structure. The fitted linear-combination of category-cluster RDMs is shown in the middle columns. This descriptive visualization shows to what extent different categorical divisions are prominent in each layer of the deep supervised model. The layers show some of the categorical divisions emerging. However, remixing of the features (linear SVM readout) is required to emphasize the categorical divisions to a degree that is similar to IT. The final IT-geometry-supervised layer (weighted combination of layers and SVM discriminants) has a categorical structure that is very similar to IT. Overfitting to the image set was avoided by crossvalidation. For statistical inference, see Figure 9. https://doi.org/10.1371/journal.pcbi.1003915.g008

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 9. The layers of the deep supervised model are less categorical than IT, but remixing and reweighting achieves IT-level categoricality. Bars show the categoricality index for each layer of the deep convolutional network and for the IT-geometry-supervised layer. For conventions and for definition of the categoricality index, see Figure 4. Error bars and shaded regions indicate 95%-confidence intervals. Significant Categoricality indices are indicated by stars underneath the bars (* p<0.05, ** p<0.01, *** p<0.001, **** p<0.0001). Significant differences between the categoricality index of each model and the hIT categoricality index are indicated by blue vertical arrows (p<0.05, Bonferroni-adjusted for 9 tests). The corresponding inferential comparisons for mIT are indicated by gray vertical arrows. Categoricality is significantly greater in hIT and mIT than in any of the internal layers of the deep convolutional network. However, the IT-geometry-supervised layer (remixed and reweighted) achieves a categoricality similar to (and not significantly different from) IT. This analysis is based on equating the noise level in the models with that of hIT (Materials and Methods). Similar results obtain for a conservative inferential analysis comparing the categoricality of the noise-less models with that of the noisy estimates for hIT and mIT (Figure S10). https://doi.org/10.1371/journal.pcbi.1003915.g009

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 10. Remixing and reweighting features of the deep supervised network achieves an IT-like representational geometry. All analyses and conventions here are analogous to Figure 5, but applied to the strongly supervised deep convolutional network, rather than to the not-strongly supervised models. Remixing the features of layer 7 by fitting linear SVMs (separate set of training images) for the major categorical divisions (animate/inanimate, face/nonface, and body/nonbody) helped account for the categorical clusters in IT. The Kendall-τ A RDM correlations between the SVM decision values and IT (stated underneath the RDMs in the top row) are statistically significant (p<0.05). For the deep convolutional network used here, feature remixing accounted for the animate/inanimate division of IT. We attempted to create an IT-like representation as a reweighted combination of the layers of the deep network and the SVM decision values. We fitted one weight for each of the layers and one weight for each of the three decision values. The bar graph in the middle row shows the weights, with 95%-confidence intervals obtained by bootstrap resampling of the stimulus set. As before, the weights were fitted using non-negative least squares to minimize the sum of squared deviations between the RDM of the weighted combination and the hIT RDM. The resulting IT-geometry-supervised RDM (bottom row, center) is very similar to the RDMs of hIT (left) and mIT (right). The τ A RDM correlation between the fitted model and IT is about equal for monkey IT (0.40) and human IT (0.38). Both of these RDM correlations are higher than the RDM correlation between hIT and mIT, reflecting the effect of noise on the empirical RDM estimates. As in Figure 5, the fitted model RDM was obtained by cross-validation to avoid overfitting to the image set. https://doi.org/10.1371/journal.pcbi.1003915.g010

Most models explain a small component of the IT representational geometry Among the not-strongly-supervised models, the seven models with the highest RDM correlations with hIT and mIT are shown in Figure 1 (for other brain regions, see Figure S1 and Table 1). Visual inspection suggests that the models capture the human-face cluster, which is also prevalent in IT. However, the models do not appear to place human and animal faces in a single cluster. In addition, the inanimate objects appear less clustered in the models. All models shown in Figure 1 have small, but highly significant (p<0.0001) RDM correlations with hIT and mIT (Figure 1A, 1B, respectively; for RDM correlation with other brain regions see Figure S2 for the not-strongly-supervised models, and Figure S3 for the deep supervised model representations). Most of the other not-strongly-supervised models also have significant RDM correlations (Table 1, Figure 2; inference by randomization of stimulus labels). Although often significant, all RDM correlations between not-strongly-supervised models and IT were small (Kendall τ A <0.17 for hIT; τ A <0.26 for mIT).

Combining features from multiple models improves the explanation of IT Combining features from the not-strongly-supervised models improved the RDM correlations to IT. Model features were combined by summarizing each model representation by its first 95 principal components and then concatenating these sets of principal components. This approach ensured that each model contributed equally to the combination (same number of features and same total variance contributed). The combination of the 27 not-strongly-supervised models (combi27) has a higher RDM correlation with both hIT and mIT than any of the 27 contributing models. Second to the combi27 model, internal representations of the HMAX model have the highest RDM correlation with hIT and mIT. This might reflect the fact that the architecture and parameters of the HMAX model closely follow the literature on the primate ventral stream. In addition to the combi27, we also tested the combination of untrained models, the combination of unsupervised trained models, and the combination of weakly supervised trained models (Figure S4). The combi27 explained IT equally well or better than other combinations of the not-strongly-supervised models. In the remaining analyses, we therefore omit the other combinations and consider the combi27 along with each individual model. Monkey IT was significantly better explained by the combi27 than by the second best among the not-strongly-supervised models (HMAX-C2 UT ; p = 0.02; inference by bootstrap resampling of the stimulus set [50], not shown). This suggests that the models are somewhat complementary in explaining the IT features space. For hIT, the second best model was also a version of HMAX (HMAX-all UT ), but it did not explain hIT significantly worse than combi27 (p = 0.261, not shown). Model RDM correlations with mIT tended to be higher than model correlations with the hIT RDM. For example, the dissimilarity correlation of the combi27 with mIT was 0.25, whereas for hIT it is 0.17. This difference is statistically significant (p = 0.001), suggesting that the models were able to better explain the mIT RDM compared to the hIT RDM. This could be caused by a lower level of noise in the mIT RDM (estimated from cell-recording data) than in the hIT RDM (from fMRI data).

None of the not-strongly-supervised models fully explains the IT data For the human data we were able to estimate a noise ceiling [30] (Materials and Methods), indicating the RDM correlation expected for the true model, given the noise in the data. None of the 28 not-strongly-supervised models reached the noise ceiling (Figure 2A). The combi27 representation came closest, but at τ A = 0.17, it was far from the lower bound of the noise ceiling (τ A = 0.26). This indicates that the fMRI data capture a component of the hIT representation that all the not-strongly-supervised models leave unexplained. For mIT, we could not estimate the noise ceiling because we had data from only two animals.

IT is more categorical than any of the not-strongly-supervised models The main categorical divisions observed in IT appear weak or absent in the best fitting models (Figure 1). To measure the strength of categorical clustering in each model and brain representation, we fitted a linear model of category-cluster RDMs to each model and brain RDM (Materials and Methods, Figure S5). The fitted models (Figure 3) descriptively visualize the categorical component of each RDM, summarizing sets of within- and between-category dissimilarities by their averages. The fits for several computational models show a strong human-face cluster, and a weak animate cluster. The human-face cluster is expected on the basis of the visual similarity of the human-face images (all frontal aligned human faces of the same approximate size). The animate cluster could reflect the similar colors and more rounded shapes shared by the animate objects. However, IT in both human and monkey exhibits additional categorical clusters that are not easily accounted for in terms of visual similarity. First, the IT representation has a strong face cluster that includes human and animal faces of different species, which differ widely in shape, color, and pose. Second, the IT representation has an inanimate cluster, which includes a wide variety of natural and artificial objects and scenes of totally different visual appearance. These clusters are largely absent from the not-strongly-supervised models (Figures 3, S6, S7, S8). In order to statistically compare the overall strength of categorical divisions between IT and each of the models, we computed a categoricality index for each representation. The categoricality index is the proportion of RDM variance explained by categorical divisions. The categoricality index is calculated as the squared correlation between the fitted category-cluster model (Figure S5) and the RDM it is fitted to (Figure 4). The model RDMs are noise-less. However, the brain RDMs are affected by noise, which lowers the categoricality index. To account for the noise and make the categoricality indices comparable between models and IT, we added noise matching the noise level of hIT to the model representations (Materials and Methods). We then compared the categoricality indices of the 28 not-strongly-supervised models to that of hIT (Figure 4). Human IT has a categoricality index of 0.4. All of the not-strongly supervised models have categoricality indices below 0.16; most of them below 0.1. Inferential comparisons show that the categoricality index is significantly higher for hIT than for any of the models (inference by bootstrap resampling of the image set). We also compared the categoricality indices between models and IT without equating the noise levels. In this analysis, the categoricality index reflects the categoricality of the models without noise. For hIT and mIT, the noise lowers the categoricality estimate. Nevertheless, the hIT categoricality index remains significantly greater than that of any of the models. For mIT, similarly, the categoricality index is significantly greater than for all but three of the models (Figure S9). We also analyzed the clustering strength separately for each of the categories (Figure S6). For animates, clustering strength was significant for a few models (Lab joint color histogram, PHOG, and HMAX-all). For human faces, significant clustering was observed for several computational models (convNet, bioTransform, dense SIFT, LBP, silhouette image, gist, geometric blur, local self-similarity descriptor, global self-similarity descriptor, stable model, HMAX-C1, and combi27). These significant category clusters reflect the visual similarity of the members of these categories. Inferential comparisons of clustering strength between each of the models and hIT (Figure S8) and mIT (Figure S8) for each of the categories revealed that IT clusters animates, inanimates, and faces (including human and animal faces) significantly more strongly in both species than most of the models (blue bars in Figures S7 and S8). There are only a few cases, in which a model clusters one of the categories more strongly than IT.

Remixing and reweighting of the features of the not-strongly-supervised models does not improve the explanation of the IT data The finding that categoricality is stronger in IT than in any of the models raises the question of what the models are missing. One possibility is that the models contain all essential nonlinear features, but in proportions different from IT, thus emphasizing the features differently in the representational geometry. In that case reweighting of the features (i.e. stretching and squeezing the representational space along its original axes) should help approximate the IT representational geometry. For example, the representation might contain a feature perfectly discriminating animates from inanimates. This single categorical feature would not have been reflected strongly in the overall RDM if none of the other features emphasized this categorical division. The influence of such a feature on the overall representational geometry could be increased either by replicating the feature in the representation or by amplifying the feature values. These two alternatives are equivalent in their effects on the RDM, so we consider only the latter. Another possibility is that all essential nonlinearities are present, but the features need to be linearly recombined (i.e. performing general affine transformations) to approximate the IT representational geometry. We therefore investigated whether linear remixing and reweighting of the features of the not-strongly-supervised models could provide a better explanation of the IT representational geometry. Remixing of features. We attempted to create new features as linear combinations of the original features. The space of all linear recodings is difficult to search given limited data. We therefore restricted this analysis to the combi27 features (which represent a combination of the not-strongly-supervised models) and attempted to find linear combinations that specifically emphasize the missing categorical divisions. In order to find such linear combinations, we trained three linear support vector machine (SVM) classifiers for body/nonbody, face/non-face, and animate/inanimate categorization. The SVMs were trained on a set of 884 labeled images of isolated objects nonoverlapping with the set of 96 images we had brain data for. We used the decision-value outputs of the classifiers as new features. The resulting single-feature RDMs (Figure 5, top; one RDM for each SVM) are not highly categorical and have only a low correlation (τ A <0.1) with the IT RDMs for human and monkey. This is consistent with the fact that the combi27 representation does not perform very well on categorization tasks (Figures 11, S11). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 11. Animate/inanimate categorization accuracy for all models. Each dark blue bar shows the categorization accuracy of a linear SVM applied to one of the computational model representations. Categorization accuracy for each model was estimated by 12-fold crossvalidation on the 96 stimuli. To assess whether categorization accuracy was above chance level, we performed a permutation test, in which we retrained the SVMs on (category-orthogonalized) 10,000 random dichotomies among the stimuli. Light blue bars show the average model categorization accuracy for random label permutations. Categorization performance was significantly greater than chance for most models (* p<0.05, ** p<0.01, *** p<0.001, **** p<0.0001). The deep convolutional network model (final fully connected layer 7) has the highest animate/inanimate categorization performance (96%). The combi27 has the second highest performance (76%). https://doi.org/10.1371/journal.pcbi.1003915.g011 Feature reweighting. Combining the not-strongly-supervised models with equal weight in the combi27 representation improved the explanation of our IT data. We wanted to test whether appropriate weighting of the not-strongly-supervised models could further improve the explanation of the IT geometry. In addition to the 27 not-strongly-supervised models, we included the combi27 model, and the three categorical SVM discriminants in the set of representations to be combined. We fitted one weight for each of these representations (27+1+3 = 31 weights in total), so as to best explain the hIT RDM (Figure 5, middle row). Flipping the sign of a feature (weight = −1) has no effect on the representational distances. We can, thus, consider only positive weights, without loss of generality. We therefore used a non-negative-least-squares fitting algorithm [51] to find the non-negative weights for the models that minimize the sum of squared deviations between the hIT RDM and the RDM of the weighted combination of models. The RDM of the weighted combination of the model features is equivalent to a weighted combination of the RDMs of the models (Materials and Methods) when squared Euclidean distance is used. We used the squared Euclidean distance for normalized representational patterns, which is equivalent to correlation distance, as used throughout this paper. We therefore applied the nonnegative least-squares algorithm at the level of the RDMs. In order to avoid overestimation of the RDM correlation between the fitted model and hIT due to overfitting to the image set, we fitted the weights to random subsets of 88 of the 96 images in a crossvalidation procedure, holding out 8 images on each fold. We then estimated the representational dissimilarities for the weighted-combination model for the 8 held-out images. We repeated this procedure until the entire RDM of 96 by 96 images was estimated (Figure 5, bottom row, center). Feature reweighting and remixing did not reproduce the categorical structure observed in IT (Figure 5, bottom row). In fact the weighted-combination model did slightly worse than combi27 at explaining hIT and mIT (τ A = 0.13 for hIT, τ A = 0.20 for mIT). The lower performance, despite the inclusion of combi27 as one of the component representations, reflects the cost of overfitting. However, since we fitted only 31 weights in the reweighting step, that cost is small. The failure to improve the explanation of the IT geometry through remixing and reweighting, thus, suggests that the not-strongly-supervised models are missing features important to the IT representational geometry. Different nonlinear features and more powerful supervised learning methods may be needed to fully capture the structure of the IT representation. We therefore next tested a deep supervised convolutional neural network [52].

A strongly supervised deep convolutional network better explains the IT data So far, we showed that none of the not-strongly-supervised models were able to reproduce the categorical structure present in IT. Most of these models were untrained or trained without supervision. A few of them were weakly supervised (i.e. supervised with merely 884 training images). Their failure at explaining our IT data suggests that computational features trained to cluster the categories through supervised learning with many labeled images might be needed to explain the IT representational geometry. We therefore tested a deep convolutional neural network trained with 1.2 million labelled images [52], nonoverlapping with the set of 96 images used here. The model has eight layers. The RDM for each of the layers and the RDM correlations with hIT and mIT are shown in Figure 6. The deep supervised convolutional network explains the IT geometry better than any of the not-strongly-supervised models. The RDM correlation between hIT and the deep convolutional network's best-performing layer (layer 7) is τ A = 0.24. Layer 7 explains the hIT representation significantly better (p<0.05; obtained by bootstrap resampling of the stimulus set) than combi27 (τ A = 0.17), the best-performing of the not-strongly-supervised models. Monkey IT, as well, is better explained by layer 7 (τ A = 0.29) than by combi27 (τ A = 0.25), although the difference is not significant. Layer 7 is the deep network's highest continuous representational space, followed only by the readout layer (layer 8, also known as the “scores”). The readout layer is composed of 1000 features, one for each of the 1000 category labels used in training the network. The readout layer has a lower RDM correlation with hIT (τ A = 0.13) and mIT (τ A = 0.18) than layer 7. From layer 1 to layer 7 the RDM correlation with IT rises roughly monotonically (Figure 7, Table 2) and many of the pairwise comparisons between RDM correlations for higher and lower layers are significant (Figure 7, horizontal lines at the top). Nevertheless, even the best-performing layer 7 does not reach the noise ceiling (Figure 7). Although the deep convolutional network outperforms all not-strongly-supervised models, it does not fully explain our IT data. As for the not-strongly-supervised models, we analyzed the categoricality of the layers of the deep supervised model (Figures 8, 9). All layers of the deep supervised model, including layer 7 and layer 8 (the readout layer), have significantly lower categoricality indices than hIT and mIT (Figure 9). This might reflect the fact that the stimulus set was equally divided into animates and inanimates and this division, thus, strongly influences our categoricality index. Importantly, the deep supervised network emphasizes some categorical divisions more strongly and others less strongly than IT (Figure 8). For example, layer 7 emphasizes the division between human and animal faces and the division between artificial and natural inanimate objects more strongly than IT. However, IT emphasizes the animate/inanimate and the face/body division more strongly than layer 7.

Remixing and reweighting of the deep supervised features fully explains the IT data We have seen that the deep supervised model provides better separation of the categories than the not-strongly-supervised models and that it also better explains IT. However, it did not reach the noise ceiling. As for the not-strongly-supervised models, we therefore asked whether remixing the features linearly (by adding linear readout features emphasizing the right categorical divisions) and reweighting of the different layers and readout features could provide a better model of the IT representation. The method for remixing and reweighting was exactly the same as for the not-strongly-supervised models (Figure 5). However, the linear SVM features were based on layer 7 (instead of combi27) and the reweighting involved fitting one weight for each of the layers (1–8) and one weight for each of the three linear SVM features. As before, the linear SVM features were trained for body/nonbody, face/non-face, and animate/inanimate categorization using the nonoverlapping set of 884 training images. The RDMs for the SVM readout features show strong categorical divisions (Figure 10, top row). This is consistent with the fact that the layer-7 representation performs well on categorization tasks (Figures 11, S11). As before, we used non-negative least square fitting to find the weighted combination of the representations that best approximates hIT. Again, we avoided overfitting to the image set by fitting the weights to random subsets of 88 of the 96 images in a crossvalidation procedure, holding out 8 images on each fold. This procedure yielded a weight for each of the eight layers of the deep network and for each of the three linear SVM readout features (11 weights in total; Figure 10, middle row; Materials and Methods). We refer to this weighted combination as the IT-geometry-supervised deep model. Inspecting the RDM reveals the similarity of its representational geometry to hIT and mIT (Figure 10, bottom row). The model emphasizes the major categorical divisions similarly to IT (Figure 8, bottom right). In contrast to all other models, this model has a categoricality index matching mIT and not significantly different from either mIT or hIT (Figure 9). The IT-geometry-supervised deep model explains hIT better than any layer of the deep network (Figure 7, horizontal lines at the top). It has the highest RDM correlation with hIT (τ A = 0.38) and mIT (τ A = 0.4) among all model representations considered in this paper. Importantly, it falls well within the upper and lower bounds of the noise ceiling and, thus, fully explains the non-noise component of our hIT data.

Model representations more similar to IT categorize better Figure 11 shows the animate/inanimate categorization accuracy of linear SVM classifiers taking each of the model representations as their input (for the face/body dichotomy and the artificial/natural dichotomy among inanimates, see Figure S11). The categorization accuracy for each model was estimated by 12-fold crossvalidation of the 96 stimuli (Materials and Methods). The deep convolutional network model (layer 7) has the highest animate/inanimate categorization performance (96%), and the combi27 has the second highest performance (76%). Figure 12 shows that models whose representations were more similar to IT tended to have a higher animate/inanimate categorization performance. The Pearson correlation between the IT-to-model representational similarity (τ A RDM correlation) and categorization accuracy was 0.75 for hIT and 0.68 for mIT across the 28 not-strongly-supervised model representations and the seven layers of the deep supervised model. This finding could simply reflect the fact that the categories correspond to clusters in the IT representation and any representation clustering the categories will be well-suited for categorization. Indeed categorization performance is also predicted by the RDM correlation between a model and an animate-inanimate categorical RDM, albeit with a lower correlation coefficient (r = 0.38, not shown). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 12. Model representations resembling IT afford better categorization accuracy. A model's IT-resemblance (measured by the RDM correlation between IT and model) predicts its categorization accuracy (animate/inanimate). This holds for both human-IT resemblance (top) and monkey-IT resemblance (bottom). The substantial positive correlation between IT-resemblance and categorization accuracy could reflect the categorical clustering of IT (left panels). However, the within-category RDM correlation between a model and IT also predicts model categorization accuracy (right panels). Each panel shows the least-squares fit (gray line) and the Spearman rank correlation r (* p<0.05, ** p<0.01, *** p<0.001, **** p<0.0001). Each circle shows one of the models. Numbers indicate the model (see Table 1 for model numbering). Different layers of the deep supervised convolutional network are indicated by colored labels “L1” (layer 1) to “L7” (layer 7). The deep model's layers are color-coded from light blue to light red (from lower to higher layers). Computer vision models are shown by gray circles; biologically motivated models are shown by black circles. The transparent horizontal and vertical rectangles cover non-significant ranges along each axis. https://doi.org/10.1371/journal.pcbi.1003915.g012 In order to further assess whether it was only the category clustering that predicted categorization accuracy or something deeper about the similarity of the model representation to IT, we considered the within-category dissimilarity correlation between each model and IT as a predictor of categorization accuracy. Models that were more similar to IT in terms of their within-category representational geometry (dissimilarities among animates and dissimilarities among inanimates) also tended to have higher categorization performance (Pearson r = 0.45 for hIT, r = 0.67 for mIT; p<0.01, p<0.0001, respectively). These results may add to the motivation for computer vision to learn from biological vision. If computational feature spaces more similar to the IT representation yield better categorization performance within the present set of models, then it might be a good strategy for computer vision to seek to construct features even more similar to IT.

Several models using Gabor filters and other low-level features explain human early visual cortex We could not distinguish early visual areas V1, V2, and V3, because stimuli were presented foveally in the human fMRI experiment (2.9° visual angle in diameter, centered on fixation). Instead we defined an ROI for early visual cortex (EVC), which covered the foveal confluence of these retinotopic representations. Several models using Gabor filters (SIFT, gist, PHOG, HMAX, ConvNet) and other features (Geometric blur, local self-similarity descriptor, global self-similarity descriptor, silhouette image) explained the early visual RDM estimated from fMRI (Figure S1A, S2A). These models not only explained significant dissimilarity variance, but reached the noise ceiling, indicating that they explain the EVC representation to the extent that the noise in our data enables us to assess this. For the HMAX model (as implemented by Serre et al. [20]), we tested several internal representations. The HMAX-C2 layer had the highest RDM correlation with EVC among all models. The HMAX-C2 layer falls within the early stages (above S1, C1, and S2 layers, and below S2b, S3, C2b, C3, and S4 layers) of the HMAX model and its features closely parallel the initial stages of primate visual processing. For the deep supervised model, the RDM correlations of different layers with EVC are shown in Figure S3A. Layers 2 and 3 of the model have the highest RDM correlation with EVC and reach the noise ceiling. However, their correlation with EVC is lower than that of the HMAX-C2 layer.