Theories of object recognition agree that shape is of primordial importance, but there is no consensus about how shape might be represented, and so far attempts to implement a model of shape perception that would work with realistic stimuli have largely failed. Recent studies suggest that state-of-the-art convolutional ‘deep’ neural networks (DNNs) capture important aspects of human object perception. We hypothesized that these successes might be partially related to a human-like representation of object shape. Here we demonstrate that sensitivity for shape features, characteristic to human and primate vision, emerges in DNNs when trained for generic object recognition from natural photographs. We show that these models explain human shape judgments for several benchmark behavioral and neural stimulus sets on which earlier models mostly failed. In particular, although never explicitly trained for such stimuli, DNNs develop acute sensitivity to minute variations in shape and to non-accidental properties that have long been implicated to form the basis for object recognition. Even more strikingly, when tested with a challenging stimulus set in which shape and category membership are dissociated, the most complex model architectures capture human shape sensitivity as well as some aspects of the category structure that emerges from human judgments. As a whole, these results indicate that convolutional neural networks not only learn physically correct representations of object categories but also develop perceptually accurate representational spaces of shapes. An even more complete model of human object representations might be in sight by training deep architectures for multiple tasks, which is so characteristic in human development.

Shape plays an important role in object recognition. Despite years of research, no models of vision could account for shape understanding as found in human vision of natural images. Given recent successes of deep neural networks (DNNs) in object recognition, we hypothesized that DNNs might in fact learn to capture perceptually salient shape dimensions. Using a variety of stimulus sets, we demonstrate here that the output layers of several DNNs develop representations that relate closely to human perceptual shape judgments. Surprisingly, such sensitivity to shape develops in these models even though they were never explicitly trained for shape processing. Moreover, we show that these models also represent categorical object similarity that follows human semantic judgments, albeit to a lesser extent. Taken together, our results bring forward the exciting idea that DNNs capture not only objective dimensions of stimuli, such as their category, but also their subjective, or perceptual, aspects, such as shape and semantic similarity as judged by humans.

Funding: This work has been funded by the The Belgian Science Policy (IUAP P7/11, http://www.belspo.be/iap/ ) and the European Research Council (ERC-2011-Stg-284101; http://erc.europa.eu/ ) grants awarded to HPOdB. JK is a research assistant of the Research Foundation—Flanders (FWO; http://www.fwo.be/ ) and holds a Postdoctoral Mandate from the Internal Funds KU Leuven ( http://www.kuleuven.be/research/support/ ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Here we put this hypothesis to the test through a few benchmark stimulus sets, which have highlighted particular aspects of human shape perception in the past. We first demonstrate that convolutional neural networks (convnets), the most common kind of DNN models in image processing, can recognize objects based upon shape also when all other cues are removed, as humans can. Moreover, we show that despite being trained solely for object categorization, higher layers of convnets develop a surprising sensitivity for shape that closely follows human perceptual shape judgments. When we dissociate shape from category membership, then abstract categorical information is available to a limited extent in these networks, suggesting that a full model of shape and category perception might require richer training regimes for convnets.

The stimulus sets on which DNNs have been tested in these previous studies allow the inference that there is a general correspondence between the representations developed within DNNs and important aspects of human object representations at the neural level. However, these stimulus sets were not designed to elucidate specific aspects of human representations. In particular, a long tradition in human psychophysics and primate physiology has pointed towards the processing of shape features as the underlying mechanism behind human object recognition (e.g., [ 12 – 15 ]). Cognitive as well as computational models of object recognition have mainly focused upon the hierarchical processing of shape (e.g., [ 1 , 16 , 17 ]). There are historical and remaining controversies about the exact nature of these shape representations, such as about the degree of viewpoint invariance and the role of structural information in the higher levels of representation (e.g., [ 18 , 19 ]). Still, all models agree on the central importance of a hierarchical processing of shape. For this reason we hypothesized that the general correspondence between DNNs representations and human object representations might be related to a human-like sensitivity for shape properties in the DNNs.

Recently, however, deep neural networks (DNNs) brought a tremendous excitement and hope to multiple fields of research. For the first time, a dramatic increase in performance has been observed on object and scene categorization tasks [ 7 , 8 ], quickly reaching performance levels rivaling humans [ 9 ]. More specifically in the context of object recognition, stimulus representations developed by the deep nets have been shown to account for neural recordings in monkey inferior temporal cortex and functional magnetic resonance imaging data throughout the human ventral visual pathway (e.g., [ 6 , 10 , 11 ]), suggesting that some fundamental processes, shared across different hardware, have been captured by deep nets.

Understanding how the human visual system processes visual information involves building models that would account for human-level performance on a multitude of tasks. For years, despite the best efforts, computational understanding of even the simplest everyday tasks such as object and scene recognition have been limited to toy datasets and poor model performances. For instance, hierarchical architecture HMAX [ 1 ], once known as “the standard model” of vision [ 2 ], worked successfully on a stimulus set of paper clips and could account for some rapid categorization tasks [ 3 ] but failed to capture shape and object representations once tested more directly against representations in the visual cortex (e.g., [ 4 – 6 ]).

Despite its significance, the correlation with categorical judgments was much weaker than with shape, even after we restricted stimuli to the 23 objects in the ImageNet, meaning that the learned representations in convnets are largely based on shape and not category. In other words, categorical information is not as dominant in convnets as in humans, in agreement with [ 6 ] where deep nets were shown to account for categorical representations in humans only when categorical training was introduced on top of the outputs of convnets. (See also Discussion where we talk about the availability of information in models.)

The abundance of categorical information in convnet outputs is most strikingly illustrated in Fig 5B where a multidimensional scaling plot depicts overall stimulus similarity. A nearly perfect separation between natural and manmade objects is apparent. Note that less than a half of these objects (23 out of 54) were known to GoogLeNet, but even completely unfamiliar objects are nonetheless correctly situated. This is quite surprising given that convnets were never trained to find associations between different categories. In other words, there is no explicit reason why a convnet should learn to represent guitars and flutes similarly (the category of “musical instruments” is not known to the model). We speculate that these associations might be learned implicitly, since during training objects of the same superordinate category (“musical instruments”) might co-occur in images. Further tests would be necessary to establish the extent of such implicit learning in convnets.

First, we found that convnets represented shape fairly well, correlating with perceptual human shape judgments between .3 and .4, nearly reaching the human performance limit ( Fig 5C and 5D ). Unlike before, the effect was not specific to deep models but was also observed in HMAX and even shallow models. This observation is expected because, unlike in previous experiments, in this stimulus set physical form and perceived shape are well correlated. Instead, the purpose of this stimulus set was to investigate to what extent semantic human category judgments are captured by convnets, since here category is dissociated from shape. We found that all deep but not shallow or HMAX models captured at least some semantic structure in our stimuli ( Fig 5D and 5E ; bootstrapped related samples significance test for deep vs. shallow and deep vs. HMAX: p < .001), indicating that representations in convnets contain both shape and category information. Similar to Exp. 1, comparable correlations were observed even when the models were provided only with silhouettes of the objects (no texture), indicating that such categorical decisions appear to rely mainly on the shape contour and not internal features.

We employed this stimulus set to explore how categorical information is represented by convnets. As before, participants were asked to judge similarity among stimuli based either on their shape or on their category. Note that even for categorical judgments, participants were asked to rate categorical similarity rather than divide stimulus set into six categories, resulting in idiosyncratic categorical judgments and consistency between humans not reaching ceiling.

(a) Stimulus set from [ 37 ] with shape and category information largely orthogonal. (Adapted with permission from [ 37 ].) (b) Multidimensional scaling plot representing object dissimilarity in the output layer of GoogLeNet. The black line indicates a clear separation between natural (brown, orange, yellow) and man-made (blue, gray, pink) objects. (c-d) A correlation between model representations and human shape (c) and category (d) judgments. Gray band indicates estimated ceiling correlation based on human performance. (e) A correlation with human shape (orange) and category (blue) judgments across the layers of HMAX models and convnets. Vertical dotted lines indicate where fully-connected layers start. In all plots error bars (or bands) represent 95% bootstrapped confidence intervals.

Typically, object shape and semantic properties are correlated, such that objects from the same category (e.g., fruits) share some shape properties as well (all have smooth roundish shape) that may distinguish them from another category (e.g., cars that have more corners), making it difficult to investigate the relative contributions of these two dimensions. To overcome these limitations, Bracci and Op de Beeck [ 37 ] recently designed a new stimulus set, comprised of 54 photographs of objects, where shape and category dimensions are orthogonal to each other as much as possible ( Fig 5A ). In particular, objects from six categories have been matched in such a way that any one exemplar from a particular category would have a very similar shape to an exemplar from another category. Thus, the dissociation between shape and category is more prominent and can be meaningfully measured by asking participants to judge similarity between these objects based either on their shape or on their category. By correlating the resulting dissimilarity matrices to human neural data, Bracci and Op de Beeck [ 37 ]found that perceived shape and semantic category are represented in parallel in the visual cortex.

In the first three experiments, we demonstrated convnet sensitivity to shape properties. However, these convnets have been explicitly trained to optimize not for shape but rather category, that is, to provide a correct semantic label. Apparently, categorization is aided by developing sensitivity to shape. But is there anything beyond sensitivity to shape then that convnets develop? In other words, to what extent do these networks develop semantic representations similar to human categorical representations over and above mere shape information?

We found that all deep but not shallow or HMAX models (except for HMAX’99) showed a higher than chance performance ( Fig 4B ) with performance typically improving gradually throughout the architecture ( Fig 4C ; bootstrapped related samples significance test for deep vs. shallow, one-tailed: p < .001; deep vs. HMAX: p = .011). Moreover, deeper networks tended to perform slightly better than shallower ones, in certain layers even achieving perfect performance. Overall, there was not any clear pattern in mistakes across convnets, except for a tendency towards mistakes in the main axis curvature, that is, convnets did not seem to treat straight versus curved edges as very distinct. In contrast, humans consistently show a robust sensitivity to changes in the main axis curvature [ 31 , 36 ]. Note that humans are also not perfect at detecting NAPs as reported by [ 36 ]. Thus, we do not go further into these differences because the RBC theory and most previous behavioral and neural studies only address a general preference for NAP changes, and hence do not provide a systematic framework for interpreting the presence or absence of such preference for specific NAPs.

We evaluated model performance on this set of 22 geons ( Fig 4A ) that have been used previously in behavioral [ 31 , 32 , 36 ] and neurophysiological studies. A model’s response was counted as accurate if the response to a non-accidental stimulus was more dissimilar from the base than the metric one.

Thus, the sensitivity for non-accidental properties presents an important and well-tested line of research where the physical size of differences between shapes is dissociated from the effect of specific shape differences on perception. We tested the sensitivity for non-accidental properties using a previously developed stimulus set of geon triplets where the metric variant is as distinct or, typically, even more distinct from the base than the non-accidental variant as measured in the metric (physical) space. Nevertheless, humans and other species report perceiving non-accidental shapes as more dissimilar from the base than the metric ones, presenting us with a perfect test case where, similar to Exp. 2, physical shape similarity is different from the perceived one.

Over years, Biederman and others consistently found such preference to hold in a large number of studies across species [ 25 – 28 ], age groups [ 29 – 31 ], non-urban cultures [ 32 ], and even in the selectivity of inferior temporal neurons in monkeys [ 24 , 33 ]. This idea of invariants has also been shown to play an important role in scene categorization [ 34 ] and famously penetrated computer vision literature when David Lowe developed his SIFT (Scale-Invariant Feature Transform) descriptor that attempted to capture invariant features in an image [ 35 ].

(a) Examples of geons [ 24 ]. In order to measure model’s sensitivity to changes in non-accidental properties, model’s output is computed for a particular stimulus (middle column) and compared to the output when another variant of the same kind of stimulus is presented (right column) and when a non-accidental change in the stimulus is introduced (left column) that is physically (in the metric space) just as far from the base as the metric variant. We used 22 such triplets in total. (b) Model performance on discriminating between stimuli. For each triplet, model’s output is counted as accurate if the non-accidental variant is more dissimilar from the base stimulus than the metric variant is from the base. Chance level (50%) is indicated by a dashed line. (c) HMAX and convnet performance on the task at different layers. Non-accidental stimuli appear to be closer to the base in the early layers, which is consistent with a conservative design of the stimuli [ 24 ]. Best performance is observed at the upper layers of convnets with a slight dip at the output layer. Vertical dotted lines indicate where fully-connected layers start. In both plots, error bars (or bands) depict 95% bootstrapped confidence intervals.

In 1987, Biederman put forward the Recognition-by-Components (RBC) theory [ 16 ] that proposed that objects recognition might be based on shape properties known as non-accidental. Under natural viewing conditions, many object’s properties are changing, depending on lighting, clutter, viewpoint and so on. In order to recognize objects robustly, Biederman proposed that the visual system might utilize those properties that remain largely invariant under possible natural variations. In particular, Biederman focused on those properties of object shape that remain unchanged when the three-dimensional shape of an object is projected to the two-dimensional surface on the eye’s retina, such as curved versus straight object axis, parallel versus converging edges, and so on [ 23 ]. Importantly, RBC theory predicts that observers should notice a change in a non-accidental property more readily than an equivalent change in a metric property. Consider, for example, geons shown in Fig 4A , top row. Both the non-accidental and the metric variant differ by the same amount from the base geon (as measured by some linear metric, such a pixelwise or GaborJet difference), yet the non-accidental one appears more distinct to us.

Moreover, we observed that convnets exhibited significantly stronger correlations with behavioral similarity than HMAX family of models (bootstrapped related samples significance test, one-tailed: p < .001). This tendency was already visible in Exp. 2a, but did not reach statistical significance yet. With this larger stimulus set that contains less pronounced perceptual dimensions, the limitations of HMAX models in shape processing and their relevance in understanding human perception became prominent, even for the most complex version HMAX-PNAS. These results corroborate with earlier observations that HMAX family of models are not sufficient for explaining categorical data dimensions [ 4 , 6 ].

As seen in Fig 3C , deep models are sensitive to differences between fonts, with the letters presented in the same font tending to be clustered in the same part of the representational space. To quantify the similarity of model representations to human judgments, as before, we asked participants to rate the similarity between all these letters, and correlated their judgments with model outputs ( Fig 3D ). We found that deep models captured perceived similarity of letters significantly better than shallow models (bootstrapped related samples significance test, one-tailed: p < .001), whereas shallow models correlated significantly better with physical form (bootstrapped related samples significance test, one-tailed: p < .001), consistent with our hypothesis that the formation of perceptually-relevant representational spaces is a general property of convnets. In fact, convnets captured all explainable variance in human data, demonstrating that in certain scenarios convnets are already sufficiently robust.

(a) Stimulus set of six letters from six constructed fonts. (b) Multidimensional scaling plots for the dissimilarity matrices of shape judgments by humans and the GoogLeNet outputs (at the top layer). Notice how humans and GoogLeNet are good at clustering the ULOG font but both struggle with Futurama. (c-d) A correlation between model outputs and physical form similarity (c) and perceived shape similarity (d) of stimuli. Whereas shallow models only capture physical shape, deep models capture perceived shape significantly better than shallow and HMAX models. Gray band indicates the estimated ceiling correlation based on human performance. (e) Correlation with physical (green) and perceived (orange) shape similarity across the layers of HMAX models and convnets. Vertical dotted lines indicate where fully-connected layers start. A preference for the perceived shape emerges in the upper layers. In all plots, error bars (or bands) indicate the 95% bootstrapped confidence intervals.

So far, we found that deeper networks generally reflect perceive shape similarity better than shallower ones. Is this effect specific to the two dimensions used in [ 5 ] or does it reflect a broader tendency for deeper nets to develop perceptually-relevant representations? In Experiment 2b, we constructed a new stimulus set where perceptually relevant dimensions were no longer explicitly defined. In particular, it was composed of six letters from six novel font families ( Fig 3A ). Even though the letters were unrecognizable (e.g., a letter ‘a’ in these fonts looked nothing like a typical ‘a’) and varied substantially within a font family, implicitly the letters in each font shared the same style (size was matched across fonts as much as possible). It should be noted though that even for human observers detecting these font families is not straightforward (note mistakes in Fig 3B ).

To quantify how well the computed stimulus dissimilarities overall corresponded to the physical form or the perceived shape dissimilarity, we correlated physical form and perceived shape dissimilarity matrices with model’s output. We found that shallow models were mostly better at capturing physical dissimilarity than the output layers of deep models ( Fig 2C ; bootstrapped related samples one-tailed significance test for shallow vs. HMAX and for shallow vs. deep: p < .001), whereas perceived shape was better captured by most deep models ( Fig 2D ; bootstrapped related samples one-tailed significance test for deep vs. HMAX: p = .010; deep vs. shallow: p = .002). (But note that, obviously, early layers of deep nets typically can reflect physical dissimilarities too.) Correlation layer-by-layer revealed that preference for shape gradually increased throughout the layers of all convnets, whereas physical similarity abruptly decreased at their upper layers ( Fig 2E ).

In order to investigate the relative importance of the physical form and perceived shape for various computer vision models, we used this stimulus set to compute the outputs of five shallow (single layer) models (pixelwise, GaborJet, HOG, PHOG, and PHOW), three versions of HMAX of varying complexity (HMAX ‘99, HMAX-HMIN, HMAX-PNAS), and three deep convnets (CaffeNet, VGG-19, and GoogLeNet; see Methods for details). Next, we computed the dissimilarity between each pair of the nine stimuli, resulting in a 9x9 dissimilarity matrix, and we applied multidimensional scaling on these matrices. Fig 2B shows the two-dimensional arrangement derived from multidimensional scaling for CaffeNet. This procedure revealed that the representations of these stimuli in the output of deep models tended to cluster based on their perceived rather than physical similarity, comparable to human judgments and neural representations in the higher visual areas in human cortex [ 5 ].

(a) Stimulus set with physical and perceived shape dimensions manipulated orthogonally [ 5 ]. (b) Multidimensional scaling plots for the dissimilarity matrices of physical form, perceived shape, and the GoogLeNet outputs (at the top layer). The separation between shapes based on their perceived rather than physical similarity is evident in the GoogLeNet outputs (for visualization purposes, indicated by the lines separating the three clusters). (c) A correlation between model outputs and the physical form similarity of stimuli. Most shallow models are capturing physical similarity reasonably well, whereas HMAX and deep models are largely less representative of the physical similarity. (d) A correlation between model outputs and the perceived shape similarity of stimuli. Here, in contrast, deep models show a tendency of capturing perceived shape better than shallow and HMAX models. Gray band indicates estimated ceiling correlation based on human performance. (e) Correlation with physical (green) and perceived (orange) shape similarity across the layers of HMAX models and convnets. A preference for the perceived shape emerges in the upper layers. Vertical dotted lines indicate where fully-connected layers start. In all plots, error bars (or bands) indicate the 95% bootstrapped confidence intervals.

First, we used a stimulus set where the physical and perceptual dimensions were specifically designed to be orthogonal. Building on their earlier stimulus set [ 22 ], Op de Beeck and colleagues [ 5 ] created a stimulus set of nine novel shapes ( Fig 2A ) that can be characterized either in terms of their overall shape envelope / aspect ratio (vertical, square, horizontal), which we refer to as a physical form, or their perceptual features (spiky, smoothie, cubie), which we refer to as a perceived shape since humans base their shape judgments on these features, as explained below. They then computed the physical form dissimilarity matrix by taking the difference in the pixels, whereas the perceptual shape dissimilarity matrix was obtained in a behavioral experiment by asking participants to judge shape similarity ([ 5 ]; see Methods for details). They reported that participants typically grouped these stimuli based on the perceived shape and not shape envelope ( Fig 2B shows the multidimensional scaling plot of these dissimilarity matrices), and also provided neural evidence for such representations in the higher shape-selective visual area known as the lateral occipital complex (LOC). The most common hierarchical model of object recognition available at that time, the original HMAX [ 1 ], did not capture perceived shape with this stimulus set.

In Experiment 2, we wanted to understand whether convolutional neural networks develop representations that capture the shape dimensions that dominate perception, the so-called “perceived” shape dimensions, rather than the physical (pixel-based) form. In most available stimulus sets these two dimensions are naturally correlated because the physical form and the perceived shape are nearly or completely identical. In order to disentangle the relative contributions of each of these dimensions, we needed stimulus sets where a great care was taken to design perceptual dimensions that would differ from physical dimensions.

Fig 1D also shows that the models sometimes outperformed humans, seemingly in those situations where a model could take an advantage of a limited search space (e.g., it is much easier to say there is an iron when you do not know about hats). Overall, however, despite the moderate yet successful performance on silhouettes, it is obvious from Fig 1D that there are quite some stimuli on which the models fail but which are recognized perfectly by human observers. Common mistakes could be divided into two groups: (i) similar shape (grasshopper instead of bee), and (ii) completely wrong answers where the reason behind model’s response is not so obvious (whistle instead of lion). We think that the former scenario further supports the idea that models base their decisions primarily on shape and are not easily distracted by the lack of other features. In either case, the errors might be remedied by model exposure to cartoons and drawings. Moreover, we do not think that these discrepancies might be primarily due to the lack of recurrent processes in these models since we tried to minimize influences of possible recurrent processes during human categorization by presenting stimuli for 100 ms to human observers. It is also possible that better naturalistic training sets in general are necessary where objects would be decoupled from background. For instance, lions always appear in savannahs, so models might be putting too much weight on savannah’s features for detecting a lion, which would be a poor strategy in the case of this stimulus set. Nonetheless, even in the absence of such training, convnets generalize well to such unrealistic stimuli, demonstrating that they genuinely learn some abstract shape representations.

To investigate the consistency between human and model responses in more detail, we computed a squared Euclidean distance between the average human accuracy and a model accuracy, and normalized it to the range [0, 1], such that a consistency of .5 means that a model responded correctly where a human responded correctly and made a mistake where a human made a mistake about half of the time ( Fig 1C ; see Methods for reasoning behind this choice of consistency). Overall, the consistency was substantial and nearly reached between-human consistency for color objects for our best model (GoogLeNet). To visualize the amount of consistency, we depicted The best model’s (GoogLeNet) performance on silhouettes against human performance ( Fig 1D ). The performances are well correlated as indicated by the slope of the logistic regression being reliably above zero ( Fig 1D ; z-test on GoogLeNet: z = 2.549, p = .011; CaffeNet: z = 2.393, p = .017; VGG-19: z = 2.323, p = .020). Furthermore, we computed consistency between models and found that for each variant of the stimulus set, the models appear to respond similarly and commit similar mistakes (the between-model consistency is about .8 for each pairwise comparison), indicating that the models learn similar features.

As textural cues were gradually removed, convnets still performed reasonably well. In particular, switching to grayscale decreased the performance by about 15%, whereas a further decrease by 30% occurred when inner gradients were removed altogether (silhouette condition). In other words, even when an object is defined solely by its shape, convnets maintain a robust and highly above-chance performance. Notably, a similar pattern of performance was observed when humans were asked to categorize these objects, suggesting that models are responding similarly to humans but are overall less accurate (irrespective of stimulus variant).

We then presented three convnets with the stimuli and asked them to produce a single best guess of what might be depicted in the image. A correct answer was counted if the label exactly matched the actual label. We found that all deep nets exhibited a robust categorization performance on the original color stimulus set, reaching about 80–90% accuracy ( Fig 1B , with the best model (GoogLeNet) reaching human level of performance. Given that the models have not been trained at all on abstract line drawings, we found it an impressive demonstration of convnet feature generalization.

First, we asked 30 human observers (10 per variant of the stimulus set) to choose a name of each object, presented for 100 ms, from a list of 657 options, corresponding to the actual of these objects and their synonyms as defined by observers in [ 20 ]. Consistent with previous studies [ 15 , 21 ], participants were nearly perfect in naming color objects, slightly worse for grayscale objects, and considerably worse for silhouettes ( Fig 1B , gray bands). Moreover, we found that participants were very consistent in their responses ( Fig 1C , gray bands).

(a) Examples of stimuli from the modified Snodgrass and Vanderwart stimulus set [ 21 ]. Stimulus images courtesy of Michael J. Tarr, Center for the Neural Basis of Cognition and Department of Psychology, Carnegie Mellon University, http://www.tarrlab.org/ . (b) Human (n = 10 for each variant of the stimulus set) and convnet (CaffeNet, VGG-19, GoogLeNet) accuracy in naming objects. For each stimulus set variant, mean human performance is indicated by a gray horizontal line, with the gray surrounding band depicting 95% bootstrapped confidence intervals. Error bars on model performance also depict 95% bootstrapped confidence intervals. (c) A consistency between human and convnet naming of objects. A consistency of .5 means that about half of responses (whether correct or not) we consistent between a model and an average of humans. Error bars indicate 95% bootstrapped confidence intervals. Gray bands indicate estimated ceiling performance based on between-human consistency. (d) Correlation with human performance on the silhouette stimulus set. The x-axis depicts an average human accuracy for a particular silhouette [ 15 ] and the y-axis depicts GoogLeNet performance on the same silhouette (either correct (value 1.0) or incorrect (value 0.0)). Model’s performance is jittered on the x- and y-axis for better visibility. Dark gray bubbles indicate average model’s performance for 11 bins of human performance (i.e., 0–5%, 5–15%, 15–25%, etc.) with the size of each bubble reflecting the number of data points per bin. The orange line shows the logistic regression fit with a 95% bootstrapped confidence interval (light orange shaded). The slope of the logistic regression is reliably different from zero.

If convnets are indeed extracting perceptually relevant shape dimensions, they should be able to utilize shape for object recognition. This ability should extend to specific stimulus formats that highlight shape and do not include many other cues, such as silhouettes. The models have been trained for object recognition with natural images, how would they perform when all non-shape cues are removed? In order to systematically evaluate how convnet recognition performance depends on the amount of available shape and non-shape (e.g., color or texture) information, we employed the colorized version of the Snodgrass and Vanderwart stimulus set of common everyday objects [ 20 , 21 ]. This stimulus set consists of 260 line drawings of common objects that are easily recognizable to human observers and has been used extensively in a large number of studies (Google Scholar citations: over 4000 to [ 20 ]; over 500 to [ 21 ]). In our experiments, we used a subset of this stimulus set (see Methods ), consisting of 61 objects ( Fig 1A ). Three variants of the stimulus set were used: original color images, greyscale images, and silhouettes.

Discussion

Here we demonstrated that convolutional neural networks are not only superior in object recognition but also reflect perceptually relevant shape dimensions. In particular, convnets demonstrated robust similarities with human shape sensitivities in three demanding stimulus sets: (i) object recognition based solely on shape (Exp. 1), (ii) correlations with perceived rather than physical shape dimensions (Exp. 2), and (iii) sensitivity to non-accidental shape properties (Exp. 3). Notably, these shape-based representations emerged without convnets being trained for shape processing or recognition and without any explicit knowledge of our stimuli.

Furthermore, we demonstrated that convnets also develop abstract, or superordinate, category representations, but to a much smaller extent than shape (Exp. 4). In particular, we found that objects belonging to the same superordinate category (e.g., bananas and oranges belong to fruits) are represented more similarly than objects from different superordinate categories (e.g., bananas and violins).

These results expand the growing literature that convnets reflect human visual processing [10,11]. More specifically, the correspondence between object representations in primates and convnets has been investigated in the context of object recognition. Our study adds the new information that the powerful object recognition performance of convnets is related to a human-like sensitivity for shape and, to a lesser extent, perceptually-adequate semantic spaces of the objects they learn.

We emphasize that the reported sensitivity to shape reflects the representational spaces learned in convnets, as opposed to the information that is in principle available in them. Both approaches are useful and valid for evaluating how good a particular model is in processing visual information, but they provide different kinds of information. “Available information” means that a linear classifier, such as a Support Vector Machine, can learn (with supervision) to correctly classify stimuli into predefined categories. Thus, if an above-chance classification can be performed from a model’s output, it means that a model is making this the information necessary to perform a task explicit (object manifolds become untangled, as described in [38]). For instance, object categorization based on their pixel values does not work but after a series of transformations in convnets, categorical information becomes available. A better model for a particular task is then the one that has task-relevant information explicitly available.

While this approach provides a valuable information about model’s behavior, it does not directly provide evidence if it is a good model of human visual perception. In other words, the fact that a model is categorizing objects very well does not imply that it represents these categories similarly to humans. An explicit evaluation how well a model explains human data needs to be performed. One approach is to use model outputs to predict neural data, as used in [6,10]. However, in this case model outputs are combined in a supervised fashion, so this approach also resembles the “available information” approach in that we use external knowledge to investigate the amount of information available in the model outputs.

Another approach is to compare the representational spaces in models and humans, as proposed in [39] and used in this study. By computing the similarity of stimulus representations in models and humans we can understand if models implicitly develop representations that match human representations. Critically, we are not asking if a model contains necessary information that could be used for a particular task. Rather, we are asking if a model is already representing this information in a similar way independent of a task. To illustrate this difference, consider, for instance, stimuli in Exp. 2a. Even a very simple shallow model, such as GaborJet, can learn to classify these nine stimuli into three perceptual categories (spikies, smoothies, and cubies) correctly because these dimensions are very simple and are easily accessible. Nonetheless, we found that representations in such simple models do not match human perception. Even though one could decode all the necessary information in stimuli from Exp. 2a to perform classification into perceptual categories, this kind of information is not made dominant in shallow models.

Again, both approaches are valid and meaningful, but it is important to be explicit about the implications that they bring. We emphasize that, despite only being trained on a task that has a clear correct answer to it (i.e., object recognition), convnets also develop representations that reflect subjective judgments of human observers. In our opinion, this difference is critical to the success of deep nets. Observe that in the typical convnet training, a particular image can have only a cat or a dog, or a car, or something else, always with a correct answer for any given image, and deep nets have been shown to learn this correspondence very well. However, this is very different from what humans typically do. The human visual system is particularly adept at making stable judgments about the environment when there is no clear or single answer to a given problem. For example, most individuals would report seeing a house rather that a stack of bricks even though both answers are technically correct. As a recent blunder of Google Photos app that labelled black people as gorillas [40] illustrates, a machine that is incapable of arriving to perceptually congruent decisions might be unacceptable in social settings where such common sense is expected. Although more data would clearly help, it is hard if not impossible to have training data for every possible situation. Uncertainty is bound to occur in natural scenarios (e.g., due to the lack of training samples or poor visibility), so a more error-prone strategy is making sure that machines are learning perceptually relevant dimensions that will generalize properly to unfamiliar settings. Such machines can become better than humans, but critically when they are wrong, their mistakes will be human-like. We therefore expect that further advancement of computer vision and artificial intelligence at large are critically dependent not only on improvement in benchmark performance but also in matching human perceptual judgments.

Our data also provide insights how convnets relate to the information processing across the visual cortex. First, we observed that early layers of convnets tended to relate to the physical stimulus dimensions, consistent with the known properties and models of early visual cortex [41]. Second, the output layer of convnets related to the perceived shape properties. Earlier human neuroimaging studies and monkey electrophysiological recordings revealed that these perceptual shape representations are implemented in human occipitotemporal cortex and in monkey inferotemporal cortex. Specifically, fMRI experiments with the stimulus set used in our Exp. 2a have shown that the shape-selective lateral occipital area might be mostly involved in this shape processing [5]. Moreover, the geon stimulus set used in Exp. 3 also showed disproportionate sensitivity to non-accidental properties in monkey physiological recordings [24]. Finally, the stimulus set used in Exp. 4 showed a co-existence of shape and category information in human visual cortex and the dominance of categorical representations in the most anterior parts of it [37]. Taken together, these results suggest that the shape representations in output layers of convnets relate to shape processing in higher visual areas in primates and their behavioral responses.

However, note that it is not necessarily the output layer which provides the best fit with shape representations in the primate brain. Given that the output layer is directly optimized to produce a correct category label rather than to represent shapes, it is possible that earlier layers are in fact better at capturing shape dimensions. Our results appear to be broadly consistent with this notion. However, these differences between layers seem to be small and the best intermediate layer is not consistent across experiments. Moreover, shape itself is a hierarchical concept that can be understood at multiple scales of analysis (e.g., local, global) and at multiple layers of abstraction [42], so it may not be possible to pinpoint an exact locus of shape representations neither in convnets nor in the human visual system. Rather, different dimensions of shape features might be distributed across multiple areas.

Causes for shape sensitivity in convnets Our results suggest that a human-like sensitivity to shape features is a quite common property shared by different convnets, at least of the type that we tested. However, the three convnets were also very similar, since all of them very trained on the same dataset and used the same training procedure. Which convnet properties are important in developing such shape sensitivity? One critical piece of information is offer by the comparison to HMAX models. Despite a similar architecture, in most experiments we observed that overall HMAX models failed to capture shape sensitivity to the same extent as convnets. The most obvious difference lies in the depth of the architecture. There are at most four layers in HMAX models but at least eight layers in the simplest of our convnets, CaffeNet. However, HMAX’99 (that has two layers) did not seem to perform consistently worse than HMAX-PNAS (that has four layers). Another important difference is the lack of supervision during training. As has been demonstrated before with object categorization [6], unsupervised training does not seem to be sufficiently robust, at least the way it is implemented in HMAX. Another hint that supervision might be the critical component in learning universal shape dictionaries comes from comparing our results to the outputs obtained via the Hierarchical Modular Optimization (HMO) that was recently reported to correspond well to primate neural responses [10]. For Exps. 2a and 4, where we could obtain the outputs of the HMO layer that corresponds best to monkey neural data, we found largely similar pattern of results, despite differences in depth, training procedure, and training dataset. The only clear similarity between the tested convnets and HMO was supervised learning. Finally, part of convnet power might also be attributed to the fully-connected layers. Both in CaffeNet and VGG-19, the critical preference for perceived shape emerges at the fully-connected layers. In GoogLeNet, the preference to perceptual dimensions is typically the strongest at the last layer that is also fully-connected, though earlier layers that are not fully-connected also exhibit a robust preference for perceived shape. Other parameters, such as the naturalness of the training dataset or the task that convnet is optimized for, might also contribute to the representations that convnets develop. In short, the tests and the models that we have included in the present paper provide a general answer to our hypotheses about shape representations in convnets, but there are many specific questions about the role of individual variables that remain to be answered.

Relation to theories of shape processing In the literature, at least two theoretical approaches to shape processing have played an important role: image-based theories [19], which capitalize on processing image features without an explicit encoding of the relation between them, and structure-based theories [18], which emphasize the role of explicit structural relations in shape processing. Our results do not necessarily provide support for particular theories of shape processing. Of course, in their spirit convnets are closer to image-based theories since there is no explicit shape representation computed. On the other hand, in Exp. 3 we also found that convnets were sensitive to non-accidental properties even without ever being trained to use these properties. While in principle HMAX architectures can also develop sensitivity to non-accidental properties when a temporal association rule is introduced [43], the fact that such sensitivity automatically emerges in convnets when training for object categorization provides indirect support that non-accidental properties are diagnostic in defining object categories, as proposed by the RBC theory [16]. Of course, a mere sensitivity to non-accidental properties does not imply that convnets must actually utilize the object recognition scheme proposed by the RBC theory [16]. For instance, according to this theory, objects are subdivided into sets of shape primitives, known as geons, and recognized based on which geons compose that particular object, referred to as a “structural description” of the object. Finding an increased sensitivity for non-accidental properties does not necessarily imply that all these other assertions of the RBC theory are correct, and it does not by itself settle the controversy between image-based and structure-based models of object recognition.