Abstract Is visual cortex made up of general-purpose information processing machinery, or does it consist of a collection of specialized modules? If prior knowledge, acquired from learning a set of objects is only transferable to new objects that share properties with the old, then the recognition system’s optimal organization must be one containing specialized modules for different object classes. Our analysis starts from a premise we call the invariance hypothesis: that the computational goal of the ventral stream is to compute an invariant-to-transformations and discriminative signature for recognition. The key condition enabling approximate transfer of invariance without sacrificing discriminability turns out to be that the learned and novel objects transform similarly. This implies that the optimal recognition system must contain subsystems trained only with data from similarly-transforming objects and suggests a novel interpretation of domain-specific regions like the fusiform face area (FFA). Furthermore, we can define an index of transformation-compatibility, computable from videos, that can be combined with information about the statistics of natural vision to yield predictions for which object categories ought to have domain-specific regions in agreement with the available data. The result is a unifying account linking the large literature on view-based recognition with the wealth of experimental evidence concerning domain-specific regions.

Author Summary Domain-specific regions, like the fusiform face area, are a prominent feature of ventral visual cortex organization. Despite decades of interest from a large number of investigators employing diverse methods, there has been surprisingly little theoretical work on “why” the ventral stream may adopt this modular organization. In this study we propose a computational account of the role played by domain-specific regions in ventral stream function. It follows from a new theoretical analysis of the recognition problem which highlights the importance of building representations that are robust to class-specific transformations. These results provide a unifying account linking neuroimaging and neuropsychology-based ideas of domain-specific regions to the psychophysics and electrophysiology-oriented literature on view-based object recognition and invariance.

Citation: Leibo JZ, Liao Q, Anselmi F, Poggio T (2015) The Invariance Hypothesis Implies Domain-Specific Regions in Visual Cortex. PLoS Comput Biol 11(10): e1004390. https://doi.org/10.1371/journal.pcbi.1004390 Editor: Nikolaus Kriegeskorte, Medical Research Council, UNITED KINGDOM Received: June 17, 2014; Accepted: May 11, 2015; Published: October 23, 2015 Copyright: © 2015 Leibo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited Data Availability: Image data was generated automatically using the methods described in the paper. Additionally, the images used for experiment 3 and 4 are available for download from The Center For Brains, Minds, and Machines: cbmm.mit.edu. Funding: This material is based upon work supported by the Center for Brains, Minds, and Machines (CBMM), funded by NSF STC award CCF-1231216. URL: http://cbmm.mit.edu/ (TP). This research was also sponsored by grants from the National Science Foundation (NSF-0640097, NSF-0827427) URL: http://www.nsf.gov/ (TP), and the Air Force Office of Scientific Research AFOSR-THRL (FA8650-05-C-7262) URL: www.afosr.af.mil (TP). Additional support was provided by the Eugene McDermott Foundation (TP). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction The discovery of category-selective patches in the ventral stream—e.g., the fusiform face area (FFA)—is one of the most robust experimental findings in visual neuroscience [1–6]. It has also generated significant controversy. From a computational perspective, much of the debate hinges on the question of whether the algorithm implemented by the ventral stream requires subsystems or modules dedicated to the processing of a single class of stimuli [7, 8]. The alternative account holds that visual representations are distributed over many regions [9, 10], and the clustering of category selectivity is not, in itself, functional. Instead, it arises from the interaction of biological constraints like anatomically fixed inter-region connectivity and competitive plasticity mechanisms [11, 12] or the center-periphery organization of visual cortex [13–17]. The interaction of three factors is thought to give rise to properties of the ventral visual pathway: (1) The computational task; (2) constraints of anatomy and physiology; and (3) the statistics of the visual environment [18–22]. Differing presuppositions concerning their relative weighting lead to quite different models of the origin of category-selective regions. If the main driver is thought to be the visual environment (factor 3), then perceptual expertise-based accounts of category selective regions are attractive [23–25]. Alternatively, mechanistic models show how constraints of the neural “hardware” (factor 2) could explain category selectivity [12, 26, 27]. Contrasting with both of these, the perspective of the present paper is one in which computational factors are the main reason for the clustering of category-selective neurons. The lion’s share of computational modeling in this area has been based on factors 2 and 3. These models seek to explain category selective regions as the inevitable outcome of the interaction between functional processes; typically competitive plasticity, wiring constraints, e.g., local connectivity, and assumptions about the system’s inputs [12, 26–28]. Mechanistic models of category selectivity may even be able to account for the neuropsychology [29, 30] and behavioral [31, 32] results long believed to support modularity. Another line of evidence seems to explain away the category selective regions. The large-scale topography of object representation is reproducible across subjects [33]. For instance, the scene-selective parahippocampal place area (PPA) is consistently medial to the FFA. To explain this remarkable reproducibility, it has been proposed that the center-periphery organization of early visual areas extends to the later object-selective regions of the ventral stream [13–15, 17]. In particular, the FFA and other face-selective region are associated with an extension of the central representation, and PPA with the peripheral representation. Consistent with these findings, it has also been argued that real-world size is the organizing principle [16]. Larger objects, e.g., furniture, evoke more medial activation while smaller objects, e.g., a coffee mug, elicit more lateral activity. Could category selective regions be explained as a consequence of the topography of visual cortex? Both the eccentricity [15] and real-world size [16] hypotheses correctly predict that houses and faces will be represented at opposite ends of the medial-lateral organizing axis. Since eccentricity of presentation is linked with acuity demands, the differing eccentricity profiles across object categories may be able to explain the clustering. However, such accounts offer no way of interpreting macaque results indicating multi-stage processing hierarchies [17, 34]. If clustering was a secondary effect driven by acuity demands, then it would be difficult to explain why, for instance, the macaque face-processing system consists of a hierarchy of patches that are preferentially connected with one another [35]. In macaques, there are 6 discrete face-selective regions in the ventral visual pathway, one posterior lateral face patch (PL), two middle face patches (lateral- ML and fundus- MF), and three anterior face patches, the anterior fundus (AF), anterior lateral (AL), and anterior medial (AM) patches [2, 36]. At least some of these patches are organized into a feedforward hierarchy. Visual stimulation evokes a change in the local field potential ∼ 20 ms earlier in ML/MF than in patch AM [34]. Consistent with a hierarchical organization involving information passing from ML/MF to AM via AL, electrical stimulation of ML elicits a response in AL and stimulation in AL elicits a response in AM [35]. In addition, spatial position invariance increases from ML/MF to AL, and increases further to AM [34] as expected for a feedforward processing hierarchy. The firing rates of neurons in ML/MF are most strongly modulated by face viewpoint. Further along the hierarchy, in patch AM, cells are highly selective for individual faces and collectively provide a representation of face identity that tolerates substantial changes in viewpoint [34]. Freiwald and Tsao argued that the network of face patches is functional. Response patterns of face patch neurons are consequences of the role they play in the algorithm implemented by the ventral stream. Their results suggest that the face network computes a representation of faces that is—as much as possible—invariant to 3D rotation-in-depth (viewpoint), and that this representation may underlie face identification behavior [34]. We carry out our investigation within the framework provided by a recent theory of invariant object recognition in hierarchical feedforward architectures [37]. It is broadly in accord with other recent perspectives on the ventral stream and the problem of object recognition [22, 38]. The full theory has implications for many outstanding questions that are not directly related to the question of domain specificity we consider here. In other work, it has been shown to yield predictions concerning the cortical magnification factor and visual crowding [39]. It has also been used to motivate novel algorithms in computer vision and speech recognition that perform competitively with the state-of-the-art on difficult benchmark tasks [40–44]. The same theory, with the additional assumption of a particular Hebbian learning rule, can be used to derive qualitative receptive field properties. The predictions include Gabor-like tuning in early stages of the visual hierarchy [45, 46] and mirror-symmetric orientation tuning curves in the penultimate stage of a face-specific hierarchy computing a view-tolerant representation (as in [34]) [46]. A full account of the new theory is outside the scope of the present work; we refer the interested reader to the references—especially [37] for details. Note that the theory only applies to the first feedforward pass of information, from the onset of the image to the arrival of its representation in IT cortex approximately 100 ms later. For a recent review of evidence that the feedforward pass computes invariant representations, see [22]. For an alternative perspective, see [11]. Though note also, contrary to a claim in that review, position dependence is fully compatible with the class of models we consider here (including HMAX). [39, 47] explicitly model eccentricity dependence in this framework. Our account of domain specificity is motivated by the following questions: How can past visual experience be leveraged to improve future recognition of novel individuals? Is any past experience useful for improving at-a-glance recognition of any new object? Or perhaps past experience only transfers to similar objects? Could it even be possible that past experience with certain objects actually impedes the recognition of others? The invariance hypothesis holds that the computational goal of the ventral stream is to compute a representation that is unique to each object and invariant to identity-preserving transformations. If we accept this premise, the key question becomes: Can transformations learned on one set of objects be reliably transferred to another set of objects? For many visual tasks, the variability due to transformations in a single individual’s appearance is considerably larger than the variability between individuals. These tasks have been called “subordinate level identification” tasks, to distinguish them from between-category (basic-level) tasks. Without prior knowledge of transformations, the subordinate-level task of recognizing a novel individual from a single example image is hopelessly under-constrained. The main thrust of our argument—to be developed below—is this: The ventral stream computes object representations that are invariant to transformations. Some transformations are generic; the ventral stream could learn to discount these from experience with any objects. Translation and scaling are both generic (all 2D affine transformations are). However, it is also necessary to discount many transformations that do not have this property. Many common transformations are not generic; 3D-rotation-in-depth is the primary example we consider here (see S1 Text for more examples). It is not possible to achieve a perfectly view-invariant representation from one 2D example. Out-of-plane rotation depends on information that is not available in a single image, e.g. the object’s 3D structure. Despite this, approximate invariance can still be achieved using prior knowledge of how similar objects transform. In this way, approximate invariance learned on some members of a visual category can facilitate the identification of unfamiliar category members. But, this transferability only goes so far. Under this account, the key factor determining which objects could be productively grouped together in a domain-specific subsystem is their transformation compatibility. We propose an operational definition that can be computed from videos of transforming objects. Then we use it to explore the question of why certain object classes get dedicated brain regions, e.g., faces and bodies, while others (apparently) do not. We used 3D graphics to generate a library of videos of objects from various categories undergoing rotations in depth. The model of visual development (or evolution) we consider is highly stylized and non-mechanistic. It is just a clustering algorithm based on our operational definition of transformation compatibility. Despite its simplicity, using the library of depth-rotation videos as inputs, the model predicts large clusters consisting entirely of faces and bodies. The other objects we tested—vehicles, chairs, and animals—ended up in a large number of small clusters, each consisting of just a few objects. This suggests a novel interpretation of the lateral occipital complex (LOC). Rather than being a “generalist” subsystem, responsible for recognizing objects from diverse categories, our results are consistent with LOC actually being a heterogeneous region that consists of a large number of domain-specific regions too small to be detected with fMRI. These considerations lead to a view of the ventral visual pathway in which category-selective regions implement a modularity of content rather than process [48, 49]. Our argument is consistent with process-based accounts, but does not require us to claim that faces are automatically processed in ways that are inapplicable to objects (e.g., gaze detection or gender detection) as claimed by [11]. Nor does it commit us to claiming there is a region that is specialized for the process of subordinate-level identification—an underlying assumption of some expertise-based models [50]. Rather, we show here that the invariance hypothesis implies an algorithmic role that could be fulfilled by the mere clustering of selectivity. Consistent with the idea of a canonical cortical microcircuit [51, 52], the computations performed in each subsystem may be quite similar to the computations performed in the others. To a first approximation, the only difference between ventral stream modules could be the object category for which they are responsible.

Discussion We explored implications of the hypothesis that achieving transformation invariance is the main goal of the ventral stream. Invariance from a single example could be achieved for group transformations in a generic way. However, for non-group transformations, only approximate invariance is possible; and even for that, it is necessary to have experience with objects that transform similarly. This implies that the optimal organization of the ventral stream is one that facilitates the transfer of invariance within—but not between—object categories. Assuming that a subsystem must reside in a localized cortical neighborhood, this could explain the function of domain-specific regions in the ventral stream’s recognition algorithm: to enable subordinate level identification of novel objects from a single example. Following on from our analysis implicating transformation compatibility as the key factor determining when invariance can be productively transferred between objects, we simulated the development of visual cortex using a clustering algorithm based on transformation compatibility. This allowed us to address the question of why faces, bodies, and words get their own dedicated regions but other object categories (apparently) do not [8]. This question has not previously been the focus of theoretical study. Despite the simplicity of our model, we showed that it robustly yields face and body clusters across a range of object frequency assumptions. We also used the model to confirm two theoretical predictions: (1) that invariance to non-group transformations is only needed for subordinate level identification; and (2) that clustering by transformation compatibility yields subsystems that improve performance beyond that of the system trained using data from all categories. These results motivate the the next phase of this work: building biologically-plausible models that learn from natural video. Such models automatically incorporate a better estimate of the natural object distribution. Variants of these models may be able to quantitatively reproduce human level performance on simultaneous multi-category subordinate level (i.e., fine-grained) visual recognition tasks and potentially find application in computer vision as well as neuroscience. In [42], we report encouraging preliminary results along these lines. Why are there domain-specific regions in later stages of the ventral stream hierarchy but not in early visual areas [2, 3]? The templates used to implement invariance to group transformations need not be changed for different object classes while the templates implementing non-group invariance are class-specific. Thus it is efficient to put the generic circuitry of the first regime in the hierarchy’s early stages, postponing the need to branch to different domain-specific regions tuned to specific object classes until later, i.e., more anterior, stages. In the macaque face-processing system, category selectivity develops in a series of steps; posterior face regions are less face selective than anterior ones [34, 83]. Additionally, there is a progression from a view-specific face representation in earlier regions to a view-tolerant representation in the most anterior region [34]. Both findings could be accounted for in a face-specific hierarchical model that increases in template size and pooling region size with each subsequent layer (e.g., [41, 42, 84, 85]). The use of large face-specific templates may be an effective way to gate the entrance to the face-specific subsystem so as to keep out spurious activations from non-faces. The algorithmic effect of large face-specific templates is to confer tolerance to clutter [41, 42]. These results are particularly interesting in light of models showing that large face templates are sufficient to explain holistic effects observed in psychophysics experiments [73, 86]. As stated in the introduction, properties of the ventral stream are thought to be determined by three factors: (1) computational and algorithmic constraints; (2) biological implementation constraints; and (3) the contingencies of the visual environment [18–22]. Up to now, we have stressed the contribution of factor (1) over the others. In particular, we have almost entirely ignored factor (2). We now discuss the role played by anatomical considerations in this account of ventral stream function. That the the circuitry comprising a subsystem must be localized on cortex is a key assumption of this work. In principle, any HW-module could be anywhere, as long as the wiring all went to the right place. However, there are several reasons to think that the actual constraints under which the brain operates and its available information processing mechanisms favor a situation in which, at each level of the hierarchy, all the specialized circuitry for one domain is in a localized region of cortex, separate from the circuitry for other domains. Wiring length considerations are likely to play a role here [87–90]. Another possibility is that localization on cortex enables the use of neuromodulatory mechanisms that act on local neighborhoods of cortex to affect all the circuitry for a particular domain at once [91]. There are other domain-specific regions in the ventral stream besides faces and bodies; we consider several of them in light of our results here. It is possible that even more regions for less-common (or less transformation-compatible) object classes would appear with higher resolution scans. One example may be the fruit area, discovered in macaques with high-field fMRI [3]. Lateral Occipital Complex (LOC) [82]

These results imply that LOC is not really a dedicated region for general object processing. Rather, it is a heterogeneous area of cortex containing many domain-specific regions too small to be detected with the resolution of fMRI. It may also include clusters that are not dominated by one object category as we sometimes observed appearing in simulations (see Fig 4 and S1 Text). The Visual Word Form Area (VWFA) [4]

In addition to the generic transformations that apply to all objects, printed words undergo several non-generic transformations that never occur with other objects. We can read despite the large image changes occurring when a page is viewed from a different angle. Additionally, many properties of printed letters change with typeface, but our ability to read—even in novel fonts—is preserved. Reading hand-written text poses an even more severe version of the same computational problem. Thus, VWFA is well-accounted for by the invariance hypothesis. Words are frequently-viewed stimuli which undergo class-specific transformations. This account appears to be in accord with others in the literature [92, 93]. Parahippocampal Place Area (PPA) [94]

A recent study by Kornblith et al. describes properties of neurons in two macaque scene-selective regions deemed the lateral and medial place patches (LPP and MPP) [95]. While homology has not been definitively established, it seems likely that these regions are homologous to the human PPA [96]. Moreover, this scene-processing network may be analogous to the face-processing hierarchy of [34]. In particular, MPP showed weaker effects of viewpoint, depth, and objects than LPP. This is suggestive of a scene-processing hierarchy that computes a representation of scene-identity that is (approximately) invariant to those factors. Any of them might be transformations for which this region is compatible in the sense of our theory. One possibility, which we considered in preliminary work, is that invariant perception of scene identity despite changes in monocular depth signals driven by traversing a scene (e.g., linear perspective) could be discounted in the same manner as face viewpoint. It is possible that putative scene-selective categories compute depth-tolerant representations. We confirmed this for the special case of long hallways differing in the placement of objects along the walls: a view-based model that pools over images of template hallways can be used to recognize novel hallways [97]. Furthermore, fast same-different judgements of scene identity tolerate substantial changes in perspective depth [97]. Of course, this begs the question: of what use would be a depth-invariant scene representation? One possibility could be to provide a landmark representation suitable for anchoring a polar coordinate system [98]. Intriguingly, [95] found that cells in the macaque scene-selective network were particularly sensitive to the presence of long straight lines—as might be expected in an intermediate stage on the way to computing perspective invariance. Is this proposal at odds with the literature emphasizing the view-dependence of human vision when tested on subordinate level tasks with unfamiliar examples—e.g. [72, 79, 99]? We believe it is consistent with most of this literature. We merely emphasize the substantial view-tolerance achieved for certain object classes, while they emphasize the lack of complete invariance. Their emphasis was appropriate in the context of earlier debates about view-invariance [100–103], and before differences between the view-tolerance achieved on basic-level and subordinate-level tasks were fully appreciated [104–106]. The view-dependence observed in experiments with novel faces [72, 107] is consistent with the predictions of our theory. The 3D structure of faces does not vary wildly within the class, but there is still some significant variation. It is this variability in 3D structure within the class that is the source of the imperfect performance in our simulations. Many psychophysical experiments on viewpoint invariance were performed with synthetic “wire” objects defined entirely by their 3D structure e.g., [79–81]. We found that they were by far, the least transformation-compatible (lowest ) objects we tested (Table 1). Thus our proposal predicts particularly weak performance on viewpoint-tolerance tasks with novel examples of these stimuli and that is precisely what is observed [80]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Table of transformation compatibilities. COIL-100 is a library of images of 100 common household items photographed from a range of orientations using a turntable [114]. The wire objects resemble those used in psychophysics and physiology experiments: [79–81]. They were generated according to the same protocol as in those studies. https://doi.org/10.1371/journal.pcbi.1004390.t001 Tarr and Gauthier (1998) found that learned viewpoint-dependent mechanisms could generalize across members of a homogenous object class [106]. They tested both homogenous block-like objects, and several other classes of more complex novel shapes. They concluded that this kind of generalization was restricted to visually similar objects. These results seem to be consistent with our proposal. Additionally, our hypothesis predicts better within-class generalization for object classes with higher . That is, transformation compatibility, not visual similarity per se, may be the factor influencing the extent of within-class generalization of learned view-tolerance. Though, in practice, the two are usually correlated and hard to disentangle. In a related experiment, Sinha and Poggio (1996) showed that the perception of an ambiguous transformation’s rigidity could be biased by experience [108]. View-based accounts of their results predict that the effect would generalize to novel objects of the same class. Since this effect can be obtained with particularly simple stimuli, it might be possible to design them so as to separate specific notions of visual similarity and transformation compatibility. In accord with our prediction that group transformations ought to be discounted earlier in the recognition process, [108] found that their effect was spared by presenting the training and test objects at different scales. Many authors have argued that seemingly domain-specific regions are actually explained by perceptual expertise [24–27, 109]. Our account is compatible with some aspects of this idea. However, it is largely agnostic about whether the sorting of object classes into subsystems takes place over the course of evolution or during an organism’s lifetime. A combination of both is also possible—e.g. as in [110]. That said, our proposal does intersect this debate in several ways. Our theory agrees with most expertise-based accounts that subordinate-level identification is the relevant task. The expertise argument has always relied quite heavily on the idea that discriminating individuals from similar distractors is somehow difficult. Our account allows greater precision: the precise component of difficulty that matters is invariance to non-group transformations. Our theory predicts a critical factor determining which objects could be productively grouped into a module that is clearly formulated and operationalized: the transformation compatibility Under our account, domain-specific regions arise because they are needed in order to facilitate the generalization of learned transformation invariance to novel category-members. Most studies of clustering and perceptual expertise do not use this task. However, Srihasam et al. tested a version of the perceptual expertise hypothesis that could be understood in this way [111]. They trained macaques to associate reward amounts with letters and numerals (26 symbols). In each trial, a pair of symbols were displayed and the task was to pick the symbol associated with greater reward. Importantly, the 3-year training process occurred in the animal’s home cage and eye tracking was not used. Thus, the distance and angle with which the monkey subjects viewed the stimuli was not tightly controlled during training. The symbols would have projected onto their retina in many different ways. These are exactly the same transformations that we proposed are the reason for the VWFA. In accord with our prediction, Srihasam et al. found that this training experience caused the formation of category-selective regions in the temporal lobe. Furthermore, the same regions were activated selectively irrespective of stimulus size, position, and font. Interestingly, this result only held for juvenile macaques, implying there may be a critical period for cluster formation [111]. Our main prediction is the link between transformation compatibility and domain-specific clustering. Thus one way to test whether this account of expertise-related clustering is correct could be to train monkeys to recognize individual objects of unfamiliar classes invariantly to 3D rotation in depth. The task should involve generalization from a single example view of a novel exemplar. The training procedure should involve exposure to videos of a large number of objects from each category undergoing rotations in depth. Several categories with different transformation compatibilities should be used. The prediction is that after training there will be greater clustering of selectivity for the classes with greater average transformation compatibility (higher ). Furthermore, if one could record from neurons in the category-selective clusters, the theory would predict some similar properties to the macaque face-processing hierarchy: several interconnected regions progressing from view-specificity in the earlier regions to view-tolerance in the later regions. However, unless the novel object classes actually transform like faces, the clusters produced by expertise should be parallel to the face clusters but separate from them. How should these results be understood in light of recent reports of very strong performance of “deep learning” computer vision systems employing apparently generic circuitry for object recognition tasks e.g., [62, 112]? We think that exhaustive greedy optimization of parameters (weights) over a large labeled data set may have found a network similar to the architecture we describe since all the basic structural elements (neurons with nonlinearities, pooling, dot products, layers) required by our theory are identical to the elements in deep learning networks. If this were true, our theory would also explain what these networks do and why they work.

Methods Training HW-architectures An HW-architecture refers to a feedforward hierarchical network of HW-layers. An HW-layer consists of K HW-modules arranged in parallel to one another (see Fig 1B). For an input image I, the output of an HW-layer is a vector μ(I) with K elements. If I depicts a particular object, then μ(I) is said to be the signature of that object. The parameters (weights) of the k-th HW-module are uniquely determined by its template book (5) For all simulations in this paper, the output of the k-th HW-module is given by (6) We used a nonparametric method of training HW-modules that models the outcome of temporal continuity-based unsupervised learning [42, 67]. In each experiment, the training data consisted of K videos represented as sequences of frames. Each video depicted the transformation of just one object. Let G 0 be a family of transformations, e.g., a subset of the group of translations or rotations. The set of frames in the k-th video was O t k = {gt k ∣ g ∈ G 0 }. In each simulation, an HW-layer consisting of K HW-modules was constructed. The template book k of the k-th HW-module was chosen to be (7) Note that HW-architectures are usually trained in a layer-wise manner (e.g., [57]). That is, layer ℓ templates are encoded as “neural images” using the outputs of layer ℓ − 1. However, in this paper, all the simulations use a single HW-layer. One-layer HW-architectures are a particularly stylized abstraction of the ventral stream hierarchy. With our training procedure, they have no free parameters at all. This makes them ideal for simulations in which the aim is not to quantitatively reproduce experimental phenomena, but rather to study general principles of cortical computation that constrain all levels of the hierarchy alike. Experiment 1 and 2: The test of transformation-tolerance from a single example view Procedure. The training set consisted of transformation sequences of K template objects. At test time, in each trial the reference image was presented at the 0 transformation parameter (either 0°, or the center of the image for experiment 1 and 2 respectively). In each trial, a number of query images were presented, 50% of which were targets. The signature of the reference image was computed and its Pearson correlation compared with each query image. This allowed the plotting of an ROC curve by varying the acceptance threshold. The statistic reported on the ordinate of Figs 2 and 3 was the area under the ROC curve averaged over all choices of reference image and all resampled training and testing sets. 1. Translation experiments (Fig 2) Stimuli. There were 100 faces and 100 random noise patterns in the dataset. For each repetition of the experiment, two disjoint sets of 30 objects were selected at random from the 100. The first was used as the template set and the second was used as the test set. Each experiment was repeated 5 times with different random choices of template and testing sets. The error bars on the ordinate of Fig 2 are ±1 standard deviation computed over the 5 repetitions. 2. Rotation in depth experiments (Fig 3) Stimuli. All objects were rendered with perspective projection. For rotation in depth experiments, the complete set of objects consisted of 40 untextured faces, 20 class B objects, and 20 class C objects. For each of the 20 repetitions of the experiment, 10 template objects and 10 test objects were randomly selected. The template and test sets were chosen independently and were always disjoint. Each face/object was rendered (using Blender [75]) at each orientation in 5° increments from −95° to 95°. The untextured face models were generated using Facegen [74]. Experiment 3: Transformation compatibility, multidimensional scaling and online clustering experiments (Figs 4 and 5) Stimuli: Faces, bodies, vehicles, chairs and animals. Blender was used to render images of 3D models from two sources: 1. the Digimation archive (platinum edition), and 2. Facegen. Each object was rendered at a range of viewpoints: −90° to 90° in increments of 5 degrees. This procedure produced a transformation sequence for each object, i.e., a video. The full Digimation set consisted of ∼ 10,000 objects. However, our simulations only used textured objects from the following categories: bodies, vehicles, chairs, and animals. For each experiment, the number of objects used from each class is listed in S2 Table. A set of textured face models generated with FaceGen were added to the Digimation set. See S7 Fig for examples. In total, 23,791 images were rendered for this experiment. The complete dataset is available from cbmm.mit.edu. Procedure. Let A i be the i th frame of the video of object A transforming and B i be the i th frame of the video of object B transforming. Define a compatibility function ψ(A, B) to quantify how similarly objects A and B transform. First, approximate the Jacobian of a transformation sequence by the “video” of difference images: J A (i) = ∣A i − A i+1 ∣ (∀i). Then define the pairwise transformation compatibility as: (8) Transformation compatibility can be visualized by Multidimensional Scaling (MDS) [113]. The input to the MDS algorithm is the pairwise similarity matrix containing the transformation compatibilities between all pairs of objects. For the ψ-based online clustering experiments, consider a model consisting of a number of subsystems (HW-architectures). The clustering procedure was as follows: At each step a new object is learned. Its newly-created HW-module is added to the subsystem with which its transformations are most compatible. If the new object’s average compatibility with all the existing subsystems is below a threshold, then create a new subsystem for the newly learned object. Repeat this procedure for each object. The objects for this experiment were sampled from three different distributions: “realistic” distribution, uniform distribution, and the biased against faces distribution, see S2 Table for the numbers of objects used from each class. The algorithm’s pseudocode is in S1 Text (Section 5.3). Fig 4 shows examples of clusters obtained by this method. Experiment 4: Evaluating the clustered models on subordinate-level and basic-level tasks (Fig 5) Stimuli. The stimuli were the same as in experiment 3. Procedure. To confirm that ψ-based clustering is useful for object recognition with these images, we compared the recognition performance of the subsystems to the complete system that was trained using all available templates irrespective of their subsystem assignment. Two recognition tasks were simulated: one basic level categorization task, view-invariant cars vs. airplanes, and one subordinate level task, view-invariant face recognition. For the subordinate face recognition task, a pair of face images were given, the task was to determine whether they depict the same person (positive) or not (negative). For basic level categorization, a pair of car/airplane images were given; the task was to determine whether they depicted the same basic-level category or not. That is, whether two images are both cars (positive), both airplanes (positive) or one airplane and one car (negative). The classifier used for both tasks was the same as the one used for experiments 1 and 2: for each test pair, the Pearson correlation between the two signatures was compared to a threshold. The threshold was optimized on a disjoint training set. For each cluster, an HW-architecture was trained using only the objects in that cluster. If there were K objects in the cluster, then its HW-architecture had K HW-modules. Applying Eq (7), each HW-module’s template book was the set of frames from the transformation video of one of the objects in the cluster. For both tasks, in the test phase, the signature of each test image was computed with Eq (6). Since the clustering procedure depends on the order in which the objects were presented, for each of the 3 object distributions, we repeated the basic-level and subordinate level recognition tasks 5 times using different random presentation orders. The error bars in Fig 5B and 5C, and S10 Fig convey the variability (one standard deviation) arising from presentation order. Evaluation parameters: 60 new face objects (disjoint from the clustering set)

Data was evenly split to 5 folds, 12 objects per fold.

For each fold, 48 objects were used for threshold optimization. For the face recognition case, 12 faces were used for testing. For the basic-level case, 12 objects of each category were used for testing.

For each fold, 4000 pairs were used to learn the classification threshold θ (see below), 4000 pairs for testing.

Performance was averaged over all folds.

Acknowledgments We would like to thank users Bohemix and SoylentGreen of the Blender Open Material Repository for contributing the materials used to create the images for the illumination simulations (in supplementary information). We also thank Andrei Rusu, Leyla Isik, Chris Summerfield, Winrich Freiwald, Pawan Sinha, and Nancy Kanwisher for their comments on early versions of this manuscript, and Heejung Kim for her help preparing one of the supplementary figures.

Author Contributions Conceived and designed the experiments: JZL QL FA TP. Performed the experiments: JZL QL. Analyzed the data: JZL QL. Contributed reagents/materials/analysis tools: JZL QL FA TP. Wrote the paper: JZL QL FA TP.