To generate stimuli for our recordings, we randomly drew 2,000 faces from this face space ( Figure 1 C). Projections of real faces onto the 50 axes were largely Gaussian, and the 2,000 faces shared a similar distribution of vector lengths as the real faces ( Figure S2 C). Face stimuli were presented for 150 ms (ON period) interleaved by a gray screen for 150 ms (OFF period), and the same set of 2,000 stimuli were presented to each cell from three to five times each. We recorded 205 cells in total from two monkeys: 51 cells from ML/MF and 64 cells from AM for monkey 1; 55 cells from ML/MF, and 35 cells from AM for monkey 2.

To investigate face representation in face patches, we generated parameterized realistic face stimuli using the “active appearance model” (): for each of 200 frontal faces from an online face database (FEI face database; see STAR Methods : Generation of parameterized face stimuli), a set of landmarks were labeled by hand ( Figure 1 A, left). The positions of these points carry information about the shape of the face and the shape/position of internal features ( Figure 1 A, middle). Then the landmarks were smoothly morphed to a standard template (average shape of landmarks); the resulting image ( Figure 1 A, right) carries shape-free appearance information. In this way, we extracted a set of 200 shape descriptors and 200 appearance descriptors. To construct a realistic face space, we performed principal components analysis (PCA) on the shape and appearance descriptors separately, to extract the feature dimensions that accounted for the largest variability in the database, retaining the first 25 PCs for shape and first 25 PCs for appearance ( Figures 1 B and S2 A). This results in a 50-dimensional (50-d) face space, where every point represents a face, obtained by starting with the average face, first adding the appearance transform and then applying the shape transform to the landmarks; reconstructions of faces from the original dataset within this 50-d space strongly resemble the original faces ( Figure S2 B). Most of the dimensions were “holistic,” involving changes in multiple parts of the face; for example, the first shape dimension involved changes in hairline, face width, and height of eyes. Movie S1 shows a movie of a face undergoing changes only in shape parameters and a face undergoing changes only in appearance parameters.

(G) The first 9 Eigenface feature dimensions. The intensity was normalized so that 0 was mapped to middle gray, and maximum absolute value was mapped to black or white.

(C) (C1) Gray lines show distribution of feature values of 200 real faces along 50 dimensions, normalized to zero mean and unit variance. Black line shows prediction by standard Gaussian distribution. (C2) Sorted vector lengths of 200 real faces in 50-d feature space. (C3) Sorted vector lengths of 2000 parameterized faces in 50-d feature space.

(K) Neuronal responses as a function of the feature value for the first shape dimension (top) and first appearance dimension (bottom) for all significant cells (p < 0.01 by shift predictor).

(J) Responses of ML/MF (left) and AM (right) neurons as a function of the distance along STA dimension. The abscissa is rescaled so that the range [–1,1] covers 98% of the stimuli.

(I) Response of a neuron in AM is plotted against distance between the stimulus and the average face along the STA axis. Error bars represent SE.

(H) Number of significantly tuned cells (p < 0.01 by shift predictor, see STAR Methods ) for each of 50 dimensions for ML/MF and AM cells in both monkeys.

(F) Distribution of shape preference indices, quantified as the contrast between the vector length of shape and appearance STA for ML/MF and AM cells. Arrows indicate the average of each population (p = 10 −25 , Student’s t test).

(E) Vector length of the STA for the 25 appearance dimensions is plotted against that for the 25 shape dimensions for all the cells recorded from middle face patches ML/MF (blue) and anterior face patch AM (red).

(D) Spike-triggered average of a face-selective neuron from anterior face patch AM. The first 25 points represent shape dimensions, and the next 25 represent appearance dimensions. The facial image corresponding to the STA is shown in the inset.

(A–C) Generation of parameterized face stimuli. (A) 58 landmark points were labeled on 200 facial images from a face database (FEI face database; example image shown on left). The positions of these landmarks carry shape information about each facial image (middle). The landmarks were smoothly morphed to match the average landmark positions in the 200 faces, generating an image carrying shape-free appearance information about each face (right). (B) PCA was performed to extract the feature dimensions that account for the largest variability in the database. The first principal component for shape (B1) and appearance (B2) are shown. (C) Example face stimuli generated by randomly drawing from a face space constructed by the first 25 shape PCs and first 25 appearance PCs.

We first localized six face patches in two monkeys with fMRI by presenting a face localizer stimulus set containing images of faces and non-face objects (). Middle face patches MF, ML, and anterior patch AM were targeted for electrophysiological recordings () ( Figure S1 A). Well-isolated single units were recorded while presenting 16 real faces and 80 non-face objects (same stimuli as in). Units selective for faces were selected for further recordings ( Figures S1 B and S1C; see STAR Methods ).

(B) Neuronal responses (baseline-subtracted, averaged from 50 to 300 ms) to images of different categories recorded from the middle face patches (ML/MF, left) and the anterior face patch AM (right).

Next, we explored the shape of tuning to shape/appearance dimensions by ML/MF and AM neurons. When responses of an example AM neuron were grouped according to the distance between the stimulus and the average face (i.e., the face at the origin of the face space) along the STA axis in the 50-d face space, we saw ramp-like tuning, with maximum and minimum responses occurring at extreme feature values ( Figure 1 I). Such ramp-like tuning was consistently observed along the STA dimension across the population for both AM and ML/MF ( Figure 1 J) and was also clear for individual dimensions ( Figure 1 K).

To quantify neuronal tuning to the 50 dimensions of the face space, responses of each neuron were first used to calculate a “spike-triggered average” (STA) stimulus (), i.e., the average stimulus that triggered the neuron to fire ( Figure 1 D). On average, each cell was significantly tuned along 6.1 feature dimensions (covering the range [0 17] with SD = 3.8). We next compared the relative sensitivity to shape or appearance for each neuron: a “shape preference index” was computed based on the vector length of the STA for shape versus appearance dimensions. We found that most ML/MF cells showed stronger tuning to shape dimensions than to appearance dimensions, while AM cells showed the opposite trend ( Figures 1 E–1H). The shape preference indices computed with subsets of stimuli were highly correlated (split halves approach, correlation = 0.89 ± 0.07, n = 205 cells, see STAR Methods ); hence, this distinction between preferred axes in ML/MF and AM is real ( Figure 1 G). Furthermore, this distinction is completely consistent with previous studies showing that ML/MF cells are tuned to specific face views, while AM cells code view-invariant identity (). Changes in identity would produce changes in appearance dimensions, and changes in view (within a limited range away from frontal) would be accounted for by changes in shape dimensions. Importantly, because shape dimensions encompass a much larger set of transformations than just view changes, the tuning of AM cells to appearance dimensions indicates invariance to a much larger set of transformations in articulated shape than just view changes, consistent with the invariance of face recognition behavior to many transformations beyond view changes, such as severe distortion in face aspect ratio ().

To quantify the overall decoding accuracy of our model, we randomly selected a number of faces from the stimulus set and compared their actual 50-d feature vectors to the reconstructed feature vector of one face in the set using Euclidean distance. The decoding accuracy decayed with increasing number of faces but was ∼75% with 40 faces when all cells were pooled together ( Figure 3 B, black solid line), which is much higher than chance level ( Figure 3 B black dashed line). Furthermore, when the number of cells was equalized, decoding accuracy rose fastest for the combined population compared to ML/MF and AM populations alone ( Figure 3 C, for n = 99 cells, p < 0.01 when comparing combined population with AM; p < 0.005 when comparing combined population with ML/MF, estimated by 1,000 iterations of random sampling with replacement, see STAR Methods ), consistent with the two regions carrying complementary information about shape and appearance. We also determined the accuracy of decoding by measuring the subjective similarity between the reconstructions and actual faces using human psychophysics and found that human subjects were significantly more likely to match the reconstructed with the actual face, compared to a highly similar distractor (see STAR Methods : Human psychophysics). The fact that we can accurately decode the identity of real faces from population responses in ML/MF and AM shows that we have satisfied one essential test of a full understanding of the brain’s code for face identity.

We found this simple linear model could predict single features very well ( Figure 2 B). We used percentage variance of the feature values explained by the linear model to quantify the decoding quality. Overall, the decoding quality for appearance features was better than that for shape features for AM neurons, while the opposite was true for ML/MF neurons ( Figures 2 C and 2D), consistent with our analysis using STA ( Figure 1 F). By combining the predicted feature values across all 50 dimensions, we could reconstruct the face that the monkey saw. Examples of the reconstructed faces are shown in Figure 3 A next to the actual faces, using ML/MF data, AM data, and combined data from both patches. The reconstructions using AM data strongly resemble the actual faces the monkey saw, and the resemblance was further improved by adding ML/MF data.

(B) Decoding accuracy as function of number of faces, using a Euclidean distance model (black solid line). Decoding accuracy based on two alternative models, nearest neighbor in the space of population response (gray dashed line, see STAR Methods ) and average of nearest 50 neighbors (gray solid line), were much lower. The black dashed line represents chance level. Results based on three neuronal populations are shown separately (black solid lines for ML/MF and AM are the same as the black solid lines for corresponding patches in Figure 2 D, except here they are not shown with variability estimated by bootstrapping). In the left panel, boxes and error bars represent mean and SEM of subjective (human-based) decoding accuracy based on 78 human participants (see STAR Methods : Human psychophysics).

(A) Using facial features decoded by linear regression in Figure 2 , facial images could be reconstructed. Predicted faces by three neuronal populations and the corresponding actual stimuli presented in the experiment are shown.

If a face cell has ramp-shaped tuning to different features, this means that its response can be roughly approximated by a linear combination of the facial features, with the weighting coefficients given by the slopes of the ramp-shaped tuning functions. For a population of neurons,, whereis the vector of responses of different neurons, S is the matrix of weighting coefficients for different neurons,is the 50-d vector of face feature values, andis the offset vector. If this is true, then by simply inverting this equation, we should be able to linearly decode the facial features from the population response (). To attempt this, we took advantage of the fact that we always presented the same set of 2,000 stimuli to the monkey and used a leave-one-out approach to train and test our model. We determined the transformation from responses to feature values with linear regression using population responses of face cells in a time window from 50 to 300 ms after stimulus onset to 1,999 faces and then predicted the feature value of the remaining image ( Figure 2 A). Note that, for this decoding procedure, we used cells recorded sequentially; if the brain were to use a similar decoding approach, it would be using neurons firing simultaneously.

(D) Decoding accuracy as a function of the number of faces randomly drawn from the stimulus set for three different models (see STAR Methods ). For each model, different sets of features were first linearly decoded from population responses, and then Euclidean distances between decoded and actual features in each feature space were computed to determine decoding accuracy. The three sets of features are: 50-d features of active appearance model; 25-d shape features; 25-d appearance features. Shaded region indicates SD estimated using bootstrapping.

(A) Diagram illustrating decoding model. To construct and test the model, we used responses of AM (n = 99) and ML/MF (n = 106) cells to 2,000 faces. Population responses to 1,999 faces were used to determine the transformation from responses to feature values by linear regression, and then the feature values of the remaining image were predicted.

Shape of Tuning along Axes Orthogonal to the STA

Riesenhuber and Poggio, 1999 Riesenhuber M.

Poggio T. Hierarchical models of object recognition in cortex. Valentine, 1991 Valentine T. A unified account of the effects of distinctiveness, inversion, and race in face recognition. Tanaka, 1996 Tanaka K. Inferotemporal cortex and object vision. Freiwald and Tsao, 2010 Freiwald W.A.

Tsao D.Y. Functional compartmentalization and viewpoint generalization within the macaque face-processing system. Quiroga et al., 2005 Quiroga R.Q.

Reddy L.

Kreiman G.

Koch C.

Fried I. Invariant visual representation by single neurons in the human brain. Riesenhuber and Poggio, 1999 Riesenhuber M.

Poggio T. Hierarchical models of object recognition in cortex. Valentine, 1991 Valentine T. A unified account of the effects of distinctiveness, inversion, and race in face recognition. Figure 4 AM Neurons Display Almost Flat Tuning along Axes Orthogonal to the STA in Face Space Show full caption (A) For each neuron in AM, the STA was first computed, and then 2,000 random axes were selected and orthogonalized to the STA in the 25-d space of appearance features. Tuning functions along 300 axes accounting for the largest variability in the stimuli were averaged and fitted with a Gaussian function ( a · e − ( x 2 / σ 2 ) + c ) . The center of the fit ( a + c ) was used to normalize the average tuning function. Red dots and error bars represent mean and SD of the population. ∗nine positions). For a given image, the similarity of this image to any of the transforms (defined as a decreasing linear function of pixel level distance between the two images) was computed and the maximum value across all 81 transforms was set as the response of the cell. For fairness of comparison, the response of each model cell was matched to one of the AM neurons on noise level and sparseness (for details, see (B) Same as (A), but for two control models. (B1) Each simulated cell corresponds to one of the 200 real faces projected onto the 25-d face space of appearance features (exemplar face), and its response to an arbitrary face is a decreasing linear function of the Euclidean distance between the arbitrary face and the exemplar in the 25-d feature space. (B2) Each simulated cell corresponds to 81 transforms of a single identity (nine viewsnine positions). For a given image, the similarity of this image to any of the transforms (defined as a decreasing linear function of pixel level distance between the two images) was computed and the maximum value across all 81 transforms was set as the response of the cell. For fairness of comparison, the response of each model cell was matched to one of the AM neurons on noise level and sparseness (for details, see STAR Methods , comparison of sparseness between neurons and models is shown in the inset). (C) Responses of an AM neuron to 25 parameterized faces. Firing rate was averaged with 25-ms bins. The three stimuli evoking strong responses are shown on the right. (D) Responses of the cell in (C) to different faces are color coded and plotted in the 2-D space spanned by the STA axis and the axis orthogonal to STA in the appearance feature space accounting for the largest variability in the features. Arrows indicate three faces in (C). (E) Same as (D), but for a non-sparse AM cell. (F) For each cell in AM or two models, tuning along orthogonal axes was first fitted with a Gaussian function, and the ratio between the fit at 0.67 ( a ⋅ e − ( 0.67 2 / σ 2 ) + c ) and the center ( a + c ) and was computed and plotted against the sparseness of the cell. Cells in each population were further divided into three groups according to sparseness ( = ( ∑ i = 1 N R i / N ) 2 / ( ∑ i = 1 N R i 2 / N ) ) . Solid and open circles indicate data from two different monkeys. Boxes and error bars represent mean and SE of each subgroup. The difference between AM neurons and two models was significant for all three sparseness levels (p < 0.001, Student’s t test). (G) Two models were used to fit face cells’ responses to parameterized face stimuli: (1) an “axis” model where every face was projected onto an axis in the 50-d face space; (2) an “exemplar” model where distance from one of the 2,000 faces to an exemplar face was computed (the length of the exemplar face in the 50-d space was restricted to be smaller than twice the average length of real faces). The projection or distance was then passed through a nonlinearity (a third-order polynomial) to generate a predicted response. Each parameter of the model was adjusted using gradient descent to minimize the Euclidean distance between predicted and actual firing rate. To obtain high-quality responses, we repeated 100 faces more frequently than the remaining 1,900 faces and used responses to the 100 faces to validate the model derived from the 1,900 faces. (H) Predicted versus actual responses for one example cell using an axis model. The model explained 68% of variance in the responses. (I) Comparison of fitting quality by two models for 32 cells. An axis model provides significantly better fits to actual responses (mean = 56.9%) than an exemplar model (mean = 41.7%, p < 0.001, paired t test). (J) Different trials of responses to 100 stimuli were randomly split into two halves, and the average response across half of the trials was used to predict that of the other half. Percentage variances explained, after Spearman-Brown correction (mean = 71.1%), are plotted against that of the axis model. (K) A convolutional neural net was trained to perform view-invariant face identification ( Figure S7 ). 52 units were randomly selected from 500 units in the final layer of CNN and were used to linearly fit responses of face cells. Mean explained variance across 100 repetitions of random sampling was plotted against that of the axis model. The fit quality by CNN units was much lower (mean = 30.2%, p < 0.01) than the axis model. Using more units will lead to overfitting, and the validated explained variance will be further reduced (to 26.5% for the case of 100 units and 17.7% for 200 units). (L) Neuronal responses were fitted by a different “axis” model using “Eigenface” features as the dimensions of the face space ( Figure S2 G; see STAR Methods ). PCA was performed on the original image intensities of 2,000 faces and the first 50 PCs were treated as the input to the axis model. Fitting procedure was the same as shown in (G). The fit quality by “Eigenface” model was much lower (mean = 29.9%, p < 0.001) than the axis model. Using 100 PCs slightly increased the fit quality (mean = 31.1%), while using 200 PCs led to overfitting (mean = 22.8%). S4, See also Figures S3 S5 , and S7 The model used for decoding assumes that face patch neurons are linearly combining different features (“axis model”). While simple, this code is inconsistent with prevailing notions of the function of IT cells, in particular, sparse AM cells. Many models of object recognition assume an exemplar-based representation () ( Figure 4 G, right), in which object recognition is mediated by units tuned to exemplars of specific objects that need to be recognized. Early studies attempting to find the “optimal object” for IT cells assumed such an exemplar-based model (). More direct support for an exemplar-based model comes from recordings in face patch AM, where a subset of cells have been found to respond extremely sparsely to only a few identities, invariant to head orientation (e.g., see Figure 1 inand Movie S2 ). These cells have been hypothesized to code exemplars of specific individuals, analogous to the “Jennifer Aniston” cells recorded in human hippocampus that respond to images, letter strings, and vocalizations of one specific individual (). If AM cells are in fact linearly combining different features, then geometrically what an AM cell is doing is simply taking a dot product between an incoming face and a specific direction in face space defined by the cell’s STA ( Figure 4 A, inset). If this is true, then each cell should have a null space within which the response of the cell does not change. This null space is simply the plane orthogonal to the STA, since adding a vector in this plane will not change the value of projection onto the STA. In contrast, if AM cells are coding exemplars of specific individuals, then the response to an incoming face should be a decreasing function of distance of the face to the exemplar face ().

Figure S3 Tuning along Single Axis Orthogonal to STA Is Flatter for AM Neurons Than Control Models Using Exemplars or Max Pooling, Related to Figure 4 Show full caption (A) Tuning of 99 AM cells along the single axis orthogonal to STA in 25-d appearance feature space that accounts for the most variability. Red dot and error bar represent mean and s.d. (B) Same as (A), but for models using distance to an exemplar face to compute responses (c.f. Figure 4 B). Sparseness and noise levels are matched to AM cells. (C) Same as (B), but using extreme faces as exemplars of each cell (vector length = twice the average of real faces). (D) Same as (B), but using max pooling model in Figure 4 B. (E), The strength of nonlinearity, quantified by the ratio between surround and center of the Gaussian fit (c.f. Figure 4 F), is plotted against sparseness for AM and 3 models. Boxes and error bars represent mean and s.e. for three sparseness levels. (F) Same as (E), but for the absolute difference between the ratio and 1. (G and H) Same as (E) and (F), but for all 300 random axes used in Figure 4 A. (I) Ramp-shaped tuning along STA axis does not imply flat tuning along orthogonal axes. (I1) The axis model shows ramp-shaped tuning along STA axis and flat tuning along orthogonal axes. (I2) Different examples with ramp-shaped tuning along STA axis (Face axis 1). Only the leftmost example shows flat tuning along the orthogonal axis (Face axis 2). To decide whether a cell is coding an exemplar or an axis, the critical question is, what is the shape of tuning along axes in the plane orthogonal to the STA axis? If this plane constitutes a “null space” in which all faces elicit the same response, this would be indicative of axis coding. Alternatively, if there is Gaussian tuning along axes within this plane, this would be indicative of exemplar coding. To distinguish between these two possibilities, we quantified tuning of AM cells along axes orthogonal to the STA in 25-d appearance feature space ( Figure 4 A); we purposely excluded the 25-d shape feature space to avoid the possibility of shape invariance giving rise to flat tuning along the orthogonal dimension. To obtain better signal quality, we averaged tuning along multiple axes that accounted for the largest variability of the stimuli (see STAR Methods ; the results also hold true for single axes, see Figure S3 ). Surprisingly, the tuning of AM neurons was largely flat along orthogonal axes and showed no clear bias for Gaussian nonlinearity.

To quantitatively confirm the flatness, we compared the result with several models ( Figures 4 B and S3 B–S3D). The first model defined, for each AM cell recorded, a counterpart model “exemplar” cell that fired maximally to a specific exemplar face, and whose firing rate decayed linearly as a function of the distance between an incoming face and the exemplar face. We chose the exemplar face by projecting one of the 200 real faces in the original FEI database to the 25-d appearance feature space. The sparseness and noise of the model units were set equal to those of the actual units. As expected, the model units displayed clear bell-shaped tuning along orthogonal axes ( Figure 4 B1). In a second exemplar model, we implemented view invariance by a conventional max-pooling operation: each unit contained a set of templates corresponding to different views and positions of the same identity, and the response of the unit to one face was the maximum of the similarities between this face and each template (similarity was defined as a decreasing linear function of mean absolute pixel difference between two images). This model also demonstrated a clear bell-shaped nonlinearity ( Figure 4 B2).

Figure S4 Additional Analyses of Tuning along Axes Orthogonal to STA, Related to Figure 4 Show full caption (A–H) The actual face space spanned by AM cells constitutes a subspace of the 50-d feature space. One concern is that the flat tuning we observed in the orthogonal plane is due to contribution from dimensions which do not modulate any cells in the population. To address this, the actual face space encoded by AM cells was constructed by performing principal component analysis on axes defined by the STAs of appearance-preferring AM neurons. This figure shows that tuning along axes orthogonal to STA is flatter for AM neurons than control models within this actual face space encoded by AM cells. (A) To estimate the actual space encoded by AM neurons, we first computed the STA for each neuron, normalized it to norm of 1. To avoid a non-zero mean, we pooled both STAs and their opposite (–STA), then performed principal component analysis on STAs (and –STAs) of all the appearance-biased AM neurons. (B) Principal components (PCs) were used to define the axes of the new space. (C) The eigenvalue of each PC was used to define a scaling factor of that axis: Larger eigenvalues correspond to longer axes. The gray area indicates 99% confidence interval computed by randomly shuffling the 2000 stimuli. The STA of the randomly shuffled response was rescaled using the norm of the actual STA. The first 47 PCs are significant. For another way of estimating dimensionality of face space using an identification task, see Figure S6 A. (D–G) same as Figures 4 A, 4B, and 4F, but using distance and orthogonality estimated by the metrics of the new space. (H) The absolute difference between 1 and the ratio in (G) was plotted against sparseness for all three populations. (I–L) Another concern is that exemplar cells may use a distance function weighing some dimension more strongly than other dimensions, resulting in non-circular contour lines. Such cells would display flatter tuning along some dimensions than others. We explore this possibility by varying the aspect-ratio between weights of distance along the STA axis and orthogonal axes. (I–K) Tuning along axes orthogonal to STA for three aspect-ratios, same convention as Figure 4 B. (L) Nonlinearity of tuning along orthogonal dimensions for three sparseness levels, as quantified in Figure 4 F, were plotted for exemplar cells using distance metrics with different aspect-ratios and AM. Tuning of AM cells are flatter than exemplar cells with aspect ratio as high as eight (p < 0.01, Student’s t test). (M and N) Tuning along axes orthogonal to the STA is flatter for ML/MF neurons than the exemplar model in the space of shape features. (M) same as Figure 4 A, but for tuning of 106 ML/MF cells along axes orthogonal to STA in 25-d space of shape features. (N) same as Figure 4 F, but for ML/MF cells. One might worry that the flat tuning we observed in the orthogonal plane was due to contribution from dimensions that did not modulate any cells in the population; analysis of responses restricted to the actual face space spanned by the STAs of AM neurons shows that this is not the case ( Figures S4 A–S4H). Another concern is that cells may encode exemplars using an ellipsoidal distance metric, such that tuning is broader along some dimensions than others; analysis of model exemplar units explicitly endowed with non-circular aspect ratio rule out this possibility ( Figures S4 I–S4L).

Vogels, 2016 Vogels R. Sources of adaptation of inferior temporal cortical responses. Figure S5 Adaptation Plays Little Role in Shaping Responses of AM Cells, Related to Figure 4 Show full caption (A) Tuning functions along the STA dimension for an example cell in AM are shown under three adaptation conditions: all trials (blue); trials preceded by a stimulus far from the average face along the STA dimension (33% largest distances, green); trials preceded by a stimulus close to the average face along STA dimension (33% smallest distances, red). The ratio of average distance between the far and near groups was 7.06 on average, thus the two groups represent clearly different adaptation conditions. (B) The average tuning function along the STA dimension for 47 AM cells for the three conditions are shown. The response of each cell was normalized to have an average of 1. Error bars represent s.e. Note that we have 47 AM cells in this case, rather than 99 cells, because for 52 cells the presented stimuli included not only frontal faces (e.g., profile faces), making it difficult to collect enough consecutive trials of frontal faces for this analysis. (C) Average response to near stimuli (33% smallest distances) was not affected by the distance of the preceding stimulus to the average face (near or far). (D) Similar to (C) but only considering trials preceded by two consecutive near or far stimuli, which should result in stronger adaptation effect, if there is any. (E-H) Similar to (A-D) but for tuning along axes orthogonal to the STA. Note that distance to the average face was computed along orthogonal axes (c.f. Figure 4 A). Error bars represent s.d. There is no significant difference between the two groups in C,D,G and H (p > 0.05, paired t test). A further potential confounding factor is adaptation: cells in IT cortex have been reported to suppress their responses more strongly for more frequent feature values (). Our stimuli were Gaussian distributed along each axis; as a result, faces closer to the average appeared more frequently. To rule out the possibility that our findings are specific to our stimulus conditions, we examined the extent of adaptation in the recorded cells by regrouping the responses based on the preceding stimuli. We first examined how tuning along the STA dimension is affected by the preceding stimuli. The responses of each cell were regrouped according to the distance between the immediately preceding stimulus and the average face along the STA dimension, into a far group (33% largest distances) and a near group (33% smallest distances). If adaptation plays an important role, one would expect to see a clear difference in tuning between the two groups (for example, one might expect the center of the tuning function to be more suppressed for the “near” group than the “far” group). However, we observed no difference in tuning between the two groups ( Figures S5 A–S5D). Similar to results along the STA dimension, we found adaptation played little role in reshaping tuning along orthogonal axes ( Figures S5 E–S5H).

The results so far suggest that AM cells are encoding specific axes rather than exemplars. How can we reconcile this finding with the existence of sparse, view-invariant AM cells selective for specific exemplars? To address this, we examined the shape of tuning of AM cells as a function of sparseness. We found that, for our parameterized stimuli, some AM neurons also responded sparsely ( Figure 4 C shows one example). However, when we looked at tuning of these sparse neurons in a 2-d space spanned by the STA and an orthogonal axis, they showed a drastic nonlinearity along the STA but nearly no tuning along the orthogonal axis ( Figure 4 D shows one example; for comparison, Figure 4 E shows the response of a non-sparse cell). When we plotted the level of nonlinearity along the orthogonal axis against sparseness, we found AM neurons were less tuned than the two control models, regardless of the sparseness of responses ( Figures 4 F and S3 E–S3H). Furthermore, the lack of tuning along the orthogonal axis provides a simple explanation for the mystery of why some AM cells, even super sparse ones, respond to several faces bearing no obvious resemblance to each other: these faces are “metameric” because they differ by a large vector lying in the cell’s null space (arrows in Figure 4 D).

We repeated the above analyses for cells in ML/MF and found that ML/MF cells were also tuned to single axes defined by the STA, showing flat tuning in the hyperplane orthogonal to the STA ( Figures S4 M and S4N). Thus, the fundamental difference between ML/MF and AM lies in the axes being encoded (shape versus shape-free appearance), not in the coding scheme.

Yamins et al., 2014 Yamins D.L.

Hong H.

Cadieu C.F.

Solomon E.A.

Seibert D.

DiCarlo J.J. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Yamins et al., 2014 Yamins D.L.

Hong H.

Cadieu C.F.

Solomon E.A.

Seibert D.

DiCarlo J.J. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Sirovich and Kirby, 1987 Sirovich L.

Kirby M. Low-dimensional procedure for the characterization of human faces. Turk and Pentland, 1991 Turk, M.A., and Pentland, A.P. (1991). Face recognition using Eigenfaces. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 586–591. A full model of face processing should allow both encoding and decoding of neural responses to arbitrary faces. How well does the axis model predict firing rates of cells to real faces? To address this, we fit responses of face cells to two models, an axis model and an exemplar model ( Figure 4 G). In the axis model, we assumed that the cell is simply taking the dot product between an incoming face (described in terms of a 50-d shape-appearance vector) and a specific axis and then passing the result through a nonlinearity. In the exemplar model, we assumed the cell is computing the Euclidean distance between the face and a specific exemplar face and then passing the result through a nonlinearity. The nonlinearity allows us to account for nonlinear tuning along the STA. We fit the two models on responses to a set of 1,900 faces and then tested on responses to a different set of 100 faces. To obtain high signal quality, the 100 faces were repeated ten times more frequently than the rest of the 1,900 faces. We found that the axis model could explain up to 57% of the variance of the response, outperforming the exemplar model by more than 15% of explained variance ( Figures 4 H and 4I). We compared this to the noise ceiling of the cells estimated by using the mean response on half of the trials to predict the mean on the other half, which yielded 72% explained variance after Spearman-Brown correction ( Figure 4 J). The ratio between variances explained by the axis model and that by data is 80.0%, which is much higher than previously achieved (48.5%) (). We also trained a five-layer convolutional neural network (CNN) to perform invariant face identification and then linearly regressed activity of AM cells on the activity of the output neurons of this network, analogous to a previous study using output units of a CNN trained on invariant object recognition to model IT responses (). We found this could explain 30% variance (42.5% of noise ceiling) ( Figure 4 K), significantly lower than the performance of the axis model, and comparable to results of the previous study (48.5% of noise ceiling). Furthermore, we compared the axis model with a well-known face model: the “Eigenface” model (), which computes principal components of the original images rather than shape or appearance representations ( STAR Methods , see also Figure S2 G). In this case, 50 “Eigenface” features were used as the axes of the model. We found that the “Eigenface” model could explain 31% variance ( Figure 4 L), significantly lower than that of the axis model. This suggests the correct choice of face space axes is critical for achieving a simple explanation of face cells’ responses.