What specific features should visual neurons encode, given the infinity of real-world images and the limited number of neurons available to represent them? We investigated neuronal selectivity in monkey inferotemporal cortex via the vast hypothesis space of a generative deep neural network, avoiding assumptions about features or semantic categories. A genetic algorithm searched this space for stimuli that maximized neuronal firing. This led to the evolution of rich synthetic images of objects with complex combinations of shapes, colors, and textures, sometimes resembling animals or familiar people, other times revealing novel patterns that did not map to any clear semantic category. These results expand our conception of the dictionary of features encoded in the cortex, and the approach can potentially reveal the internal representations of any system whose input can be captured by a generative model.

We term the overall approach XDREAM (EXtending DeepDream with Real-time Evolution for Activity Maximization in real neurons) ( Figure 1 D). We conducted evolution experiments on IT neurons in six monkeys: two with chronic microelectrode arrays in posterior IT (PIT) (monkeys Ri and Gu), two with chronic arrays in central IT (CIT) (monkeys Jo and Y1), one (monkey Ge) with chronic arrays in both CIT and PIT, and one with a recording chamber over CIT (monkey B3). Lastly, we validated the approach in a seventh monkey with a chronic array in primary visual cortex (V1) (monkey Vi).

Here, we use a novel combination of a pre-trained deep generative neural network () and a genetic algorithm to allow neuronal responses to guide the evolution of synthetic images. By training on more than one million images from ImageNet (), the generative adversarial network learns to model the statistics of natural images without merely memorizing the training set () ( Figure S1 ), thus representing a vast and general image space constrained only by natural image statistics. We reasoned that this would be an efficient space in which to perform the genetic algorithm, because the brain also learns from real-world images, so its preferred images are also likely to follow natural image statistics. Moreover, convolutional neural networks emulate aspects of computation along the primate ventral visual stream (), and this particular generative network has been used to synthesize images that strongly activate units in several convolutional neural networks, including ones not trained on ImageNet (). The network takes 4,096-dimensional vectors (image codes) as input and deterministically transforms them into 256 × 256 RGB images ( STAR Methods and Figure 1 ). In combination, a genetic algorithm used responses of neurons recorded in alert macaques to optimize image codes input to this network. Specifically, each experiment started from an initial population of 40 images created from random achromatic textures () ( Figure 1 B). We recorded responses of IT neurons (spike counts 70–200 ms after stimulus onset minus background) while monkeys engaged in a passive fixation task. Images subtended 3° × 3° and covered the unit’s receptive field ( Figure 1 C). Neuronal responses to each synthetic image were used to score the image codes. In each generation, images were generated from the top 10 image codes from the previous generation, unchanged, plus 30 new image codes generated by mutation and recombination of all the codes from the preceding generation selected on the basis of firing rate ( Figure 1 D). This process was repeated for up to 250 generations over 1–3 h; session duration depended on the monkey’s willingness to maintain fixation. To monitor changes in firing rate due to adaptation and to compare synthetic-image responses to natural-image responses, we interleaved reference images that included faces, body parts, places, and simple line drawings.

(D) Experimental flow. Image codes were forwarded through the deep generative adversarial network to synthesize images presented to the monkey. Neuronal responses were used to rank image codes, which then underwent selection, recombination, and mutation to generate new image codes (for details see STAR Methods ).

Responses were the average spikes per image from 70 to 200 ms after stimulus onset, minus baseline (spikes from 1 to 60 ms after stimulus onset). The average response to these images by category is shown in Figure 4 E.

To qualitatively estimate the expressiveness of the deep generative network, we selected arbitrary images in various styles and categories outside of the training set of the network (first row). To find an image code that would approximately generate each target image (second row), we used either (1) backpropagation to optimize a zero-initialized image code to minimize pixel-space distance (left group; STAR methods, Initial Generation ), or (2) the CaffeNet fc6 representations of the target image, as the generator was originally trained to use (right group;). The existence of codes that produced the images in the second row, regardless of how they were found, demonstrates that the deep generative network is able to encode a variety of images. We then asked whether, given that these images can be approximately encoded by the generator, a genetic algorithm searching in code space (“XDREAM”) is able to recover them. To do so, we created dummy “neurons” that calculated the Euclidean distance between the target image and any given image in pixel space (left group) or CaffeNet pool5 space (right group) and used XDREAM to maximize the “neuron responses” (thereby minimizing distance to target), similar to how this network could be used to maximize firing of real neurons in electrophysiology experiments. The genetic algorithm is also able to find codes that produced images (third row) similar to the target images, indicating that not only is the generator expressive, its latent space can be searched with a genetic algorithm. Images that were reproduced from published work with permission are as follows: “Curvature-position” (); “3d shape” (); “Monkey face” ILSVRC2012 (). Public domain artwork is as follows: “Monet,” The Bridge at Argenteuil (National Gallery of Art); “object,” “Moon jar” (The Metropolitan Museum of Art). “Neptune” (NASA) is a public domain image.

Despite the successes with hand-picked stimuli, the field might have missed stimulus properties that better reflect the “true” tuning of cortical neurons. A series of interesting alternative approaches have addressed this question. One approach is to start with hand-picked stimuli that elicit strong activation and systematically deform those stimuli (); this approach has revealed that neurons often tend to respond even better to distorted versions of the original stimuli (). Another is spike-triggered averaging of noise stimuli (), but this has not yielded useful results in higher cortical areas, because it cannot capture non-linearities. An elegant alternative is to use a genetic algorithm whereby the neuron under study can itself guide its own stimulus selection. Connor and colleagues () pioneered this approach to study selectivity in macaque V4 and IT cortex. Our method extends and complements this approach in order to investigate the tuning properties of inferior temporal cortex (IT) neurons in macaque monkeys.

A transformative revelation in neuroscience was the realization that visual neurons respond preferentially to some stimuli over others (). Those findings opened the doors to investigating neural coding for myriad stimulus attributes. A central challenge in elucidating neuronal tuning in visual cortex is the impossibility of testing all stimuli. Even for a small patch of 100 × 100 pixels, there are ∼10possible binary images, ∼10grayscale images, or ∼108-bit color images. Using natural images reduces the problem, but it is still impossible to present a neuron with all possible natural stimuli. Investigators circumvent this formidable empirical challenge by using ad hoc hand-picked stimuli, inspired by hypotheses that particular cortical areas encode specific visual features (). This approach has led to important insights through the discovery of cortical neurons that respond to stimuli with specific motion directions (), color (), binocular disparity (), curvature (), and even complex natural shapes such as hands or faces ().

We recorded from one single unit and three multiunit sites (six evolution experiments total) in monkey Vi, which had a chronic microelectrode array in V1. The stimuli were centered on each receptive field (measuring ∼0.79° square root of the area) but were kept at the same size as in the IT experiments (3° × 3°). In addition to the synthetic images, we interleaved reference images of gratings (3° × 3° area) of different orientations (0°, 45°, 90° and 135°) and spatial frequencies (∼0.5, 1, and 2 cycles per degree) at 100% contrast. In all experiments, neurons showed an increase in firing rate to the synthetic images (median 84.0 spikes per s per generation, 77.4–91.2, 25th–75th percentile) ( Table S1 ). Thus, on average, V1 sites, like IT cortex, responded well to late-generation synthetic images ( Table S4 ). To measure the distribution of orientations of the region of the synthetic images that fell within each V1 receptive field (∼0.8° × 0.8°), we performed a discrete Fourier transform analysis on the central 0.8° × 0.8° of the synthetic images and correlated the resulting spectrogram to the spectrograms expected from 16 gratings with orientations ranging from 0° to 135°. Across experiments, the mean correlation between the orientation content profile of the patch and the orientation tuning measured from the gratings was 0.59 ± 0.09 (mean ± SEM), compared with 0.01 ± 0.26 for a shuffled distribution (p values ≤ 0.006 in 5 out of 6 experiments, permutation test, N= 999). Thus, V1 neurons guided the evolution of images dominated by their independently measured preferred orientation.

Single- and multi-units in IT successfully guided the evolution of synthetic images that were stronger stimuli for the neuron guiding the evolution than large numbers of natural images. To see if our technique could be used to characterize more coarsely sampled neuronal activity than a single site, we asked whether we could evolve strong stimuli for all 32 sites on an array. Each of the chronically implanted arrays had up to 32 visually responsive sites, spaced 400 μm apart. We conducted a series of evolution experiments in 3 monkeys (Ri, Gu, and Y1) guided by the average population response across the array. Evolution experiments for all 3 monkeys showed increasing responses to synthetic images over generations compared with reference images: the median population response changes to synthetic images for monkeys Ri, Gu, and Y1 were 9.4 spikes per s per generation (2.8–19.6, 25th–75th percentile), 30.7 (17.6–48.3, 25th–75th percentile) and 27.8 (18.8–39.6, 25th–75th percentile). In these population-guided evolutions, 61%, 93%, and 99.5% of individual sites showed increases in firing rate (statistical significance defined by fitting an exponential function to 250 resampled firing rate per generation curves per site; an increase was significant if the 95% CI of the amplitude bootstrap distribution did not include zero). Therefore, larger populations of IT neurons could successfully create images that were on average strong stimuli for that population. When the populations were correlated in their natural-image preferences, the synthetic images were consistent with those evolved by individual single sites in the array: for example, in monkey Ri, the population-evolved images contained shape motifs commonly found in ImageNet pictures labeled “macaques,” “wire-haired fox terrier,” and “Walker hound.” This suggests that the technique can be used with coarser sampling techniques than single-unit recordings, such as local field potentials, electrocorticography electrodes, or even functional magnetic resonance imaging.

IT neurons retain selectivity despite changes in position, size, and rotation (), although it has been reported that more selective neurons are less transformation-invariant (). The latter observation is also consistent with the alternative interpretation that the more optimal a stimulus is for the neuron, the less invariant the neuron will be, and this is consistent with what we found. To compare the invariance of IT neurons to synthetic and natural images, we presented 3 natural and 3 evolved synthetic images at different positions, sizes, and fronto-parallel rotations in two animals (monkeys Ri and Gu). The natural images were chosen from the nearest, middle, and farthest matches from ImageNet. The synthetic images were chosen from the final generation. Every image was presented at three positions in relation to the fovea: (−2.0°, −2.0°), (−2.0°, 2.0°), and (0.0°, 0.0°); three sizes (widths of 1°, 2°, 4°) and 4 rotations (0°, 22°, 45°, and 80°, counterclockwise from horizontal) ( Figure S6 A). Invariance was defined as the similarity (correlation coefficient) in the neuron’s rank order of preferences for images under different transformation conditions (the more similar the rank order, the higher the invariance). The rank order was better maintained across transformations for the natural images than for the synthetic images ( Figures S6 B and S6C). Thus, the degree of invariance for these neurons changed depending on the stimulus set, and the neurons were the least invariant for the more optimal synthetic images. This result suggests that the degree of invariance measured for particular neurons might not be a fixed feature of that neuron, but rather might depend on the effectiveness of the stimulus used to test the invariance.

(B) Responses (background subtracted) to six images (3 reference natural images, 3 synthetic images) as a function of (top) size, (center) rotation, and (bottom) position. Each plot shows the mean response (±SEM) as a function of one transformation. Line thickness indicates transformation value.

(A) Transformations applied to natural and synthetic images evolved by PIT units (the natural images were nearest, middle, and farthest fc6 matches to the evolved image, as defined in Figure 7 ). Images varied in size, rotation and position.

To find out whether similarity in fc6 space between a neuron’s evolved synthetic image and a novel natural image predicted that neuron’s response to that novel image, we first performed 3 to 4 independent evolution experiments using the same (single- or multi-) unit in each of three animals. After each evolution, we took the top synthetic image from the final generation and identified the top 10 nearest images in fc6 space, 10 images from the middle of the distance distribution and the farthest (most anticorrelated) 10 images (9 of each shown in Figure 7 A). Then, during the same recording session, we presented these images to the same IT neurons and measured the responses to each group (near, middle, and far) as well as to all 40 evolved images of the last generation. Figure 7 B shows that synthetic images gave the highest responses, the nearest natural images the next highest responses, and the middle and farthest images the lowest. To quantify this observation, we fit linear regression functions between the ordinal distance from the synthetic image (near, middle, and far) and the unit’s mean responses, and found median negative slopes ranging from −5.7 to −21.1 spikes per s across monkeys ( Table S3 ). Thus, distance from the evolved synthetic image in fc6 space predicted responses to novel natural images. This does not indicate that this space is the best model for IT response properties; instead, this shows that it is possible to use the neurons’ evolved images to predict responses to other images. Importantly, responses to the synthetic images were the highest of all.

(B) Responses from each unit to the last-generation evolved images compared with the nearest, intermediate, and farthest images from the evolved images in the fc6 space of AlexNet (mean ± SEM).

(A) Final-generation synthetic images from units Ri-10 and Ri-12 and the closest, intermediate, and farthest 9 images from the image set for each. For unit Ri-10 we used the 60,000-image database, and for unit Ri-12 the 100,300-image database.

We applied this image-search approach to all evolution experiments by identifying the top 100 matches to every synthetic image in fc6 space (the Pearson correlation coefficients of these images ranged from 0.30 to 0.61, median 0.36) and visualized the WordNet () labels of the matching images via word clouds. In monkey Ri, whose array showed natural-image preferences for faces, the categories that best matched the synthetic images were “macaques,” “toy terrier,” and “Windsor tie” (the latter containing images of faces and bodies) ( Figure S5 F); in contrast, in monkey Y1, where most of the neurons in the array had shown natural-image preferences for places, the categories that best matched were “espresso maker,” “rock beauty” (a type of fish), and “whiskey jug”; by inspection these images all contained extended contours ( Figure S5 G). We confirmed this matching trend by quantifying the WordNet hierarchy labels associated with every matched natural image ( Table S2 ).

First, we focused on the evolution experiment for PIT single unit Ri-17. This cell evolved a discrete shape near the top left of the image frame, comprising a darkly outlined pink convex shape with two dark circles and a dark vertical line between them ( Figure S5 A). When tested with the 2,550 natural images, this neuron responded best to images of monkeys, dogs, and humans ( Figure S5 B). We propagated this evolved image through AlexNet, along with the 100,300 ImageNet examples, ranked all the fc6 vectors by their Pearson correlation to the evolved image vector, and identified the closest, middle, and farthest 100 matches. The synthetic image showed an average vector correlation of 0.38 to the closest images, 0.06 to the middle, and −0.14 to the farthest images. The 9 nearest ImageNet images were cats, dogs, and monkeys ( Figure S5 C). To visualize the common shape motifs of this image cluster, we identified the individual fc6 units most strongly activated by the synthetic image and used activation maximization (deepDreamImage.m) to generate examples of preferred shapes for those fc6 units. All the units preferred round tan/pink regions with small dark spots ( Figure S5 D). To rule out that these matches could be due to an overrepresentation of animals in ImageNet, we also looked at the least correlated matches, which were indeed not animals, but were pictures of places, rectilinear textures, or objects with long, straight contours ( Figure S5 E).

(F) Word cloud and histogram showing counts of ImageNet labels of the top 150 closest ImageNet pictures to the evolved stimulus, pooled across all experiments for all 14 visually responsive sites in the array in monkey Ri.

If these evolved images are telling us something important about the tuning properties of IT neurons, then we should be able to use them to predict neurons’ responses to novel images. The deep generator network had been trained to synthesize images from their encoding in layer fc6 of AlexNet (4,096 units) (), so we used the fc6 space to find natural images similar to the evolved images. In particular, we asked whether a neuron’s response to a novel image was predicted by the distance in fc6 space between the novel image and the neuron’s evolved synthetic image. To do this, we calculated the activation vectors of the evolved synthetic images in AlexNet fc6 and searched for images with similar fc6 activation vectors. We used 2 databases for this: the first comprised ∼60,000 images collected in our laboratory over several years, and the second set comprised 100,300 images from the ILSVRC2012 dataset (); we included 100 randomly sampled images from each of its 1,000 categories, and two additional ImageNet categories of faces [ID n09618957], macaques [ID n02487547], and 100 images of local animal care personnel with and without personal protective garb.

IT neurons guided the evolution of images that varied from experiment to experiment but retained consistent features for any given recording site (see Figure S4 for measure of similarity across and between sites), features that bore some similarity to each neuron’s preferences in natural image sets. Figure 6 shows the final-generation evolved images from two independent evolution experiments for IT sites in five monkeys, along with each site’s top 10 natural images. In each case a reproducible figure emerged in the part of the synthetic image corresponding to the contralateral visual field. In three monkeys (Ri, Gu, and Ge), response profiles to natural images indicated that the arrays were located in face-preferring regions, whereas in monkey Y1, the array was in a place-preferring region and in monkey Jo, the array was in an object-selective region. Face-selective unit Ri-23 evolved something face-like in the left (contralateral) half of the image; this is most apparent if one covers up the right (ipsilateral) half of the synthetic image. The images evolved by unit Ge-7 bore some resemblance to the unit’s top natural image, a familiar person wearing protective clothing (see Figure S3 for additional independent evolutions from this site). Unit Ge-15 consistently evolved a black dot just to the left of fixation on a tan background (see Figure S3 for additional independent evolutions from this site). This unit might be similar to posterior-face-patch neurons described previously that responded optimally to a single eye in the contralateral field () (see Figure S3 for additional independent evolutions from this site). Monkey-face-selective unit Ge-17 evolved a tan area with two large black dots aligned horizontally, and a light area below (see Figure S3 for additional independent evolutions from this site). Unit Jo-6 responded to various body parts and evolved something not inconsistent with a mammalian body; interestingly, a whole “body” in one evolution and a larger partial “body” in the other. Unit Jo-5 evolved a small black square, and unit Jo-4 something black with orange below. Unit Jo-21 consistently evolved a small dark shape in the contralateral half of the image. Scene-selective unit Y1-14 evolved rectilinear shapes in the left (contralateral) field. Additional independent evolutions for some of these and other units are shown in Figure S3

(E) Accuracy of all 20 measures of similarity (2 distance measures × 10 spaces), separately for the four collections of experiments. Best accuracy for CaffeNet and ResNet-101 data are attained with the Euclidean-fc8 measure. With this measure, accuracies for ResNet-101 data and top-90% electrophysiology data are statistically indistinguishable (p = 0.59). Best accuracy for electrophysiology is attained with the angle-conv4 measure. With this measure, accuracies for electrophysiology data and CaffeNet and ResNet-101 data, respectively. are statistically indistinguishable (p = 0.42 and 0.21 with top-90% electrophysiology data; p = 0.88 and 0.75 with all data). n within = 28; n between = 802 for all 45 electrophysiology experiments. Shaded region indicates SEM.

(D) Same as (C), but for electrophysiology data with the angle-between-vectors measure in code, pixel, and conv4 spaces. We used the same 45 IT experiments in the main text that showed changes significantly different from zero in neuron response to the evolved images. Here, we excluded 5 experiments with changes within the bottom 10% percentile (analysis with all experiment is shown in the next panel). “Within-neuron” is defined as experiments repeated for the same site in the same animal, and “between-neuron” as experiments in different animals (since sites in the same array can have correlated selectivity). n within = 19; n between = 618. Top-right of each subplot, accuracy ± SEM ∗ p < 0.05; ∗∗ p < 0.001; both corrected for multiple (20) comparisons.

(C) Distribution of similarity between images evolved for the same unit versus images evolved for different units, as quantified by Euclidean distance in three spaces (image code, pixel, and CaffeNet layer fc8). To quantify the separation between within- and between-neuron similarities, we estimated the accuracy of simple linear classifiers (thresholds) using leave-one-out cross-validation, separately for each collection of experiments and each measure of similarity. n within = 135; n between = 300. Chance is 50% by equally weighing within-neuron and between-neurons data; accuracy can be < 50% because it is a validation accuracy. Filled bars, histogram. Solid lines, KDE estimate of the distribution, for visual aid only. Shaded red around threshold, standard deviation of threshold across different leave-one-out splits.

(A and B) XDREAM-synthesized stimuli for (A) 3 CaffeNet fc8 units : “goldfish,” “ambulance,” and “loud-speaker;” and (B) the corresponding 3 ResNet-101 fc1000 units, starting from different initial populations (random draws of 40 initial codes from a bank of 1,000). Each row corresponds to a unit and each column a random initialization. Activation for each image is noted on the top right of the image.

Each large image shows the last-generation synthetic image from one evolution experiment for a single chronic recording site. To the right the synthetic images are shown the top 10 images for that site from a natural image set. Red crosses indicate fixation. The arrays were in the right hemisphere of both animals. As indicated by site number, some of the evolutions shown here are from the same recording sites as shown in Figures 3 6 , but from independent experiments. For the central-fixation experiments the reader is encouraged to cover the right (ipsilateral) half of the image. For unit Ge-7, first row, unit Ge-15, and unit Ge-17, the natural images were from the set of 2550 natural images shown interleaved with the synthetic images during the evolution experiment; for the other experiments, the natural images were from a 108-image set.

Each pair of large images shows the last-generation synthetic images from two independent evolution experiments for a single chronic recording site in 5 different animals. To the right of the synthetic images are the top 10 images for each neuron from a natural image set. Red crosses indicate fixation. The arrays were in the left hemisphere of monkey Jo, and in the right hemisphere of all the other animals. The natural images shown interleaved during each evolution were from either a 108-reference image set containing faces, bodies, places, and line segments (used for cells Gu-21 and Y1-14) or the set of 2,550 natural images rank ordered in Figure S2 for unit Ri-10 (used for all the other units in this figure).

We conducted 46 independent evolution experiments on single- and multi-unit sites in IT cortex in six different monkeys. During almost all the evolutions, the synthetic images evolved gradually to become increasingly effective stimuli. To quantify the change in stimulus effectiveness over each experiment, we fit an exponential function to the mean firing rate per generation, separately for synthetic and reference images (as in Figure 3 A). Synthetic-image firing rate change over the course of each experiment was on average between 25 to 84 spikes per s for the different animals ( Figure 5 A); synthetic image changes were significantly different from zero in 45 out of 46 individual experiments (95% CI of amplitude estimate not including zero per bootstrap test). In contrast, responses to reference images were stable or decreased slightly across generations (reference firing rate change average for different animals ranged from −11 to 9 spikes per s; this change was significant in 15 out of 46 individual experiments) ( Figure 5 A and Table S1 ). Thus, IT neurons could consistently guide the evolution of highly effective images, despite minor adaptation. Moreover, these evolved images were often more powerful stimuli than the best natural images tested, despite the fact that the synthetic images were far from naturalistic. When comparing the cells’ maximum responses to natural versus evolved images in every experiment, cells showed significant differences in 25 of 46 experiments (p < 0.03, permutation test after false discovery correction), and, in all but one case, the synthetic images evoked the greater response ( Figure 5 B; see Table S4 for further quantification of natural and evolved image responses). Figure 5 C shows a histogram of response magnitudes for PIT cell Ri-10 to the top synthetic image in each of the 210 generations and responses to each of the 2,550 natural images (data for both synthetic and natural images collected 2 days later). Early generations are indicated by lighter gray and later by darker gray, so it is apparent that later generation synthetic images gave larger responses. We also illustrate one of the few experiments where natural images evoked stronger responses than did synthetic images in Figure 5 D (monkey Ge, site 7), which compares the site’s responses to synthetic images against 2,550 natural images. This site responded slightly better (by an average of four spikes per s, or 3.7% of its maximum rate) to images of an animal-care person who visits the animals daily, wearing our institution-specific protective mask and gown (see Figure S3 for additional independent evolutions from this site). Even in this case, one clear benefit of XDREAM is that, by coming up with effective stimuli in a manner independent from the hand-picked image set, it reveals specific features of the natural image that drove the neuron’s selectivity, stripped of incidental information.

(D) Color convention of (C) used for unit Ge-7. The evolution for this neuron can be seen in the second half of Video S1

(C) Histogram of response magnitudes to natural (green) and synthetic (gray-to-black) images for unit Ri-10 (same unit as Figures 3 and 4 ). Below the histogram are shown the best and worst natural and synthetic images.

(B) Scatterplot of maximum responses across all images for synthetic versus reference images (measured across all generations, max ± SE per bootstrap). Colors indicate animal. The size of the circle indicates statistical significance (large circle: p < 0.03 after false discovery correction). Black square indicates the experiment in Figure 3

(A) Change in response to synthetic versus reference images over generations. Each point shows the mean change in firing rate to reference versus synthetic images in each experiment (change estimated by the amplitude coefficient of an exponential function fitted to the neuron’s mean response per generation; error bars represent ± SEM, per bootstrap, 500 iterations of data re-sampling). Solid circles indicate single units; open circles indicate multi-units.

We conducted independent evolution experiments with the same single unit on different days, and all final-generation synthetic images featured a brown object against a uniform background, topped by a smaller round pink and/or brown region containing several small dark spots; the object was centered toward the left half of the image, consistent with the recording site being in the right hemisphere ( Figure 4 B). The synthetic images generated on different days were similar by eye, but not identical, potentially due to invariance of the neuron, response variability, and/or stochastic paths explored by the algorithm in the neuron’s response landscape. Regardless, given that this unit was located in PIT, just anterior to the tip of the inferior occipital sulcus and thus early in the visual hierarchy, it was impressive that it repeatedly directed the evolution of images that contained such complex motifs and that evoked such high firing rates. Two days following the evolution experiment in Figure 3 , this unit was screened with 2,550 natural images, including animals, bodies, food, faces, and line drawings, plus the top synthetic images from each generation. Among the natural images this neuron responded best to monkey torsos and monkey faces. Of the 10 natural images in this set giving the largest responses, 5 were of the head and torso of a monkey ( Figure 4 C). The worst natural images were inanimate or rectilinear objects ( Figures 4 D and 4E).

We first show an example of an evolution experiment for one PIT single unit (Ri-10) in chronic-array monkey Ri. The synthetic images changed with each generation as the genetic algorithm optimized the images according to the neuron’s responses ( Figure 3 ; first half of Video S1 ). At the beginning of the experiment, this unit responded more strongly to the reference images ( Figure S2 ) than to the synthetic images, but over generations, the synthetic images evolved to become more effective stimuli ( Figure 4 A). To quantify the change in responses over time, we fit an exponential function to the cell’s mean firing rate per generation separately for the synthetic and for the reference images (solid thick lines in Figure 4 A). This neuron showed an increase of 51.5 ± 5.0 (95% confidence interval [CI]) spikes per s per generation in response to the synthetic images and a decrease of −15.5 ± 3.5 spikes per s per generation to the reference images—thus the synthetic images became gradually more effective, despite the neuron’s slight reduction in firing rate to the reference images, presumably due to adaptation.

(C–E) Selectivity of this neuron to 2,550 natural images. (C) In (C) are the top 10 images from this image set for this neuron. (D) In (D) are the worst 10 images from this image set for this neuron. The entire rank ordered natural image set is shown in Figure S2 . (E) In (E) is the selectivity of this neuron to different image categories (mean ± SEM). The entire image set comprised 2,550 natural images plus selected synthetic images. Early synthetic is defined as the best image from each of the first 10 generations and late from the last 10. Each image response is the average over 10–12 repeated presentations. See Figure S3 for additional independent evolutions from this site.

(B) Last-generation images evolved during three independent evolution experiments; the leftmost image corresponds to the evolution in (A); the other two evolutions were carried out on the same single unit on different days. Red crosses indicate fixation. The left half of each image corresponds to the contralateral visual field for this recording site. Each image shown here is the average of the top 5 images from the final generation.

(A) Mean response to synthetic (black) and reference (green) images for every generation (spikes per s ± SEM). Solid straight lines show an exponential fit to the response over the experiment.

Each image is the average of the top 5 synthetic images for each generation (ordered from left to right and top to bottom). The response of this neuron in each of these generations is shown in Figure 4 A.

We first validated XDREAM on units in an artificial neural network, as models of biological neurons. Our method generated super stimuli for units across layers in CaffeNet, a variant of AlexNet ( Figure 2 ) (). The evolved images were frequently better stimuli than all of > 1.4 million images, including the training set for CaffeNet, for all 4 layers that we tested ( Figure 2 C). For units in the first and last layers, the method produced stimuli that matched the ground-truth best stimuli in the first layer and category labels in the last layer. XDREAM is also able to recover the preferred stimuli of units constructed to have a single preferred image ( Figure S1 ). Importantly, well-known methods for feature visualization in artificial neural networks, such as DeepDream, rely on knowledge of network weights (), whereas our approach does not, making it uniquely applicable to neuronal recordings.

(B and C) Most evolved images activated artificial units more strongly than all of > 1.4 million images in the ILSVRC2012 dataset (). (B) At the top is the distribution of activations to ImageNet images and evolved images for one unit in the classification layer fc8, corresponding to the “Honeycomb” label. Grayscale gradient indicates generation of evolved images. On the bottom are the best 3 ImageNet images and one evolved image, labeled with their respective activations. In this case, the evolved image activated the unit ∼1.4 × as strongly as did the best ImageNet image. (C) The violin plot shows distribution of (evolved/best in ImageNet) ratios across 4 layers in CaffeNet, 100 random units per layer. White circles indicate medians of the distributions; thick bars indicate the 25and 75quartiles of the distributions.

(A) Evolved images resembled the ground truth best images in the first layer of CaffeNet. In the ground truth best, transparency indicates the relative contribution of each pixel to the unit’s activation. Only the center 11 × 11 pixels of the evolved images are shown, matching the filter size of the units.

Discussion

Yamane et al., 2008 Yamane Y.

Carlson E.T.

Bowman K.C.

Wang Z.

Connor C.E. A neural code for three-dimensional object shape in macaque inferotemporal cortex. Chang and Tsao (2017) Chang L.

Tsao D.Y. The code for facial identity in the primate brain. A powerful approach for modeling neuronal responses has been to use stimuli sampled in a fully defined parametric space. For example, one innovative study (), which inspired our approach, defined response properties in the IT cortex according to 3D curvature and orientation. In a more recent study,used face images parameterized along appearance and shape axes to describe and predict neuronal responses in face patches. Parametric stimulus spaces lead to quantitative neuronal models that are easier to describe than models built on the learned, vast, and latent space of a deep network. But these approaches are complementary: standard parametric models operate in a circumscribed shape space, so these models might not capture the entire response variability of the neuron. Generative neural networks are powerful enough to approximate a wide range of shape configurations, even shapes diagnostic of monkey faces or bodies. If the response evoked by the best stimulus in a parametric space is less than that evoked by a generative-network stimulus, it would indicate that the parametric model is overly restricted. This is important because a neuron could be tuned to different aspects of a stimulus: a face-preferring neuron might show tuning for high curvature, but curvature tuning alone is insufficient to explain the neuron’s selectivity. A generative-network-based approach can serve as an independent, less constrained test of parametric models. Indeed, in some instances, the evolved stimuli contained features of animal faces, bodies, and even animal-care staff known to the monkeys, consistent with theories of tuning to object categories and faces, whereas in other instances, the evolved stimuli were not identifiable objects or object parts, suggesting aspects of neuronal response properties that current theories of the visual cortex have missed. Thus, our approach can also serve as an initial step in discovering novel shape configurations for subsequent parametric quantification. In sum, our approach is well-positioned to corroborate, complement, contrast with, and extend the valuable lessons learned from these previous studies.