Theorists have suggested that emotions are canonical responses to situations ancestrally linked to survival. If so, then emotions may be afforded by features of the sensory environment. However, few computational models describe how combinations of stimulus features evoke different emotions. Here, we develop a convolutional neural network that accurately decodes images into 11 distinct emotion categories. We validate the model using more than 25,000 images and movies and show that image content is sufficient to predict the category and valence of human emotion ratings. In two functional magnetic resonance imaging studies, we demonstrate that patterns of human visual cortex activity encode emotion category–related model output and can decode multiple categories of emotional experience. These results suggest that rich, category-specific visual features can be reliably mapped to distinct emotions, and they are coded in distributed representations within the human visual system.

We validated this model, which we call EmoNet, in three different contexts, by predicting (i) normative emotion categories of video clips not used for training; (ii) normative emotional intensity ratings for International Affective Picture System (IAPS), an established set of emotional images ( 26 ); and (iii) the genre of cinematic movie trailers, which are designed to manipulate emotion by presenting different visual cues ( 27 ). The model is named EmoNet, as it is based on a seminal deep neural net model of object recognition called AlexNet ( 28 ) and has been adapted to identify emotional situations rather than objects. EmoNet’s goal is to provide a plausible account of how visual information is linked to distinct types of emotional responses (here characterized as human judgments of emotion categories). We refer to these linkages between patterns across sensory inputs and emotions as visual emotion schemas. This view is based on ideas from parallel distributed processing theory, wherein schemas are solutions to constraint satisfaction problems that neural networks learn to solve ( 29 ). A hallmark of schemas is that they require relational processing and integration across multiple elements (e.g., the color, forms, objects, and agents in a scene), which is likely crucial for the visual representation of emotion categories. To test whether EmoNet can uniquely identify multiple emotion categories, we developed and applied a statistical framework for estimating the number of discriminable emotion categories. To test prediction 2, we used machine learning approaches to find patterns of brain activity in the occipital lobe [measured via functional magnetic resonance imaging (fMRI), n = 18] linked to emotion category–related output from EmoNet. To test prediction 3, in a separate fMRI study (n = 32), we verified that patterns of occipital lobe activity can decode the category of emotional responses elicited by videos and music (across five categories). Our results are consistent with previous research, showing that different patterns of visual cortical activity are associated with different emotion categories ( 15 – 16 ) but go beyond them to (i) rigorously test whether sensory representations are sufficient for accurate decoding and (ii) provide a computational account of how sensory inputs are transformed into distributed emotion-related codes.

To test predictions 1 and 2, we developed a convolutional neural network (CNN), whose output is a probabilistic representation of the emotion category of a picture or video, and used it to classify images into 20 different emotion categories using a large stimulus set of 2232 emotional video clips ( 25 ). Using a computational approach to characterize how raw visual inputs are transformed into multiple emotion categories, we have an explicit model of visual emotion schemas that can be systematically validated and mapped onto known features of the visual system. Because this model strips away higher-order processes such as effortful cognitive appraisal, it has unique application potential. For example, clinically, this model could help assess the effectiveness of emotion interventions at a more fundamental computational level, which would relieve our reliance on more complex and subjective outcomes such as self-reported experience.

The hypothesis that emotion schemas are embedded in sensory systems makes several predictions that have not, to our knowledge, been tested. First, models constructed from image features alone should be able to (i) predict normative ratings of emotion category made by humans and (ii) differentiate multiple emotion categories. Second, representations in these models should map onto distinct patterns of brain activity in sensory (i.e., visual) cortices. Third, sensory areas, and particularly visual cortex, should be sufficient to decode multiple emotion categories. Further, because emotion schemas can be viewed as solutions to a constraint satisfaction problem (mapping certain sensory inputs to different kinds of emotions), these sensory representations are likely distributed in nature ( 24 ). Here, we test each of these hypotheses.

The latter view is broadly compatible with appraisal theories ( 20 ) and more recent theories of emotions as constructed from multiple perceptual, mnemonic, and conceptual ingredients ( 5 ). In appraisal theories, emotion schemas ( 21 ) are canonical patterns of organism-environment interactions that afford particular emotions. High-level visual representations are thought to be an integral part of schemas because they can act as situation cues that afford particular emotional responses. For example, scenes of carnage evoke rapid responses related to disgust or horror, and later (integrating conceptual beliefs about the actors and other elements) compassion, anger, or other emotions. Scenes with attractive, scantily clad people evoke schemas related to sex; scenes with delicious food evoke schemas related to consumption; and so on. In these cases, the sensory elements of the scene do not fully determine the emotional response—other aspects are involved, including one’s personal life experiences, goals, and interoceptive states ( 21 )—but the sensory elements are sufficient to convey the schema or situation that the organism must respond to. Initial appraisals of emotion schemas (often called “system 1” appraisals) can be made rapidly ( 22 ), and in some cases, unconsciously, and unconscious emotion may drive preferences and shape learning ( 23 ). Emotion schemas are also content rich in the sense that they sharply constrain the repertoire of emotional responses afforded by a given schema. For example, horror scenes might afford fear, anger, or compassion, but other kinds of emotional responses (sadness, nurturing, and playfulness) would be ancestrally inappropriate. Thus, while some affective primitives (representations related to survival and well-being) are related to biologically older subcortical brain systems ( 6 , 7 ) and involve relatively little cognitive processing, canonical, category-specific emotion schemas exist and may be embedded in part in human sensory cortical systems.

There are at least two ways of interpreting this evidence. On the one hand, emotion-related activity in sensory areas could reflect a general enhancement of visual processing for relevant, novel, or attended percepts ( 17 ). Stronger sensory responses to emotionally relevant percepts can also be evolutionarily conserved [relevant in ancestral environments ( 18 )] or learned during development ( 12 ). In this case, affective stimuli evoke stronger sensory responses, but the information about emotion content (fear versus anger, sadness versus joy) is thought to be represented elsewhere. On the other hand, perceptual representations in sensory (e.g., visual) cortex could reflect the content of emotional situations in a rich way ( 19 ); specific configurations of perceptual features, ranging from low-level features (e.g., color or spatial frequency) to high-level features (e.g., the presence of faces or objects), could afford specific types, or categories, of emotional responses, including fear, anger, desire, and joy. In this case, neural codes in sensory cortices might represent information directly relevant for the nature of emotional feelings.

Emotions are thought to be canonical responses to situations ancestrally linked to survival ( 1 ) or the well-being of an organism ( 2 ). Sensory processing plays a prominent role in nearly every theoretical explanation of emotion [e.g., ( 3 , 4 , 5 )], yet neuroscientific views have historically suggested that emotion is driven by specialized brain regions, e.g., in the limbic system ( 6 ) and related subcortical circuits ( 7 ), or in some theories, in neural circuits specialized for emotion categories such as fear ( 8 ) and sadness ( 9 ). According to these long-standing views, activity in sensory cortex (e.g., visual areas V1 to V4) is thought to be antecedent to emotion but not central to emotional appraisals, feelings, or responses. However, recent theoretical developments ( 10 ) and empirical observations suggest that sensory and emotional representations may be much more intertwined than previously thought. Activity in visual cortex is enhanced by emotional stimuli ( 11 ), and single neurons learn to represent the affective significance of stimuli. For example, neurons in V1 ( 12 ), V4 ( 13 ), and inferotemporal cortex ( 14 ) selectively respond to rewarding stimuli. In addition, multivariate patterns of human brain activity that predict emotion-related outcomes often use information encoded in visual cortex ( 15 , 16 ).

( A ) Dendrogram illustrates hierarchical clustering of emotion categories that maximizes discriminability. The x axis indicates the inner squared distance between emotion categories. The dashed line shows the optimal clustering solution; cluster membership is indicated by color. ( B ) Confusion matrix for the five-cluster solution depicts the proportion of trials that are classified as belonging to each cluster (shown by the column) as a function of ground truth membership in a cluster (indicated by the row). The overall five-way accuracy is 40.54%, where chance is 20%. ( C ) Model weights indicate where increasing brain activity is associated with the prediction of each emotion category. Maps are thresholded at a voxel-wise threshold of P < 0.05 for display.

This analysis revealed that of the seven states being classified (six emotions and neutral videos), at least five distinct emotion clusters (95% CI, 5 to 7) could be reliably discriminated from one another based on occipital lobe activity (five-way classification accuracy, 40.54%; chance, 20%; see Fig. 5 ), supporting prediction 3. Full seven-way classification was 29.95% (chance, 14.3%; P = 0.002). Contentment, amusement, and neutral videos were reliably differentiated from all other emotions. States of fear and surprise were not discriminable from one another (they were confused 21.09% of the time), yet they were reliably differentiated from all other emotions. Sadness and anger were also confusable (15.5%) but were discriminable from all other emotional states. Thus, although some emotional states were similar to one another regarding occipital lobe activation, we found strong evidence for categorical coding of multiple emotions during movie inductions of specific emotions.

To provide additional evidence that visual cortical representations are emotion category–specific, we tested whether visual cortical activity was sufficient to decode the category of emotional videos in an independent dataset [n = 32; see ( 15 )]. In this dataset, human individuals viewed cinematic film clips that elicited contentment, sadness, amusement, surprise, fear, and anger. We selected videos that elicited responses in one emotion category above all others for each video, complementing the previous study, whose stimuli elicited more blended emotional responses. We tested predictive accuracy in seven-way classification of emotion category based on subject-average patterns of occipital lobe activity for each condition, with eight-fold cross-validation across participants to test prediction performance in out-of-sample individuals. We then performed discriminable cluster identification ( Figs. 1 and 4 ; see Supplementary Text for details) to estimate how many distinct emotion categories out of this set are represented in visual cortex.

In additional model comparisons, we tested whether occipital cortex was necessary and sufficient for accurate prediction of EmoNet’s emotion category representation. We compared models trained using brain activity from individual areas [i.e., V1 to V4 ( 43 ) and inferotemporal cortex ( 44 )], the entire occipital lobe ( 39 ), and the whole brain. We trained models to predict variation across images in each EmoNet emotion category unit and averaged performance across emotion categories. The whole–occipital lobe model [r = 0.2819 ± 0.0163 (SE)] and the whole-brain model [r = 0.2664 ± 0.0150 (SE)] predicted EmoNet emotion categories more strongly than models based on individual visual areas (r = 0.0703 to 0.1635; all P < 0.0001). The occipital lobe model showed marginally better performance than the whole-brain model (Δr = 0.0155; 95% CI, 0.0008 to 0.0303; P = 0.0404, paired t test), despite having nearly 100,000 fewer features available for prediction (for model comparisons across regions, see fig. S5). A post hoc, confirmatory test revealed that excluding occipital lobe activation from the whole-brain model significantly reduced performance (Δr = −0.0240; 95% CI, −0.0328 to −0.0152; P < 0.0001, paired t test), indicating that activity in the occipital lobe meaningfully contributed to predictions in the whole-brain model. Furthermore, using occipital lobe activation to decode EmoNet emotion categories (activation in layer fc8) was more accurate than decoding earlier layers (conv1 to conv5, fc6, and fc7; see fig. S6). These results provide strong support for distributed representation of visual emotion schemas within the occipital lobe and partially redundant coding of this information in other brain systems. Although multiple brain systems convey emotion-related information (potentially related to action tendencies, somatovisceral responses, modulation of cognitive processes, and subjective feelings), activity outside the visual system does not appear to uniquely reflect the representations learned by EmoNet and may be better aligned with nonvisual aspects of emotion. More generally, the distributed coding of emotion categories parallels other recent findings on population coding of related affective processes ( 15 – 16 ); for review, see ( 45 ).

To further test the number of discriminable emotion categories encoded in visual cortex, we constructed a confusion matrix for relationships between the visual cortical multivariate pattern responses and EmoNet emotion category units. For each study participant, we correlated the output from each of the 20 fMRI models (a vector with 112 values, one for each IAPS image) with vectors of activation across EmoNet’s 20 emotion category units (producing a 20 × 20 correlation matrix), using leave-one-subject-out cross-validation to provide an unbiased test. For each model, the EmoNet unit with the highest correlation was taken as the best-guess emotion category based on brain activity, and the resulting confusion matrix was averaged across participants. The confusion matrix is shown in Fig. 4C , with correct predictions in 20-way classification (sensitivity) shown on the diagonal and false alarms (1 – specificity) on the off-diagonal. The average sensitivity across participants was 66.67 ± 11.4% (SEM), and specificity was 97.37 ± 0.88%; thus, visual cortical activity was mapped onto EmoNet’s categories with a positive predictive value of 65.45 ± 10.4% (chance is approximately 5%). In addition, as above, we estimated the number of uniquely discriminable categories by clustering the 20 categories and searching the clustering dendrogram to determine the maximum number of clusters (minimum link distance) at which each cluster was significantly discriminable from each other one, with bootstrap resampling to estimate confidence intervals. The results showed at least 15 discriminable categories (95% CI, 15 to 17), with a pattern of confusions that was sometimes intuitive based on psychology (e.g., “empathic pain” was indistinguishable from excitement, and romance was grouped with adoration and interest with entrancement) but, in other cases, was counterintuitive (sadness grouped with awe). This underscores that visual cortex does not perfectly reproduce human emotional experience but, nonetheless, contains a rich, multidimensional representation of high-level, emotion-related features, in support of prediction 2.

Visualization based on PCA reveals three important emotion-related features of the visual system. ( A ) Scatterplots depict the location of 20 emotion categories in PCA space, with colors indicating loadings onto the first three principal components (PCs) identified from 7214 voxels that retain approximately 95% of the spatial variance across categories. The color of each point is based on the component scores for each emotion (in an additive red-green-blue color space; PC 1 = red, PC 2 = green, PC 3 = blue). Error bars reflect bootstrap SE. ( B ) Visualization of group average coefficients that show mappings between voxels and principal components. Colors are from the same space as depicted in (A). Solid black lines indicate boundaries of cortical regions based on a multimodal parcellation of the cortex ( 41 ). Surface mapping and rendering were performed using the CAT12 toolbox ( 42 ). ( C ) Normalized confusion matrix shows the proportion of data that are classified into 20 emotion categories. Rows correspond to the correct category of cross-validated data, and columns correspond to predicted categories. Gray colormap indicates the proportion of predictions in the dataset, where each row sums to a value of 1. Correct predictions fall on the diagonal of the matrix; erroneous predictions comprise off-diagonal elements. Data-driven clustering of errors shows 15 groupings of emotions that are all distinguishable from one another. ( D ) Visualization of distances between emotion groupings. Dashed line indicates minimum cutoff that produces 15 discriminable categories. Dendrogram was produced using Ward’s linkage on distances based on the number of confusions displayed in (C). See Supplementary Text for a description and validation of the method.

Because EmoNet was trained on visual images, we first explored how emotion schemas might emerge from activity in the human visual system, within a mask comprising the entire occipital lobe [7214 voxels ( 39 )]. Patterns of occipital activity predicted variation in EmoNet’s emotion category units across images, with different fMRI patterns associated with different emotion categories ( Fig. 4 ; for individual maps, see fig. S4). Multiple correlations between brain-based predictions and activation in EmoNet emotion category units were tested in out-of-sample individuals using leave-one-subject-out ( 40 ) cross-validation. These correlations were positive and significant for each of the 20 EmoNet emotion categories [mean r = 0.2819 ± 0.0163 (SE) across individuals; mean effect size d = 3.00; 76.93% of the noise ceiling; P < 0.0001, permutation test; see Materials and Methods]. The highest average level of performance included entrancement [r = 0.4537 ± 0.0300 (SE); d = 3.559; 77.03% of the noise ceiling; P < 0.0001], sexual desire [r = 0.4508 ± 0.0308 (SE); d = 3.453; 79.01% of the noise ceiling; P < 0.0001], and romance [r = 0.3861 ± 0.0203 (SE); d = 4.476; 72.34% of the noise ceiling; P < 0.0001], whereas horror [r = 0.1890 ± 0.0127 (SE); d = 3.520; 60.17% of the noise ceiling; P < 0.0001], fear [r = 0.1800 ± 0.0216 (SE); d = 1.963; 59.44% of the noise ceiling; P < 0.0001], and “excitement” [r = 0.1637 ± 0.0128 (SE); d = 3.004; 65.28% of the noise ceiling; P < 0.0001] exhibited the lowest levels of performance.

If emotion schemas are afforded by visual scenes, then it should be possible to decode emotion category–related representations in EmoNet from activity in the human visual system. To test this hypothesis, we measured brain activity using fMRI while participants (n = 18) viewed a series of 112 affective images that varied in affective content (see Materials and Methods for details). Treating EmoNet as a model of the brain ( 38 ), we used PLS to regress patterns in EmoNet’s emotion category layer onto patterns of fMRI responses to the same images. We investigated the predictive performance, discriminability, and spatial localization of these mappings to shed light on how and where emotion-related visual scenes are encoded in the brain.

Movie genres are systematically associated with different emotion schemas: Romantic comedies were predicted by increased activation of units coding for “romance” ( β ̂ = 1.499; 95% CI, 1.001 to 2.257), amusement ( β ̂ = 1.167; 95% CI, 0.639 to 2.004), and “sadness” ( β ̂ = 0.743; 95% CI, 0.062 to 1.482); horror trailers were predicted by activation of interest ( β ̂ = 1.389; 95% CI, 0.305 to 3.413), horror ( β ̂ = 1.206; 95% CI, 0.301 to 3.536), and aesthetic appreciation ( β ̂ = 1.117; 95% CI, 0.259 to 2.814); and action trailers were predicted by activation of “anxiety” ( β ̂ = 1.526; 95% CI, 0.529 to 2.341), awe ( β ̂ = 0.769; 95% CI, 0.299 to 1.162), and “fear” ( β ̂ = 0.575; 95% CI, 0.094 to 1.109). As with IAPS images, EmoNet tracked canonical visual scenes that can lead to several kinds of emotional experience based on context. For instance, some horror movies in this sample included scenic shots of woodlands, which were classified as aesthetic appreciation, leading to high weights for aesthetic appreciation on horror films. While these mappings illustrate how EmoNet output alone should not be overinterpreted in terms of human feelings, they also illustrate how emotion concepts can constrain the repertoire of feelings in context. A beautiful forest or children playing can be ominous when paired with other threatening context cues (e.g., scary music), but the emotion schema is incompatible with a range of other emotions (sadness, anger, interest, sexual desire, disgust, etc.).

The results indicated that EmoNet’s frame-by-frame predictions tracked meaningful variation in emotional scenes across time ( Fig. 3A ) and that mean emotion category probabilities accurately classified the trailers ( Fig. 3, B and C , and movies S1 to S3), with a three-way classification accuracy of 71.43% (P < 0.0001, permutation test; chance, 35.7%). The average AUC for the three genres was 0.855 (Cohen’s d = 1.497; Fig. 3C ). Classification errors were made predominantly between action and horror movies (26.32%), whereas romantic comedies were not misclassified, indicating that they had the most distinct features.

( A ) Emotion prediction for a single movie trailer. Time courses indicate model outputs on every fifth frame of the trailer for the 20 emotion categories, with example frames shown above. Conceptually related images from the public domain (CC0) are displayed instead of actual trailer content. A summary of the emotional content of the trailer is shown on the right, which is computed by averaging predictions across all analyzed frames. ( B ) PLS parameter estimates indicate which emotions lead to predictions of different movie genres. Violin plots depict the bootstrap distributions (1000 iterations) for parameter estimates differentiating each genre from all others. Error bars indicate bootstrap SE. ( C ) Receiver operator characteristic (ROC) plots depict 10-fold cross-validation performance for classification. The solid black line indicates chance performance. ( D ) t-SNE plot based on the average activation of all 20 emotions. ( E ) Confusion matrix depicting misclassification of different genres; rows indicate the ground truth label, and columns indicate predictions. The grayscale color bar shows the proportion of trailers assigned to each class. Analysis was performed on a trailer for The Proposal, ©2009 Disney.

A second test examined whether emotion categories could meaningfully be applied to dynamic stimuli such as videos. We tested EmoNet’s performance in classifying the genre of 28 randomly sampled movie trailers from romantic comedy (n = 9), action (n = 9), and horror (n = 10) genres (see Materials and Methods for sampling and selection criteria). EmoNet made emotion predictions for each movie frame (for example, see the time series in Fig. 3 ). PLS regression was used to predict movie genres from the average activation over time in EmoNet’s final emotion category layer, using one-versus-all classification ( 37 ) with 10-fold cross-validation to estimate classification accuracy in independent movie trailers.

( A ) Depiction of the full IAPS, with picture locations determined by t-SNE of activation of the last fully connected layer of EmoNet. The color of each point indicates the emotion category with the greatest score for each image. Large circles indicate mean location for each category. Combinations of loadings on different emotion categories are used to make predictions about normative ratings of valence and arousal. ( B ) Parameter estimates indicate relationships identified using PLS regression to link the 20 emotion categories to the dimensions of valence (x axis) and arousal (y axis). Bootstrap means and SE are shown by circles and error bars. For predictions of valence, positive parameter estimates indicate increasing pleasantness, and negative parameter estimates indicate increasing unpleasantness; for predictions of arousal, positive parameter estimates indicate a relationship with increasing arousal and negative estimates indicate a relationship with decreasing arousal. *P < 0.05, **P FWE < 0.05. ( C ) Cross-validated model performance. Left and right: Normative ratings of valence and arousal, plotted against model predictions. Individual points reflect the average rating for each of 25 quantiles of the full IAPS set. Error bars indicate the SD of normative ratings (x axis; n = 47) and the SD of repeated 10-fold cross-validation estimates (y axis; n = 10). Middle: Bar plots show overall RMSE (lower values indicate better performance) for models tested on valence data (left bars, red hues) and arousal data (right bars, blue hues). Error bars indicate the SD of repeated 10-fold cross-validation. *P < 0.0001, corrected resampled t test. The full CNN model and weights for predicting valence and arousal are available at https://github.com/canlab for public use.

In addition, the categorical emotion responses in EmoNet’s representation of each image were arranged in valence-arousal space in a manner similar to the human circumplex model ( Fig. 2B ), although with some differences from human raters. Units coding for adoration ( β ̂ = 0.5002; 95% CI, 0.2722 to 1.0982), “aesthetic appreciation” ( β ̂ = 0.4508; 95% CI, 0.1174 to 0.6747), and surprise ( β ̂ = 0.4781; 95% CI, 0.3027 to 1.1476) were most strongly associated with positive valence across categories. Units coding for “disgust” ( β ̂ = −0.7377; 95% CI, −1.0365 to −0.6119), entrancement ( β ̂ = −0.1048; 95% CI, −0.5883 to −0.0010), and horror ( β ̂ = −0.3311; 95% CI, −0.7591 to −0.0584) were the most negatively valenced. The highest loadings on arousal were in units coding for awe ( β ̂ = 0.0285; 95% CI, 0.0009 to 0.0511) and horror ( β ̂ = 0.0322; 95% CI, 0.0088 to 0.0543), and the lowest-arousal categories were amusement ( β ̂ = −0.3189; 95% CI, −0.5567 to −0.1308), “interest” ( β ̂ = −0.2310; 95% CI, −0.4499 to −0.0385), and “boredom” ( β ̂ = −0.1605; 95% CI, −0.4380 to −0.0423). The marked similarities with the human affective circumplex demonstrate that model representations of emotion categories reliably map onto dimensions of valence and arousal. However, these findings do not indicate that the valence-arousal space is sufficient to encode the full model output; we estimate that doing so requires 17 dimensions, and the loadings in Fig. 2B do not exhibit a classic circumplex pattern. The discrepancies (e.g., surprise is generally considered high-arousal and neutral valence, and awe is typically positively valenced) highlight that the model was designed to track visual features that might serve as components of emotion but do not capture human feelings in all respects. For example, while people may typically rate awe as a positive experience, “awe-inspiring” scenes often depict high-arousal activities (e.g., extreme skiing or base jumping).

We constructed predictive models using partial least squares (PLS) regression of human valence and arousal on features from the last fully connected layer of EmoNet, which has 20 units, one for each emotion category. We analyzed the accuracy in predicting valence and arousal ratings of out-of-sample test images using 10-fold cross-validation ( 35 ), stratifying folds based on normative ratings. We also analyzed the model weights (β PLS ) mapping emotion categories to arousal and valence to construct a valence and arousal space from the activity of emotion category units in EmoNet. The models strongly predicted valence and arousal ratings for new (out-of-sample) images. The model predicted valence ratings with r = 0.88 [P < 0.0001, permutation test; root mean square error (RMSE), 0.9849] and arousal ratings with r = 0.85 (P < 0.0001; RMSE, 0.5843). A follow-up generalization test using these models to predict normative ratings on a second independent image database ( 36 )—with no model retraining—showed similar levels of performance for both valence (r = 0.83; RMSE, 1.605) and arousal (r = 0.84; RMSE, 1.696). Thus, EmoNet explained more than 60% of the variance in average human ratings of pleasantness and arousal when viewing IAPS images. This level of prediction indicates that the category-related visual features EmoNet detects are capable of explaining differences in self-reported valence and arousal in stimuli that elicit mixed emotions.

To further test EmoNet’s generalizability, we tested it on three additional image and movie databases. A first test applied EmoNet to images in the IAPS, a widely studied set of images used to examine the influence of positive and negative affect on behavior, cognitive performance, autonomic responses, and brain activity ( 32 ). The IAPS dataset provides an interesting test because human norms for emotion intensity ratings are available and because IAPS images often elicit mixed emotions that include responses in multiple categories ( 33 ). Much of the variance in these emotion ratings is explained by a 2D model of valence (pleasant to unpleasant) and arousal (calm to activated), and emotion categories are reliably mapped into different portions of the valence-arousal space ( 34 ), often in a circumplex pattern ( 5 ). These characteristics of the IAPS allowed us to determine whether the schemas learned by EmoNet reliably inform normative human ratings of valence and arousal and whether EmotNet organizes emotions in a low-dimensional or circumplex structure similar to human ratings. These tests serve the dual functions of validating emotion schemas against human ratings of affective feelings and providing a description of the visual features that may contribute to human feelings.

To further assess the number of distinct emotion categories represented by EmoNet, we developed two additional tests of (i) dimensionality and (ii) emotion category discriminability. First, we tested the possibility that EmoNet is tracking a lower-dimensional space, such as one organized by valence and arousal, rather than a rich, category-specific representation. Principal components analysis (PCA) on model predictions in the holdout dataset indicated that many components were required to explain model predictions; 17 components were required to explain 95% of the model variance, with most components being mapped to only a single emotion [i.e., exhibiting simple structure ( 31 ); see fig. S3]. To test category discriminability, we developed a test of how many emotion categories were uniquely discriminable from each other category in EmoNet’s output ( Fig. 1E ; see Supplementary Text for details of the method). The results indicated that EmoNet differentiated 11 (95% CI, 10 to 14) distinct emotion categories from one another, supporting the sensory embedding hypothesis.

The visual information EmoNet used to discriminate among emotion schemas could be related to a variety of features, ranging from low-level features (e.g., color or spatial power spectra) to high-level features (e.g., the presence of objects or faces). To investigate how these features are linked to different emotion schemas that EmoNet learned, we examined associations between these properties and 20-dimensional (20D) patterns of activation in EmoNet’s final emotion category layer for the training dataset. This analysis revealed that different color intensities, spatial power spectra, and object classes were associated with different emotion schema in the training data (with correlations as high as r = 0.8; fig. S2). However, when applied to images from the holdout test data, these bivariate associations largely failed to discriminate among emotion schemas (except for object classes, which exhibited a top 1 accuracy of 11.4%). Thus, EmoNet characterizes different categories by combining multiple visual features that vary in their abstractness, consistent with the idea that emotion schemas are represented in a distributed code.

Crucially, EmoNet accurately discriminated multiple emotion categories in a relatively fine-grained way, although model performance varied across categories. “Craving” [AUC, 0.987; 95% confidence interval (CI), 0.980 to 0.990; d = 3.13; P < 0.0001], “sexual desire” (AUC, 0.965; 95% CI, 0.960 to 0.968; d = 2.56; P < 0.0001), “entrancement” (AUC, 0.902; 95% CI, 0.884 to 0.909; d = 1.83; P < 0.0001), and “horror” (AUC, 0.876; 95% CI, 0.872 to 0.883; d = 1.63; P < 0.0001) were the most accurately predicted categories. On the other end of the performance spectrum, “confusion” (AUC, 0.636; 95% CI, 0.621 to 0.641; d = 0.490; P < 0.0001), “awe” (AUC, 0.615; 95% CI, 0.592 to 0.629; d = 0.415; P < 0.0001), and “surprise” (AUC, 0.541; 95% CI, 0.531 to 0.560; d = 0.147; P = 0.0002) exhibited the lowest levels of performance, despite exceeding chance levels. Some emotions were highly confusable in the test data, such as “amusement,” “adoration,” and “joy,” suggesting that they have similar visual features despite being distinct from other emotions (fig. S3). Thus, visual information is sufficient for predicting some emotion schemas, particularly those that have a strong relationship with certain high-level visual categories, such as craving or sexual desire, whereas other sources of information are necessary to discriminate emotions that are conceptually abstract or depend on temporal dynamics (e.g., confusion or surprise).

( A ) Model architecture follows that of AlexNet (five convolutional layers followed by three fully connected layers); only the last fully connected layer has been retrained to predict emotion categories. ( B ) Activation of artificial neurons in three convolutional layers (1, 3, and 5) and two fully connected layers (6 and 8) of the network. Scatterplots depict t-distributed stochastic neighbor embedding (t-SNE) plots of activation for a random selection of 1000 units in each layer. The first four layers come from a model developed to perform object recognition ( 25 ), and the last layer was retrained to predict emotion categories from an extensive database of video clips. ( C ) Examples of randomly selected images assigned to each class in holdout test data (images from videos that were not used for training the model). Pictures were not chosen to match target classes. Some examples show contextually driven prediction, e.g., an image of a sporting event is classified as empathic pain, although no physical injury is apparent. ( D ) Linear classification of activation in each layer of EmoNet shows increasing emotion-relation information in later layers, particularly in the retrained layer fc8. Error bars indicate SEM based on binomial distribution. ( E ) t-SNE plot shows model predictions in test data. Colors indicate the predicted class, and circled points indicate that the ground truth label was in the top 5 predicted categories. Although t-SNE does not preserve global distances, the plot does convey local clustering of emotions such as amusement and adoration. ( F ) Normalized confusion matrix shows the proportion of test data that are classified into the 20 categories. Rows correspond to the correct category of test data, and columns correspond to predicted categories. Gray colormap indicates the proportion of predictions in the test dataset, where each row sums to a value of 1. Correct predictions fall on the diagonal of the matrix, whereas erroneous predictions comprise off-diagonal elements. Categories the model is biased toward predicting, such as amusement, are indicated by dark columns. Data-driven clustering of errors shows 11 groupings of emotions that are all distinguishable from one another (see Materials and Methods and fig. S3). Images were captured from videos in the database developed by Cowen and Keltner ( 25 ).

EmoNet ( Fig. 1 ) was based on the popular AlexNet object recognition model, which mirrors information processing in the human ventral visual stream ( 30 ), and changed its objective from recognizing object classes to identifying the normative emotion category of more than 137,482 images extracted from videos ( 25 ) with normative emotion categories based on ratings from 853 participants. This was accomplished by retraining weights in its final fully connected layer (see Materials and Methods for details). We tested EmoNet on 24,634 images from 400 videos not included in the training set. EmoNet accurately decoded normative human ratings of emotion categories, providing support for prediction 1. The human-consensus category was among the top 5 predictions made by the model (top 5 accuracy in 20-way classification) for 62.6% of images (chance, 27.95%; P < 0.0001, permutation test); the top 1 accuracy in a 20-way classification was 23.09% (chance, 5.00%; P < 0.0001, permutation test); the average area under the receiver operating characteristic curve (AUC) across the 20 categories was 0.745 (Cohen’s d = 0.945), indicating that emotions could be discriminated from one another with large effect sizes. Model comparisons indicated that EmoNet performed better than shallower models based on the same architecture (with fewer convolutional layers, all P < 0.0001, McNemar test) and better than a deeper model that included the eight layers from AlexNet and an additional fully connected layer optimized to predict emotion categories (P < 0.0001; see model 4 in fig. S1).

DISCUSSION

Our work demonstrates the intimate relationship between visual perception and emotion. Although emotions are often about specific objects, events, or situations (1), few computational accounts of emotion specify how sensory information is transformed into emotion-relevant signals. Driven by the hypothesis that emotion schemas are embedded in the human visual system, we developed a computational model (EmoNet) to classify images into 20 different emotion categories. Consistent with our prediction that image features alone are sufficient for predicting normative ratings of emotion categories determined by humans, EmoNet accurately classified images into at least 11 different emotion categories in holdout test data. Supporting our second prediction that EmoNet representations learned from visual images should map primarily onto the activity of sensory systems (as opposed to subcortical structures or limbic brain regions), distributed patterns of human occipital lobe activity were the best predictors of emotion category units in EmoNet. Last, our third prediction was supported by the observation that patterns of occipital lobe activity were sufficient for decoding at least 15 emotion categories evoked by images and at least five of seven emotional states elicited by cinematic movies. These findings shed light both on how visual processing constrains emotional responses and on how emotions are represented in the brain.

A large body of research has assumed that low-level visual information is mainly irrelevant to emotional processing; it should be either controlled for or explained away, although studies have shown that neurons in early visual areas are sensitive to affective information such as reward (12). Our model provides a means to disentangle the visual properties of stimuli that are emotion relevant from those that are not and isolate stimulus-related features [e.g., red color serving as an indicator of higher energy content in fruit (46)] from more abstract constructs (e.g., the broader concept of craving, which does not necessarily require a visual representation). Along these lines, we found some evidence that aspects of the visual field, including angle, eccentricity, and field size, are modestly associated with different emotion schema, converging with evidence that emotions can act to broaden or focus visual perception (47). However, we found that simple visual features (or linear combinations of them) are poor discriminators of emotion categories, suggesting that EmoNet is using complex, nonlinear combinations of visual features to label images. This suggests that distributed representations that include multiple different visual features (varying in abstractness) code for different schemas. Thus, although the information EmoNet uses is certainly visual in nature, it is not reducible to a simple set of features easily labeled by humans. On the basis of our findings, it seems unlikely that a complete account of emotion will be devoid of sensory qualities that are naturally associated with emotional outcomes or those that are reliably learned through experience.

We found that human ratings of pleasantness and excitement evoked by images can be accurately modeled as a combination of emotion-specific features (e.g., a mixture of features related to disgust, horror, sadness, and fear is highly predictive of unpleasant arousing experiences). Individuals may draw from this visual information when asked to rate images. The presence of emotion-specific visual features could activate learned associations with more general feelings of valence and arousal and help guide self-report. It is possible that feelings of valence and arousal arise from integration across feature detectors or predictive coding about the causes of interoceptive events (48). Rather than being irreducible (49), these feelings may be constructed from emotionally relevant sensory information, such as the emotion-specific features we have identified here, and previous expectations of their affective significance. This observation raises the possibility that core dimensions of affective experience, such as arousal and valence, may emerge from a combination of category-specific features rather than the other way around, as is often assumed in constructivist models of emotion.

In addition to our observation that emotion-specific visual features can predict normative ratings of valence and arousal, we found that they were effective at classifying the genre of cinematic movie trailers. Moreover, the emotions that informed prediction were generally consistent with those typically associated with each genre (e.g., romantic comedies were predicted by activation of romance and amusement). This validation differed from our other two image-based assessments of EmoNet (i.e., testing on holdout videos from the database used for training and testing on IAPS images) because it examined stimuli that are not conventionally used in the laboratory but are robust elicitors of emotional experience in daily life. Beyond hinting at real-world applications of our model, integrating results across these three validation tests serves to triangulate our findings, as different methods (with different assumptions and biases) were used to produce more robust, reproducible results.

The fact that emotion category units of EmoNet were best characterized by activity spanning visual cortex (i.e., the occipital lobe) sheds light on the nature of emotion representation in the brain. There are multiple types of well-known functional specialization in the occipital lobe, with different areas selectively responding to varying spatial frequency, orientation, color, and motion, among numerous other examples (50). More recent work combining CNN models and brain measurement (30) has demonstrated that early visual areas represent features in early layers of AlexNet in an explicit manner, whereby information is directly accessible to a downstream neuron or processing unit via linear readout (51).

Although an extensive body of work has demonstrated that these mappings between visual features and the occipital lobe exist, our findings indicate that neither early layers of AlexNet nor individual visual features successfully discriminate among multiple emotion categories. These observations suggest that an alternative account is necessary to explain how emotion schemas are mapped onto the visual system. Our work provides new insight into the visual system and the nature of emotion by showing that the occipital cortex explicitly encodes representations of multiple emotion schema and that rather than being encoded in individual visual areas, emotion-related features are distributed across them. This distributed representation encodes complex emotion-related features that are not reducible to individual visual features. These features likely emerge through a series of nonlinear transformations, through which the visual system filters retinal input to represent different types of emotional situations, analogous to how object representations emerge in the ventral stream.

Activation of emotion schemas in visual cortex offers a rapid, possibly automatic way of triggering downstream emotional responses in the absence of deliberative or top-down conceptual processes. By harnessing the parallel and distributed architecture of the visual system, these representations could be refined through experience. Information from downstream systems via feedback projections from ventromedial prefrontal cortex or the amygdala (10) could update visual emotion schemas through learning. Sensory information from nonretinal sources, including auditory stimuli and mental imagery, can activate object-specific representations in early visual areas (52) and could similarly activate emotion-specific codes in the visual system. Thus, emotion-related activity in visual cortex is most likely not a purely bottom-up response to visual inputs or a top-down interpretation of them but is at the interface of sensory representations of the environment and previous knowledge about potential outcomes. Future work integrating computational models with recurrent feedback and brain responses to emotional images will be necessary to understand the convergence of these bottom-up and top-down signals. Our computational framework provides a way to resolve outstanding theoretical debates in affective science. It could be used, for example, to test whether mappings between visual features and emotions are conserved across species or change throughout development in humans. On the basis of evolutionary accounts that suggest that certain basic emotions are solutions to survival challenges, mechanisms for detecting emotionally relevant events should be conserved across species. Notably, some of the most accurately predicted schemas include sexual desire and craving, which are motivational states that transcend cultures and are linked to clear evolutionary goals (i.e., to reproduce and to acquire certain nutrients). Work in the domain of object recognition has shown that representations of objects are highly similar between humans and macaques (53); an extension of the present work is to test whether the emotion representations we identified here are as well.

Our work has several limitations that can be addressed in future work. Although our goal was to focus on visual processing of emotional features, visual stimulation is not the only way in which emotions can be elicited. Information from other senses (olfactory, auditory, somatic, interoceptive, etc.), memories of past events, manipulation of motor activation, and mental imagery have all been used to evoke emotional experiences in the laboratory. EmoNet can be expanded, potentially by adding more abstract or supramodal representation of emotions and interactions among different types of sensory information. Further, although EmoNet was trained to evaluate the emotional significance of images, it was not developed to predict emotional behavior. Future work is necessary to understand whether emotion schemas constrain behavior and to determine whether they generalize to real-world scenarios [e.g., whether viewing an image of a spider activates the same schema as physically encountering one (54)]. It may also be possible to refine the model by constructing adversarial examples (55) of different schemas, i.e., images that are designed to fool EmoNet, and to evaluate their impact on human experience and behavior. Last, as our model comparisons show, EmoNet is only one model in a large space of neural networks that can explain emotion processing; comparing different models designed to achieve a common goal (e.g., detecting emotional situations from words, speech, or music or producing a specific behavioral response) may reveal the principles at the core of different emotional phenomena.

Using a combination of computational and neuroscientific tools, we have demonstrated that emotion schemas are embedded in the human visual system. By precisely specifying what makes images emotional, our modeling framework offers a new approach to understanding how visual inputs can rapidly evoke complex emotional responses. We anticipate that developing biologically inspired computational models will be a crucial next step for resolving debates about the nature of emotions [e.g., (56)] and providing practical tools for scientific research and in applied settings.