The mental contents of perception and imagery are thought to be encoded in hierarchical representations in the brain, but previous attempts to visualize perceptual contents have failed to capitalize on multiple levels of the hierarchy, leaving it challenging to reconstruct internal imagery. Recent work showed that visual cortical activity measured by functional magnetic resonance imaging (fMRI) can be decoded (translated) into the hierarchical features of a pre-trained deep neural network (DNN) for the same input image, providing a way to make use of the information from hierarchical visual features. Here, we present a novel image reconstruction method, in which the pixel values of an image are optimized to make its DNN features similar to those decoded from human brain activity at multiple layers. We found that our method was able to reliably produce reconstructions that resembled the viewed natural images. A natural image prior introduced by a deep generator neural network effectively rendered semantically meaningful details to the reconstructions. Human judgment of the reconstructions supported the effectiveness of combining multiple DNN layers to enhance the visual quality of generated images. While our model was solely trained with natural images, it successfully generalized to artificial shapes, indicating that our model was not simply matching to exemplars. The same analysis applied to mental imagery demonstrated rudimentary reconstructions of the subjective content. Our results suggest that our method can effectively combine hierarchical neural representations to reconstruct perceptual and subjective images, providing a new window into the internal contents of the brain.

Machine learning-based analysis of human functional magnetic resonance imaging (fMRI) patterns has enabled the visualization of perceptual content. However, prior work visualizing perceptual contents from brain activity has failed to combine visual information of multiple hierarchical levels. Here, we present a method for visual image reconstruction from the brain that can reveal both seen and imagined contents by capitalizing on multiple levels of visual cortical representations. We decoded brain activity into hierarchical visual features of a deep neural network (DNN), and optimized an image to make its DNN features similar to the decoded features. Our method successfully produced perceptually similar images to viewed natural images and artificial images (colored shapes and letters), whereas the decoder was trained only on an independent set of natural images. It also generalized to the reconstruction of mental imagery of remembered images. Our approach allows for studying subjective contents represented in hierarchical neural representations by objectifying them into images.

Funding: This research was supported by grants from the New Energy and Industrial Technology Development Organization (NEDO), JSPS KAKENHI Grant number JP26119536, JP15H05920, JP15H05710, JP17K12771 and ImPACT Program of Council for Science, Technology and Innovation (Cabinet Office, Government of Japan). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2019 Shen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

An overview of a deep image reconstruction is shown. The pixel values of the input image are optimized so that the DNN features of the image are similar to those decoded from fMRI activity. A deep generator network (DGN) is optionally combined with the DNN to produce natural-looking images, in which optimization is performed at the input space of the DGN.

Here, we present a novel approach, named deep image reconstruction, to visualize perceptual content from human brain activity. This technique combines the DNN feature decoding from fMRI signals with recently developed methods for image generation from the machine learning field ( Fig 1 ) [ 11 ]. The reconstruction algorithm starts with a given initial image and iteratively optimizes the pixel values so that the DNN features of the current image become similar to those decoded from brain activity across multiple DNN layers. The resulting optimized image is considered as a reconstruction from the brain activity. We optionally introduced a deep generator network (DGN) [ 12 ] to constrain the reconstructed images to look similar to natural images by performing optimization in the input space of the DGN.

The recent success of DNNs provides technical innovations to study the hierarchical visual processing in computational neuroscience [ 9 ]. Our recent study used DNN visual features as a proxy for the hierarchical neural representations of the human visual system and found that a brain activity pattern measured by fMRI could be decoded (translated) into the response patterns of DNN units in multiple layers representing the hierarchical visual features given the same input [ 10 ]. This finding revealed a homology between the hierarchical representations of the brain and the DNN, providing a new opportunity to utilize the information from hierarchical visual features.

While the externalization of states of the mind is a long-standing theme in science fiction, it is only recently that the advent of machine learning-based analysis of functional magnetic resonance imaging (fMRI) data has expanded its potential in the real world. Although sophisticated decoding and encoding models have been developed to render human brain activity into images or movies, the methods are essentially limited to image reconstructions with low-level image bases [ 1 , 2 ], or to matching to exemplar images or movies [ 3 , 4 ], failing to combine the visual features of multiple hierarchical levels. While several recent approaches have introduced deep neural networks (DNNs) for the image reconstruction task, they have failed to fully utilize hierarchical information to reconstruct visual images [ 5 , 6 ]. Furthermore, whereas categorical decoding of imagery contents has been demonstrated [ 7 , 8 ], the reconstruction of internally generated images has been challenging.

Results

We trained the decoders that predicted the DNN features of viewed images from fMRI activity patterns following the procedures of Horikawa & Kamitani (2017) [10]. In the present study, we used the VGG19 DNN model [13], which consisted of sixteen convolutional layers and three fully connected layers and was pre-trained with images in ImageNet [14] to classify images into 1,000 object categories (see Materials and Methods: “Deep neural network features” for details). We constructed one decoder for a single DNN unit to predict outputs of the unit. We trained decoders corresponding to all the units in all the layers (see Materials and Methods: “DNN feature decoding analysis” for details).

The feature decoding analysis was performed with fMRI activity patterns in visual cortex (VC) measured while subjects viewed or imagined visual images. Our experiments consisted of the training sessions in which only natural images were presented and the test sessions in which independent sets of natural images, artificial shapes, and alphabetical letters were presented. In another test session, a mental imagery task was performed. The decoders were trained using the fMRI data from the training sessions, and the trained decoders were then used to predict DNN feature values from the fMRI data of the test sessions (the accuracies are shown in S1 Fig).

Decoded features were then forwarded to the reconstruction algorithm to generate an image using variants of gradient descent optimization (see Material and Methods: “Reconstruction from a single DNN layer” and “Reconstruction from multiple DNN layers” for details). The optimization was performed to minimize the error between multi-layer DNN features decoded from brain activity patterns and those calculated from the input image by iteratively modifying the input image. For natural image reconstructions, to improve the “naturalness” of reconstructed images, we further introduced the constraint using a deep generator network (DGN) derived from the generative adversarial network algorithm (GAN) [15], which is known to capture a latent space explaining natural images [16] (see Material and Methods: “Natural image prior” for details).

Examples of reconstructions for natural images are shown in Fig 2 (see S2 Fig for more examples, and see S1 Movie for reconstructions through the optimization processes). The reconstructions obtained with the DGN capture the dominant structures of the objects within the images. Furthermore, fine structures reflecting semantic aspects like faces, eyes, and texture patterns were also generated in several images. Our extensive analysis on each of the individual subjects demonstrated replicable results across the subjects. Moreover, the same analysis on a previously published dataset [10] also replicated qualitatively similar reconstructions to those in the present study (S3 Fig).

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Seen natural image reconstructions. The black and gray surrounding frames indicate presented and reconstructed images respectively (reconstructed from VC activity using DNN1–8). Reconstructed images obtained through the optimization processes are shown for seen natural images. Reconstructions were constrained by the DGN. https://doi.org/10.1371/journal.pcbi.1006633.g002

To investigate the effect of the DGN, we evaluated the quality of reconstructions generated both with and without using it (Fig 3A and 3B; see S4 Fig for individual subjects; see Material and Methods: “Evaluation of reconstruction quality”). While the reconstructions obtained without the DGN also successfully reproduced rough silhouettes of dominant objects, they did not show semantically meaningful appearances (see S5 Fig for more examples; also see S6 Fig for reconstructions from different initial states for both with and without the DGN). Evaluations using pixel-wise spatial correlation and human judgment both showed almost comparable accuracy for reconstructions with and without the DGN (accuracy of pixel-wise spatial correlation, with and without the DGN, 76.1% and 79.7%; accuracy of human judgment, with and without the DGN, 97.0% and 96.0%). However, reconstruction accuracy evaluated using pixel-wise spatial correlation showed slightly higher accuracy with reconstructions performed without the DGN than with the DGN (two-sided signed-rank test, P < 0.01), whereas the opposite was observed for evaluations by human judgment (two-sided signed-rank test, P < 0.01). These results suggest the utility of the DGN that enhances the perceptual similarity of reconstructed images to target images by rendering semantically meaningful details in the reconstructions.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. Effect of the deep generator network (DGN). (A) Reconstructions with and without the DGN. The first, second, and third rows show presented images, and reconstructions with and without the DGN respectively (reconstructed from VC activity, DNN1–8). (B) Reconstruction quality of seen natural images (three subjects pooled, N = 150; chance level, 50%). https://doi.org/10.1371/journal.pcbi.1006633.g003

To characterize the ‘deep’ nature of our method, the effectiveness of combining multiple DNN layers was tested using both objective and subjective assessments [5, 17, 18]. For each of the 50 test natural images, reconstructed images were generated with a variable number of multiple layers (Fig 4A; DNN1 only, DNN1–2, DNN1–3, …, DNN1–8; see S7 Fig for more examples). In the objective assessment, the pixel-wise spatial correlations to the original image were compared between two combinations of DNN layers. In the subjective assessment, an independent rater was presented with an original image and a pair of reconstructed images, both from the same original image but generated with different combinations of multiple layers, and was required to indicate which of the reconstructed images looked more similar to the original image. While the objective assessment showed higher winning percentages for the earliest layer (DNN1) alone, the subjective assessment showed increasing winning percentages for a larger number of DNN layers (Fig 4B). Our additional analysis showed poor reconstruction quality from individual layers especially from higher layers (see S8 Fig for reconstructions from individual layers). These results suggest that combining multiple levels of visual features enhanced the perceptual reconstruction quality even though the pixel-wise accuracy is lost.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 4. Effect of multi-level visual features. (A) Reconstructions using different combinations of DNN layers (without the DGN). The black and gray surrounding frames indicate presented and reconstructed images respectively (reconstructed from VC activity). (B) Objective and subjective assessments of reconstructions from different combinations of DNN layers (error bars, 95% confidence interval [C.I.] across samples, N = 50; see Material and Methods: “Evaluation of reconstruction quality” for the procedure to calculate winning percentage). https://doi.org/10.1371/journal.pcbi.1006633.g004

Given the true DNN features, instead of decoded features, as the input, the reconstruction algorithm produces almost complete reconstructions of original images (S8 Fig), indicating that the DNN feature decoding accuracy would determine the quality of reconstructed images. To further confirm this, we calculated the correlation between the feature decoding accuracy and the reconstruction quality for individual images (S9 Fig). The analyses showed positive correlations for both the objective and subjective assessments, suggesting that improving feature decoding accuracy could improve reconstruction quality.

We found that the luminance contrast of a reconstruction was often reversed (e.g., the stained-glass images in Fig 2), presumably because of the lack of (absolute) luminance information in the fMRI signals, even in the early visual areas [19]. Additional analyses revealed that the feature values of filters with high luminance contrast in the earliest DNN layers (conv1_1 in VGG19) were better decoded when they were converted to absolute values (Fig 5A and 5B), demonstrating a clear discrepancy between the fMRI and raw DNN signals. The large improvement levels demonstrate the insensitivity of fMRI signals to pixel luminance, suggesting the linear-nonlinear discrepancy of DNN and fMRI responses to pixel luminance. This discrepancy may explain the reversal of luminance observed in several reconstructed images. While this may limit the potential for reconstructions from fMRI signals, the ambiguity might be resolved by modelling DNNs to fill the gaps between signals of DNNs and fMRI. Alternatively, further emphasis of the high-level visual information in hierarchical visual features may help to resolve the ambiguity of luminance by incorporating information on semantic context.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 5. DNN feature decoding accuracy of raw and absolute features. The analysis was performed with features from the conv1_1 layer of the VGG19 model using the test natural image dataset (error bar, 95% C.I. across subjects). (A) Mean feature decoding accuracy of all units. (B) Mean feature decoding accuracy for individual filters. The feature decoding accuracies of units within the same filters were individually averaged. The filters were sorted according to the ascending order of the raw feature decoding accuracy averaged for individual filters. https://doi.org/10.1371/journal.pcbi.1006633.g005

To confirm that our method was not restricted to the specific image domain used for the model training, we tested whether it was possible to generalize the reconstruction to artificial images. This was challenging, because both the DNN and our decoding models were solely trained on natural images. The reconstructions of artificial shapes and alphabetical letters are shown in Fig 6A and 6B (also see S10 Fig and S2 Movie for more examples of artificial shapes, and see S11 Fig for more examples of alphabetical letters). The results show that artificial shapes were successfully reconstructed with moderate accuracy (Fig 6C left; 70.5% by pixel-wise spatial correlation, 91.0% by human judgment; see S12 Fig for individual subjects) and alphabetical letters were also reconstructed with high accuracy (Fig 6C right; 95.6% by pixel-wise spatial correlation, 99.6% by human judgment; see S13 Fig for individual subjects). These results indicate that our model did indeed ‘reconstruct’ or ‘generate’ images from brain activity, and that it was not simply making matches to exemplars. Furthermore, the successful reconstructions of alphabetical letters demonstrate that our method can expand the possible states of visualizations, with advance in resolution over reconstructions performed in previous studies [1, 20].

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 6. Seen artificial image reconstructions. The black and gray surrounding frames indicate presented and reconstructed images respectively (VC activity, DNN 1–8, without the DGN). (A) Reconstructions for seen artificial shapes. (B) Reconstructions for seen alphabetical letters. The reconstructed letters were arranged in the word: “NEURON”. (C) Reconstruction quality of artificial shapes and alphabetical letters (three subjects pooled, N = 120 and 30 for artificial shapes and alphabetical letters, respectively; chance level, 50%). https://doi.org/10.1371/journal.pcbi.1006633.g006

To assess how the shapes and colors of the stimulus images were reconstructed, we separately evaluated the reconstruction quality of each of shape and color by comparing reconstructed images of the same colors and shapes. Analyses with different visual areas showed different trends in reconstruction quality for shapes and colors (Fig 7A and see S14 Fig for more examples). Human judgment evaluations suggested that shapes were reconstructed better from early visual areas, whereas colors were reconstructed better from the mid-level visual area V4 (Fig 7B and see S15 Fig for individual subjects; ANOVA, interaction between task type [shape vs. color] and brain areas [V1 vs. V4], P < 0.01), although the interaction effect was marginal when considering evaluations by pixel-wise spatial correlation (P = 0.06). These contrasting patterns further support the success of shape and color reconstructions and indicate that our method can be a useful tool to characterize the information content encoded in the activity patterns of individual brain areas by visualization.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 7. Reconstructions of shape and color from multiple visual areas. (A) Reconstructions of artificial shapes from multiple visual areas (DNN 1–8, without the DGN). The black and gray surrounding frames indicate presented and reconstructed images respectively. (B) Reconstruction quality of shape and color for different visual areas (three subjects pooled, N = 120; chance level, 50%). https://doi.org/10.1371/journal.pcbi.1006633.g007

Finally, to explore the possibility of visually reconstructing subjective content, we performed an experiment in which participants were asked to produce mental imagery of natural and artificial images shown prior to the task session. The reconstructions generated from brain activity due to mental imagery are shown in Fig 8 (see S16 Fig and S3 Movie for more examples). While the reconstruction quality varied across subjects and images, rudimentary reconstructions were obtained for some of the artificial shapes (Fig 8A and 8B for high and low accuracy images, respectively). In contrast, imagined natural images were not well reconstructed, possibly because of the difficulty of imagining complex natural images (Fig 8C; see S17 Fig for vividness scores of imagery). While the pixel-wise spatial correlation evaluations of reconstructed artificial images did not show high accuracy (Fig 8D; 51.9%; see S18 Fig for individual subjects), this may have been due to the possible disagreements in positions, colors and luminance between target and reconstructed images. Meanwhile, the human judgment evaluations showed accuracy higher than the chance level, suggesting that imagined artificial images were recognizable from the reconstructed images (Fig 8D; 83.2%; one-sided signed-rank test, P < 0.01; see S18 Fig for individual subjects). Furthermore, separate evaluations of color and shape reconstructions of artificial images suggested that shape rather than color had a major contribution to the high proportion of correct answers by human raters (Fig 8E; color, 64.8%; shape, 87.0%; two-sided signed-rank test, P < 0.01; see S19 Fig for individual subjects). Additionally, poor but sufficiently recognizable reconstructions were obtained even from brain activity patterns in the primary visual area (V1; 63.8%; three subjects pooled; one-sided signed-rank test, P < 0.01; see S20 Fig for reconstructed images and S21 Fig and S22 Fig for quantitative evaluations), possibly supporting the notion that low-level visual features are encoded in early visual cortical activity during mental imagery [21]. Taken together, these results provide evidence for the feasibility of visualizing imagined content from brain activity patterns.