$\begingroup$

I think you are succumbing to the homunculus argument, the fallacy that there is some sort of image in the brain for someone to view. There is no magical theater in your head where what is incident on your retina is projected. All you have in your brain is complicated patterns of neural activity, there are no images and nothing to view. However, these patterns of activity give rise to your phenomenological experience. To fully understand this you should ask:

What are current neuronal explanations and models of 'consciousness'?

But lets try to clear up some of the conceptual difficulties with vision in particular. Your experience of the visual world is effected by two types of inputs: (1) the data from your retina, and (2) data from the rest of your senses, including memory. Why is obvious that not everything comes from (1)? Consider one the following:

You experience a whole visual scene, there isn't a certain nothing-ness somewhere. Yet on your retina, there is a blind-spot, something fills in that part of your experience for you.

You have an experience of certain far away buildings being further then nearby buildings. Yet your eyes are too close together for the difference in angle of the two images to be measurable at the fidelity of your retina. How does your mind know the buildings are further? Parallax and memories of how big certain objects typically are and how this scales with distance.

If we concentrate only on method (1), then all the information is there from the retina on-wards and only degrades (as part of the signal is thrown away, compressed, or undergoes noise) on its way to V1 and on-wards. However, its encoding changes to become more compatible with integration with other sensory and memory information. By the time the data has reached V1 and V2, it is in an encoding that we understand well enough to reconstruct videos of what people are seeing/experiencing. As the Gallant Lab that ran the linked study summarizes:

The human visual system consists of several dozen distinct cortical visual areas and sub-cortical nuclei, arranged in a network that is both hierarchical and parallel. Visual information comes into the eye and is there transduced into nerve impulses. These are sent on to the lateral geniculate nucleus and then to primary visual cortex (area V1). Area V1 is the largest single processing module in the human brain. Its function is to represent visual information in a very general form by decomposing visual stimuli into spatially localized elements. Signals leaving V1 are distributed to other visual areas, such as V2 and V3. Although the function of these higher visual areas is not fully understood, it is believed that they extract relatively more complicated information about a scene. For example, area V2 is thought to represent moderately complex features such as angles and curvature, while high-level areas are thought to represent very complex patterns such as faces. The encoding model used in our experiment was designed to describe the function of early visual areas such as V1 and V2, but was not meant to describe higher visual areas. As one might expect, the model does a good job of decoding information in early visual areas but it does not perform as well in higher areas.

Remember, there is no video in those areas. It is just firing of neurons that the scientists have figured out how to decode and interpret. As the quote mentions, the higher visual areas are not well understood right now, but presumably that is where a lot of the type (2) feedback is happening. Even inside the mildly understood visual areas, a lot of processing is distributed. For instance, take a look at the question about face-blindness:

Does the fusiform face area in patients with Prosopagnosia (face blindness) show lower activity under an fMRI?

By damaging one part of the brain (the fusiform face area) you are able to continue to 'see' tables and chairs perfectly fine, and yet you can't properly identify or recognize faces.

Hopefully this convinces you that it doesn't make sense to look for 'the image' in the brain. Together the mind and the eye are able to shape what you perceive and give it meaning, but it is a pseudo-question to ask where that image is finally assembled. It is not assembled, there is no image, there is only encoding of retina activity into higher level firing patterns that produces in us the experience of vision and meaning.