Damn you, Peter Jackson!

Let’s end this debate once and for all. Humans can see frame rates greater than 24fps (although plenty of people will argue that they can’t on the internet). I’ll explain more in a future post if necessary, but let’s take that as read.

Once you’ve accepted that fact, the next question is why do movies at 48fps look “videoy”, and why do movies at 24fps look “dreamy” and “cinematic”. Why are games more realistic at 60Hz than 30Hz?

The answer to all of this lies in two things – ocular microtremor, and center-surround receptive fields in the retina. And it predicts where the cut-off lies as well.





Holy oscillating oculomotors, Batman!

You might not know this, but your eyes are wobbling all the time, like a hummingbird on methamphetamines. They just plain jiggle in their sockets. It’s a surprise that you can see anything at all, in fact.

The question is why?

You may already know that you can only see an area of sharp focus roughly the size of a silver dollar held out at arm’s length. This is the part of your retina called the fovea, which is the nice, sharp, color-responsive part of your retina. Your brain stitches together information from this peephole into a version of the world that you actually see. It’s densely packed with color-receptive cells called cones.

Here, go read this Wikipedia article if you need to catch up on your retina knowledge. I’ll wait.

According to this paper (Physical limits of acuity and hyperacuity, Wilson S. Geisler, U Texas) from 1983, the physical limit of acuity for your eye is 6 arcseconds when looking at two parallel thin lines that are really close together (also known as vernier acuity).

Now there’s a formula which tells you the minimum you can possibly distinguish between two lines, with a camera of a given aperture, and it’s called the Rayleigh criterion. (Rayleigh was a pretty smart physicist, who liked to play with waves).

On that page I just linked, there’s a formula which tells you the best you should be able to hope for, for a human eye, under optimal circumstances:

θ = 1.22x10-4 rad

… which is 25.16 arcseconds.

Yeah. So that’s a lot more than 6 arcseconds.

What’s more, cones themselves are 30-60 arcseconds across – between 5x and 10x times the size of the smallest gap you can see.

So that’s theoretically impossible… Or it would be if your eye was just a simple camera. But it’s not. Your retina is actually a CPU all by itself, and does a lot of processing for you. It also has some pretty specialized elements – like the design of the cones themselves.

Let’s look at a cone…

Cones are highly specialized light receptor cells, that have evolved to gather as much data as possible (in the form of light) over millennia. They’re not just simple pixel-readers though – they behave directionally, and prefer to accept light hitting them head-on. This is known as the Stiles-Crawford effect.

The shape of the top of a cone cell is why they’re called cones, and the Stiles-Crawford effect is why they’re cone-shaped. If you can discard light that’s coming off-axis, then you can better determine details – possibly even discriminating diffracted images and making them less fuzzy.

If you look at the picture, the tip of the cone is about 1/3rd the diameter of the cone. So we can take our 30-60 arcsecond measurement and divide it by 3 to get the actual fine-detail receptive field of the cone – give or take.

But now we have gaps in the image. If the sensors are more pin-prick like, how can they discriminate edges that are about the same width as the sensor itself?

All wiggly-jiggly…

The final piece of this puzzle is that the pattern of cones on your retina is not a fixed sensor; the sensor moves.

Ocular microtremor is a phenomenon where the muscles in your eye gently vibrate a tiny amount at roughly 83.68Hz (on average, for most people). (Dominant Frequency Content of Ocular Microtremor From Normal Subjects, 1999, Bolger, Bojanic, Sheahan, Coakley & Malone, Vision Research). It actually ranges from 70-103Hz.

No-one knows quite why your eye does this. (But I think I’ve figured it out).

If your eyes wobble at a known period, they can oscillate so that the light hitting the cones wanders across the cones themselves (each cone is 0.5-40µm across, and the wobble is approximately 1 to 3 photoreceptor widths, although it’s not precise – 150-2500nm). We can use temporal sampling, with a bit of post-processing to generate a higher resolution result than you’d get from just a single, fixed cone. What’s more, eyes are biological systems; we need something to compensate for the fact that the little sack of jelly in your eye is wobbling when you move it anyway, so why not use the extra data for something?

Tasty, tasty jelly.

So here’s the hypothesis. The ocular microtremors wiggle the retina, allowing it to sample at approximately 2x the resolution of the sensors. What do we have in the retina that could do this processing though?

Dolby 8.1 Center-Surround… er… Receptors

The receptive field of a sensory neuron is split into the center and the surround. It works like this:

…. and it’s really great for edge detection, which looks like this if you simulate it:

The cool thing is, this means that if you wobble the image, center-surround and off-center/surround cells will fire as they cross edges in the image. This gives you a nice pulse train that can be integrated along with the oscillation control signal, to extract a signal with 2x the resolution or more .

Bonus round: The Uncanny Valley

Nature likes to re-use components, and the center-surround feature of neurons is no exception. I like to think that this is the cause of the Uncanny Valley phenomenon, where the closer to “real” you look without being 100% on the money, the more disconcerting it feels.

Here’s an example from Wired magazine:

This is a big problem for videogames, because it makes getting to photorealistic human characters really difficult. Climbing out of that valley is, in fact, a total bitch. We’ll get there eventually though – but there’s a lot of subconscious details that we need to figure out to get there. (Which are hard to identify because their processing mostly happens at a pre-verbal, subconscious level in your brain).

Wait a minute. That curve looks a lot like something you might see with a center-surround receptive field. Which looks like this:

Specifically, it’s what you might get if you combine a linear trend line (from less-real to more-real) with a center-surround response in some fashion.

Nature LOVES to reuse building blocks. So it’s quite possible that this response-curve is part of the mechanism that the brain uses to discriminate things – or at least go from gross-feature comparison to high-detail comparison.

Imagine it like this: you’ve got a bunch of cells building up a signal which says “hey, this might be a human!”. That signal grows until more specialized feature-detection mechanisms kick in, and say “er, not quite” on top of that original signal. Eventually they say “Yep, that’s it!”, but in the mean time, thanks to the center-surround behavior collating the signals from lots of different gross-feature recognizers, it barks really loudly when you’re in the zone where that cell clicks on, but before you get it right.

So maybe our “this is an X” mechanism works – at the final recognition stages – via center-surround receptive fields.

Anyway, this is a bit off topic.

Side Effects of Ocular Microtremor, and frame rate

Let’s assume that if (like real life) what you’re seeing is continuously changing, and noisy, your brain can pick out the sparse signal from the data very effectively. It can supersample (as we talked about above), and derive twice the data from it. In fact, the signal has to be noisy for the best results – we know that from a phenomenon known as Stochastic Resonance.

What’s more, if we accept that an oscillation of 83.68Hz allows us to perceive double the resolution, what happens if you show someone pictures that vary (like a movie, or a videogame) at less than half the rate of the oscillation?

We’re no longer receiving a signal that changes fast enough to allow the super-sampling operation to happen. So we’re throwing away a lot of perceived-motion data, and a lot of detail as well.

If it’s updating higher than half the rate of oscillation? As the eye wobbles around, it’ll sample more details, and can use that information to build up a better picture of the world. Even better if we’ve got a bit of film-grain noise in there (preferably via temporal anti-aliasing) to fill in the gaps.

It just so happens that half of 83.68Hz is about 41Hz. So if you’re going to have high-resolution pulled properly out of an image, that image needs to be noisy (like film-grain) and update at > 41Hz. Like, say, The Hobbit. Or any twitch-shooter.

Less than that? Say, 24fps? Or 30fps for a game? You’re below the limit. Your eye will sample the same image twice, and won’t be able to pull out any extra spatial information from the oscillation. Everything will appear a little dreamier, and lower resolution. (Or at least, you’ll be limited to the resolution of the media that is displaying the image, rather than some theoretical stochastic limit).

Some readers of this article have suggested that this is all an artifact of motion-blur – double the frame rate, half the motion-blur, and you naturally get twice the sharpness.

It may play a part – though I’m not sure it plays a large one – The Hobbit, the shutter was set to 1/64th of a second. For regular movies? The shutter exposes for 1/48th of a second. That’s not halving; half the motion blur of 24 fps film would require an exposure time of 1/96th of a second. So I suspect that motion blur isn’t the whole story here.

The supersampling phenomenon has a name

It ends up that there’s an entire field of study in computational optics dedicated to the up-rezzing of images known as Super Resolution. It lets you take multiple images which would normally look like the one on the left of the image below, and turn them into the image on the right:

(image from the Wikipedia article linked above)

I suspect that ocular microtremors are part of the mechanism that the brain uses to do something similar. If you’re looking at frames of video, it’ll only be able to do its job if you have noise in the signal. Fortunately, most movies still do have random, Poisson-distributed noise – in the form of film grain. (Again, this plays back into that whole Stochastic Resonance phenomenon).

What’s the upshot of all this?

For Movies…

At 48Hz, you’re going to pull out more details at 48Hz from the scene than at 24Hz, both in terms of motion and spatial detail. It’s going to be more than 2x the information than you’d expect just from doubling the spatial frequency, because you’re also going to get motion-information integrated into the signal alongside the spatial information. This is why for whip-pans and scenes with lots of motion, you’re going to get much better results with an audience at faster frame rates.

Unfortunately, you’re also going to get the audience extracting much more detail out of that scene than at 24Hz. Which unfortunately makes it all look fake (because they can see that, well, the set is a set), and it’ll look video-y instead of dreamy – because of the extra motion extraction which can be done when your signal changes at 40Hz and above.

The short version is, to be “cinematic”, you really need to be well under 41Hz, and above the rate where motion becomes jerky – also known as the phi phenomenon or “apparent motion”—which is ~16Hz, so that the motion looks like motion.

Ah, you might be thinking… but video is 29.997Hz (for NTSC). Why does it look video-y?

Video isn’t really 29.997Hz…

It’s actually 59.994Hz for broadcast video. It’s just interlaced, so that you only show half of the lines from each frame, every 1/60th of a second. They don’t do this:

Snapshot –> Display Odd Lines –> Display Even LInes

… they do this:

Snapshot –> Display Odd Lines –> Snapshot –> Display Even Lines

… which is a whole different beast. (They may not even snapshot at all, depending on the camera; they may just sample the entire line as they shift it out really really fast from the CCD… so it becomes continuous – even though that may lead to rolling problems due to pixel persistence).

In other words, broadcast video is above the ocular microtremor sampling nyquist frequency, due to interlacing.

For Videogames

This is going to be trickier, because unlike film (which has nice grain, at least 4K resolution – although in reality it’s something like 6000 ‘p’ [horizontally] for 35mm film and 12000 ‘p’ for IMAX, and no “pixels” per se due to the film grain – although digital has meant we need to recreate some of this), we’re dealing with a medium where we’re resolution-limited (most games are 1920x1080 or lower). So we can’t get around our limitations in the same way. You can see our pixels. They’re bigger. And they’re laid out in a regular grid.

So if you really want the best results, you need to do your games at 12000x6750. Especially if someone’s borrowing an IMAX theatre to play them in.

Let’s get real.

Higher resolution vs. frame rate is always going to be a tradeoff. That said, if you can do >~38-43 fps, with good simulated grain, temporal antialiasing or jitter, you’re going to get better results than without. Otherwise jaggies are going to be even more visible, because they’re always the same and in the same place for over half of the ocular microtremor period. You’ll be seeing the pixel grid more than its contents. The eye can’t temporally alias across this gap – because the image doesn’t change frequently enough.

Sure, you can change things up – add a simple noise/film grain at lower frame rates to mask the jaggies – but you may get better results in some circumstances at > 43fps with 720p than at 30fps with 1080p with jittering or temporal antialiasing – although past a certain point, the extra resolution should paper over the cracks a bit. At least, that is, as long as you’re dealing with scenes with a lot of motion – if you’re showing mostly static scenes, have a fixed camera, or are in 2D? Use more pixels.

If you can use a higher resolution and a faster frame rate, obviously, you should go for that.

So my advice is:

Aim for a frame rate > ~43Hz. On modern TV sets, this means pretty much 60Hz.

Add temporal antialiasing, jitter or noise/film grain to mask over things and allow for more detail extraction. As long as you’re actually changing your sampling pattern per pixel, and simulating real noise – not just applying white noise – you should get better results.

If you can still afford it, go for higher resolution

As a bonus, at higher frame rates you can respond more quickly to the action in the game – which is essential for twitch games, where responding to the game matters. This is mostly a side effect of lower end-to-end latency (game loops are generally locked to how fast they can present a new frame, and it’s rare for input to be decoupled from this – so a faster frame rate means lower input lag). It may also be due to being able to see changes in the game more quickly as well – after all, it’s updating twice as fast at 60Hz. If the ocular microtremors play a part, that mechanism may also allow better motion extraction from the cones.

Of course, realistically, the proof of the pudding is in the eating. The only true test is to experiment, and get a variety of people to do an AB comparison. If more people prefer one than the other, and you have a large enough sample size? Go with the one more people like.

Backing it all up with some evidence

So it looks like this post went a wee bit viral.

In response, I guess I need to back up a few of my more tenuous science claims – hey, this is a blog post, ok? I didn’t submit it for publication in a journal, and I figured the rigorous approach would be an instant turn-off for most. Still, enough people have questioned the basis for this – so I’m going to at least bolster up the basic idea (ocular microtremor + stochastic resonance are used for hyperacuity).

So I did some digging around, and here we go – have a paper from actual scientists who did actual experiments (hey! it looks like other people got here before me… but I don’t think anyone’s tied it all back to its interaction with frame-rate before):

[PDF] Stochastic resonance in visual cortical neurons: does the eye-tremor actually improve visual acuity? – Hennig, Kerscher, Funke, Wörgötter; Neurocomputing Vol 44-46, June 2002, p.115-120

Abstract: We demonstrate with electrophysiological recordings that visual cortical cell responses to moving stimuli with very small amplitudes can be enhanced by adding a small amount of noise to the motion pattern of the stimulus. This situation mimics the micro-movements of the eye during fixation and shows that these movements could enhance the performance of the cells. In a biophysically realistic model we show in addition, that micro-movements can be used to enhance the visual resolution of the cortical cells by means of spatiotemporal integration. This mechanism could partly underlie the hyperacuity properties of the visual system. “The stimulus used in the simulations is a typical vernier stimulus, which consists of two adjoining bars with a small relative displacement of d = 7.5”. The displacement is smaller than the distance between two photoreceptors (30”) and thus cannot be resolved. Hyperacuity, though, allows the detection of displacements in the order of 4” to 10”, which so far has been attributed to the spatial sampling of the ganglion cells […] we investigated the role of micro-movements on the resolution of vernier stimuli. […] It proved to be noise in the amplitude and frequency range of the ocular microtremor that shows a strong effect on acuity” “[…] microtremor has an even stronger impact. As it’s amplitude increases, the discriminability reaches much higher values for low amplitudes.”

Given that this was a study performed on cat retinas, you might be wondering whether or not this still applies. Well, it ends up that cats have these things called X- and Y- cells, which are retinal ganglion cells which show (for X) “brisk-sustained” and (for Y) “brisk-transient” response.

[PDF] Sustained and transient neurones in the cat’s retina and lateral geniculate nucleus – Cleland, Dubin, Levick; J. Physiol Sep 1971; 217(2); 473-496

I won’t list all of the papers on this, but if you do a search on “cat retina X Y ganglion”, you can find a bunch of them.

These are the cells which do this kind of processing. There wasn’t much proof for them existing in humans until 2007 – here’s the layman’s summary from Science Daily from that discovery. They used a rather neat looking sensor to do the work – pretty much jamming it right up against the retina to see how it ticked:

(Half Life 3 confirmed)

This thing is the size of a pinhead, and contains 512 electrodes, and was used to record the activity of 250 retinal cells simultaneously – of which 5-10 were the data-processing upsilon cells.

And here’s the paper:

[HTML] Identification and Characterization of a Y-Like Primate Retinal Ganglion Cell Type – Petrusca, Grivich, Sher, Field, Gauthier, Greschner, Shlens, Chichilnisky, Litke; J. Neuroscience, October 2007, 27(41): 11019-11027

If you base your conclusion on the first paper I linked to, it’s likely that the Y-Cells are at least in part responsible for this response. Could be the midget cells though; I’m not sure.

Edit: 12/22 – made some edits to clean up a few sentences that were way too woolly, and added a little new material, including links to Super Resolution. I didn’t expect this post to blow up like this – normally my blog posts don’t make a splash like this. Thanks for popping by!

Edit: 12/23 – moar papers from real, actual researchers to back it up

Some of this post is speculation – at least until experiments are performed on this. It may actually be real new science by the end of the day. I’d love to hear from any actual professionals in the field who’ve done research in this area.