How does JPEG actually work?

04/12/2006

8 minutes to read

In this article

JPEG is an image encoding designed to compress photographs and

similar images effectively, often 5 to 15 times over a raw bitmap

format. It's a lossy format that exploits properties of human vision to

eliminate information that is difficult to distinguish. You see JPEGs

all the time, and if you've seen a JPEG compressed with a low quality

level, you have a good idea of the kinds of visual artifacts that

emerge when JPEG compression is pushed to its limits. But how does it

really work, and how does this cause JPEG artifacts?

The JPEG standard defines a number of different encoding options,

and there are a number of container file formats defined other than the

usual JFIF, but I'm going to discuss a single popular JPEG encoding

method used by many applications. This encoding combines three

different completely independent methods of image compression:

downsampling the chrominance channels, quantization of the discrete

cosine transformation, and entropy coding.

Method 1: Downsampling the chrominance channels

The most obvious way to make an image file smaller is to make the

image smaller - for example, a 100×100 bitmap is 1/4 the size of a

200×200 bitmap. You can expand the image back out to the original size

when viewing it by simply upsampling it. This isn't a great compression

method, since it causes clearly visible artifacts, usually blocky or

blurry effects, especially around detailed, sharp areas.

A different approach is instead of downsampling the entire image, to

only downscale certain channels of the image. For example, the eye's

red-green sensors are more numerous and sensitive than its blue-yellow

sensors, so if we shrink just the blue channel and later re-expand it,

the effect is much less noticable. But the RGB color space is not our

only choice - images can be encoded in many color systems, and for each

of these the channels we choose to downsample will have different

visual effects. This experiment

by Eric Setton of Stanford University demonstrates the effect on image

quality of downsampling various channels in various color systems.

The results are visually obvious: we are sensitive to red and green,

but even more sensitive to intensity or luminance, the relative

brightness of pixels compared to other nearby pixels. This is expressed

by the Y component of the YCbCr system and the V component of the HSV

system. Any downsampling of this information is visually disastrous. On

the other hand, downsampling the chrominance, expressed by the

Cb and Cr components of the YCbCr system, has almost no visual effect

(Cb and Cr stand for "chrominance blue" and "chrominance red" - see a sample image broken up into YCbCr channels in the middle of this page).

For this reason, typical JPEG files convert images to the YCbCr color

space, and then downsample both the Cb and Cr channels by a factor of 2

in both width and height. Further processing of the image proceeds on

each of the three channels independently.

This is the method in JPEG encoding largely responsible for the

artifact known as "color bleeding", where along sharp edges the color

on one side can "bleed" onto the other side. This is because the

chrominance channels, which express the color of pixels, have had each

block of 4 pixels averaged into a single color, and some of these

blocks cross the sharp edge.

Method 2: Quantization of the discrete cosine transformation

Normally, we view an image as simply a grid of pixels, and if we

alter a pixel or group of pixels dramatically, the effect will be

visually obvious. A different approach to storing images is to come up

with a set of "basis" images which can be added together in various

combinations to form the image we want. JPEG encodes each 8-by-8 square

of pixels using the following basis image set of 64 images:

We assign each of these images a relative weight determining how

much influence it has on the image. This weight can be positive or

negative (when negative, the basis image is subtracted). Taken

together, these 64 weights can describe any 8×8 image at all, and they

describe the image precisely. This example at Wikipedia

shows an 8×8 image, its original pixel array, and its new 8×8 array of

weights after being converted using the discrete cosine transform

(which I'll discuss later).

But what's the point of transforming 64 pixel values into 64

weights? For one thing, the resulting data may exhibit more structure,

and so be easier to compress. For example, with blocks from typical

photographs, the basis images in the upper left above have large

weights and the ones in the lower right have small weights.

But to store the weights, which are real numbers, we have to encode

them somehow. The simplest encoding is to just convert them to integers

between -k and k by scaling and rounding them. This works, but there's a tradeoff in our choice of k:

if it's too big, the encoded image will take too much space, but if

it's too small, the rounding might visibly distort the image, since

it's effectively altering the weights.

A more effective approach is to use a different k for each

basis image. This works well because the visual difference caused by

variations in the weights of different basis images is different. The

eye can detect small variations of the weights of the images in the

upper-left, but can overlook much larger variations in the weights of

the images in the lower-right, so we can get away with smaller k and larger rounding error for these weights.

In practice, rather than deal with expensive floating-point numbers,

we round the initial weights to integers and later divide them by fixed

factors called the quantization factors using integer division.

During decoding, we multiply them by these factors again. The

upper-left images have small quantization factors, so that little

information is lost in this process, while the lower-right images have

larger factors. This example shows a commonly used matrix of quantization

factors, one for each basis image, and the result of quantizing the

sample image using them. Observe how much simpler the matrix becomes -

it now contains a large number of entries that are small or zero,

making it much easier to compress. This example

shows how the weight values expand back out during decoding - notice

that although some of the weights are now different, the visual

difference is almost undetectable.

Quantization is the primary source of JPEG artifacts. Because the images in the lower-right

tend to have the largest quantization divisors, JPEG artifacts will

tend to resemble combinations of these images. The matrix of quantization factors can be directly controlled by

altering the JPEG's "quality level", which scales its values up or down.

Quantization compresses the image in two important ways: one, it

limits the effective range of the weights, decreasing the number of

bits required to represent them. Two, many of the weights become identical or zero,

improving compression in the third step, entropy coding.

But this discussion leaves one important question unanswered: how do

we take an arbitrary square of 8×8 pixels and decompose it into a

combination of the basis images? As it turns out, we specifically chose

the images above in a way to make this easy. These images are formed by

cosine waves of various frequencies in the x and y directions, and a

transformation called the discrete cosine transform from Fourier transformation theory allows us to rapidly decompose a signal into these parts in only O(n log n)

time. Because of the importance of JPEGs, over the years very fast and

clever specialised solutions for 8×8 blocks have been created in both

software and hardware. This is also the reason we use small,

constant-sized blocks instead of processing the image as a whole: the

cosine transform step doesn't need to scale up to large bitmaps, so the

transformation as a whole can run in linear (O(n)) time.

Method 3: Entropy coding

Unlike the other steps, entropy coding is lossless. If you've

studied the simple compression techniques used in ZIP files and BMP

files, the techniques of entropy coding should come as no surprise.

We start by turning our 8×8 grid of integers into a linear sequence

of integers. The obvious row-major and column-major orders are not

ideal here, since we expect there to be many zeros in the lower-right

area. Instead, we start with the top-left corner and zig-zag in a

diagonal pattern across the matrix, going back and forth until we reach

the lower-right corner.

Next, we apply the same run-length encoding algorithm used by GIF

files to condense sequences of identical integer values. In areas where

the values are small, we can expect many such runs, particularly runs

of zeros. We use a special code to say "all the remaining values are

zero", which comes in handy.

Finally, we apply Huffman compression to the remaining sequence,

which is a standard lossless compression technique that encodes

frequently used values using less bits than infrequently used values.

Conclusion

And that's it! For each YCbCr channel, for each 8×8 block, we

perform these transformations, and we store the results in a bitstream.

This data is wrapped up in a JFIF container giving metadata about the

image, and the file goes out ready for web browsers to view, completely

unaware of all the data that has been lost. Just don't go sending JPEGs

into space and expect aliens to see them the same way.

Finally, no discussion of the JPEG encoding would be complete

without noting that JPEG is based on research ideas that are now very

old and largely supplanted. The more modern JPEG 2000 encoding

based on wavelets can be pushed much farther before visible artifacts

appear, and includes a surprisingly efficient lossless mode. Of course,

all this improvement is rather moot until common applications actually

support it.