Introduction

After reading an article called "How Google Earth Works" on the great site HowStuffWorks.com, it became apparent that the article was more of a "how cool it is" and "here’s how to use it" than a "how Google Earth [really] works."

So I thought there might be some interest, and despite some valid intellectual property concerns, here we are, explaining how at least part of Google Earth works.



Keep in mind, those IP issues are real. Keyhole (now known as Google Earth) was attacked once already with claims that they copied someone else’s inferior (IMO) technology. The suit was completely dismissed by a judge, but only after many years of pain. Still, it highlights one problem of even talking about this stuff. Anything one says could be fodder for some troll to claim he invented what you did because it "sounds similar." The judge in the Skyline v. Google case understood that "sounding similar" is not enough to prove infringement. Not all judges do.

Anyway, the solution to discussing "How Google Earth [Really] Works" is to stick to information that has already been disclosed in various forms, especially in Google’s own patents, of which there are relatively few. Fewer software patents is better for the world. But in this case, more patents would mean we could talk more openly about the technology, which, btw, was one of the original goals of patents — a trade of limited monopoly rights in exchange for a real public benefit: disclosure. But I digress…

For the more technically inclined, you may want to read these patents directly. Be warned: lawyers and technologists sometimes emulsify to form a sort of linguistic mayonnaise, a soul-deadening substance known as Patent English, or Painglish for short . If you’re brave, or masochistic, here you go:

1. Asynchronous Multilevel Texture Pipeline

2. Server for geospatially organized flat file data

There are also a few more loosely related Google patents. I don’t know why these are shouting, but perhaps because they’re very important to the field. I’ll hopefully get to these in more detail in future articles:

3. DIGITAL MAPPING SYSTEM

4. GENERATING AND SERVING TILES IN A DIGITAL MAPPING SYSTEM

5. DETERMINING ADVERTISEMENTS USING USER INTEREST INFORMATION AND MAP-BASED LOCATION INFORMATION

6. ENTITY DISPLAY PRIORITY IN A DISTRIBUTED GEOGRAPHIC INFORMATION SYSTEM (this one will be huge)

And there is this more informative technical paper from SGI (PDF) on hardware "clipmapping," which we’ll refer to later on. Michael Jones, btw, is one of the driving forces behind Google Earth, and as CTO, is still advancing the technology.

I’m going to stick closely to what’s been disclosed or is otherwise common technical knowledge. But I will hopefully explain it in a way that most humans can understand and maybe even appreciate. At least that’s my goal. You can let me know.

Big Caveat: the Google Earth code base has probably been rewritten several times since I was involved with Keyhole and perhaps even after these patents were submitted. Suffice it to say, the latest implementations may have changed significantly. And even my explanations are going to be so broad (and potentially out-dated) that no one should use this article as the basis for anything except intellectual curiosity and understanding.

Also note: we’re going to proceed in reverse, strange as it may seem, from the instant the 3D Earth is drawn on your screen, and later trace back to the time the data is served. I believe this will help explain why things are done as they are and why some other approaches don’t work nearly as well.

Part 1, The Result: Drawing a 3D Virtual Globe

There are two principal differences between Google Maps and Earth that inform how things should ideally work under the hood. The first is the difference between fixed-view (often top-down) 2D & free-perspective 3D rendering. The second is between real-time and pre-rendered graphics. These two distinctions are fading away as the products improve and converge. But they highlight important differences, even today.

What both have in common is that they begin with traditional digital photography — lots of it — basically one giant high-resolution (or multi-resolution) picture of the Earth. How they differ is largely in how they render that data.

Consider: The Earth is approximately 40,000 km around the waist. Whoever says it’s a small world is being cute. If you stored only one pixel of color data for every square kilometer of surface, a whole-earth image (flattened out in, say, a mercator projection) would be about 40,000 pixels wide and roughly half as tall. That’s far more than most 3D graphics hardware can handle today. We’re talking about an image of 800 megapixels and 2.4 gigabytes at least. Many PCs today don’t even have 2GB of main memory. And in terms of video RAM, needed to render, a typical PC has maybe 128MB, with a high-end gaming rig having upwards of 512.

And remember, this is just your basic run-of-the-mill one-kilometer-per-pixel whole-earth image. The smallest feature you could resolve with such an image is about 2 kilometers wide (thank you, Mr. Nyquist) — no buildings, rivers, roads, or people would be apparent. But for most major US cities, Google Earth deals in resolutions that can resolve objects as small as half a meter or less, at least four thousand times denser, or sixteen million times more storage than the above example.

We’re talking about images that would (and do) literally take many terabytes to store. There is no way that such a thing could ever be drawn on today’s PCs, especially not in real-time.

And yet it happens every time you run Google Earth.

Consider: In a true 3D virtual globe, you can arbitrarily tilt and rotate your view to pretty much look anywhere (except perhaps underground — and even that’s possible if we had the data). In all 3D globes, there exists some source data, typically, a really high-resolution image of the whole earth’s surface, or at least the parts for which the company bought data. That source data needs to be delivered to your monitor, mapped onto some virtual sphere or ideally onto small 3D surfaces (triangles, etc..) that mimic the real terrain, mountains, rivers and so on.

If you, as a software designer, decide not allow your view of the Earth to ever tilt or rotate, then congrats, you’ve simplified the engineering problem and can take some time off. But then you don’t have Google Earth.

Now, various schemes exist to allow one to "roam" part of this ridiculously large texture. Other mapping applications solve this in their own way, and often with significant limitations or visual artifacts. Most of them simply cut their huge Earth up into small regular tiles, perhaps arranged in a quadtree, and draw a certain number of those tiles on your screen at any given time, either in 2D (like Google Maps) or in 3D, like Microsoft’s Virtual Earth apparently does.

But the way Google Earth solved the problem was truly novel, and worthy of a software patent (and I am generally opposed to software patents). To explain it, we’ll have to build up a few core concepts. A background in digital signal theory and computer graphics never hurts, but I hope this will be easy enough that that won’t be necessary.

I’m not going to explain how 3D rendering works — that’s covered elsewhere. But I am going to focus on texture mapping and texture filtering in particular, because the details are vital to making this work. The progression from basic concepts to the more advanced texture filtering will also help you understand why things work this way, and just how amazing this technology really is. If you have the patience, here’s a very quick lesson in texture filtering.

The Basics

The problem of scaling, rotating and warping basic 2D images was solved a long time ago. The most common solution is called Bilinear Filtering. All that really means is that for each new (rotated, scaled, etc..) pixel you want to compute, you take the four "best" pixels from your source image and blend them together. It’s "bilinear" because it linearly blends two pixels at a time (along one axis), and then linearly blends those two results (along the other axis) for the final answer.

[A "linear blend," in case it’s not clear, is butt simple: take 40% of color A, and 60% of color B and add them together. The 40/60 split is variable, depending on how "important" each contributor is, as long as the total adds up to 100%.]

That functionality is built into your 3D graphics hardware such that your computer can nowadays do literally billions of these calculations per second. Don’t ask me why your favorite paint program is so slow.

The problem being addressed can be visualized pretty easily — that’s what I love about computer graphics. It turns out, whenever we map some source pixels onto different (rotated, scaled, tilted, etc…) output pixels, visual information is lost.

The problem is called "aliasing" and it occurs because we digitally sampled the original image one way, at some given frequency (aka resolution), and now we’re re-sampling that digital data in some other way that doesn’t quite match up.







1. A simple low-res (11×11 pixel) image is about to be rotated. (the grid lines are merely to delineate pixels)

3. Close up of one output pixel. Bilinear interpolation averages the "best" four source pixels for each new destination pixel (shown as black border with white dots) based on their relative importance (ideally: fractional area).



2. Each pixel in the destination grid overlaps multiple pixels from the rotating original.





4. After bilinear interpolation, the resulting rotated image has some clear (or rather blurry) issues.



Now, when we talk about output pixels and destinations, it doesn’t much matter if the destination is a bitmap in a paint program or the 3D application window that shows the Earth. Aliasing happens whenever the output pixels do not line up with the sampling interval (frequency, resolution) of the source image. And aliasing makes for poor visual results. Dealing with aliasing is about half of what texture mapping is all about. The rest is mostly memory management. And the constraints of both inform how Google Earth works.

The mission then is to minimize aliasing through cleverness and good design. The best way to do this is to get as close as possible to a 1:1 correspondence between input and output pixels, or at least to generate so many extra pixels that we can safely down-sample the output to minimize aliasing (also known as "anti-aliasing"). We often do both.

Consider: for resizing images, it only gets worse — each pixel in your destination image might correspond to hundreds of pixels of source imagery, or vice-versa. Bilinear interpolation, remember, will only pick the best four source pixels and ignore the rest. So it can therefore skip right over important pixels, like edges, shadows, or highlights. If some such pixel is picked for blending during one frame and skipped over subsequently, you’ll get an ugly "pixel-popping" or scintillation effect. I’m sure you’ve seen it in some video games. Now you know why.

Tilting images (or any 3D transformation) is even more problematic, because now we have elements of scaling and rotation, but also a great variation in pixel density across rendered surfaces. For example, in the "near" part of a scene, your nice high-res ground image might be scaled up such that the pixels look blurry. In the "far" part of the scene, your image might appear scintillated (as above) because simple 2×2 bilinear interpolation is necessarily skipping important visual details from time to time.

Copyright, Microsoft Virtual Earth

Here’s an example of where a certain kind of texture filtering causes poor results. The text labels are hardly readable (why they’re painted into the terrain image at all is another issue).

Better Filtering, Revealed

Most consumer 3D hardware already supports what’s called "tri-linear" filtering. With tri-linear and a closely coupled technique called mip-mapping, the hardware computes and stores a series of lower resolution versions of your source image or texture map. Each mip-map is automatically down-sampled by a factor of 2, repeatedly, until we reach a 1×1 pixel image whose color is the average of all source image pixels.

So, for example, if you provided the hardware with a nice 512×512 source image, it would compute and store 8 extra mip-levels for you (256, 128, 64, 32, 16, 8, 4, 2, and 1 pixel square). If you stacked those vertically, you might more easily visualize the "mip-stack" as an upside down pyramid, where each mip-level (each horizontal slice) is always 1/2 the width of the one above.

Drawing of a mip-map

pyramid (not to scale). The X depicts trilinear filtering sampling two mip-levels for an in-between pixel to reduce aliasing.

During 3D rendering, mip-mapping and tri-linear filtering take each destination pixel, pick the two most appropriate mip-levels, essentially do a bi-linear blend on both, and then blend those two results again (linearly) for the final tri-linear answer.

So for example, say the next pixel would have no aliasing if only the source image had a resolution of 47.5 pixels across. The system has stored power of two mip maps (16, 32, 64…). So the hardware will cleverly use the 64×64 and 32×32 pixel versions closest to the desired sampling of 47.5, compute a bilinear (4-sample) result for each, and then take those two results and blend them a third time.

That’s tri-linear filtering in a nutshell, and along with mip-mapping, it goes a great distance to minimizing aliasing for many common cases of 3D transformations.

Remember: so far, we’ve been talking about nice, small images, like 512×512 pixels. Our whole-earth image will need to be millions of pixels across. So one might consider making a giant mip-map of our whole-earth image, at say one meter resolution. No problem, right? But you’ll realize fairly soon that would require a mip-map pyramid 26 levels deep, where the highest resolution mip-level is some 66 million pixels across. That simply won’t fit on any 3D video card on the market, at least not in this decade.

I’m guessing Microsoft’s Virtual Earth gets around this limit by cutting their giant earth texture into many smaller distinct tiles of, say, 256 pixels square, where each gets mip-mapped individually. That approach would work to an extent, but it would be relatively slow and give some of the visual artifacts, like the blurring we see above, and a popping in and out of large square areas as you zoom in and out.

There’s one last concept about mip-maps to understand before we move on to the meat of the issue. Imagine for a moment that the pixels in the mip-map pyramid are actually color-coded as I’ve indicated above, with an entire layer colored red, another yellow, etc.. Drawing this on a tilted plane (like the Earth’s ground "plane") would then seem to "slice through" the pyramid at an interesting angle, using only those parts of the pyramid that are needed for this view.

It’s this characteristic of mip-mapping that allows Google Earth to exist, as we’ll see in a minute.



A typical tilted Google Earth image (copyright & courtesy Google).



The same view, using color to show which mip levels inform which pixels

The example on the left shows a normal 3D scene from Google Earth, as well as a rough diagram showing from where in the mip-stack a 3D hardware system might find the best source pixels, if they were so colorized.

The nearer area gets filled from the highest-resolution mip-level (red), dropping off to lower and lower resolutions as we get farther from the virtual point of view. This helps avoid the scintillation and other aliasing problems we talked about earlier, and looks quite nice. We get as close as possible to a 1:1 correspondence between source and destination, pixel for pixel, so aliasing is minimized.

Even better still, tri-linear filtering 3D graphics hardware has been improved with something called anisotropic filtering (a simple preference option in Google Earth) which is essentially the same core idea as the previous examples, but using non-square filters, beyond the basic 2×2. This is very important for visual quality, because even with fancy mip-mapping, if you tilt a textured polygon to a very oblique angle, the hardware must choose a low-resolution mip-level to avoid scintillation on the narrow axis. And that means the whole polygon is sampled at too-low a resolution, when it’s only one direction that needed to dip down to the low-res stuff. Suffice it to say, if your hardware supports anisotropic filtering, turn it on for best results. It’s worth every penny.

Now, to the meat of the issue

We still have to solve the problem of how to mip-map a texture with millions of pixels in either dimension. Universal Texture (in the Google Earth patent) solves the problem while still providing high quality texture filtering. It creates one giant multi-terabyte whole-earth virtual-texture in an extremely clever way. I can say that since I didn’t actually invent it. Chris Tanner figured out a way to do on your PC what had only ever been done on expensive graphics supercomputers with custom circuitry, called Clip Mapping (see SGI’s pdf paper, also by Chris, Michael, et al., for a lot more depth on the original hardware implementation). That technology is essentially what made Google Earth possible. And my very first job on this project was making that work over an internet connection, way back when.

So how does it actually work?

Well, instead of loading and drawing that giant whole-earth texture all at once — which is impossible on most current hardware — and instead of chopping it up into millions of tiles and thereby losing the better filtering and efficiency we want, recall from just above that we typically only ever use a narrow slice or column of our full mip-map pyramid at any given time. The angle and height of this virtual column changes quite a bit depending on our current 3D perspective. And this usage pattern is fairly straightforward for a clever algorithm to compute or infer, knowing where you are and what the application is trying to draw.

A Universal Texture is both a mip-map, plus a software emulated clip-stack, meaning it can mimic a mip-map of many more levels and greater ultimate resolution than can fit in any real hardware.



Note: though this diagram doesn’t depict it as precisely as the paper, the clip stack’s "angle" shifts around to best keep the column centered.

So this clever algorithm figures out which sections of the larger virtual texture it needs at any given time and pages only those from system memory to your graphics card’s dedicated texture memory, where it can be drawn very efficiently, even in real-time.

The main modification to basic mip-mapping, from a conceptual point of view, is that the upside down pyramid is no longer just a pyramid, but is now much, much taller, containing a clipped stack of textures, called, oddly enough, a "clip stack," perhaps 16 to 30+ levels high. Conceptually, it’s as if you had a giant mip-map pyramid that’s 16-30 levels deep and millions to billions of pixels wide, but you clipped off the sides — i.e., the parts you don’t need right now.

Imagine the Washington monument, upside down and you’ll get the idea. In fact, imagine that tower leaning this way or that, like the one in Pisa, and you’ll be even closer. The tower leans in such a way that the pixels inside the tower are what you need for rendering right now. The rest is ignored.

Each clip-level is still twice the resolution of the one "below" it, like all mip-maps, and nice quality filtering still works as before. But since the clip stack is limited to a fixed but roaming footprint, say 512×512 pixels wide (another preference in Google Earth), that means that each clip-level is both twice the effective resolution and half the coverage area of the previous. That’s exactly what we want. We get all the benefits of a giant mip-map, with only the parts relevant to any given view.

Put another way, Google Earth cleverly and progressively loads high-res information for what’s at the focal "center" of your view (the red part above), and resolution drops off by powers of two from there. As you tilt and fly and watch the land run towards the horizon, Universal Texture is optimally sending only the best and most useful levels of detail to the hardware at any given time. What isn’t needed, isn’t even touched. That’s one thing that makes it ultra-efficient.

It’s also very memory-efficient. The total texture memory for an earth-sized texture is now (assuming this 512 wide base mip-map, and say 20 extra clip-levels of data) only about 17 megabytes, not the dozens to hundreds of terabytes we were threatened with before. It’s actually doable, and worked in 1999 on 3D hardware that had only 32 MB or less. Other techniques are only now becoming possible with bigger and bigger 3D cards.

In fact, with only 20 clip-levels (plus 9 mip levels for the base pyramid), we see that 229 yields a virtual texture capable of up to 536 million pixels in either dimension. Multiply that by 1/2 vertically, gives an virtual image of a few hundred terapixels in area, or enough excess capacity to represent features as small as 0.15 meters (about 5 inches), wherever the data is available. And that’s not the actual limit. I simply picked 20 clip levels as a reasonable number. And you thought the race for more megapixels on digital cameras was challenging. Multiply that by a million and you’re in the planetary ballpack.

Fortunately, for now, Google only really has to store a few dozen terapixels of imagery. The other beauty of the system is that the highest levels of resolution need not exist everywhere for this to work. Wherever the resolution is more limited, wherever there are gaps, missing data, etc.. the system only draws what it has. If there is higher resolution data available, it is fetched and drawn too. If not, the system uses the next lower resolution version of that data (see mip-mapping above) rather than drawing a blank. That’s exactly why you can zoom into some areas and see only a big blur, where other areas are nice and crisp. It’s all about data availability, not any hard limit on the 3D rendering. If the data were available, you could see centimeter resolution in the middle of the ocean.

The key then to making this all work is that, as you roam around the 3D Earth, the system can efficiently page new texture data from your local disk cache and system memory into your graphics texture memory. (We’ll cover some of how stuff gets into your local cache next time). You’ve literally been watching that texture uploading happen without necessarily realizing it. Hopefully, now you will appreciate all the hard work that went into making this all work so smoothly — like feeding an entire planet piecewise through a straw.

Finally, there’s one other item of interest before we move on. The reason this patent emphasizes asynchronous behavior is that these texture bits take some small but cumulative time to upload to your 3D hardware continuously, and that’s time taken away from drawing 3D images in a smooth, jitter-free fashion or handling easy user input — not to mention the hardware is typically busy with its own demanding schedule.

To achieve a steady 60 frames per second on most hardware, the texture uploading is divided into small, thin slices that very quickly update graphics video memory with the source data for whatever area you’re viewing, hopefully just before you need it, but at worst, just after. What’s really clever is that the system needs only upload the smallest parts of these textures that are needed and it does it without making anyone wait. That means rendering can be smooth and the user interface can be as fluid as possible. Without this asynchronicity, forget about those nice parabolic arcs from coast to coast.

Now, other virtual globes can also virtualize the whole-earth texture, perhaps they cut it into tiles, and even use multiple power-of-two resolutions like GE does. But without the Universal Texturing component or something better, they’ll either be limited to 2D top-down rendering, or they’ll do 3D rendering with unsatisfying results, blurring, scintillation, and not nearly as good performance for streaming the data from the cache into texture memory for rendering.