Hui Ding took a job at Instagram about three years ago, just as the photo-happy social network was growing into one of the world's most popular online services.

He soon noticed that every so often, company co-founder Mike Krieger, normally a jovial character, would suddenly turn serious and intense. He would hunker over his computer keyboard and mutter "We've gotta fix this." This meant that Instagram had slowed to a crawl for users across the globe, and the cause was always the same: Justin Bieber.

Or almost always. Sometimes, it was Kim Kardashian.

'A single celebrity's count could destroy the entire infrastructure.'

Here's how it would go down: Bieber would post a photo, and so many Beliebers would "Like" it that Instagram's computers couldn't keep up. As a way of quickly pushing new stuff to their millions of users, big Internet outfits like Instagram operate what's called a memory cache inside their computer data centers, packing their most popular online content into the super-speedy memory systems of hundreds or even thousands of computer servers. Delivering data from memory is far faster than delivering it from good ol' databases sitting on good ol' hard disks. But a Bieber pic would receive so many "Likes" that the cache couldn't hold them. As the service fought to retrieve each "Like" from disk—one by one—the database would grind to a halt and Instagram would lock up.

"My thought was: 'Wow. A single celebrity's count could destroy the entire infrastructure,'" Ding says.

As for Krieger, he still remembers Bieber's digital ID in the database that underpinned Instagram—6860189—because Bieber's account was so often the source the latest problem. "I still know it by heart" Krieger says. "So many of the early scaling issues had to do with him hitting things we'd never hit before, so I got good at knowing it was him just by sighting his ID."

Beating the Bieber Bug

But those days are over. Last summer, about two years after it was acquired by Facebook, Instagram moved its online operation into one of Facebook's massive computer data centers. In recent months, the company has expanded into two other Facebook facilities. And as the company has expanded its infrastructure—now juggling more than 80 million photos and videos a day from more than 400 million people worldwide—it has modified its underlying software so that it can avoid the Bieber bug and so many other glitches that commonly plague enormous online services.

In tallying "Likes," for instance, Instagram now uses what it calls a "denormalized counter." What this means is that it doesn't try to keep a running "Like" count in the memory cache—and it doesn't try to tally the number by querying the individual accounts of all the people who have posted those "Likes." Since these accounts sit in databases that run on hard disk, that would take too long. Instead, the company keeps a "Like" count for each photo in a single database cell, and it accesses this single cell as needed. "That's just one disk access of about ten micro-seconds," says Instagram engineer Lisa Guo, "and it's always going to be there, wherever you need it."

Instagram's Bieber fix helps provide a roadmap for other businesses as they expand their online operations to more and more users.

This is one small tweak to a vast online infrastructure. But alongside so many other tweaks, it helps provide a roadmap for other businesses as they expand their online operations to more and more users. Not every online startup will have the option of expanding into multiple Facebook data centers—some of the most advanced facilities on the planet—but as time goes on, many will encounter similar growing pains, and they can benefit from many of the same tricks of the trade.

As Instagram details in a blog post published this morning, the company has moved into multiple Facebook data centers so it can keep up with the growth of the already vast community of people who Instagram. As the service expands beyond 400 million users, it needs more computer servers. But moving into multiple data centers also means the company can better respond to disasters. If one data center goes down, another can pick up the slack.

Instagram, Everywhere

At the same time, this setup creates new challenges. Among other things, the memory cache in one data center won't necessarily match the memory cache in another. If a user comments on a photo and their account resides in a database on some machines in Oregon, for instance, this comment will show up in the Oregon cache, but not in the cache running in a Facebook data center in North Carolina. So, if other users look at the photo through the North Carolina facility, they won't see the comment.

To get around this particular problem, Instagram turned to a tool called PgQ. Dovetailing with the company's PostgreSQL database, it ensures that if a cache in one region isn't up to date, the system will visit the database where the latest information is stored. Visiting this database, which sits on hard disks, takes longer. But that's where tricks like denormalized counters can help.

That may seem like a lot to wrap your head around. Ensuring that you can quickly see all the stuff on Instagram that you want to see is an enormously complicated tasks, involving not only memory caches (based on MemCached software) and multiple databases (PostgreSQL and Cassandra) but also myriad web servers and message brokers. The point is that, with all these pieces in place across multiple data centers in disparate parts of the world, you—the user—don't have to worry about Instagram grinding to a halt. And neither does Mike Krieger. Or at least, the worries are less frequent.

Web services will always be vulnerable to natural disasters, not to mention Justin Bieber and whoever else inherits his selfie crown. Bieber doesn't seem likely to go away anytime soon. But at least there are ways of minimizing the Bieber problem.