Wednesday, September 3, 2008 at 1:27AM

Kim Nash in an interview with Jonathan Heiliger, Facebook VP of technical operations, provides some juicy details on how Facebook handles operations. Operations is one of those departments everyone runs differently as it is usually an ontogeny recapitulates phylogeny situation. With 2,000 databases, 25 terabytes of cache, 90 million active users, and 10,000 servers you know Facebook has some serious operational issues. What are some of Facebook's secrets to better operations?

Frequent Releases. A major release once a week and a minor releases every few days.

Create a Cyber Liability Group. At one time operations was distributed amongst several groups. A permanent operations group was created to isolate problems and revert problem software components back to previously known good states. The ability of a separate team to handle rollbacks speaks to a great deal of standardization and advanced tool building.

Distribute Team Across Time Zones. Split the operations team across different time zones so no one has to work the graveyard shift. Facebook has 20 people in their team located in Palo Alto, California and London, England.

Be Innovative, Not Safe. Fear of failure often shuts down the organizational brain and makes it hide behind excessive rules and regulations. A technology company should have a bias towards action and innovation. Release software. Don't stifle genius. Rely on your tools and processes to recover from problems.

Expect Problems. Software pushed to production will have problems. Expect problems, but don't let that stop you from innovating.

Roll Backward. When a problem is detected in a release the changes can either be rolled forward or backward. Rolling back is going to a previously good release. Rolling forward is fixing problems in the new release rather than rolling back. Bugs in production are fixed in production. Roll forward ends up being covered in the press, so prefer roll backs over roll forwards.

Roll Out Massive Changes Slowly. Turn on features gradually, for a few percent of users at a time. Use the slow rollout to fix problems that can only be found under real user conditions. This approach give operations and development a lot of confidence in changes.

Encourage Openness and Information Sharing. Design reviews, PR strategy, which servers to buy, etc are often open for informal debate among employees. Facebook has created an Ideas system where employees can create an Idea by category. There's a discussion tool for discussing the idea and a rating system for rating the idea. Tools are built on-top of Facebook platform so they are available to everyone.