This post is part of the series “Every Day Is Monday in Operations.” Throughout this series we discuss our challenges, share our war stories, and walk through the learnings we’ve gained as Operations leaders. You can read the introduction and find links to the rest of the series here.

Even with our best efforts spent incorporating Site Up into our culture, Every Day Is Still Monday in Operations. Change is constant and we aren’t perfect. Outages will still occur from time to time. When an outage does occur we must do everything within our power to restore service as quickly as possible. No matter how difficult the circumstances we must keep pushing forward. If the service isn’t restored the company can’t make money and without money the company will not exist much longer.

Panama

When I popped my head out of the 10g massacre, I pointed my attention at Panama (still not at LinkedIn), a rewrite of the legacy search marketing solution. It is important to understand the magnitude of the problem.

The legacy systems had many single points of failure:

The Advertiser databases had no Business Continuity Planning (BCP)

The Data System backend pipeline where we counted our money (with no BCP)

A Serving system that actually served ads out of Oracle Databases

not a good use of relational database technology

A replication system that was problematic, trying desperately and unreliably to move ads/listings from the Advertiser systems to the Serving systems.

An editorial system that was problematic, choking on millions of changes a day much like Los Angeles traffic jam with an accident blocking 2 lanes.

A data pipeline that was problematic, with constant failures, manual interventions, reprocessing of data.

All of this (except the Serving System) was located in a single datacenter with wires so poorly managed that we could not get to our networking devices in any reasonable way.

I do not know why the project was called Panama but at one low-point in the project, I put together a presentation to draw parallels between our project and the Panama Canal. The Panama Canal being considered by many historians as the greatest construction achievement in the 20th century.

The Panama Canal was attempted first by the French, who had successfully built the Suez canal. They had problems funding the project, engineering a workable solution, and keeping a workforce intact. They eventually abandoned the project and almost destroyed the economy of their native country. BTW, the French did manage to dig 1/3 of the canal.

The Panama Project had been attempted twice before with no success, because of improper funding, lack of engineering resources, no workable plan, and lack of leadership. When I took over, there were 27 people working on the Panama project, a gross miscalculation of what it took to build this solution. After reviewing the scope of the project and the level of attrition, I immediately booked a plane for headquarters. The purpose of the trip was to beg for help. I went to the boss and told him I needed the company's best and brightest, and I needed hundreds of engineers, and I needed them now. There was no way to hire them all, so I had to pull from numerous groups. At one point, a senior leader said to me, “You are the most hated person in the company!”. I thanked him and continued to recruit.

The leadership of the Panama Canal project is littered with people from France to the United States, from Presidents to Engineers to Doctors. Key people played critical roles, sacrificing much to keep the project on track.

My biggest problem was that I needed to keep the legacy sites running (where billions of dollars were at risk, and we had our share of 7 figure outages), while building the new system, and then building the bridge/migration path from legacy to new. I knew I could not run the Panama project directly. I needed a person who slept, ate, drank, showered, and lived Panama 24/7. I put Mr. A in charge immediately, and informed my boss a week later. My boss told me I could not make the move. I told him I had already done so. If he wanted to undo that decision he could fire me. This was one of many instances where I could have been fired. Note that this is one my favorite axioms: “Go to work every day willing to be fired.” The reason I love this is that I am empowered to make the best possible decisions I can make to further the company’s interests. If I am wrong, then fire me. If you disagree, then fire me.

The biggest problem with the Panama Canal was creating a workable environment and getting the workforce in place. At this point in history, no one knew that mosquitos transmitted deadly diseases like Yellow Fever and Malaria which were literally decimating the workforce. Doctors were put on the problem and solved it, removing standing water (the hot-bed for mosquitos) from the workplace environment.

For the Panama project, we needed to get a dedicated workforce in place, with the proper environments for development, integration, qa, staging, and production. Fortunately, we were not faced with diseases like Yellow Fever and Malaria.

Another daunting problem with the Panama Canal was selecting the right site and getting the local government to agree to the construction. There were two sites considered, and Panama was selected. The problem at the time was that Panama was owned by Columbia. So President Roosevelt helped stage a revolution to create the country of Panama and secured the rights to build the Canal. “Speak softly and carry a big stick”.

In the case of the Panama Project, we needed to ensure that our base of operation could run from multiple sites in Southern and Northern California. This presented a difficult task to keep each subsystem aligned and integrated with the overall solution, but it was the only option available. While we did not create a country, we did need to break down the silos of the "loose confederation of warring tribes" that existed in the company.

The second most difficult problem with the Panama Canal was how to engineer a solution that would work. The great debate was whether to build a sea level canal (like the Suez) or to build a set of locks that could buoy the ships over the terrain. The key problem was controlling the floods in the rivers that were fed by the tropical rains. The raging waters caused constant mudslides that undid the hours of digging of the canal itself. In the end, the Panama Canal engineers turned the worst problem (the raging waters), into the renewable energy solution for running the lock structure. They dammed the river and created a lake at the top of the Canal, which perpetually fed the locks to buoy and lower the ships from one lock to the next. They also used the hydroelectric power from the dam to run the "mules" which pulled the ships.

In the Panama Project, the key decisions were around how to engineer a new solution that brought additional power and insights to the advertisers, allowed for a bidding AND value model, and scaled 10x beyond the existing infrastructure, ensuring business continuity, and providing a mechanism to seamlessly migrate from the old legacy solution to the new system without missing a beat. The key debate centered around using the legacy system as a starting point, or largely rewriting the system from scratch. We chose the latter.

At one point near the completion of the Panama Canal project a particularly difficult excavation in the Culebra Cut (culebra is Spanish for snake) filled in with a mud-slide during another torrential downpour. The Culebra Cut manager went to the chief engineer and said “What do I do now”. The answer from the chief engineer was simple and to the point, “Dig”.

We had many moments like this in the Panama Project. We started building critical mass around the deliverables for three main systems (Advertiser, Serving, Data). Unfortunately we also were burdened with the failed promises of the past to deliver the next generation systems, and had very tight timelines. The Operating Team met every day for more than one year, remediating schedule, removing bottlenecks, reallocating resources, and in general plowing through milestones one at a time. At one point, the chief engineer asked me to write the acceptance criteria for release of this system, knowing my strict nature with respect to quality and operability. I did what he asked, producing a long document with numerous checklists, including: functional acceptance, performance, failover/BCP, SLA’s, Operations, Migration, Launch Readiness, Software Configuration Management, Environment Readiness, SOX, Security, Exception policies, Signoffs, Launch sequence, Post Launch verification/review. Unfortunately, when we were fighting for hitting a launch date, I was sitting with my staff and my boss who was slowly trying to relax a subset of the acceptance criteria. As my list was being whittled down, my anger was growing. Finally I stood up, threw my badge at the CTO and started walking out. Quality, schedule, features, pick 2. I did not want to sacrifice Quality at this point. Dig.

During the Panama Canal project, the engineers were forced to build tools and systems that had never existed until that moment. Special trains and steam-powered excavation equipment ran 24 hours a day.

We needed also to build such tools. A key part of the Panama project was the migration of hundreds of thousands of advertisers from old to new systems without missing a beat. Since I wanted to fully appreciate the problem, I built a tiny test account, and insisted that it be one of the first accounts migrated (it was actually account number 2). That night I sat with our QA team and filed four “blocker” bugs just from migrating my tiny account. I halted all migrations until these problems were sorted out. Where there is one bug, there are many. Another long night in our fight to produce a quality system that had to work out of the box. Dig.

The Panama Canal was completed in 1914, 38 years after the French conceived of the original plan. It is still a marvel of engineering and sheer will.

We completed the Panama project 17 days late, and started migrating from the old legacy systems to the new ones flawlessly.

Learning from “Panama”

We hit on every principle in the construction of the Panama system. We were uncompromising about site-up, built incredible monitoring, designed/constructed/tested our code with care, carefully considered which features made sense, took calculated risks with lots of A/B testing, restored service against hard SLA's, worked hard to prove out our assumptions, communicated status on a daily basis, and worked around the clock to preserve the operation of the legacy system while migrating to the new system. It is the greatest project I have ever had the privilege to work on.

24/7

I inherited an engineer from the research team at a prominent Internet company. The first thing he said to me was that he did not want to be called after 7PM during the work-week, nor did he want to be called on weekends. I sat there for about 500 milliseconds contemplating my reply, which was, "We are done here. You will never work for me."

I like it when my people have time off but if you work for me and you do not answer your phone, I am not a fan of that. In a 24/7 shop anyone could be needed at any time to help solve a site issue, not just when it’s convenient.

I am big fan of a 24/7 NOC (network operations center) and I am also a big fan of on-call rotation, where people are scheduled to cover every hour of every day. Having some ready to act at any time makes a world of difference when you are trying to get ahead of an issue.

In fact, I like 24/7 so much that one of my direct reports has replaced his phone ring with my nasty voice saying: "What part of 24/7 do you not understand?" Since he is German by heritage, I was willing to translate for him: "24/7: Haben sie nicht verstehen?”

Learning from “24/7”

We have a three tier support system in our shop. If we need to reach tier 3 (the software developers), there is obviously something seriously wrong that could not be resolved by the tier 1 and tier 2 teams. That's why the tiered structure was set up in the first place, and if the engineer at tier 3 was unwilling to help with a problem severe enough that two other teams of engineers couldn’t solve it, well, he was guaranteed to lose the job eventually: either by firing or by the site staying down for too long.

The other component here is ownership. If I am a software engineer, and I throw my bits over the operational wall to a separate team, there is little ownership and accountability when things go awry. However, if I am the tier 3 support and have to stay up until the issue is resolved, then I will think twice about shipping a pile of shit.

This is a great example of the phrase, "Living the dream". We get to choose the job we take. Once we take it, there are certain rules that apply. To honor "site up", we have to make a true commitment to do whatever it takes, while constantly figuring out better ways to do it. Sometimes this means halting a single site incident other times it’s a herculean task to overcome years of technical debt. Either way, in Operations you must do your best and never give up!

Today’s stories “Panama” and “24/7” were both experienced by David Henke. To ask either of us questions directly please tag us with @David Henke or @Benjamin Purgason in the comments below. We’ll be checking in on the comments throughout business hours, so let us know what you think!