The largest Cloud Foundry implementation

“Always seek knowledge” was the crux of a talk about operating BOSH and Cloud Foundry, presented at the CF Summit in Frankfurt.

The talk was given by Michael Maximilien (aka Dr. Max), Chief Architect of PaaS Innovation at IBM Labs and Project Management Committee (PMC) Lead for CF Extensions within the Cloud Foundry Foundation community.

The formal title for Dr. Max’s session was “Lessons Learned Keeping the Largest Cloud Foundry Environment Alive and Kicking.” The title refers to IBM Bluemix, a certified Cloud Foundry implementation with more than 1 million registered users (and growing at 20,000 per month), 500,000 running apps and hundreds of services.

He synopsized the lessons learned into a Top 10 list, listing issues, good and bad things about them, and the lessons learned along the way. He alluded several times to the size and complexity of many Cloud Foundry deployments, and urged a packed room of attendees not to get discouraged.

His talk was not an introductory presentation, but rather a detailed, technical look at the day-to-day challenges of managing the gargantuan Bluemix implementation. Dr. Max hoped the lessons learned at IBM can be applied by other managers facing their own challenges with complex Cloud Foundry deployments.

Issue #10: Change

“There is a tightly controlled change request process,” he said. The bad aspect means it is slow from identification of change requests to their processing, creating a bottleneck. Within a large company like IBM, there is also the issue of many people on widely distributed global teams often having to work during odd hours. The good news here is this environment “limits propagation of problematic changes,” according to Dr. Max.

Lesson learned: Use tooling to alleviate time-zone differences, but a change request process is needed to coordinate global teams.

Issue #9: Audit checklist

This is also something that slows things down, especially with manual audits. Dr. Max noted that Canary-based deployments help to audit changes.

Lesson learned: IBM created its own homegrown tool called Doctor to enable continuous monitoring and auditing of all deployments, with an actual ability to execute actions.

Issue #8: Log checking and monitoring

“Because we’re such a large environment, the log rotates too quickly,” according to Dr. Max. “There is massive log data and it’s to keep track of it.” A Loggretator-style log stream alleviates the problem, he said.

Lesson learned: There’s a need to introduce early usage of log parsing and tooling, such as Splunk. In IBM’s case, a homegrown tool was developed.

Issue #7: bosh-init woes

It can be hard to re-create an existing BOSH Director, he said. Frequent updates make this more of an issue. Hovewer, Dr. Max also noted that bosh-init (to which he has contributed code) is easy to use in general. He also noted that “we’re moving to a new BOSH CLI and therefore single binary, and we’ve introduced external CPIs, which are a source of growth for the Cloud Foundry movement.”

Through the CPIs, Cloud Foundry can work across multiple custom clouds. “We’re up to 20 now,” Dr. Max said, “this is a great source of strength.”

Lesson learned: The key is to have better planning for adopting new changes.

Issue #6: Custom software peril

Dr. Max noted the problems that come when large corporations inevitably develop their own custom software, particularly when they create custom stemcells or custom BOSH implementations.

“Use BOSH,” he said. “Adopt it, embrace it. If you start doing things that don’t really work with BOSH, you start breaking the abstractions and it becomes much harder to upgrade and keep the system healthy.”

Lesson learned: Don’t have your own custom stemcell or custom BOSH. Use the extension points: the CPI and the release. “Don’t try to do more than that,” he said. “Then you can survive with the rest of the system.”

Issue #5: Using Power DNS

“Do not use Power DNS,” he admonished. “It creates a single point of failure,” he noted. It will also be hard to remove. Dr. Max also noted that some infrastructure companies provide highly available DNS solutions, and that the BOSH team is developing a new pDNS-less solution.

Lesson learned: Think long and hard about adding any non-HA node or job into your deployment.

“We highly recommend you do not use Power DNS, but we don’t really have good alternative yet.” —Dr. Max, IBM

Issue #4: Security updates

“Security updates are painful but important,” he said. “You have to do them.” He noted that the Internet, whether we like it or not, “is full of evildoers.” In fact, the frequency of CVEs (the reference library of common vulnerabilities and exposures) forces weekly stemcell updates,” he said, noting that such rolling updates can be costly. The good news is the BOSH team is able to issue new stemcells quickly.

Lesson learned: You might need to work with your IaaS to push patches. Reloading the OS is a way to speed up new stemcell roll-out.

Issue #3: Multi-BOSH deployments

“Starting with (only) one BOSH director can lead to a bottleneck,” Dr. Max said. “Solo BOSH is easy, but be careful if you grow.” The good news, of course, is that BOSH supports multi-deployment implementations.

Lesson learned: Revisit your deployment strategy often and consider multi-BOSH deployment before you grow too large.

Issue #2: 100% expectation

“Deployments and updates are never a 100% success,” he said. “In fact, most large deployments result in a failure somewhere. But you need to be equipped to address failures and continue. Don’t stress about it.”

To the good, “BOSH is great at re-starting where it fails, and deployments and updates are still usable (when there are failures),” Dr. Max noted.

Lesson learned: Trust the tool. Failures are part of large deployments and things never work perfectly the first time.

“Embrace failures…this weirdness is going to happen. You’re just going to have to adopt it as part of your culture.” —Dr. Max, IBM

Issue #1: Director DB backups

“Back up your Director DB often and a lot,” Dr. Max said. He noted an incident at IBM in which a call he received at 4 a.m. led to two weeks of fixes. “Backup cmd exists, but it’s not as fast as an IaaS snapshot, so back up is very slow,” he said. He also noted that a new BOSH backup command is faster and better than in the past, and some IaaS providers have snapshots for disks, “so, you can get your provider to do your backup for you.”

Lesson learned: Always have a backup of your Director’s DB. A dummy CPI can help with lost deploys.

Want details? Watch the video!

Table of contents What’s the issue with tightly controlled change request process? (3:50) What’s the issue with audits? (6:05) What’s the issue with log checking and monitoring? (7:36) What’s the issue with BOSH-init? (9:27) What’s the issue with custom software? (11:31) What’s the issue with Power DNS? (14:12) What’s the issue with security updates? (16:05) What’s the issue with multi-BOSH deplotments? (18:14) What’s the issue with deployment and update expectations? (19:40) What’s the issue with Director DB backups? (21:37)

Related reading

About the speaker