Salesforce Database Fiasco

Current News Coverage

ZDNet - Faulty database script brings salesforce to its knees

The Register - Salesforce database outage

Geekwire - Database error causes widespread ongoing salesforce outage…

Service availability appears to be restored for a majority of users, although a subset are still locked out of their data. Based on initial reporting some details remain unclear.

Unanswered Questions

…a database script deployment that inadvertently gave users broader data access than intended. ZDNet reporting

Due to the lack of technical details in initial reporting, there are many unanswered questions. Unanswered questions naturally lead to idle speculation.

Ongoing Remediation

Firstly, service remediation appears to be at least partially manual. This conclusion is not without reproach, but is based on evidence in the initial reports and the careful wording of company statements since the incident broke. This recovery has spanned a long duration, that has apparently been intensively staffed, with service restoration being fragmented and ongoing.

While in no way conclusive, these are signs of a potentially manual, disorganized, and unplanned-for rollback.

Response

Secondly, a primary feature of this misconfigured deployment is that all users were granted read-write access to all databases.

How long was the misconfiguration live before it was noticed? Before access was revoked?

How did Salesforce finally notice the misconfiguration? Did they have a horrified scramble after a puzzled customer reached out?

If password hashes were leaked, it would be prudent to force password resets on users. Yet, no reporting or updates from Salesforce have mentioned as much.

Data Integrity

A detail that has been missing so far is whether, or to what extent, unauthorized access or modification happened to any affected databases. Hopefully, remediation includes rolling back databases to a known good state, which would at least ensure data integrity from before the incident. However due to the wording, and focus on “restoring” permissions and access, it’s hard to know exactly how unauthorized access or modification is being handled.

Root Cause

According to reports, a misconfigured deployment script ran, which modified all customer databases to allow read-write access to all users.

What was the review process for this deployment? Was it approved?

Did this deployment go to a staging environment before production? How was that environment verified?

How was the deployment monitored?

How was the new deployed state verified?

How was monitoring conducted on the affected infrastructure?

The Hard Truth

A mistake like this only happens when there are a lack of controls, or very weak controls, on many levels. From the organization, to its review and deploy processes, and its monitoring and alerting capabilities. Legitimate oversights and mistakes do happen, which is why mature engineering organizations rely on a multi-layered process to stop defects during the SDLC.

Best Practice Solutions

While no strategy guarantees safety against defects, there are practices and processes which are proven to reduce the number of defects that ultimately reach production and impact paying customers.

For the defects that do make it through to production, there are strategies which will help with response and recovery.

Code Reviews

Code reviews are an easy, low cost, high value process to add to the SDLC. In addition to catching defects, code reviews are an excellent way to share knowledge across an organization.

Senior engineers get daily opportunities to mentor and coach their peers. It’s also a great opportunity to ask questions, gather feedback, and improve implementation.

Configuration Management

Configuration management is critical to a few key defect-minimization strategies. Having fully automated deployment and rollback procedures almost requires some level of configuration management. Configuration versioning and secrets management are other important pieces to this puzzle.

Terraform by HashiCorp is one of many great options for configuration as code.

Vault (also by HashiCorp) is a secret management option which has a CLI, a web interface, and integrates natively with cloud providers. It’s pretty intuitive and easy to use.

Deployment Testing

Every deployment to production should be preceded by a deployment to a similar production-like environment. That environment should be monitored during and after deployment. It should be tested rigorously with smoke tests, UI tests, or a manual QA team.

Phased Deployments

Deployments should roll across clusters or partitions in phases, with each phase being monitored against expectations. Rollbacks, particularly rollbacks of schema changes, are difficult to prepare for. If that’s the case, backups need to be hot and ready to roll back to. The deployment should be scheduled for the lowest-use time of the week.

Incident Planning, Disaster Recovery

Proper planning prevents piss-poor performance.

– Coach Hanika, high school wrestling

and

Fail to plan, plan to fail.

Both of these statements are corny, often repeated, and completely true. Incident response requires planning, organization, practice, trust, and coordination.

Designated roles reduce confusion and response time. Roles will depend on organizations and their individual requirements, but some common ones include:

incident commander

communications point person

investigators

scribes (for documenting timelines and actions taken)

With these roles and an incident response plan in place, MTTR has the best chance to be minimized. Investigators are insulated from fly-by questions, and are allowed to focus on the problem. The company and its various teams can stay up to date on any news via the communications person. And the incident commander ensures that the response stays focused and investigators have all the resources they need to complete the recovery.