Note: A version of this article appeared in the MSRC Blog as a three-part series, starting on June 25, 2019.

The Microsoft Security Response Center (MSRC) is an integral part of Microsoft’s Cyber Defense Operations Center (CDOC) that brings together security response experts from across the company to help protect, detect, and respond to threats in real-time. Staffed with dedicated teams 24×7, the CDOC has direct access to thousands of security professionals, data scientists, and product engineers throughout Microsoft to ensure rapid response and resolution to security threats.

This article looks at an important slice of Microsoft’s security response, namely how MSRC responds to elevated threats to customers through our Software and Services Incident Response Plan (SSIRP). We’ll discuss the anatomy of a SSIRP incident and give some recommendations for building your own security response process. Before discussing our incident response plan, we will show how we work to avoid these incidents from happening.

In the wake of multiple worms, including Blaster and Sasser, the SSIRP process was created in 2004 to formalize Microsoft’s response to elevated threats against our customers.

Established in 1998, the MSRC works to protect customers from vulnerabilities in our products and services. Our team of dedicated security professionals comes from a variety of security backgrounds across international government, industry, and academia, including the defense and intelligence communities, government CERTs, and the telecommunications, energy, technology, and computer security industries.

It was this team of responders that developed the original Secure Development Lifecycle (SDL) principles that are standard across the software development industry as a comprehensive approach to developing and maintaining secure products through 12 best practices. These principles were informed by our mission to ensure the vulnerabilities in our products are disclosed according to the principles of Coordinated Vulnerability Disclosure (CVD), updates are released quickly, and our customers are protected from exploits. To do this, our strategy is to make it difficult and costly for attackers to find, exploit and leverage vulnerabilities by:

Eliminating vulnerabilities entirely by incorporating technologies that remove the opportunity for exploitation. Breaking exploitation techniques with defense-in-depth technologies to make it difficult for an attacker to run arbitrary code, as well as technologies that help to contain any damage to a ‘sandboxed’ environment and prevent persistence. Using technologies such as Windows Defender and signals provided by industry partners to ensure attackers have little incentive to invest their time and resources into developing exploits they know will be defended against in a timely manner.

Figure 1: Microsoft’s vulnerability mitigation strategy

We are working to change the vulnerability economy – making it more expensive and time consuming for attackers to acquire and exploit software vulnerabilities, thus reducing the risk of exploitation for our customers. We aren’t doing it alone. Through our commitment to Coordinated Vulnerability Disclosure (CVD), security researchers and industry partners work together to help ensure updates are released before an issue is made public. This is win-win for all; customers are protected, and researchers are acknowledged through formal recognition for their work as well as through our bug bounty programs. The Microsoft Active Partner Protections (MAPP) program also ensures member security organizations across the industry are ready to respond following a coordinated vulnerability disclosure.

Using this approach has led to a safer ecosystem for all. With more vulnerabilities being fixed year-over-year, the actual risk to customers from vulnerabilities is steadily declining and the number of known exploits is consistently trending downwards. In short, Microsoft and security researchers are identifying more vulnerabilities, and—as a result—attackers are using less.

While we have made a lot of progress through this approach, we are also prepared to activate our incident response process when things don’t go to plan. Our SSIRP crisis model was established in the early days of the MSRC as a means of mobilizing global resources to quickly defend customers against elevated security threats such as Blaster and Sasser. Today this coordination is center-stage in the cross-industry response to Spectre and Meltdown, or the immediate mobilization in the wake of the ShadowBrokers leaks that eventually led to WannaCry. Each year, the SSIRP team responds to dozens of incidents that threaten our customers, including imminent or actual public vulnerability or exploit disclosures, attacks against customers, or security threats to Microsoft’s cloud services – such as O365, Azure, and Dynamics.





Anatomy of a SSIRP incident

SSIRP is our incident response process for responding to major threats to our customers, including exploits in the wild that are being used to attack customers (‘zero days’), threats to the security of Microsoft’s services like Azure and O365, and the public disclosure of unpatched vulnerabilities that could be used to attack customers. Many teams across the company are mobilized during this response, including the Cyber Defense Operations Center (CDOC) response teams, enterprise security response, product and service security teams, and key security technology teams like Windows Defender. These security specialists are engaged every day as rapid responders on a range of threats to our products and services, as well as our internal network. While each team is an expert in their product or service, it is through the SSIRP process and the CDOC that they join in a cross-company coordinated effort to protect customers from serious security threats.

Anatomy of a SSIRP incident

There are five phases to almost every product or service SSIRP incident, shown below.

Figure 2: The five phases of a SSIRP incident

Watch

Microsoft keeps a continual state of watch for emerging incidents, and both internal and external partners are key players with specific insights into various parts of the Microsoft ecosystem. Together with the MSRC, ‘watch partners’ keep vigil over their areas of responsibility for signs of emerging threats.

Triage

When an issue is found, it’s triaged by our team, and if there is a high risk to customers, a SSIRP is declared. This focuses extra resources to ensure timely variant analysis, mitigation, updates to services, and the release of updates to customers. Each SSIRP is assigned a severity level that measures the potential risk to customers. The severity is intended to be a living rating that changes as the situation develops, and it also drives the level of response.

For example, earlier this year we were informed of a vulnerability in an Open Source Software (OSS) container runtime called runc that affected all Linux systems using this component. The vulnerability was an Elevation of Privilege (EoP) that could allow an attacker to gain root-level code execution where the they already had malicious code executing in the container. While the underlying vulnerability was not in one of our products or services, we considered it to be a significant threat to our customers and declared a ‘Severity Level 2’ SSIRP to mobilize resources for a cross-company response.

Assess

After we declare an incident, teams work to assess the extent of the issue and confirm a plan of record to protect our customers as soon as possible. This work includes representatives from engineering, communications, customer service and support, and other defenders. As well as scoping the issue, the team works to ensure customers are aware of any mitigations ahead of an update release. Coordination and collaboration with industry also happens through our MAPP program during the assess phase and—in the case of the Spectre and Meltdown class of vulnerabilities—with other major technology companies. Assessment is a time-critical function and one that has little room for mistakes. Our mantra is “Know – don’t guess.”

Engineering/Development

With a Plan of Record established, the focus shifts to engineering, and ensuring there are enough resources mobilized to protect customers as soon as possible. In some cases, engineering will release engineering workarounds or adding protections to Microsoft Defender and communications such as security advisories, blog posts, and heads-up to Microsoft Active Protections Program (MAPP) partners as a complete fix is developed.

Microsoft’s response to the Meltdown and Spectre vulnerabilities affecting computer chips was known internally as SSIRP Poncherello after the lead character from the TV show “CHiPS”.

At the same time, teams work on the ultimate goal: wide distribution of any security update, fixes to any affected services, and customer guidance when there are specific actions that customers need to take to protect themselves (security update guide advisories, blogs, field alerts).

Updates to services are pushed to production as soon as they are tested. Security updates to products are typically released as part of our regular Update Tuesday schedule, along with the disclosure of fixed vulnerabilities that provide insights and learnings for the industry. The predictability of a monthly Update Tuesday allows customers to schedule updates to their systems in a timely manner, while reducing the economic cost of any downtime. In some rare cases of high risk, we may determine that an immediate, ‘out-of-band’ update is necessary, such as the updates we released during the WannaCry outbreak.

During the runc vulnerability SSIRP, teams investigated all of Microsoft’s services to determine which, if any, were affected. During this investigation, Azure Moby and the Azure Kubernetes Service were identified as using runc but were not affected since they used a statically linked version of the component that was not vulnerable. Even so, both services updated their code to include the patch provided by the code maintainer and the changes pushed to production. When the code maintainer made the vulnerability publicly known, it was given a severity rating of ‘High’ (CVSS 3.0 score: 8.6) and it was assigned the unique identifier, CVE-2019-5736.

Post Incident Review

In the Post Incident Review phase – after updates have been released and services are updated – the Crisis Lead confirms with watch partners that the incident was comprehensively resolved. Crisis response teams stand down and a post-incident review is held to formally capture any lessons learned and drive improvements across the company. This is critical to any response model, as the security landscape is always changing – what worked yesterday, may not be the best option for tomorrow’s incident. In the case of the runc SSIRP, there were no additional learnings to glean – the case was typical for an Open Source Software (OSS) incident, and the team used some of the best practices that we will next share.





Building your own security incident response process

As the threat landscape continues to evolve, we are always learning and adjusting our incident response approach. A post-incident review of each security incident that our teams manage provides insights into how we can evolve our services and products to be more secure, improve our response processes and respond faster, and help keep customers more secure. This ensures we are continually reviewing and updating our response processes to keep in step with an evolving security landscape.

After nearly two decades in incident response, we can share some of the best practices and learnings that have come from this experience. These apply to organizations both big and small and are relevant to all security response teams. Some of these we learned the hard way – when there was no established practice across the industry, and we learned from our experience when things didn’t go exactly to plan. Any organization looking to establish their own incident response plan can benefit from the below best practices:

Plan. Have a plan and a process ready before any response is needed. Refer to NIST publication Computer Security Incident Handling Guide (800-61 Rev 2) for a detailed description of what such a plan should look like. The document is intended to assist organizations in establishing computer security incident response capabilities and handling incidents efficiently and effectively.

Stakeholder support. Formalize your plan and get executive and other stakeholder support for your incident response plan. Your plan will only be as effective as they allow it to be.

Practice. Exercise your response process before you need it. ‘Tabletop’ simulations allow you to safely run through a mock incident to uncover any deficiencies in processes, assumptions, and differences of understanding across teams, and develop collateral needed to communicate effectively to both customers and executives.

Leadership. Make sure there is clear accountability for who is leading the incident response process – in SSIRP we call this the Crisis Lead. The Crisis Lead’s primary role is to lead, direct, coordinate, and adjust the response plan as needed. They need to have an in-depth knowledge of the incident response process and to be included in any side discussions that may occur regarding the incident. Your response runs the risk of being ineffective if the Crisis Lead doesn’t have all the necessary context and isn’t involved in all the discussions through the response process.

Empower. Ensure your incident response teams have the autonomy to move fast within the bounds of the approved process and understand when to seek executive approval for extraordinary action.

Communication. Communication should be coordinated within the incident response process. All communication – including executive, employee, and customer communication – should be coordinated through the Crisis Lead who is accountable for incident resolution. Without this, communications will often be incomplete or inaccurate and only serve to confuse. Clear, accurate communication builds confidence in the incident response process, maintains trust with customers, protects your brand, and is essential for fast effective response.

Collaborate. Take a holistic approach and involve teams early. In addition to engineering teams, public relations, customer communications, customer support and legal teams may need to contribute to the incident. Bringing them in early – ideally from the start – allows them to better understand context and move quickly with good judgement.

Multithread. Split your incident response into workstreams when necessary. Large or complex incident response events should be split into separate workstreams. For example, move all engineering work, along with the appropriate teams, to an Engineering workstream. Similarly, customer support, customer communications, and public relations should form a separate Communications workstream. The Crisis Lead should be part of every workstream for coordination of the response effort.

Synch. Hold regular meetings for each workstream and for the overall response effort. Separate meetings should periodically take place for each workstream, in addition to a regular all-participant meeting where each of the workstreams reports back, so that participants gain valuable context about what is happening in the overall response effort.

Learn. Undertake a Post-Incident Review. Remember that the job isn’t finished the moment the problem is mitigated and communicated to customers. Understanding the root cause of the issue and considering how response could be improved enables you to drive durable improvements to systems, technologies, and processes that drive security for your organization and customers.

When considering cybersecurity and incident response, it truly is “Better to have, and not need, than to need, and not have” (F. Kafka). Crisis handling is more efficient when stakeholders are all working from the same well-rehearsed playbook.

Interested in learning more about Microsoft’s SSIRP process works to protect customers? If you are attending Black Hat this year, check out Eric Doerr’s presentation The Enemy Within: Modern Supply Chain Attacks, which will share more experiences from the SSIRP team about how we responded to a software supply chain attack.

Simon Pope, Director of Incident Response, Microsoft Security Response Center (MSRC)

Original Posts