Impact MCEPSC vulnerability patch on VDI

On November 12th, 2019 Intel disclosed a new vulnerability known as MCEPSC (CVE-2018-12207). Software vendors released patches to mitigate this vulnerability, but some vendors decided to disable the mitigation by default because of a potential performance impact. Because we focus on end-user computing here at GO-EUC, we decided to investigate what the impact of the performance on VDI workloads is when the mitigation against this vulnerability is enabled.

Machine Check Error on Page Size Change (MCEPSC)

The name that was given to this vulnerability is “Machine Check Error on Page Size Change” (MCEPSC). The security impact of this vulnerability is that an attacker with access to execute code in a virtual machine could potentially crash the hypervisor. On VMware ESXi for instance, this would result in a Purple Screen of Death (PSOD). When this compromised virtual machine is then restarted by High Availability (HA) on another host, this compromised virtual machine could crash that host as well, causing a cascading effect which could potentially take all the hosts in a cluster offline. VMware published a security advisory which includes the following statement:

Known Attack Vectors: A malicious actor with local access to execute code in a virtual machine may be able to trigger a purple diagnostic screen or immediate reboot of the Hypervisor hosting the virtual machine, resulting in a denial-of-service condition.

Both VMware and Microsoft have mitigated this vulnerability with a patch, but by default it’s not enabled.

VMware KB59139:

Hypervisor-Specific Mitigations for one of these vulnerabilities, identified by CVE-2018-12207, is not enabled by default.

Microsoft guidance:

Microsoft has released updates to help mitigate this vulnerability for guest Virtual Machines (VMs) but the protection is disabled by default.

Why did these vendors not enable this mitigation by default, you might wonder? Apparently, mitigating this vulnerability could have an impact on the performance. VMware released a statement regarding the performance impact:

Our testing consistently showed a performance impact of 5% or less for enterprise class workloads which include, but are not limited to, Databases, mixed workload server consolidation scenarios, and virtual desktop infrastructure (VDI).

We decided to test this for ourselves.

Infrastructure & configuration

This research has taken place on the Nutanix lab environment. The host for the desktops has the following specifications:

2x CPU: Intel Xeon Gold 5220, 18 cores @ 2.20GHz

768GB Memory

The three other Nutanix nodes in the cluster provide the test host with storage through an NFS-mount.

The following parameters apply to this research:

The Hypervisor we used was vSphere 6.7 Update 3, build 15018017 (patch ESXi670-201911001 installed).

The Side Channel Aware scheduler was not enabled.

The Nutanix Controller VM (CVM) was powered off during the tests on the host with the virtual desktops (This will make it easier for the community to reproduce these tests if using hardware with the same specifications).

Citrix Virtual Apps and Desktops 7 – 1909 was used for desktop provisioning and brokering.

The non-persistent desktops were created using Machine Creation Services.

The default configuration for the desktop is 2vCPU’s with 3GB memory.

All required applications for Login VSI are installed including Microsoft Office 2016 x86.

For this research two scenarios are defined:

Windows 10 build 1809 (build 17763.864), fully updated until November 12th. This is considered the default in the charts.

Windows 10 build 1809 (build 17763.864), fully updated until November 12th, with MCEPSC patch enabled on ESXi.

Each scenario is tested according to our default testing methodology, which is described here.

Expectations and results

According to VMware, enabling the MCEPSC patch has an impact on Enterprise workloads (including VDI) of 5% or less. For VDI workloads this could mean three things: 5% lower density, 5% lower (application) response times or both.

Higher is better

Lower is better

The Login VSI VSImax can be used to validate the user density between the scenarios. In our tests we see an impact on the Login VSI VSImax of around 8%. The Login VSI baseline provides an indication of the difference in the overall response times. We see around 5% higher Login VSI baseline, which means that the user needs to wait longer for applications to respond.

Based on the Login VSI VSImax, we should see similar pattern in the CPU utilization, as the lab environment is CPU constrained.

Lower is better

Lower is better

There is a 4% increase in the average CPU utilization, which is a little lower than reflected by the VSImax results. Nevertheless, this higher CPU utilization is causing the Login VSI VSImax to be lower.

Because enabling the mitigation should only affect the CPU, we didn’t expect to see any difference on the storage performance.

Lower is better

Lower is better

Lower is better

Lower is better

As shown in the charts, there is no difference in storage performance worth mentioning.

Logon times are important in a VDI environment and have a big effect on the first user experience. When the logon times are increasing, it is likely to have a negative effect on the user experience.

Lower is better

Lower is better

The 19% higher logon times on average is not a fair assessment of the real difference in logon times. Because the logon times increase at a higher rate when the CPU is at 100%, the logon times in the test with the patch enabled starts to increase at an earlier point in time of the test. Comparing the logon times of only the first 30 minutes of the test is more realistic:

Lower is better

Now the impact on logon times is only 4%, which matches the impact on the Login VSI baseline.

Conclusion

Although the vulnerability MCEPSC did not get the same media attention as Spectre or Meltdown, it is still a vulnerability that can have a big impact on a virtualized environment. Any vulnerability that allows a virtual machine to crash the host that it’s running on needs further investigation. Especially when software vendors don’t enable the mitigation by default because of performance reasons. Maybe the CVSS score isn’t considered high enough to be a real risk? Every organization needs to determine the risk for themselves, we only investigated the performance impact on VDI workloads.

In our testing with the Login VSI knowledgeworker workload, we see an impact on user density (Login VSI VSImax) of around 8% and an impact on the application response times (Login VSI baseline) of around 6%. Both are slightly higher than what VMware suggests, but it always depends on the workload and the hardware.

From a security perspective, we recommend you to enable this MCEPSC patch because the impact of a compromised virtual machine being able to crash a host is a significant risk. On the other hand, from a performance perspective, based on the results we recommend not to enable this patch. It’s basically a risk assessment of how likely it is that this can happen in your environment. When you decide to enable this patch, we (like VMware also suggests in the performance statement) recommend you to test the mitigation with your applications and workloads prior to deploying in production environments. This way you are sure what the impact is to avoid unexpected performance problems.

Share your experience in our Slack Channel.

Photo by Robert Horvick on Unsplash