This post is part 1 of a 3-part series on monitoring Azure virtual machines. Part 2 is about collecting Azure VM metrics, and Part 3 details how to monitor Azure VMs with Datadog.

Microsoft Azure is a cloud provider offering a variety of compute, storage, and application services. Azure services include platform-as-a-service (PaaS), akin to Google App Engine or Heroku, and infrastructure-as-a-service (IaaS). In the most recent Gartner “Magic Quadrant” rating of cloud IaaS providers, Azure was one of only two vendors (along with Amazon Web Services) to place in the “Leaders” category.

In this article, we focus on IaaS. In an IaaS deployment, Azure’s basic unit of compute resources is the virtual machine. Azure users can spin up general-purpose Windows or Linux (Ubuntu) VMs, as well as machine images for applications such as SQL Server or Oracle.

Whether you run Linux or Windows on Azure, you will want to monitor certain basic VM-level metrics to make sure that your servers and services are healthy. Four of the most generally relevant metric types are CPU usage, disk I/O, memory utilization and network traffic. Below we’ll briefly explore each of those metrics and explain how they can be accessed in Azure.

This article references metric terminology introduced in our Monitoring 101 series, which provides a framework for metric collection and alerting.

Users can monitor Azure with the following metrics via the Azure web portal or can access the raw data directly via the Azure diagnostics extension. Details on how to collect these metrics are available in the companion post on Azure metrics collection.

CPU usage is one of the most commonly monitored host-level metrics. Whenever an application’s performance starts to slide, one of the first metrics an operations engineer will usually check is the CPU usage on the machines running that application.

Name Description Metric type CPU percentage Percentage of time CPU utilized Resource: Utilization CPU user time Percentage of time CPU in user mode Resource: Utilization CPU privileged time Percentage of time CPU in kernel mode Resource: Utilization

CPU metrics allow you to determine not only how utilized your processors are (via CPU percentage) but also how much of that utilization is accounted for by user applications. The CPU user time metric tells you how much time the processor spent in the restricted “user” mode, in which applications run, as opposed to the privileged kernel mode, in which the processor has direct access to the system’s hardware. The CPU privileged time metric captures the latter portion of CPU activity.

Although a system in good health can run with consistently high CPU utilization, you will want to be notified if your hosts’ CPUs are nearing saturation.

Monitoring disk I/O is critical for understanding how your applications are impacting your hardware, and vice versa. For additional visibility beyond the VM-level metrics covered here, you can also collect metrics from your Azure storage accounts to determine if your storage is being throttled or has availability issues that could impact performance.

Name Description Metric type Disk read Data read from disk, per second Resource: Utilization Disk write Data written to disk, per second Resource: Utilization

Monitoring the amount of data read from disk can help you understand your application’s dependence on disk. If the application is reading from disk more often than expected, you may want to add a caching layer or switch to faster disks to relieve any bottlenecks.

Monitoring the amount of data written to disk can help you identify bottlenecks caused by I/O. If you are running a write-heavy application, you may wish to upgrade the size of your VM to increase the maximum number of IOPS (input/output operations per second).

Monitoring memory usage can help identify low-memory conditions and performance bottlenecks.

Name Description Metric type Memory available Free memory, in bytes/MB/GB Resource: Utilization Memory pages Number of pages written to or retrieved from disk, per second Resource: Saturation

Paging events occur when a program requests a page that is not available in memory and must be retrieved from disk, or when a page is written to disk to free up working memory. Excessive paging can introduce slowdowns in an application. A low level of paging can occur even when the VM is underutilized—for instance, when the virtual memory manager automatically trims a process’s working set to maintain free memory. But a sudden spike in paging can indicate that the VM needs more memory to operate efficiently.

Azure’s default metric set provides data on network traffic in and out of a VM. Depending on your OS, the network metrics may be available in bytes per second or via the number of TCP segments sent and received. Because TCP segments are limited in size to 536 bytes each, the number of segments sent and received provides a reasonable proxy for the overall volume of network traffic.

Name Description Metric type Availability Bytes transmitted Bytes sent, per second Resource: Utilization Linux VMs Bytes received Bytes received, per second Resource: Utilization Linux VMs TCP segments sent Segments sent, per second Resource: Utilization Windows VMs TCP segments received Segments received, per second Resource: Utilization Windows VMs

You may wish to generate a low-urgency alert when your network traffic nears saturation. Such an alert may not notify anyone directly but will record the event in your monitoring system in case it becomes useful for investigating a performance issue.

If your network traffic suddenly plummets, your application or network may be overloaded.

In this post we’ve explored several general-purpose metrics you should monitor to keep tabs on your Azure virtual machines. Monitoring the metric set listed below will give you a high-level view of your VMs’ health and performance:

Over time you will recognize additional, specialized metrics that are relevant to your applications. Part 2 of this series provides step-by-step instructions for collecting any metric you may need to monitor Azure.

Many thanks to reviewers from Microsoft for providing important additions and clarifications prior to publication.

Source Markdown for this post is available on GitHub. Questions, corrections, additions, etc.? Please let us know.