Understanding performance of your infrastructure is extremely important, especially when running production systems. There is nothing worse than a customer calling and saying they are experiencing slowness with one of their applications and you having no idea where to start looking.

In the 2014 State of DevOps survey survey, one of the questions asked was how is your organization notified of failure?.

Here was the multiple choice question asked:

Through the survey, one of the top practices that correlated with performant IT teams was Monitor system and application health:

Logging and monitoring systems make it easy to detect failures and identify the events that contributed to them. Proactive monitoring of system health based on threshold and rate-of-change warnings enables us to preemptively detect and mitigate problems.

If you want some more information about performant DevOps teams and the methods they used to test teams, I recommend the talk What We Learned from Three Years of Sciencing the Crap Out of DevOps.

Monitoring performance counters on Windows in any centralized manager way has always been tricky. In 2014 I wrote a PowerShell Module to send performance counters to Graphite which turned out to be pretty popular called Graphite-PowerShell-Functions.

Thankfully, things are getting easier. Let’s take a look at using InfluxDB to store our metrics, Telegraf to tramsit the metrics and Grafana do display them.

By the end of the article, you should be able to make a dashboard that looks something like this:

📢 Want to be notified when I post more like this? Follow me on Twitter: @MattHodge 📢

Requirements

You will need a Linux machine which will host the InfluxDB and Grafana installations. I will be using Ubuntu 14.04 x64 for this.

Preparing the Ubuntu Machine

There is nothing special that needs to be performed on the Ubuntu server before installing InfluxDB or Grafana. Just make sure all the packages are up to date:

sudo apt-get update sudo apt-get upgrade

UTC Time

The other thing I would recommend is setting the time zone of the Ubuntu server to UTC. It is a good idea to standardize on UTC as the time zone for all your metrics. InfluxDB uses UTC so stick to it.

(You can read about some of the struggles when you don’t use UTC here).

Install InfluxDB

InfluxDB is an open source distributed time series database with no external dependencies. It’s useful for recording metrics, events, and performing analytics. I recommend having a read of the key concepts of InfluxDB over at their documentation page.

Let’s download and install the InfluxDB .deb

cd /tmp wget https://s3.amazonaws.com/influxdb/influxdb_0.11.0-1_amd64.deb sudo dpkg -i influxdb_0.11.0-1_amd64.deb # Start the service sudo service influxdb start

InfluxDB listens on 2 main ports:

TCP port 8083 is used for InfluxDB’s Admin panel

is used for InfluxDB’s Admin panel TCP port 8086 is used for client-server communication over InfluxDB’s HTTP API

Once installed, go to http://Your-Linux-Server-IP:8083 in the browser and confirm you can access the InfluxDB admin panel:

Install Grafana

Grafana is a beautiful open source, metrics dashboard and graph editor. It can read data from multiple sources, for example Graphite, Elasticsearch, OpenTSDB, as well as InfluxDB. Take a look at the Grafana live demo site to see what it can do.

First we will download and install the Grafana .deb . You can find the latest version over at http://grafana.org/download/.

cd /tmp wget https://grafanarel.s3.amazonaws.com/builds/grafana_2.6.0_amd64.deb # Install required packages for Grafana sudo apt-get install -y adduser libfontconfig sudo dpkg -i grafana_2.6.0_amd64.deb # Start the service sudo service grafana-server start # Configure Grafana to start at boot time sudo update-rc.d grafana-server defaults 95 10

Grafana’s web interface listens on TCP port 3000 by default.

Go to http://Your-Linux-Server-IP:3000 in the browser and confirm you can access the InfluxDB admin panel:

Telegraf

Telegraf is an agent written in Go for collecting metrics from the system it’s running on, or from other services, and writing them into InfluxDB or other outputs.

We will be using the win_perf_counters plugin for telegraf to collect Windows performance counters and send them over to InfluxDB. More information on the plugin can be found at the telegraf GitHub page.

Install the Telegraf Client

As the Windows agent is still in an experimental phase, head over to its GitHub page at https://github.com/influxdata/telegraf to grab the latest version.

At the time of writing the latest version could be found at http://get.influxdb.org/telegraf/telegraf-0.11.1-1_windows_amd64.zip.

Extract the zip file into a directory, I used C:\telegraf .

Inside you will see 2 files:

telegraf.exe - this is the application. It is written in Go which compiles nicely into a single .exe file

- this is the application. It is written in Go which compiles nicely into a single file telegraf.conf - all the configuration options for telegraf

Configure Telegraf

Basic Configuration

Open the telegraf.conf file in a text editor - I would recommend one which supports TOML syntax highlighting such as Atom.

The Windows version of telegraf has a configuration file setup to collect some common Windows performance counters by default, so we do not need to change very much for it to work.

The first thing we will change is the collection interval. This is how often the performance counters will be read. I am setting mine to 5 seconds. This configuration option is under the [agent] section:

[agent] interval = "5s"

Next, under the [[outputs.influxdb]] section, we need to update the urls option to point to our InfluxDB server at http://Your-Linux-Server-IP:8086 .

[[outputs.influxdb]] urls = ["http://Your-Linux-Server-IP:8086"]

Deciding What To Capture

As this is a Hyper-V server, I wanted to collect some Hyper-V specific metrics. I found two articles, a post by Ben Armstrong about Dynamic Memory Performance Counters with Hyper-V and Measuring Performance on Hyper-V on MSDN.

These were the parts that stuck out from the articles:

Use the following rule of thumb when measuring disk latency on the Hyper-V host operating system using the “\Logical Disk()\Avg. Disk sec/Read “or “\Logical Disk()\Avg. Disk sec/Write” performance monitor counters:

1ms to 15ms = Healthy

15ms to 25ms = Warning or Monitor

26ms or greater = Critical, performance will be adversely affected

and

My favorite performance counter is the “Average Pressure” counter under the “Hyper-V Dynamic Memory Balancer” category. This gives you a very simple view of the overall memory allocation of your system

As long as this number is under 100, you know that there is enough memory is your system to service your virtual machines. Ideally this value should be at 80 or lower. The closer this gets to 100, the closer you are to running out of memory. Once this number goes over 100 then you can pretty much guarantee that you have virtual machines that are paging in the guest operating system.

Depending on the type of server you are trying to monitor, you will want to do the same and research a few important performance counters you should be keeping an eye on.

Adding Additional Counters

We have worked out exactly what needs to be monitored, lets add them to the configuration file.

First we will add \Logical Disk(*)\Avg. sec/Read and \Logical Disk(*)\Avg. sec/Write .

The configuration file already includes LogicalDisk monitoring, so we just need to add Avg. sec/Write and Avg. sec/Read into the Counters array for LogicalDisk in the section in the file.

After doing this, the configuration for the LogicalDisk counters looks like this:

[[inputs.win_perf_counters.object]] # Disk times and queues ObjectName = "LogicalDisk" Instances = ["*"] # Added "Avg. sec/Write" and "Avg. sec/Write" to the Counters array. Counters = [ "% Idle Time" , "% Disk Time" , "% Disk Read Time" , "% Disk Write Time" , "% User Time" , "Current Disk Queue Length" , "Avg. Disk sec/Read" , "Avg. Disk sec/Write" ] Measurement = "win_disk" #IncludeTotal=false #Set to true to include _Total instance when querying for all (*).

Next, we want to add the Hyper-V Dynamic Memory Balancer counter. I wasn’t sure if its full path, so I used PowerShell to find it:

# I used ConvertTo-Json as it makes the output much easier to read. Get-Counter -List "Hyper-V Dynamic Memory Balancer" | Select-Object Paths , PathsWithInstances | ConvertTo-Json

From here I found the full counter path was \Hyper-V Dynamic Memory Balancer(System Balancer)\Average Pressure (JSON adds the double slashes). This was added to the configuration file:

[[inputs.win_perf_counters.object]] # Disk times and queues ObjectName = "Hyper-V Dynamic Memory Balancer" Instances = [ "System Balancer" ] Counters = [ "Average Pressure" ] Measurement = "hyper_v"

Save the telegraf.conf file.

To run telegraf, open and then we will start telegraf with the following command:

C:\telegraf\telegraf.exe -config C:\telegraf\telegraf.conf

If all went well you should see telegraf starting to collect your metrics and send them over to InfluxDB.

Troubleshooting

If you get an error saying 2016/03/28 19:48:01 toml: line 1: parse error this is because you used standard old notepad and its line-endings broke things. Use a real text editor!

Installing Telegraf as a service

If you are happy with how Telegraf is functioning, you can install it a service so it starts itself when the system reboots. Follow the instructions here.

Viewing the Data in Grafana

Now you have some metrics being sent into InfluxDB, you can use Grafana to view them.

Open up http://Your-Linux-Server-IP:3000 and login using the default credentials:

Username: admin

Password: admin

Configure a Data Source

Grafana needs to have a data source added so it knows where to look for the metrics.

Click on Data Sources on the left and then Add new at the top.

Choose the type InfluxDB 0.9.x for the data source and enter the URL for InfluxDB. Keep in mind that Grafana is running on the same box as InfluxDB, so you can just use http://localhost:8086 .

Keep access as proxy .

The default database for the telegraf agent is telegraf . The Grafana form will not let you save unless you enter a User and Password, so just enter in something random as we have not configured any InfluxDB credentials.

Create a Dashboard

To display our data, we will need to create a dashboard. Select Home from the top menu and click New .

Add a Graph

In the new dashboard page you will see a little green rectangle over on the left, click it and choose Add Panel > Graph .

Click on the Metrics tab, and down on the bottom right of the page is the data source dropdown. Choose the data source we added, called InfluxDB .

In the data selection section, choose From win_cpu and match the rest of the fields up to the image below to get a graph of the CPU usage.

You can read more about querying data from InfluxDB in Grafana in the Grafana docs.

Next, click on the General tab and enter a name for the graph.

Head over to the Axes & Grid tab. There are a ton of options here. As this is a graph to show CPU usage of one or more Hyper-V servers, I chose to structure and enter a name for the graph.

As we are looking at the % Processor Time performance counter, set the Left Y Unit to be percent (0-100) .

performance counter, set the to be . Set some threshold levels - these just give a nice visual representation of when you should be worried about a the graph entering the danger zone.

levels - these just give a nice visual representation of when you should be worried about a the graph entering the danger zone. You can also display additional values under the graph next to your metrics, in this example I enabled Min , Max and Avg .

Save the Dashboard

Click Back to dashboard and then up the top of the page, choose the Cog icon > Settings .

Give the dashboard a name and save it - I choose Hyper-V Dashboard and entered the hyper-v tag.

Create a Table

I added a Table panel to track disk latency on the Hyper-V server:

The query that I used for this was as follows:

You will notice I used a math function and multiplied the performance counter by 1000 . As this performance counter records in seconds with millisecond precision, I had to multiply by 1000 to get a millisecond value for the counter.

From there I went to the Options tab and set the Unit value to milliseconds (ms) and set the thresholds that were recommended by Microsoft.

Create a Single Value Display

Finally I added a Single Value panel to track Hyper-V memory pressure.

The query that I used for this was as follows:

I then went to the Options tab and set the Postfix of the metric to be avg pressure . I also enabled Background coloring and set the Thresholds as recommended by Ben’s blog post.

Wrapping Up

InfluxDB and Telegraf provide an excellent and simple way to ship Windows performance counters off the server, and Grafana lets us display these metrics in beautiful dashboards.

Hopefully this starts you on your journey to graphing performance data for your systems.

Keep an eye out for another post shortly which will discuss some more advanced usage including using annotations on the graphs so you can correlate events in your infrastructure to system performance.