August 29, 2013 / Home

It’s been almost two months since Adam O’Byrne and I launched Kouio, a Google Reader replacement we’ve built together. It’s been a wild ride and we’ve both learnt a lot in a very short period of time. With Kouio, we’ve had the blessing of what I refer to as DevOps by default, the all too common scrappy start-up scenario, where only a single engineer (in this case yours truly) has built the back-end application, while also being responsible for all operational aspects such as provisioning servers, installing and tuning services, monitoring, and everything else involved in keeping the application up and running smoothly. What I’ll talk about in this post is the monitoring aspect — the hows and whys of what has worked really well for us.

First a little background. Looking at the user interface for Kouio, you’ll find it primarily revolves around a single page web app, which on the surface looks fairly simplistic, but contains many components hidden away behind the scenes, all working together to bring you the magic:

The public website (blog, documentation, etc)

RESTful API which serves both the client app plus third-party apps

WebSocket server which provides real-time feed updates

Cache service for session data and frequently accessed content

Primary database, which stores among other things, the RSS feeds themselves, and the tens and tens of millions of related articles

Queueing system that manages both the pub-sub service for real-time feed updates, and the worker queues that take care of a variety of tasks, such as:

Retrieving RSS feeds

Retrieving website icons

Exporting articles to external services (Evernote, Readbility, etc)

Payment processing

As you can see, there are a lot of moving pieces involved in bringing together a simple looking RSS reader. With such a variety of distinct components, an obvious problem we face is visibility — how exactly can we know what all the pieces of the system are doing at one point in time? How can we visualise the state of the entire system coherently as a whole?

Graphite / StatsD

I decided to solve this problem using the popular software combination of Graphite and StatsD. Graphite is a tool that deals with storing time-series metrics, as well as providing a very powerful graphing interface (built with Django no less) for visualising and reporting on metrics collected. StatsD which is built with Node.js then provides the service for collecting event streams over UDP, and aggregating collected metrics at high volume, before pumping them into Graphite.

These tools appealed to me for several reasons. Firstly I wanted something highly hackable that we could customise to fit our needs. StatsD and each of the different Graphite components all follow the Unix philosophy of doing one thing, and doing it very well. Not only that, given they’re built with Python and JavaScript which Adam and I are very experienced in, the Graphite / StatsD pair seemed like our best bet in terms of customisation, over larger all encompassing monitoring systems backed by plug-ins, such as Nagios or Munin. Secondly these tools were built and expanded upon by companies with a great open source culture at large scale, places like Orbitz, Etsy, and even Mozilla, who released django-statsd for monitoring the Firefox Add-ons Marketplace, which we’re now also using with great results.

Collecting Metrics

With Graphite and StatsD installed, the final step involved was actually choosing the metrics to collect, and working out how to get at them. Mozilla’s django-statsd package gives you a lot out of the box here. Without any configuration, it automatically adds counters and timers to many areas of Django, such as the ORM, caching and unit tests. The really interesting integration though is at the view layer. Counters and timing metrics are collected for all view functions, with each metric further segmented in a ton of ways, from the application name and URL parts, right down to the HTTP verbs used and status codes returned — all incredibly insightful for an application like Kouio that implements a public-facing RESTful API.

Monitoring the Django application was only half of the picture though. I still needed to capture system level metrics and other miscellaneous parts of our application state. I found a handful of open source projects available related to collecting metrics and feeding them into StatsD and Graphite, but instead of using any of these I opted to put together a quick solution using the psutil Python library:

import os import time from django.conf import settings from django.contrib.auth.models import User from django.core.management.base import NoArgsCommand from django.db.models import Sum import psutil import redis import statsd # The getsentry.com client from raven.contrib.django.raven_compat.models import client as raven_client from kouio.feeds.models import Feed , Item statsd_client = statsd . StatsClient () redis_client = redis . Redis () last_disk_io = psutil . disk_io_counters () last_net_io = psutil . net_io_counters () time . sleep ( 1 ) def io_change ( last , current ): return dict ([( f , getattr ( current , f ) - getattr ( last , f )) for f in last . _fields ]) while True : memory = psutil . phymem_usage () disk = psutil . disk_usage ( "/" ) disk_io = psutil . disk_io_counters () disk_io_change = io_change ( last_disk_io , disk_io ) net_io = psutil . net_io_counters () net_io_change = io_change ( last_net_io , net_io ) last_disk_io = disk_io last_net_io = net_io gauges = { "memory.used" : memory . used , "memory.free" : memory . free , "memory.percent" : memory . percent , "cpu.percent" : psutil . cpu_percent (), "load" : os . getloadavg ()[ 0 ], "disk.size.used" : disk . used , "disk.size.free" : disk . free , "disk.size.percent" : disk . percent , "disk.read.bytes" : disk_io_change [ "read_bytes" ], "disk.read.time" : disk_io_change [ "read_time" ], "disk.write.bytes" : disk_io_change [ "write_bytes" ], "disk.write.time" : disk_io_change [ "write_time" ], "net.in.bytes" : net_io_change [ "bytes_recv" ], "net.in.errors" : net_io_change [ "errin" ], "net.in.dropped" : net_io_change [ "dropin" ], "net.out.bytes" : net_io_change [ "bytes_sent" ], "net.out.errors" : net_io_change [ "errout" ], "net.out.dropped" : net_io_change [ "dropout" ], "queue.pending" : redis_client . llen ( "kouio-feed-list" ), "totals.users" : User . objects . count (), "totals.feeds" : Feed . objects . count (), "totals.items" : Item . objects . count (), } thresholds = { "memory.percent" : 80 , "disk.size.percent" : 90 , "queue.pending" : 20000 , "load" : 20 , } for name , value in gauges . items (): print name , value statsd_client . gauge ( name , value ) threshold = thresholds . get ( name , None ) if threshold is not None and value > threshold : bits = ( threshold , name ) message = "Threshold of %s reached for %s" % bits print message raven_client . captureMessage ( message ) time . sleep ( 1 )

Writing our own code here affords us full flexibility, and with the psutil library this was a trivial task. We’re then also able to integrate directly with Django’s ORM and Redis to keep track of growth and other parts of state within the system. Finally you’ll see we also implement some basic threshold monitoring. We’re able to keep this as simple as you could imagine by integrating it with Sentry, the system we use for tracking exceptions in the application. By treating these thresholds as application exceptions, we don’t need to worry about our threshold checks spiralling out of control with millions of notifications — that’s all handled for us by Sentry.

How does this all look once it’s up and running? It’s worth mentioning installation was hardly straight-forward, requiring half a dozen components sourced and built in different ways — one of the downsides of not using an off-the-shelf product. Once everything was set up though, I really went to town with our initial dashboard, arranging and colourising every single metric I thought remotely useful:

After a couple of weeks, I was able to greatly refine the dashboard, adding new metrics as I discovered them, and throwing out a ton that didn’t turn out to be as useful as I originally thought, ending up with a more useful dashboard:

Mission accomplished. I was then able to perform lots of different experiments around tuning our database, workers and RESTful API, while visualising the effect on the system as a whole:

Incidentally, the best performance gains involved re-working our database indexes, as well as some awful little tricks using PostgreSQL CTEs — but that’s a story for another post.

Real-time Graphs

What good is all of this if we can’t watch the graphs animate pixel by pixel as seconds tick by? Graphite provides a really powerful API for producing graphs, however the output is still static PNG files. What I did was modify the Graphite dashboard template with the following JavaScript snippet, which iterates through each of the graphs and reloads them one by one in succession, producing the desired effect:

var current = - 1 ; var container = document . getElementsByClassName ( ' graph-area-body ' )[ 0 ]; var imgs = container . getElementsByTagName ( ' img ' ); setInterval ( function () { current += 1 ; if ( current >= imgs . length ) { current = 0 ; } var rand = ' &rand= ' + Math . random (); imgs [ current ]. src = imgs [ current ]. src . split ( ' &rand= ' )[ 0 ] + rand ; }, 1000 );

Home