A lot of tools available in IT/Sysadmin/Ops/DevOps are disappointing:

They don’t fit your environment. They lack features or our designed for a different sort of environment (i.e cloud vs hardware, Linux vs Windows, distributed vs centralized etc)

You can’t interact with them programmatically

They cost too much

They are not customizable enough, or require too much customization to get off the ground

Feel kludgy, unreliable, outdated, or like the programmers were stoned

Don’t fit with your company’s culture (i.e. Enterprise vs Agile)

In short a lot of stuff is too expensive, isn’t a good fit, or is simply bad software. This ends up leaving an ops team with two options. They can whine about it, or create their own tools. So at Stack Exchange we build our own DevOps tools.

Status

Nick Craver’s baby, which we just call “Status” is at first glance a monitoring dashboard, but is essentially a collection of tools that filled various needs:

An Overview of CPU, Memory, and Network utilization for all our servers as well as a detailed view. Done with responsive and interactive D3 graphs as well as sparklines it helps compensate for Solar Wind’s terrible interface.

for all our servers as well as a detailed view. Done with responsive and interactive D3 graphs as well as sparklines it helps compensate for Solar Wind’s terrible interface. SQL Server monitoring . SQL’s built in Clustering views are deeply flawed. If a node loses connectivity, it stops updating remote nodes status, so it could show everything as connected and fine, even if there is no connectivity. We also get to see the most expensive queries, active queries utilizing whoisactive, current connections, and which DBs are on which server

. SQL’s built in Clustering views are deeply flawed. If a node loses connectivity, it stops updating remote nodes status, so it could show everything as connected and fine, even if there is no connectivity. We also get to see the most expensive queries, active queries utilizing whoisactive, current connections, and which DBs are on which server HAProxy Monitoring and Administration : With multiple instances of HAProxy we needed a single view instead of HAProxy’s built-in display. Also, this gave us a nice web interface to take servers out of rotation

: With multiple instances of HAProxy we needed a single view instead of HAProxy’s built-in display. Also, this gave us a nice web interface to take servers out of rotation Redis : A nice presentation of Redis Info across all instances and all servers. Also a display that shows what is slaved to what in at a quick glance

: A nice presentation of Redis Info across all instances and all servers. Also a display that shows what is slaved to what in at a quick glance Elastic Search : Health overview of or clusters (as well as index and shard data)

: Health overview of or clusters (as well as index and shard data) A dashboard of all the exceptions generated by our applications

Status is C# / .NET app. It polls data from various sources – sometimes the system directly and other times it gets it from Orion. There is a lot more to status that makes it awesome. The real accomplishment is that status enables us to see the general health of our main infrastructure at a glance.

Web Logging

If you business is creating and running websites, your web logs are gold. We use the logs generated by our load balancer, HAProxy, as our canonical web logs. In their raw text format, web logs are often not that useful (this is particularly true with over 100 million records a day). However we parse and structure our web logs in a few different ways:

We have C# service that Jarrod Dixon wrote that inserts them into SQL so we can query them. In order to query them we use an instance of Data Explorer, SQL management studio, and also have certain lookups directly from our sites

Displaying realtime graphs of various log information with Realog, a system I created with Go, Redis, and NVD3.js so we could view activity live without having to write queries

One of the interesting things we do with our weblogs is to add extra information by adding headers inside the app and striping them from the response at HAProxy. For example, we capture how many Redis and SQL queries were involved in that request and how long they took.

Patch Dashboard

OS updates can be a bit tedious, even more so in a mixed Windows and Linux environment. Steven Murawski and George Beech created a dashboard that allows us:

View the outstanding patches and patch count for both Linux and Windows

Trigger updates on either Linux or Windows

Schedule time frames for automatic Linux updates

What’s Next

If you want to learn more about these tools and DevOps at Stack Exchange, come see George, Nick, and Steven present “Building for Operations” at Velocity.

Keeping all this stuff to ourselves feels a bit greedy. However, for something open sourced to be very useful it usually needs to be made a bit more generic which takes time. We also want to build a lot more. Our inventory system Racktables lacks an API so we need a new one or a way to extend it. We want to build our own monitoring system (likely on top of OpenTSDB). In order to create more, and open source it we need help. So we are looking a full time developer with ops experience to join our SRE team. So if you are awesome, want to build awesome ops stuff and open source it, come join us!