In May of 2014, Landon Linden, aka Landon McDowell, the Lab’s VP of Operations and Platform Engineering, wrote a blog post on the reasons why a series of issues combined to make Second Life especially uncomfortable for many.

At the time, and as many bloggers and commentators – myself included – noted, the post came as a refreshing breath of fresh air after so long without meat-and-veg communications from the Lab in terms of what is going on with the platform and why things can go wrong.

Now Landon is back explaining how the Lab’s Ops team responds to issues within their services, the communications tools they use – and why the tools are so effective.

An Inside Look at How The Ops Team Collaborates is once again an interesting and informative piece, delving into not only the technical aspects of how the Lab respond to problems within their services, but which also encompasses the very human aspects of the dealing with issues – handling emotions when tensions are high, opening the window for those not directly involved in matter to keep an eye on what is happening so that they can also make better informed decisions on their own actions, and more.

The core of the Lab’s approach to incident communications is the use of text chat (specifically IRC) rather than any reliance on crash team meetings, the telephone and so on. Those who deal with the Lab on a technical level won’t be surprised at the use of IRC – it is a fairly strong channel of communication for the Lab in a number of areas; but what makes this post particularly interesting is the manner in which the use of IRC is presented and used: as a central incident and problem management tool for active issues; as a means of ensuring people can quickly get up-to-speed with both what has happened in a situation, and what has been determined / done in trying to deal with it; as a means of providing post-mortem information; and as a tool for helping train new hires.

These benefits start with what is seen as the sheer speed of communication chat allows, as Landon notes:

The speed of text communication is much faster. The average adult can read about twice as fast as they can listen. This effect is amplified with chat comms being multiplexed, meaning multiple speakers can talk intelligibly at the same time. With practice, a participant can even quickly understand multiple conversations interleaved in the same channel. The power of this cannot be overstated.

In a room or on a conference call, there can only be one speaker at a time. During an outage when tensions are high this kind of order can be difficult to maintain. People naturally want to blurt out what they are seeing. There are methods of dealing with this, such as leader-designating speakers or “conch shell” type protocols. In practice though, what often prevails is what one of my vendors calls the “Mountain View Protocol,” where the loudest speaker is the one who’s heard.

In text, responders are able to hop out of a conversation, focus on some investigation or action, hop back in, and quickly catch up due to the presence of scroll back. In verbal comms, responders check-out to do some work and lose track of the conversation resulting in a lot of repeating.

He also notes that not everyone is involved in a situation right from the start. Issues get escalated as they evolve, additional support may be called-in, or the net widened in the search for underlying causes, requiring additional teams to be involved, or the impact of an incident spreads. Chat and the idea of “reading scrollback” as the Lab calls it, allows people to come on-stream for a given situation and fully au fait with what has occurred and what is happening in a manner not always possible through voice communications and briefings, and without breaking the ongoing flow of communications and thinking on the issue.

The multiplexing capabilities of chat also mean that individuals can disengage from the main conversation, have private exchanges which, while pertinent to the issue, might otherwise derail the core conversation or even be silenced in something like a teleconference – and those engaged in such exchanges can still keep abreast of the central conversations.

For an environment like the Lab, where operations and personnel are distributed (data centres and offices located in different states / on different coasts, not everyone working from an office environment, etc.), chat has proven a powerful tool, although one that may take time getting to grips with, as Landon notes about his first exposure, saying:

I … just sat there staring at the screen wondering what the hell had just happened, wondering what the hell I had gotten myself into. I thought I was a seasoned pro, but I had never ever seen an incident response go that smoothly or quickly. Panic started to set in. I was out of my league.

However, the benefits in using it far outweigh any need for a degree of gear shifting required by ops staff in learning to use the approach. As Landon states in closing his comments, “when it works it is a wondrous thing to behold, a ballet in a war zone, beautiful, terrifying, and glorious.”

This is another great insight into what happens inside the Lab, and as such, the post makes very worthwhile reading, whether or not you have a background in Ops support.