Transcript

Wrightson: I'm Nicky Wrightson, and I am absolutely terrible at out-of-hours support. I first went onto an out-of-hours support race about 10 years ago. I was working for a bank, and my first call in the middle of the night, I can't actually tell you much about it because I don't remember answering my phone. I get into work the next day, and I get a bollocking from my boss, for answering my phone and not doing anything about the incident. I didn't know what he was talking about. What had actually transpired, I worked it out with the ops guys, is that I was still dreaming when I answered the call, and I thought I was taking a mission from James Bond. Needless to say, I hadn't opened my laptop and dug into the alerts. Even if I hadn't thought I was taking a mission from James Bond, I'm pretty rubbish at 3 a.m. I can barely remember my own name, so it's in my best interest to not get called.

I've recently had the pleasure of working at the FT, and I hope everybody saw the keynote on Monday, Sarah [Wells]. If you have, don't feel like you're getting deja vu from some parts of this talk, and I have not plagiarized her keynote in the last two days. We in fact worked on the same content platform, so some of the things might overlap a little bit. We were in a really unique place here. We had a greenfield project, and we were properly empowered to make the right decisions, so we defined a lot of our operational support model, not just the tech stack.

Since then, I've moved on to River Island, very recently, and it's in my best interest to sort this out here as well because now the countdown is ticking down until I end up being put on that rota, and I don't want to get called. So we are building these really complicated microservices, sort of systems. The illegible diagram up there is part of the content platform. The reason why I know it's only part of the content platform is that my Lucidchart skills completely and utterly failed me, and so there's bits of functionality missing up there. We're building these really complicated microservices. Oh, got a bit quiet now.

Martin Fowler said there are some tradeoffs to microservices, and one of the tradeoffs is operational complexity. The more complex these ecosystems that we're building are, the harder it is to support them, and especially when we're deploying things so much more regularly and tiny bits here, and there, and everywhere. Also, we're empowered now. We can choose whatever tech we want, within reason, of course, but this is just an example team at River Island and the tech that's being used just on a four-person team.

How can the sort of traditional model of one operations team support a couple of teams like this, let alone 20 or 30, in a larger organization? We are empowered to run tech, yes, but then traditionally, we had sort of hardware, there were dedicated people to keeping that hardware up, but we're now empowered to learn that support model, too, because we have much more influence than we ever had before. The cloud revolution has allowed us to own the whole process.

When I joined the FT, we were right at the start of the journey, and we were just implementing microservices in this Greenfield project, as I said, and we were really flaky, I mean really flaky. Our consumers added caching in front of us so that they could guard against our flakiness. Quickly though, our services were going to be used for the new ft.com that was being rolled out, so we needed to actually up our game and we had to agree certain service levels with the other teams. One of those was 15 minutes recovery time. That's really hard when you've got quite a lot of data behind you. If that wasn't enough, we then had to take all of these 150 microservices we had in our platform, and move pretty rapidly off our hand rolled container platform to Kubernetes, all whilst keeping that 15-minute recovery time.

I was at a party recently, and I was a great guest because I went around my old team annoying them, asking them about their recent call outs, and they'd had two call outs in the last six months. I said, "What did you need to do?" They went, "Absolutely nothing. We just reassured operations they knew what to do." It's an amazing sort of trajectory from flakiness to just purely reassurance. I would like to point out we did not develop a mythical, magical system of perfection here. We just worked out ways not to be called out of hours.

An Engineer’s Mindset

I bundled it together as five groups of stuff that came to that point of no calls, which I'm going to go through now. When I was initially writing this talk, I had this slide, I think it was number three, or something. But I soon realized that without the buy-in of the engineers and the engineers' mindset changes, it doesn't matter what I tell you in technical approaches, because you might implement them. If you don't think it's a primary concern, you're going to do nothing with it.

Charity Majors, I quote her quite a lot on this space, I'm not a sysadmin. I don't come from a sysadmin background. However, I think that she's got a really good point here. It should be engineers' primary concern. They should take it as a part of being done on a project. It's not an afterthought. We need to trust the teams to know how best to do this. They need to understand what they need to do, but they also need to foster relationships where needed. They're the ones that know how to support this. Traditionally, you have this kind of operational front, first line sort of team, and it's important for them to work with the development team if they are second or third line.

One way we saw this working is that we would have two buckets. Sarah [Wells] did mention some of this, I think it was three buckets when she mentioned it, I did a later slot on this one, so two buckets of engineers. The operations could round-robin one of those buckets of engineers until somebody picked up. This was really good. It meant that we could spontaneously get drunk, go on a bike ride, go for a swim. It didn't matter, because we knew that there would be somebody else in that support rota. The caveat is, if you all go to the pub together, there can be a problem there.

This was completely voluntary and compensated, and it really drove home the importance of that relationship between operations and the engineers. It was things like we all had the operations' phone numbers so that we recognized the phone go, and small points like that really helped this model succeed. Our support model extended into the day. We'd have two people dedicated to in-hours support. So they would triage queries, issues, things left over from the night before. They would also drive improvements to the platform. We would see issues. They would actually address those issues in hours. This meant that that big diagram I showed you earlier, that people kind of touched all different parts of this rather than just areas that they were subject-matter experts in.

It also made them think about how they design their business functionality differently. Now, we need to start thinking about that every error might result in that phone call waking you up. What does that error mean to you? I tell you what, people are not going to drop logs and not adapt errors when they know that there's going to be a call on the line. Also, we want to sort of work out the ways in which we can adjust our severity. That error that I showed you before, it might be meaningless. It might be something that you just want to track. You certainly don't want a call from it. The example above is, actually, we were running CoreOS Container Linux, and we wanted to know when we needed to updates and kind of security patching. This is really critical to do security patching. I'm not going to do it at 3 a.m., though. I'm going to do it after my coffee and croissant in the morning. That would just give us a warning. It wouldn't go critical.

There are a couple of laws by Lehman and Belady, back from the '70s and in the '90s. They continued in the '90s. This talks about a system declining unless it's vigorously maintained. That certainly is applicable to actual software. We all know the importance of refactoring. However, it also works for the operational model, the support model. If you don't sort of pay attention to the way that you're supporting a system, it will decline, and it's closely related to one of their earlier laws, saying that, "The systems we are building increase in complexity unless we really work hard on reducing that complexity." Again, that holds true for both the technology, and the code, and the process.

Don’t Get Called for Issues That Could Have Been Caught in Office Hours

We're now thinking in a different way. We know that we need to have this as a primary concern. What's next? We don't want to get called for issues during the day. Of course, we don't. This is the biggest one. Releases are the biggest cause of stuff that happens during the day that we could end up getting called for. But there are some simple ways to kind of reduce the risk of releases causing that. How many people won't release code at 5:00 in the evening? Anybody would do it on a Friday afternoon at 5:00?

I think that I've got different view on this slightly, in that sometimes, if you look at why there is a fear of 5:00 releases, it can show you some stuff about the low confidence of your team. Why does your team have that low confidence? Is there something wrong in the deployment process? Can they not verify their releases? Do they not have the tooling in place to have confidence in their release? However, this is always a balance of risk. You want to work out the risk of it breaking, how quickly you can roll it back, and so it's a bit of a touch and go. Well, I still wouldn't do it at 5:00 on a Friday, though.

Also, deployment times. Don't underestimate the attention span of anybody. I am walking into rooms. Already, I'm forgetting why I walked there. And when it comes to slow deployments it's even worse, because people will have a cup of tea, or will still go home without knowing if their release actually was successful. We recently had a problem that it was taking 12 minutes, once we pushed our containers to a container repo, for us to actually go completely live with that new container. We were actually running in a sort of A/B scenario for a while, not intentionally. I can only get that down to five minutes because of some funky networking. This is not quick, and I've got a likelihood of forgetting, so a really easy way to kind of get around that is just hook it into some form of notification.

If you've got a slow deployment, hook in either your CI into Slack, or whatever you need to do. Just make sure that there's something poking you to go and check. You've got to go and check. Now, there was a brilliant talk yesterday about progressive delivery by James, and this is what part of this is about, is the testing and production style of things, not the scary, "I'm going to throw things into production, cross my fingers, and hope for the best." It's the, "Make sure that you have sort of test accounts," is the simplest thing, of being able to do manual tests in production.

You can also put your stuff behind feature flags. If you've not got confidence in your release, you could just have it on during the day, turn it off at night. That's not going to cause you a call. There's a plethora of other things that you can do, such as the sort of observability side, once it's out there. I don't think I'm the only person quoting Cindy here. I've heard her come up a couple of times. Really recommend her talk on testing microservices. She talks about a lot of that, canarying, and such like.

The other thing that's going to cause you issues are these monstrous, big, 3 a.m. batch jobs. Everybody has had these in their history. Unfortunately, I've got a few at my present at the moment. What you're doing here is moving the heavy lifting, right until the point where you don't actually want to have a problem. You are going to get called. It is a matter of time. A simple way to stop this happening. Here's a really simple trading system. Get some trades into a queue, does some stuff, saves it off. Then at 3 a.m., it goes and get all the day's trades, aggregates it maybe, transforms it, send it off to a reconciliation tool.

Now, what's happening here is that most of the work is being done at 3 a.m. You get some dodgy data, which you will. At some point, there will be something broken, or some hiccup along the line, and you might get called because they won't be able to reconcile those trades. What we could do is take the transformation over to the real time. I've kind of simplified this diagram. Probably don't want one reconciliation file. Do the transformation, or maybe even the aggregation, as you get the trades in. And it says "orders," I'm afraid. Orders then, do your order transformation and then send it over to a reconciliation. Just take that heavy lifting to real time and just send the reconciliation file, or create the reconciliation file and send it over to the third party at 3 a.m. It means that the work that's being done out of hours is very small. So don't get called during the day for things that you can avoid.

Automate Failure

If there is a problem, we want this automated. If you can recover from failure in an automated manner, do so, because computers are better than us at doing this stuff. The tooling has come such a long way in the last five years. We've got things like Kubernetes, or Kubernetes - I keep pronouncing it wrong - and Serverless that will do a lot of the heavy lifting and the recovery for you. However, to get these tools to actually play nice with your applications, you need to have a set of principles that that you go to.

Firstly, you've got to let them terminate. Well, these tools are going to take your stuff, stop them, and throw them somewhere else. Transactional. You don't want your data half-baked anywhere. I think that's just a general good practice. It doesn't matter if it's on one of these platforms. Things need to be able to restarted. If you're checking your service onto another VM, it needs to have a clean restart, queue back so that you don't end up with half-baked data and you can recover easily. Idempotent. If you can replay things without adverse effects, it's a real advantage. Stateless, ideally. State is always a problem, and half of the rest of it was actually about state.

Coming back briefly to idempotency, if you make your services idempotent, you can replay failed events, and you can even send events over and over again. If all else fails, you can go bigger than just failed event. We had a whole separate region, so you could have your platform duplicated in another region. What this will allow you to do is actually weather the storm of larger cloud outages. This helped us in 2017 when we had the S3 outage. We were able to failover to the EU. A word of warning though, if you have another platform, if you release code to both platforms at the same time and you've got a bug, of course, you're going to take down both platforms and you're going to be left with nothing. The other thing is things can infect both platforms, so be very careful of the blast radius of certain things. We learned this the hard way. As I was saying, we had a stack in the U.S. and a stack in the EU, and we had all of our services duplicated, all 150 of them, plus data stores, Kafka, a load of stuff. What we would do is just, on a normal day, geolocation balance the load.

However, one day, we hadn't moved over to Kubernetes, we were working on our own hand-rolled container orchestration system. And this is what we saw, and you might recognize this screenshot. Sarah also has the same one, because we worked on the same thing. What was happening is we lost our EU cluster, and we started to you lose our U.S. cluster. Something was going on. You won't believe me when I actually tell you this is the moment that one member of our team found out about this outage. He's still trying to get me to pay for another beer because it got left on the table. What had happened is that we had ended up starting to lose our U.S. cluster. We boldly then went, "Well, where are we going to serve traffic from? Ah, should we save it from staging? Oh, but the same thing will happen there. Okay, what can we do?"

We bypassed every single service in our new, sparkly, expensive system, and purely used these huge platforms to route traffic to our old APIs in the data centers. It wasn't pretty, but it meant that the FT still had the capability of breaking news, and it got us through the night. We could go home and know that we could come in the following morning and fix this stuff, and find out what was going on. We hadn't got a clue at this point. What had happened is we had tried to containerize a graph, when we didn't know how to containerize many things at the time. We messed up, and we hadn't set limits on how big our containers could grow. One query to a graph database that's long running will just eat, and eat, and eat memory, and it was blowing the VMs, and it was blowing VM after VM. We fell over to the U.S., and, of course, the same thing happens, and it's infected both of our platforms.

Yes, so do try to make sure you understand what your blast radius is. Nowadays, there are even more things you can do. When you're over in Kubernetes, it helps you. You could have namespacing around certain areas, for example. So, automate all your recovery wherever you can. Whatever that means, replaying events, failing over, whatever it is, if somebody else can do it, let the computers do it.

Understand What Your Customer Really Care about

It's really hard to understand what your customer finds important. I was recently talking to an ex-colleague of mine, and they were writing an MVP, so a minimal viable product for a train line in Europe. It had 500 requirements. They thought every single one of those was important. So trying to tease this out of your customer is pretty tricky. For the FT, it was brand. It might be revenue. Of course, now at River Island, it's definitely revenue. It could be worse if you're working in a power station or a hospital. But once you know what is important to your business and customer, you need to be able to react to issues before they find out those issues. You want to be the ones that are alerted on failure, not have somebody knock on your door going, "Yes, we can't publish content."

This looks like a quote from Sarah. However, this is her mantra. If anybody's ever worked with Sarah, you'll hear this at least a dozen times. I found it quite hard when I first joined the FT because, "Oh, I'm getting all these 500s. Surely I need to alert on these." So we would go, "Are you going to do anything with them?" "No." "So why alert?" You're just creating noise and alert fatigue. Also, not all of your services are equal. We used to have a little image cleaner app back in the old hand-rolled container system. That's certainly not equal to something that would take payments, for example. "I will fix that little app once I'm in, the following day, maybe the following day after that."

If you can do synthetic requests. Synthetic requests are manufactured requests through your system in an automated fashion. They could be either replaying old requests. We had a kind of kitchen sink request as our first ever synthetic request, that was literally everything that an article could possibly contain, and you can pump those through your system all the time. What this is really good for is spiky traffic. The FT would be publishing quite a lot during the day, Monday through Friday. Saturday night, dead, occasional publishers. What you end up with at that time, without synthetic requests, is Schrödinger's platform. Is it dead, or have we just not published anything? You can't tell what state your platform is in. So we used to pump these through, and the alerts would be exactly the same against these synthetic requests as normal requests. It also gave us massive advantage for rolling new releases out, because we could see whether these synthetics failed against our new releases.

I changed this slide, you might notice, today. This is from the keynote this morning. Absolutely brilliant. I'm not going to talk a huge amount about tracing, considering we did hear a great lot of detail this morning. But tracing is great for monitoring those critical flows, and you can take that further and actually provide alerting around it. We went down the track of having this monitoring app, though, that started life in the best of intentions. However, it kind of went a bit wonky. Our very simple content platform. We get some content in, we transform it, we write to the database. I've condensed that massive diagram to that. This is the simple form. And what we did is we put this application that would monitor certain points. I've simplified some of that.

I don't know whether anybody has a part of their system that when you're working and you're doing, say, an estimate in a sprint, and you've got a piece of work and you go, "Oh, that will be three." Then somebody goes, "Oh, you've got to touch X, Y, Z," and suddenly, it skyrockets the estimate. This was us. Anything that touched that publish monitor, the estimates at least doubled. We realized that it was becoming very brittle and tightly coupled to some of our flows, and it was becoming a duplication of business logic. It had to know how our application code was structured.

We decided to move away from that, and put events by logging. All critical events were flagged for our logs, custom logs, so that we knew that it was a monitoring event. What we would then do is chuck all those logs into Kinesis, where we could run some Streaming SQL. I only put a tiny bit on there. It meant that we could take a time window and say whether publishers had got through that in a certain time or whether there were errors for a certain set of events. We could do it as monitoring, but we could also use the log aggregation for queries, following up things. It meant we were now logging those important events very close to the actual thing that we cared about. So when we refactored that code, the monitoring got reflected alongside it.

We understand what customers want. Well, we kind of understand. Hopefully, we've identified a few flows. This is an iterative thing, by the way. Our first attempt at saying, "This is important," is very different from what it is now. Pick a starting point, it doesn't matter what the starting point is, and then take it from there.

Break Things and Practice Everything

Now that we are automating our failovers, we know what's in our mindset, how do we deal with actually getting called at 3 a.m.? Well, if you answer the phone that is. I'm not going to talk a lot about "Chaos Engineering." Hopefully, you've seen Russ's talk before. Crystal is coming on next, which I think she's going to talk about this, too. It's basically pulling down bits of your system and seeing how they react, how your system reacts. This can be extended, again, from your tech stack into your operational support model. We actually used it to test the interaction between our operations team and the second line support team. The only problem is we didn't tell everybody properly, so our office went, "Yes, there's a problem over here with one of the developments." "Don't worry, we took it down." Kind of defeats the object of an actual practice run.

When do you go towards these kind of chaos engineering tools? This is a kind of timeline from monolith to microservice. You normally start with a big chunk, tease some stuff out, tease some more stuff out further. Maybe that system then evolves more functionality. When do you get to the point that you're ready to do chaos engineering? There's no point if you've got a monolith, because what bit do you bring down? You've only got one thing. You probably don't have the resilience to deal with that. But do you need to be all the other way at the end? Do you need to have finished your journey to do this? I don't think the answer is clear. I think there's somewhere along this, you pick when you believe your system is resilient enough to deal with it.

But as I was saying, manual simulation of outages also helps. It's a practice mechanism. And we did this quite a bit to start with, especially because we had a single point of failure. Now, we honestly had a big, single point of failure in our system. We had all of our content coming in from our CMSs into one component, and this didn't scale well, because as soon as we scaled it, we got the order wrong of publishers, so we would overwrite newer publishers with older publishers. We knew this was a problem, and we wanted to fix it. It was quite an extensive fix.

When we initially started practicing outages, we erred on the side of caution, and we did manual take-downs of services, and not that service. Weirdly though, this single point of failure has really helped us out in the long run. Remember this slide? I was talking about multi-region failover. If we needed to do changes at that single point of failure, we'd do manual failovers. We would manually fail over our entire cluster over to the U.S., and release to EU, and vice versa. What this enabled us to do is practice that failover mechanism over, and over, and over again. And us engineers are lazy, so we want to make that command shorter, and shorter, and shorter. We want to script everything, so it's literally so easy to completely failover the platform, which the operations team thanked us for.

This is the kind of thing, when you're fixing things during the day, you build confidence. You know how to do this stuff, you know how to get the platform steady, or at least just holding its own. But whenever you do want to do manual interventions, make it as stupid as possible. I can't cope with doing a hell of a lot. I can't go and dig into logs at 3 a.m. I just want to know how to stabilize things. We had a job that just you pasted a load of failed content IDs in, and you could republish content from source. This was just a Jenkins job so that the operations team didn't have to write a line of code.

What's in your alerts is really important. Only make it relevant to actioning that issue. You don't want to have to go get an ID here, get a SQL connection here, go look it up in a graph over here, and combine all of that knowledge at 3 a.m. to make the call. All you need to make a call of is, "Will a failover fix this? What events have failed? Will it need to be replayed? Or do I need to call somebody else?" This is an example of one of those alerts, and it's got plain English, and it's very descriptive of what's going on and what has failed. It's given a link to Splunk, so you can actually go and have a look at some more stuff, but basically, you've got exactly what you need to action from there.

This is probably my most important piece of advice is that we like to fix things. We really want to fix things. Three a.m. is not that time. What you need to do is stabilize your system enough to get it limping into the next day. You need to get sleep, you need to be paired up with your colleagues, you need that greater information around you, and you need coffee, lots of coffee. You just need to get it to that point that it gets through the night.

I've given you all my advice on how to sort of not get called in the middle of the night. Basically, most of my advice involves postponing everything to business hours. First off, make sure that you're actually thinking about this every step of the way in the development process. This is of primary concern. Intraday stuff, reduce the risk of it causing you calls. Make sure your deployment is quick, and release is easy to verify. Automate as much as you possibly can, but don't tell yourself that it's been automatically recovered, because you might want to dig into it. Understand your flows. Break things, and practice everything over and over again.

We're the ones that get called at 3 a.m. We own every aspect of this. Incorporate these into all of your development parts. If you're not an engineer, and you're in management, or CEO, CTO, recognize this. Do you need to allow the engineers complete control over this? Not just the code, but the whole model of operating this. If the engineers own it, we value it and we pay attention to it.

Questions & Answers

Participant 1: Do you recommend or do you know any tool that can simulate synthetic requests that you can recommend?

Wrightson: I don't, because I think that's very personal, synthetic requests. We wrote our own system for that. We slightly over-engineered it to start with, but yes, we did our own system. I think when you're dealing with synthetic requests, it's the idempotency and all of those understandings that you really need to understand about your system, and those requests need to be very reflective of the business needs, so they're very personal.

Participant 2: Could you elaborate more on how you decoupled your synthetic monitoring from the business logic and your APIs?

Wrightson: Rather than have a monolithic monitoring system that we had to replicate the flow in, we pooled the logging into the events. An example back in the slides was we failed to save to our graph database, so that's an event. So we had to add a custom log. And then once we had done that, we pumped into Kinesis. Admittedly, there was some business logic over on the Streaming SQL side, so there was some coupling, but it meant that we could defy it. It didn't matter if we changed approach from that Streaming SQL. We still had the event logs. Does that answer your question? You can catch me afterwards, if you want more details on that.

Moderator: How do you go about when new people join the team, actually getting them up to speed on the engineers' mindset? What kind of things are done for that?

Wrightson: The simplest is, pairing up very early on, on that in-hours support. We whinged about going on that in-hours support. We called it ops cop. And we had rota, and we were constantly whinging about it, because people want to do their business functionality. However, bringing new members of the team on and getting them to shadow and pair up on all of those issues, it built their confidence up and they could get a bigger understanding of the platform at the same time.

Participant 3: If you get an alert, but it's not actionable, but it's definite that it has some user impact, but it’s just not clear what to do and it requires some investigations - what to do with such alerts?

Wrightson: If you're not going to action it, don't have it as an alert. You could have various things like a report, or go and investigate your 500 log. But if it's an alert that doesn't give you the right information to investigate it, then iterate on how you are alerting. This might be a new thing. You don't implement any of that on day one and it stays the same. It's constantly evolving, because you don't get the same problem twice. I think, Liz was telling us yesterday about there's brands of issues, but you never get the same one twice, so you have to keep evolving all of this operational side of things so that the engineers have that power to make the right call at the right point.