2019-11-10 Scaling in the presence of errors—don’t ignore them Building a reliable, robust service often means building something that can keep working when some parts fail. A website where not every feature is available is often better than a website that’s entirely offline. Doing this in a meaningful way is not obvious. The usual response is to hire more DBAs, more SREs, and even more folk in Support. Error handling, or making software that can recover from faults, often feels like the option of last resort—if ever considered in the first place. The usual response to error handling is optimism. Unfortunately, the other choices aren’t exactly clear, and often difficult to choose from too. If you have two services, what do you do when one of them is offline: Try again later? Give up entirely? Or just ignore it and hope the problem goes away? Surprisingly, all of these can be reasonable approaches. Even ignoring problems can work out for some systems. Sort-of. You don’t get to ignore errors, but sometimes recovering from an error can look very similar to ignoring it. Imagine an orchard filled with wireless sensors for heat, light, and moisture. It makes little sense to try and resend old temperature readings upon error. It isn’t the sensor’s job to ensure the system works, and there’s very little a sensor can do about it, too. As a result, it’s pretty reasonable for a sensor to send messages with wild abandon—or in other words, fire-and-forget. The idea behind fire-and-forget is that you don’t need to save old messages when the next message overrides it, or when a missing message will not cause problems. A situation where each message is treated as being the first message sent—forgetting that any attempt was made prior. Done well, fire-and-forget is like a daily meeting—if someone misses the meeting, they can turn up the next day. Done badly, fire-and-forget is akin to replacing email with shouting across the office, hoping that someone else will take notes. It isn’t that there’s no error handling in a fire-and-forget client, it’s that the best method of recovery is to just keep going. Unfortunately, people often misinterpret fire-and-forget to mean “avoid any error handling and hoping for the best”. You don’t get to ignore errors. When you ignore errors, you only put off discovering them—it’s not until another problem is caused that anyone even realises something has gone wrong. When you ignore errors, you waste time that could be spent recovering from them. This is why, despite the occasional counter example, the best thing to do when encountering an error is to give up. Stop before you make anything worse and let something else handle it. Giving up is a surprisingly reasonable approach to error handling, assuming something else will try to recover, restart, or resume the program. That’s why almost ever network service gets run in a loop—restarting immediately upon crashing, hoping the fault was transient. It often is. There’s little point in trying to repeatedly connect to a database when the user is already mashing refresh in the browser. A unix pipeline could handle every possible bad outcome, but more often than not, running the program again makes everything work. Although giving up is a good way to handle errors, restarting from scratch isn’t always the best way to recover from them. Some pipelines work on large volumes or data, or do arduous amounts of numerical processing, and no-one is ever happy about repeating days or weeks or work. In theory, you could add error handling code, reduce the risk that the program will crash, and avoid an expensive restart, but in practice it’s often easier to restructure code to carry on where it left off. In other words, give up, but save your progress to make restarting less time consuming. For a pipeline, this usually entails a awful lot of temporary files—to save the output of each subcommand, and the result of splitting the input up into smaller batches. You can even retry things automatically, but for a lot of pipelines, manual recovery is still relatively inexpensive. For other long running processes, this usually means something like checkpoints, or sagas. Or in other words, transforming a long running process into a short running one that’s run constantly, writing out the progress it makes to some file or database somewhere. Over time, every long running process will get broken up into smaller parts, as restarting from scratch becomes prohibitively expensive. A long running process is just that more likely to run into an impossible error—full disks, no free memory, cosmic rays—and be forced to give up. Sometimes the only way to handle an error is to give up. As a result, the best way to handle errors is to structure your program to make recovery easier. Recovery is what makes all the difference between “fire-and-forget” and “ignoring-every-error” despite sharing the same optimism. You can do things that look like ignoring errors, or even letting something else handle it, as long as there’s a plan to recover from them. Even if it’s restarting from scratch, even if it’s waking someone up at night, as long as there’s some plan, then you aren’t ignoring the problem. Assuming the plan works, that is. You don’t get to ignore errors. They’re inevitably someone’s problem. If someone tells you they can ignore errors, they’re telling you someone else is on-call for their software. That, or they’re using a message broker. A message broker, if you’re not sure, is a networked service that offers a variety of queues that other computers on the network can interact with. Usually some clients enqueue messages, and others poll for the next unread message, but they can be used in a variety of other configurations too. Like with a unix pipe, message brokers are used to get software off the ground. Similarly to using temporary files, the broker allows for different parts of the pipeline to consume and produce inputs at different rates, but don’t easily allow replaying or restarting when errors occur. Like a unix pipe, message brokers are used in a very optimistic fashion. Firing messages into the queue and moving on to the next task at hand. Somewhat like a unix pipeline, but with some notable differences. A unix pipeline blocks when full, pausing the producer until the consumer can catch up. A unix pipeline will exit if any of the subcommands exit, and return an error if the last subcommand failed. A message broker does not block the producer until the consumer can catch up. In theory, this means transient errors or network issues between components don’t bring the entire system down. In practice, the more queues you have in a pipeline, the longer it takes to find out if there’s a problem. Sometimes that works out. When there’s no growth, brokers act like a buffer between parts of a system, handling variance in load. They work well at slowing bursty clients down, and can provide a central point for auditing or access control. When there is growth, queues explode regularly until some form of rate limiting appears. When more load arrives, queues are partitioned, and then repartitioned. Scaling a broker inevitably results in moving to something where the queue is bounded, or even ephemeral. The problem with optimism is that when things do go wrong, not only do you have no idea how to fix it, you don’t even know what went wrong. To some extent, a message broker hides errors—programs can come and go as they please, and there’s no way to tell if the other part is still reading your messages—but it can only hide errors for so long. In other words, fire-and-regret. Although an unbounded queue is a tempting abstraction, it rarely lives up to the mythos of freeing you from having to handle errors. Unlike a unix pipeline, a message broker will always fill up your disks before giving up, and changing things to make recovery easy isn’t as straight forward as adding more temporary files. Brokers can only recover from one error—a temporary network outage—so other mechanisms get brought in to compensate. Timeouts, retries, and sometimes even a second “priority” queue, because head-of-line blocking is genuinely terrible to deal with. Even then, if a worker crashes, messages can still get dropped. Queues rarely help with recovery. They frequently impede it. Imagine a build pipeline, or background job service where requests are dumped into some queue with wild abandon. When something breaks, or isn’t running like it is supposed to, you have no idea where to start recovery. With a background queue, you can’t tell what jobs are currently being run right now. You can’t tell if something’s being retried, or failed, but maybe you’ve got log files you can search through. With logs, you can see what the system was doing a few minutes ago, but you still have no idea what it might be doing right now. Even if you know the size of a queue, you’ll have to check the dashboard a few minutes later—to see if the line wiggled—before you know for sure if things are probably working. Hopefully. Making a build pipeline with queues is relatively easy, but building one that the user can cancel, or watch, involves a lot more work. As soon as you want to cancel a task, or inspect a task, you need to keep things somewhere other than a queue. Knowing what a program is up to means tracking the in-between parts, and even for something as simple as running a background task, it can involve many states—Created, Enqueued, Processing, Complete, Failed, not just Enqueued—and a broker only handles that last part. Not very well. As soon as one queue feeds into another, an item of work can be in several different queues at once. If an item is missing from the queue, you know it’s either being dropped or processed, if an item is in the queue, you don’t know if it’s being processed, but you do know it will be. A queue doesn’t just hide errors, it hides state too. Recovery means knowing what state the program was in before things went wrong, and when you fire-and-forget into a queue, you give up on knowing what happens to it. Handling errors, recovering from errors, means building software that can knows what state it is currently operating in. It also means structuring things to make recovery possible. That, or you give up on on automated recovery of almost any kind. In some ways, I’m not arguing against fire-and-forget, or against optimism—but against optimism that prevents recovery. Not against queues, but how queues inevitably get used. Unfortunately, recovery is relatively easy to imagine but not necessarily straight forward to implement. This is why some people opt to use a replicated log, instead of a message broker. If you’ve never used a replicated log, imagine an append only database table without a primary key, or a text file with backups, and you’re close. Or imagine a message broker, but instead of enqueue and dequeue, you can append to the log or read from the log. Like a queue, a replicated log can be used in a fire-and-forget fashion with not so great consequences. Just like before, chaos will ensue as concepts like rate-limiting, head-of-line blocking, and the end-to-end-principle are slowly contended with—If you use a replicated log like a queue, it will fail like a queue. Unlike a queue, a replicated log can aid recovery. Every consumer sees the same log entries, in the same order, so it’s possible to recover by replaying the log, or by catching up on old entries. In some ways it’s more like using temporary files instead of a pipeline to join things together, and the strategies for recovery overlap with temporary files, too—like partitioning the log so that restarts aren’t as expensive. Like temporary files, a replicated log can aid in recovery, but only to a certain point. A consumer will see the same messages, in the same order, but if a entry gets dropped before reaching the log, or if entries arrive in the wrong order, some, or potentially all hell can break loose. You can’t just fire-and-forget into a log, not over a network. Although a replicated log is ordered, it will preserve the ordering it gets, whatever that happens to be. This isn’t always a problem. Some logs are used to capture analytic data, or fed into aggregators, so the impact of a few missing or out of order entries is relatively low—a few missing entries might as well be called high-volume random sampling and declared a non-issue. For other logs, missing entries could cause untold misery. Recovering from missing entries might involve rebuilding the entire log from scratch. If you’re using a replicated log for replication, you probably care quite a lot about the order of log entries. Like before, you can’t ignore errors—you only make things expensive to recover from. Handling errors like out of order or missing log entries means being able to work out when they have occurred. This is more difficult than you might imagine. Take two services, a primary and a secondary, both with databases, and imagine using a replicated log to copy changes from one to another. It doesn’t seem to difficult at first. Every time the primary service makes a change to the database, it writes to to log. The secondary reads from the log, and updates its database. If the primary service is a single process, it’s pretty easy to ensure that every message is sent in the right order. When there’s more than one writer, things can get rather involved. Now, you could switch things around—write to the log first, then apply the changes to the database, or use the database’s log directly—and avoid the problem altogether, but these aren’t always an option. Sometimes you’re forced to handle the problem of ordering the entries yourself. In other words, you’ll need to order the messages before writing them to the log. You could let something else provide the order, but you’d be mistaken if you think a timestamp would help. Clocks move forwards and backwards and this can cause all sorts of headaches. One of the most frustrating problems with timestamps is ‘doomstones’: when a service deletes a key but has a wonky clock far out in the future, and issues an event with a similar timestamp. All operations get silently dropped until the deletion event is cleared. The other problem with timestamps is that if you have two entries, one after the other, you can’t tell if there are any entries that came between them. Things like “Hybrid Logical Clocks”, or even atomic clocks can help to narrow down clock drift, but only so much. You can only narrow down the window of uncertainty, there’s still some clock skew. Again, clocks will go forwards and backwards—timestamps are terrible for ordering things precisely. In practice you need explicit version numbers, 1,2,3… etc, or a unique identifier for each version of each entry, and a link back to the record being updated, to order messages. With a version number, messages can be reordered, missing messages can be detected, and both can be recovered from, although managing and assigning those version numbers can be quite difficult in practice. Timestamps are still useful, if only for putting things in a human perspective, but without a version number, it’s impossible to know what precise order things happened in—and that no steps are missing, either. You don’t get to ignore errors, but sometimes the error handling code isn’t that obvious. Using version numbers or even timestamps both fall under building a plan for recovery. Building something that can continue to operate in the presence of failure. Unfortunately, building something that works when other parts fail is one of the more challenging parts of software engineering. It doesn’t help that doing the same thing in the same order is so difficult that people use terms like causality and determinism to make the point sink in. You don’t get to ignore errors, but no one said it was going to be easy. Although using things like replicated logs, message brokers, or even using unix pipes can allow you to build prototypes, clear demonstrations of how your software works—they do not free you from the burden of handling errors. You can’t avoid error handling code, not at scale. The secret to error handling at scale isn’t giving up, ignoring the problem, or even it trying again—it is structuring a program for recovery, making errors stand out, allowing other parts of the program to make decisions. Techniques like fail-fast, crash-only-software, process supervision, but also things like clever use of version numbers, and occasionally the odd bit of statelessness or idempotence. What these all have in common is that they’re all methods of recovery. Recovery is the secret to handling errors. Especially at scale. Giving up early so other things have a chance, continuing on so other things can catch up, restarting from a clean state to try again, saving progress so that things do not have to be repeated. That, or put it off for a while. Buy a lot of disks, hire a few SREs, and add another graph to the dashboard. The problem with scale is that you can’t approach it with optimism. As the system grows, it needs redundancy, or to be able to function in the presence of partial errors or intermittent faults. Humans can only fill in so many gaps. Staff turnover is the worst form of technical debt. Writing robust software means building systems that can exist in a state of partial failure (like incomplete output), and writing resilient software means building systems that are always in a state of recovery (like restarting)—neither come from engineering the happy path of your software. When you ignore errors, you transform them into mysteries to solve. Something or someone else will have to handle them, and then have to recover from them—usually by hand, and almost always at great expense. The problem with avoiding error handling in code is that you’re only avoiding automating it. In other words, the trick to scaling in the presence of errors is building software around the notion of recovery. Automated recovery. That, or burnout. Lots of burnout. You don’t get to ignore errors.

2019-01-08 What the hell is REST, Anyway? Originating in a thesis, REST is an attempt to explain what makes the browser distinct from other networked applications. You might be able to imagine a few reasons why: there’s tabs, there’s a back button too, but what makes the browser unique is that a browser can be used to check email, without knowing anything about POP3 or IMAP. Although every piece of software inevitably grows to check email, the browser is unique in the ability to work with lots of different services without configuration—this is what REST is all about. HTML only has links and forms, but it’s enough to build incredibly complex applications. HTTP only has GET and POST, but that’s enough to know when to cache or retry things, HTTP uses URLs, so it’s easy to route messages to different places too. Unlike almost every other networked application, the browser is remarkably interoperable. The thesis was an attempt to explain how that came to be, and called the resulting style REST. REST is about having a way to describe services (HTML), to identify them (URLs), and to talk to them (HTTP), where you can cache, proxy, or reroute messages, and break up large or long requests into smaller interlinked ones too. How REST does this isn’t exactly clear. The thesis breaks down the design of the web into a number of constraints—Client-Server, Stateless, Caching, Uniform Interface, Layering, and Code-on-Demand—but it is all too easy to follow them and end up with something that can’t be used in a browser. REST without a browser means little more than “I have no idea what I am doing, but I think it is better than what you are doing.”, or worse “We made our API look like a database table, we don’t know why”. Instead of interoperable tools, we have arguments about PUT or POST, endless debates over how a URL should look, and somehow always end up with a CRUD API and absolutely no browsing. There are some examples of browsers that don’t use HTML, but many of these HTML replacements are for describing collections, and as a result most of the browsers resemble file browsing more than web browsing. It’s not to say you need a back and a next button, but it should be possible for one program to work with a variety of services. For an RPC service you might think about a curl like tool for sending requests to a service: $ rpctl http://service/ describe MyService methods: ...., my_method $ rpctl http://service/ describe MyService.my_method arguments: name, age $ rpctl http://service/ call MyService.my_method --name="James" --age=31 Result: message: "Hello, James!" You can also imagine a single command line tool for a databases that might resemble kubectl : $ dbctl http://service/ list ModelName --where-age=23 $ dbctl http://service/ create ModelName --name=Sam --age=23 $ ... Now imagine using the same command line tool for both, and using the same command line tool for every service—that’s the point of REST. Almost. $ apictl call MyService:my_method --arg=... $ apictl delete MyModel --where-arg=... $ apictl tail MyContainers:logs --where ... $ apictl help MyService You could implement a command line tool like this without going through the hassle of reading a thesis. You could download a schema in advance, or load it at runtime, and use it to create requests and parse responses, but REST is quite a bit more than being able to reflect, or describe a service at runtime. The REST constraints require using a common format for the contents of messages so that the command line tool doesn’t need configuring, require sending the messages in a way that allows you to proxy, cache, or reroute them without fully understanding their contents. REST is also a way to break apart long or large messages up into smaller ones linked together—something far more than just learning what commands can be sent at runtime, but allowing a response to explain how to fetch the next part in sequence. To demonstrate, take an RPC service with a long running method call: class MyService(Service): @rpc() def long_running_call(self, args: str) -> bool: id = third_party.start_process(args) while third_party.wait(id): pass return third_party.is_success(id) When a response is too big, you have to break it down into smaller responses. When a method is slow, you have to break it down into one method to start the process, and another method to check if it’s finished. class MyService(Service): @rpc() def start_long_running_call(self, args: str) -> str: ... @rpc() def wait_for_long_running_call(self, key: str) -> bool: ... In some frameworks you can use a streaming API instead, but replacing a procedure call with streaming involves adding heartbeat messages, timeouts, and recovery, so many developers opt for polling instead—breaking the single request into two, like the example above. Both approaches require changing the client and the server code, and if another method needs breaking up you have to change all of the code again. REST offers a different approach. We return a response that describes how to fetch another request, much like a HTTP redirect. You’d handle them In a client library much like an HTTP client handles redirects does, too. def long_running_call(self, args: str) -> Result[bool]: key = third_party.start_process(args) return Future("MyService.wait_for_long_running_call", {"key":key}) def wait_for_long_running_call(self, key: str) -> Result[bool]: if not third_party.wait(key): return third_party.is_success(key) else: return Future("MyService.wait_for_long_running_call", {"key":key}) def fetch(request): response = make_api_call(request) while response.kind == 'Future': request = make_next_request(response.method_name, response.args) response = make_api_call(request) For the more operations minded, imagine I call time.sleep() inside the client, and maybe imagine the Future response has a duration inside. The neat trick is that you can change the amount the client sleeps by changing the value returned by the server. The real point is that by allowing a response to describe the next request in sequence, we’ve skipped over the problems of the other two approaches—we only need to implement the code once in the client. When a different method needs breaking up, you can return a Future and get on with your life. In some ways it’s as if you’re returning a callback to the client, something the client knows how to run to produce a request. With Future objects, it’s more like returning values for a template. This approach works for breaking up a large response into smaller ones too, like iterating through a long list of results. Pagination often looks something like this in an RPC system: cursor = rpc.open_cursor() output = [] while cursor: output.append(cursor.values) cursor = rpc.move_cursor(cursor.id) Or something like this: start = 0 output = [] while True: out = rpc.get_values(start, batch=30) output.append(out) start += len(out) if len(out) < 30: break The first pagination example stores state on the server, and gives the client an Id to use in subsequent requests. The second pagination example stores state on the client, and constructs the correct request to make from the state. There’s advantages and disadvantages—it’s better to store the state on the client (so that the server does less work), but it involves manually threading state and a much harder API to use. Like before, REST offers a third approach. Instead, the server can return a Cursor response (much like a Future ) with a set of values and a request message to send (for the next chunk). class ValueService(Service): @rpc() def get_values(self): return Cursor("ValueService.get_cursor", {"start":0, "batch":30}, []) @rpc def get_cursor(start, batch): ... return Cursor("ValueService.get_cursor", {"start":start, "batch":batch}, values) The client can handle a Cursor response, building up a list: cursor = rpc.get_values() output = [] while cursor: output.append(cursor.values) cursor = cursor.move_next() It’s somewhere between the two earlier examples of pagination—instead of managing the state on the server and sending back an identifier, or managing the state on the client and carefully constructing requests—the state is sent back and forth between them. As a result, the server can change details between requests! If a Server wants to, it can return a Cursor with a smaller set of values, and the client will just make more requests to get all of them, but without having to track the state of every Cursor open on the service. This idea of linking messages together isn’t just limited to long polling or pagination—if you can describe services at runtime, why can’t you return ones with some of the arguments filled in—a Service can contain state to pass into methods, too. To demonstrate how, and why you might do this, imagine some worker that connects to a service, processes work, and uploads the results. The first attempt at server code might look like this: class WorkerApi(Service): def register_worker(self, name: str) -> str ... def lock_queue(self, worker_id:str, queue_name: str) -> str: ... def take_from_queue(self, worker_id: str, queue_name, queue_lock: str): ... def upload_result(self, worker_id, queue_name, queue_lock, next, result): ... def unlock_queue(self, worker_id, queue_name, queue_lock): ... def exit_worker(self, worker_id): ... Unfortunately, the client code looks much nastier: worker_id = rpc.register_worker(my_name) lock = rpc.lock_queue(worker_id, queue_name) while True: next = rpc.take_from_queue(worker_id, queue_name, lock) if next: result = process(next) rpc.upload_result(worker_id, queue_name, lock, next, result) else: break rpc.unlock_queue(worker_id, queue_name, lock) rpc.exit_worker(worker_id) Each method requires a handful of parameters, relating to the current session open with the service. They aren’t strictly necessary—they do make debugging a system far easier—but problem of having to chain together requests might be a little familiar. What we’d rather do is use some API where the state between requests is handled for us. The traditional way to achieve this is to build these wrappers by hand, creating special code on the client to assemble the responses. With REST, we can define a Service that has methods like before, but also contains a little bit of state, and return it from other method calls: class WorkerApi(Service): def register(self, worker_id): return Lease(worker_id) class Lease(Service): worker_id: str @rpc() def lock_queue(self, name): ... return Queue(self.worker_id, name, lock) @rpc() def expire(self): ... class Queue(Service): name: str lock: str worker_id: str @rpc() def get_task(self): return Task(.., name, lock, worker_id) @rpc() def unlock(self): ... class Task(Service) task_id: str worker_id: str @rpc() def upload(self, out): mark_done(self.task_id, self.actions, out) Instead of one service, we now have four. Instead of returning identifiers to pass back in, we return a Service with those values filled in for us. As a result, the client code looks a lot nicer—you can even add new parameters in behind the scenes. lease = rpc.register_worker(my_name) queue = lease.lock_queue(queue_name) while True: next = queue.take_next() if next: next.upload_result(process(next)) else: break queue.unlock() lease.expire() Although the Future looked like a callback, returning a Service feels like returning an object. This is the power of self description—unlike reflection where you can specify in advance every request that can be made—each response has the opportunity to define a new parameterised request. It’s this navigation through several linked responses that distinguishes a regular command line tool from one that browses—and where REST gets its name: the passing back and forth of requests from server to client is where the ‘state-transfer’ part of REST comes from, and using a common Result or Cursor object is where the 'representational’ comes from. Although a RESTful system is more than just these combined—along with a reusable browser, you have reusable proxies too. In the same way that messages describe things to the client, they describe things to any middleware between client and server: using GET, POST, and distinct URLs is what allows caches to work across services, and using a stateless protocol (HTTP) is what allows a proxy or load balancer to work so effortlessly. The trick with REST is that despite HTTP being stateless, and despite HTTP being simple, you can build complex, stateful services by threading the state invisibly between smaller messages—transferring a representation of state back and forth between client and server. Although the point of REST is to build a browser, the point is to use self-description and state-transfer to allow heavy amounts of interoperation—not just a reusable client, but reusable proxies, caches, or load balancers. Going back to the constraints (Client-Server, Stateless, Caching, Uniform Interface, Layering and Code-on-Demand), you might be able to see how they things fit together to achieve these goals. The first, Client-Server, feels a little obvious, but sets the background. A server waits for requests from a client, and issues responses. The second, Stateless, is a little more confusing. If a HTTP proxy had to keep track of how requests link together, it would involve a lot more memory and processing. The point of the stateless constraint is that to a proxy, each request stands alone. The point is also that any stateful interactions should be handled by linking messages together. Caching is the third constraint: labelling if a response can be cached (HTTP uses headers on the response), or if a request can be resent (using GET or POST). The fourth constraint, Uniform Interface, is the most difficult, so we’ll cover it last. Layering is the fifth, and it roughly means “you can proxy it”. Code-on-demand is the final, optional, and most overlooked constraint, but it covers the use of Cursors, Futures, or parameterised Services—the idea that despite using a simple means to describe services or responses, the responses can define new requests to send. Code-on-demand takes that further, and imagines passing back code, rather than templates and values to assemble. With the other constraints handled, it’s time for uniform interface. Like stateless, this constraint is more about HTTP than it is about the system atop, and frequently misapplied. This is the reason why people keep making database APIs and calling them RESTful, but the constraint has nothing to do with CRUD. The constraint is broken down into four ideas, and we’ll take them one by one: self-descriptive messages, identification of resources, manipulation of resources through representations, hypermedia as the engine of application state. Self-Description is at the heart of REST, and this sub-constraint fills in the gaps between the Layering, Caching, and Stateless constraints. Sort-of. It covers using 'GET’ and 'POST’ to indicate to a proxy how to handle things, and covers how responses indicate if they can be cached, too. It also means using a content-type header. The next sub-constraint, identification, means using different URLs for different services. In the RPC examples above, it means having a common, standard way to address a service or method, as well as one with parameters. This ties into the next sub-constraint, which is about using standard representations across services—this doesn’t mean using special formats for every API request, but using the same underlying language to describe every response. In other words, the web works because everyone uses HTML. Uniformity so far isn’t too difficult: Use HTTP (self-description), URLs (identification) and HTML (manipulation through representations), but it’s the last sub-constraint thats causes most of the headaches. Hypermedia as the engine of application state. This is a fancy way of talking about how large or long requests can be broken up into interlinked messages, or how a number of smaller requests can be threaded together, passing the state from one to the next. Hypermedia referrs to using Cursor , Future , or Service objects, application state is the details passed around as hidden arguments, and being the 'engine’ means using it to tie the whole system together. Together they form the basis of the Representational State-Transfer Style. More than half of these constraints can be satisfied by just using HTTP, and the other half only really help when you’re implementing a browser, but there are still a few more tricks that you can do with REST. Although a RESTful system doesn’t have to offer a database like interface, it can. Along with Service or Cursor , you could imagine Model or Rows objects to return, but you should expect a little more from a RESTful system than just create, read, update and delete. With REST, you can do things like inlining: along with returning a request to make, a server can embed the result inside. A client can skip the network call and work directly on the inlined response. A server can even make this choice at runtime, opting to embed if the message is small enough. Finally, with a RESTful system, you should be able to offer things in different encodings, depending on what the client asks for—even HTML. In other words, if your framework can do all of these things for you, offering a web interface isn’t too much of a stretch. If you can build a reusable command line tool, generating a web interface isn’t too difficult, and at least this time you don’t have to implement a browser from scratch. If you now find yourself understanding REST, I’m sorry. You’re now cursed. Like a cross been the greek myths of Cassandra and Prometheus, you will be forced to explain the ideas over and over again to no avail. The terminology has been utterly destroyed to the point it has less meaning than 'Agile’. Even so, the underlying ideas of interoperability, self-description, and interlinked requests are surprisingly useful—you can break up large or slow responses, you can to browse or even parameterise services, and you can do it in a way that lets you re-use tools across services too. Ideally someone else will have done it for you, and like with a web browser, you don’t really care how RESTful it is, but how useful it is. Your framework should handle almost all of this for you, and you shouldn’t have to care about the details. If anything, REST is about exposing just enough detail—Proxies and load-balancers only care about the URL and GET or POST. The underlying client libraries only have to handle something like HTML, rather than unique and special formats for every service. REST is fundamentally about letting people use a service without having to know all the details ahead of time, which might be how we got into this mess in the first place.

2018-08-05 Repeat yourself, do more than one thing, and rewrite everything If you ask a programmer for advice—a terrible idea—they might tell you something like the following: Don’t repeat yourself. Programs should do one thing and one thing well. Never rewrite your code from scratch, ever!. Following “Don’t Repeat Yourself” might lead you to a function with four boolean flags, and a matrix of behaviours to carefully navigate when changing the code. Splitting things up into simple units can lead to awkward composition and struggling to coordinate cross cutting changes. Avoiding rewrites means they’re often left so late that they have no chance of succeeding. The advice isn’t inherently bad—although there is good intent, following it to the letter can create more problems than it promises to solve. Sometimes the best way to follow an adage is to do the exact opposite: embrace feature switches and constantly rewrite your code, pull things together to make coordination between them easier to manage, and repeat yourself to avoid implementing everything in one function.. This advice is much harder to follow, unfortunately. Repeat yourself to find abstractions. “Don’t Repeat Yourself” is almost a truism—if anything, the point of programming is to avoid work. No-one enjoys writing boilerplate. The more straightforward it is to write, the duller it is to summon into a text editor. People are already tired of writing eight exact copies of the same code before even having to do so. You don’t need to convince programmers not to repeat themselves, but you do need to teach them how and when to avoid it. “Don’t Repeat Yourself” often gets interpreted as “Don’t Copy Paste” or to avoid repeating code within the codebase, but the best form of avoiding repetition is in avoiding reimplementing what exists elsewhere—and thankfully most of us already do! Almost every web application leans heavily on an operating system, a database, and a variety of other lumps of code to get the job done. A modern website reuses millions of lines of code without even trying. Unfortunately, programmers love to avoid repetition, and “Don’t Repeat Yourself” turns into “Always Use an Abstraction”. By an abstraction, I mean two interlinked things: a idea we can think and reason about, and the way in which we model it inside our programming languages. Abstractions are way of repeating yourself, so that you can change multiple parts of your program in one place. Abstractions allow you to manage cross-cutting changes across your system, or sharing behaviors within it. The problem with always using an abstraction is that you’re preemptively guessing which parts of the codebase need to change together. “Don’t Repeat Yourself” will lead to a rigid, tightly coupled mess of code. Repeating yourself is the best way to discover which abstractions, if any, you actually need. As Sandi Metz put it, “duplication is far cheaper than the wrong abstraction”. You can’t really write a re-usable abstraction up front. Most successful libraries or frameworks are extracted from a larger working system, rather than being created from scratch. If you haven’t built something useful with your library yet, it is unlikely anyone else will. Code reuse isn’t a good excuse to avoid duplicating code, and writing reusable code inside your project is often a form of preemptive optimization. When it comes to repeating yourself inside your own project, the point isn’t to be able to reuse code, but rather to make coordinated changes. Use abstractions when you’re sure about coupling things together, rather than for opportunistic or accidental code reuse—it’s ok to repeat yourself to find out when. Repeat yourself, but don’t repeat other people’s hard work. Repeat yourself: duplicate to find the right abstraction first, then deduplicate to implement it. With “Don’t Repeat Yourself”, some insist that it isn’t about avoiding duplication of code, but about avoiding duplication of functionality or duplication of responsibility. This is more popularly known as the “Single Responsibility Principle”, and it’s just as easily mishandled. Gather responsibilities to simplify interactions between them When it comes to breaking a larger service into smaller pieces, one idea is that each piece should only do one thing within the system—do one thing, and do it well—and the hope is that by following this rule, changes and maintenance become easier. It works out well in the small: reusing variables for different purposes is an ever-present source of bugs. It’s less successful elsewhere: although one class might do two things in a rather nasty way, disentangling it isn’t of much benefit when you end up with two nasty classes with a far more complex mess of wiring between them. The only real difference between pushing something together and pulling something apart is that some changes become easier to perform than others. The choice between a monolith and microservices is another example of this—the choice between developing and deploying a single service, or composing things out of smaller, independently developed services. The big difference between them is that cross-cutting change is easier in one, and local changes are easier in the other. Which one works best for a team often depends more on environmental factors than on the specific changes being made. Although a monolith can be painful when new features need to be added and microservices can be painful when co-ordination is required, a monolith can run smoothly with feature flags and short lived branches and microservices work well when deployment is easy and heavily automated. Even a monolith can be decomposed internally into microservices, albeit in a single repository and deployed as a whole. Everything can be broken into smaller parts—the trick is knowing when it’s an advantage to do so. Modularity is more than reducing things to their smallest parts. Invoking the ‘single responsibility principle’, programmers have been known to brutally decompose software into a terrifyingly large number of small interlocking pieces—a craft rarely seen outside of obscenely expensive watches, or bash. The traditional UNIX command line is a showcase of small components that do exactly one function, and it can be a challenge to discover which one you need and in which way to hold it to get the job done. Piping things into awk '{print $2}' is almost a rite of passage. Another example of the single responsibility principle is git. Although you can use git checkout to do six different things to the repository, they all use similar operations internally. Despite having singular functionality, components can be used in very different ways. A layer of small components with no shared features creates a need for a layer above where these features overlap, and if absent, the user will create one, with bash aliases, scripts, or even spreadsheets to copy-paste from. Even adding this layer might not help you: git already has a notion of user-facing and automation-facing commands, and the UI is still a mess. It’s always easier to add a new flag to an existing command than to it is to duplicate it and maintain it in parallel. Similarly, functions gain boolean flags and classes gain new methods as the needs of the codebase change. In trying to avoid duplication and keep code together, we end up entangling things. Although components can be created with a single responsibility, over time their responsibilities will change and interact in new and unexpected ways. What a module is currently responsible for within a system does not necessarily correlate to how it will grow. Modularity is about limiting the options for growth A given module often gets changed because it is the easiest module to change, rather than the best place for the change to be made. In the end, what defines a module is what pieces of the system it will never responsible for, rather what it is currently responsible for. When a unit has no rules about what code cannot be included, it will eventually contain larger and larger amounts of the system. This is eternally true of every module named ‘util’, and why almost everything in a Model-View-Controller system ends up in the controller. In theory, Model-View-Controller is about three interlocking units of code. One for the database, another for the UI, and one for the glue between them. In practice, Model-View-Controller resembles a monolith with two distinct subsystems—one for the database code, another for the UI, both nestled inside the controller. The purpose of MVC isn’t to just keep all the database code in one place, but also to keep it away from frontend code. The data we have and how we want to view it will change over time independent of the frontend code. Although code reuse is good and smaller components are good, they should be the result of other desired changes. Both are tradeoffs, introducing coupling through a lack of redundancy, or complexity in how things are composed. Decomposing things into smaller parts or unifying them is neither universally good nor bad for the codebase, and largely depends on what changes come afterwards. In the same way abstraction isn’t about code reuse, but coupling things for change, modularity isn’t about grouping similar things together by function, but working out how to keep things apart and limiting co-ordination across the codebase. This means recognizing which bits are slightly more entangled than others, knowing which pieces need to talk to each other, which need to share resources, what shares responsibilities, and most importantly, what external constraints are in place and which way they are moving. In the end, it’s about optimizing for those changes—and this is rarely achieved by aiming for reusable code, as sometimes handling changes means rewriting everything. Rewrite Everything Usually, a rewrite is only a practical option when it’s the only option left. Technical debt, or code the seniors wrote that we can’t be rude about, accrues until all change becomes hazardous. It is only when the system is at breaking point that a rewrite is even considered an option. Sometimes the reasons can be less dramatic: an API is being switched off, a startup has taken a beautiful journey, or there’s a new fashion in town and orders from the top to chase it. Rewrites can happen to appease a programmer too—rewarding good teamwork with a solo project. The reason rewrites are so risky in practice is that replacing one working system with another is rarely an overnight change. We rarely understand what the previous system did—many of its properties are accidental in nature. Documentation is scarce, tests are ornamental, and interfaces are organic in nature, stubbornly locking behaviors in place. If migrating to the replacement depends on switching over everything at once, make sure you’ve booked a holiday during the transition, well in advance. Successful rewrites plan for migration to and from the old system, plan to ease in the existing load, and plan to handle things being in one or both places at once. Both systems are continuously maintained until one of them can be decommissioned. A slow, careful migration is the only option that reliably works on larger systems. To succeed, you have to start with the hard problems first—often performance related—but it can involve dealing with the most difficult customer, or biggest customer or user of the system too. Rewrites must be driven by triage, reducing the problem in scope into something that can be effectively improved while being guided by the larger problems at hand. If a replacement isn’t doing something useful after three months, odds are it will never do anything useful. The longer it takes to run a replacement system in production, the longer it takes to find bugs. Unfortunately, migrations get pushed back in the name of feature development. A new project has the most room for feature bloat—this is known as the second-system effect. The second system effect is the name of the canonical doomed rewrite, one where numerous features are planned, not enough are implemented, and what has been written rarely works reliably. It’s a similar to writing a game engine without a game to implement to guide decisions, or a framework without a product inside. The resulting code is an unconstrained mess that is barely fit for its purpose. The reason we say “Never Rewrite Code” is that we leave rewrites too late, demand too much, and expect them to work immediately. It’s more important to never rewrite in a hurry than to never rewrite at all. null is true, everything is permitted The problem with following advice to the letter is that it rarely works in practice. The problem with following it at all costs is that eventually we cannot afford to do so. It isn’t “Don’t Repeat Yourself”, but “Some redundancy is healthy, some isn’t”, and using abstractions when you’re sure you want to couple things together. It isn’t “Each thing has a unique component”, or other variants of the single responsibility principle, but “Decoupling parts into smaller pieces is often worth it if the interfaces are simple between them, and try to keep the fast changing and tricky to implement bits away from each other”. It’s never “Don’t Rewrite!”, but “Don’t abandon what works”. Build a plan for migration, maintain in parallel, then decommission, eventually. In high-growth situations you can probably put off decommissioning, and possibly even migrations. When you hear a piece of advice, you need to understand the structure and environment in place that made it true, because they can just as often make it false. Things like “Don’t Repeat Yourself” are about making a tradeoff, usually one that’s good in the small or for beginners to copy at first, but hazardous to invoke without question on larger systems. In a larger system, it’s much harder to understand the consequences of our design choices—in many cases the consequences are only discovered far, far too late in the process and it is only by throwing more engineers into the pit that there is any hope of completion. In the end, we call our good decisions ‘clean code’ and our bad decisions ‘technical debt’, despite following the same rules and practices to get there.

2018-05-14 Write code that’s easy to delete, and easy to debug too. Debuggable code is code that doesn’t outsmart you. Some code is a little to harder to debug than others: code with hidden behaviour, poor error handling, ambiguity, too little or too much structure, or code that’s in the middle of being changed. On a large enough project, you’ll eventually bump into code that you don’t understand. On an old enough project, you’ll discover code you forgot about writing—and if it wasn’t for the commit logs, you’d swear it was someone else. As a project grows in size it becomes harder to remember what each piece of code does, harder still when the code doesn’t do what it is supposed to. When it comes to changing code you don’t understand, you’re forced to learn about it the hard way: Debugging. Writing code that’s easy to debug begins with realising you won’t remember anything about the code later. Rule 0: Good code has obvious faults. Many used methodology salesmen have argued that the way to write understandable code is to write clean code. The problem is that “clean” is highly contextual in meaning. Clean code can be hardcoded into a system, and sometimes a dirty hack can written in a way that’s easy to turn off. Sometimes the code is clean because the filth has been pushed elsewhere. Good code isn’t necessarily clean code. Code being clean or dirty is more about how much pride, or embarrassment the developer takes in the code, rather than how easy it has been to maintain or change. Instead of clean, we want boring code where change is obvious— I’ve found it easier to get people to contribute to a code base when the low hanging fruit has been left around for others to collect. The best code might be anything you can look at quickly learn things about it. Code that doesn’t try to make an ugly problem look good, or a boring problem look interesting.

Code where the faults are obvious and the behaviour is clear, rather than code with no obvious faults and subtle behaviours.

Code that documents where it falls short of perfect, rather than aiming to be perfect.

Code with behaviour so obvious that any developer can imagine countless different ways to go about changing it. Sometimes, code is just nasty as fuck, and any attempts to clean it up leaves you in a worse state. Writing clean code without understanding the consequences of your actions might as well be a summoning ritual for maintainable code. It is not to say that clean code is bad, but sometimes the practice of clean coding is more akin to sweeping problems under the rug. Debuggable code isn’t necessarily clean, and code that’s littered with checks or error handling rarely makes for pleasant reading. Rule 1: The computer is always on fire. The computer is on fire, and the program crashed the last time it ran. The first thing a program should do is ensure that it is starting out from a known, good, safe state before trying to get any work done. Sometimes there isn’t a clean copy of the state because the user deleted it, or upgraded their computer. The program crashed the last time it ran and, rather paradoxically, the program is being run for the first time too. For example, when reading and writing program state to a file, a number of problems can happen: The file is missing

The file is corrupt

The file is an older version, or a newer one

The last change to the file is unfinished

The filesystem was lying to you These are not new problems and databases have been dealing with them since the dawn of time (1970-01-01). Using something like SQLite will handle many of these problems for you, but If the program crashed the last time it ran, the code might be run with the wrong data, or in the wrong way too. With scheduled programs, for example, you can guarantee that the following accidents will occur: It gets run twice in the same hour because of daylight savings time.

It gets run twice because an operator forgot it had already been run.

It will miss an hour, due to the machine running out of disk, or mysterious cloud networking issues.

It will take longer than an hour to run and may delay subsequent invocations of the program.

It will be run with the wrong time of day

It will inevitably be run close to a boundary, like midnight, end of month, end of year and fail due to arithmetic error. Writing robust software begins with writing software that assumed it crashed the last time it ran, and crashing whenever it doesn’t know the right thing to do. The best thing about throwing an exception over leaving a comment like “This Shouldn’t Happen”, is that when it inevitably does happen, you get a head-start on debugging your code. You don’t have to be able to recover from these problems either—it’s enough to let the program give up and not make things any worse. Small checks that raise an exception can save weeks of tracing through logs, and a simple lock file can save hours of restoring from backup. Code that’s easy to debug is code that checks to see if things are correct before doing what was asked of it, code that makes it easy to go back to a known good state and trying again, and code that has layers of defence to force errors to surface as early as possible. Rule 2: Your program is at war with itself. Google’s biggest DoS attacks come from ourselves—because we have really big systems—although every now and then someone will show up and try to give us a run for our money, but really we’re more capable of hammering ourselves into the ground than anybody else is. This is true for all systems. Astrid Atkinson, Engineering for the Long Game The software always crashed the last time it ran, and now it is always out of cpu, out of memory, and out of disk too. All of the workers are hammering an empty queue, everyone is retrying a failed request that’s long expired, and all of the servers have paused for garbage collection at the same time. Not only is the system broken, it is constantly trying to break itself. Even checking if the system is actually running can be quite difficult. It can be quite easy to implement something that checks if the server is running, but not if it is handling requests. Unless you check the uptime, it is possible that the program is crashing in-between every check. Health checks can trigger bugs too: I have managed to write health checks that crashed the system it was meant to protect. On two separate occasions, three months apart. In software, writing code to handle errors will inevitably lead to discovering more errors to handle, many of them caused by the error handling itself. Similarly, performance optimisations can often be the cause of bottlenecks in the system—Making an app that’s pleasant to use in one tab can make an app that’s painful to use when you have twenty copies of it running. Another example is where a worker in a pipeline is running too fast, and exhausting the available memory before the next part has a chance to catch up. If you’d rather a car metaphor: traffic jams. Speeding up is what creates them, and can be seen in the way the congestion moves back through the traffic. Optimisations can create systems that fail under high or heavy load, often in mysterious ways. In other words: the faster you make it, the harder it will be pushed, and if you don’t allow your system to push back even a little, don’t be surprised if it snaps. Back-pressure is one form of feedback within a system, and a program that is easy to debug is one where the user is involved in the feedback loop, having insight into all behaviours of a system, the accidental, the intentional, the desired, and the unwanted too. Debuggable code is easy to inspect, where you can watch and understand the changes happening within. Rule 3: What you don’t disambiguate now, you debug later. In other words: it should not be hard to look at the variables in your program and work out what is happening. Give or take some terrifying linear algebra subroutines, you should strive to represent your program’s state as obviously as possible. This means things like not changing your mind about what a variable does halfway through a program, if there is one obvious cardinal sin it is using a single variable for two different purposes. It also means carefully avoiding the semi-predicate problem, never using a single value ( count ) to represent a pair of values ( boolean , count ). Avoiding things like returning a positive number for a result, and returning -1 when nothing matches. The reason is that it’s easy to end up in the situation where you want something like "0, but true" (and notably, Perl 5 has this exact feature), or you create code that’s hard to compose with other parts of your system ( -1 might be a valid input for the next part of the program, rather than an error). Along with using a single variable for two purposes, it can be just as bad to use a pair of variables for a single purpose—especially if they are booleans. I don’t mean keeping a pair of numbers to store a range is bad, but using a number of booleans to indicate what state your program is in is often a state machine in disguise. When state doesn’t flow from top to bottom, give or take the occasional loop, it’s best to give the state a variable of it’s own and clean the logic up. If you have a set of booleans inside an object, replace it with a variable called state and use an enum (or a string if it’s persisted somewhere). The if statements end up looking like if state == name and stop looking like if bad_name && !alternate_option . Even when you do make the state machine explicit, you can still mess up: sometimes code has two state machines hidden inside. I had great difficulty writing an HTTP proxy until I had made each state machine explicit, tracing connection state and parsing state separately. When you merge two state machines into one, it can be hard to add new states, or know exactly what state something is meant to be in. This is far more about creating things you won’t have to debug, than making things easy to debug. By working out the list of valid states, it’s far easier to reject the invalid ones outright, rather than accidentally letting one or two through. Rule 4: Accidental Behaviour is Expected Behaviour. When you’re less than clear about what a data structure does, users fill in the gaps—any behaviour of your code, intended or accidental, will eventually be relied upon somewhere else. Many mainstream programming languages had hash tables you could iterate through, which sort-of preserved insertion order, most of the time. Some languages chose to make the hash table behave as many users expected them to, iterating through the keys in the order they were added, but others chose to make the hash table return keys in a different order, each time it was iterated through. In the latter case, some users then complained that the behaviour wasn’t random enough. Tragically, any source of randomness in your program will eventually be used for statistical simulation purposes, or worse, cryptography, and any source of ordering will be used for sorting instead. In a database, some identifiers carry a little bit more information than others. When creating a table, a developer can choose between different types of primary key. The correct answer is a UUID, or something that’s indistinguishable from a UUID. The problem with the other choices is that they can expose ordering information as well as identity, i.e. not just if a == b but if a <= b , and by other choices mean auto-incrementing keys. With an auto-incrementing key, the database assigns a number to each row in the table, adding 1 when a new row is inserted. This creates an ambiguity of sorts: people do not know which part of the data is canonical. In other words: Do you sort by key, or by timestamp? Like with the hash-tables before, people will decide the right answer for themselves. The other problem is that users can easily guess the other keys records nearby, too. Ultimately any attempt to be smarter than a UUID will backfire: we already tried with postcodes, telephone numbers, and IP Addresses, and we failed miserably each time. UUIDs might not make your code more debuggable, but less accidental behaviour tends to mean less accidents. Ordering is not the only piece of information people will extract from a key: If you create database keys that are constructed from the other fields, then people will throw away the data and reconstruct it from the key instead. Now you have two problems: when a program’s state is kept in more than one place, it is all too easy for the copies to start disagreeing with each other. It’s even harder to keep them in sync if you aren’t sure which one you need to change, or which one you have changed. Whatever you permit your users to do, they’ll implement. Writing debuggable code is thinking ahead about the ways in which it can be misused, and how other people might interact with it in general. Rule 5: Debugging is social, before it is technical. When a software project is split over multiple components and systems, it can be considerably harder to find bugs. Once you understand how the problem occurs, you might have to co-ordinate changes across several parts in order to fix the behaviour. Fixing bugs in a larger project is less about finding the bugs, and more about convincing the other people that they’re real, or even that a fix is possible. Bugs stick around in software because no-one is entirely sure who is responsible for things. In other words, it’s harder to debug code when nothing is written down, everything must be asked in Slack, and nothing gets answered until the one person who knows logs-on. Planning, tools, process, and documentation are the ways we can fix this. Planning is how we can remove the stress of being on call, structures in place to manage incidents. Plans are how we keep customers informed, switch out people when they’ve been on call too long, and how we track the problems and introduce changes to reduce future risk. Tools are the way in which we deskill work and make it accessible to others. Process is the way in which can we remove control from the individual and give it to the team. The people will change, the interactions too, but the processes and tools will be carried on as the team mutates over time. It isn’t so much valuing one more than the other but building one to support changes in the other.Process can also be used to remove control from the team too, so it isn’t always good or bad, but there is always some process at work, even when it isn’t written down, and the act of documenting it is the first step to letting other people change it. Documentation means more than text files: documentation is how you handover responsibilities, how you bring new people up to speed, and how you communicate what’s changed to the people impacted by those changes. Writing documentation requires more empathy than writing code, and more skill too: there aren’t easy compiler flags or type checkers, and it’s easy to write a lot of words without documenting anything. Without documentation, how can you expect people to make informed decisions, or even consent to the consequences of using the software? Without documentation, tools, or processes you cannot share the burden of maintenance, or even replace the people currently lumbered with the task. Making things easy to debug applies just as much to the processes around code as the code itself, making it clear whose toes you will have to stand on to fix the code. Code that’s easy to debug is easy to explain. A common occurrence when debugging is realising the problem when explaining it to someone else. The other person doesn’t even have to exist but you do have to force yourself to start from scratch, explain the situation, the problem, the steps to reproduce it, and often that framing is enough to give us insight into the answer. If only. Sometimes when we ask for help, we don’t ask for the right help, and I’m as guilty of this as anyone—it’s such a common affliction that it has a name: “The X-Y Problem”: How do I get the last three letters of a filename? Oh? No, I meant the file extension. We talk about problems in terms of the solutions we understand, and we talk about the solutions in terms of the consequences we’re aware of. Debugging is learning the hard way about unexpected consequences, and alternative solutions, and involves one of the hardest things a programer can ever do: admit that they got something wrong. It wasn’t a compiler bug, after all.

2017-12-04 Psychological Safety in Operation Teams Psychological Safety in Operation Teams (usenix.org) Think of a team you work with closely. How strongly do you agree with these five statements? If I take a chance and screw up, it will be held against me. Our team has a strong sense of culture that can be hard for new people to join. My team is slow to offer help to people who are struggling. Using my unique skills and talents comes second to the objectives of the team. It’s uncomfortable to have open, honest conversations about our team’s sensitive issues. Teams that score high on questions like these can be deemed to be “unsafe.”

2017-06-28 How do you cut a monolith in half? It depends. The problem with distributed systems, is that no matter what the question is, the answer is inevitably ‘It Depends’. When you cut a larger service apart, where you cut depends on latency, resources, and access to state, but it also depends on error handling, availably and recovery processes. It depends, but you probably don’t want to depend on a message broker. Using a message broker to distribute work is like a cross between a load balancer with a database, with the disadvantages of both and the advantages of neither. Message brokers, or persistent queues accessed by publish-subscribe, are a popular way to pull components apart over a network. They’re popular because they often have a low setup cost, and provide easy service discovery, but they can come at a high operational cost, depending where you put them in your systems. In practice, a message broker is a service that transforms network errors and machine failures into filled disks. Then you add more disks. The advantage of publish-subscribe is that it isolates components from each other, but the problem is usually gluing them together.

For short-lived tasks, you want a load balancer For short-lived tasks, publish-subscribe is a convenient way to build a system quickly, but you inevitably end up implementing a new protocol atop. You have publish-subscribe, but you really want request-response. If you want something computed, you’ll probably want to know the result. Starting with publish-subscribe makes work assignment easy: jobs get added to the queue, workers take turns to remove them. Unfortunately, it makes finding out what happened quite hard, and you’ll need to add another queue to send a result back. Once you can handle success, it is time to handle the errors. The first step is often adding code to retry the request a few times. After you DDoS your system, you put a call to sleep(). After you slowly DDoS your system, each retry waits twice as long as the previous. (Aside: Accidental synchronisation is still a problem, as waiting to retry doesn’t prevent a lot of things happening at once.) As workers fail to keep up, clients give up and retry work, but the earlier request is still waiting to be processed. The solution is to move some of the queue back to clients, asking them to hold onto work until work has been accepted: back-pressure, or acknowledgements. Although the components interact via publish-subscribe, we’ve created a request-response protocol atop. Now the message broker is really only doing two useful things: service discovery, and load balancing. It is also doing two not-so-useful thing: enqueuing requests, and persisting them. For short-lived tasks, the persistence is unnecessary: the client sticks around for as long as the work needs to be done, and handles recovery. The queuing isn’t that necessary either. Queues inevitably run in two states: full, or empty. If your queue is running full, you haven’t pushed enough work to the edges, and if it is running empty, it’s working as a slow load balancer. A mostly empty queue is still first-come-first-served, serving as point of contention for requests. A broker often does nothing but wait for workers to poll for new messages. If your queue is meant to run empty, why wait to forward on a request. (Aside: Something like random load balancing will work, but join-idle-queue is well worth your time investigating) For distributing short-lived tasks, you can use a message broker, but you’ll be building a load balancer, along with an ad-hoc RPC system, with extra latency.

For long-lived tasks, you’ll need a database A load balancer with service discovery won’t help you with long running tasks, or work that outlives the client, or manage throughput. You’ll want persistence, but not in your message broker. For long-lived tasks, you’ll want a database instead. Although the persistence and queueing were obstacles for short-lived tasks, the disadvantages are less obvious for long-lived tasks, but similar things can go wrong. If you care about the result of a task, you’ll want to store that it is needed somewhere other than in the persistent queue. If the task is run but fails midway, something will have to take responsibility for it, and the broker will have forgotten. This is why you want a database. Duplicates in a queue often cause more headaches, as long-lived tasks have more opportunities to overlap. Although we’re using the broker to distribute work, we’re also using it implicitly as a mutex. To stop work from overlapping, you implement a lock atop. After it breaks a couple of times, you replace it with leases, adding timeouts. (Note: This is not why you want a database, using transactions for long running tasks is suffering. Long running processes are best modelled as state machines.) When the database becomes the primary source of truth, you can handle a broker going offline, or a broker losing the contents of a queue, by backfilling from the database. As a result, you don’t need to directly enqueue work with the broker, but mark it as required in the database, and wait for something else to handle it. Assuming that something else isn’t a human who has been paged. A message pump can scan the database periodically and send work requests to the broker. Enqueuing work in batches can be an effective way of making an expensive database call survivable. The pump responsible for enqueuing the work can also track if it has completed, and so handle recovery or retries too. Backlog is still a problem, so you’ll want to use back-pressure to keep the queue fairly empty, and only fill from the database when needed. Although a broker can handle temporary overload, back-pressure should mean it never has to. At this point the message broker is really providing two things: service discovery, and work assignment, but really you need a scheduler. A scheduler is what scans a database, works out which jobs need to run, and often where to run them too. A scheduler is what takes responsibility for handling errors. (Aside: Writing a scheduler is hard. It is much easier to have 1000 while loops waiting for the right time, than one while loop waiting for which of the 1000 is first. A scheduler can track when it last ran something, but the work can’t rely on that being the last time it ran. Idempotency isn’t just your friend, it is your saviour.) You can use a message broker for long-lived tasks, but you’ll be building a lock manager, a database, and a scheduler, along with yet another home-brew request-response system.

Publish-Subscribe is about isolating components The problem with running tasks with publish-subscribe is that you really want request-response. The problem with using queues to assign work is that you don’t want to wait for a worker to ask. The problem with relying on a persistent queue for recovery, is that recovery must get handled elsewhere, and the problem with brokers is nothing else makes service discovery so trivial. Message brokers can be misused, but it isn’t to say they have no use. Brokers work well when you need to cross system boundaries. Although you want to keep queues empty between components, it is convenient to have a buffer at the edges of your system, to hide some failures from external clients. When you handle external faults at the edges, you free the insides from handling them. The inside of your system can focus on handling internal problems, of which there are many. A broker can be used to buffer work at the edges, but it can also be used as an optimisation, to kick off work a little earlier than planned. A broker can pass on a notification that data has been changed, and the system can fetch data through another API. (Aside: If you use a broker to speed up a process, the system will grow to rely on it for performance. People use caches to speed up database calls, but there are many systems that simply do not work fast enough until the cache is warmed up, filled with data. Although you are not relying on the message broker for reliability, relying on it for performance is just as treacherous.) Sometimes you want a load balancer, sometimes you’ll need a database, but sometimes a message broker will be a good fit. Although persistence can’t handle many errors, it is convenient if you need to restart with new code or settings, without data loss. Sometimes the error handling offered is just right. Although a persistent queue offers some protection against failure, it can’t take responsibility for when things go wrong halfway through a task. To be able to recover from failure you have to stop hiding it, you must add acknowledgements, back-pressure, error handling, to get back to a working system. A persistent message queue is not bad in itself, but relying on it for recovery, and by extension, correct behaviour, is fraught with peril.

Systems grow by pushing responsibilities to the edges Performance isn’t easy either. You don’t want queues, or persistence in the central or underlying layers of your system. You want them at the edges. It’s slow is the hardest problem to debug, and often the reason is that something is stuck in a queue. For long and short-lived tasks, we used back-pressure to keep the queue empty, to reduce latency. When you have several queues between you and the worker, it becomes even more important to keep the queue out of the centre of the network. We’ve spent decades on tcp congestion control to avoid it. If you’re curious, the history of tcp congestion makes for interesting reading. Although the ends of a tcp connection were responsible for failure and retries, the routers were responsible for congestion: drop things when there is too much. The problem is that it worked until the network was saturated, and similar to backlog in queues, when it broke, errors cascaded. The solution was similar: back-pressure. Similar to sleeping twice as long on errors, tcp sends half as many packets, before gradually increasing the amount as things improve. Back-pressure is about pushing work to the edges, letting the ends of the conversation find stability, rather than trying to optimise all of the links in-between in isolation. Congestion control is about using back-pressure to keep the queues in-between as empty as possible, to keep latency down, and to increase throughput by avoiding the need to drop packets. Pushing work to the edges is how your system scales. We have spent a lot of time and a considerable amount of money on IP-Multicast, but nothing has been as effective as BitTorrent. Instead of relying on smart routers to work out how to broadcast, we rely on smart clients to talk to each other. Pushing recovery to the outer layers is how your system handles failure. In the earlier examples, we needed to get the client, or the scheduler to handle the lifecycle of a task, as it outlived the time on the queue. Error recovery in the lower layers of a system is an optimisation, and you can’t push work to the centre of a network and scale. This is the end-to-end principle, and it is one of the most important ideas in system design. The end-to-end principle is why you can restart your home router, when it crashes, without it having to replay all of the websites you wanted to visit before letting you ask for a new page. The browser (and your computer) is responsible for recovery, not the computers in between. This isn’t a new idea, and Erlang/OTP owes a lot to it. OTP organises a running program into a supervision tree. Each process will often have one process above it, restarting it on failure, and above that, another supervisor to do the same. (Aside: Pipelines aren’t incompatible with process supervision, one way is for each part to spawn the program that reads its output. A failure down the chain can propagate back up to be handled correctly.) Although each program will handle some errors, the top levels of the supervision tree handle larger faults with restarts. Similarly, it’s nice if your webpage can recover from a fault, but inevitably someone will have to hit refresh. The end-to-end principle is realising that no matter how many exceptions you handle deep down inside your program, some will leak out, and something at the outer layer has to take responsibility. Although sometimes taking responsibility is writing things to an audit log, and message brokers are pretty good at that.

Aside: But what about replicated logs? “How do I subscribe to the topic on the message broker?” “It’s not a message broker, it’s a replicated log” “Ok, How do I subscribe to the replicated log” From ‘I believe I did, Bob’, jrecursive Although a replicated log is often confused with a message broker, they aren’t immune from handling failure. Although it’s good the components are isolated from each other, they still have to be integrated into the system at large. Both offer a one way stream for sharing, both offer publish-subscribe like interfaces, but the intent is wildly different. A replicated log is often about auditing, or recovery: having a central point of truth for decisions. Sometimes a replicated log is about building a pipeline with fan-in (aggregating data), or fan-out (broadcasting data), but always building a system where data flows in one direction. The easiest way to see the difference between a replicated log and a message broker is to ask an engineer to draw a diagram of how the pieces connect. If the diagram looks like a one-way system, it’s a replicated log. If almost every component talks to it, it’s a message broker. If you can draw a flow-chart, it’s a replicated log. If you take all the arrows away and you’re left with a venn diagram of ‘things that talk to each other’, it’s a message broker. Be warned: A distributed system is something you can draw on a whiteboard pretty quickly, but it’ll take hours to explain how all the pieces interact.

You cut a monolith with a protocol How you cut a monolith is often more about how you are cutting up responsibility within a team, than cutting it into components. It really does depend, and often more on the social aspects than the technical ones, but you are still responsible for the protocol you create. Distributed systems are messy because of how the pieces interact over time, rather than which pieces are interacting. The complexity of a distributed system does not come from having hundreds of machines, but hundreds of ways for them to interact. A protocol must take into account performance, safety, stability, availability, and most importantly, error handling. When we talk about distributed systems, we are talking about power structures: how resources are allocated, how work is divided, how control is shared, or how order is kept across systems ostensibly built out of well meaning but faulty components. A protocol is the rules and expectations of participants in a system, and how they are beholden to each other. A protocol defines who takes responsibility for failure. The problem with message brokers, and queues, is that no-one does. Using a message broker is not the end of the world, nor a sign of poor engineering. Using a message broker is a tradeoff. Use them freely knowing they work well on the edges of your system as buffers. Use them wisely knowing that the buck has to stop somewhere else. Use them cheekily to get something working. I say don’t rely on a message broker, but I can’t point to easy off-the-shelf answers. HTTP and DNS are remarkable protocols, but I still have no good answers for service discovery. Lots of software regularly gets pushed into service way outside of its designed capabilities, and brokers are no exception. Although the bad habits around brokers and the relative ease of getting a prototype up and running lead to nasty effects at scale, you don’t need to build everything at once. The complexity of a system lies in its protocol not its topology, and a protocol is what you create when you cut your monolith into pieces. If modularity is about building software, protocol is about how we break it apart. The main task of the engineering analyst is not merely to obtain “solutions” but is rather to understand the dynamic behaviour of the system in such a way that the secrets of the mechanism are revealed, and that if it is built it will have no surprises left for [them]. Other than exhaustive physical experimentations, this is the only sound basis for engineering design, and disregard of this cardinal principle has not infrequently lead to disaster. From “Analysis of Nonlinear Control Systems” by Dustan Graham and Duane McRuer, p 436 Protocol is the reason why ‘it depends’, and the reason why you shouldn’t depend on a message broker: You can use a message broker to glue systems together, but never use one to cut systems apart.

2016-09-16 I like this talk a lot: what modularity is, what we use it for, how modularity happens in systems, and how we can use modularity to manage change.

2016-08-22 RIP, Mathie. Last night I found out i’d lost a friend, and if you’ll be patient with my words, I’d like to reflect a little. Mathie was one of the older, weirder, geeks I met. I’d escaped my home town in the edge of nowhere, and it was my first time having a peer group of adults. He’d helped everywhere. With the student run shell server, with the local IRC server everyone collected on, a known and friendly face on the circuit Mathie was one of the many people behind Scottish Ruby Conference, responsible for bringing a lot of interesting people into Edinburgh, and into my life. Why wasn’t this talk given at Scottish Ruby Conference I fucked up In front of everyone assembled at the fringe track, a collection of talks that didn’t quite make it, mathie answered honestly. It’s kinda how i’ll remember him: a bit of a fuckup. A fuckup who, changed my life for the better. Thanks mathie, I hope to pass on some of your kindness. RIP, You fucking idiot.

2016-05-20 PapersWeLove London: End-to-End Arguments In System Design This week I gave a short talk on a paper I love: End-to-End Arguments in System Design The talk was recorded and uploaded (but not captioned), and you can watch it here: https://skillsmatter.com/skillscasts/8200-end-to-end-arguments-in-system-design-by-saltzer-reed-and-clark

2016-05-10 A million things to do with a computer! I gave a talk at !!con last weekend, about my favourite programming language scratch: Back in 1971, Cynthia Solomon and Seymour Papert wrote “Twenty things to do with a computer”, about their experiences of teaching children to use Logo and their ideas for the future. They were wrong: There’s a lot more than twenty. Logo’s successor, Scratch, has over thirteen million things that children and adults alike have built. Scratch is radically approachable in a way that puts every other language to shame. This talk is about the history, present, and future of Scratch: why Scratch is about ‘coding to learn’, and not about ‘learning to code’. I had a incredible time at !!con. The live captioning was fantastic (and they’re crowdfunding a game to teach steno too). The livestreams are up (but no captions), and my talk is 3h29m32s in on day 2.

2016-03-07 Addendum: Write code that is easy to delete, not easy to extend. I found two translations by accident. I can’t tell if they are perfect translations but I am thankful nonetheless. Russian

Chinese (Many people mentioned The Wrong Abstraction, and it is worth mentioning here too.)