The continuing headaches of distributed programming

2016-08-10 by Guest author: Andy Oram, Senior Editor O'Reilly Media

Why scaling and parallelism remain hard even with new tools and languages

Despite ground-breaking advances in languages and utilities that purport to simplify parallel or distributed computing, designing a system that supports modern commerce and other high-priority applications remains one of the most difficult tasks in computing. Having dealt with such systems for too long–I wrote tests and documentation for a Unix pthreads implementation at a manufacturing firm in the 1980s, for instance, and edited O’Reilly’s PThreads Programming book in 1996–I have gotten so familiar seeing the same difficulties over and over that I’ve learned to look for them in every distributed system I hear about. And I recently started to wonder why there had been so limited progress in the field after I was assigned a cluster of books on languages (Erlang, Haskell, Scala, OCaml), messaging (ZeroMQ), and DevOps/scheduling tools (Mesos, ZooKeeper).

I expect language and tool designers to deal with the difficulties of distributed computing. What surprises me is that users out in the field–application developers and operations personnel–have to do so as well. The complexities are not completely hidden by the tools. And I think I have identified the reason: efficient and robust distributed computing requires careful tuning of your architectural parameters. These parameters must be exposed to the programmers or operations people. At least a few of the following issues will intrude into their decisions:

How many nodes you want for each function in your application (front-end, application back-end, database, load balancing, etc.)

The memory and CPU capacity of each node

The geographic distributions of nodes and which share a data center

How to partition and replicate your data, the number of replicas needed, and the distribution of nodes

How many systems at each site should be alive simultaneously to handle availability and load

How many nodes constitute a quorum when replicating state or electing a leader

What consistency models (eventual, causal, strong, etc.) you want during data replication

The recovery policy in case of failures or unresponsive systems

Data maintenance policies, such as whether to cache the intermediate results of calculations or recalculate them when necessary

The demands on operations pile up fast, and a lot of configuration is required no matter how easy the tool-makers try to render it. This article discusses the design of such systems.

Languages facilitate distributed computing–but only to a degree

Erlang, Go, and several other languages tout their solutions for people writing for multiple cooperating nodes. Support for distributing computing has certainly been refined in these languages, particularly through the message-passing model, which now seems almost universal.

Good old threads and multiprocessing have been relegated to the most performance-critical applications on single systems. Applications spanning nodes can’t benefit from the data sharing that threads and processes on a single system can exploit, and even the blessings of sharing data on a single node run into headaches because of cache consistency issues–along with threads’ notorious vulnerability to hard-to-debug errors. It’s definitely convenient to use a single programming model to cover communication among both local and remote processes, so messaging wins out. Its asynchronous communication allows the same mechanisms to be used for both local and remote processes transparently. You can write code that runs on a single node, and with little or no changes, distribute the code to a cluster of nodes.

Asynchronous messaging can even be used through the Network Time Protocol (NTP) to synchronize clocks on different computers, so that when they do have to engage in shared memory activities (such as writing files to a shared disk), their views of the external world are roughly in agreement.

But processes still have to deal with backed-up queues, child processes that fail, and child processes that simply never return a result because of process failure, programming error, or networking partitioning. Programmers have to work with protocols such as back pressure and policies enforced by supervisor nodes. The very existence of ZooKeeper (for implementing distributed consensus and guaranteeing agreement across application processes) demonstrates how complicated distributed computing remains.

It would be unreasonable to expect a programming language to deal with these issues, and in fact they don’t. The Erlang developers realized some time ago that something on top of Erlang was required to deal with distributed computing, so they developed the OTP library discussed in the O’Reilly book Designing for Scalability with Erlang/OTP. This library provides basic calls for starting and restarting processes, recognizing and propagating error messages, and similar supervisor responsibilities.

The distinction between what’s appropriate for language semantics and what’s appropriate in a library depends on the architectural considerations I mentioned before. Policy decisions such as the following aren’t appropriate to build into a language:

Whether to give up on a process that failed (because you don’t need its results or cannot recover from its corrupt state) or restart it

How long to wait before deciding that a node has vanished

How many nodes to run for fault-tolerance and scalability

Whether to set up a single monitor for all subprocesses, or separate monitors for your user interactions and back-end databases

Those are the sorts of decisions relegated by Erlang to the OTP library. OTP has built the functionality to deal with the issues of launching processes, joining them, and handling failures into a set of generic templates so that the programmers can concentrate more on the business logic of their particular applications. A useful overview of policy decisions in distributed systems introduces some of these topics–and even a treatment of that length couldn’t cover everything.

But at a certain level of mission-critical application deployment, even OTP is insufficient. In order to avoid the need for a human staff person to hang around 24/7 and monitor thousands of nodes, you want automated ways to send user requests to a pool of servers, and sophisticated ways to duplicate data in such a way that it’s safe even if multiple nodes fail. These requirements call for more than a library–they need a network of cooperating separate processes.

Riak Core and Scalable Distributed Erlang are discussed in the book as extra tools to fill the gap. These tools expose the detailed configuration details mentioned earlier in this article so that a programmer can make the tough, architecture-specific, environment-specific choices that allow an application to scale and stay highly available.

Unrelated processes

The tools I’ve discussed so far in this article, for all their diversity, take one easy way out: they assume that all processes have a common ancestor or at least were started by a single operator. Most, such as ZeroMQ and Erlang/OTP, rely on a tree structure of supervisors going back to a single root. Unrelated Erlang processes and microservices can find each other through a central registry called Gproc.

Bluetooth and other wireless technologies allow device discovery on local networks, part of the technologies for the Internet of Things. ZooKeeper is also more flexible, using a coordination and voting algorithm in the tradition of Paxos to coordinate independent peers.

A scheduling system used internally at Google allows multiple schedulers to start jobs on shared computing resources, with potentially different policies for determining when jobs of varying lengths should run. The paper does not explain how the system users pass it information on the expected time requirements or latencies of the jobs they submit. Shared resources means conflicts will occur and schedulers must use transactions to deal with jobs that require too many resources to finish.

Few systems try to support a totally peer-to-peer environment of different processes with different owners–a goal that was already recognized as nearly intractable when I wrote about it in 2004.

Unknown processes can “plug in” to a distributed environment if everybody is communicating over a local area network. Broadcasts on the network allow systems to discover each other and choose leaders. That’s how Apple’s Bonjour and Microsoft’s CIFS let you start up a printer and have other systems automatically communicate with it. Outside a LAN, the coordination of unrelated processes requires out-of-band communications, such as when I tweet the URL of a video I want you to view. There are also specialized servers that use DNS lookups or related methods to register and offer up unknown services to clients seeking them.

In summary, various cloud solutions and schedulers take on important burdens such as job monitoring, fair division of resources, auto-scaling, and task-sharing algorithms such as publish/subscribe, but the configuration issues mentioned at the beginning of this article still require intervention by a programmer or operations person.

Which Way to Turn?

I’ve been watching the distributed computing space since the 1980s. The fundamental problems don’t go away and aren’t solved by throwing more CPU or faster networks at them. Distributed computing forces computers to reflect the complexity of life itself, with many interacting factors over various periods of time. Go meditate in a forest for half an hour, and you will understand more about the problems of distributed systems (and other things) than I can convey through text.

My friend and stalwart author Francesco Cesarini reminds me that computing has made many advances over the past few decades. One, as I mentioned, is the large-scale abandonment of shared member in favor of message-passing. Not all shared resources can be given up, though. Every production system rests on some shared data store, such as a relational database, and these data stores present their own partitioning and high-availability challenges.

The computer field has significantly advanced in understanding the trade-offs between consistency and availability, illustrated by all the discussion around CAP and all the data systems designed with different these trade-offs. Network failures are being understood and handled like other software exceptions.

Cesarini says, “Some frameworks that try to automate activities end up failing to hide complexity. They limit the trade-offs you can make, so they cater only to a subset of systems, often with very detailed requirements.”

Cesarini says, “Some frameworks that try to automate activities end up failing to hide complexity. They limit the trade-offs you can make, so they cater only to a subset of systems, often with very detailed requirements.”

So go ahead and pick your tool or language–you may well be picking another one two years from now. But rest comfortably in the knowledge that whatever challenges you’re struggling with, and whatever solutions you find, you’ll be reusing ideas that were thought up and probably implemented years ago.