Doing the Impossible

Exactly-once Messaging Patterns in Kafka

Exactly-once messaging is something of a holy grail in the Kafka ecosystem – widely sought-after but rarely encountered. There are a handful of systems that promise exactly-once semantics, but none of them are a general-purpose solution: they’re often too task-specific, too heavyweight, or too broken, and sometimes all three. Complicating the picture is the fact that exactly-once message delivery is, in general, impossible.

Thankfully, the picture is not so bleak. While there may be no one-size-fits-all mechanism for exactly-once delivery, there are many specific situations where it is possible to implement something with the right semantics – and better, it’s often surprisingly cheap. In this post, we’ll do this for a handful simple streaming problems, and talk about how to extend these ideas to larger systems.

How Kafka Works

To start on a sound footing, we’ll spin quickly though some of Kafka’s abstractions and semantics. If any of this is unfamiliar, you may want to go through the Kafka design docs – particularly the section on delivery semantics – or Jay Kreps’ excellent writeup on the underlying log abstraction.

A Kafka topic is a named collection of numbered partitions, and each partition is backed by a log. A log is just an ordered series of messages, like so:

log input 0 1 2 3 4 5 6 7 8 9

Each message in the log has a unique offset, starting at zero and counting up as new messages are added.

Unlike a traditional queue, consumers never ‘remove’ or ‘poll’ a message from a Kafka topic. Typically, each partition will be assigned to a specific consumer, and that consumer keeps track of its own position in the log:

cursors input cursor1 input:6:sw->cursor1:nw

To avoid starting from scratch after a failure, consumers usually commit these offsets to some persistent store. (As of Kafka 0.8.1, the high-level consumer stores these in ZooKeeper, but Kafka expects to ship its own API for this in a future release.) If a consumer fails, it can retrieve its stored offset on startup and resume from there.

This offset is committed asynchronously, so it’s usually a little behind the offset of the last message processed. Even if we chose to commit the offset after every single message, the problem is not gone – there’s still a moment, just after processing completes but before the offset is persisted, where the offset in permanent storage is stale. If there’s a failure, and the consumer restores from a stale commit, it will process a small number of messages again. This is called at-least-once delivery – every message will get processed, but some may be processed more than once.

Of course, we almost never actually want those duplicate deliveries – we’d like every message to be processed exactly once. Unfortunately, this turns out to be impossible; in a distributed system, there’s no general-purpose way to ensure that some arbitrary operation will run exactly once for every message. There’s a pretty good discussion of this on lobste.rs , and the Kafka docs mention the same issue:

So what about exactly once semantics (i.e. the thing you actually want)? The limitation here is not actually a feature of the messaging system but rather the need to co-ordinate the consumer’s position with what is actually stored as output. [This can be handled] by simply letting the consumer store its offset in the same place as its output. […] As an example of this, our Hadoop ETL that populates data in HDFS stores its offsets in HDFS with the data it reads so that it is guaranteed that either data and offsets are both updated or neither is.

So Kafka can’t implement exactly-once semantics on its own, but Camus can – dumping a Kafka queue into a file is a very constrained sort of operation, and when you scope your problem down that small, getting exactly-once semantics is not too hard. In this post, we’ll extend these ideas cover a wide variety of situations: tasks that both consume and produce messages, or that keep some intermediate state, or that consume from or produce to multiple partitions. In general, we’ll be looking for the smallest amount of state or coordination that works for a particular task. Since there’s no general-purpose solution, we may as well tailor our coordination to the job at hand; and by exploiting some of the specifics, we can often keep the overhead very low.

I’m also going to focus heavily on ‘stream transformations’. Most of the writing and implementations out there only deal with exactly-once state updates, which is understandable; when you’re first building a system on Kafka, you’re most interested in getting data in or pulling data out. In a developed system, though, you’re likely to have a bunch of jobs that consume data from Kafka, do some processing, and write the results back to Kafka for a downstream job. (Martin Kleppmann compares this to UNIX pipes, which feels quite apt.) For these kind of jobs, just updating state exactly-once is not enough – you also want to avoid sending duplicate messages – so we’ll take care to make sure to extend our exactly-once semantics to the messaging layer.

I don’t want to take credit for any of these ideas: most of them are collected from various sources or the Kafka folklore, or small extensions to known techniques, and I’ve tried to give credit where I can. On the other hand, I do think the sum is greater than its parts – I hope that by collecting these together, it makes it easier to design systems with a more reasonable semantics.

Exclusive Producer

Let’s consider a ‘hello world’ of stream processing: a job that takes a single input topic and copy it to an output topic, partition by partition:

cat input output input:0->output:0 input:1->output:1 input:2->output:2 input:3->output:3 input:4->output:4 input:5->output:5 input:6->output:6 input:7->output:7 input:8->output:8 input:9->output:9

This is about as simple as it gets. There’s no processing to do or state to manage, and every input message corresponds to exactly one output message.

An obvious implementation might use Kafka’s high-level stream consumer, taking each incoming message and publishing it to the output log. While convenient, this setup can duplicate messages if the processor fails: if the last commit was half a second before the failure, that last half-second of messages will be written twice.

whoops input output input:0->output:0 input:6:sw->output:5:n recovery input:6:s->recovery:6:n input:7:s->recovery:7:n input:8:s->recovery:8:n input:9:s->recovery:9:n

If we want to copy over each message just once, we’ll need some way to avoid these duplicate messages. This is nontrivial: if a failure happens while it’s sending messages to the broker – whether from a network error or a node crash – we can’t always tell if the broker handled the request or not, so we don’t know if it’s safe to resend the message.

In our case, though, it’s not too hard to check. When our task restarts, we can check the latest offset in the output log. If there are 8 messages in the output, it means we’ve handled 8 messages of input – and if we restart the consumer there, we’ll pick up right where we left off.

resumed input output input:7:se->output:7:ne recovery input:8->recovery:8:n input:9->recovery:9:n

This copies every input message to the output precisely once, and in this simple case, it actually takes less state than the naive version.

This is very similar to Camus’ approach, where the offset is stored alongside the state. Here, the output partition is the state – and the latest offset is updated ‘implicitly’ every time a message is written to the topic. It’s also very similar to Kafka’s ‘idempotent producer’ proposal, and the FAQ mentions a very similar approach. On its own, this ‘read your own output offset’ approach is fairly limited and brittle; but it’s also a useful tool for more complicated systems.

The key here is that our producer is the only producer writing to a particular partition – if there were multiple writers, it would be impossible to tell if it was your write that succeeded or another task’s. Having this constraint makes the system much easier to reason about; if you can set your system up with a single writer per partition, it’s almost always worthwhile.

Committing Multiple Offsets

One of the big limitations of that first approach is that it’s restricted to one-to-one mappings between input and output. Many interesting operations don’t fit this pattern: we may want to filter bad data out of the stream so that it’s not represented in the output at all, or split the incoming messages up into several messages and write all of them out.

In the general case, where an input message can produce any amount of output, things can get a bit chaotic:

chaos input output input:0->output:0 input:0->output:1 input:2->output:2 input:4->output:3 input:4->output:4 input:4->output:5 input:4->output:6 input:6->output:7 input:6->output:8 input:7->output:9

In particular, there’s no longer a simple relationship between the offsets in the input and output log. Our simple recovery scheme no longer works – we can check the latest offset in the output log, but that’s doesn’t tell us what the input offset should be. The high-level consumer has exactly the opposite issue: we know what the input offset was at the time of the commit, but we can’t infer the output offset. The commit is asynchronous, so the output offset at the time of the commit is not even necessarily the output offset at the time of the failure.

To get an exactly-once behaviour for this, we can use a hybrid approach. Instead of committing only the input offset, we commit both the input and output offsets together. If that commit pointed right to the end of the output log, we’d be done – our current offsets would be consistent, and we could pick up where we left off. However, since we commit after processing input and sending messages, the commit’s very likely to be stale.

stale input output input:0:sw->output:0:nw input:4:se->output:7:nw

If we restarted processing from a stale commit, some data would be duplicated in the output. If our task has written n messages since the last commit, simply restarting from that commit would write those n messages a second time. But if our task is the only producer for that partition, it’s easy to calculate the value of n… and by starting from the commit, and then dropping the first n messages of output we create, we can get ourselves back to a consistent state.

%3 input output input:0:sw->output:0:nw input:4:se->output:7:nw input:6:s->output:7:n recover input:6:s->recover:8:n input:7:s->recover:9:n

This has almost exactly the same performance characteristics as the existing high-level consumer. It does one extra request at the beginning to get the upcoming offset in the output stream, but after that it’s actually a little bit faster – it gets to throw away the duplicate output messages instead of writing them to the network.

Checkpoints and Changelogs

So far we’ve only talked about processing that’s totally stateless – each message in the input stream gets handled separately. There’s another large class of jobs where we want to accumulate some state between messages: calculating statistics, running a state machine, or some other calculation that uses past history to affect the current processing.

Again, our last technique doesn’t quite work: we can recover our position in the input from the commit, but we can’t infer the current state without replaying all the messages. You might think that we could just add to state to the commit – and, actually, you’d be completely right:

%3 input output input:6:sw->output:6:nw count: 6

When the commit is restored, we have our processing state at that moment, and we can continue as usual.

This attaches the full state to every commit, so it’s important that the state stays small: while writing out a handful of counts is no big deal, reserializing and resending a large data structure every few seconds could quickly grow to dominate the processing time. Many stream-processing tasks can have gigabytes worth of state; at that scale, putting the full state in each commit is totally impractical.

One way to deal with this is to introduce a ‘changelog’ – every operation on the state gets logged out, and we can recover the current state just by replaying the log. Many databases use a changelog to handle replication, and the Samza framework uses changelogs to recover from node failures. When the state is large, this is much cheaper than frequently reserializing the entire state. Of course, this isn’t free: recreating state by replaying a changelog is generally slower than restoring a checkpoint. You also need to be careful that all the operations in the changelog are deterministic; if not, the recovered state could diverge arbitrarily far from the original. (This is a general concern with changelogs; for example, MySQL’s statement replication diverges when using random numbers or the current time.)

Adding exactly-once semantics on top of a changelog turns out to be fairly straightforward: instead of attaching our state to the committed offsets, like the checkpoint style does, we attach the offsets to the changelog messages. Once we restore the changelog, not only do we have a consistent state, we also have the right offsets for that state – and we can continue processing without duplicating any messages or updating the state twice.

Log as Coordination

So far, we’ve only been processing a single input stream at a time. In the Real World, however, it’s common to want to read from many sources at once – perhaps you want to do a stream join, or just to aggregate statistics for a bunch of streams together. The simplest instance of this is a ‘merge’ operation, where we just interleave the messages together.

merge input0 output input0:0:s->output:0:n input0:1:s->output:3:n input0:2:s->output:4:n input0:3:s->output:6:n input0:4:s->output:9:n input1 output:1:s->input1:0:n output:2:s->input1:1:n output:5:s->input1:2:n output:7:s->input1:3:n output:8:s->input1:4:n

To keep things snappy, we’d like our task to process messages as soon as they arrive. This introduces an element of nondeterminism: arrival order depends on network latency and many other factors, so a number of different interleavings are possible. Our last strategy involved restoring to a stale checkpoint and replaying the input – but in a nondeterministic world, it’s no longer obvious what order to replay the input in.

To compensate for this, we’ll need to introduce an additional log. This ‘merge log’ doesn’t actually contain any of the input or output data – it just records the order in which the input arrives. (“One message from input-1”, “One message from input-2”, and so on.) When a new input arrives at our task, we write it to the merge log first; once Kafka confirms the write, we can publish the data to our output stream. When we commit the offsets, we record the current offset in the merge log as well.

Given this metadata, crash recovery is straightforward. When our task restores the commit after a failure, it returns to a consistent (but stale) state. In particular, since the commit is asynchronous, our offset in the merge log is likely to behind the tip. At this point, we can step forward in the merge log, replaying the input in exactly the same order; once we’ve reached the tip of the merge log, recovery is complete, and we can resume normal processing from there. In effect, the merge log has imposed an order on our input logs: as far as our actual processing code is concerned, it’s no different than if all the input came from a single deterministically-ordered log.

In The Log, Jay Kreps refers to this sort of thing as “squeezing the non-determinism out of the input stream.” This technique is very powerful; it’s one reason so many databases and consensus algorithms are built on top of a log. In many of those systems, this log comes with a bottleneck; all operations are totally ordered in the same log, so there’s a lot of contention for a single resource. Here, however, a single log only coordinates between a single task and a few different inputs – so we can scale horizontally without needing any global coordination.

Repartitioning and Deduplication

So far, all the techniques we’ve discussed have dealt with a small number of partitions at a time. It’s normally straightforward to extend this to entire topics: for example, to merge two topics with the same number of partitions, we can just merge each pair of partitions together.

However, there’s one case where this ‘partitionwise’ approach doesn’t work. A lot of stream processing jobs are ‘locality sensitive’: if you have a topic full of page view events, and you want to count all the visits to each individual page, it helps to have all the visits for a page in the same partition. Kafka calls this semantic partitioning, and the producer API is explicitly designed to support it.

Of course, a single topic can only be partitioned in one way – but if there are multiple consumers interested in the data, they might need their input partitioned in different ways. (If we tried to count all the page views by source IP, for example, an input stream partitioned by page URL is not very useful.) As a result, it’s quite common to find jobs that take all the messages in some input topic and repartitions them in to some more convenient arrangement. There’s no way to split this work up by partition: data from each input partition may end up in any output partition, and each output partition may be written to concurrently by any number of tasks. Since each output partition no longer has a single writer, it’s not possible for a failed task to just check the offset to know if its writes were successful or not, so we can’t make the producer idempotent – if a task fails, some messages will be written to the output more than once.

Thankfully, all is not lost. While these messages may be duplicated in Kafka, it’s still possible to deduplicate them on the client. The simplest fix is to add a little metadata to each message: the message’s partition and offset in the source topic. The consumers will need to remember the largest offset they’ve seen for each source partition; if an incoming message’s offset is smaller, it’s a duplicate, and it can be safely discarded.

This works, but the cost is high. In all the previous sections, the output topic has been ‘pristine’, with no duplicates or extra metadata in the log. This is a really useful property: any consumer that speaks the Kafka protocol can consume that topic without any special logic, and it works seamlessly with existing tools. To make our regrouping work, we had to leak some additional metadata into the log, and any consumers will need to clean this up – implementing the exact same deduplication logic and keeping extra state. This is the stream processing equivalent of “leaking implementation details into the interfacex,” and it’s just as painful.

If there’s only a small number of downstream consumers, we might be happy to live with this. Otherwise, it’s straightforward to insert a second ‘cleanup’ job after the first which just copies our regrouped topic into a new output topic. This cleanup job only needs to deal with a single input and output partition at a time, so it can use the ‘idempotent producer’-style techniques to take our messy regrouped topic and write out a ‘pristine’ version. This obviously has some significant overhead, but it still scales well, and it’s sometimes worth it to make life easier for downstream systems.

Going Forward

I’m really excited about the future of Kafka. I suspect that, in the next few years, we’ll recognize the Kafka-style log as as fundamental as distributed filesystems or key value stores – and it’s already become the best choice for a wide variety of tasks. There are some downsides to living on the cutting edge, though: you’re missing a lot of the libraries and tooling you expect from a mature technology, and its limits and possibilities aren’t as broadly known or well-explored. The community’s working hard and fast, but some major avenues still remain mostly unexplored.