Let’s start by acknowledging that the term ‘fast’ is multi-faceted, complex, and highly ambiguous. Latency, throughput, jitter, are metrics that shape and influence one’s interpretation of the term. It is also inherently contextual: the industry and application domains in themselves set the norms and expectations around performance. Whether or not something is fast depends largely on one’s frame of reference.

Apache Kafka is optimized for throughput at the expense of latency and jitter, while preserving other desirable qualities, such as durability, strict record order, and at-least-once delivery semantics. When someone says ‘Kafka is fast’, and assuming they are at least mildly competent, you can assume they are referring to Kafka’s ability to safely accumulate and distribute a very high number of records in a short amount of time.

Historically, Kafka was born out of LinkedIn’s need to move a very large number of messages efficiently, amounting to multiple terabytes of data on an hourly basis. The individual message propagation delay was deemed of secondary importance, as was the variability of that time. After all, LinkedIn is not a financial institution that engages in high-frequency trading, nor is it an industrial control system that operates within deterministic deadlines. Kafka can be used to implement near-real-time (otherwise known as soft real-time) systems.

Note: For those unfamiliar with the term, ‘real-time’ does not mean ‘fast’, it means ‘predictable’. Specifically, real-time implies a hard upper bound, otherwise known as a deadline, on the time taken to complete an action. If the system as a whole is unable to meet this deadline each and every time, it cannot be classed as real-time. Systems that are able to perform within a probabilistic tolerance, are labelled as ‘near-real-time’. In terms of sheer throughput, real-time systems are often slower than their near-real-time or non-real-time counterparts.

There are two significant areas that Kafka draws upon for its speed, and they need to be discussed separately. The first relates to the low-level efficiency of the client and broker implementations. The second derives from the opportunistic parallelism of stream processing.

Broker performance

Log-structured persistence

Kafka utilizes a segmented, append-only log, largely limiting itself to sequential I/O for both reads and writes, which is fast across a wide variety of storage media. There is a wide misconception that disks are slow; however, the performance of storage media (particularly rotating media) is greatly dependent on access patterns. The performance of random I/O on a typical 7,200 RPM SATA disk is between three and four orders of magnitude slower when compared to sequential I/O. Furthermore, a modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes. Because of this, the difference between sequential I/O and random I/O is still evident in flash and other forms of solid-state non-volatile media, although it is far less dramatic compared to rotating media.

Record batching

Sequential I/O is blazingly fast on most media types, comparable to the peak performance of network I/O. In practice, this means that a well-designed log-structured persistence layer will keep up with the network traffic. In fact, quite often the bottleneck with Kafka isn’t the disk, but the network. So in addition to the low-level batching provided by the OS, Kafka clients and brokers will accumulate multiple records in a batch — for both reading and writing — before sending them over the network. Batching of records amortizes the overhead of the network round-trip, using larger packets and improving bandwidth efficiency.

Batch compression

The impact of batching is particularly obvious when compression is enabled, as compression becomes generally more effective as the data size increases. Especially when using text-based formats such as JSON, the effects of compression can be quite pronounced, with compression ratios typically ranging from 5x to 7x. Furthermore, record batching is largely done as a client-side operation, which transfers the load onto the client and has a positive effect not only on the network bandwidth but also on the brokers’ disk I/O utilization.

Cheap consumers

Unlike traditional MQ-style brokers which remove messages at point of consumption (incurring the penalty of random I/O), Kafka doesn’t remove messages after they are consumed — instead, it independently tracks offsets at each consumer group level. The progression of offsets themselves is published on an internal Kafka topic __consumer_offsets . Again, being an append-only operation, this is fast. The contents of this topic are further reduced in the background (using Kafka’s compaction feature) to only retain the last known offsets for any given consumer group.

Compare this model with more traditional message brokers which typically offer several contrasting message distribution topologies. On one hand is the message queue — a durable transport for point-to-point messaging, with no point-to-multipoint ability. On the other hand, a pub-sub topic allows for point-to-multipoint messaging but does so at the expense of durability. Implementing a durable point-to-multipoint messaging model in a traditional MQ requires maintaining a dedicated message queue for each stateful consumer. This creates both read and write amplification. On one hand, the publisher is forced to write to multiple queues. Alternatively, a fan-out relay may consume records from one queue and write to several others, but this only defers the point of amplification. On the other hand, several consumers are generating load on the broker — being a mixture of read and write I/O, both sequential and random.

Consumers in Kafka are ‘cheap’, insofar as they don’t mutate the log files (only the producer or internal Kafka processes are permitted to do that). This means that a large number of consumers may concurrently read from the same topic without overwhelming the cluster. There is still some cost in adding a consumer, but it is mostly sequential reads with a low rate of sequential writes. So it’s fairly normal to see a single topic being shared across a diverse consumer ecosystem.

Unflushed buffered writes

Another fundamental reason for Kafka’s performance, and one that is worth exploring further: Kafka doesn’t actually call fsync when writing to the disk before acknowledging the write; the only requirement for an ACK is that the record has been written to the I/O buffer. This is a little known fact, but a crucial one: in fact, this is what actually makes Kafka perform as if it were an in-memory queue — because for all intents and purposes Kafka is a disk-backed in-memory queue (limited by the size of the buffer/pagecache).

On the flip side, this form of writing is unsafe, as the failure of a replica can lead to a data loss even though the record has seemingly been acknowledged. In other words, unlike say a relational database, acknowledging a write alone does not imply durability. What makes Kafka durable is running several in-sync replicas; even if one were to fail, the others (assuming there is more than one) will remain operational — providing that the failure is uncorrelated (i.e. multiple replicas failing simultaneously due of a common upstream failure). So the combination of a non-blocking approach to I/O with no fsync , and redundant in-sync replicas give Kafka the combination of high throughput, durability, and availability.

Client-side optimisations

Most databases, queues, and other forms of persistent middleware are designed around the notion of an all-mighty server (or a cluster of servers), and fairly thin clients that communicate with the server(s) over a well-known wire protocol. Client implementations are generally considered to be significantly simpler than the server. As a result, the server will absorb the bulk of the load — the clients merely act as interfaces between the application code and the server.

Kafka takes a different approach to client design. A significant amount of work is performed on the client before records get to the server. This includes the staging of records in an accumulator, hashing the record keys to arrive at the correct partition index, checksumming the records and the compression of the record batch. The client is aware of the cluster metadata and periodically refreshes this metadata to keep abreast of any changes to the broker topology. This lets the client make low-level forwarding decisions; rather than sending a record blindly to the cluster and relying on the latter to forward it to the appropriate broker node, a producer client will forward writes directly to partition masters. Similarly, consumer clients are able to make intelligent decisions when sourcing records, potentially using replicas that geographically closer to the client when issuing read queries. (This feature is a more recent addition to Kafka, available as of version 2.4.0.)

Zero-copy

One of the typical sources of inefficiencies is copying byte data between buffers. Kafka uses a binary message format that is shared by the producer, the broker, and the consumer parties so that data chunks can flow end-to-end without modification, even if it’s compressed. While eliminating structural differences between communicating parties is an important step, it doesn’t in itself avoid the copying of data.

Kafka solves this problem on Linux and UNIX systems by using Java’s NIO framework, specifically, the transferTo() method of a java.nio.channels.FileChannel . This method permits the transfer of bytes from a source channel to a sink channel without involving the application as a transfer intermediary. To appreciate the difference that NIO makes, consider the traditional approach where a source channel is read into a byte buffer, then written to a sink channel as two separate operations:

File.read(fileDesc, buf, len);

Socket.send(socket, buf, len);

Diagrammatically, this can be represented using the following.

Although this looks simple enough, internally, the copy operation requires four context switches between user mode and kernel mode, and the data is copied four times before the operation is complete. The diagram below outlines the context switches at each step.

Looking at this in more detail —

The initial read() causes a context switch from user mode to kernel mode. The file is read, and its contents are copied to a buffer in the kernel address space by the DMA (Direct Memory Access) engine. This is not the same buffer that was used in the code snippet. Prior to returning from read() , the kernel buffer is copied into the user-space buffer. At this point, our application can read the contents of the file. The subsequent send() will switch back into kernel mode, copying the user-space buffer into the kernel address space — this time into a different buffer associated with the destination socket. Behind the scenes, the DMA engine takes over, asynchronously copying the data from the kernel buffer to the protocol stack. The send() method does not wait for this prior to returning. The send() call returns, switching back to the user-space context.

In spite of its mode-switching inefficiencies and additional copying, the intermediate kernel buffer can actually improve performance in many cases. It can act as a read-ahead cache, asynchronously prefetching blocks, thereby front-running requests from the application. However, when the amount of requested data is significantly larger than the kernel buffer size, the kernel buffer becomes a performance bottleneck. Rather than copying the data directly, it forces the system to oscillate between user and kernel modes until all the data is transferred.

By contrast, the zero-copy approach is handled in a single operation. The snippet from the earlier example can be rewritten as a one-liner:

fileDesc.transferTo(offset, len, socket);

The zero-copy approach is illustrated below.

Under this model, the number of context switches is reduced to one. Specifically, the transferTo() method instructs the block device to read data into a read buffer by the DMA engine. This buffer is then copied another kernel buffer for staging to the socket. Finally, the socket buffer is copied to the NIC buffer by DMA.

As a result, we have reduced the number of copies from four to three, and only one of those copies involves the CPU. We have also reduced the number of context switches from four to two.

This is a massive improvement, but it’s not query zero-copy yet. The latter can be achieved as a further optimization when running Linux kernels 2.4 and later, and on network interface cards that support the gather operation. This is illustrated below.

Calling the transferTo() method causes the device to read data into a kernel read buffer by the DMA engine, as per the previous example. However, with the gather operation, there is no copying between the read buffer and the socket buffer. Instead, the NIC is given a pointer to the read buffer, along with the offset and the length, which is vacuumed up by DMA. At no point is the CPU involved in copying buffers.

Comparisons of traditional and zero-copy on file sizes ranging from a few megabytes to a gigabyte show performance gains by a factor of two to three in favor of zero-copy. But what’s more impressive, is that Kafka achieves this using a plain JVM with no native libraries or JNI code.

Avoiding the GC

The heavy use of channels, native buffers, and the page cache has one additional benefit — reducing the load on the garbage collector (GC). For example, running Kafka on a machine with 32 GB of RAM will result in 28–30 GB usable for the page cache, completely outside of the GC’s scope. The difference in throughput is minimal — in the region of several percentage points — as the throughput of a correctly-tuned GC can be quite high, especially when dealing with short-lived objects. The real gains are in the reduction of jitter; by avoiding the GC, the brokers are less likely to experience a pause that may impact the client, extending the end-to-end propagation delay of records.

To be fair, the avoidance of GC is less of a problem now, compared to what it used to be when Kafka was conceived. Modern GCs like Shenandoah and ZGC scale to huge, multi-terabyte heaps, and have tunable worst-case pause times, down to single-digit milliseconds. It is not uncommon these days to see JVM-based applications using large heap-based caches outperform off-heap designs.

Stream parallelism

The efficiency of log-structured I/O is one crucial aspect of performance, mostly affecting writes; Kafka’s treatment of parallelism in the topic structure and the consumer ecosystem is fundamental to its read performance. The combination produces an overall very high end-to-end messaging throughput. Concurrency is ingrained into its partitioning scheme and the operation of consumer groups, which is effectively a load-balancing mechanism within Kafka — distributing partition assignments approximately evenly among the individual consumer instances within the group. Compare this to a more traditional MQ: in an equivalent RabbitMQ setup, multiple concurrent consumers may read from a queue in a round-robin fashion, but in doing so they forfeit the notion of message ordering.

The partitioning mechanism also allows for the horizontal scalability of Kafka brokers. Every partition has a dedicated leader; any nontrivial topic (with multiple partitions) can, therefore, utilize the entire cluster of broker for writes. This is yet another point of distinction between Kafka and a message queue; where the latter utilizes clustering for availability, Kafka will genuinely balance the load across the brokers for availability, durability, and throughput.

The producer specifies the partition when publishing a record, assuming that you are publishing to a topic with multiple partitions. (One may have a single-partition topic, in which case this is a non-issue.) This may be accomplished either directly — by specifying a partition index, or indirectly — by way of a record key, which deterministically hashes to a consistent (i.e. same every time) partition index. Records sharing the same hash are guaranteed to occupy the same partition. Assuming a topic with multiple partitions, records with a different key will likely end up in different partitions. However, due to hash collisions, records with different hashes may also end up in the same partition. Such is the nature of hashing. If you understand how a hash table works, this is no different.

The actual processing of records is done by consumers, operating within an (optional) consumer group. Kafka guarantees that a partition may only be assigned to at most one consumer within its consumer group. (We say ‘at most’ to cover the case when all consumers are offline.) When the first consumer in a group subscribes to the topic, it will receive all partitions on that topic. When a second consumer subsequently joins, it will get approximately half of the partitions, relieving the first consumer of half of its prior load. This enables you to process an event stream in parallel, adding consumers as necessary (ideally, using an auto-scaling mechanism), providing that you have adequately partitioned your event stream.

Control of record throughput accomplished in two ways: