Thomas Schreiter (now a Data Engineer at Microsoft/Yammer) discusses his project of comparing two ingestion technologies: Open source Kafka and AWS Kinesis.

Selecting an appropriate tool for the task at hand is a recurring theme for an engineer’s work. When multiple competing tools are available for the same task selecting the right tool is non-trivial. In some small applications, any tool might do the trick and spending a lot of effort selecting the “right” one would be a waste of time. But for a reasonably large application selecting the right tool might save days, weeks or even months down the road because of high data throughput, low latency, and spared maintenance headaches.

Insight provided me with a great opportunity to compare multiple tools. For my project, I decided to take a close look at ingestion technologies, which are responsible for storing external raw data and making it available for batch or stream processing. Ingestion is commonly the first part of the data pipeline. The first contestant was Kafka, which is open-sourced under Apache, very popular and widely used in the industry. The second contestant was Kinesis, which is proprietary to Amazon Web Services and fairly new in the game, as it was released in 2013.

The metric the contestants were evaluated against was throughput, which is an important one in practical applications, next to latency. To set up a fair experiment, a producer script was generating messages in the form of plain strings of a few dozen bytes each and shooting them as fast as possible to either the Kafka cluster or the Kinesis stream. These stored the messages on their respective disks and waited for the next message to be sent. This was the test environment and would theoretically run forever. To measure the throughput, a logger was attached to the producer that kept a count of the number of messages. Every few seconds the throughput was calculated and stored in a MySQL database. I also built a front-end to visualize the results in a web browser. (Since the portion of Insight Data Engineering Program, when we went to present our projects at various companies, is long over, the website does not exist anymore.)

So who won? The short answer is that Kafka consistently achieved a higher throughput than Kinesis. Kafka reached a throughput of 30k messages per second, whereas the throughput of Kinesis was substantially lower, but still solidly in the thousands. While Kinesisâ€™ throughput improved when parallelizing the producers, in the sense that multiple producers scripts were running in parallel on one machine, it maxed out at about 20k msg/sec. I suppose that Kafka achieved the higher throughput because it has been open-sourced for much longer and has a strong community. As Kinesis will be improved over time, I am sure that its throughput will increase as it evolves.

Besides varying the number of parallel producers, there is another variable that has a strong influence on the throughput; both Kafka and Kinesis support the notion of bulking multiple messages and sending them together in one go. The messages are still considered as single messages and their order is preserved, but sending them together greatly reduces the communication overhead. The graph below shows the vast impact of the bulk size on the throughput. Whereas, Kafka flattens out at around 200 bulked messages, the curve for Kinesis seems to increase even after 500 messages. Unfortunately, Kinesis is currently capped at that size. So if that throttle would be released, higher throughputs are likely. (For completeness, the numbers from the diagram above come from a setup where 500 messages were sent in bulk.)

As said before, throughput is one of the most important metrics when evaluating technologies in a Big Data pipeline. During my project, I also had the chance to look at other metrics, although in lesser scrutiny, so the results in the following paragraphs should be considered more as anecdotal than as “hard” evidence. First, the above results focused on the producer side, i.e. sending messages to the ingestion system. Retrieving messages from the ingestion system, i.e. consuming messages, turned out to be a lot faster in Kafka’s case. For Kinesis, it depends how many messages are bulked on the consumer side. If only a single message is read at the time, then the consumption tends to be very slow.

Next is latency, which is another important metric. Latency is defined as the duration between sending the message to the ingestion system and consuming it on the other end. Kafka turns out to be very fast in terms of latency. While I didn’t measure the time with a stopwatch, just watching the producer in one terminal and the consumer in another showed that the messages were piped through seemingly instantly. For Kinesis, the latency is visibly higher. That is, because the consumer has to ping the Kinesis stream to check whether a new message is available to be consumed. The rate at which the consumer is allowed to pull a message out is forced to be at least about a tenth of a second. This mechanism introduces a lower bound on the average latency of a twentieth of a second. (If there is a way to circumvent this pull-mechanism, i.e. if the Kinesis stream pushes a message to the consumer as soon as it receives it, then the latency would drop significantly. However, I am not aware if this functionality exists.)

Finally, a somewhat “softer” but nonetheless relevant metric is maintainability, i.e. ease of installation and difficulty to keep it up and running. After a few hiccups, Kinesis was fairly easy to get running thanks to python’s boto package. Once installed, Kinesis kept happily running and was stable. Kafka, on the other hand, caused some trouble. I could not identify the underlying cause, but Kafka broke down multiple times during my project. This led to a couple of long evenings, but luckily most of it could be fixed within hours. Still, it is quite surprising when a system crashes down seemingly randomly. (I want to point out that my fellow Insight Fellows, who used Kafka in their projects, did not have these problems, so there is a good chance that this is not typical for Kafka, but instead might have been a result of human error on my part.)

Now, I don’t consider this project a full-blown benchmarking study, since there are many more important variables to consider, e.g. the size of the messages, the specs of the nodes, the location of the data center or parallelizing the producers across multiple nodes, to just name a few. However, just taking the technologies out of the box, playing around with them and stress-testing them in a simple yet reasonable test environment provides valuable insights of which tool to use for a given application. Given the results, in a production environment, I would choose Kafka due to its high throughput. If, on the other hand, the underlying task is exploratory, then its easy setup makes Kinesis my first choice.

For more information about the implementation, the code for this project is available at: github.com/thomas-schreiter/ingestion_benchmark.