We use and love Kafka at Data Syndrome. It enables us to move processing from batch to realtime with minimal pain and complexity. However, during a recent project we learned a hard lesson about the kafka-python package that has me thinking about how to make the choice between open source tools. In this post we reflect on the open source decision making process. We describe two Kafka clients for Python, the issues we encountered, and the solution we’ll be using going forward.

kafka-python: the Wild West

kafka-python is the most popular Kafka client for Python. In the past, we’ve used it without issue and it was used in my book, Agile Data Science 2.0. On this project, a huge problem came up. We discovered that when KafkaConsumer is used in the documented manner, which is iterating on the consumer to get messages from the queue, messages are frequently lost by the consumer that did arrive in the topic. Which we verified by an analysis using the console consumer.

To give a little more detail, kafka-python and KafkaConsumer are used with an SSL secured Kafka service like Aiven Kafka, like this:

kafka_consumer = KafkaConsumer(

topic,

enable_auto_commit=True,

group_id=group_id,

bootstrap_servers=config.kafka.host,

api_version=(0, 10),

security_protocol='SSL',

ssl_check_hostname=True,

ssl_cafile=config.kafka.ca_pem,

ssl_certfile=config.kafka.service_cert,

ssl_keyfile=config.kafka.service_key

) for message in kafka_consumer:

application_message = json.loads(message.value.decode())

...

When used in this, the recommended manner, KafkaConsumer drops messages. There is a workaround, which retains all messages. This was kindly given to us by Kafka as a Service provider Aiven support. It looks like this:

while True:

raw_messages = consumer.poll(timeout_ms=1000, max_records=5000)

for topic_partition, messages in raw_messages.items():

application_message = json.loads(message.value.decode())

...

While this workaround may work, the fact that the method in the README drops messages was more than a bit of a turn off. So I looked for an alternative.

confluent-kafka: Corporate Support

I was pleasantly surprised when I came across the coufluent-kafka Python module. It is a thin wrapper around librdkafka , a Kafka library written in C that forms the basis for the Confluent Kafka libraries for Go and .NET. More importantly, it is supported by Confluent. I love open source, but when the “informal community owns or supports this” method doesn’t turn out well, it is nice to have a corporate stamp on an alternative. We haven’t bought support yet. With this package, we know that someone stands behind the quality of the software, and having the option to buy commercial support is awesome.

Implementing confluent-kafka in place of kafka-python was easy. It uses a poll method, similar to the alternative method of accessing kafka-python outlined above.

kafka_consumer = Consumer(

{

"api.version.request": True,

"enable.auto.commit": True,

"group.id": group_id,

"bootstrap.servers": config.kafka.host,

"security.protocol": "ssl",

"ssl.ca.location": config.kafka.ca_pem,

"ssl.certificate.location": config.kafka.service_cert,

"ssl.key.location": config.kafka.service_key,

"default.topic.config": {"auto.offset.reset": "smallest"}

}

)

consumer.subscribe([topic]) # Now loop on the consumer to read messages

running = True

while running:

message = kafka_consumer.poll()

application_message = json.load(message.value.decode())



kafka_consumer.close()

And now we receive all our messages. This is not to say that kafka-python is a bad tool, and I’m sure the community will react to the problem and resolve the issue. I’ll be sticking with confluent-kafka from now on.

Open Source Control

Open source is powerful, but when it comes to the complexity of “big data” and NoSQL tools, it often helps to have a sizable company behind a tool, driving its development. This way you know that if it works for them, it probably has basic functionality nailed down pretty well. This could be informal, as when a company releases a project as FOSS, or formal, as when a company offers commercial support for a tool. Of course, the flip side of this is that when a company as opposed to the open source community is behind a tool, you lose control. Your voice might mean little, unless you’re a paying customer.

The ideal situation is open source governance, as with the Apache Foundation, and in addition to have commercial support options available. That simply doesn’t happen for most of the free software on the internet. Limiting yourself to tools with a corporate stamp of approval would be very restrictive. It may be the right choice for some shops, but not ours. I like to test tools out, and if they are small and do one thing simply, I make use of them.

Trust in Open Source

For larger tools the process of evaluation is more complex. I will look at the number of issues and contributors, the date of the last commit. I might ask my network about a tool, sometimes on Twitter. When you pick a project from Github after doing a sniff check, you are trusting in the community to produce good tools. This works in general, for most tools.

Trusting the community however, can be problematic. For a particular tool, you might have no good reason to trust that particular community to produce good software. Communities vary in terms of goals, experience and how much time they dedicate to open source projects. It is important to be judicious in selecting tools, and not to let your ideals cloud your judgement.