Introduction

Apache Kafka has evolved as the defacto standard for building reliable event-based systems with ultra-high volumes. The unique, yet simple architecture has made Kafka an easy to use component which integrates well with existing enterprise architectures. At a very high level, it is a messaging platform which decouples the message producers from the message consumers while providing the reliability of message delivery to consumers at scale. Kafka has message producers which send messages (events) to kafka which kafka stores in an entity called a topic which will deliver the messages to one or more consumers in a reliable manner. There can be different types of producers and consumers depending on the use case.

Architecture

Kafka architecture is a simple yet powerful architecture which blends well within most of the existing architectures.

Kafka can run on a cluster of nodes spanning across multiple machines, multiple data centers, multiple regions. Kafka has the capability to expand across geographically distributed resources. Once Kafka cluster is set up, external applications and systems can interact with the cluster through 4 standard APIs.

Publisher API — This is used by event sources which generate events to send those events in large magnitudes to the kafka cluster. Kafka cluster stores these events in topics which are partitioned so that these topics can scale beyond a single server. These partitions are replicated across multiple nodes in case if one node goes down, the data is not going to be lost. These events are stored for a configurable duration regardless of whether they are consumed or not. During the time of publishing the events, publishers can specify the partition in which the event is going to be stored in a given topic.

Consumer API — The events stored in the topics are delivered to consumers who are subscribed to these topics. Unlike in standard pub-sub based topics, the consumer has total control over reading the messages from the topics. The consumer can control the offset value which tells the position of the message within a given partition so that consumers can read messages from any location within a given topic and a partition. Additionally, kafka comes with the concept of a consumer group through which, the consumers can balance load across multiple competing consumers similar to a queue based subscription. There can be multiple consumer groups subscribed to a given topic and each consumer group will get one copy of the message at a given time. A given consumer within a consumer group is bounded to a given partition within a topic. Hence there cannot be more consumers within a group than the number of partitions of a topic.

Connector API — Sometimes kafka needs to be integrated with existing applications and systems. As an example, if we want to read a set of log files through kafka to process the data within those files using a stream processor, kafka provides the connector API to implement reusable components called connectors. The same connector can be used for any file-based integration with kafka.

Streaming API — The events stored in kafka sometimes needs to be processed in real-time for various purposes like transformation, correlation, aggregation, etc. These sort of functionalities can be applied through a stream processor which acts on a continuous stream of records. Kafka streaming API exposes a topic as a stream of events so that any other stream processing engine can consume that and do the processing.

Advantages

Kafka has aggregated several advantages of existing messaging systems into one single solution through its unique design. Here are some of the advantages of Kafka

Ability to build cloud scale message processing platforms with fault tolerance

Message persistence and durability with a configurable retention period with control at the consumer end allows consumers the freedom to process messages at their will

Order of the messages are guaranteed within a partition which is critical in some use cases

Replication allows messages to be stored reliably so that failure of n-1 nodes within an n node cluster with replication count of n does not affect the message delivery to the consumers

Scalability of the cluster allows to deploy the system across multiple data centers for highest levels of availability and better performance

Streams API allows the events to be processed in real-time with stream processing engines

Use cases