Have you ever thought about the Push vs Pull approach for the system, which one suits or solves which problem? Another Question why did Kafka choose Pull over Push design for Consumers?

Before talking about the Kafka approach, whether the Broker should push the data to consumer or consumer should pull from Kafka? Let’s first understand both of the approaches, as each one has its own Pros and Cons.

Note: Kafka was never designed for a single consumer or downstream app to subscribe to the datastore(topic).

Push Approach:

Just think about a datastore (topic), full of events/data which is continuously receiving events/data (Producer is publishing events into the topic) at any random rate, which according to this approach, should be pushed down to the consumer by Brokers. So the duty or load will be on broker not consumer to make sure the data is processed by consumers or downstream application.

Also, Downstream application (consumer) doesn’t have to be worried about the latency or delay in receiving the events from the broker, as it will be controlled and monitored by the broker only.

So, what if the app/consumer is down, the broker will need to continuously retry to push the data, or if the app is not much capable to process the data with the inflow rate from the broker (fall behind), it will be overwhelmed.

Now think about multiple consumers(group) / downstream app need the events, it will be very difficult and overwork, for brokers to push data to each consumer causing issues for the broker.

Pull Approach:

This is exactly the opposite of what we just discussed above, consumer or subscriber app will pull or request the broker/server for all available messages after its current position in the log (or up to some configurable max size), and in case of fall behind with broker or if the app (consumer) is down, it will try to catch up later. But now the broker need not be concerned about these issues of the consumer, now such problems will always be dealt with the downstream app.

For multiple consumer or downstream app, each one will pull the data on their individual which will never affect the performance of other apps as well as the broker.

The only issue with the pull-based is when there is no more data with the broker and consumer waits or tries to request for the new events, this may end up polling in a tight loop, effectively busy-waiting for data to arrive.

To overcome or avoid the issue we can configure the downstream app (consumer) in such a way that blocks the consumer request in a long poll waiting until data arrives, or for a given number of bytes to ensure large transfer sizes.

Hope, now you have understand the reason for choosing pull based approach over push.

Reference: https://kafka.apache.org/documentation/#design_pull