Kafka is a really powerful distributed publish / subscribe software that helps you build complex asynchronous applications. GumGum was an early adopter of this technology and is nowadays running hundreds of brokers across multiple clusters.

Kafka monitoring is a must

Kafka cluster operations are a thing on their own (scaling clusters in and out, recovering from a dead broker, reassigning partitions across the cluster…) but if you want to build performant client applications based on Kafka you want to pay close attention to your consumers:

Are they able to keep up with the incoming traffic (consumer lag) ?

Kafka Consumer Lag is an indicator of how much lag there is between Kafka producers and consumers. In other words, how far behind is your consumer compared to the latest produced message in the topic the consumer is reading from.

Is the consumer group in a stable state (consumer rebalancing) ?

Rebalance/Rebalancing: the procedure that is followed by a number of distributed processes that use Kafka clients and/or the Kafka coordinator to form a common group and distribute a set of resources among the members of the group (source : Incremental Cooperative Rebalancing: Support and Policies).

In a microservice world, whether you run on VMs, ECS or Kubernetes, you may want to adjust the number of running instances / tasks / pods based on the consumer lag, thus making Kafka lag reporting a critical piece of your infrastructure.

In a Cloud provider like AWS, most of the auto scaling actions get triggered by a CloudWatch alarm. This means that the computed lag for a consumer must be posted to the CloudWatch API in order to be able to adjust the size of an autoscaling group or an ECS service based on this custom metric.