A while ago we published some benchmarks and sizing about our experience of running Apache Kafka over a service mesh with the Banzai Cloud Kafka and Istio operator, orchestrated by our automated and operationalized service mesh, Backyards.

The reasons for such a setup were many, and there are more details in the Running Apache Kafka over Istio - benchmark post, but let me recap some of our initial reasons, and how we evolved from there.

Running Kafka over Istio does not add performance overhead (quite the opposite in case of mTLS)

Out of the box support for multiple network topologies

Resilience to network failures

Observability and metrics based alerts and decisions

While these were already good enough reasons, things changed quite fast since we published the benchmarks. The Envoy community has merged the Kafka protocol 2.0 codec, so instead of treating Kafka traffic as TCP, Envoy can now understand Kafka semantics at the protocol level. While this PR was essential, some other important parts of the puzzle were still missing, like Envoy’s Kafka protocol filter.

The Envoy community and adamkotwasinski has been working on the Kafka protocol filter for Envoy

The filter is almost ready (in Adam’s fork) and now you can bring it on a test ride

We built a custom Envoy version with the filter included

We automated the Kafka setup on Istio, including the custom Envoy version

Would you like to run Kafka over Istio the easy way - try Supertubes. It is as simple as https://getsupertubes.sh | sh && supertubes install -a

Kafka protocol support in Envoy 🔗︎

Envoy is a next generation network proxy, built for the cloud native era. It supports a wide variety of application protocols (Zookeeeper, MongoDB, etc) and recently added Kafka support. The benefits of a network proxy understanding higher level protocol implementations are huge. In case of Kafka, the list of benefits include:

Out of the box tracing and monitoring within a Kafka mesh

and within a Kafka mesh Consumer group metrics

Information about apps and their version of the client libraries

Request validation

Protocol version translations

Automatic topic name conversions without having to modify the clients

Mirroring topics to another clusters (we run many hybrid Kubernetes clusters)

Functional parity across runtimes

Now let’s dig into some of the above.

Metrics and monitoring 🔗︎

The Banzai Cloud Kafka operator has always provided server side metrics. But running in a Backyards-managed Istio service mesh also adds metrics from the Envoy sidecar. This opens up a totally new perspective. Without having to modify Kafka clients, we now have insights into clients and how they behave. For example, it’s easy to query which client is writing to a topic and what is the byte rate/client.

Functional parity across runtimes 🔗︎

In Kafka, the client SDK is often responsible for too many things. The historical decision behind it, was to keep the brokers as lightweight and easy as possible. Initially Kafka was written in Scala, however with the later shift to Java, the full featured client SDKs are now the Java ones. The non JVM clients are missing quite a few features. With the help of Envoy, this will be different in the future, because some of the client responsibilities could be shifted into the sidecar proxy. This would bring the same functionalities to all clients no matter what language they’re written in.

Request validation 🔗︎

As Kafka is content agnostic, misbehaving clients can write nearly anything to the brokers. The Envoy proxy can now validate the requests at the protocol level, and check if they contain all the required (or too many) information before forwarding it to the brokers.

Rewrapping old Kafka protocols 🔗︎

The Kafka client SDK is a sensitive component. We’ve seen clusters that could not be upgraded in time, because clients were using older protocol versions. The Envoy filter can unwrap messages of older versions, and translate them to the latest and greatest version at the protocol level.

Envoy protocol filter for Kafka in action 🔗︎

This is all nice and handy, but there’s still a missing piece: the Envoy protocol filter for Kafka. As mentioned earlier, the Envoy community and Adam Kotwasinski is working hard to finish it. We took Adam’s branch, built a custom Envoy version with the Kafka filter included, and automated a Kafka cluster setup on Istio, orchestrated by Backyards. Under the hood the major components are:

A custom Envoy build, available on this Docker hub repo

The Banzai Cloud Istio operator

The Banzai Cloud Kafka operator

Observability tools such as Prometheus, Jaeger and Grafana, installed by Backyards

The Backyards CLI

Install a Kafka cluster on Istio 🔗︎

The first prerequisite is to have a Kubernetes cluster.

You can create a Kubernetes cluster on five different cloud providers, or on-premise via the free developer version of the Pipeline platform. Or you can also bring your own cluster.

If you have a cluster, you can grab this experimental build of the Backyards CLI.

This is an experimental feature, so make sure you download the appropriate release.

Set the KUBECONFIG environment variable to your Kubernetes cluster, and run the following two commands. It will install all the necessary components to try out the Envoy Kafka protocol filter.

backyards istio install --set spec.proxy.image = banzaicloud/proxyv2:devfilter backyards install --with-kafka-cluster

Backyards will install and configure an Istio service mesh, and a Kafka cluster using Banzai Clouds Operators (Kafka and Istio). It will also configure the Envoy Kafka protocol filter with a custom resource called EnvoyFilter .

If you are more of a visual type, the following diagram represents the architecture:

To see some metrics, you will need some load in your Kafka cluster. You can use you own tooling to do that, or you can issue the following command which starts a small performance tool and sends some load to Kafka:

backyards kafka load

Then you can open the Grafana dashboard for the Kafka cluster:

backyards kafka dashboard

Kafka protocol filter metrics 🔗︎

The sample dashboards show information about various Kafka protocol messages. The early version of the filter already produces some of the most important metrics, like the average latency of responses, the number of failed responses, or the number of topics.

These metrics can help you keep the cluster healthy. You can setup alerts based on these, that are triggered when something starts to behave incorrectly. For example, the Produce Buffer metric can tell you if the cluster is nearing its limits, so an intervention is needed.

On the other hand you can also use these metrics to build custom logic that helps you manage the cluster. For example you can leverage the Produce requests metric when setting up autoscaling of the Kafka cluster. Passing a certain threshold of the average response time could initiate an automatic Kafka cluster upscale.

Banzai Cloud is changing how private clouds are built: simplifying the development, deployment, and scaling of complex applications, and putting the power of Kubernetes and Cloud Native technologies in the hands of developers and enterprises, everywhere.

#multicloud #hybridcloud #BanzaiCloud