Powering the Heroku Platform API: A Distributed Systems Approach Using Streams and Apache Kafka

Listen to this article

We recently launched Apache Kafka on Heroku into beta. Just like we do with Heroku Postgres, our internal engineering teams have been using our Kafka service to power a number of our internal systems.

The Heroku platform comprises a large number of independent services. Traditionally we’ve used HTTP calls to communicate between these services. While this approach is simple to implement and easy to reason about, it has a number of drawbacks. Synchronous calls mean that the top-level request time will be gated by the slowest backend component. Also, internal API calls create tight point-to-point couplings between services that can become very brittle over time.

Asynchronous messaging has been around a long time as an alternative architecture for communicating between services. Instead of using RPC-style calls between systems, we introduce a message bus into our system. Now to communicate between system A and B, we can simply have system A publish messages to the message bus, and system B can consume those messages whenever it wants. The message bus stores messages for a certain period of time, so communication can occur between system A and system B even if both aren’t online at the same time.

Increasingly we are moving from our synchronous integration pattern to the asynchronous messaging integration pattern, powered by Kafka. This creates much looser coupling between our services. This allows our services and (importantly!) our development teams to operate and to iterate more independently. The message stream produced by system A creates an abstract contract - as long as system A continues to publish a compatible stream of messages, then both systems A and B can be modified without regard to the other. Even better, the producing system doesn’t need to know anything about the consuming system(s). We can add or remove consumers at any time.

Compared to traditional message brokers, Kafka offers a number of benefits. It offers blazing performance and scalability, with the ability to handle hundreds of thousands of messages per second. Its architecture supports relatively long-term message storage, enabling consumers to read back in time many hours. And its simple log-oriented design provides good delivery guarantees without requiring any complex ack/nack protocol. Finally, its multi-node architecture offers zero downtime (brokers within a cluster can be upgraded independently) and simple horizontal scalability. This makes Kafka suitable for a large range of integration and stream processing use cases, all running against the same Kafka cluster.

Our core internal service generates an abstract event stream representing all resource changes on the platform - we call this the platform event stream. We’ve built a number of variations of this stream, once on Postgres, once with AWS Kinesis, and our latest version uses the Kafka service.

As a globally shared service, Kinesis throttles read bandwidth from any single stream. This sounds reasonable, but in practice means that adding additional consumers to a stream slows down all consumers of that stream. This resulted in a situation where we were reluctant to add additional consumers to the platform event stream. This encouraged us to re-implement the stream on Kafka. We have been very happy with the minimal resources required to serve additional consumers - a single Kafka cluster can easily serve hundreds of clients.

When we launched the Kafka version of the platform event stream, we wanted to ease the transition for the clients of our existing Kinesis stream. These clients expected an HTTP-based interface and a managed authentication system. We had expansive plans to allow lots of different clients, and we wanted both to simplify the process of creating new clients as well as be able to control stream access at a fine-grained level.

So we decided to implement a simple proxy to front our Kafka cluster. The proxy uses HTTP POST for publishing and a Websocket channel for consuming. It also implements a custom client authentication scheme. The proxy offers a layer of manageability on top of our Kafka cluster which allows us to support a large set of client use cases. It also allows us to protect our Kafka cluster inside a secure Heroku Private Space while still allowing controlled access from outside the Space.

The proxy exacts a performance penalty relative to the native Kafka protocol, but we’ve found that Kafka is so fast that this penalty is acceptable for our requirements. Some other teams at Heroku with “bare metal” performance needs are using their own Kafka clusters with native clients.

Despite the trade-offs, we have been very happy with the results. We have more than ten different consumers of the platform event stream, and the minimal onramp costs for connecting to the proxy (requiring nothing more than a websocket client) are enabling that number to grow steadily. Kafka’s robust scalability means that adding another consumer of the event stream costs almost nothing in additional resources, and this had led us to creating new consumers for lots of purposes that we never originally envisioned.

The success of Kafka plus the websocket proxy has encouraged us to generalize the proxy to support additional event streams beyond our original one. We have now opened up the proxy so that other teams can register new Kafka topics hosted within our cluster. This gives them the kind of zero-administration service they expect with low cost and high scalability.

Some features that we would like to support in the future include:

A schema registry to hold definitions for all events, both for discoverability and potentially message validation

Message filtering

Public consumers. Eventually we hope to expose the event stream as a primitive to all clients of the Heroku platform API.

Confluent has some interesting open source offerings in these areas, including their own REST Proxy for Kafka and their Schema Registry for Kafka.

This asynchronous integration pattern aligns well with the broader architectural shift away from batch processing with relational databases towards real-time stream processing. Rethinking your services as event-driven stream processors offers a path towards a much more agile, scalable, and real-time system but requires thinking very differently about your systems and the tools you are using. Kafka can play a key role in enabling this new style of real-time architecture, and techniques like using an HTTP proxy are effective tools for easing experimentation and adoption.

Moving to a real-time, asynchronous architecture does require significant new ways of thinking. Push channels must be created to notify users of system state asynchronously, rather than simply relying on the HTTP request/response cycle. Rethinking the notion of “persistent state” as a “point in time snapshot” rather than “canonical source of truth” implies a very different application architecture than the ones to which most engineers are accustomed. Architecting for eventual consistency and compensating transactions requires developing new techniques, libraries, and tools.