Event time low-latency joins with Kafka Streams

This post attempts to illustrate the difficulty of performing an event-time join between two time series with a stream processing framework. It also describes one solution based on Kafka Streams 0.11.0.0.

An event-time join is an attempt to join two time series while taking into account the timestamps. More precisely, for each event from the first time series, it looks up the latest event from the other that occurred before it. This blog post is based on Kafka Stream although I found the original idea in this Flink tutorial, where the idea of event-time join is very well explained.

Event-time join are often required in practise. For example, given a stream of transactions and another one of customer profile updates, we might want to associate each transaction to the corresponding customer profile as it was known at the moment of the transaction. Or given a stream of traffic information and another one of weather updates, we might want to associate each traffic event with the latest weather that was known for that location at that point in time.

Note that an event-time join is not symmetric: performing an event-time join from stream 1 to stream 2 does not yield the same result as performing it from stream 2 to stream 1.

Difficulty and opportunity related to streaming approach

If we were in a batch context, implementing an event-time join would be pretty straightforward. By batch context I mean one where "all the data is available", so that the execution of an aggregation like max(orderDate) is guaranteed to provide the last order date of the full dataset.

For example, assume we have a dataset of customer visit events and another one of orders. Both are timestamped and both are thus representing time series. Suppose we want to look up, for each customer visit, the latest order performed by that customer before the visit. In Batch mode, we can simply look up the latest known order before the visit ( Orders.OrderDate <= CustomersVisit.VisitDate in the example below) and join that to the visit information. One possible illustration might be:

/** One batch approach to linking each customer visit to their latest order that occured before that. (probably not optimal, though hopefully clear enough to illustrates my purpose) */ SELECT CustomerOrderAsOfVisitDate . VisitId as visitID , CustomerOrderAsOfVisitDate . CustomerId as customerId , Orders . OrderDate as lastOrderDateBeforeVisit , Orders . ShipperId orderShipperId FROM Orders LEFT JOIN ( -- latest order date occuring before each visit SELECT CustomersVisit . VisitId , CustomersVisit . CustomerId , max ( orderDate ) as lastOrderDate FROM CustomersVisit JOIN Orders ON ( Orders . CustomerId == CustomersVisit . CustomerId AND Orders . OrderDate <= CustomersVisit . VisitDate ) GROUP BY CustomersVisit . VisitId , CustomersVisit . CustomerId ) AS CustomerOrderAsOfVisitDate on ( CustomerOrderAsOfVisitDate . CustomerId == Orders . CustomerId AND CustomerOrderAsOfVisitDate . lastOrderDate == Orders . OrderDate )

visitID customerId lastOrderDateBeforeVisit orderShipperId 12 Ana Trujillo 1996-09-18 3 14 Antonio Moreno 1996-11-27 2 15 Around 1996-12-16 3 16 Berglunds 1996-12-16 3

A typical crux of stream processing though is the fact that datasets are unbounded and considered theoretically infinite. This implies that at any point in time, we cannot be sure that we have received all necessary information to compute the final version of anything. In the example above, this means that max(orderDate) only returns the latest order date observed so far, though that's an aggregation that's ever changing.

Also, because of delays that could happen during data ingestion, it is typically considered that events are not guaranteed to be delivered in order (see discussion in Flink's documentation on Event time vs Processing time and Spark's time handling documentation)

This limitation applies also in the case of event-time join: any time we receive a transaction or a car traffic information, we cannot in general be sure that the information we current have concerning user profiles or weather time series is the latest that we will ever be available. We could decide to wait, though how long?

This question of "how long to wait" is one key difference between stream and batch processing. In a batch approach, some data collection process is assumed to have "waited long enough" beforehand so that at the moment of the batch execution, we can consider that "all data is available". Said otherwise, "waiting long enough" is not a concern of the batch implementation whereas it is a first class citizen in stream processing.

In many cases though, a nightly batch that processes the last day's data are nothing less than a manual implementation of a 24h tumbling window. Hiding the stream nature of a dataset behind nightly batches is sometimes hiding too much the complexity related to time by pretending that "all data is available". In many situations, we end up handling ourselves cases like late event arrivals or aggregations over more than one day (e.g. 30 sliding trends), which are much more natural if we use a framework that embrace the infinite time series nature of the dataset.

Why not relying on Kafka Streams event-time based processing

Kafka Streams 0.11.0.0 does not offer out-of-the box event time join.

It does provide however a couple of handy primitives for designing stream processing based on event time, as explained in the Kafka Streams concepts documentation. As far as I understand however, these features are primarily useful for time-based window aggregations and best effort flow control.

Kafka Streams DSL also exposes KStreams-to-KTable join which essentially corresponds to looking up up-to-date reference data in real time. Confluent has published two excellent blogs about it (here and here). Combined with Kafka Streams's built-in best-effort flow control, this is already quite powerful and probably exactly what we need in many cases. As a point of comparaison, at the time of writing this, this feature is not (yet?) available as part of Spark Structured Streaming (2.2.0) out of the box.

Ktable-to-KStream however corresponds to a lookup done at processing time (as mentioned in KStream::join javadoc). To fully support event-time-join of out of order streams, we need to manually keep some buffers of both streams, which is explained below.

High level approach

As mentioned in the introduction, this post is inspired from the Flink event-time join tutorial, and my solution is almost a copy-cat of their suggested solution.

The gist of my solution is very simple:

keep in mind that an event-time join is an asymmetric operation. Let's name the first stream the transaction stream and the one we are join it to the dimension stream

upon receiving a dimension event, just record it in a time-bounded buffer (e.g. TTL = 1 day or so)

upon receiving a transaction event, perform a best effort join, i.e. join it with the dimension information available at that moment

schedule an action that review previously joined information and emits corrected joins when necessary

Here is an illustration where the transaction stream is a stream of recommendations and the dimension stream is a stream of mood events (this use case is detailed in the code sample below):