by Florian Troßbach

"Databases? Where we're going we don't need databases" – Doc Brown, 1985

Well, we’re certainly not there yet, but this article is going to introduce you to a new feature of the popular streaming platform Apache Kafka that can make a dedicated external database redundant for some use cases.

Kafka 0.10.0 introduced the “Kafka Streams” API – a new Kafka client that enables stateless and stateful processing of incoming messages, with state being stored internally where necessary. In the initial release, state could only be exposed by writing to another Kafka topic. Since Kafka 0.10.1, this internal state can be queried directly. This article introduces the API and talks about the challenges in building a distributed streaming application with interactive queries. It assumes basic knowledge of the Streams API.

Example

Let’s consider a simple example that models the tracking of visits to a web page. A topic “visitsTopic” contains Kafka messages that contain key-value pairs in the format <ip,timestamp>. So the key of the message contains an IP address of a visitor and the value contains the timestamp of that visit. This is of course a bit contrived as we would not use one topic per trackable web page, but let’s keep it simple.

For the sake of this example, we’re interested in three aspects:

how many times did a user with a given IP visit our page in total?

how often was it visited by a given IP in the last hour?

how many times per user session did an IP visit the page?

This can be achieved with the following topology:

There are three state stores:

“totalVisitCount” contains the total amount of visits per unique IP

“hourlyVisitCount” contains the number of visits in the last hour

“sessionVisitCount” contains the count per session (with a new session being started when there is no activity for more than a minute)

In Kafka 0.10.0, the only option to retrieve that data would have been to materialize it into another Kafka topic. For many use cases, this can be considered quite wasteful. Why do we have to persist the data once again in Kafka if all we want to do is answer a few very simple queries?

Interactive Queries to the Rescue

As outlined in KIP-67, interactive queries were designed to give developers access to the internal state that the Streams-API keeps anyway. This is the first bit to take away: interactive queries are not a rich Query-API built on Kafka Streams. They merely make existing internal state accessible to developers.

The state is exposed by a new method in org.apache.kafka.streams.KafkaStreams . While this client originally mainly contained the capability to start and stop streaming topologies, it has been extended in Kafka 0.10.1 and further with 0.10.2. The entrypoint into querying a local state store is the store method. Let’s look a bit more closely at its signature:

public <T> T store(String storeName,

QueryableStoreType<T> queryableStoreType)

The first parameter is easy, it takes the name of the store that we want to query – “totalVisitCount”, “hourlyVisitCount” or “sessionVisitCount” in our example. It is not the topic name! The second parameter is a bit more intriguing. It declares the type of the provided store. At this point, it’s worth taking a step back to understand what that is about. By default, the Kafka Streams high-level DSL uses RocksDB (http://rocksdb.org/) to store the internal state. This is generally pluggable by the way – you could supply your own StateStoreProvider. RocksDB mainly works in memory but may flush to disk as well. There are three standard types of RocksDB-backed state stores:

Key-Value based

Window based

Session window based (since 0.10.2)

In our example, “totalVisitCount” is an example of a key-value based state that maps an IP address to a counter. “hourlyVisitCount” is window-based – it stores the count of visits of an IP address as it occurred in a specific time window. “sessionVisitCount” is an example of a session window store. Session windows are a new feature of Kafka 0.10.2 and allow to group repeated occurrences of keys into specific windows that dynamically expand if a new record arrives within a so-called inactivity gap. Simple example: if the inactivity gap is 1 minute, a new session window would be opened if there was no new record for a key for longer than that minute. Two messages within say 20 seconds would belong to the same window.

Each store type has its specifically tailored API. A key value store enables different types of queries than window stores.

Accessing a key-value store works like this:

ReadOnlyKeyValueStore<String, Long> store = streams.store(“visitsTable”,

QueryableStoreTypes.<String, Long>keyValueStore());

An important aspect of interactive queries is in the name of the return type – they are read-only. There are no inserts, updates, deletes whatsoever. This is a good thing – Kafka topics are your only data source and underlying computations could really get messed up if you were allowed to manipulate data.

The ReadOnlyKeyValueStore interface does not contain a lot of methods. You can basically query the value of a certain key, the values of a range of keys, all keys and an approximate count of entries. Applied to our example, this store enables you to query for the total count of visits for a given IP, the count for a range of IPs, all IPs and their count and an approximate count of all unique IPs in the store.

Creating a handle to a windowed store works like this:

ReadOnlyWindowStore<String, Long> store = streams.store(“hourlyVisitCount”,

QueryableStoreTypes.<String, Long>windowStore());

This interface is even sparser as it only has one method called fetch that takes a key as well as a “from” and and a “to” timestamp.

This retrieves the aggregated results of the windows that fall into the passed timeframe. The resulting iterator contains KeyValue<Long,V> objects where the long is the starting timestamp of the window and the V the type of the value of that KTable. With regards to the example, this would retrieve the hourly counts for all visits by a given IP starting with the window that contains “timeFrom” and ending with the window containing “timeTo”.

Session windows stores are retrieved with

ReadOnlySessionStore<String, Long> store = streams.store("sessionVisitCount", QueryableStoreTypes.<String, Long>sessionStore());

The store interface is the simplest of all as it only has one fetch method that takes a key and nothing else. It retrieves the results for all existing session windows at that point in time.

So this looks easy enough. When running a single instance of the streaming application, all partitions of the topic are handled by that instance and can be queried. However, running a single instance of a consumer is not really what Kafka is about, is it? How do interactive queries work when the partitions of the source topics – and by extension the state – is distributed across instances of your streaming application?

Running your application in distributed mode

There is no beating around the bush – here be dragons. As mentioned above, interactive queries have not turned Kafka Streams into an almighty query server.

So the bad news is:

you need an additional layer that glues together your instances

you need to know which instance(s) would be responsible for a given query

you need to build it yourself

Sucks a bit, right? It’s not hard to see where this restriction is coming from, though – building an efficient generalized query facade running in a distributed mode, working for all kinds of data on Kafka is hard when all you can count on is the fact that keys and values are byte arrays containing god knows what. Another main reason for this is that Kafka Streams aims to be completely agnostic to the kind of context it is run in – it does not want to restrict you to certain frameworks. The Confluent blog argues this case very nicely.

Kafka Streams is not leaving you completely alone with that problem, though.

When you provide the properties for your streaming application, a new one is application.server . This expects a host:port pair that will be published among the instances of your application. This does not mean that the Streams API will actually open that port and listen to some sort of request. That is your responsibility and you are completely responsible for communication protocols etc. But it will communicate that endpoint to the other instances via the Kafka protocol, so if you keep your end of the bargain, you can query any instance for metadata and it will provide a comprehensive view. The following illustration demonstrates the setup:

There are two instances of the application, running on 1.2.3.4:42 and 1.2.3.5:4711. A query layer talks to those instances via a user-defined (that means you) protocol. The instances themselves need to run some kind of server that provides endpoints for that protocol. You’re completely free what to use here, there is a lot of choice in the Java ecosystem – Spring MVC, Netty, Akka, Vert.x, you name it). Initially, the query layer needs to know at least one instance by address, but that instance can – if your protocol allows it – pass on the information about the other endpoints. The query layer can ask any instance for information about the location of a given key or store.

Accessing the metadata

So how do we get this metadata on the low level? For this, we return to org.apache.kafka.streams.KafkaStreams . Apart from the method that let’s us access a store, it also provides access to metadata on varying levels. You can simply query all metadata for a streaming application. This will give you an overview of:

what instances of my application are running where (according to the “application.server” property?

what state stores are available on those instances?

what partitions of what topics are handled by an instance?

In a simple example with only one instance, this metadata looks like this (via its toString ):

The host info object contains the provided application server values, the three state store names are present and the instance handles partitions 0 and 1 of topic “visitsTopic”. If there were more instances, we’d get all metadata. That metadata is of course a snapshot of the time you call the allMetadata() method – starting or stopping instances can result in partition reassignment.

The API provides more fine-grained access as well. We can query all the metadata for a given state store, for example. This operation only returns metadata for instances where a store of that name is present. Even more specific are two methods that take the name of a store and a key (and either a Serializer for that key or a StreamPartitioner). This is a very interesting operation as it will return the single metadata for the instance that will hold the data for a key if any data exists, which of course cannot be guaranteed – we won’t know if data is there unless we execute an actual query.

Conclusion

Interactive queries are a very cool feature that just might make your database redundant one day. Kafka is not the only technology moving in that direction – Apache Flink 1.2 introduced a similar feature.

But let’s not get ahead of ourselves – these are early days for this kind of technologies. Interactive queries in Kafka are at the moment only suitable for very simple key-based queries and the need to build your own distributed query layer might put people off. But with an ever-growing Kafka community, there is some real potential. The future is not quite here yet, but interactive queries shows us what it might look like.

As an entry point for further reading,I recommend reading Confluent’s introductory post. Confluent also provides a reference implementation of a query layer.