No worries, this is not yet another post about exactly once processing. I’d like to describe an interesting requirement that popped up in one of our projects.

Here is the situation

The event source is sending messages to a Kafka topic. The message contains a single line being the identifier of the message. In our app it’s an object with several fields one of which is an id . It is serialised with Avro and we’re using Avro specific Serdes, but let’s keep things simple. All we want to achieve is having unique identifiers in an output topic. In this post I’ll discuss three solutions dependent on the characteristics of the identifiers. For reference the project is available on Github.

De-duplicating sequence numbers

The input is generating identifiers in a sequence. It’s guaranteed that there are no gaps between the numbers and that they are ordered. Due to technical issues, they may be redelivered starting from a previous snapshot. In other words, we may end up with following numbers in the input topic:

What we want to see in the output topic are unique identifiers. Yeah, with bash you can do it in maaaaaaany ways, one of which is:

The first approach is based on a custom ValueTransformer using a state store to persist the last processed sequence number:

The transform method stores and returns the current sequence number if it hasn’t been stored in the store yet and is higher than the last seen value. Otherwise null is returned, which is filtered out later in the pipeline. This solution will only work if the input topic has one partition. For multiple partitions, there will be multiple ValueTransformer instances working with their local shard of the state. It may happen that the first message with the sequence number 5 will be routed to the first partition, and another message with the same sequence number 5 will go to the second partition. In such a case, the 5 will be processed once by the first ValueTransformer and again by the second ValueTransformer . The result will be that 5 will be put into the output topic twice.

De-duplicating grouped sequence numbers

In our case the sequence numbers are keyed. In result we get multiple sequences, one per key. What we want to achieve are unique sequences for each key. One solution to this problem is to group by key and aggregate the last and current value, as a pair. In practice we used a Tuple2 from Vavr and provided a custom Serde , but to keep things simple, the pair is a String separated by a semicolon:

If the current number is higher than the previous one, the returned value is a concatenation of both — previous;current . The pipeline extracts the current number later on to put it into the output topic. If however the current number is lower or equal than the previous one, the returned pair consists of the previous values — previous;previous . The filter method drops all pairs which have the same values and they are not put into the output topic.

One important thing to note is the cache size. This solution requires it to be disabled by setting cache.max.bytes.buffering a.k.a. StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG to zero. In this case the aggregate function returns an output for each element instead of accumulating several values.

De-duplicating custom values

Both solutions rely on the two basic characteristics of the incoming messages: the values are numbers and there will be no gaps in the sequence. How could we solve this problem, if the value would be a custom String ?

The internet is a good source for ideas. I’ve found this solution interesting and modified it slightly. The pipeline is built up on a ValueTransformer :

The transform method does one simple thing — it looks up the current value in the state store and if it is there it returns a null which is later discarded (filtered out) in the pipeline. If however the value is not stored yet, it is put into the state store and returned for further processing. Quite similar to the first approach, except that it allows to work with any custom identifier instead of a sequence of numbers. The interesting part is preventing the state store to grow indefinitely. The init method setups a scheduler which every minute will run a Punctuator responsible for removing records older than a configurable threshold called maintainDurationMs .

Also similar to the first solution, this approach won’t work, if there are more than one partition and the messages are not keyed. In such a case, the same identifiers may be routed to different partitions and processed independently which will result in duplicates in the output topic. If however the values are UUIDs you won’t have duplicates across different keys and you’re good to go with this solution.

Further ideas?