Reading Time: 5 minutes

Whenever we hear the word Kafka, all we think about it as a messaging system with a publisher-subscriber model that we use for our streaming applications as a source and a sink.

So we can say that Kafka is just a dumb storage system that stores the data provided by a producer for a long time (configurable) and it can provide it to some consumer whenever one asks for data (from a topic of course).

Now between consuming the data from producer and sending it to the consumer, we can’t do anything on this data in Kafka. Then, we make use of other tools like Spark or Storm to process the data in between producer and consumer. In this way we have to build two separate clusters for our app: one for our Kafka cluster that stores our data; another one is to do stream processing on our data.

So to save ourselves from this hassle, Kafka Streams API comes to our rescue. With this, we have a Unified Kafka where we can set our stream processing inside Kafka cluster. And with this tight integration, we get all the support from Kafka (for example topic partition becomes stream partition for parallel processing).

What’s KAFKA STREAMS API?

The Kafka Streams API allows you to create real-time applications that power your core business. It is the easiest yet the most powerful technology to process data stored in Kafka. It gives us the implementation of standard classes of Kafka.

A unique feature of the Kafka Streams API is that the applications you build with it are normal applications. These applications can be packaged, deployed, and monitored like any other application – there is no need to install separate processing clusters or similar special-purpose and expensive infrastructure!

Link to the image

FEATURES BRIEF

The features provided by Kafka Streams:

highly scalable, elastic, distributed and fault-tolerant application

stateful and stateless processing

event-time processing with windowing, joins, and aggregations

we can use already defined most common transformation operation using Kafka Streams DSL or the lower-level processor API which allows us to define and connect custom processors

low barrier to entry which means it does not take much configurations and setup to run a small scale trial of stream processing, rest depends on your use case.

no separate cluster requirements for processing (integrated with Kafka)

Employs one-record-at-a-time processing to achieve millisecond processing latency, and supports event-time based windowing operations with the late arrival of records.

to achieve millisecond processing latency, and supports with the late arrival of records. Supports Kafka Connect to connect to different applications and databases

STREAMS

A stream is the most important abstraction provided by Kafka Streams: it represents an unbounded, continuously updating data set. A stream is an ordered, replayable, and fault-tolerant sequence of immutable data records, where a data record is defined as a key-value pair. It can be considered as either a record stream (defined as KStream) or a changelog stream (defined as KTable or GlobalKTable)

STREAM PROCESSOR

A stream processor is a node in the processor topology. It represents a processing step in a topology (to transform the data). A node is basically our processing logic that we want to apply on streaming data.

Link to the image

As shown in the figure, a source processor is a processor without any upstream processors and sink processor that does not have down-stream processors.

PROCESSING IN KAFKA STREAMS

The aim of this processing is to provide ways to enable processing of data that is consumed from Kafka and will be written back into Kafka. Two options available for processing stream data:

High-Level Kafka Streams DSL A “lower-level” processor that provides APIs for data-processing, composable processing, and local state storage.

1. HIGH-LEVEL DSL

High-Level DSL contains already implemented methods ready to use. It is composed of two main abstractions KStream and KTable or GlobalKTable

a). KStream

A KStream is an abstraction of record stream where each data is a simple key value pair in the unbounded dataset. It provides us many functional ways to manipulate stream data like

map

mapValue

flatMap

flatMapValues

filter

It also provides joining methods for joining multiple streams and aggregation methods on stream data.

b). Ktable or GlobalKTable

A KTable is an abstraction of a changelog stream. In this change log, every data record is considered as Insert or Update(Upsert) depending upon the existence of the key as any existing row with the same key will be overwritten.

2. PROCESSOR API

The low-level Processor API provides a client to access stream data and to perform our business logic on the incoming data stream and send the result as the downstream data. It is done via extending abstract class AbstractProcessor and overriding the process method which contains our logic. This process method is called once for every key-value pair.

Where the High-Level DSL provides ready to use methods with functional style, the low-level processor API provides you the flexibility to implement processing logic according to your need. The trade-off is just the lines of code you need to write for specific scenarios.

Code In Action: Quickstart

To start working on Kafka Streams the following dependency must be included in the sbt project

"org.apache.kafka" % "kafka-streams" % "0.11.0.0"

Following imports are required for the application:

import org.apache.kafka.common.serialization.{Serde, Serdes} import org.apache.kafka.streams.KafkaStreams import org.apache.kafka.streams.StreamsConfig._ import org.apache.kafka.streams.kstream.{KStream, KStreamBuilder}

Next, we have to set up some configuration properties for Kafka Streams

val streamsConfiguration = new Properties() streamsConfiguration.put(APPLICATION_ID_CONFIG, "Streaming-QuickStart") streamsConfiguration.put(BOOTSTRAP_SERVERS_CONFIG, "localhost:9092") streamsConfiguration.put(DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName) streamsConfiguration.put(DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)

Now we have to create an instance of KstreamBuilder which in turn provides us with KStream object

val builder = new KStreamBuilder

The builder object has stream method which takes a topic name and returns an instance of KStream object subscribed to that specific topic

val kStream = builder.stream("InTopic")

Here on this ‘kStream’ object, we can use many methods provided by high-level DSL of Kafka Streams like ‘map’, ‘process’, ‘transform’, ‘join’ which in turn gives us another KStream object with that method applied. And now the last step is to send this processed data to another topic

val upperCaseKStream = kStream.mapValues(_.toUpperCase) //characters of values are now converted to upper case upperCaseKStream.to("OutTopic") //sending data to out topic

The last step is to start the streaming. For this step, the builder and the streaming configuration that we created are used

val stream = new KafkaStreams(builder, streamsConfiguration) stream.start()

So this was a simple example of high-level DSL. To have more clarity on this, some examples are here. One scenario of this example demonstrates the use of Kafka streams to combine data from two streams(different topics) and sending them to a single stream(topic) which is done using High-Level DSL. Another one shows filtering of data using stateful operations on value using Low-Level Processor API. Here is the link to Code Repository

Conclusion

So with Kafka Streams, we can now process the stream data within Kafka. No separate cluster is required just for processing. And with the functional ways provided by High-Level DSL, it is much more easy to use. Although it restricts the user to process data in many ways. But for those situations, we have Lower Level Processor APIs already there to be implemented.

I hope it was of some help. 🙂

References: