By Dori Rabin

There’s a lot of discussion online about the correct use cases for streaming with time windows or batch handling micro batches. In modern data platforms, we often need to use both methods. Sometimes the use case dictates the method we should use. When low latency is required, and data needs to be processed fast — it’s probably better to use stream processing. However, when we need to run complex analytics on large volumes of data (such as machine learning, billing, monthly reports, etc.) that reside in different data sources, we will generally apply batch processing.

But what happens when we need to apply the same kind of analytics on different time window configurations? Imagine we have a streaming workflow that emits minute-level aggregations and we need to consolidate all those aggregations into longer time intervals, such as days, weeks or even months. If the per-minute aggregations need to be available with minimum latency, a streaming framework is the right choice. On the other hand, for aggregations on larger time-windows (day, week and month) and for historical data, we will probably need to read large volumes of data from one of our data-stores, process the data and store the results. In that case, it makes sense to use some ETL with batch analytics.

But what about all the streaming analytics code that we created in the streaming workflow? Can we reuse it in the batch ETL code? Let’s take a look.

Apache Flink for Stream Processing

The Apache Flink framework shines in the stream processing ecosystem. It started a few years ago and became GA in 2016. Today it has a very active and thriving open source community with more than 500 contributors (between 50 to 100 active) and more than 16K commits.

Apache Flink is now established as a very popular technology used by big companies such as Alibaba, Uber, Ebay, Netflix and many more. It is integrated in the backbone data platform by consuming data streams from sources such as Kafka or cloud services such as Amazon Kinesis and processes them in a scalable and reliable manner.

Flink excels at for real-time stream processing workloads where the data is split into very small time windows or even processed one message at a time. This allows for a high velocity, low-latency processing pipeline that is used in complex real world use cases like Alibaba Singles Day real-time dashboard.

The main API Apache Flink provides for stream processing tasks is the DataStream API. The Flink DataStream provides an API to transform immutable collections of data. It’s created by adding a source like FlinkKafkaConsumer and then it provides a way to transform the data using operators like map, flatMap, filter and others.

In the following example, we’re reading a message stream of online purchase transactions from Kafka and producing the sum of all purchases per product. Each product purchase transaction is represented by the ProductPurchase class and each product inside it has a unique ID representing the product (productId).

However, as discussed in the introduction, sometimes the data resides in databases or data-lakes and we’d like to execute on the data the same processing pipeline as we have for the streaming use-case. In this case, latency considerations are less important as we’d like to process a large, finite volume of data. Naturally, the solution is to use a batch job that can read large amounts of data and process it. To do this, Flink provides support for batch data processing using the DataSet API. If we convert our code to batch processing, the new code will look something like this:

Flink Batch Limitations

Although batch processing is supported as part of Apache Flink, it still has some shortcomings:

Limited choice of data sources: Flink supports only reading from a local filesystem. However, for batch jobs, we generally expect to have a framework capable of reading from many data sources, for example, distributed file systems or databases like PostgreSQL, MySQL or NoSQL Databases (Cassandra/MongoDB/Elasticsearch).

Lack of fault tolerance: Flink streaming reliability is achieved through the checkpoint mechanism which is not functioning as expected in batch. There has been an ongoing major issue for quite a long time by users asking to support it (https://issues.apache.org/jira/browse/FLINK-4256) but apparently the Flink community does not prioritize this.

Weak community adoption: Flink is not widely used for implementing batch processing. We could not find enough posts, questions and troubleshooting help regarding Flink batch issues.

The current leading framework for implementing ETL jobs for batch processing is Apache Spark. However, if you already have your stream processing code written in Apache Flink — you will likely want to reuse it. First, let’s go over a short intro to Spark and then we’ll see how you can use your Flink streaming code with Apache Spark.

Spark Batch

Apache Spark was developed in 2012 in response to limitations in the MapReduce cluster computing paradigm. At the core are the resilient distributed datasets (RDD) that are immutable collections of data items distributed over a cluster of machines. They are processed via a series of transformations in a scalable and fault-tolerant way.

Today, Apache Spark is the go-to framework for big-data batch processing. It has a huge and very thriving community with more than 1300 contributors and more than 24K commits.

In converting our example to use Apache Spark, the code will look like this:

Code Reuse is Achieved

As you can see, the skeleton of both main flow codes look the same:

Read JSON file from file system (or kafka topic) Convert JSON to ProductPurchase POJO (map function): mapper function is reused Run some aggregations over the ProductPurchase class (sum all the purchase transaction): reducer function is reused Convert the ProductPurchase back to JSON representation: another mapper function can be reused Write the JSON dataset/rdd back to the filesystem or kafka topic

The main difference noted here is that for Flink we use the class FlinkProductReducer and for Spark we use SparkProductReducer, but they are both just wrapper classes, that behind the scenes use the same code. We can see in the Spark code after converting Spark tuples to Flink tuples it calls the same method: DataHelper.sumProducts() that is called by Flink reducer. In general, most of the code logic of a Flink/Spark is located behind the map and reduce functions. In our case, hundreds of lines of codes that contain your application logic, type validations, transformations and aggregation are defined in framework neutral way (without framework dependencies).

Conclusions

Apache Flink and Apache Spark have brought to the open source community great stream processing and batch processing frameworks that are widely used today in different use cases.

We saw in this blog post an example of Flink code that was used in a typical streaming workflow converted to be reused by a Spark batch workflow. This proves that large parts of codebases using Flink can be shared and adapted for use in Spark and vice-versa, which allows us to leverage each framework for the parts it excels at and also save a lot of development time.