🕥 7 min.





This is a first blog to introduce Apache Beam before going into more details about Beam runners in the next blogs.





What is Apache Beam ?





To define Apache Beam let's start with a quote

«Batch / streaming? Never heard of either.»

(Batch is nearly always part of higher-level streaming)





Beam is a programming model that allows to create big data pipelines. Its particularity is that its API is unified between batch and streaming because the only difference between a batch pipeline and a streaming pipeline is that batch pipelines deal with finite data whereas streaming pipelines deal with infinite data. So, batch can be seen as a sub-part of the streaming problem. And Beam provides windowing features that divide infinite data into finite chunks.





Let's take a look at the Beam stack:

Primitive transforms

Composite transforms









A simple Beam pipeline





Let's take a look at a first simple Beam pipeline:





This is the usual Hello World type big data pipeline that counts the occurrences of the different words in a text file. It reads a text file from google storage and the result of the count is output to google storage. This is a very simple straight pipeline. Not all the pipelines have to be straight that way, it is there to serve as a baseline example for the continuation of the blog and to illustrate some key concepts of Beam :

Pipeline : the user interaction object.

PCollection : Beam abstraction of the collection of elements spread across a cluster.

TextIO : this IO is used to read and write from/to the text file. Reading part of the IO is in reality a Read transform and writing part of the IO is in reality a Write transform

Count : Combine transform that counts occurrences

FlatMapElements and MapElements: They are composite transforms of ParDo that are there for convenience.

The resulting DAG

When Beam executes the above pipeline code, the SDK first creates the adja cent gra ph to represent the pipeline. It is known as the DAG (Direct Acyclic Graph). For each transform of the pipeline code, a node of the DAG is created. Read node corresponds to the Read transform of TextIO (the input of the piepeline)

Write node corresponds to the Write transform of TextIO (the output of the pipeline)

All the other nodes are a direct representation of the transforms of the pipeline. Please note that only the Read transform is a primitive transform as described in the above paragraph, all the others are composite transforms. Please note that only the Read transform is a primitive transform as described in the above paragraph, all the others are composite transforms.

But what about the runner ?

This is when the runner enters the scene. The job of the runner is simply to translate the pipeline DAG into a native pipeline code for the targeted Big Data platform. It is this code that will be executed by a Big Data cluster such as Apache Spark or Apache Flink.



You want to know more about the runner ? This is when the runner enters the scene. The job of the runner is simply to translate the pipeline DAG into a native pipeline code for the targeted Big Data platform. It is this code that will be executed by a Big Data cluster such as Apache Spark or Apache Flink.You want to know more about the runner ? The next article describes what the runner is and uses the Spark runner as an example

The pipelineis written using one of the several(java, python or go) that Beam provides to the user. The SDK is a set of libraries ofand IOs that allow him to input data into his pipeline, do some computation on this data and then output to a given destination. Once the pipeline is written, the user choses a Big Data execution engine such as Apache Flink Apache Spark or others to run his pipeline. When the pipeline is run, thefirst translates it to native code depending on the chosen Big Data engine. It is this native code that is executed by the Big Data cluster.The Beam SDK contains a set of transforms that are the building blocks of the pipelines the user writes. There are only 3 primitives :: It is the good old flatmap that allows to process a collection element per element in parallel and apply them a function called: This one groups the elements that have a common key. The groups can then be processed in parallel by next transforms downstream in the pipeline.: The read transform is the way to input data into the pipeline by reading an external source which can be a batch source (like a database) or a streaming (continuously growing) source (such as a kafka topic).: All the other transforms available in the SDK are actually implemented as composites of the 3 previous ones.As an example, theof Beam which is calledand which allows to do a computation on data spread across the cluster is implemented like this:Other examples of composite transforms provided by the SDK areorthat we will see in next chapter. But the user can also create his own composites and composites of composites.