Google Cloud Dataflow is one of the products provided by Google Cloud Platform which helps you ingest and transform data coming from a streaming or a batched data source.

At Roobits, we extensively use Dataflow pipelines to ingest events and transform them into desirable data that is to be used by our customers.

Dataflow is also serverless and auto-scales based on the input load, which is an added bonus to the flexibility it already provides.

Dataflow essentially requires you to write the logic that’s to be performed on the incoming events from a source (which could be PubSub, Apache Kafka, or even a file!) and then deploy that logic on Google’s servers.

Dataflow allows you to write this logic either in Java, Kotlin or Python.

A very simple example of a Dataflow Pipeline that takes an input paragraph and counts the words in it, is as follows :

While the code here might look complicated, you can go to the documentation page of Apache Beam to know more about what’s happening here.

To deploy this code on your Google Cloud Project, you can do so as follows :

java -jar wordcount.jar \

--runner=DataflowRunner \

--project=<YOUR_GCP_PROJECT_ID>

While it looks good, there are certain concerns when it comes to pricing as you plan on scaling this pipeline as it is.

Let’s look at them one by one.

Reducing the Disk size

By default, the disk size for the dataflow pipeline is set to 250GB for a batch pipeline and 400GB for a streaming pipeline.

If you are processing the incoming events in memory, this is mostly a wasted resource, so instead, I’d suggest reducing this parameter to 30GB or less (the min recommended value is 30GB but we faced no issues while running the pipeline at 9–10GB of PD)

You can do so by specifying the disk size as follows while deploying your pipeline :

--diskSizeGb=30

Now looking at Google Cloud Pricing calculator, reducing this value saves us around 20$ per month per worker.