Often while building a data processing pipeline, your data can be either Bounded (of a defined size) or Unbounded (with an undefined size).

An example of a Bounded data source could be a text file containing all the words inside the Oxford Dictionary while the twitter stream of all the Tweets containing a specific #hashtag is an example of an Unbounded data source.

Google Cloud Dataflow allows you to handle both types of Data source by providing you with options to create either a Streaming or Batch pipeline with handle Unbounded and Bounded data source respectively.

In our use case, we had to process data coming from an Unbounded source but we didn’t want to process these events immediately; instead we wanted to do an operation on the collected events once every 2 minutes.

We tried using a Batch pipeline to do this job, but unbounded sources cannot be read by dataflow in batch mode. This is by design, as batch pipelines expect a finite amount of data to be read and to terminate when done processing it.

When we switched to a Streaming Pipeline, Dataflow expected us to operate on the events as soon as they entered the pipeline; which was something which we didn’t want to do.

Thanks for reading this blog, if you are working at a high growth company and are looking for a large scale, real-time data collection platform; take a look at https://roobits.com/.

We might be what you are looking for!

How did we fix this?

Enter Sharding!

Dataflow allows developers to write all the incoming events entering the Pipeline into a file for processing it at a later point of time.

Sharding is supported by almost every IO Transform present in Dataflow; for example I have used it with BigQueryIO and TextIO but it’s also available to use with KafkaIO, AvroIO and others.

Sharding the incoming events asks for 2 things :

The number of Shards to create Triggering Frequency

The number of shards is essentially the number of files that Dataflow should create to write the incoming events to.

So if you set the number of shards to 10, dataflow will create 10 files and write your incoming events among these 10 files.

Having a high number of shards ensures that if one of the files get corruped, you have the rest of the events safe in other files which weren’t corrupted thereby reducing the risk of a single point of failure.

The Triggering frequency is essentially the time after which you want to read the above files.

It can be anything that you want, but do keep in mind that a higher time means a larger file created!

And that’s it!

For our use case, we wanted to process and load the events once every 2 minutes into BigQuery, so we modified our Dataflow Pipeline to be :

.apply("Write to Custom BigQuery",

BigQueryIO.writeTableRows()

.withNumFileShards(30)

.withTriggeringFrequency(Duration.standardSeconds(90))

.withMethod(BigQueryIO.Write.Method.FILE_LOADS)

.withSchema(tableSchema)

.to(table);

from :

.apply("Write to Custom BigQuery",

BigQueryIO.writeTableRows()

.withSchema(tableSchema)

.to(table);

All it took was 2 lines of code to convert our Streaming Pipeline into a Streaming Pipeline that Micro-Batched the incoming events at the specified time.

Note : Dataflow by default stores the file into a Google Cloud Storage Bucket so using this might incur you the storage cost charged by GCS.

Thanks for reading! If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment 💬 below.

Have feedback? Let’s connect on Twitter.