BigQuery storage

The first thing we need is unlimited storage. The obvious choice is Cloud Storage because it’s the cheapest. But it has the limitation that it’s not so ideal for steaming data into it. The other service that has kinda unlimited storage is BigQuery. Surprisingly it has exactly the same price tag as Cloud Storage and it has a streaming insert API. It even automaticly drops the price by 50% per partition if you don’t ingest data in that partition for 90 days.

If you look at all the characteristics I think we’ve found our ideal storage engine. It even comes with the added value that you have a nice web interface to give you debugging power and some insight on your streaming data.

Designing a schema seems straight forward. Just store everything in the Pub/Sub message. Luckily BigQuery has a binary datatype so we can store any type of message. I also prefer to store both the message- and the processing timestamp so you at least know when the message was backed up. Ideally they should be very close to the message timestamp. I stream different topics into one table so I’ll add the topic to the schema as well (your requirement could be different and maybe justify a table per topic). Also don’t forget to backup the potential attributes that could be attached to your messages.

Cloud Dataflow backup pipeline

Now that we have our storage system we need something to take the actual backup. Cloud Dataflow seems like the ideal candidate, it uses the Apache Beam model that has the same semantics for streaming and batch. This means you can reuse the same code later when you want to read your data back from backup in batch mode.

Cloud Dataflow is also a managed service, this helps bring down the operational cost. So throwing all ingredients in the mix ( Cloud Pub/Sub, BigQuery and Cloud Dataflow) we now have to think about the code.

This is surprisingly easy. Read the Pub/Sub subscription of the topic you want to backup is a no brainer. Note that each subscription has a retention of 7 days giving you enough time to update your pipeline. Next up, transform the message into a TableRow for ingestion into BigQuery and then flatten all your subscriptions out so you only have one output table. The only tricky thing is that you want to have a partition per day, but I’ve already touched upon the auto BigQuery partitioning in an Apache Beam pipeline in a previous article so that’s solved as well.

Conclusion

Thats all there is to it. One of the simplest beam pipelines you can build and it’s able to handle dozen of Pub/Sub topic at the same time and streaming backup into BigQuery. Running your pipeline you’ve got the following benefits:

Cheapest storage for your message backup

Automatic 50% discount after 90 days of storage

7 day Pub/Sub subscription buffer

Only one Cloud Dataflow running

Maybe it’s not as powerful as the Kafka storage API it comes pretty close and without the worries about running a Kafka cluster.