Cloud Data Fusion is a great addition to the growing number of data tools available on Google Cloud Platform. Using Google Cloud Data Fusion, we at ML6 can bridge the gap between code based data transformation tools such as Google Cloud Dataflow and more traditional UI based ETL and data integration tools.

Google Cloud Data Fusion batch and streaming pipelines are executed on Google Cloud Dataproc. Since CDAP is already in production for 5+ years it is compatible with various Hadoop distributions like Azure, AWS and of course Google Cloud. This will be valuable for hybrid cloud scenarios.

On the 28th and 29th of March 2019, I attended the Cloud Data Fusion training by the CDAP team for EMEA partners.

These are my main highlights.

The concepts, source/sink/transform/… are very familiar building blocks for users of ETL tools so the learning curve to build a scalable serverless data pipeline is low.

The “Wrangler” transformation is the swiss-army knife in Google Cloud Data Fusion. It offers an efficient familiar visual interface to transform the data column by column. Extra functionality can be added using custom developed “directives”.

The Scala, Javascript and Python transformation steps, which can output multiple records.

Excellent support for flat and nested data. Schemas are automatically detected or can be manually defined using AVRO schema.

Google Cloud Data Fusion pipelines are run as Google Cloud Dataproc jobs so you don’t need to manage any infrastructure.

One of the main benefits of the data pipelines is automatic data lineage and data preview. That’s extremely important in regulated or complex data integration environments

The data pipelines can be parameterized using macros.

Since all configuration is JSON-based in the background it’s easy to import/export schemas, directives etc.

A wide range of plug-ins, directives and predefined data pipelines are already available in the marketplace called the “Hub”.

I would like to get a better view of Google Cloud Dataproc sizing and auto-scaling.

The support for windows and late arriving data is more advanced in Google Cloud Dataflow/Apache Beam.

Since most data pipelines are very specific make sure to pick the right tool for the job. That will be possible using Google Cloud Composer because a Google Cloud Data Fusion hook/operator is on the roadmap.

If you want to experiment with Google Cloud Data Fusion you can launch it in GCP. The first 120 hours for the basic edition are free each month.

It’s easy to setup the CDAP sandbox in a VM or on your local machine. Make sure you use a 64-bit Java JRE. Or use the VM or Docker container.

In an upcoming blog post, we will build a Google Cloud Data Fusion pipeline that combines bike share availability data in jsonline and master data in JSON.

About ML6:

We are a team of AI experts and the fastest growing AI company in Belgium. With offices in Ghent, Amsterdam, Berlin and London, we build and implement self learning systems across different sectors to help our clients operate more efficiently. We do this by staying on top of research, innovation and applying our expertise in practice. To find out more, please visit www.ml6.eu