Check-list: selecting the right data transformation service for Google BigQuery

Which Google Cloud service can get your data ready for analytics?

Photo by Kyle Glenn on Unsplash

The good news is, when it comes to moving and transforming data for analytics and AI/ML built on top of BigQuery, Google Cloud offers a plethora of native choices. The bad news is that you have to make a choice among many options to ensure the technology will fit your needs and prove to be sustainable over time.

This article is a check-list to help you choose the best technology (or combination of technologies) that is right for you. If you’re planning to leverage BigQuery for your analytics and data science projects, it means that you will need to clean, transform, and combine data in batches and/or micro-batches. For this reason, I have ruled out Google BigQuery Data Transfer Service, which can only move data into BigQuery but can’t transform it. Also, I limited my research to only native Google Cloud data transformation services, including Cloud Dataproc, Cloud Dataflow, Cloud Data Fusion, and Cloud Dataprep. Sure, transformations can also happen in BigQuery using SQL, which is quite a powerful language (I have spent years building applications leveraging SQL and stored procedures); however, SQL does not make a sustainable solution for enterprise data transformation services for many reasons and there are also good reasons why there are so many technologies that have been invented to transform data (this would need a complete article on its own).

So what are the questions you should ask yourself for selecting the right technology?

Question 1: Are you looking to write code in a programming language for your solution?

If software engineers or developers have to develop and integrate various components to create an application that includes data transformation at scale, then you should look at Cloud Dataflow or Cloud Dataproc. These are code-based solutions that require programming skills such as Java, Python, RestAPI, or Apache Spark. These require fairly technical knowledge. With these services, you already know which transformations need to be created, and you need a development tool to code these transformations. To figure out whether to choose Cloud Dataflow or Cloud Dataproc, read this article, which provides good arguments to use one or the other.

Question 2: is your data already in Google Cloud?

If your data is not yet in Google Cloud, Cloud Data Fusion, the Extract, Transform, and Load (ETL) solution from Google built on the open-source project CDAP, provides numerous connectors to bring the data into Google Cloud and BigQuery. Alternatively, you could leverage other ETL solutions you already have in-house that can bring the data into Google Cloud. Google recommends these ETL vendors, including Matillion, Fivetran, Informatica, and many more. ETL solutions such as Cloud Data Fusion are best to develop large-scale data pipelines to move data from on-prem and cloud sources for “hydrating” a data lake or a Data Warehouse (DWH). Data engineers, ETL developers, and data architects will benefit most from Cloud Data Fusion. They will be able to develop solid real-time and batch-data transformation pipelines. Cloud Data Fusion and ETLs are excellent in building stable data pipelines to move data routinely to a DWH or a data lake. They are not meant to create pipelines that evolve often or to put in the hands of some business users.

Question 3: does your data initiative need the flexibility to adapt to business requirements often? Does your project need, in addition to data engineers, self-service data transformation by business users or less technically savvy professionals?

Data preparation is the most recent evolution of data transformation technologies, bringing these capabilities in the hands of data-driven professionals who are less technical and often reside in a line of business (in addition to the usual technical data specialists such as data engineers). These professionals know the data better than anyone else however, they may lack the technical skills to clean, combine, and enrich the data to make it usable for their needs. Typically they would know Excel, Google Sheets, and SQL. Cloud Dataprep (developed in collaboration with Trifacta) offers a grid-based, machine learning-guided visual interface to enable agile data exploration and assessment to refine, standardize, and combine data for BigQuery for analytics. Users interact with the content of data to iteratively refine and bring together data for their downstream business-driven analytics.

Bonus question for Cloud Dataprep or Cloud Data Fusion: are you leaning more toward Cloud Dataproc or Cloud Dataflow?

Cloud Data Fusion generates Cloud Dataproc code to transform the data, while Cloud Dataprep generates some Dataflow code to transform the data. Both have their advantages based on your use case and your former experience. This could also be an element to weigh in favor of a particular technology to deliver analytics on top of BigQuery.

Ultimately, you may need to combine multiple solutions to create end-to-end analytics or machine learning solutions for BigQuery. This is quite often that we see an ETL to bring the data in a Cloud DWH or Data Lake and then Data Preparation technology to expose the data to the consumers.

The following table summarizes these technologies’ characteristics to guide your choice.

Originally published at www.trifacta.com