9 Common Mistakes with Cloud Data Fusion Bhuvanesh Follow Apr 17 · 5 min read

Cloud Data fusion made the Data Engineer’s life easy. It’s a fully managed ETL service. We build and deploy the ETL packages with just drag and drop the components. They do support for Batch and Real-time steams. Cloud Data Fusion has already enabled the plugins and connectors for most of the GCP data services like BigQuery, GCS, CloudSQL and etc. It's using DataProc clusters in the backend to do all of the transformation and other steps. We can use Spark or MapReduce on the DataProc cluster to run our ETL pipeline. If you are going to kick start your Data Pipeline with Cloud Data Fusion, then you may face some errors. Here we listed 9 common mistakes that we do with Data Fusion.

#1 Enabling wrangler failed:

As soon as you provisioned the Data Fusion instance, then if you try to enable the wrangler then you may see that it is failed.

Enabling failed

Cannot find wrangler-service artifact

This is actually not an issue, once the instance is available, just wait for 5more minutes. Because they do some background process to make all the services available to use. So just for some time and try to enable it again. The system Admin tab will show the status of all the services.

#2 JDBC Connection Failed in Private Access:

Cloud Data Fusion can completely work with your VPC to access the resources in the local network. We can enable private access while launching the instance. Even you enabled the Private server access on the VPC, still, you’ll get the connection failed error messages. Here is the MySQL database connectivity error.

Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.

Just enabling the Private Access on the VPC level will not help. If you are launching the data fusion for the first time in your project, then we have to peer our VPC with the Data Fusion’s tenant project. To do this get the tenant ID from the Cloud Data Fusion instance details.

On the Instance details page, copy your instance’s Service Account value. The tenant project ID is the portion between the “at” symbol (@) and the following period (.). For example, if the service account value is

cloud-datafusion-management-sa@r8170c9b5e7699803-tp.iam.gserviceaccount.com

then the tenant project ID is r8170c9b5e7699803-tp .

cloud-datafusion-management-sa@r8170c9b5e7699803-tp.iam.gserviceaccount.com

Go to your VPC and create the peering. Detailed steps are well documented in GCP. Follow this link.

# 3 Subnetwork does not support Private Google Access

Once you deployed the Cloud Data Fusion and try to run the pipeline, you may get an error message like this.

com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Subnetwork 'dataproc' does not support Private Google Access which is required for Dataproc clusters when 'internal_ip_only' is set to 'true'. Enable Private Google Access on subnetwork 'dataproc' or set 'internal_ip_only' to 'false'.

You have enabled the private access while launching the data fusion instance but it seems you Subnet where the DataProc clusters are going to launch is not enabled with Private Access. To solve this, Go the VPC → Subnet → Select the subnet where your DataProc cluster will be launched. Click on for Private Google access.

#4 DataProc — Bucket does not exist

For staging purpose data proc clusters need a GCS bucket. If the bucket is not available, then they should have permission to create it.

com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Google Cloud Storage bucket does not exist 'dataproc-4decb821-5d69-4c0c-b3ca-355d1947da6b-us-central1'.

To solve this issue, we need to pass the existing bucket name on the Dataproc compute config. And the service account that is going to attach with the DataProc VM should have FULL access to this bucket.

#5 Worker Nodes count:

If you want to run a single node DataProc cluster then we need to set the Worker node as 0 or if you have a multi-node cluster, then at least we need 2 worker nodes.

java.lang.IllegalArgumentException: Invalid config 'workerNumNodes' = 1. Worker nodes must either be zero for a single node cluster, or at least 2 for a multi node cluster.

#6 Minimum Memory:

We need 3.5 GB as the minimum memory on the DataProc cluster(for both master and worker nodes).

com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Multiple validation errors:

- Machine type 'custom-1-1024' does not have enough memory. Minimum memory required is 3584 MB.

#7 Minimum Disk space to root volume:

The image that is going to use by the DataProc VM needs 15GB for the root volume.

- Requested image requires minimum boot disk size of 15 GB; requested 10 GB

#8 Subnetwork must have purpose=PRIVATE

While running the pipeline, you may get this error message. It’ll happen while provisioning the DataProc cluster.

Failed to create cluster cdap-bhuvi-af411bd8-8020-11ea-b17a-5a566bc430a2: Subnetwork must have purpose=PRIVATE.

We need to pass the VPC, Subnet names in the cluster properties, if you let the options as blank, then you’ll get the error.

#9 Edit the existing PipeLine:

Once you deployed the pipeline, then it's not possible to edit the package. But you can use two alternate methods.

Clone the current package

Export the package as JSON file

Then you can make the changes and save it in a different name. After that delete the old package.