Cloud Data Fusion — Import custom plugins and integrate AWS Aurora with GCS Bhuvanesh Follow Apr 22 · 5 min read

Cloud Data Fusion gives out of the box ETL pipeline with Code free flexibility. Once we deploy it on GCP, we can just drag and drop the components and build our ETL pipeline. By default, it has integration with almost all the data services in GCP (Like BigQuery, BigTable, Pub/Sub and etc). But if you want to do the integration from other data sources like MongoDB, Cassandra, AWS Aurora(MySQL, PostgreSQL) then we have to develop our own Plugins. Luckily Google and the CDAP teams are maintaining a repo for a lot of different data sources. But we need to build them with maven and convert it as a JAR plugin. Click on the below link to get the list of plugins from the repo.

Build the Aurora MySQL Plugin

For this demo, we are going to take the MySQL plugin from the Database plugins repo. Its a Java package, so we have to build this via maven.

Install JDK 8 and Maven 3:

apt update

sudo apt-get install openjdk-8-jdk

sudo apt install maven sudo update-alternatives --config java

-- it should be java 8

/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

Build the package:

We have to clone the repo.

This repo contains the plugins for multiple databases. You can build all the plugins in one shot or you can build whatever the plugin you want. Let's build the whole package, but it has to test the connectivity as well. But we can skip them.

cd database-plugin

mvn clean install -D maven.test.skip=true

mvn clean package -D maven.test.skip=true

While building this it’ll check the database connectivity and etc. Here we just build the package, so use -D maven.test.skip=true keyword. If the build is a success then get the JAR and JSON files from the target directory of each database plugin directory.

OR If you face any errors like below,

[ERROR] Failed to execute goal on project database-commons: Could not resolve dependencies for project io.cdap.plugin:database-commons:jar:1.3.0-SNAPSHOT: Could not find artifact jdk.tools:jdk.tools:jar:1.6 at specified path /usr/lib/jvm/java-11-openjdk-amd64/../lib/tools.jar

Then just follow these steps.

mvn clean install

Its ok, it’ll throw there is no JDBC driver path for the MySQL. Now build the plugin for Aurora MySQL.

Go to the aurora-mysql-plugin directory

mvn clean package -D maven.test.skip=true

mvn clean install -D maven.test.skip=true

Our plugin is ready. We have to upload the aurora-mysql-plugin-1.3.0-SNAPSHOT.jar and aurora-mysql-plugin-1.3.0-SNAPSHOT.json files from the database-plugins/aurora-mysql-plugin/target directory to the Cloud Data Fusion.

Import the Plugin:

Launch your Data Fusion instance and go to Studio. [Enable private network access]

Just click on the Green Plus symbol and select plugin.

Select your JAR file and then click on Next, then select your JSON file.

Click on Finish.

Our plugin is imported into the Data fusion. Similarly, we built some more plugins like MySQL, CloudSQL for MySQL, MongoDB and etc.

But these all are just the plugins, if you want to connect to the Database, then from the options we have to provide the JDBC driver as well.

Let see how to use this plugin. First, we have to upload the MySQL JDBC driver.

So you can download it from the MySQL website. Again click on the green plus symbol and select driver.

Select the jar file you extracted from the MySQL driver.

Click on Next and fill the following details.

Name : mysql-driver-8.0

Class Name : com.mysql.jdbc.Driver

Version : it’ll detect automatically

Description : something and something

: : : it’ll detect automatically : something and something Then click on the finish.

Start the integration:

Source Setup: Aurora

Make sure you have a VPN setup between GCP and AWS.

Also on Data Fusion enable the Private Network access while launching.

Now click on the MySQL icon and select the properties. From the Driver name use mysql-driver-8.0 — This is the name we have while uploading the JDBC driver. And fill the rest of the details like RDS Endpoint, Username, Password. From the Import Query, just give a sample quey like select * from mock_data limit 10 and click on the Get Schema. If all your inputs are correct then it’ll run the query on the database and detect the schema.

Sink Setup: GCS

From the Sink, click on the GCS. Add the properties like Format, GCS bucket and etc.

Our pipeline is ready, now click on the slave and deploy. Data Fusion will use DataProc clusters for the Data pipeline, So by default they have some set of configurations like no of master nodes and worker nodes, Disk space, VPC and etc. We have to change as per our Network topology.

General settings:

Region — Your Main region

Zone — Dataproc Cluster availability zone.

Network — VPC name

Subnet — Dataproc cluster’s subnet.

GCS Buket — Bucketname (without gs://) for staging the files.

— Your Main region — Dataproc Cluster availability zone. — VPC name — Dataproc cluster’s subnet. — Bucketname (without gs://) for staging the files. Master Node, Worker Node — Just setup as per your need.

Now let's run the pipeline. It’ll take 5mins to spinup the Data Proc Cluster. Meanwhile, if you face any changes - refer to our blog about some common mistakes that we do in Data Fusion.