Several days ago, I developed a new project called spark-binlog . Before this project if you want to incrementally sync data from MySQL, there is a really big pipeline. The pipeline looks like the following picture:

You must use tools like canal to extract binlog from MySQL, send them to Kafka, then build a stream application based on Flink/Spark/Store to consume from the Kafka again. Because the real target is to sync the MySQL table through binlog instead of syncing binlog itself, we need to find a upsertable storage e.g. Hbase or Kudu. Kudu is great, you can query it directly, If hbase, you should wrap it with pheonix or export the data from HBase to HDFS table so it can be queried by Spark/Preso/Hive. This is really un-maintainable and time-consumed pipeline.

We hope we can query MySQL binlog directly and export it to a storage which has upsertable feature. It looks like this:

To accomplish such a simple pipeline, there are two requirements:

MySQL Binlog should be a Datasource Storage should support update/delete/append

So I create two projects:

Int this post we will focus on the spark-binlog.

The following picture show you how spark-binlog works.

When you start a spark streaming application or submit a new streaming job in existing spark service, the driver will do several things:

Firstly choose an executor to start a fake MySQL slave to pull the binlog and write them to HDFS with WAL(Write Ahead Log) Secondly, start datasource server in the same executor which runs fake MySQL slave. Finally, start the real structured streaming to consume the binlog from data source server ,compute the result and save them to delta lake.

However, the difficult points are:

Binlog connector only support async consuming. Spark only supports pull model on datasource.

To solve this mismatching, we introduce the WAL based on HDFS. Once the data comes, we write data into HDFS. When the other executors want to pull the data, we fetch them from WAL. Notice that both Binlog Server and WAL supports relocate(re-consuming). We firstly try to relocate data from WAL and then Binlog Server. This is because relocating from Binlog Server is a heavy operation.