At my employer Tjek we recently decided to rebuild our event pipeline and move it to Google BigQuery to reduce complexity in the stack and to remove some services that were no longer maintainable.

BigQuery offers a bunch of nice tools for querying and visualizing our data, which would enable a number of our internal teams to work directly with the data & share it with customers without having to make a request to the engineering department.

The new service that would replace the old http service was easy to write, but then came the question of moving the historic data into BigQuery.

To migrate the old data, we had to backfill our entire dataset, accumulated over the last 10 years, into BigQuery and this is the story of how it was done.

Flashback

The platform this was done on has been described in a earlier post

Also the build pipeline for the above platform is described here

State of Flux

The old event pipeline is a beast in itself, it consisted of multiple services written in Erlang and was backed by a Kafka cluster & a Citus Postgress cluster

a http server that would receive events from our clients and publish it to a dirty events topic.

a deduplication service consuming the dirty events topic publishing the new events in a clean events topic

a backup service consuming the clean events topic and storing them in S3.

a database service consuming the clean events topic and inserting them into a Postgres cluster.

This meant that the events I needed to load into BigQuery was either in S3 or Kafka waiting to be written to the backups

The Enemy Within

One of the first obstacles I ran into was that BigQuery does not allow punctuations in the field names of the JSON you are loading.

The second problem was it required the data that was going to be loaded had been pre sorted based on the destination tables.

The backups & Kafka topic contains 11 different event types mixed with each other.

This meant every single line of JSON we had in both Kafka & S3 needed to be transformed before being loaded into BigQuery…

Tomorrow Is Yesterday

Remember the backup service? It reads X number of events from the Kafka topic, saves the offset it has consumed up to, then gzip’s it all and ships it to S3.

The current setting was 100k events per topic partition and there was 32 partitions for the topic meaning there would likely be around ~3M events in the clean Kafka topic waiting to be backed up.

To get the events from Kafka I figured I would just stop our backup service to to have a stable offset to start consuming from and just pass the data back through the same JSON parser I just used for the event backup files

Learning curve

Downloading some gzipped files and doing something to the content doesn’t sound to bad so off I went

We had at the time ~37k files, totalling ~800GB gzipped JSON from a period of roughly 10 years back in time that all needed to be transformed.

Some years ago something similar was done by a previous colleague when we moved to Citus. At that time the software used required 50 copies running and almost a whole week to complete.

I wanted this to be doable within reasonable time so I wouldn’t have to wait days for the results, “Bash + AWS cli + jq” was out of the question. It needed to be fast and scaleable to process multiple files at a time

The backup files we had was Kafka centred and not designed around the event format.

The Kafka topic had 32 partitions and for each there would be a file that could contain up to 11 different event types and on an average the event timestamps would range between 4–6 days in each file.

000.00000000000000000000–00000000000000999999.jsonl.gz

000.00000000000001000000–00000000000001999999.jsonl.gz

000.00000000000002000000–00000000000002999999.jsonl.gz

000.00000000000003000000–00000000000003999999.jsonl.gz

000.00000000000004000000–00000000000004999999.jsonl.gz

…

031.00000000000566000000–00000000000566099999.jsonl.gz

I started writing a tool in Go that would run workers that download backup files and processed them in parallel and then sorted the output to different folders and files depending on event type and backup file origin.

Historically the backup files had been written with 1M intervals meaning there could be weeks of data per file for the early years and opening 32 of them in parallel trying to write the output to different files proved to be to much

IO access would spike through the roof and everything would grind to a stop as it was writing several thousands of files at the same time when running many workers

I figured I somehow needed to process the files concurrently while serialising the disk writes to fewer larger files, that should go easier on the underlying storage and let me control the resources better

I decided to go on the following file format (hive inspired naming).

/data/type=%d/event-YYYY-MM-DD.jsonl.gz

this would produce a file per day and event type.

I created a Go package I call Multiwriter to help me do this.

You feed the Multiwriter filenames and a io.Reader that you want to be persisted to disk.

The Multiwriter then uses a LRU cache to hold file handles for the most recent files written to. It also uses a semaphore channel to prevent more than X writes at the same time.

I also changed the input order of the files, instead of reading them numerically sorted from S3 I would sort the files into slices based on what partition they belong to and consume the files “horizontally” across the partitions.

000.00000000000000000000–00000000000000999999.jsonl.gz

001.00000000000000000000–00000000000000999999.jsonl.gz

002.00000000000000000000–00000000000000999999.jsonl.gz

…

031.00000000000000000000–00000000000000999999.jsonl.gz

then start over on partition 0 and increase the offset

000.00000000000001000000–00000000000001999999.jsonl.gz

001.00000000000001000000–00000000000001999999.jsonl.gz

002.00000000000001000000–00000000000001999999.jsonl.gz

…

031.00000000000001000000–00000000000001999999.jsonl.gz

…

With 32 worker I would in theory be replaying the Kafka stream as it was written and I would also have less open files at the same time as the files it would be parsing contains largely the same dates compared to traversing the whole file list “vertically”

Internally the flow of data in the even processor looked like this

Someone to Watch Over Me

During the first trial runs of the parser it took almost 6 hours to parse all the backups, remember JSON is very expensive to decode and encode and the standard JSON package heavily relies on Reflection.

I had seen packages promising to be much faster than the stock one and decided to give json-iterator a shot as it claimed to be 100% drop in replacement.

It also has a “fastest” setting which does not sort the field names in the output which can save you even further cpu cycles

Day of Honour

Running the backup parser was done on a 36 core machine in AWS.

Fly baby fly

Once the files was all parsed the next problem arose. Uploading these would take “ages” as the CLI clients to upload files to S3 only uploads a file at a time.

I wrote a function in my parser that would spawn upload workers and use the AWS SDK Go to upload 15 files at a time concurrently cutting down the transfer time to a mere half hour.

From here it was just a matter of creating the transfer jobs in BigQuery and watch it run & wait

In all, 18 315 948 132 events was parsed, rewritten & uploaded in about 4 hours.

Removing the half hour for uploading it means the tool transformed about 87.2 million events per minute

I hope you enjoyed reading about my little JSON adventure. If you have any questions feel free to comment! :)

Tjek is on the lookout for another dedicated DevOps engineer. If you love Cloud, Kubernetes, Linux & Docker and want to work in a flexible working environment checkout the position here

Notes

By JSON lines I am referring to JSONL.

Also called newline-delimited JSON (NDJSON)

Example: