“Growth is like a game of snakes and ladders. When you’re rising fast, you need the right infrastructure in place to handle it. If you don’t, you’re heading back to square one.”

Accurate and fast financial reporting and reconciliation is a crucial part of running a large consumer payments business like Azimo. Super-high fidelity is critical: millions of payments, transactions and refunds need to be accurately updated across not only our own systems but also distributed to those of our multiple clients and partners. Get it wrong and you’re out of business.

In this article, we describe how we created a scalable, robust financial reporting system, using fully managed services running on Google Cloud Platform.

Previous Solution

Our previous financial reporting system was built around CouchBase. Couchbase is scalable and supports CQL, a query language similar to SQL but without SQL’s flexibility. This made it difficult to build the financial reports that we need as a much larger and still rapidly scaling business.

Data ingestion into Couchbase was extremely slow, and became progressively slower as our business became more complex. As well as our internal events, we had to upload and synchronise our partners’ files and API data.

The complexity of the system made building a new financial model with existing technology impossible. The architecture of the old system looked like this:

We had a series of cron jobs loading data from different places, as well as additional jobs enriching the data after it was already loaded. Not all of these processes could be run concurrently, which slowed us down further. Several flaws had to be addressed:

Complex for the end user

Expensive infrastructure + lots of devops/engineering maintenance time

Poor scalability

Poor performance

Hard to extend

Not robust

After a short design spike it became clear that a new system was needed. Our goal was to build a system that would be:

User friendly

Cheap infrastructure + minimal devops/engineering maintenance time

Horizontally scalable

Great performance

Easily extensible

Robust

Introducing Ebenezer

We all know and love (or hate) buzzwords like “robust”, “horizontally scalable” and “user friendly”. We hear them in conferences, we read them in articles.

In this section I’m going to explain how we achieved those goals and built Ebenezer, our new system. Fans of Charles Dickens and/or Bill Murray will spot the reference to Ebenezer Scrooge, the miserly protagonist of A Christmas Carol.

A modern distributed system is like a series of building blocks. To achieve all the buzzwords mentioned above, each of the system’s blocks must work efficiently with related blocks and prevent bottlenecks from occurring.

Data Warehouse : BigQuery

BigQuery was our standout candidate for data warehousing. As well as being significantly cheaper than our last solution, it is fully managed by Google Platform and has superb performance even when processing petabytes of data. BigQuery is also part of our core platform, so instead of dumping data once per day, we are now able to process most crucial data types as streaming events.

One of our key requirements was to recreate Ebenezer BigQuery tables from raw data, thereby treating those tables as views, not as the source of truth. This would allow us to do the following without needing to write custom migration jobs:

Remodel the tables

Add new computations

Fix bugs

Some of our data, however, comes from either partner files (FTP) or directly from an API. We needed a solution to process data both in batch and streaming modes.

Processing framework : Apache Beam with Google Dataflow

Our processing framework had to fulfill two primary goals:

To stream data directly into BigQuery

To process, transform and load partners files into BigQuery

We chose to use Google Dataflow with Apache Beam. When we started, Apache Beam was still in beta but it would soon become the SDK 2.0 for Google Dataflow. As well as excellent processing, Dataflow’s other benefits include monitoring, alerts, aggregated logs and a well-maintained infrastructure.

Building our pipelines in Beam allows us to batch process our partners’ files from Google Cloud Storage. This means we can reload the data whenever we want to modify, extend or fix something.

Our streaming data is hosted on AWS, using Apache Kafka as a message broker. This solution works for us and we didn’t want to change it.

Master Data Set — Avro on Google Cloud Storage

All our events are sent to Kafka as Avro, but we also needed a mechanism to store the events in Avro files on Google Cloud Storage (GCS). Our system creates Avro files once per hour and stores them on GCS, with data separated per topic, per day.

For this purpose we created an Apache Beam pipeline that operates on Avro Generic Records using Schema Registry. This project is open source and can be found on AzimoLabs:

This works perfectly when all your data is correct, but what about when something goes wrong, like a bug in the Kafka Producer?

To handle this scenario we created another generic job that is able to read the files, transform them and output them into another destination. Only the class with transformation logic needs to be created and injected via Beam configuration. This project is also open sourced and can be found on AzimoLabs:

All our Beam jobs use a shared factory pattern with either KafkaIO or AvroIO. This is decided based on configuration. All other jobs remain exactly the same, which makes table recreation and the code responsible for it very simple.

Depending on your downtime requirements, you can either recreate the table and reload the data from Avro files (without stopping your streaming job), or you can create a new table with a different suffix. In this case, reports would still have to be adjusted. Unfortunately BigQuery does not have alias functionality for the tables like ElasticSearch for index names, which would decrease downtime to 1 second. Here’s hoping we’ll get this feature one day.

In the meantime, the functionality is used often by our finance team. It’s a pleasure to be able to say “no problem” when your stakeholders need new columns with data backfilled. Flexibility keeps everyone happy.

Batch Jobs Scheduling — App Engine

We needed pre and post-processing actions for our batch processing pipelines and chose App Engine’s cron jobs functionality to achieve this. A cron task can start a background thread to do the following:

Look for new files to process

Check whether any Dataflow jobs are running

Asynchronously start a Dataflow job

Check if the last job succeeded or failed

Move processed files

Visualisation — Domo & Google Data Studio

Domo handles most of our reports that don’t need real-time data. It’s user-friendly, efficient and is already used across the company, so it was the natural choice for this project.

We also needed dashboards for real-time events, which can help us detect issues and handle treasury management. For now we’re happy with Google Data Studio, though if our needs become more complex we might reconsider.

Big (architecture) picture

A variety of small, independent services written in Spring are responsible for fetching data from our partners’ APIs and storing it on message brokers. Depending on the API, we handle fetching the data with schedulers, receiving webhooks, deduplicating data etc.

Below is a full view of the architecture we use:

Summary

Ebenezer has been in production for around one year. New functionality is still being added, such as new partner integrations. We’re very happy with its usability, performance and low maintenance time. If you have any questions, feel free to ask in the comments section below.