Building Abacus

To truly overcome the challenges of the initial version of our real-time event aggregation system, we decided to transform it into a modern streaming application leveraging Apache Flink as the framework. However, addressing all problems all at once would make this project too formidable. To scope the project better, we started by understanding what optimizations were essential for us to smoothly handle Black Friday 2018 traffic.

Combining all the previously outlined subsystems into one to make ingestion of event idempotent was essential. Making the new aggregation system fast and isolating account-level and customer-level aggregation would also be crucial for us to triumph on Black Friday. Moving off from Cassandra’s counter data type would require us to redesign the API layer and reaggregate all historical data. This effort, though ideal for our long-term scalability, was not essential for us to accomplish before Black Friday. Also, achieving counter-less storage would become a more tractable goal after consolidating all the subsystems. Therefore, we defined two milestones for the Abacus project.

The first milestone, which needs to be achieved before Black Friday, is to release the initial version of Abacus that aggregate events and write deltas to the database. This version of Abacus continues to use Cassandra’s counter data type to leverage existing data storage and API. Thus no data migration or new API layer is needed. The goals of this milestone are isolating account-level and customer-level aggregation, consolidating all subsystems to make ingestion of events idempotent, and improving the efficiency of the aggregation system.

The second milestone will be achieved after Black Friday. This milestone encompasses the release of the second version of Abacus where we aggregate final counts in place and persist the results to Cassandra (or any other distributed database). The second milestone also includes the redesign of the API layer to query data and migration of all historical data. After the second milestone, we can finally move off from Cassandra’s counter data type.

The triumph of milestone one

In order to solve the problem of isolation, we utilize Flink’s native connector in combination with Kafka. Kafka’s concept of consumer group takes advantages of both message queuing and publish-subscribe models. Kafka consumers belonging to the same consumer group share a group ID. These consumers then divide the topic partitions as fairly amongst themselves as possible and guarantee that each partition is only consumed by a single consumer from the group.

Kafka Consumer Group

Flink’s native Kafka connector will create a consumer group for each job. Therefore, by simply running two Flink jobs, we can separate out customer-level and account-level aggregations at the source. Respectively, we define those two Flink jobs alongside with their pertinent sink databases Abacus Customer and Abacus Statistic. Logically, Abacus Customer and Abacus Statistic are nearly identical. The only difference is Abacus customer does not have step check uniqueness of customer actions, which results in a simpler workload. Therefore, I will focus on explaining the design and implementation of the more complicated pipeline — Abacus Statistic, of which the architecture is drawn below.

Milestone One Implementation of Abacus Statistic

Check Uniqueness is a Flink RichMapFunction we maintain to track the uniqueness of a certain action. We create a log record only the first time a customer performs a certain action. This record signals that the action has already been counted and the subsequent actions of the same customer are not unique. Events are then windowed using processing time instead of event time. When the window closes, the increments — instead of the final counts — are written to Cassandra.

Flink supports both processing time and event time in streaming programs. When a streaming program runs on processing time, all time-based operations (like time windows) will use the system clock of the machines that run the respective operator. When running on event time, a streaming application is clocked on the time that each individual event occurred on its producing device. In event time, the progress of time depends on the data. Event time programs must specify how to generate Event Time Watermarks, which is the mechanism that signals progress in event time. you can read more details on how Flink watermark works in this post. Optionally, event time programs can specify allowed lateness for window operators. By default, late elements are dropped when the watermark is past the end of the window. However, Flink allows specifying a maximum allowed lateness for window operators. Allowed lateness specifies by how much time elements can be late before they are dropped.

For our workload, incoming events can be heavily delayed by upstream data sources or historical back population during customer onboarding. Therefore, the application uses processing time for windowing and aggregating events before writing increments to Cassandra. Compared to relying on event time, using processing time for the application also has the benefit of maintaining a much smaller window states.

We implemented the design above and delivered it as part of milestone one. The current version of Abacus has consolidated subsystems — Check Uniques and Big Bertha — of the initial version of the event aggregation system and guarantees idempotency of event ingestion before writing to Cassandra. Also, replacing the old aggregation system with Abacus has successfully reduced the ingestion time of the whole event processing pipeline from 450 milliseconds to 80 milliseconds. Additionally, account-level and customer-level aggregations have been separated, which isolates the impact of any potential failure of each system.

Relying on Abacus, Klaviyo was able to break all data processing records on Black Friday and ensure success for our customers. Our new aggregation system has not only been proven to smoothly processed more than a billion events on Cyber Monday alone but also achieved this by using only one-sixth of the original resources.

A peek into milestone two

Immediately after last Black Friday, we started the development of the milestone two Abacus where we would aggregate final counts instead of increments in the stream. The prudent design of milestone one Abacus has enabled us to reuse all the code of the first version. We only need to add one additional step — Value Reader — to our stream. The implementation looks like this:

Milestone Two Implementation of Abacus Statistic

Value Reader consists of two Flink operators — WindowManager and CassandraHydrator. WindowManager is a Flink RichMapFunction to maintain counts for different timeframes. CassandraHydrator is a Flink AsyncFunction that enables us to read specific counts from the account-level database. When an aggregate of a specific customer action flows through the pipeline, WindowManager will query local state in RocksDB to see whether it has the count for the customer action. If it does, the application initializes the current aggregate in the stream with the count. If the count is not available locally, CassandraHydrator is invoked to initialize the aggregate through reading the count from database asynchronously. Later, this initialized aggregate is reduced to final counts after windowed using processing time. Leveraging this design, the account-level Cassandra cluster can be finally freed from the infamous Cassandra’s counter data type because we will always write the final counts instead of deltas to Cassandra at the end of the stream.

Currently, parallel to developing the next version of Abacus, we are rewriting our API layer and planning the migration of historical data for the new schema. After milestone two, we will gain appreciable improvement on the already performant real-time event aggregation system.