Now back to the design challenges

Equipped with the capabilities of Phoenix and backed by a solid datastore that is HBase, we solved the design challenges described earlier as follows:

Large Aggregations on the fly

As discussed before, our half-hourly batch job crunches raw events from Kafka stream to generate metric records. They pre-aggregated on 30 mins time samples and then written to Phoenix relational tables.

Figure 4 is a representation of one such table used by the Analytics API call to pull customer support agent’s daily performance analytics as mentioned before.

Figure 4 : Representation of the relational table that models the metrics data for Agent Analytics API

Figure 5 represents the output dataset we expect the query for the following API call to return. The dataset contains time series metrics for each day in Asia/Kolkata timezone for each of the agents.

https://../analytics/agent/?from=2018–10–02T00:00:00&to=2018-10-09T00:00:00&granularity=daily&timezone=Asia/Kolkata&limit=1000

Figure 5 : Representation of the output aggregate data required in the API response

Figure 6 shows the SQL query that does this aggregation on the fly.

Under the hood the SQL query is translated to HBase native API calls that are sent to multiple HBase region servers, that parallelly scans and filters rows for the customer named “hscustomer” between the “specified time range” while simultaneously running server side aggregations like “SUM”, “COUNT” on the metric columns grouped by the pivot columns i.e “time” and “agentId”. If you notice the group-by pivot “time”, is the 30 mins “time_bucket” column in UTC timezone, converted to “Asia/Kolkata” and truncated to day.

Figure 6 : SQL query for the Agent Analytics API

Deep paginations

Now coming to the problem of paginating API calls on massive datasets comprising of millions of rows. For pagination these APIs use a standard combination of pageSize(limit) and nextPageToken(next_key) approach.

Server side pagination with SQL

For paginating on SQL results a hybrid approach to pagination is used comprising of the following

Row Value Constructors (RVC): RVC is an SQL 92 syntax for running range queries on tables using a combination of primary key columns. The query can specify the start row and end row using the PK column combination. This allows us to use the 1000 rows for the current page and the 1001st row can be deconstructed to prepare the next_key for the next page query’s start row and so on.

2. LIMIT N OFFSET M: Limit offset is again a standard in SQL to perform paginated queries where offset is used to skip M rows from start and limit is used to accumulate N rows for the page.

RVC works well for API queries that do not have any aggregations, however it cannot be used for API queries that need to run aggregation functions like SUM, COUNT as seen before. For such APIs we use the LIMIT N OFFSET M approach. It works well for pagination going few hundred pages deep but starts degrading quickly for Deep paginations scenarios which we solve using a Custom Cursor based approach about which we discuss next.

So in summary a hybrid approach to pagination works as follows

RVC for queries that have flat scans or no aggregations

for queries that have flat scans or no aggregations LIMIT N OFFSET N for aggregate queries but shallow pagination (usually depends upon the input time range of the API calls)

for aggregate queries but shallow pagination (usually depends upon the input time range of the API calls) Custom Cursors for deep pagination described next

Figure 7 : Query for the 2nd page using LIMIT OFFSET

3. Custom Cursors for Deep Pagination: As discussed before, LIMIT OFFSET works well for paginations that are relatively shallow, however for scenarios where the data pull scale encompasses retrieval of millions of aggregated rows, which even with page size of 1000, spans across thousands of pages, LIMIT OFFSET does not work.

This does not work because the offset operation in data-stores like HBase has to start scanning rows from the beginning of the scan range and keep skipping them until the offset is reached. All of this unnecessary scanning and skipping leads to wasted CPU cycles, wasted IO resulting in slow queries which keeps getting slower as we go deep into the pages. It also puts pressure on HBase’s region servers heap due to unwanted cache evictions and object churns.

This challenge is solved using a custom cursor based approach where the state of a paginated query life cycle is stored in HBase, while the pagination progresses and until the pagination completes. This for the Analytics API queries, was done by dividing the query space into partitions and running LIMIT OFFSET queries only within the bounds of one partition.

For example in the “Agents Analytics API” we use “agentId” for partitioning the query space by agents. In order to do this, at the commencement of the paginated API call, we start a cursor for the API request by pulling all the “agentId”s of the “customer_name” and storing it in a Phoenix HBase table as cursor state.

Now we run the LIMIT OFFSET query one “agentId” at a time, until all the agents are queried, following which we close the cursor state. As “agentId” is a part of the table’s composite primary key, Phoenix executes a special filter called SKIP SCAN FILTER when“agentId” is used in the WHERE clause of the query. The HBase scanner uses this hint and does a seek to next row that satisfies the filter condition of the WHERE clause, thereby skips scanning the huge number of rows and columns in between. The cursor state is propagated between the pages of the API calls by encoding it in the next_key token (Figure 8,9,10,11)

Figure 8: Cursor start initialises cursor state with all agentIds and first agentId encoded in next_key is returned

Figure 9: Cursor next progresses cursor state to next agentId

Figure 10: Cursor end clears the state once all agentIds are pulled by the paginate API calls

Figure 11 : Decoded next_key token containing cursor state is used in the Phoenix SQL query, since agent_id is part of the primary key, this query uses a filter called SKIP SCAN FILTER which makes the query run fast

OLTP latencies

As discussed before, running aggregated SQL queries on such massive analytics data sets of tens of millions of rows at low latency of few seconds is a serious challenge. All of these massive row scans, filters and aggregations demands a lot of CPU and IO capacity which is limited on a single machine.

This is where a database like Phoenix really helps as it can, by design, orchestrate the running of these CPU and IO consuming operations parallelly on multiple region servers (assuming the large query space is distributed on multiple machines), running the filters and aggregations on server side close to the data, instead of the client side, thus reducing the amount of data exchanged between the client and server dramatically. In addition, Phoenix parallelizes the running of the GROUP BY on the client, by chunking up the scans based on ranges of the row key. By doing the work in parallel, Phoenix gets us the results extremely fast.

In addition to the Parallel Scans and Parallel Pushdowns (Aggregations and Filters), for paginated queries Phoenix also uses the TopN algorithm where it continues the iterations till each region servers have accumulated N results and finally the Phoenix query server does a merge sort on all the results to finally return the Top N results to the client.

Figure 12 demonstrates this where our Analytics API service (named Sunbird) issues an SQL query for fetching the results of a page of the API call. This query is compiled by the Phoenix query server (PQS) and translated to HBase native RPC calls. The actual query space for the time range can be some millions of rows distributed across multiple regions servers. Phoenix issues Parallel Scans to all relevant region servers based of row key ranges. In addition to the parallel range scan, Phoenix also performs a Skip Scan for the supplied “agent_id” (ID of a customer support agent) that seeks the next row with this agent id inside the key skipping all rows not having this agent id inside the key. The server side coprocessors run the supplied functions to perform the transformations, truncations and aggregations until we are left with few thousand aggregated rows. Phoenix query server finally does a merged sort and returns the page of top 1000 records back to the JDBC client.

Figure 12: API parallel query execution lifecycle on Phoenix HBase architecture

In our next post (part 2 of this series) we will look into Phorm or Phoenix ORM (A custom ORM library written in Java for Phoenix JDBC) and Sunbird (HTTP API server written in Clojure) and describe the efforts and challenges that went into designing a custom ORM library and an API server that serves massive analytics datasets.

Conclusion

The learning that we had from this journey was, for designing a data-intensive application, one of the most critical decision to make is picking the right database system and programming paradigm. A Phoenix HBase architecture and a programming model primarily based on SQL served us well. Also, it is very important to design systems for scale from the outset rather than as an afterthought.

We plan on writing more posts on our experience with using Phoenix and HBase on production, lessons learnt and setting up HA and load-balancing on the PQS cluster in future.