For many applications, query throughput (i.e. the capacity for concurrent queries) is an important consideration. Specifically, being able to scale query throughput easily makes it possible (and economical) to grow your system as your user-base increases in size.

In the last few weeks, we assembled a performance engineering team to run some benchmarks that would help us demonstrate CrateDB’s ability to scale query throughput linearly.

To do this, we simulated an application that collects time-series data (specifically, sensor readings) and stores them in a multi-tenant database.

This is a common use-case for our users. CrateDB’s capability to ingest massive amounts of data and query it in real-time is uniquely well suited for the monitoring and analysis of machine data (e.g. sensors, Internet of Things, logs, and so on) via dashboards.

In this post, I will go through the basic benchmark setup and share the high-level results with you. If you want to dig deep, the full details are available as a whitepaper: CrateDB Query Throughput Scalability Benchmark

Cluster Setup

We ran this benchmark on CrateDB 1.0.4.

We used Amazon AWS c3.2xlarge instances with the following configuration:

Intel Xeon CPU E5-2680 v2 (Ivy Bridge) with 8 cores

15 GB RAM

2x 80GB SSD storage

Data Setup

Dataset

The data set we used contains 314,496,000 records. These data points simulate sensor readings gathered over the period of one year. Each record represents a single sensor reading, and also includes data about the sensor itself. The data was partitioned by week.

Table Schema

Here’s the table schema we used for this data:

DROP TABLE IF EXISTS b.t2; CREATE ANALYZER "tree" ( TOKENIZER tree WITH ( type = 'path_hierarchy', delimiter = ':' ) ); CREATE TABLE IF NOT EXISTS b.t2 ( "uuid" STRING, "ts" TIMESTAMP, "tenant_id" INTEGER, "sensor_id" STRING, "sensor_type" STRING, "v1" INTEGER, "v2" INTEGER, "v3" FLOAT, "v4" FLOAT, "v5" BOOLEAN, "week_generated" TIMESTAMP GENERATED ALWAYS AS date_trunc('week', ts), INDEX "taxonomy" USING FULLTEXT (sensor_type) WITH (analyzer='tree') ) PARTITIONED BY ("week_generated") CLUSTERED BY ("tenant_id") INTO 156 SHARDS;

Test Queries

For this benchmark we ran three different queries. Parameters in the queries were randomly generated to ensure the queries were operating across different parts of the dataset, to emulate real-world use.

Query 1

Data from a single tenant, aggregated across multiple partitions:

SELECT min(v1) as v1_min, max(v1) as v1_max, avg(v1) as v1_avg, sum(v1) as v1_sum FROM b.t1 WHERE tenant_id = ? AND day_generated BETWEEN ? AND ?;

Query 2

Data from all tenants, aggregated within a single partition, grouped and ordered by tenant (i.e. for doing cross-tenant analytics):

SELECT count(*) AS num_docs, tenant_id, min(v1) AS v1_min, max(v1) AS v1_max, avg(v1) AS v1_avg, sum(v1) AS v1_sum, min(v2) AS v2_min, max(v2) AS v2_max, avg(v2) AS v2_avg, sum(v2) AS v2_sum, min(v3) AS v3_min, max(v3) AS v3_max, avg(v3) AS v3_avg, sum(v3) AS v3_sum, min(v4) AS v4_min, max(v4) AS v4_max, avg(v4) AS v4_avg, sum(v4) AS v4_sum FROM b.t1 WHERE day_generated = ? GROUP BY tenant_id ORDER BY tenant_id;

Query 3

Sensor lookup within taxonomy (term + children):

SELECT sensor_id, sensor_type FROM b.t1 WHERE taxonomy = ? ORDER BY v1 LIMIT 100;

Cluster and Query Scaling

As we increased the size of the CrateDB cluster, we also increased the number of simulated users.

Here’s a summary of scaling setup we used:

Nodes Users 3 30 6 30 12 60 24 120

Benchmark Results

For each cluster, a JMeter test harness simulated a number of concurrent users, each connecting to the database and executing the three test queries (in random sequence and with random query parameters).

These results show the average queries per second across 314 million rows:

3 node 6 node 12 node 24 node # Users 30 30 60 120 Query 1 3.2395 6.1410 10.2985 18.2665 Query 2 3.0398 5.6857 10.0139 17.5259 Query 3 3.1790 5.9053 10.2393 17.7649 Total 9.0698 17.0139 29.8725 52.0404

If we graph this, we get:

As you can see, query throughput increases linearly with cluster size.

Wrap Up

This benchmark demonstrates that, at least for this setup, CrateDB scales query throughput linearly with cluster size.

Linear scaling is an important feature, because it makes capacity planning a lot more predictable—and more affordable.

If you want to to know more, check out the whitepaper, or get in touch.