Last week at the Strata Data Conference in San Jose, I had the privilege of demonstrating MemSQL processing over a trillion rows per second on the latest Intel Skylake servers.

It’s well known that having an interactive response time of under a quarter of a second gives people incredible satisfaction. When you deliver response time that drops down to about a quarter of a second, results seem to be instantaneous to users.

But with large data sets and concurrency needs, giving all customers that level of speed can seem beyond reach. So developers sometimes take shortcuts, such as precomputing summary aggregates. That can lead to a rigid user experience where if you tweak your query a little, for example adding an extra grouping column, suddenly it runs orders of magnitude slower. And it also means your answers are not real time, i.e. not on the latest data.

MemSQL gives you the interactive response time your users want, on huge data sets, with concurrent access, without resorting to precomputing results.

Running at a Trillion Rows Per Second

MemSQL 6, which shipped in late 2017, contains new technology for executing single-table group-by/aggregate queries on columnstore data incredibly fast. The implementation is based on these methods: (1) operations done directly on encoded (compressed) data in the columnstore, (2) compilation of queries to machine code, (3) vectorized execution, and (4) use of Intel AVX2 single instruction, multiple data (SIMD) enhancements. When the group-by columns are encoded with dictionary, integer value, or run-length encoding, MemSQL runs a one-table group-by/aggregate at rates exceeding three billion rows per second per core at its peak. The fewer the number of groups and the simpler the aggregate functions, the faster MemSQL goes.

This incredible per-core speed gave us the idea to shoot for the trillion-rows-per-second mark. To accomplish this, with a realistic query, I wrote a data generator to build a data set that simulates stock trades on the NASDAQ. Then we talked to our partners at Intel, and they generously gave us access to servers in their lab with the latest Skylake processors. These machines have two Intel® Xeon® Platinum 8180 processors each, which have 28 cores, for a total of 56 cores per server. I created a MemSQL cluster with one aggregator node and eight leaf nodes, with one server for each node, as shown in Figure 1. This cluster had 2 * 28 * 8 = 448 total cores on the leaves — the most important number that determined the overall rows-per-second rate we could get.

Figure 1. The hardware arrangement used to break a trillion rows per second.

I installed MemSQL on this cluster with two MemSQL leaf nodes on each leaf server, with non-uniform memory access (NUMA) optimizations enabled, so each MemSQL leaf node software process would run on a dedicated Skylake chip. Then I created a database trades with one partition per core to get optimal scan performance. Once that was complete, I loaded billions of rows of data representing stock trades (really decades worth of trades) into a table called trade. The larger capitalization a stock is, the more trades it has. Here’s a tiny sample of the data:

memsql> select id, stock_symbol, shares, share_price, trade_time from trade limit 10; +-------------+--------------+-----------+-------------+----------------------------+ | id | stock_symbol | shares | share_price | trade_time | +-------------+--------------+-----------+-------------+----------------------------+ | 10183273878 | CXRX | 600.0000 | 22.0000 | 2018-03-08 08:50:57.000000 | | 10184155113 | CXRX | 700.0000 | 31.0000 | 2018-03-08 08:50:57.000000 | | 10183273871 | CXRX | 500.0000 | 8.0000 | 2018-03-08 07:50:57.000000 | | 10185917724 | CXRX | 1000.0000 | 63.0000 | 2018-03-08 06:50:57.000000 | | 10183273873 | CXRX | 1000.0000 | 74.0000 | 2018-03-08 04:50:57.000000 | | 10183273874 | CXRX | 800.0000 | 96.0000 | 2018-03-08 10:50:57.000000 | | 10183273865 | CXRX | 600.0000 | 82.0000 | 2018-03-08 06:50:57.000000 | | 10183273876 | CXRX | 600.0000 | 40.0000 | 2018-03-08 05:50:57.000000 | | 10183273877 | CXRX | 700.0000 | 47.0000 | 2018-03-08 05:50:57.000000 | | 10183273869 | CXRX | 100.0000 | 15.0000 | 2018-03-08 05:50:57.000000 | +-------------+--------------+-----------+-------------+----------------------------+

This resulted in how many rows? Check this out:

memsql> select format(count(*), 0) as c from trade; +----------------+ | c | +----------------+ | 57,756,221,440 | +----------------+

That’s right, about 57.8 billion rows!

Then I ran this query to find the top 10 most traded stocks of all time:

select stock_symbol, count(*) as c from trade group by stock_symbol order by c desc limit 10;

And got this result:

+--------------+----------+ | stock_symbol | c | +--------------+----------+ | GILD | 39321600 | | AABA | 39190528 | | KHC | 39190528 | | AMZN | 39190528 | | ASML | 39059456 | | CHTR | 39059456 | | TXN | 39059456 | | FB | 39059456 | | CELG | 39059456 | | TIG | 39059456 | +--------------+----------+ 10 rows in set (0.05 sec)

10 rows in set (0.05 sec)

Yes, this query ran in 5 one-hundredths of a second! To get an accurate scan rate, we need some more significant figures though. So I wrote a stored procedure to run it 100 times in a loop:

memsql> call profile_query(100); Query OK, 0 rows affected (4.51 sec)

This gives an average response time for the query of 0.0451 second, which yields the following scan rate:

memsql> select format(count(*)/0.0451, 0) as c from trade; +-------------------+ | c | +-------------------+ | 1,280,625,752,550 | +-------------------+

Yes, that is 1.28 trillion rows per second!

We’re scanning and processing 2.86 billion rows per second per core, through the beauty of operations on encoded data, vectorization, and SIMD. We’re actually spending some time on every row, and not precalculating the result, to achieve this.

What does it mean?

The fact that we can break the trillion-row-per-second barrier with MemSQL on 448 cores worth of Intel’s latest chips is significant in a couple of ways. First, you can get interactive response time on mammoth data sets without precalculating results. This allows more flexible interaction from users, and encourages them to explore the data. It also enables real-time analytics. Second, it allows highly concurrent access by hundreds of users on smaller data sets, with all of them getting interactive response times. Again, these users will get the benefit of analytics on the latest data, not a precalculated summary snapshot. MemSQL enables all this through standard SQL that your developers already are trained to use.

Also, we achieved this on industry-standard hardware from Intel, not a special-purpose database machine. These are machines that your organization can afford to buy, or that you can rent from your favorite cloud provider, without zeroing out your bank account.

What could you do for your users with MemSQL on today’s Intel chips? Try us out, and watch below how we achieved a trillion rows per second.

Appendix: For you hardware geeks out there

If you’re craving more details about the Skylake chips we used, here’s the output of the Linux lscpu utility run on one of the servers making up the 9 total servers in the cluster:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 112 On-line CPU(s) list: 0-111 Thread(s) per core: 2 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz Stepping: 4 CPU MHz: 2501.000 BogoMIPS: 4993.28 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 39424K NUMA node0 CPU(s): 0-27,56-83 NUMA node1 CPU(s): 28-55,84-111

The data was stored on SSDs, though that isn’t really significant since we ran the queries warm-start, so the data was already cached in memory. The network is 10-gigabit ethernet.