Under the hood, Heap is powered by a petabyte-scale cluster of Postgres instances. Our dataset is large, and our customers run a wide variety of open-ended, ad hoc analytical queries that cover large portions of this dataset. In order to support ever larger customers and more complex queries over such a large amount of data, we are always working on making these queries faster.

We have published a few blog posts on how we optimize our database performance, which can be found here, here, and here. One of the biggest problems we ran into as we scaled our cluster was: how can we safely and accurately evaluate changes to our database? In this post we will look at the system we designed to answer that question.

Approaches That Didn’t Work

Our initial method for testing database changes was largely based around synthetic, isolated benchmarks. When testing out a new schema, we would copy a small amount of production data onto an isolated machine and run some example queries. We would also do some write benchmarks, in which we would take a subset of the data and measure how long it would take to insert it into our database. If everything looked good, we would start rolling out the change into production.

Unsurprisingly, this approach to benchmarking did not accurately map to production. Sometimes we would encounter a customer whose data was different than our benchmark data in some critical way, and the change would cause a regression for that customer. Other times, a change would work well in isolated benchmarking, but behave completely differently when deployed under a full workload.

For a specific example, one such change was enabling compression across our cluster by replacing our file system with the ZFS file system. When we tested compression in isolation, everything looked great. As we rolled it out to production, we discovered that we were no longer able to ingest all incoming data in real time. We were eventually able to improve the ingestion performance and keep our data compressed by doing some performance investigation, but it was still a very expensive mistake. A significant amount of engineering work had to go into fixing the problem, and we had temporarily provided a degraded experience to our customers.

Enter Shadow Prod

In order to safely and accurately test changes before we roll them out to production, we built a system we call “shadow prod”. You can think of shadow prod as a staging environment for data-layer changes. Each machine in shadow prod is a mirror of a machine in production, but with a different configuration that we want to evaluate. All reads and writes that go to the production machine are mirrored to the shadow prod machine. This means the shadow prod machine is experiencing the exact same workload as a production machine. Importantly, shadow prod is designed such that, if something were to happen to a shadow prod machine, there would be minimal impact on production.

Shadow prod gives us a place where we can test database changes in an environment exactly like production, but without any of the risks associated with making customer-visible changes. The rest of this post details how shadow prod works and how we make sure the results we get from it are accurate.

Populating a Shadow Prod Machine

To spin up a new shadow prod machine, we lean on Postgres’ backup/restore functionality. For database changes that do not require a schema change (e.g. testing different hardware or configuration settings) we can spin up a new machine with the desired configuration and restore a Postgres backup onto it. This quickly and easily gives us a new machine that is an exact copy of an existing production machine, except it has the change we want to be testing. Since we are restoring from an S3 backup, there is no impact on the production system. An additional benefit of setting up our machines this way is that we test our backups every time we spin up a new shadow prod machine.

For schema-level changes, a bit more work needs to go into getting the data into the new schema, but the same idea still applies. We are able to restore a backup of an existing production machine and then transform all of the data on it into the new schema. Again, since we are restoring the machine from backup and only doing the transform on that machine, there is no impact on production.

Whenever we are testing a change, we spin up two copies of the same machine. We designate one as the control machine and the other as an experimental machine. The control machine has the same configuration as production. By setting up the experiment in this way, we have a baseline we can compare queries on the experimental machines to. This protects us from the possibility that whatever we did to populate the machine affected the resulting performance.

Mirroring Reads and Writes

Mirroring reads and writes is conceptually simple: whenever a read or write is sent to production, also send that query to any relevant machines in shadow prod. In practice, this can be tricky. We’ve developed some techniques to make this easier.

When writing data, we might need to handle different schemas between the experimental node and the rest of our databases. Instead of inserting data directly into database tables, we use a set of Postgres user-defined functions that insert the data into the underlying tables. This way, the service that inserts data into the database doesn’t need to be aware of the underlying schema. Instead, when inserting data, the service can use a fixed query that calls the necessary user-defined functions, and the functions will handle inserting the data into the database in the correct schema. This enables us to have many different schemas in shadow prod, but still have the same interface to insert data into all of them. Our ingestion code can treat a shadow prod database like any of our prod databases – all it needs to know is what tables live where.

In order to mirror production reads to shadow prod, whenever a user runs a query in Heap, we generate a SQL query which we run in our main database cluster and on each of the machines in shadow prod. We only return the results of the production query, so a shadow prod outage does not affect production reads. Some experiments require custom logic to generate the correct SQL, but writing this custom logic is not too much additional work, and we will need the logic anyway if we roll the experiment out to production. For some schema changes, it is possible to completely paper over the differences in schemas by using Postgres features such as views and table inheritance.

Analyzing the Results

In order to trust the results of a shadow prod experiment, we need to make sure that performance changes between the control and experimental machine are due to the change we made and not due to some unrelated factor.

In a previous post, we discussed a tool we’ve built that automatically collects data about every query we execute. The data includes the Postgres query plan (i.e. a description of how the query executes), the number of rows generated during each part of the query execution, and even the number of blocks read from disk during each portion of the query execution. We include a unique identifier in every query we run, which enables us to correlate each query run on a shadow prod database with a corresponding query on the control machine and in production.

Along with these performance tools, shadow prod allows us to perform a direct comparison between experimental conditions for every query we run. Using this data, we have built an extremely helpful workflow for validating the results of an experiment. We can specify in code the exact differences we expect to see between queries we run on the control machine and queries we run on the experimental machine. Specifying the expected differences in code allows us to programmatically run sanity checks of the performance data, such as the following:

Are the query plans that occur on the control machine the same as the ones we get in production? If not, that means the control machine is different from production in an important way, which we need to resolve before continuing.

If not, that means the control machine is different from production in an important way, which we need to resolve before continuing. Do the query plans we collect on the experimental machine have the changes we expected when compared to the control machine? If not, that means our optimization is not working as intended.

If not, that means our optimization is not working as intended. Did the query plans we collect on the experimental machine have any unexpected changes? If there are changes in the query plans beyond what we expect, that means either the change does things we didn’t expect or there is some flaw in the design of our experiment.

If there are changes in the query plans beyond what we expect, that means either the change does things we didn’t expect or there is some flaw in the design of our experiment. Did the queries we expected to improve actually improve? Did the queries we expected not to improve stay the same? Like the checks above, this helps us to make sure changes in performance are attributable to the change we made, and not to an external factor.

Here’s a simple example to give you an understanding of how this tooling works. Let’s say we have a table of 10 million random numbers between 1 and 10,000:

CREATE TABLE numbers (n) AS SELECT trunc(random() * 10000)::bigint FROM generate_series(1, 10000000) g(i);

Let’s say our users commonly run a query that counts how many times the number 25 occurs in this table:

SELECT COUNT(*) FROM numbers WHERE n = 25;

We believe creating this index on the numbers in the table will make this query faster:

CREATE INDEX ON numbers (n);

To test the impact of the index, we can setup a shadow prod experiment. We’ll setup a control machine that, like production, does not have an index on the table, and an experimental machine on which we create our new index on the table.

When we run our query on the experimental machine, our tooling will automatically collect information about how the query executed and emit the following information:

Aggregate (actual time=631.278..631.278 rows=1 loops=1) Buffers: shared hit=1137 read=43111 -> Seq Scan on numbers (actual time=0.181..631.101 rows=1023 loops=1) Filter: (n = 25) Rows Removed by Filter: 9998977 Buffers: shared hit=1137 read=43111 Planning time: 0.142 ms Execution time: 631.479 ms

In practice, the tool emits the above information in JSON. Having the information in JSON makes it easy to do programmatic analysis of the data. The information is only in the format above for display purposes.

The key pieces of information are bolded above. Postgres took 631 ms to execute the query. Practically all of the time was spent performing a sequential scan over the numbers table.

If we run the same query on the experimental machine, we will get the following output:

Aggregate (actual time=1.736..1.736 rows=1 loops=1) Buffers: shared hit=1017 -> Index Scan using numbers_n_idx on numbers (actual time=0.038..1.467 rows=1023 loops=1) Index Cond: (n = 25) Buffers: shared hit=1017 Planning time: 0.110 ms Execution time: 2.331 ms

In this case we can see the query only took 2.3 ms. Postgres uses an index scan instead of a sequential scan, and reads much fewer blocks from disk.

If we wanted to sanity check our experiment and make sure shadow prod is working as expected, we would write code to check the following conditions:

All sequential scans over the numbers table become index scans.

table become index scans. The index scan over the numbers table is significantly faster than the sequential scan.

table is significantly faster than the sequential scan. The index scan over the numbers table reads significantly fewer blocks than the sequential scan.

table reads significantly fewer blocks than the sequential scan. Besides the changes above, there are no other changes in query execution.

The performance of other parts of queries remains the same.

With code for checking these conditions, we can automatically surface any queries that fail one of the checks. This allow us to either verify that the index is having the impact we expected or discover specific queries where it is not.

Real-World Example: Table Partitioning

We recently used shadow prod to test table partitioning. This is a database technique in which a table is split into several smaller tables, which contain disjoint subsets of our data. In our case, we were splitting up our single events table into three smaller tables: sessions , pageviews , and other_events . We expected this to improve performance for queries that touch a lot of sessions or a lot of pageviews, due to improved disk locality. For more information about the benefits of table partitioning, you can read the official Postgres docs on table partitioning here. When we tested table partitioning in shadow prod, we ran the following sanity checks:

Scans over the events table on the control machine corresponded to a scan on exactly one of sessions , pageviews , or other_events .

table on the control machine corresponded to a scan on exactly one of , , or . Other than a scan on the events table becoming a scan on one of the child tables, there were no other changes in the query plans.

table becoming a scan on one of the child tables, there were no other changes in the query plans. Each scan over either the sessions table or the pageviews table on the experimental machine should have read significantly fewer blocks and run significantly faster than the corresponding scan on the control machine.

table or the table on the experimental machine should have read significantly fewer blocks and run significantly faster than the corresponding scan on the control machine. Besides scans over the events table, all other portions of the query plans should have comparable performance on both the experimental machine and the control machine.

When running these checks, our tooling automatically displays the diff between an expected change in query execution and the actual change. This allows us to quickly isolate and eliminate confounding factors in our experiments. These checks also enable us to detect edge cases in the optimization we are making and fix them before we roll anything out to production.

This tooling exposed some serious flaws in our experiment that would have corrupted our results. One was an unintentional configuration difference in the machines. Our servers have local SSDs, which we use as read caches for slower EBS disks. This is done through the L2ARC feature in ZFS. The read cache happened to be disabled on the control machine, due to a machine restart, which made our experimental machine look better in relative terms. We discovered this problem when our tools flagged queries on the experimental machine that were faster than queries on the control machine even though they had the same plans.

Another issue we were able to detect was that a particular type of query that was not optimized for the partitioned schema. That query wound up inefficiently reading data from all three of the sessions , pageviews , and other_events tables. Thanks to shadow prod, we were able to detect and fix this problem before we rolled anything out to production.

When we rolled out partitioning, we found the performance improvements in production matched what we expected from shadow prod, and there were no unexpected regressions. This means shadow prod actually works as a way to accurately measure the impact of changes!

The shadow prod environment has been an invaluable tool for testing database changes. With it, we are able to run tests in a production-like environment without any risk of affecting production. This allows us to be confident in changes we make when we finally put in the work to roll them out to production. It means we can go from an idea to a customer-visible performance improvement in a matter of weeks, not months, and with much less risk.

If you have any questions feel free to reach out to me on Twitter at @mmalipser. If building these kinds of systems sounds interesting to you, you should definitely apply to Heap!