Here at Remind, Amazon Redshift is a centerpiece of our data infrastructure. Event data is continuously ingested into Redshift. ETL processes are run within Redshift. Team members across the company, in a variety of roles, query Redshift. True, recent efforts have shifted some of our data processing onto other tools like Amazon Athena and EMR/Spark, but Redshift remains critical to data at Remind, and the performance of our Redshift cluster is therefore one of the primary concerns of the DataEng team.

Late last year we came across Intermix, a company that specializes in helping folks like us optimize their Redshift performance. We decided to do a two-week trial with them in January 2018.

Some months before, we had scaled our Redshift cluster from 12 to 16 nodes (ds.xlarge) as a quick-and-dirty way of improving performance, as well as to give us more headroom on disk space (every year at back-to-school time, our traffic — and thus our data volume — increases markedly). We’d since solved the disk space issue by removing one of our largest tables, and we’ve been working continuously on performance, so it was on our list of Q1 goals to scale the cluster back down to 12 nodes and we were hopeful that we could do so while maintaining an acceptable level of performance.

We decided that the Intermix trial would be a perfect time to try scaling down the cluster. The Intermix UI would give us easy visibility into our cluster’s performance before and after the downscaling. Also, one of Intermix’s key features is their “Throughput Analysis” which shows a timeseries graph of queue and execution times for each Workload Management (WLM) queue. Intermix’s co-founders, Lars Kamp and Paul Lappas, said they could help us us tune our WLM configuration to eliminate most queue wait times.

In consultation with Lars and Paul at Intermix, we decided to first scale the cluster down, watch how performance was impacted, and then try to tune the WLM config.

The downscale

We scaled Redshift from 16 to 12 nodes over the weekend, and watched performance the next few days. Intermix’s UI made it easy to see that queue times had increased markedly. Before the node reduction, queue wait times during peak periods averaged around 5 seconds. After the node reduction, we saw queue wait times of 20–35 seconds on average during those peak periods. Total query times (queue + exec) at peak hours increased from an average of 10–15 seconds before the downsize, to 35–50 seconds after the downsize.

Our WLM config: before

Before the intermix trial, we had 7 custom WLM queues. This included two different queues for adhoc queries (the default one, with a 15-minute timeout, and a special queue with no timeout that could be used for legimitately time-consuming queries), and two different queues for queries from our dashboard system (a high-concurrency, low-memory queue for most queries, and a low-concurrency, high-memory queue for queries that timed out of the first queue). We also had a dedicated queue for fast readonly queries used to check on ETL task completion, separated out from the main queue used for executing actual ETL tasks.

This setup stemmed from the principle that it’s best to separate out different workloads into different queues. The folks at Intermix believed that we had taken this principle overboard, and encouraged us to reduce the number of queues we used.

The WLM optimization process

Lars and Paul suggested that we aim to have just three custom WLM queues: one for event ingestion, one for ETL processes, and one for adhoc and dashboard queries. They also suggested that we take one of those categories at a time and try to reduce queue times for that queue to zero, while keeping an eye on disk-based queries (another metric that Intermix makes it easy to track) to make sure we weren’t starving queries for memory.

This was done in an iterative fashion, tracking changes to the WLM config in a spreadsheet which Intermix provided. Lars made suggestions to us at each step of the process, and his suggestions were generally spot-on.

We started with our event ingestion queue. We bumped the concurrency from 2 to 4, and queue wait times almost disappeared in this queue. Queries in this queue were still not going to disk, so we reduced memory from 10% to 8% and proceeded to tune our ETL queue.

We consolidated our two ETL queues into one, added the memory we’d taken from the events queue onto the ETL queue, and increased its concurrency from 4 to 7. This helped considerably, and there still wasn’t a significant increase in disk-based queries. We still saw some queueing during our peak ETL hours, which we decided to address in conjunction with the next step.

For our final step, we consolidated all four of our adhoc/dashboard queues into one, increasing total concurrency in this category from 26 to 30, and decreasing memory slightly — Lars suggested that the queries that were going to disk in this category were so big that a few percentage points of memory wouldn’t make much difference. We added the reclaimed memory to the ETL queue, increasing its concurrency from 7 up to 10. These changes brought queue times to zero in the adhoc/dashboards queue, and close to zero in the ETL queue.

It hadn’t been clear to us that consolidating the queues would be a win — we had been concerned that we would be unable to allocate enough memory to large queries, while wasting memory on smaller queries. However, it turned out that we had been worrying too much about memory and not enough about concurrency. We’d had query slots in some queues sitting idle while queries piled up in our main dashboard queue. Consolidating the queues solved that problem and allowed for greater overall concurrency. Intermix’s dashboards allowed us to quickly confirm that these changes didn’t hurt us in terms of the number of disk-based queries (or overall performance).

Our WLM config: after

We now have a much simpler WLM setup. Queue wait times are close to zero in every queue — substantially lower than the queue wait times even when we had the larger cluster. Thanks to the WLM optimization, our total query times (queue + exec) are back down to the levels we had before when running a larger cluster. Here’s a graph showing our progress:

Lessons learned

What did we learn through this process?

It’s super helpful to have a clear, easy-to-use redshift performance dashboard like the one provided by Intermix, so you can see what happens when you make changes.

More granular separation of Redshift workloads isn’t necessarily better, as it becomes too easy to waste resources when everything is split out into separate queues.

Significant performance improvements are possible with a systematic, iterative approach to WLM tuning.

It’s okay for our WLM config to have high level of overall concurrency (near Redshift’s limit of 50 slots).

The folks at Intermix really know what they’re doing, and their personalized support can be extremely valuable.

At the end of our trial, Lars made some suggestions for future work on our Redshift cluster to improve performance, reduce our maintenance burden, and plan for growth. Overall we have had a great experience with Intermix so far, and plan to use them on a continuing basis. We recommend checking them out if you are having Redshift performance issues or just want more visibility into what is happening on your cluster.