We use Amazon RDS for database servers and historically have had a practice of having smaller database server in QA and DEV to save cost. I would consider this a pretty common practice – but what I have also discovered is that it can lead to fixing non-bugs, or issues that will not arrise in production.

Let me try and describe a (~) worst case and some simple math to illustrate:

Production

– 10G of RAM

– 100 IOPS Storage Device (single hard disk)

– Workload is 100% reads

– 10K queries/second

QA

– 9G of RAM

– 100 IOPS Storage Device (single hard disk)

– Same workload (100% reads)

– Can only do 1K queries/second!

Production is serving its 10K queries/second with a 100% cache hit rate, but with the 1G less RAM, QA is only able to serve with a 99% cache hit rate. Each query touches on average 10 distinct pages. So:

10K queries * 10 pages = 100K pages/second 100K pages * 0.01 cache miss = 1OOO IOPS second required to cover misses.

.. since we only have 100 IOPS, we can only serve 1/10th of the queries!

In our case it was more like production having 68G RAM and QA having 34G. The query volume itself was also not necessarily as high, but the queries all needed to access a very large number of distinct pages – causing the same cache miss, and page churn.