How does a Cloud SQL database scale and what to know when setting one up Stepan Follow Mar 29 · 8 min read

One of the most challenging decisions when designing a new system is choosing your data storage solution. There are many aspects to consider, performance and scalability being usually the fundamental ones. I have been through this exercise many times and I always felt there’s not enough good documentation on what performance to expect from a Cloud SQL¹ database and how does it scale.

Performance Aspects

Let’s start by going over the main configuration choices that will directly influence the performance of your Cloud SQL instance.

CPU & Memory

No surprise here — more CPU and memory will give you better performance. You can choose an arbitrary number of CPU cores, starting from 1 shared core² and 0.6 GiB of memory up to 64 dedicated vCPUs and 416 GiB of memory.

It’s important to realize that the number of vCPUs will also directly influence the network performance of your instance. Network egress bandwidth starts at 1 Gbps for shared core instances and goes up to 32 Gbps³ for instances with 16+ cores. Don’t forget that network bandwidth is shared by both application and storage traffic, and if saturated, both will be competing with each other.

Note: While instance type is another important aspect influencing the performance, at the moment Cloud SQL — PostgreSQL only supports the first generation of general-purpose machine types — N1.

Disk Type and Size

Performance of SSD persistent disks⁴ scales linearly with its size until it reaches the limit imposed by your instance egress. Starting at 300/300 r/w IOPS and 4.8/4.8 MB/s throughput for a 10 GB volume, going up to 100k/30k r/w IOPS and 1200/800 MB/s throughput for a 4 TB one⁵. For Standard persistent disks the performance starts at 24/48 r/w IOPS for a 32 GB volume and tops at 7.5k/15k IOPS and 1200/400 MB/s throughput for 10 TB and larger volumes.

Note on HA and Downtime

Somewhat surprising might be that you can’t expect zero downtime of your Cloud SQL instance, not even in HA configuration⁶. HA provides data redundancy and shortens downtime in case of instance or zone failure, but it will still suffer from downtime on regular updates (these can happen up to once a week, although typically less than once a month). Downtime will also be required for scaling the instance.

There are strategies on how to mitigate this, but that’s beyond the scope of this article. Sufficient to say that Cloud SQL does not offer any way how to do that out of the box.

Topology

In order to get the best performance, you will want to provision private SQL instance and access it over the private IP address (with the public one, traffic would go over the internet and everything would be way slower). Technically the instance is provisioned in a separate VPC and then VPC peering is established between your VPC and the one managed by Google.

PostgreSQL Tuning

There are many options to further customize your Cloud SQL instance, apart from just choosing the version⁷. Both database flags and installed extensions will affect the performance as well, but for the purpose of this test all settings were kept at their defaults.

Testing Setup

The tests were executed from a client instance, sitting in the same VPC, same region, and zone, in order to minimize network impact. In this particular case, all the instances were provisioned in London ( europe-west2-a ).

The client instance was n2-highcpu-16 - notice a high number of CPUs to get high network performance. Client OS was default debian-cloud/debian-9 image, without any further performance tuning.

Tested was PostgreSQL version 11, in HA configuration. A disk size of 4 TB was chosen to avoid any throttling of the storage. The instance was provisioned as private and accessed using the private IP.

pgbench Setup

I’ve used pgbench, which is a simple benchmark tool bundled with PostgreSQL. By default, it runs scenario that is loosely based on TPC-B, and executes several SELECT, UPDATE and INSERT operations per transaction⁸.

The test would always run for five minutes, five iterations per tested configuration in total, with one initial “pre-warm” run which is not reflected in the results. Number of worker threads ( -j ) within pgbench was set to 8, scale factor ( -s ) to 100, the rest kept at their defaults. I've tried to experiments with some of these further, but apart from the settings mentioned later didn't find these to have a significant impact on the results.

Tested Configurations

In the first place, I was interested in how the database performance scales with the instance size. Tests were executed for instances starting with 1 vCPU up to instances with 64 vCPUs and memory amount equal to number_of_cpus x 4 in GB (basically this follows the standard instance sizing).

Let’s also look at how the number of parallel clients accessing the database affects performance — common scenario with multiple instances of the backend service(s). And finally, how much slower will everything be without connection pooling and with SSL on.

Code

The whole setup can be found as a code (Terraform mostly) in the stepanstipl/gcp-cloudsql-bench repository with additional details on how to reproduce the whole setup and execute the tests.

Results

General Performance and Latency

Fig. 1 shows that performance scales in a linear fashion up to 8–16 CPUs and flattens out afterward. This was pretty much expected, and it’s interesting to see how exactly the curve looks like. What surprised me a bit was that the average latency was significantly worse for smaller instance sizes. Starting at 140 ms for 2 vCPU instances and only falling below 20 ms for instances with 16 and more vCPUs.

So what is actually limiting the performance? Fig. 2 shows that smaller instances were maxing out on the CPU, but larger ones were nowhere near the CPU, nor previously described network and disk limits. This brings up questions on what are we actually testing with pgbench and how meaningful this metric is? I won’t dig deeper in this direction⁹, the answer is that in our scenario the machine spends a lot of time on context switching.

Concurrent Clients

The number of parallel connections did not have a too significant effect, as long as you stay within a reasonable range for the given instance size and TPS. We can observe that 100 for smaller, and 200 concurrent clients for instances above 10 vCPUs seemed to perform the best. Small instances do not support a higher number of connections by default (this can be changed by adjusting the max_connections option) and suffer from higher performance hit for higher numbers of concurrent connections.

The higher number of concurrent clients causes higher latency — fig. 4. Again, smaller instances suffer more.

Connection Pooling

Unsurprisingly, disabled connection pooling, when the client has to establish a new connection for each request instead of reusing existing ones, has a significant impact both on TPS throughput and latency. And as obvious as this might seem, I wanted to demonstrate how huge the difference is, as I’ve repeatedly seen this issue on production systems. Enabling connection reuse is often one of the cheapest ways to improve your application performance.

SSL

All the tests so far were performed without using SSL¹⁰, and enabling it has expected negative performance effect on both TPS and latency - fig. 6.

Data Stats

At last, let’s get some idea about the consistency of the measured data. The last graph, fig. 7, shows the standard deviation for each series of runs. Each data point representing one series of 5 runs for the given instance size and number of concurrent clients. The results seem reasonably consistent, higher variation in performance can be observed for instances with a high number of vCPUs (32, 64), while higher variation in latency can be observed for the smallest instance sizes.

Summary

Cloud SQL is a great hands-off way to run a very popular, well-performing and feature-rich database. Provision your instances with enough vCPUs and large enough disk size to get sufficient network bandwidth and disk IOPS. Also make sure that your connecting using the private IP, your application uses connection pooling, and expects occasional short downtime, even when using the HA mode. Keep in mind that PostgreSQL is not designed as a distributed system and if you require true zero downtime or near-linear scalability above what a single machine can handle, there might be more suitable database choices.

Hopefully, this will be useful as a rough guide on what to expect when considering Cloud SQL as the database option for your workload. Keep in mind it is a very generic test and I would always encourage you to test yourself, using a more specific benchmark that better reflects the behavior of your application.

Let me know about your experience with Cloud SQL or if you have any additional suggestions or questions. You can also find me at @stepanstipl. Happy life in the ☁☁☁!