Scaling a relational database isn’t easy. Scaling a relational database out to multiple replicas and regions over a network while maintaining strong consistency, without sacrificing performance, is really hard.

InfoWorld

How hard? The CAP Theorem says that you can only have two of the following three properties: consistency, 100 percent availability, and tolerance to network partitions.

A network partition is a break that blocks all possible paths between some two points on the network. Partitions do happen, even if you own and control your entire WAN, so either consistency or availability has to give.

What does this have to do with performance? If you demand “external consistency,” meaning strong consistency across replicas and regions in a distributed database, then transactions have to commit everywhere. There will be significant transmission lag between regions, at best 20 milliseconds per 1,000 miles (based on the speed of light in glass). That means cross-region commits will incur delays that impact the performance of the database, because commits require locks on certain parts of the database.

Cloud Spanner is Google’s “best of both worlds” database. It has the scalability of NoSQL databases while retaining SQL compatibility, relational schemas, ACID transactions, and external consistency. It comes very close to getting around the CAP Theorem.

Cloud Spanner offers to the public the internal Google Spanner service used by AdWords, Google Play, and about 5,000 other Google services and databases. There is nothing else quite like it, at least among Google’s offerings.

Google Cloud Spanner in context

Many of the database products I’ve reviewed over the last few years have addressed database scalability and performance issues in their own ways. The NoSQL databases gain performance by giving up on strong consistency: They offer “eventual consistency” in order to scale out to hundreds or thousands of servers with asynchronous replication. They additionally sacrifice SQL compatibility, relational schemas, and ACID transactions. All of these sacrifices in the name of performance and scalability make them unsuitable for some application areas, such as finance and ad tech.

Typical SQL databases, on the other hand, have limited scaling potential. You can beef up the main server’s hardware with solid-state disks, large memory, and many CPUs. You can add read replicas for clients that don’t require consistent reads (such as websites) and manually shard the database (at a high cost in developer and/or DBA time and effort). The high-performance relational databases I’ve reviewed have taken various approaches to improve on those strategies.

For example, Amazon Aurora can deliver up to five times the throughput of standard MySQL on the same CPU and RAM, due to a tight integration between the database engine and an SSD-backed, virtualized, fault-tolerant, and self-healing storage layer. Aurora can fail over to one of as many as 15 read replicas. Google Cloud SQL uses similar techniques, but has lower limits on the size of the VMs and the number of replicas.

ClustrixDB uses multiple peered nodes behind a load balancer, with automatic data slicing and rebalancing, and its query analyzer sends the compiled SQL code to the ranking replica for the data. DeepSQL is a version of MySQL with its own database engine instead of InnoDB, using highly efficient, non-traditional data structures and automatic tuning of its parameters and algorithms.

Google The flowchart above summarizes Google’s guidance for picking a cloud storage option. Cloud Spanner, at the lower left, is for relational OLTP workloads that need horizontal scalability.

Google Cloud Spanner is the result of nearly a decade of evolution. Spanner started out as a NoSQL key-value store offering multi-row transactions, external consistency, and transparent fail-over across datacenters. The first iteration of Spanner drew on Google’s experience with Bigtable, the key-value store that inspired HBase and Cassandra, and with Megastore, a transaction processing system layered on top of Bigtable. In subsequent versions Spanner got a strongly-typed schema system, a SQL query processor, horizontal row-range sharding, child tables co-located and interleaved with parent tables, and a number of other features that turned it into an attractive distributed online transaction processing (OLTP) database.

Possible competitors, although not necessarily directly comparable products, include CockroachDB, an open-source, horizontally scalable distributed SQL database developed by ex-Googlers who were familiar with Spanner; Azure Cosmos DB, the distributed Microsoft NoSQL service; MapR-DB, an “in-Hadoop” NoSQL database; NuoDB, a distributed SQL database with hybrid deployment options; and IBM DB2 Parallel Sysplex and Oracle RAC, both pricey enterprise SQL database clusters.

Google Cloud Spanner characteristics

At its essence, Spanner is a sharded, globally distributed, and replicated relational database that uses a Paxos algorithm for reaching a consensus among its nodes. Spanner uses two-phase commits for strong consistency, but considers Paxos groups to be the members of the transaction. Each Paxos group needs only a quorum to be available, not 100 percent of its members.

Client queries automatically fail over between replicas without needing to retry the query. Spanner automatically re-shards and migrates data across machines as needed in response to changes in the data, the number and location of available nodes, the query load, and network partitions.

Everything Spanner does is designed to be inherently consistent, even at the expense of availability. When there is a network partition, updates and commits may be delayed if the Paxos groups lack quorums or have lost their leaders, and they may fail if the groups are unable to elect new leaders and find a quorum.

Network partitions are extremely rare for Spanner, however, as it runs on Google’s high-availability private fiber WAN. In internal use at Google, Spanner has shown better than five nines availability, i.e. better than 99.999 percent, which means less than five minutes of downtime a year. In practice, authors of Google applications that use Spanner don’t usually bother to write code to handle availability failures.

Spanner gets serializability from locks, but it gets external consistency (similar to linearizability) from the TrueTime API. Spanner’s external consistency invariant is that for any two transactions, T1 and T2 (even if on opposite sides of the globe), if T2 starts to commit after T1 finishes committing, then the timestamp for T2 is greater than the timestamp for T1.

The TrueTime API depends on GPS time standards and atomic clocks in every datacenter, and provides less than 10 milliseconds time uncertainty. Absolute transaction sequencing allows for consistent reads and backups. The TrueTime API returns an interval guaranteed to contain the correct clock time; transactions may need to wait until the uncertainty runs out and only finish a commit when sure that the commit time is in the past.

Spanner is not only inherently consistent; it is a temporal multi-version database. The database maintains a series of consistent snapshots taken between transactions, and allows non-blocking snapshot reads from any given snapshot time in the past that hasn’t yet been garbage-collected. In fact, strong (current time) reads are executed as snapshot reads with 0 delay time. The advantage of asking for a snapshot read from the past (allowing bounded or exact staleness) is that it will rarely have to block while waiting for concurrent read-write transactions to commit. Google suggests using at least a 10-second staleness bound for maximum read performance.

Google Common SQL is a dialect of ANSI 2011 SQL that uses standard features such as ARRAY and row type (called STRUCT) to support nested data. One of the internal requirements was for Protocol Buffer message and enum types to be first class types in the Common SQL dialect. That led to the choice of a UTF8-based STRING type, which maps well to Protocol Buffers and the programming languages used heavily at Google, rather than the traditional SQL CHAR and VARCHAR types.

Common SQL is not exactly the same as any of the SQL dialects used by Postgres, MySQL, SQL Server, or Oracle, but it has a lot of overlap. Oddly, Common SQL omits DML statements such as UPDATE , although Cloud Spanner has a gRPC Mutation() API call that offers insert , update , insert_or_update , replace , and delete operations.

Security is baked into Cloud Spanner with Cloud IAM and Audit Logging. In addition, as with all Google Cloud Platform storage, data encryption is offered by default.

If you do not need a strongly consistent and horizontally scaling relational database, or you must use an open source database, then Cloud Spanner might not be the best choice for you. The figure above shows Google’s flowchart for choosing from the Google Cloud Platform storage options. The typical use cases for Cloud Spanner are ad tech, financial services, global supply chain, and retail.

Understanding Google Cloud Spanner costs

Except for the Northeast Asia region, which is more expensive, Cloud Spanner costs 90 cents per node per hour, plus 30 cents per GB per month of storage. Compare that to a minimum of 1.5 cents an hour and 17 cents per GB per month of storage for Cloud SQL. In either case, you also need to pay for network egress.

A rule of thumb is that a Cloud Spanner node can handle around 2,000 writes per second and 10,000 reads per second, along with 2 TiB of storage. So, for example, suppose you have a 10 TB database with a projected transactional load that is 80 percent reads and 20 percent writes with a total rate of 100,000 I/O per second, which works out to 80K reads per second and 20K writes per second. That means you should plan for at least 10 nodes to accommodate your write rate, which will cost $.90 per node per hour x 10 nodes x 730 hours per month = $6,570 per month for nodes and $3,000 per month for storage, for a total of $9,570 per month plus network egress costs. You can avoid network egress costs by running your client loads in the same Google Cloud region as your Cloud Spanner instance.

Note that Google recommends provisioning enough nodes to keep overall CPU utilization under 75 percent for optimal performance. You can update Spanner instances at any time to change the number of nodes, although you shouldn’t use fewer than three nodes in a production database, and you can’t reduce the number of nodes below the number needed to support the data size (five nodes for the 10 TB database in the example above).

While Google has many internal multi-region Cloud Spanner databases, most customers will not need that scale. Even large international businesses often keep separate transactional, operational databases for each country, while rolling up the combined data for read-only analysis in a single data warehouse, such as Teradata, Oracle Data Warehouse, or Google BigQuery.

Running Google Cloud Spanner

You can use Cloud Spanner from the GCP console or the gcloud command line. Preparation starts by enabling Cloud Spanner for the GCP project you’d like to use.

Then you need to create a Cloud Spanner instance with a node count as in the image above, create a database within the instance, and create a schema, either graphically or with a SQL DDL script.

Then you should apply IAM roles for security, and load your data. Once that is done you can get query execution plans and run your queries, still either from the console or the command line.

You can also run queries from applications using Java, Go, Node.js, PHP, Python, gRPC, or REST. Note that the JDBC driver has some restrictions: It does not support DML or DDL statements, and the Google Common SQL STRUCT is mapped to a String type by the driver. As you run your applications, you can monitor Cloud Spanner operations with StackDriver.

The most interesting way to test Cloud Spanner would be to run and monitor a real-world application on it, especially configured with multiple terabytes of data in multiple regions, and do a variational load study. The next most interesting way to test Cloud Spanner would be to run and monitor a benchmark that has a Spanner driver, such as YCSB.

Google A screenshot of the Cloud Spanner keynote showing the ticket selling demo running in three regions with 70 nodes, against an 80 terabyte database. The transaction rate is about 500K per minute.

Quizlet tested Cloud Spanner with an eye toward using it for the company’s eponymous quiz app, which had started out with a MySQL database and grown as much as it could with replicas and Memcached but without horizontal sharding, since (as I mentioned earlier) manual sharding is too complicated for mere mortals. According to Peter Bakkum, Quizlet’s platform lead, “The fundamental problem with this architecture is that it’s limited by the capacity of the master. Without horizontal sharding, all of the writes and consistent reads must happen on that single machine.”