By Daniel Goméz Ferro and Monte Zweben

Introduction

Splice Machine is a Hybrid Transactional/Analytical Processing database (HTAP) that is designed to modernize legacy applications. By combining aspects of a traditional RDBMS database, such as ANSI SQL support and ACID (Atomicity, Consistency, Isolation, Durability) transactions, with the scalability, efficiency, and availability of in-memory analytics and machine learning, the Splice Machine data platform empowers the data-driven, intelligent applications that transform enterprises.

To power mission-critical applications, Splice Machine’s patented distributed transaction system maintains ACID properties across multiple tables, records, constraints, and indexes.

Splice Machine uses a snapshot isolation design that implements Multi-Version Concurrency Control (MVCC) to create a new version of a record every time it is updated, instead of overwriting the old data value in place. This means that users have access to the data that was available when they began reading, even if that data has been subsequently updated by a writer.

With each transaction having its own virtual “snapshot,” transactions can execute concurrently without any locking. This leads to very high throughput and avoids troublesome deadlock conditions.

In this white paper, we are going to discuss how Splice Machine:

Implements distributed transactions with snapshot isolation

Performs hierarchical transactions

Executes transactions for both OLTP and OLAP workloads efficiently

Snapshot Isolation

Splice Machine uses timestamps to implement snapshot isolation. In order to implement snapshot isolation, the system requires a transaction to only read data from other transactions that have been committed before it commenced. To illustrate the concept of time-stamping, we show the start and commit times of three separate transactions in the graphic below. In this case, the third transaction (T3) will be aware of the changes made by the first transaction (T1) because they have been committed before T3 was initiated but the second transaction (T2) will not be aware of these changes.

Snapshot isolation requires a specific transaction to read data from other transactions only after they have been committed.

Snapshot isolation also requires that we check for write-write conflict. Write-write conflict occurs when two transactions overlap in time and try to write to the same row in the table. In the example above, if transactions T1 and T2 write to the same row in a table, one of them will have to be aborted because they occur at the same time, whereas T1 and T3 will not suffer from write-write conflict.

To address this issue, Splice Machine assigns time stamps to the transactions when they start and commit. All timestamps are unique so that there can’t be any two transactions that have the same start and commit timestamps. In the Splice Machine platform, we use the timestamp server to ensure that timestamps are unique and they advance monotonically. The timestamp server reserves a block of timestamps to be used by a specific instance by contacting a consensus manager (i.e., Apache Zookeeper) so that no other concurrent instance can use the same timestamps. This approach is required for recovery so that in case of server failure when a new transaction has to be created, there is only one unique timestamp available for the whole cluster. Since we serve a block of timestamps to the cluster, any request for new timestamps can be accommodated locally by the server instead of updating the Zookeeper state on a transaction by transaction basis. This provides higher throughput of transactions and removes Zookeeper as a transactional bottleneck.

Splice Machine stores data in HBase. Each HBase record includes a cell with a key, a value, and a timestamp

Structure of HBase record in Splice Machine.

The key represents the value of the primary key for this record in the table. The data is a compact encoding of all columns in the table for this record. Splice Machine uses the timestamp as part of the HBase cell to store the transaction start time. So each transaction is tagged with its own unique begin timestamp. This timestamp is served by the timestamp server versus the usual HBase timestamp.

You might recall from our earlier discussion that we need to use the start and commit timestamps to satisfy the condition of snapshot isolation. This condition requires that transactions can only read data that has been committed. Therefore we need a mechanism to associate commit timestamp with a particular start timestamp. In Splice Machine, we use a globally visible transactional table called SPLICE_TXN to accomplish this task. This lookup table stores the mappings between the start and the commit timestamps along with transaction metadata such as the type of transaction, and whether it has been rolled back.

Suppose we have a transaction that started at timestamp 20 and it involves looking up a record. So if you have transaction data in the table that a user is trying to read, it is first going to check the begin timestamp (5 in the case below) and then by referring to the SPLICE_TXN table to validate the mapping between timestamps 5 and 10 to verify that this transaction has been committed. So at timestamp 20, the value of this record is visible.

The Data Table holds the begin of the transaction and the SPLICE_TXN table has its end timestamp

Another mechanism we use as an optimization to the SPLICE_TXN lookup table is to add a new record in HBase with a special value. In the graphic below, this value is represented by the word “Commit.” This takes place when records in memory are flushed to disk via a log-structured merge tree implementation or during transaction roll-forward operations where the latest values of a record are consolidated into a single row to avoid the overhead of MVCC resolution.

Optimization to avoid SPLICE_TXN lookup

This value has the same key and it sits next to the actual data that the transaction wrote since the records are ordered by the key. The reason for writing the commit timestamp alongside the data itself is, when the next transaction is executed, instead of having to refer to the lookup table, it simply reads both the key and the value in the respective cells to resolve all the information locally. In this case, the new transaction at 20 will be able to see that the data has been written by transaction 05, which has been committed with timestamp 10. This approach greatly improves performance as it is faster than doing a lookup on the SPLICE_TXN table.

Hierarchical Transactions

Splice Machine uses hierarchical transactions to manage complex statements. For example, queries dispatched to Spark use hierarchical transactions as follows. The user transaction which is the root transaction is classified as transaction 0. Based on this, the statement transaction/savepoint is the child of transaction 0, and then each Spark task gets its own transaction as a child of the statement transaction. Triggers and our distributed write pipeline also use hierarchical transactions. Here’s how they work. Let’s assume the root transaction has the timestamp zero and is the parent transaction. Any transaction can have child transactions which behave like any other transaction. The child transactions are only able to read data written by the transactions that have already been committed. The behavior of the child transactions is dependent on the relationship between the parents. So, these transactions will only be able to read data, written by the transactions if their parents are able to read the data from these transactions.

Splice Machine uses hierarchies to manage complex queries.

Essentially child transactions inherit the view from their parents, and then they apply the same rule to themselves. Splice Machine implements the transaction hierarchy in several ways depending on the transaction scenario. The first one is to encapsulate the execution when the user starts a transaction by running a command. In this scenario, we don’t know in advance whether this command is going to succeed before executing it. For instance, the command to insert 100 rows might fail due to a primary key violation. In this case, Splice Machine does not roll back the entire transaction. It might be useful to point out that that system behavior can be also be configured to fail the entire insert if any row fails since that could be desirable depending on the use case. We perform this operation by automatically creating child transactions that drive the statement. If the command fails, we rollback the child transaction and we create a new one for the next statement. In the figure below if the child transaction T11 fails, Splice Machine rolls back the child transaction and creates a new one called T12 for the next statement.

In case of a child transaction failing, it is rolled back to create a new transaction

We also use child transactions to implement savepoints. Whenever we create a savepoint, we encapsulate it with a child transaction. If the user decides to go back to that particular savepoint he or she can rollback the full set of transactions and start a new transaction from that particular point onward. Savepoints can be nested as shown in the figure below.

Splice Machine uses hierarchy to encapsulate child transactions and to implement savepoints.

Another important function of the hierarchy is to support distributed execution. In cases where the user is going to run a distributed operation, we communicate this information to all the nodes that are going to execute this distributed operation. An example of this is running an import function. This function runs in Spark and it reads data concurrently from different machines and inserts that data directly into HBase from each of the tasks. So each task receives the actual information and creates its unique child transactions that are used to write the data to HBase. If some of these tasks fail, Splice Machine can roll back the partially failed transactions and retry the tasks with a new transaction identifier. If the failure is catastrophic, we can fail the entire operation and roll back all the transactions.

The creation of start and commit timestamps can be an expensive operation for OLTP workloads where the primary objective is to achieve low latency. We address this issue by identifying persisted and in-memory transactions. All the root level transactions are persisted transactions that are recorded in the transaction table. In this case, the transaction ID is broken into two parts. The least significant byte is reserved for a counter which uniquely identifies the child transactions, while the actual transaction ID is incremented in the timestamp server. So, for the new child transactions instead of getting a new timestamp from the timestamp server, it increments the counter. This allows us to uniquely identify the data written by sub-transactions and if there is a rollback, we can record it in the transactional table at the ID level. This approach is effective for local transactions when we are running OLTP transactions on one machine and have access to this counter, which we can keep increasing and maintaining unique IDs for each of the transactions.

TransactionID is broken into two parts, the least significant byte is reserved for a counter to identify the child transactions.

When we run out of bits to increase the counter then we create new, persisted sub-transactions by getting a new timestamp from the Timestamp Server. In this case, the sub-transaction will have a new sequence number from the timestamp Oracle, and the counter will start at zero and the new sub-transactions will keep increasing the counter. If the transactions are running on a different executor in Spark for which we don’t have access to the counter then these transactions have to be persisted.

Since distributed tasks or jobs already have some overhead due to the distributed nature of execution, getting begin commit timestamps does not have a major impact.

Conclusion

In summary, we believe that the Splice Machine platform can perform both OLTP and OLAP workloads. Our platform does not require transactions to get new timestamps for sub-transactions as we use one byte for the in-memory counter. For OLAP and distributed queries, Splice Machine’s hierarchical design allows users to create multiple transactions that execute simultaneously yet are independent of each other so that they can start, commit and retry independently of each other. This allows us to manage transactional workloads, in a safe and consistent manner, which is also very efficient because we don’t require a centralized coordination point.

If you’d like to learn more about how Splice Machine works, watch our “under the hood” webinar for a deeper dive on the internals of our data platform.

About Splice Machine

Splice Machine is a scalable SQL database that enables companies to modernize their legacy and custom applications to be agile, data-rich, and intelligent — all without re-writes.

When you migrate Oracle and DB2 applications to Splice, you not only reduce database licensing costs but also enable the applications to add new data sources at a massive scale.

Splice Machine enables you to unify analytics and machine learning that used to be on separate platforms to be native to the application thereby reducing ETL latency and infrastructure costs and enabling intelligent decision-making in the moment.