Monday, October 2, 2017 at 10:23AM

This is a guest post by Mark Travis, Performance Engineer at Ripple.

Ripple’s XRP Ledger is a blockchain-based payment network that transfers funds between any type of currency within a few seconds with average transaction costs of a fraction of a penny. The core of this peer-to-peer network is an open source C++ application called rippled. Ripple’s goal is to supplant the world’s existing legacy payment networks. As such, scalability is a continuous goal. This document describes how the rippled team has integrated performance engineering into its development processes, and how this has contributed to throughput gains of over 1000%.

Performance engineering practices deliver benefits in addition to measurable performance gains. These include the ability to report on the capabilities of the software so that users can feel confident that their needs will be met by the system. Performance engineering informs capacity planning and optimal configuration of environments to support the application. Many performance problems are caught and addressed before customers notice them. As process automation improves, each change to the software can be quickly assessed for improvement or regression. This methodology also makes better use of developer time by helping choose the most effective tasks for improving performance. Any software project serious about supporting global scale should integrate performance engineering into its development cycle.

Performance Engineering Method

The practices adopted within Ripple are likely to be applicable to any software development project--particularly for enterprise transaction processing systems. Performance engineering implements tools and processes with which the team can continuously improve the product according to objective standards.

Goal

The main goal for this type of system is to maximize throughput while maintaining acceptable latency. The XRP Ledger processes all transactions in roughly 4 second intervals. Each transaction is cryptographically signed by the sender and is distributed to each peer on the network. The consensus process involves a set of trusted nodes called “validators” which must agree upon the transactions that are applied to each new ledger. Once submitted to the network, each valid transaction is committed as soon as the validators agree upon inclusion. The main goal of performance engineering at Ripple is to measure and improve transaction processing throughput of a set of trusted validators.

The best way to achieve this goal is through the scientific method: test the environment, analyze results, formulate a hypothesis, apply a fix, and repeat.

Test Environment

Testing entails macro-benchmarking. Macro-benchmarking most closely emulates real-world user activities against the system. This is contrasted with micro-benchmarking, which only tests a very limited subset of the system. Micro-benchmarks sometimes are useful, but care must be taken to not infer any capabilities for the system as a whole based upon micro-benchmarks alone. A contrived example of this type of flawed approach would be to count how many HTTP GET / requests can be handled by an application server and then report the results as being the server’s throughput capabilities. Obviously, an application server has many other components that are potential bottlenecks. One of the characteristics of a complex system is that it is not possible to know the behavior of the complete system under realistic load. Emulating the environment and its workload is the best way to observe realistic results before they occur in the live network.

The more complex the system, the more difficult it is to completely emulate its behavior in a test environment. Environment and test improvement are ongoing performance engineering activities. Don’t despair if testing doesn’t match the live environment 100%! In spite of this, many bottlenecks will be caught before ever being encountered in production.

The XRP Ledger is a peer-to-peer network that implements the Ripple Protocol. Every peer independently processes transactions submitted to the network. Each node relies on a set of trusted peers, called “validators,” for determining the status of the official ledger. Funds are transferred by way of submitting transactions which are then applied to the ledger. The goal of the testing described in this document is to determine the maximum throughput of a set of validator nodes.

Not all rippled nodes are validators. Some of them expose a RESTful server interface to clients for submitting transactions to the network and for retrieving historical data. These are referred to as “client handlers.” The benchmark environment is comprised of both validating nodes and client handlers as are present in the live network. In addition, a dedicated host is used to generate test load.

The environments used for tests as described in this document have the following characteristics:

Each host is a bare metal x86_64 server running Linux, with the following characteristics: 1 x Intel Xeon 3+ GHz processor with 4 cores and hyperthreading enabled. 32GB RAM SSD storage Gigabit networking

All hosts for each test reside in the same facility on the same LAN.

There are 5 rippled validators on distinct hosts.

Various (4-8) rippled client handler hosts through which transactions are submitted to the network.

1 host for submitting traffic to client handlers.

The workload consists of sending payments of XRP (the crypto-currency native to the platform) between 5,000 accounts. Each second, a fixed number of transactions is submitted with a random sender giving 5 millionths of one XRP to a random receiver. This rate is sustained for twenty minutes. This workload is generated with a custom benchmark harness tool developed by Ripple. XRP is the native currency of the XRP Ledger: each transaction consumes a small amount of XRP as an anti-spam measure. Further, XRP can be used as a bridge currency to facilitate trades between any other asset types represented on the network.

Each test run is aborted by the benchmark software once either of 2 criteria fail:

Ledger stability: The Ripple ledger updates roughly every 4 seconds. The amount of latency for a specific transaction is measured by the time from submission to the time of its inclusion in a new version of the ledger. Success criteria for testing is that the ledger never takes more than 30 seconds between updates. Transaction completion: If over 50% of transactions submitted within just over a minute fail to complete, then the test aborts.

Transaction volume is sustained at each level for 20 minutes, and is ramped up iteratively until either criteria fail. Maximum sustained throughput for each test is the rate prior to that which caused the failure. For example, if the network handled a load of 1500 transactions per second for 20 minutes, but failed after the rate increased to 1600/s, then the maximum sustained throughput is recorded as 1500/s.

Profiling

In this context, profiling involves the collection of any type of statistics pertaining to the system under load. This can include resource utilization statistics such as displayed by top and iostat, custom logging messages from the application, as well as internal application statistics. A specific class of tools referred to as “profilers” is often used to provide internal application statistics, such as identifying the most-used function calls. But a broader meaning of profiling refers to collecting any performance-related facts occurring while the system is under load.

Here are some recommendations about profiling:

Gather only what you think you need as you need it. If, while analyzing data, you discover that you need more data for analysis, then set up collection and run another test.

Beware the observer effect. Many performance tools, especially profilers, degrade performance of the application--especially under high load. Data gathered under high load is the most important for diagnosing bottlenecks, but if the data collection itself significantly limits throughput, then it tends to hide the real cause.

For profiling the XRP Ledger, basic operation system utilities top and iostat run on each host in the environment during each test. To isolate bottlenecks, rippled is instrumented with code that varies from simple log messaging all the way to a custom profiler. The main benefit of custom profiling code is that it provides precise control over what is reported based on specific characteristics of the application. The drawback is the effort required to implement and maintain the custom code. This is typical of “build vs buy/download” decisions in software development.

Profiling code built into rippled has suffered bit rot because it has not been used for over a year. There are plans to revive this code in the near future so that it can help with current rippled versions. But it is referred to here to illustrate its previous use in helping to scale rippled. The specific code used for profiling rippled is here: https://github.com/mtrippled/rippled/tree/bm. It was never merged into the main line of development, and so has never been implemented into production. A major benefit of this custom profiler is that it reports the time spent in functions that are called when a specific mutex is held. This lock is referred to as the “master lock,” and it tends to be a bottleneck to scalability. As mentioned previously, each rippled peer processes all transactions independently of its peers. Internally, the ledger is a monolithic data structure: it only allows one writer at a time to modify it. As such, when hardware resources such as CPU, memory, disk I/O and network are sufficient, the work of modifying the ledger data structure tends to be a limiting factor.

The custom profiling code uses a dedicated thread to periodically output JSON-formatted trace entries to a log file. Trace instrumentation is a powerful way to isolate bottlenecks. Depending on object lifetime, Trace objects can record any number of events. Each object is instantiated with a name and an optional counter. The name and counter are recorded in the first entry. Entries are ordered by timestamp to ensure sensible output. Trace objects that stay local to a function can be created on the stack. However, if passed to other functions or if placed in containers, shared pointers make the most sense. This entails function signatures being modified for each function to be called with a trace object. Functions with some callers that do not have trace objects can be overloaded to simply pass a default-constructed shared pointer of PerfTrace. Functions to add events to the object will only do so if the object exists. Timers are a special type of trace activity comprised of two events: a beginning and end. They are associated by the name. For proper time calculation, the name ending a timer must correspond to that of the beginning of the timer. What this all allows is for any particular code path to be instrumented such that any function calls in the code path can be timed to microsecond granularity. Each specific invocation of the code path’s Trace output is then logged. This is in contrast to most off-the-shelf profilers which aggregate function call usage over time, but comes at the cost of increased development effort.

Here is sample output showing a partial PerfTrace object:

{ "events": [ { "counter": "22", "name": "masterlock", "tid": 18652, "time": "2015-11-07T09:27:59.848294Z", "type": "generic" }, { "counter": "0", "name": "modify", "tid": 18652, "time": "2015-11-07T09:27:59.848294Z", "type": "start" }, { "counter": "0", "name": "modify lock", "tid": 18652, "time": "2015-11-07T09:27:59.848296Z", "type": "start" }, { "counter": "0", "name": "instantiate OpenView", "tid": 18652, "time": "2015-11-07T09:27:59.848297Z", "type": "start" }, { "counter": "1870", "name": "instantiate OpenView", "tid": 18652, "time": "2015-11-07T09:27:59.850167Z", "type": "end" }, ...

All of the entries in the list correlate to the first entry, which is the 22nd instance in the code of the master lock being acquired. The specific thread id 18652. As the code path progresses, several functions are invoked with timers. For example, the “instantiate OpenView” function took 1870us. This particular trace object contains many more entries that correspond to instrumented activities during this particular acquisition of the master lock, but the output was truncated for brevity..

This profiling has proven to be very light-weight from a resource standpoint: it has not introduced the observer effect.

Identify Actionable Bottleneck

This is the most important phase: if nothing practical can be implemented, then no progress should be expected. The bottleneck metaphor is useful in this case: the narrowest part of a bottle (the neck) slows down the flow of fluid out of the mouth. Widening the neck so that the entire bottle becomes a cylinder will increase flow. While this is a useful metaphor, complex systems typically do not have a single component limiting performance. A more precise description of this phase is to identify a component, such as a hardware resource or function call, that if made more efficient, will increase throughput the greatest extent. This is an imprecise exercise. It involves informed guesswork.

Here are some suggestions to help make this phase successful:

Only move forward if there if the proposed fix is plausibly supported by available evidence, i.e. don’t grasp at straws. Very often, people treat this as an exercise to come up with solutions based on a very limited set of facts. This is often wasteful of time. Instead, if there are insufficient facts to support a hypothesis, then modify the test and collect more data until a more complete picture emerges.

Address the easiest things first: system resources. If throughput limits correspond with hardware resource limits, such as CPU, memory, disk IOPS, or networking throughput or latency, then add more resources until there is plenty of headroom even when the workload saturates the application. If the application scales this way indefinitely, then this becomes a capacity planning exercise. For applications that scale horizontally and make efficient use of system hardware, profiling can be used to identify how to make more efficient use of the hardware.

Transaction processing systems tend to not be limited by system resources: instead, they bottleneck on concurrent access to data. In this case, merely decreasing system resource utilization in general is not very effective. Instead, work on minimizing contention for the shared resource. With rippled, as mentioned, this shared resource is the ledger data structure protected by the master lock mutex.

Example: qalloc is good!

Here is an example of how this process was used to improve performance of the XRP Ledger. The work described took place in November, 2015. At the time, peak sustained throughput was about 500 transactions per second. There was plenty of CPU and other system resource headroom, so custom profiling described above was used to look for bottlenecks in the master lock.

Here is a report showing statistics for the functions taking the most time while holding the master lock. The test input was 500/s, which was close to the saturation point. The saturation point is where the application bottlenecks tend to become the most pronounced:

function % rate/s avg(us) median stdev min max modify 94.0% 229.5 3110 2773 2065 738 109011 copy OpenView 47.5% 229.5 1570 1481 661 388 16517 replace OpenView 30.6% 229.5 1011 880 1275 1 105618 OpenView.items_ 30.0% 229.5 992 931 392 265 6345 OpenView.txs_ 17.1% 229.5 566 525 330 117 14159 modify callback 15.7% 229.5 520 288 956 81 48320 ripple::apply 15.6% 487.0 243 127 570 12 48046 doApply 12.3% 487.0 192 103 442 0 33350

While not explicit from this report, each of the functions described are part of the call tree from the “modify” function at the top of the list. rippled applies transactions in batches while holding the lock. During the sample, there were 229.5/s transaction batches applied. Each averaged 3110us. There were 487/s individual transactions applied, which took 243us each.

Upon review, the rippled team found inefficiencies in the memory allocator used for data structures used within the most expensive functions. A new allocator (qalloc) was coded up and then tested. Here’s the profile:

function % rate/s avg(us) median stdev min max modify 89.8% 352.5 1364 1234 710 497 38636 copy OpenView 50.9% 352.5 774 697 400 261 29677 OpenView.items_ 32.8% 352.5 498 457 277 181 29285 replace OpenView 23.2% 352.5 352 312 287 1 35876 OpenView.txs_ 17.3% 352.5 262 222 194 72 4820 modify callback 15.0% 352.5 228 180 374 12 20698 ripple::apply 14.7% 498.2 158 146 64 10 1760 doApply 12.2% 498.2 131 122 53 1 1407

Success! All of the functions took less time to complete and the number of batches increased. Per-transaction rate was about the same because the input rate for the test stayed the same. There was about 33% decrease in the overall time spent modifying the ledger! Here is the resulting commit into rippled: Use qalloc for RawStateTable and OpenView::txs_.

10x+ Throughput Chronology

When performance testing began in February of 2015, the XRP Ledger sustained 80 transactions per second. Today, it’s up to 1500. The initial design of rippled was for scalability, and the underlying architecture has remained the same. Incremental improvement has been made over time as opportunities have been uncovered.

2/25/2015: 80/s

At the time, transaction signature verification was being performed within the master lock. This was turned off in a special test-only branch.

At the time, transaction signature verification was being performed within the master lock. This was turned off in a special test-only branch. 3/2/2015: 120/s

Disabling signature verification (for testing purposes only) provided a significant throughput increase. The next area for improvement discovered was that transactions were being applied to the ledger one at a time, incurring expensive operations to freeze and unfreeze the ledger. The proposed fix was to apply transactions in batches, to incur the expensive operations less frequently.

Disabling signature verification (for testing purposes only) provided a significant throughput increase. The next area for improvement discovered was that transactions were being applied to the ledger one at a time, incurring expensive operations to freeze and unfreeze the ledger. The proposed fix was to apply transactions in batches, to incur the expensive operations less frequently. 4/6/2015: 490/s

Transaction batching provided very significant performance gains--about 4-fold! Also, signature verification was moved outside of the master lock, allowing that activity to be distributed across multiple processing threads.

Transaction batching provided very significant performance gains--about 4-fold! Also, signature verification was moved outside of the master lock, allowing that activity to be distributed across multiple processing threads. 6/18/2015: 150/s

This was a significant regression! Investigation eventually revealed that the cause of the regression resulted from a change to how the ledger was modified. This change was intended to increase the efficiency of ledger modifications, and it did on a per-transaction basis. But, transaction batching as implemented previously was not implemented in the new code!

This was a significant regression! Investigation eventually revealed that the cause of the regression resulted from a change to how the ledger was modified. This change was intended to increase the efficiency of ledger modifications, and it did on a per-transaction basis. But, transaction batching as implemented previously was not implemented in the new code! 10/16/2015: 525/s

Re-implementing transaction batching brought throughput back up to slightly higher levels than previously. Investigation revealed the next bottleneck to be in submitting transactions to the network: the client handlers could not saturate the network before they bogged down.

Re-implementing transaction batching brought throughput back up to slightly higher levels than previously. Investigation revealed the next bottleneck to be in submitting transactions to the network: the client handlers could not saturate the network before they bogged down. 10/13/2016: 1100/s

The next performance test happened a year later, and throughput doubled! The most likely explanation is that coroutines were implemented in the functions that submit transactions to the network. This reduced blocking in those functions.

The next performance test happened a year later, and throughput doubled! The most likely explanation is that coroutines were implemented in the functions that submit transactions to the network. This reduced blocking in those functions. 7/12/2017: 1500/s

Performance testing had resumed but on an automated daily basis. The peak sustained is now 1500/s!

Summary

Performance engineering is about process. The heart of it is the scientific method. Reasoning in the abstract is insufficient to scale complex systems: inevitably, flaws in the system emerge and become noticeable only under load. The process described in this document can help any engineering team to get a head start on performance problems before they are noticed by users. Techniques described here are particularly useful for transaction processing systems because those tend to not scale simply by adding hardware. Ripple has adopted these techniques to help it become the most scalable payments-oriented blockchain network, and looks forward to continued improvement.

On HackerNews