The HHVM team has concluded its first ever open source performance lockdown, and we’re very excited to share the results with you. During our two week lockdown, we’ve made strides optimizing builtin functions, dynamic properties, string concatenation, and the file cache. In addition to improving HHVM, we also looked for places in the open source frameworks where we could contribute patches that would benefit all engines. Our efforts centered around maximizing requests per second (RPS) with Wordpress, Drupal 7, and MediaWiki, using our oss-performance benchmarking tool.

Summary

During lockdown we achieved a 19.4% RPS improvement for MediaWiki workloads, and a 1.8% RPS improvement for Wordpress. We demonstrated that HHVM is 55.5% faster than PHP 7 on a MediaWiki workload, 18.7% faster on a Wordpress workload, and 10.2% faster on a Drupal 7 workload. Improvements made to HHVM to better serve open source frameworks will ship with the next release. As a part of our lockdown effort a patch showing a promising performance improvement for all PHP engines was submitted to MediaWiki. The raw data, configuration settings, and summary statistics are available here.

Lockdowns are always a great opportunity to get the whole community involved, and we’re thrilled with the participation this time around. Our #hhvm and #hhvm-dev channels on Freenode were quite active and a number of contributors even chimed in on GitHub issues. Special thanks to Christian Sieber for contributing support for Drupal 8 to our benchmarking tool.

Throughout the lockdown we have made improvements to our benchmarking tools, and optimized our configuration. We’re pleased to report that these frameworks all ran successfully using the high performance RepoAuthoritative compilation mode, and file-cache we deploy with at Facebook. New tooling will make it easier to take advantage of these powerful tools in the upcoming 3.8 release.

Methodology

While no benchmark will ever perfectly capture the performance profiles of the disparate live sites it seeks to approximate, great care was taken in constructing this tool. We feel that it’s important to carefully explain the decisions we’ve made and the effects they’ve had on performance. Hopefully these notes will shed light on the numbers that we’ve shared, and help others to configure their benchmarking suites or to optimize their sites. Inquiries regarding the benchmarking methodology can be made directly on our oss-performance repository, pull-requests for new features and fixes are also most welcome.

Our hardware setup featured Mac Pro computers running a minimal installation of Ubuntu 14.04 (no virtualization or other layers which would add uncontrolled variables). We chose the Mac Pro because, while being convenient to work with, they also contain components that are representative of hosting servers, including Xeon processors, an abundance of RAM, and flash storage. They are also readily available in an identical configuration (see notes for details). The configuration offered ample RAM to avoid the possibility of swapping, and an SSD to reduce the effect of latency reading from disk. The minimalistic software installation ensured that results were consistent between runs and that benchmarks were run in near isolation.

For HHVM, we choose to run in RepoAuthoritative mode, using the file cache to construct a virtual in-memory file system, and with the proxygen webserver. This setup closely mirrors the configuration we use to run Facebook, and has been highly optimized. Traffic was proxied through an nginx webserver to allow for a fair comparison with PHP running FastCGI through nginx. We chose a high thread-count, as many frameworks were I/O-bound. For a CPU-bound application, we generally recommend a thread-count no higher than twice the number of available cores.

Our PHP 5 and 7 setups were quite similar; we enabled the opcache, and chose settings tuned for performance. For all engines tested, error reporting and logging were disabled to limit log spew and maximize performance. In production we generally favor a sampling approach that allows for data to still be collected about warnings and notices. For benchmarking, we felt no logging was most appropriate.

The benchmarking tool performs a sanity check once the engine and webserver have started accepting traffic to ensure that the framework is sending reasonable responses on the URLs being benchmarked. We also collect a count of the various HTTP response codes, average bytes received, and failed requests. This data is manually inspected after data collection is complete to ensure that the results from the server are reasonable. Note that some of the URLs benchmarked return non-200 response codes to approximate a realistic distribution of requests to a live server. For all results presented here this sanity check information is available in the raw JSON data provided in the notes.

Each framework we benchmarked was configured with a sample dataset designed to approximate an average installation. For Wordpress and Drupal 7, bundled tools were used to construct demo sites for benchmarking. MediaWiki was benchmarked using the Barack Obama page from Wikipedia, as was recommended by an engineer from Wikimedia foundation as representative of their load. For Wordpress, the URLs queried were based on data extracted from the hhvm.com access logs. Drupal 7 query URLs mirrored similar access patterns to Wordpress as access logs from live sites were not readily available. The MediaWiki URL list was generated to stress the Barack Obama page.

Drupal 8 was benchmarked in a manner similar to Drupal 7, including the use of bundled tooling to generate the sample site. A consensus has not developed as to whether the page cache should be used when measuring Drupal 8 performance so for our benchmarking suite we support cached and uncached Drupal 8 as separate targets. An additional setup step was performed to pre-populate Drupal 8 Twig templates so that benchmarking could be performed in RepoAuthoritative mode. This is an ahead of time optimization any sufficiently large deployment of this framework would be likely to benefit from. For both versions of Drupal a small number of additional tweaks were carried out to make the sample data more realistic. The details of these changes as well as details about the configurations of the various other frameworks are available on our oss-performance repository.

When running each engine, we sent both a set of single threaded and a set of concurrent warmup requests to each site before we began measuring performance data. We did this to allow hardware caches to warm up, PHP’s opcache to fill, and HHVM’s JIT compilation to complete. The number and duration of these warmup periods was chosen by examining performance profiles to look for steady state and monitoring the size and growth of the HHVM translation cache to ensure that compilation was largely completed. As we decided to run HHVM in RepoAuthoritative mode, a setup step was also included to build and optimize the bytecode repository using whole-program analysis, and construct the static content portion of the file-cache for efficient in-memory file system access.

Benchmarking requests were sent from 200 concurrent users, using the siege benchmarking tool. A high number of concurrent users best approximated the maximum possible RPS of a server under high load. Ideally concurrency should be the highest number of users possible before requests began to queue, as this was not possible with siege a high number was selected to ensure reliable RPS data. Note that as a consequence of this decision the response time metric now includes time spent queued. A realistic load balancer would prevent such queueing from occurring on machines serving live traffic.

We discovered that, by default, MediaWiki will store a view counter in MySQL, as well as a cache of each translatable string on any given page. Any large-scale deployment of MediaWiki would need to disable these options; so we’ve specifically turned them off. We’ve confirmed that these settings are also disabled in production for Wikipedia. A patch has been submitted to MediaWiki to cache translations more efficiently. The view counters have been deprecated, and will likely be removed entirely in a future release. These particular inefficiencies were discovered by analyzing the queries sent to MySQL by MediaWiki during a single request.

In our initial testing with Drupal 7, we noticed that every request was triggering a scan of the Drupal document root, making the is_dir(), opendir(), and readdir() functions incredibly hot. It has since been brought to our attention that a known bug in Drupal was causing the extension used to generate the sample data to be treated as missing, thus triggering a complete filesystem scan for each request. Currently the only fix for this issue is to manually remove references to the uninstalled extension from the database. After patching the database time spent accessing the filesystem drops substantially and the resulting site behaves identically to a fresh installation of Drupal 7 with manually entered data.

Results

Our results compare pre- and post-lockdown HHVM, and separately PHP 5, 7, and post-lockdown HHVM. We used PHP 5.6.9, and commits from PHP, and HHVM master (see notes below). For the pre- and post-lockdown comparison stable releases of MediaWiki, Drupal 7, and Wordpress were used, post-lockdown numbers include a patch to MediaWiki that was written during the lockdown. The second set of comparisons uses the same stable releases, with the MediaWiki patch applied for all engines. Data for each engine, framework pair was collected in ten independent runs, and RPS was quantified.

We observed low variability between runs and have a great deal of confidence in the reproducibility of these results. The raw output of each run in JSON form has been made available in the notes below along with the batch run settings used to configure the benchmark tool. The canonical field in the output indicates whether non-standard options were passed during the test run. The only non-standard option passed during our test runs was to apply the MediaWiki patch. All passed options are available in the configuration JSON.

In the lockdown, we were able to improve MediaWiki performance by 19.4%, and Wordpress by 1.8%. Unfortunately, Drupal wins were no longer measurable once the Drupal database was patched to fix the aforementioned plugin bug. In addition, we increased RPS for simple pages by 5.2% (this was mostly a measure of the overhead incurred by a request). The lockdown offered an opportunity to evaluate our methodology, and work with the community on realistic configurations for the sample data used in our benchmarking. A number important changes to the benchmarking tools and framework configurations were made throughout the course of the lockdown. The results of the lockdown are summarized below, normalized to pre-lockdown RPS numbers. Drupal 8 was not part of our lockdown and is therefore not measured here.

Across the board we found that HHVM performs best on applications where CPU time is maximized. In particular we’re 1.55 times faster than PHP 7 on a MediaWiki workload, 1.1 times faster on a Drupal 7 workload, and 1.19 times faster on a Wordpress workload. As none of these frameworks take advantage of the asynchronous I/O architecture available in HHVM (i.e., async), it’s not surprising that the greatest performance benefits come from the efficient execution of PHP code possible with a JIT compiler. The following figure summarizes the performance difference between PHP 5, PHP 7, and HHVM. Results were normalized to PHP 5 RPS, and Drupal 8 has been included. We benchmarked Drupal 8 with caching both enabled and disabled. In general the results for Drupal 8 were more stable with the cache disabled.

For all benchmarks we performed ten independent runs and used the mean RPS result. We also measured standard deviation (see error bars). Standard deviation was highly consistent between runs, and we’re confident that with the proper configuration and hardware these results should be easy to reproduce.

As an exercise, we evaluated the benefits of async MySQL in the Wordpress environment. By modifying portions of Wordpress to take advantage of the async capabilities offered by Hack and HHVM, we were able to examine the potential for performance gains through async execution. In our test environment we separated the MySQL and PHP hosting to separate machines within the same datacenter to approximate a realistic Wordpress stack. The introduction of asynchronous query execution can demonstrate performance gains in both RPS and response time. We’ll be writing separately about this in the near future.

Lockdown Optimizations

By sharing the details of some select successes and failures from lockdown we’d like to provide a window into the work we’ve been doing and offer a guide to anyone considering working on performance related patches for HHVM. Lockdown issues are available on the HHVM GitHub repository and have been tagged with “lockdown.” They include our running commentary and notes about our findings as we implemented and measured them.

The issues we focused on fell broadly into several categories: builtin functions, extensions, function dispatch, memory model, JIT compilation, and framework specific patches. In addition to exploring these optimizations we also experimented with introducing async into Wordpress to measure its benefit.

The optimizations to builtin functions included get_object_vars() (#5287), implode() (#5289), function_exists() (#5288), and defined() (#5290). The first two optimizations were quite fruitful for both Wordpress and MediaWiki. We had previously predicted that function_exists() and defined() would be good targets for Drupal 7. Unfortunately defined() was not hot enough to have a measurable performance effect, and function_exists() was primarily called with non-static strings making it difficult to optimize.

We also looked at optimizing string concatenation in the JIT using a specialized bytecode for multi-string concatenations (#5304). Although this is a valuable optimization we were unable to measure a performance benefit in any of these frameworks. The majority of multi-string concatenations were in places that retained a reference to the internally joined strings, this prevented an important feature of the optimization allowing reuse of consumed string buffers. A further potential JIT optimization was object destruction (#5281). Experimentation led us to the conclusion that it would not be worthwhile to explore changes to the object destruction path. Building a non-refcounted garbage-collector is an ongoing HHVM project and may offer more substantial performance benefits in this area.

Several important enhancements to builtin function dispatch were made (#5276, #5277) and several more have been planned (#5267, #5292). The optimizations now completed have allowed for faster inline dispatch to functions with variadics and improved support for static method dispatch for builtin classes. These changes have already showed measurable performance gains in Wordpress, and we expect a tangible benefit from the remaining patches once they are ready to be merged.

Currently we build using libpcre despite the availability of libpcre2. We decided to see if this new pcre library was any more performant (#5302). Unfortunately the results here were not encouraging. Further discussions with the maintainers of libpcre2 have led us to conclude that the changes were largely surrounding the API and that optimizing performance was not a goal. In the future we may explore the optional use of a different regular expression processing library for compatible expressions (though to remain feature complete we will always need to provide libpcre as a fallback).

Profiling has demonstrated that dynamic property access is a hot code-path for Wordpress, therefore we worked to optimize dynamic properties and object cloning for objects with such properties (#5285, #5286, #5287). In HHVM dynamic properties are stored in arrays, which had previously been eagerly copied during calls to clone() and get_object_vars(). By allowing copy-on-write behavior for these property arrays, similar to the behavior of regular PHP arrays, we measured performance improvements in Wordpress.

Some of our largest wins have come from examining the frameworks themselves. In particular, reconfiguring MediaWiki to move translatable string caching out of MySQL showed a 14% performance win using HHVM and a 21% win when using PHP 7. We created an alternative implementation that improved RPS by an additional 22% and 5% respectively - bringing the total improvement for HHVM and PHP 7 to 39% and 33%. We’re happy to be able to contribute back to another open source project and are working to upstream these changes.

There are some notable wins that disappeared as we tuned our benchmarking methodology and patched the frameworks themselves. For MediaWiki optimizing str_replace() was a major win, but the hot-path for str_replace() was elided in the subsequent patch to fix localization caching. Likewise optimizations to the file cache showed a great deal of promise for Drupal 7 (#5284) until it was discovered that the document tree scanning behavior of Drupal was a bug, which we were able to fix with a database change.

Work continues on a number of these changes (#5264, #5269, #5270, #5299), and others have been completed post-lockdown (#5300, #5279). Of the notable post-lockdown completions, work to speedup memcpy showed major performance gains for our internal workload.

We’re all quite thrilled with the progress made during the lockdown and the lessons learned. In particular, we feel that we’ve been able to validate the importance of both JIT compilation and asynchronous execution for optimizing PHP performance. From Amdahl’s law, we know that any attempt to optimize execution efficiency of PHP will be undercut by I/O-bound applications. Async offers us the opportunity to shift latency back to the CPU by improving I/O performance, increasing the importance of efficient PHP execution through JIT compilation. By combining async and JIT compilation, we’re able to run applications such as Facebook at scale.

Technical Notes

Frameworks used — Collected statistics came from stable releases and betas of several popular frameworks. For Drupal we measured 7.31 and 8.0.0-beta11. For MediaWiki and Wordpress, version 1.24.0 and 4.2.0 were used. Data for Wordpress was generated using the demo-data-creator version 1.3.2. The patch developed for MediaWiki and used in cross-engine comparisons and with post-lockdown HHVM is available on the oss-performence repository. Update: this has now been merged into Mediawiki.

Hardware details - Benchmark statistics were collected on a late 2013 Mac Pro (MD878LL/A). The machine features a 6-core 3.5 GHz Xeon processor with 64 GB of RAM and a 512 GB SSD.

Siege - When benchmarking frameworks we used Siege 2.78, currently all versions of Siege 3 have known problems. In particular Siege 3.0.0 to 3.0.7 send incorrect HOST headers to ports other than 80, and 443, and Siege 3.0.8 and 3.0.9 occasionally send bad paths due to incorrect redirect handling.

Nginx - All requests were proxied through nginx, this was required as Siege is unable to talk directly to a FastCGI server. We used nginx 1.4.6. The configuration settings for nginx are available in the oss-performance repository.

Engines Benchmarked - To compare pre- and post- lockdown performance of HHVM we measured a build of HHVM master from May 11th and May 22nd. For PHP/HHVM comparisons post-lockdown HHVM, PHP stable release 5.6.9, and PHP 7 master from May 22nd were used.

Result Details - For all data used here we’ve made the raw JSON output of each run of the oss-performance tool available and the settings we used, we strongly recommend that anyone wishing to release results from this tool make this diagnostic data available. We are also making available the summary statistics computed across runs.

Response Time - We haven’t included response times in this report as lockdown optimizations focused on RPS. Response time is a more important metric for many smaller sites and the numbers we collected are available in the raw data. We hope to make improvements to the benchmarking tool to generate concurrency/response time and concurrency/RPS curves.

Sugar CRM - There has been some interest in the SugarCRM performance of the various PHP engines. We were unable to provide these numbers as could not find a recent version of PHP 7 that could execute SugarCRM without encountering a segmentation fault. Hopefully it will be possible to collect these numbers in the future as PHP 7 becomes more mature.

Comments