Context

/live/code_path/

/canary/code_path/

[Wed May 01 14:36:33 2013] ... PHP Fatal error: Cannot redeclare class Common in /canary/code_path/ infrastructure/Common.php on line 9 [Wed May 01 14:36:43 2013] ... PHP Fatal error: Cannot redeclare class Common in /live/code_path/ infrastructure/Common.php on line 9

Debugging

$ ab --cookie cn=0 https://www.box.com/sync/enabled $ ab --cookie cn=99 https://www.box.com/sync/enabled

Instrumentation

/live/code_path/

/canary/code_path/

[Thu May 02 18:10:51 2013] ... PHP Fatal error: Cannot redeclare class Common in /live/code_path/ infrastructure/Common.php on line 9 [Thu May 02 18:10:51 2013] ... PHP Stack trace: [Thu May 02 18:10:51 2013] ... PHP 1. {main}() /canary/code_path/ index.php:0 [Thu May 02 18:10:51 2013] ... PHP 2. require_once() /canary/code_path/ index.php:44 [Thu May 02 18:10:51 2013] ... PHP 3. standard_application_requires_and_autoloader_setup() /canary/code_path/ infrastructure/runtime.php:2611 ... [Thu May 02 18:11:11 2013] ... PHP Fatal error: Cannot redeclare class Common in /canary/code_path/ infrastructure/Common.php on line 9 [Thu May 02 18:11:11 2013] ... PHP Stack trace: [Thu May 02 18:11:11 2013] ... PHP 1. {main}() /live/code_path/ index.php:0 [Thu May 02 18:11:11 2013] ... PHP 2. require_once() /live/code_path/ index.php:44 [Thu May 02 18:11:11 2013] ... PHP 3. standard_application_requires_and_autoloader_setup() /live/code_path/ infrastructure/runtime.php:2611

Experimentation

Solving

Epilogue

Lessons

References

Your code has a bug, but it disappears when you investigate! The bad behavior only occurs when you're not looking. Meet the Heisenbug - a bug that changes behavior when you attempt to observe it. Because of this, it's impossible to find the root cause ...or is it?Our Heisenbug surfaced during the last stage of the Canary project . The Canary project changed our daily code release process; no longer would we switch all traffic to use new code at once. Instead, an increasing percent of traffic would hit the new codebase, while the remainder would hit the old codebase. The problem: During the initial Canary interval, a fraction of requests (0.5%) resulted in a 500 Internal Server Error. The error reported by PHP was, but there was no duplicate include for this class in either codebase. There are two paths for webapp code on each frontend machine, one for Live code and the other for Canary code. To simplify, let's say Live traffic is served fromand Canary traffic is served from. We saw the error occur in both paths:Once we were convinced there was a problem, finding a reproducible case was essential. We mined logs and observed that /sync/enabled was one of the simplest requests that had returned a 500 in production. We then prepared two commands:"ab" is ApacheBench , a tool for stress testing and benchmarking webservers. The cookie "cn=0" causes a given request to hit the Canary codebase, while "cn=99" directs a request to the Live codebase (Note: doing a Canary deployment changes the switching threshold over time) We ran these two commands simultaneously with options "-n 400 -c 4". This fetched the URL 400 times, in 4 separate threads. This test case consistently reproduced the problem; approximately 2% of the 800 requests would fail with a 500. Much better than the "in-the-wild" rate of 0.5%, and reproducible under controlled conditions. Failures were seen from requests using either cookie, just like in production.A stacktrace of a Fatal error revealed that execution hadwhile importing code. Instead of importing "infrastructure/Common.php," an import would occur for "infrastructure/Common.php," or vice-versa. Later, another import of Common would occur using the other codebase. This caused the fatal error to be thrown. Here are example traces from two requests:So we instrumented the code with log statements .... and the problem went away. Not reproducible. We removed the instrumentation .... the problem was back. We removed some of the instrumentation .... and sometimes the problem was reproducible, and sometimes it wasn't. This was mind-boggling We had suspects. Maybe Apache was choosing the wrong codebase, based on the cookie. Maybe the environment wasn't correct. But a strong suspect was XCache, our PHP Opcode cacher - it was one thing both codebases shared. So, we researched XCache bug reports and documentation. We picked up clues. We found a description of a new feature regarding de-duplication in XCache when source files were identical. We also found one user reporting the same "cannot redeclare class" errors from a Drupal deployment in a shared hosting environment. While initially scratching our heads at the disappearing / reappearing nature of the bug, we were also trying to narrow it down. We stripped as much as possible out of the web app. Eventually we had a* of the system that would reproduce our problem. We increased ApacheBench threads to 10, and requests to 4000. The problem would then reproduce on about 20% of the requests. At this point, we were fairly sure a race condition was causing the problem. One of the three files in our subset (definitions.php) defined many named constants at runtime. These included an absolute path named BASE_SOURCE_PATH used to import classes. This led to an experiment: 1. Make definitions.php byte-for-byte identical in the two codebases. Problem reproduces. 2. Change this file in one codebase (whitespace, comment, or instrumentation). Problem doesn't reproduce. 3. Make the exact same change to the other file. Problem reproduces. Finally, things clicked: The definition of BASE_SOURCE_PATH was being constructed using the __FILE__ keyword in PHP. And the value for __FILE__ is; it depends on where the source file is located in the filesystem**. We had found a case where identical source files resulted in different sets of compiled Opcodes. In other words,We emailed the XCache community for help. In the meantime, we wanted to solve this issue and move forward with the Canary project. We needed to prevent XCache from caching Opcodes in such a way that they were shared between Canary and Live codebases, even when the files were byte-for-byte identical. We considered: a) Ensuring the Canary version of definitions.php was always different from the Live version. b) Refactoring the definition of BASE_SOURCE_PATH so it didn't use __FILE__. Both those solutions were really only partial fixes, as they would allow the root problem to come back and bite us later:If magic constants __FILE__ or __DIR__ were used in the future, we would inevitably see Fatals again. We started looking into the internals of XCache, specifically at how the MD5 hash of a source file and its base filename was used as a unique key in the Opcode cache. Multi-tenant sites benefit most from this cache de-duplication feature. E.g., a single host supports JoesDomain and SuesDomain, and both use similar base frameworks. Each site may use its own copy of CommonPackage, but it's possible to conserve space by only caching one set of compiled Opcodes for both PHP files. For Box, it may have saved XCache space during the Canary interval - but the correctness issue made it a no-brainer to avoid this optimization. Patching XCache was fairly straightforward -- instead of using just the base filename, we now provided the absolute file path in the hash calculation.Finally, our differing sets of Opcodes stopped colliding, allowing two codebases to run side-by-side without issues! The first site-wide Canary release occurred on May 18, 2013; the rest is history.We never intend to support our own "custom" version of open source libraries. If we do, we will forever be playing catch-up to new features, security fixes, bug fixes, etc. When we do improve something, we'd like our change to be accepted into the original source in most cases. Since our XCache fix would break the multi-tenant use case, we were careful to make a configuration option to control it, document it, and carefully explain why it was necessary. We submitted our patch, test case, and documentation to XCache's bug tracking site, and called it a night. The next day, we returned to the office to discover the author had rejected our patch, but accepted our test case, and submitted his own bugfix. His fix was elaborate, deep within the infrastructure of XCache. Nothing we could have done without weeks of study would have led us there. In short, XCache works by intercepting calls to the PHP compiler. If XCache lacks a cached entry for a file, it lets the compiler do its work, but saves a copy of the resulting Opcodes for reuse. But if XCachea cached entry for a file, it has to do some "fix-ups" on the Opcode data structure before handing it back to the PHP runtime. The runtime doesn't know it's dealing with a "shared" object - so it adds memory allocation pointers, frees them - whatever it needs to do. As XCache is an "add-on" module, it has to do the work of preparing the cached object for the runtime to play with. So the real bugfix was to apply the proper "fix-up" for the Opcode representing __FILE__. The author thanked us for providing a reproducible test case.Making a clear, reproducible, minimal test case is a great use of your time.Modifying source before fully understanding it is not a good idea. Simultaneous infrastructure changes make debugging/reasoning more difficult: - We were upgrading PHP versions during Canary project development. - Changing PHP versions had also required an upgrade of XCache version. - We had discovered that our memory allocation given to XCache was too small for one copy of our codebase - so we had recently increased it, and increased it again to allow for two codebases active at the same time. - The different environments we were testing had different combinations of XCache version, PHP version, and XCache memory allocation -- as all these changes were in various stages of transition. This created various "red herrings" we had to investigate and rule out along the way. https://cloud.box.com/s/xwg69j5kxw5ywjdp461q (minimal reproducible test case) ** http://php.net/manual/en/language.constants.predefined.php co-authored by David Schnepper