Some Context

The Sport Ngin platform is powered by about two dozen applications of various scale. The largest, oldest and most critical of these applications is called ‘Ngin’. Ngin is a Ruby on Rails application originally written on Ruby 1.8.7 and Rails 1.1. It now consists of 120k lines of code with 859 models. It has an additional 50k lines of automated test code. It depends on 173 gems directly resulting in 299 gems total. You can read about our previous effort to upgrade Ngin to Ruby 1.9.3 here and here. We recently worked through a challenging upgrade to Ruby 2.2. This post will talk about our approach to the large scale upgrade and the bumps along the road.

Red, Green, and QA

The first step was to start a branch and upgrade to Ruby 2 for local development and testing. The Rails server would not start, and there were automated test suite failures. We worked through a few bugs and updated some incompatible gems. We tried to make each patch backwards compatible with Ruby 1.9 so that we could ship them immediately to get it in the mainline branch and on production. The goal was to make the application code changes required at Ruby upgrade time as small as possible to lower the risk of the upgrade. This also allowed us to better code review and QA the compatibility changes going through our development pipeline as bite sized pieces.

Once the application was able to start and we fixed all the test failures, we created a parallel staging environment to run Ngin on Ruby 2. We ran through basic smoke tests on these experimental staging application servers. From there we merged the upgrade into the main staging branch. Our QA team ran through a full regression suite on the staging application servers. We discovered and patched 5 more bugs. This branch and server environment is used daily by the engineering team and thus anything majorly broken gets communicated in Slack pretty quickly. We patched a few more bugs uncovered from regular staging use and started preparing for experimental deployment in production.

Experimental phase

After passing QA in our pre-production environment we wanted a way to further test the upgrade without going all in. We wanted to safely try out Ruby 2 with production traffic patterns and analyze the results. We use HAProxy to load balance a pool of application servers with typically even traffic based on the number of CPUs the servers have. We spent some time extending our Chef cookbooks to allow a parallel OpsWorks stack running Ngin on Ruby 2 to be added to the HAProxy pool with custom weighting. We wrote about a similar approach for our 1.9 upgrade previously using EM Proxy. This allowed us to send small percentages of real production traffic to Ngin on Ruby 2 and analyze New Relic and log files for potential issues and unexpected changes in application behavior. Our plan was to send 2% traffic for a few hours, fix any bugs that would come up, increase it incrementally up to 50%, verify it was stable and then go all out with Ruby 2 servers in our pool. Little did we know, we would be troubleshooting a nasty issue for the next month.

Segmentation Faults and Random Errors

Once we had dipped Ngin on Ruby 2’s toes in the production waters we naturally found a handful of more issues. Ideally, we would always be able to find bugs in staging or by simulating production traffic in staging. However, finding all of the possible issues we would run into in production is next to impossible and was never our goal with pre-production testing. This is a pretty well understood state of complex systems.

Upon discovering a bug we would stop the production traffic to the experimental server, gather data to reproduce the issue, and patch the bug through the normal development workflow. Some of these bugs provided us the opportunity to improve our automated test coverage as well as our QA plans.

At this point we were feeling pretty good as we were able to run our Ruby 2 production experiment for extended periods of time without any Ruby 2 specific bugs popping up. We spent time doing analysis and comparison of metrics for Ngin on Ruby 1.9 vs 2. Everything seemed great: the servers used less CPU; memory usage increased slightly as expected; response time was lower; but wait, the error rate was double!?

A small percentage of errors is normal for Ngin due to it handling custom API and public website requests (and to be fair, we sometimes have a bug in our code). Ngin’s normal error rate is about 0.1% and we were consistently seeing 0.2% on Ruby 2 servers. The newly introduced errors were very strange. The traces pointed all over our codebase and the dependencies with exceptions like ‘undefined method on FalseClass’. The bug was obviously caused by the Ruby upgrade, but how to fix it was unclear from the stack trace and the captured request values. We checked all layers of the system for anomalies and noticed that there were a lot of segmentation fault errors in our web application logs. Comparing to the logs from Ruby 1.9 servers, it was clear they were occurring more often with the upgrade. It seemed obvious the segfaults had to be related to the weird errors and high error rate, but the segfaults didn’t quite line up with the Ruby traces in New Relic.

We felt so close, but the upgrade wasn’t stable enough to roll out to production yet. The error rate was troubling and the segmentation faults were unsettling. The real risk of the issue was realized when a malformed API call was made to another one of our services resulting in a service interruption. Error after error, we were unable to recreate it. We tried locally, in staging and on the experimental production servers. The request that had the error happen in it, would not cause it on its own. Was it a bad server? A bug in Passenger? Was the app process getting in a bad state through a chain of requests? We had lots of theories at this point, but we needed more data and testing to narrow down the issue.

Narrowing the Possibilities

At this point we broadened our perspective on what some of the causes of the issue may be. We analyzed all the differences in our two systems at every layer. Since we merged all 1.9 backward compatible changes for the Ruby 2 upgrade into the mainline branch, our actual changes that were required to happen with the Ruby version change were very small in scope. This left two differences in the systems: the Ruby version and Amazon Linux version (required for Ruby 2 support). This led us to also analyze the web server Nginx, the application server Passenger, Linux Kernel, and Linux system configuration. We created a list of changes to the system to try out in trial and error fashion to narrow down the cause, make the change, put it in the pool with 2% production traffic, and review the occurrence level of segmentation faults.

C extensions

Ruby gems with C extensions are a common cause of segmentation faults in Ruby applications. Albert Llop’s blog post was very helpful and was the closest match to the issue we were experiencing. Some gem dependencies are pure Ruby code, while others have C extensions in them which is basically embedded C code. This C code is responsible for managing its own memory, keeping track of pointers, and handling lower level exceptions. Ruby C extensions can be very beneficial as they can be much more performant at a particular task than Ruby. But with this power, comes the risk of causing lower level issues within a Ruby application. We had about a dozen gems with C extensions in Ngin. We audited each one: removed it if it was unused or barely used; reviewed open issues and its change log; upgraded it as appropriate. All of these changes made no difference to the frequency of segfaults.

Scary object_id warnings

We were suspicious of a few ActiveRecord models in Ngin that had object_id attributes. Ruby uses object_id to store the location in memory of the actual object's data. Scared that this was causing issues with Ruby internals and its objects that use ‘object_id’, we refactored the handful of impacted models in our system to use a different name. Okay, ruled that one out, it was time to move on to the next thing.

Core Dumps and GDB

Each segmentation fault generated a LOT of information in our Passenger logs. So much so that it was almost overwhelming, at least for developers not familiar with C and assembly. Even with all that information, it wasn’t that useful because it did not point to the source of the problem. The segfaults happened in random parts of the code without much of a pattern, even in parts of the code that were pure Ruby. Segfaults were pointing to our application code as well as library code from very stable and popular gems. We thought that getting some core dumps and using interactive debugging with gdb would provide more insight. We got the dumps and fired up gdb, but were not able to understand enough of what was happening to point out a problem. Stepping through the frames it seemed like the process was executing fine and pointers were valid until they weren’t. We even tried running Valgrind memory management tool on our application, but it proved to be difficult to work accurately in a Passenger spawned environment.

Passenger

We did a careful review of our Passenger application server configuration and searched its issues. There were some issues users have opened with Passenger about segmentation faults, but in nearly all the cases it was not a Passenger issue. Because Passenger catches and logs the segfault, it is a common place for people to post to about this issue. We opened an issue and posted to the forum in the chance of receiving some helpful advice. This didn’t help that much, but we had some tests we wanted to run to validate that Passenger was indeed not the issue. We first tried upgrading to the latest version, which at the time was 5.0.23. We switched from smart spawning to direct (so slow and a memory monster!). We disabled Passenger’s native support loader for extensions. Each change we tested on our experimental production server for a short period of time, but none of these Passenger changes made a bit of difference in segmentation faults. On top of that, Passenger’s innocence was supported by the fact that we have another Rails app running on the same version of Passenger with no segfaults ever. It has to be something specific to the Ngin application!

C and Ruby Stack Limits

Ngin is pretty darn big for a Ruby on Rails application. So big that sometimes default configuration values don’t work for it. We came across an internet post that mentioned that segfaults could occur from stack overflows happening. We didn’t have any evidence specifically of a stack overflow exception, but we wanted to rule it out anyway. We increased the C and Ruby stack limits on our experimental server. Nope, no change in faultiness.

Ruby Versions

The big vector of change in all this was the Ruby version. We were upgrading from Ruby 1.9 to 2.2, which is a lot of major and minor changes happening. Maybe there were changes in 2.0, 2.1 or 2.2 that were causing it? Maybe there was an outstanding bug in 2.2 that we were encountering? We searched through the issues for the Ruby language to turn up nothing of note. Then we tried our experiment with Ruby 2.1 which produced a bit less segfaults, but still a large increase over 1.9. We considered Ruby 2.3, but since OpsWorks did not support it yet, we decided to not pursue it. Also, we have other Rails applications running on Ruby 2.1 and 2.2 without issue so it seemed Ruby wasn’t the cause of the issue. Feeling defeated by segfault madness, we trudged forward to continue our trials.

C extensions again

We felt like we had ruled everything out. All the code was perfect, the dependencies were up to date, cruft was gone, and configurations were optimized. We still had a level of random errors caused by segfaults so we couldn’t ship the Ruby 2 upgrade to production. Reviewing everything we tried, we kept coming back to the same gut feeling. It had to be one of the C extensions in Ngin that was causing some sort of memory issue. We had already upgraded every C extension we could, so the only thing left was to temporarily remove them and run in a degraded state. The temporary removal of each gem varied in its implementation. Some were as simple as the entry in the Gemfile, while others required stubbing methods with a no-op module. We removed Image Science and ran 10% of production traffic for 10 minutes in a degraded state that was incapable of creating thumbnails or cropping images. Once again, no changes in segfaults. Kgio? Nope. Ruby Inline? Nada. Red Carpet? No, blarg. Oj, it was meant to be because one of the engineers was reading about oj segfault issues while watching that new TV show… nada. FFI, it has gotta be ffi because there was this post about segfaults with ffi and...no! Yajl-ruby? Nope, keep trucking, one of them HAS to be it. Libxml-ruby? YES! THE SEGFAULTS ARE GONE WITHOUT LIBXML-RUBY! Okay, calm down. Let’s figure out why it gets rid of the segfaults.

We Found It!

Upon digging further into libxml-ruby, we discovered there is a history of libxml-ruby causing segfaults when used with nokogiri. Sure enough, we have both in our project. Ngin depends directly on nokogiri and indirectly on libxml-ruby. Piecing together examples from other segfault issues on the nokogiri project, we were able to reproduce the issue.

The root cause of the issue was using nokogiri and libxml-ruby in the same application because of a globally scoped garbage collection callback on libxml made by the libxml-ruby C extension. The full details can be seen on the nokogiri issue. We also filed a libxml-ruby issue, but that project appears to be abandoned. Mike Dalessio, a nokogiri maintainer, was very responsive and cranked out a patched release candidate for us to try in just a day. We tried it out and sure enough, segfaults dropped to none.

An important point here is that this conflict existed in our application even on Ruby 1.9, but the occurrence of segfaults was pretty low. Our theory is that the garbage collection improvements made in Ruby 2+ (generational, more frequent and shorter) caused the memory corruption to be much more prevalent.

Performance

Finally, Ngin production is running on Ruby 2 and it performs great! We immediately saw a decrease in response time on our application. The error rate is more stable, due to fixing the memory corruption issue. The application servers are using slightly less CPU with the same amount of throughput. The app processes are using a bit more memory, but thanks to Passenger smart spawning and Ruby 2’s copy on write, the servers are using less memory overall to run the application.