[Wikitech-l] Fwd: HHVM deployment update

Hi everyone, The email below was written with an internal audience in mind, but Krenair pointed out that there would generally be a lot of general interest in this. Rob ---------- Forwarded message ---------- From: Rob Lanphier <robla at wikimedia.org> Date: Fri, Jul 18, 2014 at 5:20 PM Subject: HHVM deployment update To: WMF Engineering List, Operations Engineers Hi everyone, I'm writing to give you a quick update about where we are with HHVM deployment, so you know what to expect. Ori, Aaron, Tim, Giuseppe, Brett, Antoine and probably others I'm forgetting have been hard at work getting HHVM ready, and are about to make changes that may affect your work in production. The good news is that the way it'll most affect you is that your code will run a lot faster. The bad news is that there is some risk of breakage due to the number and nature of things we're changing. We've started referring to the stack that we're migrating to as "HAT" (HHVM, Apache 2.4, Trusty), as a nod to LAMP and as a useful tag on Gerrit changes. The point being: this is not merely a change in our PHP implementation, but a full stack upgrade that may have implications beyond just the problems that might be introduced by HHVM. The team is deploying this one piece at a time, with the first bit of the deployment happening very soon. Here's our rough timeline: * Now: limited deployments of production job runners to osmium, which the team only leaves on when they are monitoring it for errors * Week of July 21: Deployment to Beta Cluster. The timing on this may slip, since it might be a surprise to a few people who are deeply affected by it (/me waves to Chris McMahon), but we think it's generally ready from an engineering perspective. * Week of July 21: Deployment to a few job runners in production. You'll know the first job runner was deployed when you see this patch[1] get its +2. We didn't get a chance to coordinate this with Greg today, so exact timing is TBD. * Sometime later: Deployment to test.wikipedia.org application server * Sometime later: Deploy Varnish module allowing partial deployment to a fraction of application servers * Sometime later: Limited deployment to small number of application servers * Sometime later: Ramp up deployment to more application servers until most servers use HHVM * Sometime later: Deploy to remainder of services How to test your extension with HHVM: ------------------------------------- Historically, we've treated HHVM-related bugs in MediaWiki extensions as the sole responsibility of HHVM team, because we could not reasonably expect developers to test their code on HHVM while it was still difficult to build and configure. As we head toward full deployment, however, we are going to progressively shift responsibility onto you to be proactive about testing your code with HHVM and reporting any issues you encounter. If you're not sure how to test your code with HHVM, ask! The options that are currently available to you are: * On your machine, using MediaWiki-Vagrant (HHVM is the default PHP runtime) * On Labs, using Labs-Vagrant (<https://wikitech.wikimedia.org/wiki/Labs-vagrant>). The Flow team is doing this; ask them how. :) * Sometime next week: on the Beta cluster, when we switch it over to HHVM. Use the "hiphop" keyword in Bugzilla to catch our attentions. Some things that will change, and the associated challenges: * Lots of C++ code that is generally high-quality but doesn't have quite as many flight-hours logged in production as PHP. All that is entailed by that. * We expect the performance profile to improve substantially, but we can't rule out the possibility that specific operations will suffer performance regressions * Distribution-upgrade risks: there are many utilities we rely on besides MediaWiki itself, and many of those utilities will see upgrades as well. For example, a lot of the utilities on our image scalers (e.g. imagemagick, avconf, etc) will be upgraded. What we're doing to mitigate/minimize risk: test, test, test. A lot of the work that's been going on has been to improve the state of our unit tests such that we can have a clean test run before deploying all the way; a task made trickier by the fact that our current codebase doesn't meet that bar[2] That's all for now. More information about HHVM and our deployment to it can be found at https://www.mediawiki.org/wiki/HHVM . Anything that isn't there, come talk to me, and I'll turn around and ask Ori. :-) Thanks! Rob [1] Puppet repo patch "jobrunner: create hhvm-only jobrunners" https://gerrit.wikimedia.org/r/#/c/147086/ [2] Tracking bugs for unit tests that fail in HHVM (and unfortunately, our current production setup too): https://bugzilla.wikimedia.org/show_bug.cgi?id=67216