Wednesday, July 13, 2016

Two years ago I trained for my private pilot license over the course of nine months. Besides being fun and exhilarating, flying requires a lot of knowledge and skill: aerodynamics, crosswind landings, stall recovery, navigation, recognizing and dealing with system failures, and so on.

Something that always amazes me whenever I explore a completely different body of knowledge is the bleeding effect it has to other subjects I care about. There are fundamental principles that cross domains, and seeing those principles in completely different contexts gives you a much deeper understanding of them. Aviation has been no exception – it taught me a lot about being a better programmer. I wouldn't say it taught me anything completely new, but something about risking a horrifying death in the future by not understanding these principles drove the points home in a way that just programming did not.

With my instructor Chuck Hellweg after my first solo

Do not treat systems as black boxes

I'll never forget the day I did spin training. In a spin (here's a video of one), the plane basically turns into a big rock in the sky, tumbling and rotating while plummeting towards the ground. I've heard it called a "rollercoaster without rails". My instructor got a big kick out of putting me through my first spin – afterwards he said, "I've never seen anyone curse like that before!"

It's easy to think of a plane as a "black box", where the yoke and rudder turn the plane while the throttle adds energy for airspeed or for climbing. In normal stable flight, that's basically what the inputs do. In a spin however, thinking of the plane's inputs that way will get you killed. Adding power will aggravate the spin and pitching up will prevent you from exiting the spin. Trying to roll right could roll you left instead, potentially violently. To exit a spin you must operate the plane differently: put the power to idle, use only the rudder to stop rotation, and pitch down into a near vertical dive. Then the plane is flying again and you can carefully recover from the dive.

To become a pilot you could try to memorize an arbitrary set of rules for how to fly the plane in various scenarios: in cruising flight do this, when landing do that, in a spin do this other thing. But then nothing is intuitive, and you're going to make a lot of mistakes. This is especially true when things go wrong: if you have engine problems in flight, knowing how the engine, carburetor, and fuel system work could mean the difference between a safe landing on a runway while your passengers take selfies versus a dangerous landing on a highway while your passengers desperately send texts to loved ones. You use your tools much more effectively and safely when you understand their implementation.

In software I don't think anything is treated more like a black box than the database. Programmers want to store and retrieve data using the database interface and then leave it to the ops guys to get it running robustly. However, a database's interface only tells a small part of what it means to use that database, and there's a lot of crucial information it doesn't tell you. Which queries will run efficiently? Which queries will consume large amounts of resources? What happens if there are hardware failures? If one application has a bug that triggers an accidental denial-of-service attack on the database, what will happen to other applications? Will just the one application fail, or will it trigger a larger cascading failure? If the database is distributed, how is data partitioned among nodes? What happens when there is data skew?

Understanding the answers to these questions is critical to architecting robust systems, and you need to know how the database works to answer them effectively. When you understand the mechanisms underlying the databases you use, you understand their limits and failure modes much better, and you can use this knowledge to architect better systems.

For example, one of my big motivations in developing the Lambda Architecture was to avoid the failure modes and complexity of the standard architecture of an application incrementally reading and writing into a distributed database. One example of this complexity is managing online compaction – if not done extremely carefully, cascading failure and massive outages will result (as many companies learn the hard way). Another example is eventual consistency – achieving it in a fully incremental architecture is incredibly prone to error, and the smallest mistake will lead to data corruption (the infrequent kind that's very hard to track down). I strongly believe the best way to avoid failures is to design architectures so those failure modes don't exist in the first place. The more that can go wrong, the more that will go wrong.

I recommend reading through Kyle Kingsbury's Jepsen posts to see that relationship between database robustness and complexity firsthand. A common theme throughout those posts is how often distributed databases contradict their marketing, like by silently losing massive amounts of data during partitions. For me, knowing how something works gives me confidence it will work properly, and if the implementation is too complex or confusing I proceed extremely cautiously.

There's an even bigger reason to understand how your tools work. Too many programmers treat their tools like LEGO DUPLO pieces and limit solutions to how those tools can be used and combined. But when you understand how your tools work, then you can reason in terms of what would be the ideal solution if ideal tools existed. Software becomes clay instead of LEGOs. Something I've found repeatedly in my career is the languages, databases, and other tools we use are so far removed from ideal that the opportunities for fundamental innovation are constant. Thinking this way led me directly to creating Storm, Cascalog, and ElephantDB, all of which gave myself and my team huge leverage. Even when constructing the ideal tools yourself isn't practical, knowing what would be ideal is hugely helpful when evaluating possible solutions.

I'm a big fan of Martin Thompson's term "mechanical sympathy". The kind of high performance work he's known for is absolutely impossible without a deep understanding of implementation. And performance optimization in general is much easier the more you understand the implementation of your dependencies.

If you read my last post about the limited value of computer science educations, you might think I'm contradicting myself by arguing for understanding the internals of your dependencies. After all, that's a major focus of a computer science education. My answer is I advocate for a computer science education for programmers to the same extent I advocate for an aeronautical engineering degree for pilots. The degrees are relevant, useful, and helpful, but on their own do little to make you a good programmer or a good pilot. You can learn how your tools work, oftentimes much more effectively, in the trenches as opposed to on a whiteboard. Where abstract education pays its dividends is when you push the boundaries of your field. The deep problem solving and algorithm skills I gained from my computer science education helped me greatly in that regard, and if I were a test pilot I imagine a formal aeronautical engineering education would be essential.

Correctly monitoring production systems

Aviation also helped me gain a deeper understanding of monitoring production systems. I used to think of monitoring as an afterthought, something to be added after the actual system was built. Now I consider monitoring a core piece of any system and take it into account from the start.

In aviation, you monitor the plane using a variety of instruments. The primary instruments measure altitude, airspeed, direction, and orientation using gyros and air pressure. You also have a navigation tool called VOR that can determine your direction from a fixed radio beacon (but not your distance). That you can fly safely with zero visibility solely on these primitive instruments is amazing.

Last year I did some special training in a simulator where I told the instructor to try his best to kill me. He enthusiastically put me in a variety of unusual situations that involved bad weather as well as system failures. When an engine fails on a plane it's fairly obvious, but when an instrument fails sometimes the only way to tell is by looking at the other instruments (especially when you're inside a cloud and have no outside references to use). But when the instruments are inconsistent with each other, how do you know which instrument is broken? Deciding wrong could lead to further bad decisions – like flying the plane straight into the ground while thinking you're cruising at 3500'.

In one scenario he clogged my static source port while I was climbing inside a cloud. When I saw my altimeter was frozen, I initially thought I had let the plane pitch down a bit, stopping the climb. But when I crosschecked my instruments, I saw the other instruments did not support this hypothesis. If I had acted on that first instinct by pitching up, I could have stalled the plane and put myself in a very bad (simulated) situation. Instead, I correctly diagnosed the problem and switched to the alternate static port to bring the altimeter back to life.

In another scenario, he killed my vacuum pump, causing my directional gyro to fail. I then had to navigate using only the magnetic compass. Because I understood the magnetic compass is inaccurate during acceleration, I did not overreact to the strange readings it gave me during turns. If I had remembered how the readings deviate during turns depending on which direction you're facing (which I will internalize when I do my instrument rating), I would have been able to do those turns even more precisely.

There are two lessons here. First, I was able to read the instruments correctly because of my understanding of how they compute their measurements. Second, the redundancy between the instruments allowed me to diagnose and debug any issues. The same lessons apply to software.

Deploying software to production is like flying an airplane through a cloud. You know the airplane is well-designed and well-tested, but whether it's operating properly at this particular moment can only be determined from your instruments.

Consider a basic property of a production software system: it does not lose or corrupt data. You may have thorough tests, but unexpected things happen in production. How do you know this crucial property is being maintained once the software is deployed? When I started my first job out of college, some of the code for the company's product astounded me – and not in a good way. The code consisted of a plethora of special cases to handle the various ways the database had been corrupted over time.

One of my big "aha" moments as a programmer has been embracing the philosophy of "your code is wrong!". Even if you have a fantastic test suite, bugs can and do make it to production. The implications of this are huge. When you acknowledge bugs are inevitable, then bugs that corrupt your data are inevitable as well. When one of these bugs strikes in production, you need to know as soon as possible to prevent even more corruption. Maybe the bug will trigger an exception somewhere, but you can't rely on that. Your only option is to instrument your systems. If you don't then your customers will be your data corruption instruments – and that's not good for anyone.

Here's an example of a data loss bug I had that my thorough tests failed to catch. I had a Hadoop job that would "shuffle" data to the reducers by generating random numbers for keys. It turns out that for Hadoop to be completely fault-tolerant, tasks must output the same results each time they are run. Since my code was generating keys randomly on each run, a reduce task randomly failing would cause some data loss. Needless to say, discovering this as the cause of our data loss was one of those hair-pulling "why did I ever become a programmer?" debugging sessions.

There's a lot you can do to monitor for data corruption. One great technique is to continuously generate "dummy data" and push it through your production systems. Then you check that every stage of your processing outputs exactly the expected results. The end result is a simple instrument that either says "corrupt" or "non corrupt" – with the "corrupt" case showing the difference between the expected and actual results.

Just because your "dummy data" instrument says "not corrupt" does not mean data corruption is not happening. By necessity such a technique funnels a relatively small scale of data through your systems, so if you have data corruption errors that occur infrequently you probably won't detect them. To properly understand what the "not corrupt" indicator is actually saying requires an understanding of how the "dummy data" instrument works.

Another great way to detect data corruption is through aggregates like counts. For example, in a batch processing pipeline with many stages, you oftentimes know exactly how many records are expected in the output of a stage given the number of input records. Sometimes the relationship isn't static but there's a cheap way to augment the processing code to determine how many output records there should be. Using this technique would have caught the random number bug I described before.

A lot of data processing pipelines have a normalization stage, such as converting free-form location fields into structured locations (e.g. "SF" becomes "San Francisco, CA, USA"). A useful measurement is the percentage of data records that fail to normalize. If that percentage changes significantly in a short time frame, that means either 1) there's a problem with the normalizer, 2) there's a problem with the data coming in, 3) user behavior has suddenly changed, or 4) your instrument is broken.

Obviously none of these instruments are perfect. But the more checks you have, the harder it is for a corruption-causing bug to go unnoticed. Oftentimes your measurements partially overlap - this redundancy is a good thing and helps verify the instruments are functioning correctly. An example of this is measuring the overall latency of a pipeline as well as measuring the latencies of the individual components. If the latencies don't add up you either have a bug or are missing something.

Gil Tene has a great presentation on the incredibly common mistakes people make when measuring the latency of their systems. I highly recommend checking that out, and it's a great example of needing to understand how your measurements work to properly interpret them.

When I construct software now, I don't consider any part of what I build to be robust unless I can verify it with measurements in production. I've seen way too many crazy things happen in production to believe testing is sufficient to making software robust. Monitoring is critical to finding how the expectation of production behavior differs from the reality. Making the expected properties of software measurable is not always easy, and it often has major effects on software design. For this reason I account for it in all stages of the development process.

Conclusion

I've discussed some of the biggest areas aviation has improved me as a programmer: going deeper into the tools I use and thinking of instrumentation in a more holistic way. There's an even broader point here though. Becoming a pilot seems unrelated to being a programmer, yet there are surprising overlaps between the two fields. I've found every time I've pursued subjects completely unrelated to my career, there are areas of overlap that improve me in unexpected ways.

This can be captured as a general philosophy of life: consistently find ways to get outside your comfort zone and do things that challenge and frustrate you. Besides learning new things, you'll become better at everything you do because of the surprising overlaps of knowledge. The most extreme example of this for me happened three years ago when I did standup comedy for three months straight. The lessons I learned from standup have been invaluable in improving my skill as a technical speaker. And surprisingly, it's helped me become a better writer as well.

One last thing – don't let my casual references to death throughout this post scare you away from experiencing the joy of small plane flight. The only time most people hear about small planes is when they crash, so the common fear of them is just plain selection bias. If you have a few hours and an extra couple hundred bucks to spare, I highly recommend doing an intro flight and experiencing the thrill of being a pilot.