When I worked at Amazon, I came to dread question mark emails. Every couple of months, someone would email Jeff Bezos with a streaming video problem. He'd add a single '?' in the body and initiate a long string of increasingly-frustrated email forwards that would somehow end in my inbox. All other activities halted until my team and I knew precisely what caused this single anecdote, how we'd prevent this problem going forward, and had summed all that up into an extensively-reviewed email response. Stressful times.





After I'd gone through this experience several times, I got snippy with my boss. I said, "This is a dumb use of our time. We have metrics that show we succeed on playback 99.95%. We're chasing edge cases here." (Note: I would not say that directly to Jeff Bezos.) My boss's reply was both deep and deeply confusing: "When your data disagrees with your anecdote, trust the anecdote." No explanation, just an unsatisfying impression of Confucius.





Fast forward 2 years. I'd left Amazon to join TUNE for a lot of good reasons, question mark emails being one of those. Somehow, I found myself in a familiar situation.





One day, customers overwhelmed our overseas support team around 2 AM Pacific with reports that they couldn't log into one specific platform. Here's the weird thing: we have gobs and gobs of monitors, alerts, and performance data. All of them showed that everything was just fine at that time. Latency was low, we had plenty of capacity, and there were no large-scale networking issues. "LGTM. Uh, ask them about their wifi connection?" was our shaky answer.





After the second day of these reports, I realized that I'd been presented an anecdote that disagreed with my data. We started digging, iterating through each graph on each dependency of the platform. EC2 cpu utilization and iowait was fine, load balancer was distributing traffic correctly, aggregate db insertion time was within acceptable variance, memcached was memcached-ing, no sun spots, Godzilla hadn't attacked US East, and so forth. We had gone through most of this data already, but as we kept digging, we eventually found a graph that pointed to a new bottleneck in the form of large-scale, but short-lived MySql contention. We cross-referenced a specific instance of contention with previous data. We saw the latency there too when we sampled at a higher rate during that brief period of contention. Eureka! We had identified a major problem and could now go solve it.





Sometimes when you trust the anecdote, you come to find that the anecdote was flat out wrong or points to a problem you can't solve. For example, one of the video streaming issues was ultimately because someone's satellite internet connection at their Mexican beach house was too slow too sustain 1080p. Sorry, man! I don't rule space or Mexico.





Sometimes though, the anecdote allows you to see your data with new eyes. These anecdotes lead you to tools that you've always had at your disposal and never realized, tools that allow you to identify and attack* these problems before the next swarm of angry emails. That's why you trust the anecdote.





* Identify and Attack is the name of my Youtube self defense course.