Overfitting Entire Companies with Statistical Modeling

As statistical modeling becomes more ingrained in corporate decision making, models will increasingly be selected by the free market itself and not by the data analysts within those companies. This in turn will cause overfitting of the models and survivorship bias of the techniques that generated them.

Survivorship bias comes in a few flavors. The standard example comes from the world of finance. When companies go out of business, it throws off the long term performance measurements of indices that contain them. So we might look at the performance of the automobile industry and say profits have grown (or shrunk) by x% a year for the last 50 years. The statistic could be calculated by looking at the 50 year history of Ford, GM, etc, but the “bias” comes in when you ignore Packard, which went out of business long ago. Only the survivors get included, which necessarily makes the statistic look rosier than it actually is because the worst performers are excluded.

In the book “The Black Swan,” Nassim Taleb also discusses a related scenario where there are so many stock traders on Wall Street that there have to be big winners, even if the stock market and all of the traders’ actions are random. It simply follows the law of large numbers that one in a million performance will occur a few times if you have many millions of traders. “Stock trader” is often considered a high paying job, but that ignores all the people that lost money and quit after a few months. Only the survivors are well paid.

The other thing about stock traders is that they are basically a walking predictive model. They look at charts and graphs and newsfeeds all day and try to predict whether certain stocks will go up or down. The implication is that a trader can be a very good predictive model, with many decades of solid performance ahead, but if her particular model performs poorly for two consecutive years, she’s likely to be out of a job. On the other hand, a bad trader that has a very profitable year thanks to sheer luck is likely to hang around for a while, coasting on her cash cushion.

Since no one cares if stock traders are the victims of cosmic injustices, what’s the big deal? The problem is that more traditional companies are starting to live and die by the performance of their statistical models. Companies like Walmart and Proctor&Gamble spend a huge amount of time and money on analyzing market data in an attempt to get the right product in front of the right customer at the right time. The only real cash cow Google has is it’s business of predicting what ads will get the most clicks when they are displayed on a particular web page.

Statistical modeling is a bit of an art form. In a way, building models (even lots of them) is very easy with off-the-shelf software and cloud computing. The real trick is picking which one you trust enough that you are willing to make decisions based on it.

When you see a pattern in data, it’s very difficult to tell if it reflects something that is happening in the real world, or if it’s noise in the data that looks like a pattern. It’s the difference between creating a model of the real world using data as an approximation or making a model of the data set itself.

This problem often shows up as something called overfitting, where a model is very good at predicting the historical data (which it was built with), but ends up having terrible performance in the real world. An overfit model can predict the data set it was built with but not previously unseen scenarios. I’ll spare you the details of why this happens, but the important takeaway is that the model that is likely to have good future performance is not necessarily going to be the one with the best looking historical performance. In fact, having a very good historical performance is often a sign that overfitting has occurred and future performance is likely to be poor.

Needless to say, professional statisticians and machine learning analysts spend a lot of time trying to avoid overfitting by carefully selecting which models to trust. (All their time? Is that what the job actually is?)

So back to survivorship bias….

The free market selects winners and losers based on financial performance. At some point, the statisticians picking models for companies get overruled by the market when their company wins or loses customers. Over time, this will create greater pressure to select models that perform the best in the very near future based on the very recent past.

What’s worse, the free market itself may put the more realistic companies out of business by starving them of customers and capital during the times that the more over-fit models are still performing well.

Over time, entire industries will become dominated by an overfit view of the market they sell into. When the only survivors are the ones that have optimized their operations to the recent past while ignoring low probability events that could potentially bankrupt them, survivorship bias will kick in.

It seems logically absurd that companies would actually do this, but the financial crisis of 2008 had numerous multi-billion dollar financial companies bankrupting themselves by treating low probability events the same as impossible events, and then blowing up when those things happened.

The moral of the story is that data can (and should) be used to add weight to decisions but cannot be relied on as the absolute truth. All possible events that may affect a business must be enumerated and accounted for. That’s even if there is no relevant historical data to plug into a model to see what might occur in the future if some never-before-seen, but possible, event occurs. Software simulations can often come in handy in these situations to get some insight into these blind spots, but other times it’s necessary to simply treat catastrophic events as if they will happen and plan accordingly.

There are plenty of people who will disagree with this conclusion. They will tell you to fit your company’s operations to the available data. They will probably cite some examples of companies that do and have so far survived.

The Recombination of Labor Video: Automated Design of Trading Strategies