Last week, the sabermetric community had—well, not an argument, because the participants were generally professional and cordial to one another, but a debate about what we might expect over the rest of the season from a player who is currently enjoying a hot (or cold) streak. It all started with researcher Mitchel Lichtman (better known by his initials, MGL ) posting two articles, one on hitters and one on pitchers, that made the case that we should trust the projection systems rather than expect a player’s recent performance to continue. Remember Charlie Blackmon, who was the best player in baseball for three weeks and was smart enough to make those weeks the first three weeks of the 2014 season? He’s a good example. He had never been anything special, nor was he projected for greatness this year. And in retrospect, his hot streak to start the season looks a lot like a small-sample fluke.

MGL ’s methodology was reasonable. He identified hitters who had significantly outperformed their projections in April and then looked to see how well they did from May to September. He found that as a group, their subsequent performance was much closer to their projection than it was to their early-season hot streak, which held true even if he looked at longer stretches of overperformance to start the year. He found roughly the same for hitters who underperformed relative to their projections, and then found roughly the same for pitchers. His conclusion: Don’t get too wrapped up in an early season hot or cold streak. The player will most likely regress.

Dave Cameron followed with a post at FanGraphs, in which he summarized MGL ’s work and briefly discussed the fact that the “trust the projection” mantra isn’t 100 percent accurate. He ended with this nugget: “Without perfect information, we’re going to be wrong on some guys. The evidence suggests the conservative path, leaning almost entirely on forecasts and putting little weight on seasonal performance, is the one that is wrong the least.” Again, this is a perfectly reasonable thing to say. The problem isn’t that Messrs. Lichtman and Cameron are wrong. What they’ve said is all factually correct and the product of good, solid thinking. And yes, most of the time, “trust the projection” will be correct. It’s a nice antidote to the hope-fueled longings of fans who swear that while everyone else is “due for a regression,” our guy has “made a significant adjustment to his game” and will sustain this .400 batting average and 60 home run pace all the way through Game 5 of the World Series (because we won’t need seven).

The problem is that we’re asking the wrong question. To understand why, we need an analogy.

Suppose that some serious disease that no one had seen before were making the rounds. Naturally, biomedical researchers and public health officials would be hard at work immediately trying to figure out what was going on and would likely try to develop a test that could pinpoint whether someone was infected with this disease. Early detection of just about anything saves lives. In an ideal world, we’d want the test to get it right every time. If a person really were infected, we’d want the test to say “yes.” If a person were disease free, we’d want the test to say “no.” It’s rare to get a test that’s 100 percent accurate, but that’s the goal.

Now, let’s say that we can reasonably assume, based on some surveillance and epidemiology data, that 10 percent of the population is infected. But which 10 percent? Ah, that’s where I would come to the rescue with my extra super-duper magical test, because I am brilliant. I would simply declare, according to my test, which is actually some stray wires tapes to a cardboard box, that no one actually has the disease. And I would be right 90 percent of the time. That’s an A-minus, mom!

Oh right, that doesn’t really help the people who are infected, does it? Okay, instead I'll say "Everyone has the disease!" Now, I have accurately identified everyone who is infected, without missing anyone. I’ve gone up to an A-plus! Yeah, there are cases where that sort of “just assume everyone has the disease” model works in public health, but coming back to baseball, it’s basically like saying in March, “I know that a couple of these 750 players are going to break out this year. I just know it.” Technically, you can take credit for having “called” every breakout in baseball that year, but your original statement isn’t useful.

In public health and statistics, we call this a signal detection problem. A signal detection problem has two parts: something that we’re looking for (the signal) and some test that tries to find it. You can visualize the problem like this:

Signal is Really There Signal is Not Really There Test Says it’s There Test Says it’s Not There

Now, let’s fill in two of those empty squares:

Signal is Really There Signal is Not Really There Test Says it’s There Correct Identification Test Says it’s Not There Correct Rejection

These are the squares that you want to be in. The perfect test would sort everything into these two boxes.

Now, about the other two:

Signal is Really There Signal is Not Really There Test Says it’s There False Positive Test Says it’s Not There False Negative

In the context of our test for breakouts, a false positive is the guy who has a hot month. He makes you all excited because this is the new breakout star, and you start talking about him to all your friends so that you can say that you were “on him” earlier than anyone else (or so that you can pretend that your team has a shot at the playoffs). But in mid-May, he turns back into a pumpkin. A false negative, on the other hand, is the one that “we” missed on. “We” all assumed that it was just a hot streak, but it turns out that he had changed.

There are two different kinds of errors that a person can make in a signal detection problem, and in signal detection theory, there are two things that we want to know about a test to determine how good it is. One is a measure of how good a test is at sorting cases into the “good” boxes. This measure, called detectability (often abbreviated d’) is what you really want in a good test. But the other measure of a test is called response bias (often abbreviated with the Greek letter beta). This is a measure of which type of error your test will make more often.

You can think of it in terms of what you might do in a case where you looked at the evidence and found that it wasn’t quite clear whether you should go with “Definitely a breakout” or “Pshaw, just a small-sample fluke.” To which one do you usually give the benefit of the doubt? That’s your response bias. Again, in public health, there are cases where it makes sense to prefer one sort of error over another, but adjusting the response bias isn’t helping you to get more cases correctly classified. It’s just adjusting what sort of errors you will make. Sometimes that’s the only thing that you can do, and it can make the test better, but it’s no substitute for better detectability.

Here’s the problem with “Always trust the projections.” It’s also the problem with “Everyone (or no one) has the disease.” We are trying to figure out whether a player who is playing above his head really is breaking out, or if it’s just acne. Going with “always trust the projection” is a way of saying “adjust your response bias toward saying ‘No breakout’ rather than working on making the test a better detector.” If we made a list of players who have exceeded expectations this year (pick whatever definition of that you want), most will probably revert to form, but some really are emerging from their chrysalis and have become beautiful butterflies. Let’s say that 10 percent of them are real breakouts (just picking a number). Saying “small-sample fluke” all the time will be correct 90 percent of the time. And only minimally useful.

The real question that teams are concerned with is the detectability question. Suppose that a team saw a player that was starting to break out at the end of a season and could detect that yes, this one was real. At the Winter Meetings, the team’s GM would invite the breakout player’s GM out for some lemonade-fueled debauchery on the hotel mini-golf course, and somewhere over by the windmill would mention an idea for a “minor” deal. Those are the kinds of moves that a World Series team is built on. Sticking to “always trust the projection” probably does keep you from over-reacting to (and over-paying for) a two-month hot stretch, and maybe if it’s one of your own guys, you can sell high on him, but even then, you’re only getting half the benefit that you could.

In fairness to Messrs. Lichtman and Cameron (Hi guys!), I doubt either one would significantly disagree with my general point, and likely they'd be all for a method that could better detect a real breakout when it's happening (or about to happen). They’d likely agree that in a perfect world, we’d have a perfect test, but since we don’t live in a perfect world or have a perfect test, it’s better to pick the option that makes you wrong the least often. That’s perfectly sound thinking from a statistical point of view, until you look at it from the point of view of a team or anyone else who needs to be able to pick out the real breakout. Anyone can adopt "trust the projections." There’s no strategic value in it at all. Tell me when I should disregard even my own model!

Again, to be fair, MGL's projection system (and others) allow for some types of new information to re-write the projection mid-season (For example, there were specific mentions of a pitcher who is clearly losing velocity, which would be factored into the projection.) But there's another problem. What happens when there’s information that the model doesn't account for? Sure, a good model tries to take everything into account, but what happens when a scouting report comes back that says, "No really, he really has changed his whole approach and it's working for him." We can't privilege all information like that.

Your cousin's girlfriend's brother's boss who has been a Rockies fan for 40 years (yeah, I know) isn't a reliable source of information on Charlie Blackmon. And yes, ideally, a more complete model might find a way to incorporate that sort of feedback to make the model better, but we're kidding ourselves if we think our models are that complete at this point. The problem with "trust the projection" is that you're leaving out any information that isn't fueling that projection, but that might still be important. The fact that projection systems miss on a lot of breakout guys is evidence that that we’re leaving out some critical information.

We should aspire to greater things than that, even though that aspiration is a mighty tall order. We have good ways for measuring what a player did on the field, and some nifty one-number catch-all stats, but very little in the way of measuring some of the more base component skills. How good is Smith's pitch recognition? What does it mean that someone finally explained how not to chase breaking stuff low and away to him in a way that he could understand? How does that affect all the other variables? What does it mean that not only is his wrist actually healthy, but that he actually trusts it now? How does all that interact with the rest of his skillset? It's a harder question and a more humbling one. It's going to be messy to figure it out, with fits and starts and failures and maybe some long pauses between breakthroughs. But that sort of mistrust of even the most sophisticated model is the difference between saying something that's correct and reaching the point of saying something that's useful.