I’d like to talk to you today about pitcher evaluation.

I don’t mean evaluation in the sense of determining a pitcher’s talent level, or evaluation in the sense of determining a pitcher’s future value — or even evaluation in the sense of determining a pitcher’s market value. I mean a pitcher’s past value. Or, perhaps, because value is so often misunderstood and misinterpreted, we’d be better off speaking in terms of contribution. That’s how do we determine the extent to which a player contributed to his team’s success (or failure)?

Of course, this goal — evaluate a pitcher’s past contribution — isn’t new, and it hasn’t gone unaddressed. In fact, the entire purpose of WAR is to quantify a player’s value to his team. Whether with FIP or RA/9 or something in between, various WAR versions attempt to determine the extent of a pitcher’s contribution to the team.

But WAR doesn’t consider context. That is, it looks at a pitcher’s season as a whole — whether through the lens of FIP or something else, without taking into account the order and the relative importance of the variables that it takes as input.

One simple example illustrates this point. Consider the following two hypothetical starting pitchers — Pitcher A and Pitcher B — and two of their hypothetical starts:

Start 1 Start 2 Total Pitcher A 3 IP, 8 R 9 IP, 0 R 12 IP, 6.00 RA/9, 6.00 FIP Pitcher B 6 IP, 4 R 6 IP, 4 R 12 IP, 6.00 RA/9, 6.00 FIP

(I didn’t include the FIP inputs, but assume the FIP is proportional to the RA/9 for each start)

Pitcher A and Pitcher B were exactly the same with regards to their FIP-based peripherals and their run prevention. But they arrived at the the total in very different ways. Pitcher A contributed negative value to his team in his first start — giving the team very little chance to win. In the second start, he almost guaranteed his team would win. On the other hand, Pitcher B pitched poorly in both starts. He gave his team some chance to win, but it wasn’t much.

We don’t have to guess which pitcher helped his team more. To evaluate each pitcher’s importance to his team — and keeping each game in context — we have to shift perspectives. Instead of looking at each inning, as with RA/9, or at each plate appearances, as with FIP, we can look at each start and determine the probability of a team’s win, given the quality of the start.

FanGraphs already has one metric that evaluates players in a similar way: Win Probability Added, or WPA. WPA, though, looks at the entire game context. It includes the runs the pitcher’s team scores, in order to determine the extent to which the pitcher increased his team’s odds of winning. Obviously, the pitcher has (almost) no control over how much run support he gets; instead, we want to determine the probability that a team will win, given only the quality of the pitcher’s start.

Although the quality of a start can be determined in a number of ways, only three variables affect the team’s chances of winning: innings pitched, runs allowed and the base-out state if the pitcher leaves in the middle of an inning. The third factor can be addressed by looking at the run expectancy for the inning when the pitcher leaves, but for simplicity’s sake, let’s just consider the first two.

Results

Here, we’ll look at how often the starting pitcher’s team won, given the innings pitched and runs allowed. I pulled the data from 1993 — when the decline in relief innings per appearance began to stabilize — to get as large a sample as possible while still representing modern reliever usage. Below is a graph representing the results:

Note that for readability and simplicity, the graph only shows starts with a whole number of innings pitched, and it’s limited to three or more innings pitched and seven or fewer runs allowed. Obviously, there are many more possibilities than just the ones above. The graph is simply intended to be a helpful visual reference for understanding a start’s value based on run prevention.

The results aren’t surprising. A nine-inning shutout almost guarantees a win for the pitcher’s team. A seven-run, three-inning start does not. Interestingly, the minimum required for a quality start — six innings and three runs — gives a team almost exactly a 50-50 shot at winning. Interestingly, a nine-inning, four-run start is significantly better than said minimum, though it’s not considered a quality start.

With this data, including all starts, we can evaluate starting pitching using a different approach. To do so, all we’ll do is take each start, look at the pitcher’s innings pitched and runs allowed and then pull the corresponding winning percentage for all equivalent starts based on the data above. We can either take that number as it is to mimic pitcher winning percentage, compare it to an average pitcher by subtracting 0.5, or we can compare it to a replacement level pitcher by subtracting .38.

Testing the theory: Jered Weaver and Matt Cain

A theory like the one presented above has little use or significance until we know how it actually affects player evaluation. In researching who I could compare to test the theory, I came across a fantastic case from the 2012 season: Matt Cain and Jered Weaver.

In 2012, Cain gave up 73 runs in 219.1 innings, which was good for a 3.00 RA/9. Cain never gave up more than five runs in a start, and he never pitched fewer than five innings in a start. However, he only had four starts where he didn’t allow a run to score, and seven others where he held the opponents to just one run.

Weaver was almost exactly the same as Cain as far as run prevention is concerned. He allowed 63 runs in 188.2 innings, which translated to a 3.01 RA/9. But unlike Cain, Weaver was inconsistent. He had one start in which he gave up eight runs in 3.1 innings, one in which he gave up nine runs in three innings and two other starts in which he failed to pitch past the first inning. On the other hand, he had nine starts in which he allowed no runs, and five more in which he only allowed a single run. In terms of starts with zero or one run allowed, Weaver led Cain 14 to 11.

If we’re evaluating the two pitchers based solely on run prevention (ignoring park factors), they were nearly identical on a per-inning basis. But the distribution of the start’s quality was very different. Let’s see what our above approach says about these two pitchers.

Wins Above Average Above Replacement Matt Cain 20.12 4.12 6.68 Jered Weaver 19.36 4.36 6.76

And the same numbers on a per-start basis (important since Cain had 32 starts to Weaver’s 30):

Wins Above Average Above Replacement Matt Cain 0.629 0.129 0.209 Jered Weaver 0.645 0.145 0.225

Well, that’s interesting. One would think that consistency — of which Cain was an excellent example — would be a desirable trait for a pitcher. However, using this WPA-based, support-neutral, per-start approach, Weaver — who was almost identical on a run-prevention basis — is given more credit than Cain. In fact, this approach gives him more wins above replacement than Cain, despite pitching two fewer starts.

This difference, while interesting, isn’t big. Over a season, this approach to evaluating pitching only made a difference of about a quarter of a win. The conclusion–that consistency actually hurt the Giants chances of winning, all else equal–is somewhat surprising, though.

If we look at all pitchers in 2012 with this approach, and compare their rankings with this method to their ranking with RA/9, the largest differences confirm the above case study. Pitchers like CC Sabathia and Cliff Lee — both of whom are very consistent pitchers — were most hurt with this method; pitchers like A.J. Burnett and Johan Santana, who weren’t consistent, were helped.

Obviously, the best pitcher would be one who is perfectly consistent and amazing in every start, but this metric suggests that teams would win slightly more often if their pitchers skewed towards the extremes, rather than being reliably okay.

Future Considerations

I’ll be the first to admit there are some issues with the methodology above. I didn’t adjust for the run environment or the park, and the sample size for a few IP/RA combinations was likely not large enough to be confident of win percentage’s accuracy. That being said, my goal here wasn’t to create a perfect metric, but to introduce a different, hypothetically better, method of evaluating a starting pitcher’s past value.

There are a number of paths on which I, or others, can take this approach in the future. The most obvious is to adjust the win probabilities for park and league, so as to more accurately represent each start’s true value.

Secondly, we can employ this approach using more sophisticated methods of pitcher evaluation. Using straight run prevention is the natural first step, but one could also use FIP, RE24 or even something like Game Score in a similar way. Once one determines the probability that an average team wins given the “score” of the start, the rest is essentially just addition.

The title of this piece refers to starting pitcher consistency, and that is certainly the practical implication of the approach that I offered above. However, the other purpose of this approach is philosophical. With regards to evaluating past performance, the most important — the fundamental — unit of measurement for the starting pitcher is the start. The goal is to win, and the way to win is to pitch as well as possible in each start. It is only natural, therefore, to evaluate pitchers not in terms of their effectiveness in each plate appearance, or their effectiveness in each inning, but their effectiveness in each start. There are many ways to do so, and the above approach is almost certainly not the best. But approaching the issue of pitcher evaluation from this standpoint adds a new, and I believe better, perspective, one that I, or others, will hopefully expand on and improve in the future.

Much of the above data is courtesy of Retrosheet. It should also be noted that a similar idea to the one above was presented at Baseball Prospectus nine years ago, but none of my ideas or writing were taken from that piece.