It’s that time of year, when, due to the euphoria of Spring Training games being underway and Opening Day only a few weeks away, analysts and fans across the country suspend all logic about the laws of time and space and make wild and crazy predictions about how the upcoming season will turn out. Never-mind the fact that baseball is virtually impossible to predict and that an injury in April could completely throw off an entire team or even division’s projections, we still find ourselves trying to read the tarot leaves, casting lots and writing out our predictions for the season.

We decided to do things slightly different at BAR. Don’t get us wrong, we still make our own projections that will probably be woefully inaccurate. However, in addition we’ve also taken a little more objective approach to projecting the season. Instead of relying on our own skills (or lack thereof) at predicting the future, we decided to use player projection data to derive a projected W/L record for each team. We’ll go into the methods behind our madness down below, but first, let us present our first annual MLB season projections.

You can view the spreadsheet in Google Docs here.

Behind the Scenes

We knew we wanted to include predictions on our blog, but didn’t feel that in and of ourselves we had any great fortune-telling ability to generate predictions that would provide the baseball world any more value than the hundreds of others that are already floating around the web. Instead, we wanted to see what would happen if we took a player-centric approach to projections and combine projected player statistics with a little bit (read: “a lot”) of math.

The first step was deciding on the player projection data source. Fangraphs provides a list of several different projection sources for the upcoming year. We settled on The Bat as our source, and pulled in the projected data for the top 1,000+ pitchers and batters. While there are several different statistics we could have used, we decided for the first iteration to keep things simple. We used ERA of the pitchers to calculate runs allowed, and then the On-Base percentage and Slugging percentage of the batters to calculate runs generated (more on that below).

We then took our projected 25 man rosters for each team, and estimated the number of innings for each pitcher and plate appearances for each batter. We then multiplied their stats across the number of innings pitched and plate appearances to come up with Runs Allowed and Runs Scored. We then added 53 runs to each team’s Runs Allowed total due to the fact that there was an average of 53 unearned runs given up by each team last year, which are not captured in the cumulative pitcher’s ERA’s. (In future years we may adjust this based on a team’s projected defensive ability, but we decided to stick with a straight average for the first year).

From there, we summoned the great Bill Jame’s Pythagorean expectation, which takes a teams Runs Scored (RS) and Runs Allowed (RA) and outputs a projected W/L percentage (W/L% = 1/1 + (RA / RS)^2). We then rinsed and repeated across all the teams and examined the results.

Looking at the raw projections, two things jumped out at us:

We were 38 wins short. Given the fact that there are 2430 games played, and baseball doesn’t end in ties, then it follows that there needs to be 2430 wins spread across the projections. Given the fact that our W/L totals were generated simply from player statistics and the formula above, we felt that coming within 1.5% of the total number of wins meant we weren’t too crazy (mathematically speaking – don’t worry, we are well aware of the insanity of this project). The NL records were skewed slightly downward. We could see this both when comparing our projections to other sites, as well as when examining the projected interleague record. Taking out the 1065 NL innerleague games, we were able to determine the projected interleague record was 199-101 in favor of the AL. Even though the AL has won the majority of the interleague match-ups over the past several years, their per season record is usually closer to 160-140. Our best guess for this skewing is the presence of the pitcher slot in the batting order, taking up ~1/10 of the batting slots. We guesstimated that this decrease in team batting statistics may not have been properly offset by equal amount of lowering of the N.L. pitchers ERA, or it was somehow obfuscated by our model.

We determined that we could solve both problems by applying an adjustment to the generated NL Records. Since there were 38 wins that we needed to come up with, we added two to all 15 NL teams, and then an additional third win to the top 8 NL teams. This resulted in an even number of wins and losses, as well as an interleague record of 169-131, much more in line with the past several years.

Calculating Runs Scored

While determining Runs Allowed from ERA (number of runs a pitcher gives up divided by 9 innings) is a fairly obvious, if not too simplistic, approach, there is no such statistic for batters. To help us out, we introduced some Artificial Intelligence (really just machine learning). I downloaded team statistics from each season going back to 1961 when the seasons were expanded to 162 games. I then threw out the 3 strike shortened season to prevent them skewing the data, and ran a linear regression equation against the data to determine the relationship between a team’s cumulative On-Base and Slugging percentages and the total number of runs scored (in case your curious, the formula was RS =-791.45+2761.65*OBP+1518.21*SLG).

This equation is somewhat similar to the Runs Created statistic created by Bill James. While obviously imperfect, when run against the data for the last 5 years, it was within .04% of predicting the cumulative number of runs scored. We felt this was as good a metric as any for calculating runs scored, especially given the fact that the foundation of our statistics are simply another man’s guesses, educated though they may be.

Conclusion

Before we leave, we feel like there are a few things we need to make clear.

Do we feel like this is somehow going to magically result in wildly accurate projections? Absolutely not. Baseball, probably more than any other sport, is susceptible to an incredible amount of variance across an insane number of factors. Players will over- and under-perform, as well as have injuries that will remove them for large parts of the season. As a result, our goal is less about predicting the future, and more about determining the why behind a team under- or over-performing. Unlike most predictions that only display the final W/L total, you can see every single at bat that makes up our final W/L total. Because of this we can trace a team’s sudden rise or fall back to the over/under performing of their pitchers or batters, and even down to specific player levels. As we go throughout the season we will combine our projections with actual stats from the season, and post updated projections that combine the two. In addition we’ll try to point out any specific players or areas of a team that are exceeding expectations and the resulting impact it is having. We recognize that we are not experts in math or fortune telling and there are probably a number of flaws in our system. We hope to surface those throughout the year and use those to improve our system for next year. Finally, if you hate our projections, please don’t get mad at us, Blame Math. If you think your team is under represented, feel free to look at your team’s tab in the spreadsheet and you can see the player projection data yourself. We tried to limit any areas where we had to inject our own subjectivity into the process, outside of projecting playing time. We don’t necessarily even agree with our projections, but we think they are close enough to other projections to be useful. At the end of the year we can all have a good laugh at how far off we were.

We had a ton of fun putting these projections together and are excited to see how they fare against the unforgiving reality of the baseball season. Hope you enjoy!