In graduate school, I've had one professor for a few classes, and he always makes one particular statement which I will paraphrase: "I'm a statistician. Statisticians are concerned about variability. So, I want to know about the variability of an estimate."

In baseball statistics, this is particularly true in the area of projections. Most of the time, we'll see just one projection of what we think a player or team will do in the following year. However, being a statistician, I follow the dictum of my professor (In this case): I want to know the variability of these projections. What could we plausibly see from players and teams?

My piece on potential Manny Machado seasons was my first application of this thought to baseball data. Now, let's look at teams through this prism of variability. Specifically, let's look at expected wins.

Expected Wins Prerequisites

To begin with, for any expected wins-type estimator, we need an estimate of runs scored and runs allowed per game. To obtain one, I used an average of Baseball Prospectus' and my estimates for runs scored and allowed. However, you can insert your own personal estimate (It'll increase your appreciation of how slight increases or decreases in runs scored/allowed can change records).

Now, for Estimated Wins. We will wind out creating 10,000 seasons worth of data and see how many games a team wins based on the runs. However, in order to do this, we need an approximate distribution of both runs scored and runs allowed. In previous proofs of Bill James's pythagorean expectation, a continuous Weibull distribution has been used. Despite this, my preference is to have a discrete distribution for both runs scored and allowed. This is really for the sake of interpretation, as you cannot score partial runs in a game. So, we need a distribution that is discrete and has countably infinite support (To allow for the remote possibility that a team scores a very, very large number of runs). Essentially, we need a counting distribution.

A Mixture Distribution For Runs

Now the distribution that most would run off to in this case is the Poisson distribution. However, this tends to place too much probability on high values and not enough on low values. With that in mind, I prefer to work with a mixture distribution of a Poisson and a discrete uniform, specifically

f(R) = π * DiscUnif{0,⌊λ⌋} + (1-π) * Pois(λ)

where λ is the average runs scored or allowed per game, the floor function ⌊λ⌋ is the largest integer smaller than λ, (⌊ 4.28⌋ = 4), and π is the probability that an observation comes from the discrete uniform distributions.

This will allow for a more correct distribution of runs. I like it better than a mixture of poisson distributions because we are able to tie this mixture directly to the mean runs scored/allowed. It is important to note that while λ changes for every team, π will be fixed across the board. This is because prior to the season we have no real idea what π would be, so we will use an estimate that comes from the previous year's data. Also, note that mean(λ RS )=mean(λ RA ), so if you create your own estimate of mean runs scored/allowed, you need to have this constraint built in. Otherwise, you may have strange results in your simulations.

Using the previous year's data, we can obtain the maximum likelihood estimate (The most likely value given our data) of π. This has to be done using the Newton-Raphson algorithm because the closed form solution would be very difficult, if not impossible, to calculate. The 2012 data from all teams determined that π was approximately 0.25, which was close to the visual estimate of 0.3 that I had been using previously (From 2011 data). So, our mixture distribution turned out to be

f(R) = 0.25 * DiscUnif{0,⌊λ⌋} + 0.75 * Pois(λ)

What We Could See in 2013

Once we have a distribution for each team, creating a season is reasonably easy. Sample 162 times from the runs scored and allowed distributions, compare, and assign a win or loss to the result. I ensured that there were no ties in the sample when doing this. Then do that 10,000 times, creating 10,000 plausible seasons. From this we can get our plausible ranges of wins.

So, what can your team expect to do in 2013 based on theirs runs distributions? What variability is expected? The table below has the teams ranked in their division by median record. Also, the wins are created as if everyone had played a balanced schedule. The only way that schedule difficulty enters this analysis is in the estimate of λ for runs score and runs allowed.

So, there is the potential for good and bad history here. It's a slim chance at best, but there's still a chance. Every team has the chance to finish above .500, and every team has a chance to finish below .500. Every team has hope for something. Even Cubs and Mets fans can have hope, just only about a 3% or 2% hope.

Side Note: I messed around with the λ values, and it was a fun exercise. Just changing a λ by 0.1, or essentially one extra run every 10 games, can up a run total by 3-4 wins.