Sabermetrics has evolved considerably in the past decades. It continues to benefit from the AI obsession and the dissemination of cheap computing power. Yet, a few long-standing statistical models and algorithms continue to be the baseline approach for sports analytics (maybe the Lindy effect in play). It seems the more we [re]visit those models, the more we learn about the fundamentals of the game — and in the age of oversold skills, we should be more keen on fundamentals.

One of such models that we are focusing on at Torneo these days (together with our evaluation of the Elo Ratings) is Markov chains — a stochastic model for a sequence of events where the probability for each outcome only depends on the previous state of the system.

Standard example of Markov Chains. Source: Wikipedia

The idea is simple: the probability of the inning moving to a specific state is only a function of the current state, independent of how teams arrived at that place. If there are 2 outs and 2 players on scoring positions, all the preceding events and actions are of no importance (therefore stochastic systems are memoryless).

The reason I specifically like Markov chains lies in the simplicity of its definition, the insights it provides, and the vast reach of applications for which it can be used. For instance, we can analyze team performance, evaluate players contributions to a result, assess player trade values and optimize the batting line (it also helped us debug some of our retrosheet data-parsing code).

There are two key terms in Markov chains: states and transitions.

Game States

Figure 1: Non-absorbing states, numbered

Game states are all possible combinations of runners on base (0, 1, 2, 3, 12, 13, 23, and 123) and the number of outs (0, 1 and 2) at each play, totaling 24 states at any point in time of the inning (see Figure 1). The last state, the 25th, represents the end of the inning (or the absorbing state, as there is no possibility for a further transition). To account for all scenarios, we use 4 absorbing states, numbered from 25th to 28th, representing third-out situations with 0,1,2 and 3 runs, respectively.

These are not absolute. We could keep adding dimensions to create more relevant states using a third or fourth variable (with bases and outs currently being the only two variables). The more granular the states, the better our understanding, but the fewer data points we will have for those combinations. For instance, we could add plays or runs into the states, but it would easily expand the number of states and sparse the available transitions beyond desirable (Adding runs to each state, for instance, could exponentiate the number by states to 10–20. Let us not touch the curse of dimensionality).

Transitions

A transition is any movement of players for each plate appearance, or any change from one state to another. Singles, fly-outs, home runs or double plays, they all represent a specific transition, from a previous state (1–24) to a current state (1–28). In scoring perspective, a transition will always result in 0,1,2 or 3 outs and 0,1,2,3 or 4 runs. The figure below shows a transition matrix in terms of outs.