Sometime this past March, I was talking about the Mariners with a friend. Robinson Cano came up, and my friend made the comment that Cano wasn’t that great of a hitter in 2015, because he only hit about .280 (the actual figure is .287, but that’s not the point). Setting aside advanced versus non-advanced stats, we talked a little about what sort of line a “good” hitter has, what it takes to lead the league in average, that sort of thing. He was shocked to learn that only six people in the AL hit .300 or better last year, and that .320 led the league; his impression was that the old standard of .300/.400/.500 was about average for a star hitter—in fact, only four qualified batters reached that threshold last year.

Fast-forward to about a month ago. Fellow BP writer Jeff Long and I were talking about article ideas, and he suggested that I write about what the “average” player looks like (statistically) these days, because “the steroid era screwed everyone's perception.” This was pretty much confirmed by Jeff via Twitter (note: I’m not trying to call anyone out, but I am using you all to make a point).

@JeffLongBP My insticts say .280 would be "good", but I don't even know what the league average is. — Ryan Tamanini (@rmthawk64) June 26, 2016

@JeffLongBP I would guess not. .300 is the magic number — Admiral Buzzkill (@TheOriginalBull) June 26, 2016

@JeffLongBP I don't even think I know. — August Fagerstrom (@AugustFG_) June 26, 2016

What I’ve done for today, then, is taken a look at a range of offensive stats over the past 66 years to see how their respective averages and ranges have changed over time, and which direction, if any, they’re headed. Notably I excluded BP’s own True Average; this is because the average TAv is fixed at .260 every year. I also limited my sample within each season to batters with 200 PA or more.

First up: batting average.

Okay, let’s take a minute to go over what you’re seeing here, then we’ll talk about what it means. The dark line in the center of the graph is the average batting average (note: not the league’s batting average) for each year in the sample, run through a LOWESS smoothing process in R[i]. The darkest blue band, in the middle, represents the 40th to 60th percentile range; the medium blue is the 20th to 80th percentile, and the light blue is 5th to 95th. The boundaries of all regions have been LOWESS-smoothed as well. The idea is to give the reader not just an idea of the average within a statistic, but the range, as well—if you want to know if someone has a “good” batting average (or whichever statistic you’re talking about), you don’t just want to know what the average is, you also want to know how tightly distributed around that average the whole data set it. Said another way (and using made-up numbers), if someone’s hitting .302 for the year, you don’t just want to know that the average batting average is .260, you also want to know whether most batters fall within the range of .250-.270 or .220-.320.

So, what you’re seeing here for batting average is a fairly consistent range, always about 100 total points, and a generally flat average line, only ranging by about 15-20 points over the whole sample. You’re an average hitter (at least in terms of AVG) if you hit about .260, and you’re in the top 20 percent if you hit better than just under .300. The part I find most interesting about this particular graph is the shape of the 80th– and 95th-percentile lines. There are peaks (or maybe plateaus) and valleys (anti-plateaus? I don’t know the word for that), separated almost generationally. So, for example, someone who first became a baseball fan in the mid-70s would’ve been influenced by their observations into thinking slightly-high-than-average batting averages were the norm; if that person then had a kid who became a baseball fan roughly 20-30 years later, they’d have also caught an above-normal period, which would correspondingly add to the influence on the parent (assuming they’re paying attention along with the kid). So you get families who are used to high batting averages (relatively—still only a 15-20 point swing) and families used to low ones. It even looks like it’s continuing, as all lines are trending up in the last few years. That’s an outrageously speculative take on this, but the cyclical nature of the graph jumped out at me.

On-base percentage is similarly smooth overall; the late 90s-early 2000s bump (visible in the batting average graph but not explicitly mentioned) is once again obvious, and there’s a recent-year trend upward after a bit of a valley. Average is generally around .335-.350; top 20 percent is, roughly speaking, better than .375. What I take away from this is the steep decline through the 50s and 60s leading up to the rule changes in 1968, after which OBP almost immediately rebounded.

Next up is slugging; here, there’s been a real persistent change ever since the 90s, shifting all lines upward by at least 20 points. Recent years have even gotten as high as the 90s/2000s peak. It’s almost reached the point where you’d have to slug .500 to be in the top 20 percent. This is the first instance of what I think is a visibly skewed distribution, too, where the 80th-95th range is notably wider than the 5th-20th range.

Unsurprisingly, HR% (that is, HRs per PA) shows a lot of similarity to SLG. Amazingly, the average figure is rising even above that of the “steroid” era. I’d also like to point out the same skew as seen for SLG, the separation between average HR/PA and median (visible by the trend of the average line to be above the middle of the darkest blue region, which would be the median), and the 50s/60s hump in all lines, though especially the 95th-percentile trace—which was also effectively killed off by the 1968 rule changes.

I wish there was more to say about walk rate, but aside from a sharp descent leading up to the late 1960s (sensing a trend there?), it’s remained highly consistent.

Again, there’s nothing here in the strikeout percentage graph that hasn’t been discussed to death already. Strikeouts have been steadily increasing since always, with the exception of a short hiatus in the 1970s. If you have a conception about how often the average batter strikes out, and it’s based on a rate you knew in the past, you’re almost certainly too low.

Lastly, in the process of researching this piece I thought of another question that relates to this whole perception-of-average thing, and it turned out to be fun to answer: Which player’s current 2016 batting line best matches the league-average batting line of years past?

I found this in three different ways, and I’ll show you the results of all of them. I used the same stats as I discussed above, and threw in a batter-ball profile stat (GB%) for good measure. This excludes, then, both baserunning and defense—all I can show you is who’s the best match for past league average *batting* lines. I also limited the sample to players with 250 PA or more this year.

I used three distance measurements, two legitimate and one of my own creation (probably not legitimate). I used the ‘ecodist’ package in R to measure both the Euclidean and Mahalanobis distances between year league averages and current player stat lines, and then also figured the total relative error (treating the league figures as the true values and the player figures as the estimates). Euclidean distance is the same as the Pythagorean formula (the mathematical one, not the baseball one) extended to further dimensions, while the Mahalanobis distance is nearly the same but first places all values on a standard deviation scale (and therefore reflects the scale/distribution of stats, while Euclidean distance does not). The chart that follows shows all results. In case my meaning isn’t clear, here’s how to read the chart, using 1954 and relative error as the example: it would be accurate to say that “by this measure (summed relative error), the 2016 player whose batting line most resembles league average in 1954 is Joe Panik.” Some definitely do NOT pass the smell test, but then again, this article has all been about misconceptions in what average means, so maybe it’s my skepticism that’s wrong.



[i] This was done using the LOWESS function, with the utterly insane overkill of allowing up to 100,000 iterations. Full R code is available upon request, though data may or may not be.