Looking for Inefficiencies in MLB Draft Demographics | A Study

The 2015 MLB Rule 4 Draft concluded two weeks ago after three days of intense drafting, a few months of planning and years of scouting. The draft is a spectacle unmatched in scope — forty rounds that see thousands of players being drafted and many more being scouted but falling short (some sign as undrafted free agents).

Most of the draft involves players being drafted solely for the qualities they can bring to lower-level teams; they are not major league prospects. After all, organizations have to fill out multiple short-season league teams, leagues that start very soon after the draft. These players are not tracked by national prospect sites or known to even the most informed fans.

The first few rounds are more interesting and, in recent years, the first round (and subsequent sandwich rounds) have been broadcast live on MLB Network on the Monday night as a television event. Teams expect the players they select in the few few rounds to have a major league future. They take into account the certainty of a player’s skillset being of major league quality, and the quality of a player’s perceived ceiling.

For most organizations, draft and scouting practices are based on customs that scouts have been using and preaching for decades. Organizations have area scouts who together span all baseball-playing areas of North America. Those scouts report to regional cross-checkers, who report to national cross-checkers and the front office at large. Scouting networks are intricate, often well-organized and build on concepts that have been passed down by scouts for generations. Scouting is the lifeblood of the sport.

Recently, teams have begun to aggregate their scouting data along with college and high school statistics to improve the quality of their information and the efficiency with which they can bring in talent. The analytics angle that has swept baseball is moving into the draft and for some teams the effect is quite visible.

The Houston Astros are very vocal about their hybrid scouting-analytics approach to team building and their decision-makers seem to always have a plan in the draft that involves stocking up on position players with tools and a history of performance, and gaming the bonus pool to land multiple elite talents; in 2012 they took Carlos Correa first then were able to land the sliding Lance McCullers later. Both are now major league contributors. This year, they nabbed two high-performing college bats in Alex Bregman and Kyle Tucker, then had top high school talent Daz Cameron slip to 37 due to bonus demands. It’s expected that they’ll sign all three.

Other teams with dedicated analytics teams are surely crunching the numbers on how amateur stats translate to professional ball and figuring out how to combine that information with the scouting record. Chris Mitchell of Fangraphs/The Hardball Times has a KATOH model that assesses the likelihood of major league success given college data. He does it for hitters and pitchers and has also used the KATOH method for the minor leagues. He has found that there are significant college indicators that teams could use to improve their drafting.

This article, however, isn’t just a theory piece about how drafting is done and hypothetical ways it could improve. I wanted to do my own research on drafting trends and what has worked best in the past. So I compiled a database that includes the first 150 picks of each draft since 1990 and the eventual fWAR for each draftee through their age-29 season (designated as “WAR29″). Baseball America’s draft history tool provided the draft data and I pulled the value metric from Fangraphs. For this project, drafts through 2006 were considered; players drafted in 2006 have had a chance to reach the end of their team control.

I also have demographic data on these picks, including whether the player was drafted out of high school or college, their state/province of origin, the position they were drafted as and the team making the pick. These demographics will be contrasted in this analysis.

Players who went unsigned in the draft were not included since the exercise is meant to look at the value teams extracted from the picks; unsigned players do not generate value and teams that fail to sign prospects often receive compensatory picks that render the cost to be nearly neutral. Omitting unsigned players also removed some of the “signability” quirks in the data. There are always high schoolers drafted after the first round who are first round talents and are unlikely to sign with the team that drafts them. Data on unsigned data was pulled from The Baseball Cube, which provides such a filter. Since I had to link my draftee list to my major league value list using only names, there is likely to be a small amount of error in this information. I corrected name mismatches/duplicates as best I could and did the same for unsigned players. This should not compromise the data in any way.

To start, I needed a baseline with which to compare demographics. I found the average WAR29 of each pick’s pool of signed draftees, and the fraction of players drafted at each pick who reached 3 WAR and 10 WAR. 3 WAR is meant to denote a player who became a real major league contributor, including profiles such as good fourth outfielders and middle relievers. 10 WAR players are average regulars or better — players who teams envision the prospects will become when they reach maturity.

Here are two charts that illustrate the baseline values accrued at each draft pick by the three metrics.

Both graphs contain logarithmic fits that smooth out the numbers for our expected value per pick metrics. It’s clear that the data is shaped a way that accommodates a log fit. The expected value of a pick falls heavily after the first few selections. The best amateurs on the continent tend to have significant edges over the rest of the field. Figure 1 shows the average WAR of draft picks through the age of 29 and the fit line is -2.348*ln(pick) + 11.538. Ln is the natural logarithm. The picks near the end of the draft will be assessed a negative expected WAR by this line, which runs counter to this exercise. Yes, a player taken in the fifth round would not be expected to reach Major League replacement level, but a team would also not play them in theory if they didn’t reach replacement level in talent. So, the lowest expected value I will use is zero and the expected functions will really be maximum(log_function(pick), 0).

Figure 2 shows the probability of these signed draftees becoming major league contributors.The light grey line is the portion who reach 3 WAR through age 29 and the dark grey line is the portion who reach 10 WAR. Orange is the fit for 3 WAR: -0.122*ln(pick) + 0.6548; Blue is the fit for 10 WAR: -0.085*ln(pick) + 0.4218.

We have our baselines! Now the analysis will consist of comparing demographics to the baseline.

First I wanted to look at college vs. high school. Conventional wisdom suggests that high schoolers generally have higher upside, since the top college players often weren’t elite prospects out of high school. The top college players are selected from a pool of lesser talent. However, college players are closer to the major leagues since they already have years of extra development. This ceiling/floor exchange is the foundation for high school/college contrasting.

For both high school and college signed draftees, I averaged the differences between actual WAR29 and baseline expected WAR29 (ΔWAR29) and whether the player reached 3 WAR and 10 WAR with the baseline probability (Δ3Wins, Δ10Wins). Here were the results:

Education N ΔWAR29

Δ3Wins Δ10Wins College 1299 -0.106 0.001 -0.010 High School 1086 0.135 -0.011 0.002

Positive numbers reflect positively on the demographic. A value of 0.1 in Δ3Wins means the demographic converted draftees into productive players at an average rate of 10 percentage points more than the baseline expected rate. For example, if the average expected rate is 70% and 80% of the players reached 3 WAR, we’d have 0.1 Δ3Wins.

High schoolers outperformed expectations by an average of 0.135 WAR through age 29, and achieved 10 WAR at a rate very slightly above average. College players were better at the 3 WAR category. As we’ll see later though, these edges are fairly small and the most accurate thing to say about this table is that major league organizations are properly pricing the college/high school distinction into their draft boards. We could probably also add a slight modifier to account for the fact that teams must wait longer for their high schoolers to develop than for college players; there is value in receiving contributions sooner and this would bring the two categories closer together.

There are differences between the two pools and teams are capturing the differences well.

Note: the weighted sums of the tables do not need to equal zero since the logarithmic models used considered the average values of pick for signed players; pick samples have slightly differently-sized samples; holding the minimum WAR values to zero also had an effect on throwing off the total sum of zero.

Next I looked at the position that amateurs were drafted as. Going in, I expected to see that pitchers would lag behind position players, as that’s been a tenet of sabermetric thinking in recent years. Pitchers get hurt often and can have trouble developing the command and control necessary to become a major leaguer. They can be easier to scout since velocity, movement and control can be scouted regardless of the level of competition, but the road to the major leagues can be more difficult.

That is what happened. Here is the positional comparison:

Position N ΔWAR29 Δ3Wins Δ10Wins 3B 126 1.78 0.031 0.026 2B 32 1.12 0.123 0.018 1B 111 0.59 0.027 0.011 C 199 0.45 0.029 0.006 SS 264 0.33 -0.005 -0.006 OF 424 0.32 0.009 0.014 RHP 876 -0.27 -0.022 -0.011 LHP 353 -0.50 -0.031 -0.032

Left-handed pitchers were especially overdrafted from 1990-2006. When drafting a left-handed pitcher, a team could expect to lose 0.50 wins through age 29. For right-handed pitchers, a loss of 0.27 wins. These draftees also became both supporting and star contributors at a rate less than expected by their draft position. These days, half a win of marginal value is worth around $3.5MM, so this edge is not trivial. As teams need roughly as many hitters as pitchers, the sample sizes for RHP and LHP are huge so I am confident in these numbers.

Often, teams justify drafting pitchers highly by saying that they need to acquire power arms somehow; it’s a necessary evil. However, given that there is little frictional cost to acquiring arms later by making trades and signing free agents to their market rates, this logic rings hollow. Teams should draft the players they expect will return the most value. They seem to have systematically over-drafted pitchers in recent years, going against that principle.

As a group, pitchers still turn in productive careers; busted pitching prospects can often reinvent themselves as useful relievers. There is more opportunity for them to carve out marginal careers than there is for positional prospects.

Samples for individual fielding positions are smaller, but only second basemen have a sample of less than 100. Third basemen have historically over-performed their draft positions by 1.78 wins, a considerable amount of value. This is led by Scott Rolen (drafted 46th in 1993), David Wright (drafted 38th in 2001) and Evan Longoria (drafted 3rd in 2006; the former two produced over 40 wins more than expected through age 29.

No positions were over-drafted when compared to pitchers, however shortstops returned productive careers at a rate below expected. Perhaps teams put too much emphasis on the defensive value of shortstops and don’t properly adjust for the quality of other tools. Or they may see the dearth of productive shortstops in the major leagues and push the position up their draft board as a result.

After seeing a small difference between high school and college draftees overall yet a large difference between position players and pitchers, I wanted to combine the two categories. Here are draftees sorted into four categories separated by education and pitching/position playing:

Category N ΔWAR29 Δ3Wins Δ10Wins College Hitters 561 0.61 0.039 0.007 High School Hitters 595 0.49 -0.005 0.011 High School Pitchers 491 -0.13 -0.019 -0.009 College Pitchers 738 -0.47 -0.028 -0.023

We can see that college draftees account for much of the bias that teams have for drafting pitchers. An additional query I did revealed that teams don’t over-draft college left-handed pitchers compared to college right-handers, they are essentially equal compared to expectation. It is high school left-handed pitchers that they especially are over-drafting.

College hitters are under-drafted by a noteworthy amount in terms of both amassing value and having careers through 29 that surpass 3 WAR. Through this knowledge and through statistical analysis of college statistics, which have been shown to be predictive on a real level, data-driven teams should be expected to draft college hitters at the top of the draft at a rate that exceeds the average team.

The last demographic I want to investigate is region of origin. I have data on states and provinces individually, but many states/provinces are sparsely drafted from and don’t have meaningful samples. So I created seven larger regions. The regions were broken up into Northeast, Southeast, Pacific, Midwest, South Central, Canada and Puerto Rico. The US regions are illustrated here: http://i.imgur.com/peJMziL.gif.

And here’s the table with results:

Region N ΔWAR29 Δ3Wins Δ10Wins Canada 15 3.25 0.156 0.177 Puerto Rico 45 1.44 -0.018 0.033 Pacific 582 0.25 0.009 0,000 Midwest 251 0.13 0.008 -0.006 South Central 363 0.11 -0.004 0.000 Southeast 824 -0.02 -0.008 -0.008 Northeast 305 -0.29 -0.038 -0.022

It looks like Canada and Puerto Rico are being under-drafted, although samples for them are very small (only 15 signed draftees for Canada and 45 for Puerto Rico). This may be due to under-exposure. Scouts don’t canvas these regions as much as they do baseball hotbeds like Florida. However, the Northeast is also scouted less and it lagged the field in all three categories. None of the American regions had any real advantage in terms of creating 3 WAR and 10 WAR players. The most-scouted and drafted region, the American Southeast, had essentially zero bias. This suggests that teams draft best when they have more scouting looks at prospects who play against better competition; this is to be expected. As the region becomes more sparsely talented and scouted, variance reigns.

I was a little concerned after doing initial queries that some No. 1 picks would throw off the data. Many top picks were obvious generational talents entering their drafts and their performance could skew the demographics they come from. Players like Alex Rodriguez and Chipper Jones produced well in excess of the baseline top value. However, I tried each of the comparisons without the No. 1 picks as well and while the numbers were a little different, the conclusions to be made were not.

The most significant takeaways from the data are that:

– All pitchers perform worse than the baseline expectation, left-handed pitchers especially so.

– College hitters perform better than expected.

– Those outside the United States have significantly outperformed expectations, although only 60 such draftees are in the sample.

– Teams appear best (most accurately tuned to value) at drafting prospects for whom they have the most information.

There you have it! For my next draft-related article, I would like to reuse my draft database to see how each organization has drafted individually, including the types of prospects they like and how those prospects perform.

(Title image via Getty Images, found at http://www.truebluela.com/2015/6/9/8750441/mlb-draft-2015-tracker-dodgers-picks)