Hello, world! I’m very excited to be taking part in this blog, and am looking forward to sharing my enthusiasm for two great things that are great together: baseball and R!

In this initial post, I’m going to introduce you to a new R package that I have been developing with Greg Matthews, a biostatistician at UMass. The package is called openWAR, because our goal is to produce a fully open-source, reproducible version of Wins Above Replacement (WAR). Our paper on the subject is on the arXiv, but even if this doesn’t interest you, you might still be interested in the package, because it can do a lot of things that aren’t related to WAR at all.

First, the package contains functions for downloading and processing the XML files that power the MLBAM GameDay web application. Carson Sievert has written a similar package called pitchRx for the PITCHf/x data, but openWAR works with the play-by-play data — not the pitch-by-pitch data. Although this data is not “free as in freedom”, it is “free as in beer.”

Installing openWAR

openWAR is not yet on CRAN, but it is on GitHub. Currently, openWAR relies on Duncan Temple Lang’s Sxslt package, which provides XSLT functionality from within R, and this leads to a particularly elegant method of transforming the raw XML files from MLBAM into nice data frames in R. Unfortunately, this package is not on CRAN either, but rather is hosted by Omega Hat. You can install it using the repos argument:

install.packages("Sxslt", repos = "http://www.omegahat.org/R", type = "source")

Depending on your operating system, you may need to install basic XSLT functionality, which will take place outside of R. Please see the Sxslt installation instructions for more details on how to do this.

Next, installing openWAR is best accomplished through the install_github() function in the devtools package.

require(devtools) install_github("openWAR", "beanumber")

Accessing MLBAM data

The base class in openWAR is called gameday, and it collects information about a single major league game (in principle, minor leagues games could be included just as easily, but right now the parsers will only download major league data). An object of class gameday can be created if you know the MLBAM ID for the game you want to investigate. How are you supposed to know this ID? We’ve written a function that will figure this out for you.

Let’s say that you want the list of games that were played on July 21st, 2013. We can ask for the list of games:

require(openWAR) getGameIds(date=as.Date("2013-07-21"))

Retrieving data from 2013-07-21 ... ...found 15 games [1] "gid_2013_07_21_arimlb_sfnmlb_1" "gid_2013_07_21_atlmlb_chamlb_1" "gid_2013_07_21_balmlb_texmlb_1" [4] "gid_2013_07_21_chnmlb_colmlb_1" "gid_2013_07_21_clemlb_minmlb_1" "gid_2013_07_21_detmlb_kcamlb_1" [7] "gid_2013_07_21_lanmlb_wasmlb_1" "gid_2013_07_21_miamlb_milmlb_1" "gid_2013_07_21_nyamlb_bosmlb_1" [10] "gid_2013_07_21_oakmlb_anamlb_1" "gid_2013_07_21_phimlb_nynmlb_1" "gid_2013_07_21_pitmlb_cinmlb_1" [13] "gid_2013_07_21_sdnmlb_slnmlb_1" "gid_2013_07_21_seamlb_houmlb_1" "gid_2013_07_21_tbamlb_tormlb_1"

Since Jim is a Phillies fan, let’s investigate the Mets-Phillies game that was played on that date.

gd = gameday(gameId="gid_2013_07_21_phimlb_nynmlb_1") summary(gd) class(gd)

Length Class Mode gameId 1 -none- character base 1 -none- character url 5 -none- character ds 62 data.frame list

[1] "gameday"

You can see the gd object is of class gameday, and has four components: the gameId, the base MLBAM URL, URLs for the five different XML files from which it gathers its information, and finally a data.frame that contains 62 variables for every play in game. Let’s take a closer look at what is in this data.frame.

head(gd$ds)

pitcherId batterId field_teamId ab_num inning half balls strikes endOuts event actionId 6 518774 276519 121 1 1 top 0 0 1 Flyout NA 7 518774 276545 121 2 1 top 3 2 2 Groundout NA 8 518774 400284 121 3 1 top 1 2 2 Hit By Pitch NA 9 518774 502126 121 4 1 top 2 3 3 Strikeout NA 1 424324 458913 143 5 1 bottom 1 2 1 Groundout NA 2 424324 502517 143 6 1 bottom 0 2 2 Flyout NA description stand throws 6 Jimmy Rollins flies out to right fielder Marlon Byrd. L R 7 Michael Young grounds out, third baseman David Wright to first baseman Josh Satin. R R 8 Chase Utley hit by pitch. L R 9 Domonic Brown strikes out swinging. L R 1 Eric Young grounds out, shortstop Jimmy Rollins to first baseman Kevin Frandsen. R L 2 Daniel Murphy flies out softly to left fielder Domonic Brown. L L runnerMovement x y game_type home_team home_teamId home_lg 6 172.69 85.34 R nyn 121 NL 7 103.41 163.65 R nyn 121 NL 8 [400284::1B::Hit By Pitch] NA NA R nyn 121 NL 9 [400284:1B:2B::Passed Ball][400284:2B:::Strikeout] NA NA R nyn 121 NL 1 106.43 152.61 R nyn 121 NL 2 112.45 115.46 R nyn 121 NL away_team away_teamId away_lg venueId stadium timestamp playerId.C playerId.1B playerId.2B 6 phi 143 NL 3289 Citi Field 2013-07-21 17:11:38 407833 543744 502517 7 phi 143 NL 3289 Citi Field 2013-07-21 17:12:39 407833 543744 502517 8 phi 143 NL 3289 Citi Field 2013-07-21 17:15:05 407833 543744 502517 9 phi 143 NL 3289 Citi Field 2013-07-21 17:17:03 407833 543744 502517 1 phi 143 NL 3289 Citi Field 2013-07-21 17:21:34 456124 435623 400284 2 phi 143 NL 3289 Citi Field 2013-07-21 17:23:55 456124 435623 400284 playerId.3B playerId.SS playerId.LF playerId.CF playerId.RF batterPos batterName pitcherName runsOnPlay 6 431151 435560 458913 501571 407781 SS Rollins Harvey 0 7 431151 435560 458913 501571 407781 3B Young, M Harvey 0 8 431151 435560 458913 501571 407781 2B Utley Harvey 0 9 431151 435560 458913 501571 407781 LF Brown, D Harvey 0 1 276545 276519 502126 460055 430321 LF Young, E Lee, Cl 0 2 276545 276519 502126 460055 430321 2B Murphy, Dn Lee, Cl 0 startOuts runsInInning runsITD runsFuture start1B start2B start3B end1B end2B end3B outsInInning startCode 6 0 0 0 0 <NA> <NA> <NA> <NA> <NA> <NA> 3 0 7 1 0 0 0 <NA> <NA> <NA> <NA> <NA> <NA> 3 0 8 2 0 0 0 <NA> <NA> <NA> 400284 <NA> <NA> 3 0 9 2 0 0 0 400284 <NA> <NA> <NA> <NA> <NA> 3 1 1 0 2 0 2 <NA> <NA> <NA> <NA> <NA> <NA> 3 0 2 1 2 0 2 <NA> <NA> <NA> <NA> <NA> <NA> 3 0 endCode fielderId gameId isPA isAB isHit isBIP our.x our.y r theta 6 0 407781 gid_2013_07_21_phimlb_nynmlb_1 TRUE TRUE FALSE TRUE 119.01855 283.65796 307.6154 1.173521 7 0 431151 gid_2013_07_21_phimlb_nynmlb_1 TRUE TRUE FALSE TRUE -53.88154 88.22197 103.3747 2.119083 8 1 NA gid_2013_07_21_phimlb_nynmlb_1 TRUE FALSE FALSE FALSE NA NA NA NA 9 0 NA gid_2013_07_21_phimlb_nynmlb_1 TRUE TRUE FALSE FALSE NA NA NA NA 1 0 276519 gid_2013_07_21_phimlb_nynmlb_1 TRUE TRUE FALSE TRUE -46.34461 115.77418 124.7056 1.951563 2 0 502126 gid_2013_07_21_phimlb_nynmlb_1 TRUE TRUE FALSE TRUE -31.32067 208.48835 210.8278 1.719909

There is a great deal of information collected here — it should be comparable to Retrosheet. We can do some simple things like pull out the scoring plays:

subset(gd$ds, runsOnPlay > 0, select="description")

description 3 Play reviewed and stands as called: David Wright homers (15) on a line drive to left center field. 4 Marlon Byrd homers (17) on a fly ball to left field. 26 Play reviewed and overturned: Juan Lagares homers (2) on a line drive to left center field. David Wright scores. Josh Satin scores.

compute a linescore for the game:

require(plyr) ddply(gd$ds, ~inning, summarise, PHI = sum(ifelse(half == "top", runsOnPlay, 0)), NYM = sum(ifelse(half == "bottom", runsOnPlay, 0)))

inning PHI NYM 1 1 0 2 2 2 0 0 3 3 0 0 4 4 0 3 5 5 0 0 6 6 0 0 7 7 0 0 8 8 0 0 9 9 0 0

Or the final totals:

ddply(gd$ds, ~half, summarise, PA = sum(isPA), R = sum(runsOnPlay), H = sum(isHit))

half PA R H 1 bottom 32 5 7 2 top 32 0 4

How about the basic pitching lines:

ddply(gd$ds, ~pitcherId, summarise, Name = pitcherName[1], BF = sum(isPA), IP = sum(endOuts - startOuts)/3, H = sum(isHit), R = sum(runsOnPlay), BB = length(grep("Walk", event)), SO = length(grep("Strikeout", event)), HR = length(grep("Home Run", event)))

pitcherId Name BF IP H R BB SO HR 1 424324 Lee, Cl 26 6 7 5 1 6 3 2 425786 Atchison 7 2 1 0 0 2 0 3 449097 Papelbon 3 1 0 0 0 1 0 4 455374 Bastardo 3 1 0 0 0 2 0 5 518774 Harvey 25 7 3 0 0 10 0

This was not a bad day for Mr. Harvey. Now, you may see some discrepancies between the data that you download through openWAR and more authoritative sources. But based on our analysis, the fidelity of the data retrieved by openWAR is very good. We’ll verify this statement in a later post.

Clearly, there are a lot more interesting things that one can do with this data, but this is just a basic introduction. Next time we will explore openWAR‘s ability to download multiple games worth of information.