Happy Earth Day

Take a look at a sample from a Retrosheet event file:

com,"call was upheld by replay" play,7,1,donaj001,00,,NP sub,dysos001,"Sam Dyson",0,0,1 play,7,1,donaj001,20,.BBX,46(1)/FO/P.3-H(UR);2-3 play,7,1,bautj002,11,FBX,HR/78/L.3-H(UR);1-H(UR);B-H(TUR) com,"ej,buehm001,P,scotd901,Coming onto the field (not on roster)" com,"$Jose Bautista stood at HP admiring his homer"

Now if I hadn’t watched 4 replays of this game already, a lot of this CSV file would be unclear to me. This format is also difficult to use in a web API or mobile app which why I was surprised when I couldn’t easily find a JSON version of the Retrosheet Database.

The JSON

GitHub Link

Download the repository, and see the Readme.md file for instructions on how to run the tool. You have to provide the Retrosheet files yourself; just extract them into the JSONRetro/data folder after you download them from retrosheet.org

Here’s a sample JSON from a different play in that game:

"play" : { "inning" : 7, "team" : 0, "batter" : "choos001", "count" : { "balls" : 1, "strikes" : 2 }, "pitches" : "FBFB", "event" : "OA/UREV.3-H(UR)(E2/TH)(NR)" }, "play" : { "inning" : 7, "team" : 0, "batter" : "choos001", "count" : { "balls" : 2, "strikes" : 2 }, "pitches" : "FBFB.FS", "event" : "K" }

I still haven’t decided on a way to parse “event” and “pitches” but that’s why the tool is open source.

The Java

I made heavy use of Apache Commons libraries to simply iterate the CSVRecords in a CSVParser and build a string that gets written to a .json file. There is a 1:1 mapping of event files to json files and the output is 3 times the size but a single json file is still typically less than 1 megabyte. That means you can easily store a large number of these in memory at one time even on a mobile device.

The tool can probably be made much quicker with some multithreading and maybe by not writing out the StringBuilder to file on every iteration. What it does do is provide 100% compliant JSON as validated by JSONLint in about a minute.

I have MLB Gameday Data, the Lahman Database in JSON, and now a JSON version of Retrosheet. This is the starting point to being able to parse every MLB game down to the pitch level with more recent years including details such as the spin rate of every pitch and a description of every play in both Spanish and English.