I have a project that I’m working on that involves downloading an entire season’s worth of MLB Gameday data. These files include per-pitch data for all at bats since 2007, with increasingly detailed information being recorded every season. I’ve written a tool in Java that downloads the pertinent files for my needs.

Here’s a sample from a Rangers @ Jays game from the 2015 ALDS:

The last 3 files are the Pitch F/X data and the first two are used for at-bat and game metadata.

How to do it

If you have Java installed, the quick way is to download this and run it from your terminal or command line. The command takes two parameters: 1) the year to download and 2) the local directory to save the files.

java -jar Downloader.jar 2014 "c:\Users\majorsaber\data"

I think it took about 25 minutes for the 2014 season. The tool will continue where it left off if you have to cancel it during the download but it might leave the 1 currently downloading file corrupted. This could probably be improved by multithreading the download which is why I made the code open source.

[UPDATE: The tool is now multithreaded and can download an entire season in 5 minutes!]

How it works

GitHub Repository

The URL template for a single game’s worth of data is

http://gd2.mlb.com/components/game/mlb/year_2015/month_10/

and this is the contents of that directory. By fetching the DOM for that HTML directory listing and parsing it using jsoup, I get a list of all folders that begin with “gid_”. I can then fetch the 5 files that I need from each game’s directory using apache commons-io’s very useful copyURLToFile method.

That’s it. My next task is to map the batter and pitcher IDs from the Gameday data to that in the Lahman Database and Retrosheet so I can cross reference player stats with at-bat level data.

*bat flip*