Data is not a resource that should sit in gilded halls, behind locked doors, for only the priveliged few to access. That’s why we’ve been working on publishing these NFL data in a way that would allow anyone with an internet connection to easily access. By using BigQuery from Google Cloud Platform (GCP), these data can be public in a way never before possible.

Where Is The Data From? Is It Reliable?

The NFL operates a feed called feeds-rs that includes insane amounts of information about every play, player, coach, stadium…anything you could imagine, starting in 2009. These are the same base data that is parsed by the incredible project nflgame and its derivatives, nflgame-redux, this Kaggle project, and so forth.

The difference between these projects and our project, pynfldata, is that the data coming from feeds-rs is extensively verified and corrected before being published. For example, if there is a fumble or interception followed immediately by a score, the other projects do not include the PAT/2-point attempt in the data, but pynfldata does.

We of course greatly appreciate if you can find any issues with our published data! See the pynfldata project on how to contribute or raise issues.

You can access the data using the source code above, and use that to download the data from feeds-rs yourself. However, there is an easier way.

How Can The Data Be Accessed?

BigQuery is GCP’s hosted data warehouse in the cloud – think of it as a database that you can access from your browser from anywhere in the world. All you have to do to access BigQuery is to visit the console and follow the directions to link your Google account to a GCP account in order to get started. GCP generously offers $300 in free credit over your first year, and there’s also a free tier that allows for a surprising amount of analysis without ever linking a credit card to your account. If you want to play with the system before (or in addition to) looking at NFL data, there are a large number of public datasets, ranging from weather to traffic to bitcoin.

BigQuery arranges its data in the following hierarchy: PROJECTs contain DATASETs which contain TABLEs. We are publishing one table for now with data from drives as mentioned above, named pynfldata.drives.drives_v1​. If you have created your own project as detailed above, you should simply be able to query this table using Standard SQL:

SELECT * FROM ​`​pynfldata.drives.drives_v1`

You can then press the run button under the query window, and the query will return all 3316 games in the data. We’ll quickly discuss a few technical details about this table before returning with examples of simple queries to get you started with this data.

Note that drive_start is a yardline, by midfield bias: this means that the 50 yard line is 0, anything on your side of the field is negative from there, and anything on the opponents’ side of the field is positive. -30 is your own 20 yard line, +5 is the opponents’ 45.

Once you have queried the table once, you can pin the project/dataset in your control menu on the left. If you click the table, you can click the Details tab in the bottom panel to see more info about it. This table is both clustered and nested. For each of the 3316 game records returned in the above query, all drives for that game are nested under that game’s basic data of home_team, away_team, game_id, etc. This allows for selection of basic properties like year or team involved without having to parse all 77637 drives. It is also clustered by season_year, season_type, game_week, meaning that if you filter by these properties, you immediately reduce the data being parsed to ONLY that which meets the filters.

Now, these are not particularly important filters to worry about with a table that is only 3.32 MB in size, especially since BigQuery charges for a minimum of 10 MB per query. However, once we release data for individual plays in addition to that for drives and begin to expand the data included here, these may become important factors in your use of these tables.

If you want to programmatically access the data instead of using your browser and SQL, you can do so with client libraries, available in a variety of modern programming languages. That link also walks you through authentication so that you’ll be set up to use your own project, which is done automatically in the browser after you’ve set up a project.

You can also directly link BigQuery with Google Sheets, populating spreadsheets automatically with the contents of a table or the results of a query – see this link for details.

What Can We See With This?

All kinds of fun stuff! These data were used in the analysis for our recent field position article. But what you can see is only limited to your imagination and your usage of SQL!

The basic syntax of an SQL query is:

SELECT column1, column2, column3 FROM `pynfldata.drives.drives_v1` WHERE condition_a AND condition_b ORDER BY column2

For example, we can get the basic details for every regular season home Browns game from 2012:

SELECT home_team, away_team, season_year, season_type, game_week FROM `pynfldata.drives.drives_v1` WHERE season_year = 2012 AND season_type = 'REG' AND home_team = 'CLE' ORDER BY game_week

Row home team away team season year season type game week 1 CLE PHI 2012 REG 1 2 CLE BUF 2012 REG 3 3 CLE CIN 2012 REG 6 4 CLE SD 2012 REG 8 5 CLE BAL 2012 REG 9 6 CLE PIT 2012 REG 12 7 CLE KC 2012 REG 14 8 CLE WAS 2012 REG 15

We can see 2016 regular-season drives where the defense scored using the UNNEST() operator, which takes the nested structure of the table and flattens it so that we can access and use the values in the drives column:

SELECT home_team, away_team, season_type, game_week, d.drive_id, d.drive_pos_team, d.drive_scoring_team, d.drive_points FROM `pynfldata.drives.drives_v1`, unnest(drives) as d WHERE season_year = 2016 AND season_type = 'REG' AND d.drive_pos_team != d.drive_scoring_team ORDER BY game_week, game_id, d.drive_id LIMIT 5

Row home team away team game week drive id drive pos team drive scoring team drive points 1 PHI CLE 1 15 CLE PHI 2 2 TEN MIN 1 12 TEN MIN 6 3 TEN MIN 1 15 TEN MIN 7 4 IND DET 1 22 IND DET 2 5 BUF NYJ 2 11 NYJ BUF 7

You can also use GROUP BY statements to dig more into the data by summarizing and grouping rows together, which we won’t go into here.

What’s Next?

The next step is to add simple historical plays data to BigQuery. Then we’ll be expanding the fields available in the drives table and the plays table based on our needs for this blog, as well as anything that you request that’s available from the feeds-rs service.

If you have any feedback, please feel free to raise an issue on our github page, Facebook, Twitter, or send us an email. Let us know what cool stuff you find using this datasource!

To be notified when new posts go up or to provide feedback or suggestions, follow FSAD on Facebook or Twitter!