We only need to create this object once in our session and it’s ready to use for any queries we throw at it.

Understand the tables in the dataset

With our BigQueryHelper object for the GitHub dataset, bq_assistant , there are a few super simple read-only functions we can use to learn more about the data including:

Listing tables: bq_assistant.list_tables()

Getting table schema: bq_assistant.table_schema("licenses")

Inspecting table rows: bq_assistant.head("licenses", num_rows=10)

These are easy ways to get a feel for the data you want to work with without expending quota.

Of course if you follow along by forking this kernel, you can also get all of this information by clicking the “Data” tab and looking at the file previews.

Estimate the size of your queries before you make them

Now for the really cool part. We’re eager to start analyzing the GitHub BigQuery dataset to uncover insights into open source software development, but we should be careful about how much data we scan in our queries. Let’s make those free 5TB count!

Fortunately, the bq_helper module gives us tools to very easily estimate the size of our queries before we get ahead of ourselves. See below:

The result of `estimate_query_size`

Without needing to run the query, we now know it would “cost” nearly 18GB. Now you can see why it’s extremely helpful to know this without needing to actually expend the quota!

Note that the query obviously wouldn’t return an object that’s 17.6GB — it would only return 2000 rather short commit messages.

Let’s change the query and see what happens:

The result of the second `estimate_query_size`

Again, we’re scanning the same 17.6GB which is the full size of the message field. The lesson: the data you're scanning is not equivalent to the data you expect the query to return as its result. Check out BigQuery's best practices docs for more information.

Don’t be discouraged by this. When you do make a query request, your results will be cached by default. As long as you’re in the same interactive kernel session, you can run the same cell/query multiple times without expending multiples of the quota cost. Whew! The nice caching behavior is especially helpful because when you click “New Snapshot” to save your work, your kernel will be freshly executed from top to bottom.

Execute a query safely

Now that we’re comfortable with the huge amounts of data at our fingertips and know we can confidently estimate the size of our queries, let’s actually try fetching some data!

The query_to_pandas_safe function is another bq_helper function that makes the call to execute our query. It has two advantages over using the base BigQuery Python client library:

By default it won’t execute a query with a size estimate over 1GB

It returns a convenient pandas dataframe

Let’s test out query_to_pandas_safe on our first query. If it works correctly, the query will fail to execute as we know it scans 17.6GB of data.

Our query was canceled because it would scan more than 1GB of data

As expected, the query was canceled for exceeding 1GB of data estimated.

With these functions, we’re now safe and ready to write and execute a query that will help answer our question: What are the most popular licenses used for open source projects shared on GitHub?

Let’s use the best practices we learned with estimate_query_size and query_to_pandas_safe to query the licenses table of the GitHub BigQuery dataset:

Executing this query takes seconds. And now that we’ve got a pandas dataframe of our results, we’re in very familiar data science territory.

Step Three: Visualize Most Popular GitHub Project Licenses

The results of our query (and our efforts learning BigQuery) all become real once we visualize them! I’m going to use matplotlib with seaborn to create a bar plot showing the count of GitHub projects by type of license.

Here’s the code I used to do that in my kernel:

And here’s the resulting bar plot:

This shows that MIT is the most popular license by far compared to the next most common licenses: Apache 2.0 and GPL 2.0 and 3.0. And because this is Python, I wrote more lines trying to visualize my data than I did to retrieve query results from BigQuery. It’s amazing how simple and quick it is to analyze big data in BigQuery.

For the record, the Python kernel I wrote is Apache 2.0-licensed. :)

Conclusion

I hope you’ve enjoyed this walkthrough and learned something new. Personally, I felt confident about exploring this massive 3TB dataset knowing the functions available to me in bq_helper would prevent me from accidentally spending more quota than I intended.

Here are some additional resources for working with BigQuery datasets in Kaggle Kernels: