Connect BigQuery to Google Sheets

Following this Google blog post.

Photo by Linus Mimietz on Unsplash

Why would you do this? Why not a SQL database?

One of my friends is the president of a non-profit that is (thankfully) experiencing growing pains. A group of electrical and mechanical engineers in their early 20s hatched a plan to attach solar panels to schools in Uganda, and hard work and generous donations have fueled success that has lead to larger plans. But, alas, their data was scattered between different Drive accounts in the form of Google Sheets, and understanding their impact or visualizing their future was impossible with the data in its current state. I was lucky enough to get involved in the organization through my friend, who complained about how their data collection sucked, and they were trying to gather all their information into a database. Being an academic data scientist pursuing a Master’s in Public Policy, I couldn’t believe they had such valuable data as school-level data for student health and education before and after the installation of solar panels on schools in rural Uganda, and begged to lend a hand.

Over donuts and coffee in an apartment conference room, I witnessed firsthand the least tidy data set imaginable. Production and solar panel part tables lived in one person’s drive, school surveys and installation procedures lived in another. They did not even know where all the data was.

The first step was gathering all the data in one place. We agreed to meet in 2 weeks with all the data in one location, and I would handle the cloud integration. What they wanted was very simple: query multiple sheets using SQL syntax. I gave a slick presentation showing a prototype of a dash website with SQL database to generate reports, but that went out the window when I asked “How will the data be updated?” and they responded, “We will copy and paste to a Google Sheet.” It was immediately apparent to me that I was the only person in the room with SQL or cloud engineering knowledge, but rather than being an advantage, this was a liability to the non-profit, as it could not afford to ever lose access to the database in lieu of my absence or disengagement.

BigQuery is Amazing

Due to the fact that the database would be randomly accessed and updated via Google Sheets, I decided to hold off on the Airflow + SQL + Flask app I originally had in mind for this database project. Due to budget constraints and the reluctance of the organization to set up a complicated cloud process, I would need to build something native in Google that was cheap and easy. BigQuery and Data Studio immediately jumped out at me.

Before the next meeting, I wanted to have a prototype of what BigQuery could do for their organization, and decided to use UN Human Development Report — Subnational Indexes to build a mock database.

Create a Project

In the Google Cloud Platform, create a New Project. I have chosen “HDR-SubnatIndex” as my project name.

Create a new project.

Enable the API

Next, enable the Sheets API for your new project. Notice that my project at the top is HDR-SubnatIndex.

Enable sheets api.

Create a Dataset

Navigate to the BigQuery Console and select CREATE DATASET.

Give your dataset a name and configure expiration and encryption. Click Create Dataset.

Create a table

Find the sheet that you would like to connect to BigQuery and copy its shareable link.

Select your new dataset under your project in the bottom left, and then select CREATE TABLE.

Select Create table from: Drive, paste the entire Sheets URI into Select Drive URI, select CSV (I’ve noticed some issues with selecting Google Sheet where it does not recognize the column names), input Table Name, and in this case, we can auto detect schema and parameters.

Run a test query

The beauty of BigQuery is its ability to execute SQL queries on ANYTHING. To make sure the connection to the sheet has been properly allowed, run a test query. Here I select the top 10 rows of the dataset.

Now let’s run a slightly more advanced query that pulls subnational data for region, year, health index, income index, and education index for Uganda only.

But this is all information that could be easily gleaned from filters applied in Google Sheets, right? The benefit of this approach is that you now have a query you can save and run at any time without having to copy the dataset and recreate the filter steps. Where BigQuery truly stands out, however, is in its ability to execute JOINS between multiple Sheets.

SQL JOINS

I followed the same steps above to add a second BigQuery table from a google sheet in my google drive that contains GDP data from the World Bank. Combining these two datasets using Google Sheets and linking them would be much more difficult without the aid of BigQuery and SQL JOINS. Here I have queried both tables to create a report of each country’s yearly National Human Development Index and GDP.

This is an important comparison to make when analyzing multi-dimensional poverty, as Gross Domestic Product (GDP) does not tell the full story. The UN Human Development Report established the Human Development Index to take into account variables such as health, education, life expectancy, mean years of schooling, expected years of schooling, and population, which gives a much more dynamic understanding of a particular’s country’s needs or type of poverty experienced.

Conclusion

Your BigQuery dataset is essentially an abstraction above your Sheets that allows you to run SQL queries on multiple sources of data housed in dynamic Sheets. Now you have a query that you can save and run whenever needed or export to Data Studio by clicking Explore Data. Also, from Data Studio, you can create a report and use this same query on your BigQuery dataset as your data source for the report.

BONUS: BI Engine

Now that you have seen the power of BigQuery to collect data from google sheets and standardize reporting, let’s try to speed up that query, shall we? Google BI Engine is essentially purchasing accelerated compute resources for a project.

On the left side of the BigQuery console, select BI Engine. Enable it, if needed. Select CREATE RESERVATION. Select the region where the resource will be located. We will be using less than 1 GB of storage with our small queries.

You will now see your new BI Engine listed under BI Engines. After running the same query, we only saw a decrease in time of 2.3 seconds. But where BI Engine truly shines is in Google Data Studio, with noticeably faster updates while you play with charts to find the perfect way to visualize your analysis.

Thank you for reading!