Service accounts are a way to keep a tight leash on what your applications in Google Cloud Platform are doing. Instead of running your applications with all the permissions your user account has, you typically want the application to have extremely constrained access to your organization’s resources.

How to use a private key in a service account to access BigQuery from Pandas

Let’s say that you’d like Pandas to run a query against BigQuery. You can use the the read_gbq of Pandas (available in the pandas-gbq package):

import pandas as pd query = """

SELECT

year,

COUNT(1) as num_babies

FROM

publicdata.samples.natality

WHERE

year > 2000

GROUP BY

year

""" df = pd.read_gbq(query,

project_id='MY_PROJECT_ID',

dialect='standard')

print(df.head())

This will work if you run it locally (because it uses your identity, and presumably you have the ability to run queries in your project).

But if you try running the above code from a bare-bones account, it won’t work. You will get asked to go through an OAuth2 workflow to specifically authorize the application. Why? Because you don’t want an arbitrary application running up a BigQuery bill (or worse, accessing your corporate data), so you have to provide it the permissions it needs.

Let’s try it out.

1. Create GCE instance

To follow along with me, use the GCP web console and create a Google Compute Engine instance with the default access (this typically includes only the bare basics and doesn’t include access to BigQuery):

Creating a Google Compute Engine instance with restricted access

2. Create a barebones service account

From the GCP web console, create a new service account. This is the account for which you will be generating a private key. For extra security and auditing, I recommend creating a brand new service account for each application and not reusing service accounts between applications.

Go to IAM & Admin, select “Service accounts” and click on +Create Service Account. Fill out the form as follows:

Creating a barebones service account for Pandas

The roles above allow the service account to run queries and have those be billed to the project. However, the dataset owner still needs to allow the service account to view their datasets.

If you are using a non-public BigQuery dataset, give the service account the appropriate (typically just View) access to it by going to the BigQuery console and sharing the dataset with the service account’s email address. For the purposes of this tutorial, I will use a public BigQuery dataset, so we can skip this step.

3. Put the example code on the GCE instance

SSH to the GCE instance and on the command-line, type in the following commands:

sudo apt-get install -y git python-pip git clone https://github.com/GoogleCloudPlatform/training-data-analyst sudo pip install pandas-gbq==0.4.1

4. Run without default credentials

Change the project id in nokey_query.py and then run it:

cd training-data-analyst/blogs/pandas-pvtkey # EDIT nokey_query.py to set the PROJECT ID python nokey_query.py

It will ask you to go through an OAuth2 workflow. Hit Ctrl-C to exit. You don’t want this application running with *your* credentials.

A common suggestion you will hear when you run into such an error is to run:

# BAD IDEA! DO NOT DO THIS

gcloud auth application-default login

and go through the interactive OAuth 2 workflow in the shell before launching the application. Be careful about doing this: (1) the above command is a bazooka and allow the application to do anything you can do. (2) It is not scriptable. You will have to do it every time you run the application.

The best approach is to create the Compute Engine VM, not with the bare-bones service account, but with the service account to which you have given BigQuery viewer permissions. In other words, swap steps 1 and 2, and when you create the GCE instance, use the newly created service account.

But what if you are not on GCP (so your machine isn’t created by the service account auth) or you are using a managed service such as Cloud Dataflow or Cloud ML Engine (and so the VMs are created by that service’s service account)? In that case, a better approach is to change the Pandas code as follows:

df = pd.read_gbq(query,

project_id='MY_PROJECT_ID',

private_key='path_to/privatekey.json',

dialect='standard',

verbose=False)

The application will now use the permissions that are associated with this private key.

5. Run without a private key

Change the project id in query.py and then run it:

cd training-data-analyst/blogs/pandas-pvtkey # EDIT query.py to set the PROJECT ID python query.py

It will fail saying the private key wasn’t found

6. Generate a private key for the service account

Remember that you created a JSON private key when you created the service account? You need to upload it to the GCE VM. (In case you didn’t create the key file, navigate to IAM > Service Accounts and create a private key for the pandas service account. This will create a JSON file and download it to your local computer. You can also revoke keys from here.)

In the top-right of the SSH window, there is a way to upload files. Upload the generated JSON file to the GCE instance and move it into place:

mv ~/*.json trainer/privatekey.json

7. Run with the private key

python query.py

It will work now and you will get back the results of the query.

8. Putting a private key in a Python package

What if you are not using GCE, but are using a managed service (Cloud Dataflow, Cloud ML Engine, etc.) instead?

In that case, you will be submitting a Python package. Mark the private key as a resource in the setup.py using the package_data attribute:

from setuptools import setup

version='1.0',

description='Showing how to use private key',

url='

author='Google',



license='Apache2',

packages=['trainer'],

## WARNING! Do not upload this package to PyPI

## BECAUSE it contains a private key

package_data={'trainer': ['privatekey.json']},

install_requires=[

'pandas-gbq==0.4.1',

'urllib3',

'google-cloud-bigquery'

],

zip_safe=False) setup(name='trainer',version='1.0',description='Showing how to use private key',url=' http://github.com/GoogleCloudPlatform/training-data-analyst' author='Google', author_email='nobody@google.com ',license='Apache2',packages=['trainer'],install_requires=['pandas-gbq==0.4.1','urllib3','google-cloud-bigquery'],zip_safe=False)

Then, instead of hardcoding the path to the privatekey.json file, do this:

private_key = pkgutil.get_data('trainer', 'privatekey.json')

Here is a more complete example . Note that Python doesn’t have a way of marking packages as private, so you should be careful not to mistakenly publish this package in a public repository such as PyPI.

9. Safeguarding the private key

The private key essentially unlocks, for whoever presents it, the resources that have been made available. In this case, anyone who presents the private key will be able to make BigQuery queries. No second form of authentication (IP address, user login, hardware token, etc.) is required. So, be careful about how/where you store the private key. What I recommend:

(1) Make sure to have a .gitignore in the Python package that explicitly ignores the private key:

$cat trainer/.gitignore

privatekey.json

(2) Use git secrets to help safeguard against key leakage

(3) Rotate your keys. See this blog for details.

(4) Make sure to not publish the Python package to any repository of Python packages, as yours contains a private key.

Summary

Use the JSON private_key attribute to restrict the access of your Pandas code to BigQuery.

Create a service account with barebones permissions

Share specific BigQuery datasets with the service account

Generate a private key for the service account

Upload the private key to the GCE instance or add the private key to the submittable Python package

Make sure you don’t check in the private key into code or package repositories. Use key rotator and git secrets.

Here is the example code of this article in GitHub. Happy coding!

Acknowledgment: Thanks to my colleagues Tim Swast and Grace Mollison for their help and suggestions. Any errors in the article are my own.