In the pictured architecture, BigQuery is the central data store for all product data, and user permissions data is similarly stored in what we call the “permissions table”. If any table is already in BigQuery (such Google Marketing Platform and Google Ads data exported via the BigQuery Data Transfer Service), there is (usually) no need to move or copy the data. For external data, we automate data extraction from the product, for example, via product APIs, and then ingest that data into BigQuery.

Generate the permissions table

The process for extracting the permissions data is likely to be product specific, and may require cooperation from the product developers. PAPER assumes a strong one-way encryption of the user’s account identifier. We recommend you extract and transform the user data inside a protected enclave, and encrypt the user account identifiers before any kind of data transfer or storage.

Take, for example, the following example of a product’s user accounts and permissions, as managed by the product itself:

The account identifier has to match the value that will be returned by the BigQuery SESSION_USER() function (“the email address of the user that is running the query”) when user’s later try to retrieve data. BigQuery also has standards-based cryptographic hashing functions, which you can use directly or as part of a custom transformation expression. (You could add a cryptographic salt, for example.) Keep in mind you must be able to replicate the exact transformation, both when producing the permissions table, and when user’s run their queries.

For this simple example, assume the product’s account information can be queried using BigQuery Standard SQL-like syntax, this query normalizes and encrypts the user ID, and saves the user’s list of accessible account IDs in an array, storing one record per user account:

Refresh the permissions data as frequently as possible (such as every 15 minutes) so permissions changes in the product will be enforced in the external queries, analytics, and visualization tools without a long delay.

Protect your source data

Before you save the permissions table or any unfiltered product data to a BigQuery table, make sure the IAM roles of users of your Google Project (project-id in our examples) limit most user’s permissions. Project “Owner” and “Editor” (primitive roles “owner” and “editor”) roles, and “BigQuery Admin” (if used) should be granted only to a handful of trusted administrators. These users will have access to the unfiltered data and permissions tables (albeit with only encrypted account identifiers).

Permissions on BigQuery data are managed at the dataset level (controlling access to a collection of tables or views). Assuming we are storing all unfiltered data and permissions tables in the same dataset (which we’ll refer to as unfiltered in our examples), here are full table specifications for notional tables to protect:

project-id.unfiltered.account_permissions

project-id.unfiltered.account_report

project-id.unfiltered.account_summary

project-id.unfiltered.account_events

We use automation scripts to ensure the security and privacy features of PAPER are always implemented correctly and consistently. In Python implementations, we’ve used the Google Cloud Client Library for Python, and for Bash implementations, we use the gcloud, gsutil, and bq command line tools.

The examples below are in Bash and assume a Linux environment similar to that of a standard Google Compute Engine (GCE) instance, which includes a recent version of Bash. Assume we’ve already authenticated with “gcloud auth login” and set the default project to “project-id”. To keep the examples short and easy to read, some error handling and checks have been removed.

First we create the unfiltered dataset. We use the lock_down_dataset.sh script to remove default permissions, and restrict access to only administrators (Project Owners and Editors).

export PROJECT_ID='project-id'

export UNFILTERED_DATASET_ID='unfiltered'

bq mk ––dataset "${UNFILTERED_DATASET_ID}"

./lock_down_dataset.sh "${UNFILTERED_DATASET_ID}"

The dataset share settings should look like this:

Locked down dataset

Implement User-Specific Filtering in an Authorized View

We need a second dataset to hold Authorized Views to do the primary job of filtering data based on user privileges. (Note that we call the dataset “hidden_views” because we will later hide this dataset as well, in a later step.) Users granted access to this dataset will be able to query the views, even though they can’t query the unfiltered data directly.

export HIDDEN_VIEWS_DATASET_ID='hidden_views'

bq mk ––dataset "${HIDDEN_VIEWS_DATASET_ID}"

./lock_down_dataset.sh "${HIDDEN_VIEWS_DATASET_ID}"

The filter_by_account_id.sql.template below shows one of the simplest queries to filter results based on a user’s list of permitted “account_ids”. We have to explicitly share the unfiltered dataset with the view itself, to “authorize” the view to query the unfiltered tables. The share_with_authorized_view.sh script automates this configuration. (Note we are not showing the how the unfiltered tables were created.)

export PERMISSIONS_TABLE_ID='account_permissions'

export PERMISSIONS_TTL='3 DAY'

./make_and_authorize_view account_report \

filter_by_account_id \

"${UNFILTERED_DATASET_ID}" \

"${HIDDEN_VIEWS_DATASET_ID}"

At this point, the unfiltered data is protected, and you could safely grant users direct access to the hidden_views dataset. User queries would only return the results they are authorized to see. But they would also be able to see the view’s query. They would see exactly what you do to lookup a user’s permissions and how you implement filtering.

Hide the Permissions Filter Logic in another Authorized View

We recommend hiding the implementation from non-privileged users by wrapping the hidden_views with a very simple, pass-through query, implemented by the following template:

The wrapper views are saved in a third dataset, filtered_views, named to highlight to the users that their results are supposed to be filtered. We need to make the dataset, but we don’t need to lock it down since results are filtered based on any user’s product account permissions, and any user that doesn’t have a product account will simply get no results. Otherwise, the steps to create the dataset and then make the wrapper view are as shown below:

export FILTERED_VIEWS_DATASET_ID='filtered_views'

bq mk ––dataset "${FILTERED_VIEWS_DATASET_ID}"

./make_and_authorize_view account_report \

wrap_hidden_view \

"${HIDDEN_VIEWS_DATASET_ID}" \

"${FILTERED_VIEWS_DATASET_ID}"

The final configuration of datasets and views, for one table (account_report, in this example) should now look something like the view from the BigQuery console shown here. The unfiltered account_report table is mirrored by the PAPER-enabled version in filtered_views. The account_events and account_summary tables are yet to be filtered, but both the unfiltered and hidden_views datasets are (by design) restricted to privileged administrators.

Share and Use the Filtered Views

There’s one more step to make these views usable: We need to share the data with users! Not only do you need to share the filtered_views dataset, but users also need permissions to run queries, which means they need to be added, at a minimum, to the “BigQuery Job User” IAM role, in some project. (We typically grant this role in the same project that hosts the BigQuery datasets, but there are valid use cases for using a different project.) With the standard minimum permissions, you users will only see the filtered_views dataset and wrapper view query, as shown below (as well as the full schema):

Standard user’s view of PAPER data in BigQuery

We also, likely, want to go beyond just giving users access to the data. Most of our users are not likely to write their own SQL. So we configure data analytics and visualization tools to reference the filtered_views, and ensure the tools are configured to use each individual user’s credentials before sharing them with the same users that have access to the dataset.

We need to make sure BigQuery uses the data viewer’s credentials to filter the data to that user’s permissions, and not the document owner’s credentials. This distinction is clear in Data Studio data source documents, for example. In Data Studio, you create a data source, selecting the Google BigQuery Connector, and select the BigQuery project, dataset, and table (or view); then you have the option of toggling the option to use “Viewer’s Credentials” or “Owner’s Credentials”. We strongly advise only using Viewer’s Credentials for any sensitive or business critical information.

Data Studio is great for distributing access to PAPER-protected datasets because the actual data retrieved by each user is unique to that viewer. Users can even copy view-only Data Studio reports, or data sources, and create tailored versions that use the same PAPER-protected BigQuery authorized views, so when they re-share their copies, the data is still filtered and the underlying data is still completely safe from unauthorized users.

Colab documents shared in view-only mode — and then used in “playground mode” — provide a protected data enclave similar to Data Studio. In “playground mode”, users can run Colab cells that execute BigQuery SQL, and the output is visible only to that user. Be cautious of granting edit privileges, or even using your own Colab documents shared with others. Unless you explicitly turn on “playground mode”, your results will be visible to other users that might not have the same data privileges.

Google Sheets can include Apps Script to run BigQuery SQL, but there is no concept like “playground mode”, so results are always visible. If you want to offer users Google Sheets as an option for dynamically filling a spreadsheet with permission-filtered data from a PAPER-protected dataset, consider making copies of the Apps Script-enabled Google Sheet, one for each user.

Aggregate or Automate Sharing

We’ve highlighted just some of the ways to share PAPER-protected results, but keep in mind, everytime you share one of these documents, you also need to add that user to the filtered_views dataset and to the project as a BigQuery Job User.

We have two recommendations for simplifying the process of synchronizing these permissions.

Use Google Groups. You can use a Google Group alias in place of individual user emails, so you can grant the Group the BigQuery permissions, and share the filtered_views dataset, and all documents that use it with that Group. Then just add users to the Google Group in one step. (We usually suggest configuring the Google Group to be used only as an access control list, by removing all other Google Group features: Certainly remove the ability to join or request to join the group, but also remove the ability to post topics or send group emails.) Use Google APIs to copy users from document share lists to the BigQuery dataset and project role. This is definitely more advanced and technically complex, but it does allow users to share PAPER-protected documents directly with other users. You need to register each document with the code that will monitor its share list, typically using the Google Drive API, then use the BigQuery API and Google Cloud Resource Manager API to add the document users to the filtered_views dataset and project IAM BigQuery Job User role, respectively.

Monitor BigQuery Logs for Unintended Usage

PAPER gives your users a wider array of tools for data analysis, including direct access to run BigQuery queries, but this also introduces a potential risk; particularly when dealing with large datasets. Users might — even unintentionally — create a new dashboard configuration or query that scans too much data, which can increase BigQuery costs over time.

GCP gives you monitoring tools and APIs to identify and correct problems before incurring unexpectedly high costs. We recommend a two-staged safety net:

Implement alert notifications on your BigQuery logs to notify your admins, if the “bytes billed” metric for a user’s query exceeds a threshold that indicates a very expensive query. Stackdriver Logging APIs can trigger Cloud Functions via Pub/Sub, based on criteria you specify. (You could also monitor for unintended use cases, such as if a user uses your BigQuery quota to query large datasets in other Google Cloud projects.) Implement BigQuery user-level custom quotas, or investigate migrating to flat-rate pricing, as a base-level guard against runaway costs.

Advanced Features Also Supported

The example queries and scripts represent realistic but simple versions of a PAPER implementation. We have been able to extend the design to address more complex requirements. Here are two notable cases:

BigQuery supports partitioning tables by Date, as well as breaking up a table into multiple tables with a common prefix (known as Wildcard Tables), and you can even combine both features to effectively create two-dimensional partitioning. For large tables approaching a terabyte or more, effective use of partitions can make a significant difference in BigQuery costs, and query performance. The BigQuery engine has to be able to recognize the partitions required for a query. There ways to inadvertently confuse BigQuery, causing it to scan all partitions instead of the desired subset. The details are outside the scope of this article, but the good news is, all of these features still work from a PAPER filtered view. The query in the filter_by_account_id.sql.template supports one of the simplest permission models. It filters any data, as long as the data can be tied back to an “account ID” for data ownership and access control. But we have implemented more complex filter queries. Here’s one example. Fully explaining this query is a bit out of scope for this article, but the point is to show that it is possible to test for permissions to multiple entities (advertiser and site, in this case), and also test for a required subset of possible permissions needed to access a given type of report. We convert the permissions (an array of required permissions strings) into a bitmap and use the bitmap to decide if the user has the permissions required to complete the query.

Key features and benefits, versus other approaches

PAPER gives data owners confidence their data is protected, while giving users a set of powerful tools, anchored by the full capabilities of BigQuery, to analyze and visualize data, including very large datasets. Compared to other similar solutions, PAPER has clear advantages, as shown in this figure: