Managing BigQuery from the command line can be a great time saver. Here’s how to get this set up.

Photo by Elly Johnson on Unsplash

Working with BigQuery can be quite involved. The Cloud Console (Google’s name for the web-based cloud management interface) is powerful, but tasks quickly become repetitive and cumbersome. Implementing API clients, Composer DAGs, and similar applications is often overkill, especially for fairly basic and incidental operations. Fortunately, Google also provides a great toolkit that includes command line clients for many of the Google Cloud APIs. These clients have the advantage that simple tasks can be performed in a simple way, while complex tasks are still possible. Commands can be copied and pasted, scripted, executed periodically in a cron table, put in version control, and so on. All this is available from a VM instance in one of your GCP projects, but also from your notebook in the coffee shop around the corner.

Getting started

The first thing you need, obviously, is a command line. For this article I’m assuming you have a command line available. Linux and macOS users will know what I mean, but Windows users may not. The toolkit that we will be installing will provide you with a Windows command line, but the shell is somewhat limited. As an alternative, WSL (Windows Subsystem for Linux) will provide you with a basic but quite functional Linux installation of your choice that includes a proper terminal in which you can run a shell. All examples are given as they are executed in a Linux terminal running the bash shell, but these are easily translated to a different shell.

The next step is to install the Cloud SDK. This provides you with a set of tools for interacting with the Cloud Platform. Follow the instructions in the online documentation, including the step to initialize the SDK. This will let you authenticate using your GCP account, which should of course have sufficient permissions to manage BigQuery and Cloud Storage.

A little exploration

Having installed the Cloud SDK, fire up a terminal and type:

This will show you a lot of information on the various subsystems, organized into groups, that are available in GCP. For the purpose of this article the auth group is of particular interest:

This shows you the account used when initializing the SDK. Add another account by entering:

and following the instructions. You can then switch between accounts.

Another useful option is the config group, which allows you to manage some of your basic setup, such as the active account and project. Use gcloud help config for a full listing of all available parameters. Inspect the active setup by using:

Sidestepping into Cloud Storage

Cloud Storage is very convenient as a storage location for data that are to be imported into or exported from BigQuery, so let’s do a little sidestep into managing Cloud Storage from the command line. The utility for this is gsutil , which takes subcommands in a way similar to gcloud . Listing the available buckets is done like this:

More detailed information is obtained by using options:

Copying data to or from Cloud Storage is done using the cp subcommand:

There is no need for creating folders, since paths map onto objects in Cloud Storage. There is a lot more that can be controlled using gsutil . Check the built-in help and the online documentation for more information.

Besides using gsutil , objects on Cloud Storage can also be managed for some operating systems by mounting buckets as local filesystems using gcsfuse as explained here.

Managing BigQuery

Interacting with BigQuery is done using the bq command. Like with other Cloud SDK tools, help is buit in:

Let’s go through a few steps that demonstrate basic operations you are likely to perform in your day-to-day work with BigQuery. First off, we create a dataset where we can experiment without affecting important data.

The mk subcommand is used to create various resources, including datasets and tables. Inversely, the rm subcommand is used to delete resources.

Before continuing with creating a table, we need to decide how to provide the schema of the table. For simple tables the schema can be provided on the command line using the --schema option, but this option may also be used to specify a schema file, allowing for more complex table definitions. As an example I'll be using the following schema representing an order object with one or more order lines, saved in a file named order.json :

The command for creating the table is:

Note that there are many more options that help you control what to create, and at least as important, where to create it. Again, the built-in help in de bq untility is a good starting point.

Now that we have created a table, the next obvious step is to get some data into it, which is achieved using the load subcommand. This allows importing data in various formats, including CSV and JSON. As you may have noticed, I created the table with a repeated record in it, so CSV is not an option, but JSON is. Suppose I have a file with order information, where each order looks a bit like this:

Note that JSON files need to be formatted so that each record occupies just a single line, so pretty-printed JSON will not work. Assuming I have collected a set of orders in a file, one full order document per line, I can import these using the following command:

The import job can now be found in the Cloud Console, where it can be identified using the job id printed by the bq command. In this case the file I imported was a local file, but it could also have been a file in a Cloud Storage bucket placed there earlier using gsutil .

Taking importing just a little step further, if I had to import a set of files, I could do that in a single command using a loop:

Here I’m using UNIX style line contination (a backslash as the last character on a line) to improve legibility. A loop over the results of a file listing command performs the loading of each CSV file into a hypothetical shipment table in the dataset we created earlier. I'm using additional options to control the import, like not replacing the existing table contents ( --noreplace ), and skipping the CSV headers ( --skip_leading_rows=1 ). The result will be a list of job ids, one for each imported CSV file. Note that other shells may have a slightly different syntax for looping.

The last essential step that needs to be addressed is getting data back out of BigQuery. There are basically two scenarios I’d like to demonstrate, the first of which is exporting table contents as they are. In its simplest form, dumping table contents to file goes like:

Extractions can only be done to Cloud Storage, and that is why the destination URI points to a bucket. Use gsutil to get the files to your local system, as described earlier. The default export format is CSV, which does not support nested schemas. For tables like the order table above, we need to specify a format that does support nesting, for example:

It is also possible to compress the output. See bq help extract for more information.

A more involved scenario is where we want to export the results of a query. There is no way of directly exporting the results of a query, except that the bq command will print the results, which is not very practical for larger datasets. Therefore the process must be broken up into a few steps. First we perform the query, writing the results to a table, then we extract the data, and finally, we drop the table with the query results. Creating a results table can be done as follows:

Here I’ve assigned a fairly trivial query to a variable, like you would do in a script. It is more common to script these procedures so that they may run unattended, and in that case the procedure can be made a lot cleaner by using temporary tables. The fact is that the results of every query are stored without cost for a little while (currently around 24 hours), and these result sets can also be used to extract the data. In order to do this, we need the information of the table the data were written to. This information we can find in the job that ran the query:

Above is an example of a little bit of the extensive information available for the last job. The dataset and table ids from the temporary destination table can be used to export the data:

A few remarks are in place. Firstly, the output should be checked for the state of the job. There is no point in continuing if the state of the job is anything but ‘DONE’. Also, if jobs that use temporary tables are run concurrently, it is important to verify that the temporary table is created by the right job. If processes become complex like this, it is probably a good moment to switch to a more suitable environment than the command line, such as a proper Google API client or a full-blown workflow manager like Airflow or Composer (Google’s name for their Airflow based service) and similar solutions.

I hope to have given you a basic idea of how to interact with BigQuery from the command line and in basic shell scripts, without immediately having to develop API clients. Being able to use a one-off import or a simple cron job to do the work can be a great timesaver.