Hello GeoHackWeek 2017 attendees. In this codelabs-style tutorial, we’ll walk through some of the public datasets available on Google Cloud Platform (GCP) and use some of the tools available for analyzing geospatial data.

Before you begin

Hopefully you have a Google account with GCP project set up for it. If not, check out the preliminary tutorial.

Once you have a GCP project with billing enabled (by starting a free trial or using a coupon, for example), you can move on to the next steps.

About these public datasets

There are several places to look for public datasets hosted by Google.

The dataset we’ll be exploring is the New York City Citi Bike trips dataset.

Browse the NYC Citi Bike Trips dataset in Google BigQuery

View the table in BigQuery by opening it in the BigQuery Web UI at https://bigquery.cloud.google.com/table/bigquery-public-data:new_york.citibike_trips?tab=schema

You’ll see a what columns are in the table, as well as some buttons to do some operations on the table.

Don’t do a SELECT * in BigQuery, instead use the preview button to get a sample of some rows in the table.

Use the preview button to get a feel for the data.

Compose a BigQuery query

For our first query, let’s find out which stations are the most popular destinations.

Click the Compose query button.

We are going to use standard SQL syntax, which we need to enable in the options. Click the Show options button and uncheck the option to use legacy SQL syntax.

Use standard SQL syntax for the queries in this tutorial.

Now, enter the query text to find the most popular Citi Bike destinations.

Then click the Run query button.

In a few seconds, you should see a table of stations, sorted by how frequently they appear in the trips table.

Why does it say “1.15 GB processed”? This means that BigQuery scanned that many bytes to calculate the results of this query. The first 1 TB of processing is free per month.

More complex queries

In another article, I use this dataset to find popular destinations for groups versus single riders.

Visualizing query results

One way to visualize the results of BigQuery queries is to use Cloud Datalab.

First, open the Google Cloud Platform console: https://console.cloud.google.com

Then, open Google Cloud Shell using the shell icon in the top-right corner.

Enter the command to create a new instance of Datalab.

datalab create geohackweek

Connect to your new instance using the Web Preview feature.

This will open a Jupyter notebook environment with some customizations for working with Google Cloud. To access Google BigQuery from Datalab, you can use the pandas.io.gbq (also known as pandas-gbq) package.

dataframe = pandas.io.gbq.read_gbq(

‘SELECT 1’, project_id=’myproj’, dialect=’standard’)

Datalab also has built-in IPython magic commands for working with BigQuery and other Google Cloud APIs.

Shutting down your Datalab instance

To save on costs, you can stop your Datalab instance when you are not using it.

datalab stop geohackweek

This will shutdown the instance, but won’t remove it or its storage.

To start it back up and reconnect, use the connect command.

datalab connect geohackweek

If you are completely done with your instance and have backed up your notebooks, you can delete the instance completely with the delete command.

datalab delete geohackweek

Additional Resources