I’ve recently had a chance to play with some of the newer tech stacks being used for Big Data and ML/AI across the major cloud platforms. Whilst there are strengths and weaknesses in all of the tools, one of the challenges I’ve become aware of is accessing scalable compute from within RStudio, a common IDE used by data scientists to produce analytics using the R and Python languages.

I’ve put this short guide together to show a clear example of just how easy it is to provision an RStudio instance on GCP and use that instance to access the scalable power of BigQuery to perform complex analytics. You might use this guide to run your own proof-of-concept or to perform ad-hoc data analysis for projects or assignments.

Oh, and the best bit? Because this is all done on the cloud, you can follow this guide from login to query in less than 30 minutes using the web-based console. You could, of course, deploy all of this programmatically, but we’ll keep it simple for now.

Let’s get started.

Create a project

A “project” is the top-level container for resources you deploy in GCP — you might be more used to hearing these called accounts or subscriptions in AWS and Azure respectively. You will need to set one up before you can deploy instances or use BigQuery.

I’m not going to cover the setup of a project — Google has done a great job of this for us:

https://cloud.google.com/resource-manager/docs/creating-managing-projects

Deploy RStudio

The Marketplace on GCP allows us to deploy ready-made images for everything from hosting websites through to running advanced machine learning algorithms. In this example, we’re going to deploy an RStudio Server Pro image, which can be found here:

https://console.cloud.google.com/marketplace/details/rstudio-launcher-public/rstudio-server-pro-for-gcp

Starting from the “Home” dashboard, we can search for “Marketplace” or access this through the menu on the left hand side:

Searching for resources, such as the Marketplace, in the GCP portal

From there, we can search for “RStudio” in the search field in the middle of the page. This will return all related Marketplace results:

You can search the GCP marketplace for all sorts of pre-built images, such as one for RStudio

Either by following the link above, or the steps listed here, you should have found yourself at the marketplace page for “RStudio Server Pro for GCP”:

Each marketplace image has a page explaining a little bit about its purpose and pricing

Now, click on the big blue “Launch on Compute Engine” button. This will take you to a screen where you get to define options for the deployment of your RStudio server:

During deployment of an image, you can specify options that customise your instance, and see detailed pricing information

I recommend increasing your vCPUs to 2, using the “n1-standard-2” VM size, but otherwise, all of the default options should be sufficient. Once you are ready to deploy, scroll down the page and press “Deploy”.

We’re nearly there.

Next, you’ll be shown a page which will update you live on the provisioning of your RStudio server:

The deployment manager shows the current state of your deployments, as well as any warnings, advice or information that is relevant

Once this is complete, I recommend you read all of the information in the panel on the right-hand side — it will give you information such as the username and password for RStudio, as well as some helpful tips about how to configure and manage your instance once deployed.

That’s it — you now have a functional RStudio instance running on GCP!

You can access your RStudio instance using the “site address” shown in the deployment page

You can log in with the username and password shown back on the right-hand side of the deployment page.

Set up RStudio for BigQuery

First, you’re going to need to make a few small changes to enable RStudio to talk to BigQuery.

First, you will need to install the “readr” package — this helps display results returned from BigQuery. To do so, you can install it using the UI as shown, or by running the command install.packages(“readr") :

Installing packages and libraries in RStudio can be done via the console or via the UI

This may take 2–3 minutes to install, and will you will see updates appear in the console window.

Once done, you need to load the “BigRQuery” package by running the command library(bigrquery) , then defining your GCP project ID, as shown:

The RStudio console shows output from your actions within the IDE, such as installing packages or running commands.

You are now ready to run your first query!

Run your query

Next up, we’re going to define the query we want to run. In this example, we’re going to use a public dataset for our queries, but will move on later to running a query against private data you’ve uploaded.

First, define the SQL statement you wish to run with the following command:

sql <- “#standardSQL SELECT year,month,day,weight_pounds FROM `publicdata.samples.natality` LIMIT 5;"

That’s it! You can now execute your query and get your results:

After being redirected to a sign-in page, you will need to copy and paste the given code into RStudio

You may be prompted to log in to your Google account to get access to BigQuery. In the example above, I’ve chosen to save my credentials and entered my authorization code, but you may choose not to do so if you are using a shared RStudio environment.

Bonus: Uploading and querying custom datasets

In the example given above, we’ve used one of the many provided public datasets. However, in the real world, you likely need to use your own data, such as a web traffic report or a list of transactions. Here we will show you how to upload your own data into GCP for use with BigQuery and RStudio.

We’re going to primarily be following this guide from Google:

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv

First, you’ll need to create some storage. Navigate to the storage browser and click “Create bucket”:

Creating storage buckets is easy via the storage browser in the GCP portal

You can use the default settings, but to save some money and maintain data sovereignty, I’ve selected a bucket region within Australia using the regional storage class:

Whilst I’ve tweaked my settings shown, the default options will be sufficient for most users

Once created, you can upload your data — I’m using a 1000 row CSV file which contains randomly generated sales transactional-style data:

The GCS browser allows you to easily upload and manage files in your bucket

Next, we’re going to open the BigQuery UI, and after creating a dataset if we need to, we’re going to click the small “+” icon to create a new table:

You can create new datasets and tables via the BigQuery web UI — Please note the newer UI may look slightly different

Once you’ve filled in your details, and let BigQuery handle the schema detection, we can hit “Create table”.

That’s it — your data is loaded in and accessible via BigQuery, which we can test directly via the web-based UI:

Or we can test via RStudio instance we provisioned earlier, simply by updating our SQL query:

By updating the contents of the “SQL” variable and re-executing the “query_exec” command, we can run a new query against BigQuery

Thank you for reading — I hope this short guide helps you understand the possibilities of cloud technology and how it can help you to unlock exciting new data opportunities.

Find out more about Servian’s data and analytics capabilities, here.