BigQuery is a fully managed data warehouse solution provided by Google Cloud Platform. It is one of the hottest tool in Google’s arsenal and the way it scales to Petabyte scale within few seconds is nothing short of magical. Users write their queries in familiar SQL and the querying is handled transparently for them.

BigQuery has two components 1.) Storage (Colossus) 2.) Query Engine (Dremel). Both of these components are connected to each other with very fast network layer named Jupiter, which provides up-to 10Gb/sec bi-sectional bandwidth. Colossus is a highly optimized columnar storage which stores compressed data and is optimized for blazing fast retrieval of structured data. Dremel, the execution engine converts your SQL to execution tree and splits the computation into ‘slots’ and ‘mixers’. Slots are the leave nodes of the tree which reads data from Colossus and the mixer nodes handles the aggregations. Read more about BigQuery’s architecture here.

Little did I know that BigQuery has an ML interface as well. It lets you create Machine Learning models within the comfort of SQL. Currenly BigQuery ML (BQML) supports Linear Regression, Binary and Multi-class Logistic Regression and K-Means Clustering only.

In this post I’ll walk you through a simple Binary Logistic Regression example in which we’ll predict if a person has diabetes of not based on certain features.

So lets get started.

Download the dataset

We’ll be using the dataset named `Pima Indians Diabetes` provided by UCI. You can download it from kaggle for free here. This dataset predicts if a person has diabetes or not based on 8 features (Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age). The outcome is the label column and denotes if a person has diabetes or not. During training of the model we’ll use outcome column for learning and during prediction we’ll predict the value of output column.

Loading the dataset into BigQuery

Goto Big Data > Big Query in your console.

Create a data-set if you don’t already have one. In BigQuery every table has to belong to a dataset.

Click create dataset button and enter the name of the dataset. Lets name it prima_indian_diabetes.

Create dataset in BigQuery

Next we’ll create a table with the CSV that we download. Select the newly created dataset from left panel and click Create Table pill.

Create table from web console

Select source as Upload and upload the CSV that we downloaded. Give a name to your table. I’ll name it train_data. Select Auto-Detect for schema. Our CSV file has headers which will automatically be detected. Leave all the other options as default. Hit create table.

Create table web form

It will take couple of seconds for import to complete. Once complete, check the schema of the table. As you can see all the columns are imported with the right data types.

Auto-Detected schema

Fit the Model

BigQuery automatically splits your data in train and hold out set to avoid model over fitting.

Train Test Split

We’ll create a train-test split of 75–25, so that we can evaluate our model once its done. We’ll create a finger print of each row by concatenating all the columns and will assign a ROW_SPLIT_ID between [0,3]. Then we’ll select ROW_SPLIT_ID in (0,1,2) for training and 3 for evaluation.

Model Fitting

Paste above query and hit run. The model fitting should start. It will take 1–2 mins for query to complete. You should now see your model in your dataset. You can check model stats by clicking on the model.

Model stats

Model Evaluation

Now lets evaluate the model with our validation dataset using ML.EVALUATE method. ML.EVALUATE takes two parameters, MODEL and dataset.

Above query will return following evaluation parameters :

Precision, Recall, Accuracy, Area Under ROC curve, LogLoss and F1-Score.

Evaluation Result

We got an AUC of 86% which is quite good for the amount of tuning (or lack thereof) that we did. You can read the CREATE MODEL doc here and tune the model creation using OPTIONS during model creation.

Bonus

Prediction using the model.

We’ll use ML.PREDICT method (which has similar signature as ML.EVALUATE) to predict the outcome.

Paste the above query and hit run. You’ll get a table with probabilities for each class (0/1) and based on these probabilities predicted_outcome.

Predicted Output

And there you have it, a Binary Logistic Regression model completely written in SQL under 15 mins.

If you find any bugs in code or have any doubts, feel free to drop a comment.

Till then happy coding :)