F ew months ago I joined the TalkingData Adtracking competition on Kaggle (a competition close to my daily job at the time). After applying the usual Logistic Regression, I then moved my model to XGBoost (with very good results).

Today, after attending GoogleNext18 in London, I thought there was a chance to revise my work using the new BigQuery ML framework. Let’s see what I got out of it…

Loading the data

The first step was loading the data into BigQuery. In order to do that, I loaded the data to Google Cloud Storage (GCS) and ingested from there (files were too big to be loaded from direct upload). The operation is very simple from the Google Cloud Console for BigQuery.

Basic of model training

Training a model in BigQuery ML is very simple, and it’s a matter of just a slightly extended SELECT statement, as follow:

However, if we look at the “model schema” tab in BigQuery, we notice that all the features are numeric. Based on BigQuery ML documentation, this happens:

Numerical variables (of type NUMERIC , FLOAT64 or INT64 ) are standardized by default. In addition, standardization is automatically applied during prediction.

This is definitely not what we want…

Not that fast…

Features

In order to “force” BigQuery to use one-hot enconding, even if the columns are numeric, we can use CAST to STRING. This simple change brought the evaluation score from 0.0137 to 0.0103 with half the number of iterations.

Extract validation set

Another important change we need to take in account is the right selection of the validation set. BigQuery ML takes in account the fact that a TIMESTAMP column could be very important, as it defines an implicit ordering of the data. Using the data_split_method and data_split_col we can define which column to use to split the data.

Validation score is now 0.0103, so getting better at each step.

More features? Cross-features?

Other features can be added in the model, such as:

hour of day

app/device

os/channel

Train model

After testing that all my queries had worked correctly, I loaded the whole train data and run my final version:

Training took quite a while (more than 10 mins) on the whole dataset (almost 8Gb), reaching a validation score of 0.0112.

Backfill predictions

In order to create a table with the predictions for the training dataset, we run:

ML.PREDICT is basically adding additional columns to the output table: these columns contains predictions and probabilities.

How about the Kaggle score?

After creating the predictions, we can export a submission file and upload it on Kaggle, in order to calculate our score.

I run the following queries to reformat the dataset as expected by the competition rules:

I have then export the table to GCS to get a nicely formatted CSV file ready for my submission.

I got a score of 0.9536157, which would have put me in position 2839 (at the time of writing). Considering it took roughly an hour to do this, it’s not too bad at all!

What else?

There is plenty I have left out in this very short tutorial! If you want to play with this example, those are the things that I’d try next:

play with regularisation (both L1 and L2 are supported);

play with learning rate;

use a different split for data (BigQuery supports the usage of an additional columns to define the boundaries between train and validation);

Enjoy BigQuery ML!