In my book (Data Science on the Google Cloud Platform), I walk through a flight-delay prediction problem and show how to address it using a variety of tools including Spark Mlib and TensorFlow. Now that BigQuery ML has been announced, I thought I’d show how to predict flight delays using BQ ML.

Predict whether this flight will arrive late. Using only SQL.

Make no mistake — you still have to collect the data, explore it, clean it up, and enrich it. Essentially all the stuff I do in Chapter 1–9. In Chapter 10, I used TensorFlow. In this article, I will use BQML.

Create Regression Model

Here’s a BigQuery query to create the model:

#standardsql

CREATE OR REPLACE MODEL flights.arrdelay OPTIONS

(model_type='linear_reg', input_label_cols=['arr_delay']) AS SELECT

arr_delay,

carrier,

origin,

dest,

dep_delay,

taxi_out,

distance

FROM

`cloud-training-demos.flights.tzcorr`

WHERE

arr_delay IS NOT NULL

Note that:

It starts with a CREATE MODEL, with the name of the model looking just like a table name. Note: ‘flights’ is the name of the dataset to store the resulting model — so, you’ll need to create an empty dataset before running the query. The options specify the algorithm — in this case, a linear regression algorithm, with arr_delay being the label Essentially, I’m pulling in the predictor and label variables in the SELECT

About 10 minutes later, the model is trained and evaluation results have been populated for each iteration:

The loss here is mean squared error, so the model converges on iteration #6 with a RMSE of about sqrt(97) = 10 minutes.

Predict using Model

The purpose of training a model is to predict with it. You can do model predictions with a SQL statement:

#standardsql

SELECT * FROM ML.PREDICT(MODEL flights.arrdelay,

(

SELECT

carrier,

origin,

dest,

dep_delay,

taxi_out,

distance,

arr_delay AS actual_arr_delay

FROM

`cloud-training-demos.flights.tzcorr`

WHERE

arr_delay IS NOT NULL

LIMIT 10))

This results in:

As you can see, because we trained the model to predict a variable called “arr_delay”, ML.PREDICT creates a result column named predicted_arr_delay. In this case, I’m pulling 10 rows from the original table and predicting the arrival delay for those flights.

Create Classification Model

In the book, I don’t actually try to predict the arrival delay as such. Instead, I predict the probability that a flight will be more than 15 minutes late. This is a classification problem, and you can do that by changing the training query slightly:

#standardsql

CREATE OR REPLACE MODEL flights.ontime

OPTIONS

(model_type='logistic_reg', input_label_cols=['on_time']) AS SELECT

IF(arr_delay < 15, 1, 0) AS on_time,

carrier,

origin,

dest,

dep_delay,

taxi_out,

distance

FROM

`cloud-training-demos.flights.tzcorr`

WHERE

arr_delay IS NOT NULL

Here’s the evaluation results:

and an example of the predictions:

Evaluating the model

It is possible to evaluate the model on an independent dataset. I don’t have one handy, so I’ll just show you how to run the evaluation on the same dataset the model was trained on:

#standardsql

SELECT * FROM ML.EVALUATE(MODEL flights.ontime,

(

SELECT

IF(arr_delay < 15, 1, 0) AS on_time,

carrier,

origin,

dest,

dep_delay,

taxi_out,

distance

FROM

`cloud-training-demos.flights.tzcorr`

WHERE

arr_delay IS NOT NULL

))

The result:

BQML is really easy and really powerful. Enjoy!