How to do time series forecasting in BigQuery

Using an ARIMA model in BigQuery ML to do demand forecasting

Let’s carry out 2-week forecasts of the number of bicycle rentals that will commence at one of the bicycle stations in Hyde Park based on the past six weeks.

Gather training data

The first step, as with any machine learning problem is to gather the training data and explore it. Assume that we have the data on rentals until mid-June of 2015 and we’d like to predict for the rest of the month. We can gather the past 6 weeks of data using:

SELECT

CAST(EXTRACT(date from start_date) AS TIMESTAMP) AS date

, COUNT(*) AS numrentals

FROM

`bigquery-public-data`.london_bicycles.cycle_hire

WHERE start_station_name LIKE '%Hyde%' -- all stations in Hyde Park

GROUP BY date

HAVING date BETWEEN '2015-05-01' AND '2015-06-15'

ORDER BY date

Plotting this, we see a sort of weekly trend with higher rentals on weekends:

Historical data

Train ARIMA model

We can use this data to train an ARIMA model, telling BigQuery which column is the data column and which one the timestamp column:

CREATE OR REPLACE MODEL ch09eu.numrentals_forecast OPTIONS(model_type='ARIMA',

time_series_data_col='numrentals',

time_series_timestamp_col='date') AS SELECT

CAST(EXTRACT(date from start_date) AS TIMESTAMP) AS date

, COUNT(*) AS numrentals

FROM

`bigquery-public-data`.london_bicycles.cycle_hire

WHERE start_station_name LIKE '%Hyde%' -- all stations in Hyde Park

GROUP BY date

HAVING date BETWEEN '2015-05-01' AND '2015-06-15'

Once the model is trained, we can evaluate it using ML.EVALUATE() and view the ARIMA coefficients using ML.ARIMA_COEFFICIENTS().

Forecasting

We can forecast the number of rentals for each of the next 14 days and also obtain 90th percentile confidence bounds using:

SELECT * FROM ML.FORECAST(MODEL ch09eu.numrentals_forecast,

STRUCT(14 AS horizon, 0.9 AS confidence_level))

The history, forecast, and confidence bounds can be plotted using:

import matplotlib.pyplot as plt

import pandas as pd

def plot_historical_and_forecast(input_timeseries, forecast_output, timestamp_col_name, data_col_name):

plt.figure(figsize=(20,6))

plt.plot(input_timeseries[timestamp_col_name], input_timeseries[data_col_name], label = 'Historical')

plt.xlabel(timestamp_col_name)

plt.ylabel(data_col_name) forecast_output['forecast_timestamp'] = pd.to_datetime(forecast_output['forecast_timestamp'])

x_data = forecast_output['forecast_timestamp']

y_data = forecast_output['forecast_value']

confidence_level = forecast_output['confidence_level'].iloc[0] * 100

low_CI = forecast_output['confidence_interval_lower_bound']

upper_CI = forecast_output['confidence_interval_upper_bound'] # Plot the data, set the linewidth, color and transparency of the

# line, provide a label for the legend

plt.plot(x_data, y_data, alpha = 1, label = 'Forecast', linestyle='--')

# Shade the confidence interval

plt.fill_between(x_data, low_CI, upper_CI, color = '#539caf', alpha = 0.4, label = str(confidence_level) + '% confidence interval')

# Display legend

plt.legend(loc = 'upper center', prop={'size': 16}) plot_historical_and_forecast(df, fcst, 'date', 'numrentals')

This yields:

Historical data and forecast data

But how well does this compare with actually happened in the latter part of June? We can pull out the data for those days and compare with the forecast time series:

Historical, forecast, and what actually happened.

Pretty cool, eh?

Forecasting a bunch of series

So far, I have been forecasting the overall rental volume for all the bicycle stations in Hyde Park. How do we predict the rental volume for each individual station? Use the time_series_id_col:

CREATE OR REPLACE MODEL ch09eu.numrentals_forecast

OPTIONS(model_type='ARIMA',

time_series_data_col='numrentals',

time_series_timestamp_col='date',

time_series_id_col='start_station_name') AS

SELECT

start_station_name

, CAST(EXTRACT(date from start_date) AS TIMESTAMP) AS date

, COUNT(*) AS numrentals

FROM

`bigquery-public-data`.london_bicycles.cycle_hire

WHERE start_station_name LIKE '%Hyde%' -- all stations in Hyde Park

GROUP BY start_station_name, date

HAVING date BETWEEN '2015-01-01' AND '2015-06-15'

Note that instead of training the series on 45 days (May 1 to June 15), I’m now training on a longer time period. That’s because aggregate time series will tend to be smoother and much easier to predict than the time series for individual stations. So, we have to show the model a longer trend-line.

Now, the model is not one ARIMA model, but a separate ARIMA model for each station name. Indeed, doing:

SELECT *

FROM ML.ARIMA_COEFFICIENTS(MODEL ch09eu.numrentals_forecast)

ORDER BY start_station_name

gives us a separate set of coefficients for each start_station_name:

Note something interesting — each of the stations has ARIMA models of different complexity! Under the hood, BigQuery ML does automatic hyper-parameter tuning. Although the model is called “ARIMA”, the underlying algorithm actually includes quite a few bells-and-whistles including anomaly detection, holiday effect modeling (user needs to specify the holiday region), seasonality detection/modeling, and trend modeling. Plus, the different time series are trained in parallel.

When we do a prediction, we will get a prediction for each station and timestamp:

SELECT

start_station_name,

forecast_timestamp, forecast_value

FROM ML.FORECAST(MODEL ch09eu.numrentals_forecast,

STRUCT(3 AS horizon, 0.9 AS confidence_level))

ORDER By start_station_name, forecast_timestamp

yields one time-series forecast per station:

The full notebook on GitHub has plots of the forecasts by station.

The model evaluation might help you anticipate which stations the model will better in (the lower the seasonality-corrected variance, the easier it is to predict, and so, the better the model should be — you can not use AIC here because the AIC of different time series is not comparable):

SELECT * FROM ML.EVALUATE(MODEL ch09eu.numrentals_forecast)

ORDER BY variance DESC

Based on the results, we expect predictions at Hyde Park Corner to be the worst, and Knightsbridge to be the best:

Enjoy!

Next steps

Run through the full notebook on GitHub . To learn more about BigQuery ML, read Chapter 9 of BigQuery: The Definitive Guide. The book is periodically updated with these blog posts so that it remains, well, definitive.

Thanks to Xi Cheng and Amir Hormati for helpful suggestions