BigQuery ML is surprisingly simple. The first object you need to learn about is the model . Similar to tables and views, models are stored in datasets.

Creating the model is a two-part statement. The first part specifies the model parameters, including the name of the dataset and the model and the type of model. The second part specifies the data to use for training and includes the label. Every column that is not the label is taken to be the feature.

Consider the following query.

The first line defines the model by specifying the dataset and the name. The second line specifies the model type as Logistic Regression. The subsequent lines provide the training data by fetching some data from a BigQuery public dataset that hosts some data from an e-commerce site. Notice that the select has a column called label which is what we are trying to predict. Also, notice that it is set to either 0 or 1 because computers like dealing with numbers. The two new things in the query are how the table name uses a wildcard and the table suffix range. These are some of the features of BigQuery and are easy to learn.

The logistic regression model gives the probability that the data falls into one class which we represent as 1. If the probability is below 50%, then the data falls into the alternate class which is represented as 0. 50% is a threshold, which you can change to suit your needs. You can learn more about it here.

One thing to bear in mind when creating a model is that the label column can’t be NULL .

The create model syntax follows:



model_name

[OPTIONS(

[AS {CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}[OPTIONS( model_option_list )][AS query_statement

The model_type can be logistic_reg , linear_reg or kmeans .

If you know a bit about Machine Learning, you may want to specify the following: l1_reg , l2_reg , max_iteration , learn_rate_strategy , learn_rate , early_stop , data_split_method , data_split_col , ls_init_learning_rate , and warm_start . These parameters are all described here.

When the model is done training, you will get a confirmation, and the model will appear in your dataset.

Result of model training

You might be interested in the training options, which are visible when you scroll down the description page.

Model Training Options

You may click on the Training button to open the tab and see something similar to the following:

Training Report

The information is also available in tabular form. On each iteration, you can see how the loss reduces for both the training data and the evaluation data. You can also see the learning rate that was used for that training iteration.

Training Report

The evaluation metrics are also available. You can find out more about precision and recall here. You can find out more about ROC here. The confusion matrix tells you the accuracy of the model, which is the percentage of predictions that it got right.