TL;DR: If you like to iterate quickly, you can build simple ML models right from BigQuery SQL. Here I used BigQuery ML to build linear regression model for taxi fare prediction in 30min while training on all 55M rows of data.

Couple of weeks ago I got an email from Kaggle (machine learning competitions website) about NYC Taxi Fare Prediction educational competition. Your goal there is to predict the price of a taxi ride in New York. On the following day Google announced BigQuery ML, which lets you train linear/logistic regression models right from SQL.

Short disclaimer. I never used BigQuery extensively before + I have mostly hobbyist exposure to ML. Therefore, applying basic ML and trying out new tech, when solving a real-world problem, sounds like a good training ground to me.

Data

Competition dataset is hosted on public BigQuery. If you have an account on Google Cloud Platform (GCP), you can just copy tables from the public UI into your GCP project.

How Does Data Look Like?

The dataset for Taxi Fare competition is ~6GB CSV file. Each row has:

pick-up point latitude/longitude

drop-off point latitude/longitude

pick-up timestamp

# of passengers

price of the trip

55,000,000+ trips over New York in total.