Whisked Away by BigQuery ML

Predicting Whiskey Preferences in GCP

“I’m looking for…

…anything that is peaty, smoky like a campfire on the beach, medium to high alcohol, dried fruits, citrus, and a has subtle hint of terra firma.”

It’s easy for me to walk into my local liquor store and talk tasting notes (and joke) with the whiskey guy or gal on staff, and he or she will be able to know what I’m looking for and make a recommendation to me. But can a machine learning model do the same on a minimal data set and with limited inputs? That’s the question at hand.

(For those of you joining in now, thanks for coming along on this part of the story! I recommend you take a look at the other blogs in this series, but if you don’t, the summary is I have a bunch of whiskey data with my tasting preferences. Let’s see what we can learn from it.)

BigQuery ML in a nutshell

For this exercise I’ll be using Google Cloud Platform’s BigQueryML engine. If you didn’t guess, this is an ML tool that will allow you to create ML models and run them in BigQuery.

BQML is an incredible advantage for anyone who has tons of data within BigQuery (or anywhere else that can easily transfer data to BQ) because the data can stay put and doesn’t require data frame defining. Not only that, we can manage and run our models in BigQuery as well; this means no more ML specific compute virtual (or I suppose physical) machines to provision. And finally, we can create these models using specialized SQL commands available in BQ. This isn’t like other SQL plugins that require tons of user defined functions, classes and other fun things that keep us from developing real content; it works right out of the box. How cool is that?!

And you chose that why…?

Why should we care about these features? Shouldn’t I try using “a real ML framework” like TensorFlow? Well, yes… maybe. I could try to do this exercise in a “real framework” like TensorFlow. But, there are a few reasons I didn’t go down that path.

For one, I didn’t want to write another “how to do your first whatever with TensorFlow” blog. That’s been blogged to death already. Also, someone like Sara Robinson is going to do so much better of a job at producing cool, avant-garde TensorFlow content than me. BigQueryML here we come!

Another reason is that in previous blogs I have talked about my love of simplicity when designing systems that work together. I like things to have as few moving parts, to have as few failure points as possible, and to be as automated as possible from the outset. BigQueryML is simple, it’s cheap for me to experiment with, and it fits in my current tech stack definition. Using it was a no brainer for me.

Finally, my requirements didn’t dictate that I needed anything with high horsepower. My training set is a subset of just over 100 rows (all the whiskies I’ve tried in 2018), and I’m doing pretty simple regressions for prediction. And because of these reasons, I think the risk of overtraining by using a more robust and highly engineered solution was high. (Comment and discuss below, if you’d like. If there is a lot of response to this, maybe I’ll do some TensorFlow models and compare them to BQML in a future post.)

Regardless of my reasons for choosing BigQueryML, let’s get down to the business of using it.

Big queries, bigger ML?

As I said above, using BigQueryML is pretty dang easy, and mostly this is due to the feature set of the product. However, just because it’s easy to use doesn’t mean that it’s without nuance. Let’s create an example model based on my data, evaluate it, and have a little chat about it.

I’ll have what he’s having

In some earlier blogs in this series, I provided a quick run down of what the whiskey rating data looked like. Here’s a quick reminder: