iOS and Alexa Apps

Before building and training the model, I first needed data. But I haven’t seen any dataset on coffee drinking preferences. (If you see one in the wild, please let me know). So I then decided I’d have to obtain the data myself.

In order to obtain the data, I first had to build something that let me record my preferences.

I needed to be able to record when I drink coffee, what I drink, and the weather conditions. So I decided to do was build both an iOS app and an Alexa skill that does all of this.

The iOS app is open-sourced here and is available on the app store as well . It’s pretty simple, and it’s still a work in progress. But I’m pretty happy with the first version of it. This version obtains the user’s current location, looks up the current weather conditions using OpenWeatherMap, and lets the user select the type of coffee they’re having, which is all saved to Firebase.

The Alexa app, which is also open-sourced, is available as a skill here:

The Alexa skill detects the users’s current location (note, I’ll have another post on how I did this soon), and lets the user say what kind of coffee they’re drinking,

‘Alexa, tell CoffeeBot I’m having Iced Coffee’

or

‘Alexa, tell CoffeeBot I’m having Hot Coffee’

It retrieves the current weather conditions at the users location, saves the coffee preference, hot or iced, and saves the data to firebase. The skill is available to use now!

Pretty cool huh?

Building the Dataset

So after building both the iOS app and Alexa skill, I had to start recording my preferences, which I’ve done for the better part of the past year. Currently, I have more than 500 rows of data, which is a pretty good start. I generally have 2 cups a day. I also have a good friend who likes coffee almost as much as I do, and they’ve been helping out adding their preferences via the iPhone app. I’ll have more data when other users (like yourself) download the app and save your preferences!

I’ve made the dataset available on Kaggle for anyone to view:

Let’s start analyzing this data! First, we need to download it from Firebase. I won’t go into detail of how this is done, but here’s a link to the full script:

Basically, I download the Firebase data, cleaned up the data, and wrote to a csv for processing.

In order to build a very accurate model, I spent more than 1 month just cleaning the data and deciding what features would be best to train the model. I had gone through a few iterations to see what features would work best. Eventually, this is what I decided on.

First, I dropped the columns that wouldn’t help with the prediction:

Then, I decided was to round the temp and windSpeed columns to the nearest integer, which helps the model better predict new values.

Next, I only included the same amount of rows for both hot and iced coffee. My dataset had way more ‘hot’ samples than iced. This is because I started recording my data in the winter.

This helped with bias in the model and made sure both types were recorded equally in the dataset.

Finally, since the weatherCond column is a set of strings for the current weather conditions (‘Clear’, ‘Rain, ‘Snow’, etc), I decided to one-hot encode the column, for better accuracy.

The following represents the final columns I used for training:

The training process is on a Jupyter notebook and can be found here.

I first split the data into x and y values. X contains all the features, and y is the prediction: 1 for hot coffee, 0 for iced coffee.

data = pd.read_csv(‘data.csv’, index_col=0)

labels=data[[‘type’]]

Next, I decided to try out some models and see what performs best! I’m using Sci-kit Learn to perform training for now.

First, I had to split the data into training and test sets using train_test_split :

X = data

y = labels from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

The first model I tried was a Logistic Regression:

from sklearn.linear_model import LogisticRegression

from sklearn import metrics

from sklearn.metrics import accuracy_score logreg = LogisticRegression()

logreg.fit(X_train, y_train) print(‘Logistic regression accuracy: {:.3f}’.format(accuracy_score(y_test, logreg.predict(X_test)))) Logistic regression accuracy: 0.685

Only 68%, eh? Not great. Let’s see what else I can do. How about a Linear Regression?

from sklearn.linear_model import LinearRegression

from sklearn.metrics import accuracy_score reg = LinearRegression()

reg.fit(X_train,y_train) print(‘Logistic regression accuracy: {:.3f}’.format(accuracy_score(y_test, logreg.predict(X_test)))) Logistic regression accuracy: 0.685

Ok, that had no effect. How about a Random Forest?

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier() rf.fit(X_train, y_train) print(‘Random Forest Accuracy: {:.3f}’.format(accuracy_score(y_test, rf.predict(X_test)))) Random Forest Accuracy: 0.712

That’s better!

Next, I tried SVC.

from sklearn.svm import SVC

svc = SVC() svc.fit(X_train, y_train) print(‘Support vector machine accuracy: {:.3f}’.format(accuracy_score(y_test, svc.predict(X_test)))) Support vector machine accuracy: 0.699

69%. Ok. Not terrible, but also not too good.

Since Random Forest had the best accuracy, I used this model for the app.

In order to use this model for the iOS app, I had to first save the model by pickling:

import pickle filename = ‘rf.sav’pickle.dump(rf, open(filename, ‘wb’), protocol=2)

By pickling, we save the model to our machine for later use.