Step 2 —Data cleansing

A. Handling missing values

After completing the EDA process, it’s time to get our hands dirty with some basic data wrangling tasks. Assuming we are keen on using all four fields discussed above: Age, Sex, PClass and Embarked, we need to account for 177 missing Age values and 2 Embarked values.

Let’s try to impute port of embarkation based on everything that we do know about passengers. The easiest solution would be to look at distribution of passengers by their embarkation point, which leads to Southampton representing 72% of all passengers:

Number of passengers by port of embarkation

Additionally we can consider using PClass information to see if the majority of first-class ticket holders embarked from a particular port. With the help of the below query we can determine that 59% of such passengers did embark from Southampton as well:

Another way to accomplish this task would be to see if the Cabin information can be used to fill in these missing values. After some rudimentary search we can learn that the first letter of the cabin field refers to the deck, which naturally is highly correlated to the passenger class value. In our case, B happens to stand for the Promenade (not to be confused with the Upper Promenade) deck. Looking at the distribution of passengers by known port of embarkation for this deck, it’s a fair coin toss, with roughly half of folks (51.11% to be exact) starting their journey in Southampton and remaining half coming from Cherbourg.

Summarizing all of the above findings let’s take a leap of (somewhat calculated) faith and assume that the missing Embarked values should be assigned the value of S [Southampton.] Dealing with the 177 missing Age values is not as trivial, partly due to the size of this sample, partly due to the absence of reliable predictors in the data set. We can perhaps look at passengers’ names to decipher their titles and make reliable assumptions that all of the married passengers (Mrs., etc) are definitely not children. It would become more difficult to zoom in on their actual age, other than perhaps look at mean or mode values in the data. Let’s simplify our life and take the easy way out here: let’s ignore all of the records with missing Age values in hopes of having large enough remaining data set to utilize in our prediction model.

B. Preparing data for the model

Speaking of the models, they don’t play very nicely with the text/string data types, we would need to convert all such fields into numeric values. For example for the Sex field we can perhaps assign a value of 1 to all males and value of 0 to all females.

Moving on to the Pclass field, we can recall from previous observations that first-class passengers were 1.33 more likely to survive than the second-class ticket holders. However, if we just use nominal values of 1 and 2 for respective ticket classes, this relationship would be incorrectly interpreted as second-class passengers being twice as likely to perish than the first-class ticket holders. We could scale this relationship by creating three fields (one for each passenger class) and flagging records accordingly. First-class ticket holder would have a value of True (read: 1) in the PClass1_Flg field, and values of False (read: 0) for the PClass2_Flg and PClass3_Flg columns:

We would apply a similar principle to the Age category fields and create five new fields to flag records according to the passenger age group.

Step 3— Run our machine learning Titanic model

A. Create our model

We are finally done with all of the prep work and ready to proceed to the final phase of our project. To demonstrate how easy it is to create our model let’s follow BigQuery ML syntax below and run our first logistic regression model based on a single feature — Gender:

CREATE OR REPLACE MODEL `ModelName`

OPTIONS (model_type = 'logistic_reg') AS

SELECT field(s) FROM TableName

Who knew that it only takes a couple of lines of SQL code to create a Machine Learning model?

B. Evaluate model results

It only takes one line of code to evaluate our model, and the results look rather promising

SELECT * FROM ml.evaluate(model `ModelName`)

C. Run our model

The moment of truth is here, let’s run our model against the test data, let’s create a table with model results while we’re at it:

SQL Code for the table containing model predictions

Working with a single feature/column wasn’t that hard, but as we recall from our EDA, gender seems to be the best predictor of survival. It’s time to save this table as CSV and make our Kaggle submission. This fairly effortless exercise yielded 76.55% accuracy placed us in the top 74% of all Kaggle contestants. It doesn’t take a lot of analytics power to see that this score is actually shared with other 1,481 competition participants, so effectively ,we are in the top 60%! There is definitely room to grow from here, but not too shabby for the first try.

Kaggle ranking for the first submission

D. Experiment and optimize

Test and iterate is the definitely the spirit here. We can further experiment with the model by fitting it to our needs, be it changing the learning rate, setting regularization, capping the number of iterations, etc. All of these parameters can be found in the OPTIONS portion of the model statement.

In addition, we can try to improve the accuracy of our model by changing the features used; let’s use this approach. I was very saddened to realize that by adding additional features (age category, passenger class, and port of embarkation) the accuracy of the model actually went down:

Another less successful Kaggle submission

We should be able to do better though, correct? What if we consider the fact that families might have a better chance of survival by staying together and helping each other. Luckily, we can add number of parents/children travelling together (parch) and number of siblings/spouses (sibsp) to tally up total family size. It helps to know that as per our earlier EDA there are no missing values within these two fields. Please also note that passengers traveling solo would have 0 values in either of the fields. Let’s build a new, more comprehensive and hopefully more accurate model then:

This tactic was right on the money, we marginally improved the accuracy of our submission to 77%, and moved 1,476 places up, bypassing all of the folks using gender as the only predictor. We now share our position with only 583 competitors, effectively placing us into the top 54%!

I’m not certain what the realistic accuracy level we can aim to achieve, (all of those 95%+ scores on the leaderboard do seem suspicious), however one thing to keep in mind is that the type of model might play a role here, perhaps Random Forest can yield better results than classification? Nonetheless, there is still work to do when it comes to improving our score.

If you’re looking for the next steps, Google does provide BigQuery ML Getting Started guides for both: data analyst and data scientist tracks. Good luck on your machine learning journey! Please share your thoughts and methodology used, should you be more successful in your results!

3/31/19 Edit — Thanks to a very informative comment by Felipe Hoffa, I learned that some of my code used above was actually redundant; in fact BigQuery makes things even simpler.

Replace this code:

CREATE OR REPLACE MODEL `xxx.xx`

OPTIONS (model_type='logistic_reg') AS

SELECT Survived AS label,

CASE WHEN Sex = 'male' THEN 1 ELSE 0 END AS Gender

With this statement:

CREATE OR REPLACE MODEL `xxx.xx`

OPTIONS (model_type='logistic_reg') AS

SELECT Survived AS label,

Sex

Should you want to learn more BigQuery and BigQuery ML tips and tricks turn to Felipe’s excellent post here and all of the other articles he shared so far.

P.S. Sorry, Felipe, old habits die hard: I swapped your example with a leading comma to my go to of a trailing comma.