Data wrangling with transducers for a machine learning problem

28 Apr 2017

The transducers from the net.cgrand.xforms library are great to transform and analyze data with in Clojure. This blogpost shows how the xforms transducers can be used to do data analysis for a machine learning problem from Kaggle, which is a data science competition platform.

One of the competitions on Kaggle is the Titanic competition. For this competition you are given a dataset about passengers aboard the Titanic, with data such as their age and how much they paid for their ticket. In the training data you are also told if the passenger survived. The goal of the competition is to predict if a passenger survived or not for a test set of data.

This tutorial on the Kaggle site explains how to solve such a problem. The tutorial explains which steps to take and how to analyze, change or create data and how to make predictions. The tutorial uses Python to go through all these steps. In this blog we'll use Clojure instead.

Analyzing the data

| :PassengerId | :SibSp | :Fare | :Embarked | :Sex | :Survived | :Parch | :Pclass | :Age | |--------------+--------+---------+-----------+--------+-----------+--------+---------+------| | 1 | 1 | 7.25 | S | male | 0 | 0 | 3 | 22 | | 2 | 1 | 71.2833 | C | female | 1 | 0 | 1 | 38 | | 3 | 0 | 7.925 | S | female | 1 | 0 | 3 | 26 | | 4 | 1 | 53.1 | S | female | 1 | 0 | 1 | 35 | ;; etc

Example data from the Titanic dataset

Lets say you would like to find out how many people from the training data set survived. With the functions from clojure.core that looks like this:

(-> (map :Survived data) frequencies) ;;=> {0 549, 1 342}

And here's how to slice the data to see how many people survived based on their gender:

(-> (map (juxt :Sex :Survived) data) frequencies) ;;=> {["male" 0] 468, ;; ["female" 1] 233, ;; ["female" 0] 81, ;; ["male" 1] 109} (let [by-sex (group-by :Sex data) survived (zipmap (keys by-sex) (->> by-sex vals (map #(-> (map :Survived %) frequencies))))] survived) ;;=> {"male" {0 468, 1 109}, ;; "female" {1 233, 0 81}} (reduce (fn [acc {:keys [Sex Survived] :as row}] (update-in acc [Sex Survived] (fnil inc 0))) {} data) ;;=> {"male" {0 468, 1 109}, ;; "female" {1 233, 0 81}}

With the 'by-key' transducer that looks like:

(require '[net.cgrand.xforms :as x]) (into {} (x/by-key :Sex (comp (x/by-key :Survived x/count) (x/into {}))) data) ;; => {"male" {0 468, 1 109}, ;; "female" {1 233, 0 81}}

For this counting use case there is little difference between using a transducer or the basic clojure.core functions. And the benefits of transducers being useable for different types of data, such as on streams is not relevant for our use case.

But when you want to get multiple results per grouping or statistics other than counting, the transducer approach with xforms starts to come out ahead:

(def xFrequencies (comp (x/by-key identity x/count) (x/into {}))) (into {} (x/by-key :Sex :Survived (x/transjuxt {:chance (comp x/avg (map double)) :counts xFrequencies})) data) ;; {"male" {:chance 0.1889081455805893, :counts {0 468, 1 109}}, ;; "female" {:chance 0.7420382165605096, :counts {1 233, 0 81}}}

Or for the distribution of the Age feature in the dataset:

(def ageStats (comp (map :Age) (filter identity) ;; some Age values are missing (x/transjuxt {:mean (comp x/avg (map double)) :std-dev x/sd}))) (into {} ageStats data) ;;=> {:mean 29.69911764705882, :std-dev 14.526497332334035} (into {} (x/by-key :Sex ageStats) data) ;;=> {"male" {:mean 30.72664459161148, :std-dev 14.678200823816606}, ;; "female" {:mean 27.915708812260537, :std-dev 14.110146457544133}}

With the xforms transducers all the data analysis from the Titanic tutorial in Python is as easy to do in Clojure. So the next time you need to do some group-by type operation, you should check out the transducers from net.cgrand.xforms.

To replicate (most of) the charts from the tutorial you can use the Incanter library.

Predicting the test data

The fun part of doing machine learning is having a model to make predictions with. For this we'll use clj-ml, which is a wrapper around the Weka library.

;; dataset is our data in the Weka format (let [random-forest (doto (cm-classifiers/make-classifier :decision-tree :random-forest) (.setNumTrees 100)) trained (cm-classifiers/classifier-train random-forest dataset)] (let [evaluate (cm-classifiers/classifier-evaluate random-forest :cross-validation dataset 10)] (pprint/pprint evaluate)) (predict test-data random-forest "data/randomforest.csv") ;; Random Forest ;; trained: 78.563% ;; Score on Kaggle leaderboard: 0.77990 ;; other algorithms: ;; Logistic regression: ;; trained: 79.349% ;; Score on Kaggle leaderboard: 0.75120 ;; Naive Bayes ;; trained: 78.339% ;; Score on Kaggle leaderboard: 0.70813 ;; KNN 3 ;; trained: 78.451% ;; Score on Kaggle leaderboard: 0.73206 ;; SVM ;; trained: 78.676% ;; Score on Kaggle leaderboard: 0.76555 ;; Decision Tree ;; trained: 81.145% ;; Score on Kaggle leaderboard: 0.77990

The code is on GitHub.