1. Intentions

I got hands on the Real-World Machine Learning by Brink, Richards, and Fetherolf. I got my summer vacations, finally. Why not to spend some portion of the latter on the former? The book’s code is in Python. The book is formulas-free, it focuses on introductory presentations of methods (no deep learning, you know 😉 ). It relies on existing libraries such as numpy and scikit-learn . So, I thought, why not to try to solve a problem from that book by googling for utilities in Common Lisp and glueing them together?

The purpose of this article is taking the minutes of this endevour. We’ll find out

what’s it worth to import real-life data for machine learning into lisp environment what usual R/ numpy -like data manipulation tricks may look like in Common Lisp if Lisp is a Good Thing(tm) for Data Science(tm).

Inventory and materials: I used SBCL (version 1.3.12) with Emacs Slime Mode. Moreover, Quicklisp was my libraries manager. The problem data can be downloaded from kaggle.com/c/event-recommendation-engine-challenge (requires registeration).

Prerequisites: Peter Siebel’s Practical Common Lisp.

2. Looking at the data

We’ll work through the Events Recommendations example (section 5.2 of the aforementioned book). We want to predict user interest in a recommended event from the user’s age (birthyear), gender, timezone, and geographical latitude and longitude of the event. The target variable interest is binary (0/1), the gender feature is categorical, while the rest are numerical. The base datasets are the train.csv , users.csv , and events.csv files.

The .csv extension stands for comma-separated values. We expect the three files to be text files, each line representing an item, item features listed one by one with commas in between. So, my first decision was to write a dedicated function to parse a CVS file, and to use it for reading the three datasets into RAM. I thought of having separate arrays for variables (as in R’s data frame), wrapped up in a hash table. Then, numerical variables could be stored in numerical form rather than in printed form. I thought about writing my own code because I was too eager to set about the ML itself then to spend time on reading documentation of options like CL-CSV library.

Open a new file (in Emacs), call it, say, main.lisp and put the following lines at the beginning, then press Ctrl+s to save, Ctrl+c Ctrl+k to compile the file:

(eval-when (:compile-toplevel :load-toplevel :execute) (declaim (optimize debug)) (ql:quickload '("split-sequence"))) (defpackage ml-events (:use :ql :cl :split-sequence :gzip-stream :parse-float :clrf) (:export main)) (in-package ml-events)

We define a new Common Lisp package to control the namespace. Since we need to split lines into substrings, we’ll use special library for this called split-sequence. We want this library to be present/loaded every time we use main.lisp eitherway (compiling/loading the file in scripting mode, or experimenting in runtime in a slime session).

It is known data can be missing. A missing categorical data is no problem, empty strings is the solution. But for numerical variables we need something else. I decided to design a special datatype. Also I don’t automate data types recognition. I may pass the types of variables to the reading function. The first lines of datasets are headers containing varable namesm we read them separately. Next, we loop over lines with read-line , its second argument indicates that at the end of the file nil shoud be returned, and not an error break. So, here’s my CSV reading function, add it at the end of the main.lisp , then save and recompile:

(defun read-csv(fname &optional (types nil) &aux data header) (setq data (make-hash-table :test 'equal)) (with-open-file (s fname :direction :input) (setq header (read-line s)) (setq header (split-sequence #\, (string-right-trim '(#\Return) header))) (unless types (setq types (make-list (length header) :initial-element t))) (loop :for var :in header :for type :in types do (setf (gethash var data) (make-array 150 :element-type type :fill-pointer 0))) (loop :for line = (read-line s nil) while line do (setq line (split-sequence #\, (string-right-trim '(#\Return) line))) (loop :for var :in header :for val :in line :for type :in types do (vector-push-extend (case type ((fixnum maybe-fixnum) (parse-integer val :junk-allowed t)) (otherwise val)) (gethash var data))))) data) (deftype maybe-fixnum () '(or null fixnum)) (deftype maybe-float () '(or null float))

I assume that the dataset files are in the current working directory.

(declaim (special .train. .users.)) (setq .train. (read-csv "train.csv" '(fixnum fixnum fixnum string fixnum fixnum)) .users. (read-csv "users.csv" '(maybe-fixnum string maybe-fixnum string string string maybe-fixnum)))

This declaim/setq combination was peeked in the expantion of the defvar macro. Every time the lisp file is compiled/loaded, the data is reread. Now we can switch to the slime buffer (Ctrl+C, Ctrl+Z), type in .train. and press enter. Then right-click on the output and select ‘Inspect’ to see the contents of the (hash-)table.

Two file out of the three are moderate size. The third one, events.csv.gz, is an archive. Unzgipping results in 1GB+ file. I wanted to save room in my hard drive, so I reached for GZIP-STREAM library from Quicklisp. Also, this table contains floating-point real numbers. We need PARSE-FLOAT library. So, change the third line into

(ql:quickload '("split-sequence" "parse-float" "gzip-stream")))

Put a new function:

(defun read-csv-gzip(fname &key (types nil) &aux data header) (setq data (make-hash-table :test 'equal)) (with-open-gzip-file (s fname :direction :input) (setq header (gzip-stream::stream-read-line s)) ;; skip header (loop :for line = (gzip-stream::stream-read-line s) while line do (setq line (split-sequence #\, (string-right-trim '(#\Return) line))) (loop :for var :in header :for val :in line :for type :in types do (vector-push-extend (case type ((fixnum maybe-fixnum) (parse-integer val :junk-allowed t)) ((float maybe-float) (parse-float val :junk-allowed t)) (otherwise val)) (gethash var data)))))) data) (declaim (special .events.)) (setq .events. (read-csv-gzip "events.csv.gz"))

Save and recompile. Oops, my SBCL fall down due to lack of memory. The files contains some 3 million lines, 110 comma-separated fields each. Increasing SBCL’s dynamic memory limit doesn’t help. My solution was – exctraction of and recording only necessary fields. Erase read-csv-gzip, paste the following lines, unzgip the datafile onto /tmp (in my system /tmp if tmpfs stored in memory):

(defun read-csv-events(fname &key &aux data header) (setq data (make-hash-table :test 'equal)) (with-open-file (s fname :direction :input) (setq header (read-line s)) ;; skip header (loop :for line = (read-line s nil) while line do (setq line (split-sequence #\, (string-right-trim '(#\Return) line) :count 10)) (setf (gethash (parse-integer (first line)) data) (cons (parse-float (eighth line) :junk-allowed t) (parse-float (ninth line) :junk-allowed t))))) data) (declaim (special .events.)) (setq .events. (read-csv-events "/tmp/events.csv"))

We limit the number of fields to be extracted by split-sequence to nine (“lat” and “lng” fields are #8 and 9). Now the events dataset is processed in about half a minute. Nice!

Actually, our working data is train.csv patched by information from users.csv (gender, birthyear), and events.csv (latitudes and longitudes). The following lines create extra columns on the fly mapping anonymous functions to every user in the traning set and storing the results in a vector, and insert then into .train. hashtable. All anonymous function perform the following task: given a use id (out from train.csv line), find the user index in the array of user id’s and return a feature from the same index in the appropriate column.

(let* ((ids (gethash "user_id" .users.)) (by (gethash "birthyear" .users.)) (gen (gethash "gender" .users.)) (tz (gethash "timezone" .users.)) (birthdays (map 'vector (lambda(x) (aref by (position x ids))) (gethash "user" .train.))) (gender (map 'vector (lambda(x) (aref gen (position x ids))) (gethash "user" .train.))) (timezone (map 'vector (lambda(x) (aref tz (position x ids))) (gethash "user" .train.)))) (setf (gethash "birthyear" .train.) birthdays) (setf (gethash "timezone" .train.) timezone) (setf (gethash "gender" .train.) gender)) (let* ((lat (map 'vector (lambda(x) (car (gethash x .events.))) (gethash "event" .train.))) (lng (map 'vector (lambda(x) (cdr (gethash x .events.))) (gethash "event" .train.)))) (setf (gethash "lat" .train.) lat) (setf (gethash "lng" .train.) lng))

Now, we have harvested all the input data in the .train. hashtable.

3. Feature engineering

Since one of our feature variables is categorical, the book suggests turning it into a set of binary variables corresponding to different levels of the categorical variable, plus one extra variable for “Missing value”. Here’s the code:

(defun cat2num(data &key (test 'string=) (prefix nil) &aux features) (setq features (make-hash-table)) (loop :for cat being the elements of (remove-duplicates data :test test) :do (when prefix (setq cat (concatenate 'string prefix "-" cat))) (setf (gethash cat features) (map 'vector (lambda(x) (if (equal x cat) 1 0)) data))) features) (defun insert-features(table features) (loop :for var :being :the :hash-keys of features :do (when (gethash var table) (error "Feature already present")) (setf (gethash var table) (gethash var features))) table) (let ((gender (cat2num (gethash "gender" .train.) :prefix "gen" :test 'string=))) (insert-features .train. gender))

4. Building and testing a classifier

Now it’s time to get down to business. The book suggests model training using the random forest algorithm. Quick search with Google gives a link to a useful library CL-RANDOM-FOREST. Although the README.org file says the library can be downloaded and installef via Quicklisp, I failed to do it. So I had to download and save it in quicklisp’s local-projects directory. Loading it fall into an error of cl-online-learning library missing. Ohh, download and save it in quicklisp’s local-projects too. Now, add “cl-random-forest” to the list of libraries in the third line of your file. We’re ready to select cases without missing values and train a decision tree or a random forest with these data.

(defvar *dtree*) (defvar *train-features*) (defvar *target*) (setq *dtree* (loop :for by across (gethash "birthyear" .train.) :for lat across (gethash "lat" .train.) :for lng across (gethash "lng" .train.) :for tz across (gethash "timezone" .train.) :for gen-m across (gethash "gen-male" .train.) :for gen-f across (gethash "gen-female" .train.) :for gen-u across (gethash "gen-" .train.) :for int across (gethash "interested" .train.) :for inv across (gethash "invited" .train.) :when (and lat lng by tz) :collect (list (coerce by 'double-float) (coerce gen-m 'double-float) (coerce gen-f 'double-float) (coerce gen-u 'double-float) (coerce tz 'double-float) (coerce lat 'double-float) (coerce lng 'double-float) (coerce inv 'double-float)) :into features :collect int into target :finally (setq *train-features* (make-array (list (length features) 8) :element-type 'double-float :initial-contents features) *target* (make-array (length target) :element-type 'fixnum :initial-contents target)) (return (clrf:make-dtree 2 8 ;; n-class, n-dim *train-features* *target*)))) (clrf:test-dtree *dtree* *train-features* *target* :quiet-p nil) (declaim (special *forest*)) (setq *forest* (clrf:make-forest 2 8 *train-features* *target* :n-tree 25 :bagging-ratio 1.0)) (clrf:test-forest *forest* *train-features* *target*)

…and we’re done! The random fires library lacks documentation. Initially I didn’t coerce all input features data to doubule-floats and got an error exception somewhere in the depths of the library code. I almost got stuck, but next I converted all input data into double-floats, as you see here. It helped. The template for usage was found in the file simple.lisp from examples/ directory.

5. Conclusions

I demonstrated that this partucular ML example can be walked through with an ordinary HP notebook if we carefully decide on what data we actually need to import. Maybe a more sophisticated solution would use some kind of a database to access a huge csv file. I also understuud that many syntactic sugar trichs like z=(x==y) can be easily implemented by means of standard Common Lisp libary functions like map , remove-duplicates , position , etc. At the same time, modern Common Lisp has a vast selection of utility libraries and problem-solving libraries, and it can considered as a tool for data analysis.