k-Nearest Neighbor Classification Algorithm Implemented in Lisp

Description of the problem

Like other classification problems I implemented before, the setup is similar: I have a finite set of points $D := \{ x_1,\ldots,x_N \}$ from a metric space $(X,d)$ which I would like to write as a disjoint union of $m$ subsets \[ D = D_1 \sqcup \cdots \sqcup D_m \] This time, I have a training set of examples $T = \{ t_1,\ldots,t_L \}$ for which I have a classification scheme \[ c\colon T\to \{1,\ldots,m\} \] such that $t_i\in D_{c(i)}$ for each $i=1,\ldots,L$. Our problem is to extend $c$ to a function $c\colon X\to \{1,\ldots,m\}$.

The algorithm

Given a point $x\in X$, we calculate its distances to each point in the set of training points. Then we rank the points in the training set from closest to the farthest. The majority of the classification labels within the first $k$ determines which label the point $x$ is going to get. In short, we ask the training set which label $x$ should get, and the highest vote among the closest $k$-points wins.

The pseudo-code

Here is the pseudo-code for the algorithm

Function knn Input: A finite set D of points to be classified A finite set T of points A function c: T -> {1,...,m} A natural number k Output: A function r: D -> {1,...,m} Begin Foreach x in D do Let U <- {} Foreach t in T add the pair ( d(x,t) , c(t) ) to U Sort the pairs in U using the first components Count the class labels from the first k elements from U Let r(x) be the class with the highest number of occurence End Foreach Return r End

An implementation in lisp

Let me start by some utility functions. First some code to load the data

(defun read-data(file) (with-open-file (stream file) (let ((result (make-array 0 :fill-pointer t :adjustable t))) (do ((line (read-line stream nil) (read-line stream nil))) ((null line) result) (vector-push-extend (map 'vector 'read-from-string (ppcre:split #\, line)) result))))) READ-DATA

I will start by using The Wine Dataset from UCI. I will load in the data and take a small sample of size 45.

(defvar data (read-data "wine.csv")) (defvar N (length data)) (defvar train (map 'vector (lambda (x) (aref data (random N))) (make-array 45))) TRAIN

We make the following assumption about the data: the data points are numerical vectors whose last entries are the class labels. Using this assumption we define the distance function as follows:

(defun distance (x y) (sqrt (reduce '+ (map 'vector (lambda (i j k) (expt (- i j) 2)) x y (make-array (1- (length x))))))) DISTANCE

and we read the class label as follows:

(defun read-label (x) (aref x (1- (length x)))) READ-LABEL

OK. Now let me test a part of the function: given a point calculate the distances of the point to the training data set and sort the results from smallest to the largest.

(defun process (x train) (map 'vector 'car (sort (map 'list (lambda (y) (cons (read-label y) (distance x y))) train) (lambda (u v) (< (cdr u) (cdr v)))))) PROCESS

And test it on a sample point:

(let ((x (aref data (random N)))) (cons (read-label x) (process x train))) (1 . #(1 1 1 1 1 1 1 2 1 2 1 1 1 1 3 3 3 3 3 1 3 2 1 3 2 2 2 2 3 2 3 2 3 2 2 3 2 2 1 2 2 2 2 2 2))

What we need now is a function which will find the most frequent class label from the first $k$ entry. Now comes our classify function which will determine the class label of a point.

(defun classify (x train k) (let (result (temp (process x train))) (dotimes (i k) (if (assoc (aref temp i) result :test 'equal) (incf (cdr (assoc (aref temp i) result :test 'equal))) (push (cons (aref temp i) 1) result))) (caar (sort result (lambda (i j) (> (cdr i) (cdr j))))))) CLASSIFY

and we test it

(let ((x (aref data (random N)))) (cons (read-label x) (classify x train 3))) (3 . 1)

which is not successful for this choice.

Now, we apply our function to the whole dataset to assess how successful it is. For that I will need a comparison of class label given with the data and the class labels we calculate.

(defun make-table (sent) (let (result) (dolist (x sent) (if (assoc x result :test 'equal) (incf (cdr (assoc x result :test 'equal))) (push (cons x 1) result))) result)) MAKE-TABLE

Here is our final run with $k$=4

(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 4))) data)) (((3 . 1) . 5) ((3 . 2) . 22) ((3 . 3) . 21) ((2 . 1) . 3) ((2 . 3) . 17) ((2 . 2) . 51) ((1 . 2) . 6) ((1 . 3) . 6) ((1 . 1) . 47))

Presented as a table we get

New 1 2 3 1 47 6 6 Old 2 3 51 17 3 5 22 21

Now, I will repeat the analysis with $k$=8

(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 8))) data)) (((3 . 3) . 20) ((3 . 2) . 28) ((2 . 1) . 2) ((2 . 3) . 11) ((2 . 2) . 58) ((1 . 3) . 13) ((1 . 1) . 46))

Presented as a table we get

New 1 2 3 1 46 0 13 Old 2 2 58 11 3 0 28 20

Statistical analysis of the results

If we apply $\chi^2$-test to the tables (and higher correlation between the rows and columns is better) we see

> chisq.test(matrix(c(47,6,6,3,51,17,5,22,21),nrow=3,ncol=3)) Pearson's Chi-squared test data: matrix(c(47, 6, 6, 3, 51, 17, 5, 22, 21), nrow = 3, ncol = 3) X-squared = 108.0064, df = 4, p-value < 2.2e-16 > chisq.test(matrix(c(46,0,13,2,58,11,0,28,20),nrow=3,ncol=3)) Pearson's Chi-squared test data: matrix(c(46, 0, 13, 2, 58, 11, 0, 28, 20), nrow = 3, ncol = 3) X-squared = 139.2728, df = 4, p-value < 2.2e-16

The second result shows better corelation. Of course we can change the number of sample points.

(setf train (map 'vector (lambda (x) (aref data (random N))) (make-array 70))) (make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 4))) data)) (((3 . 1) . 8) ((3 . 2) . 25) ((3 . 3) . 15) ((2 . 1) . 7) ((2 . 3) . 21) ((2 . 2) . 43) ((1 . 3) . 1) ((1 . 1) . 58))

New 1 2 3 1 58 0 1 Old 2 7 43 21 3 8 25 15

which yields

> chisq.test(matrix(c(58,0,1,7,43,21,8,25,15),nrow=3,ncol=3)) Pearson's Chi-squared test data: matrix(c(58, 0, 1, 7, 43, 21, 8, 25, 15), nrow = 3, ncol = 3) X-squared = 120.7284, df = 4, p-value < 2.2e-16

We did not get a better result by changing the training set size.

Another dataset

Let me run another test on another data set. This time I will use The Iris Dataset again from UCI.

(setf data (read-data "iris.csv")) (setf train (let ((N (length data))) (map 'vector (lambda (x) (aref data (random N))) (make-array 50)))) (make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 3))) data)) (((IRIS-VIRGINICA . IRIS-VERSICOLOR) . 8) ((IRIS-VIRGINICA . IRIS-VIRGINICA) . 42) ((IRIS-VERSICOLOR . IRIS-VIRGINICA) . 2) ((IRIS-VERSICOLOR . IRIS-VERSICOLOR) . 48) ((IRIS-SETOSA . IRIS-SETOSA) . 50))

Again presented in table format we see

New Virginica Versicolor Setosa Virginica 44 6 0 Old Versicolor 1 49 0 Setosa 0 0 50

Let me run another test with a different $k$ ion the same data set:

(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 5))) data)) (((IRIS-VIRGINICA . IRIS-VERSICOLOR) . 15) ((IRIS-VIRGINICA . IRIS-VIRGINICA) . 35) ((IRIS-VERSICOLOR . IRIS-VERSICOLOR) . 50) ((IRIS-SETOSA . IRIS-SETOSA) . 50))

Again presented in table format we see

New Virginica Versicolor Setosa Virginica 41 9 0 Old Versicolor 0 50 0 Setosa 0 0 50

which is on par with the result above.