KNN # 3 — Coding our breast cancer classifier

Show me the code!

To start the project we need data, let’s then download the Breast Cancer Wisconsin dataset that we saw in the previous article.

You can know how to download the data file reading the previous article

Some observations before we start, after I dowlaond that dataset, I made some pretty simple modifications there:

I removed the ID column because it doesn’t help us to classify a cancer type;

I changed the class column position to last column;

After dowloaded it, you have two options:

— Download a dataset modified by me to facilitate the KNN development through this link;

or

— Follow the step by step and adapt the code to the dataset;

Well, let’s get started! First of all, the complete code for this tutorial is on github at this link.

When you have your dataset, let’s create into our project folder a folder called “datas” and put all our datsets there.

And on root folder we’ll create our file called main.go

I am using the VS Code editor for the development of this algorithm, however you can use any other text/code editor that you want.

Let’s define the package name and create the main function.

To start our projetc we will need to load our dataset into memory, so let’s create a method that reads a csv and return a array to us.

Now our entire dataset is inside the variable “recodords”

If I print the variable’s value in the terminal, we will have the following output.

fmt.Println(records) …[7.76 24.54 47.92 181 0.05263 0.04362 0 0 0.1587 0.05884 0.3857 1.428 2.548 19.15 0.007189 0.00466 0 0 0.02676 0.002783 9.456 30.37 59.16 268.6 0.08996 0.06444 0 0 0.2871 B]]

Then we have our dataset that we will make several iterations and apply all that theory that we saw in article # 1.

The next step after getting the dataset, will be split it into 2 parts, the first is training data and second is the test data.

Training data: it’s the already classified data that we will use to train our algorithm;

Test data: it’s the classified data that we will use to validate the our algorithm accuracy;

Let’s divide our dateset as follows, for each class we will get 70% of the data training and the residue (30%) of each class for testing.

First let’s extract the uniquely classes from our dataset, to do this we will use the following logic;

-> Extract from our dataset (records) the column where have the classes values;

-> Make a distinct( ) to clear the repeated values and thus get our unique classes;

[getCollum(…)]

The getCollum method receive a matrix and an column index as parameter, so we go in the matrix’s column index using tyhe columIndex variable and for each row inside that column we add it to a new array that we are going to return.

[distinct(…)]

With the getcolum return we will pass this value to the distict ( ) method that receives an array as parameter and for each element of this array:

-> It is verified if this same element is in the “encontred” map (dictionary)

-> If it does not exist, it will add the element to the “encontred” map variable

-> If it exists, do nothing;

In the end we will return our map where we will haven’t repeated classeses.

When we print the value “classes” fmt.Println (classes) we will have an array with two values [M, B]

Let’s iterate over these classes to get the 70% of data from each class for training and 30% of the data for testing.

Let’s create our train matrices and test matrices with 70% and 30% of data;

The getValuesByClass ( ) method receive an array and a class, so we will filter our array to get a specific class data.

In the first iteration will have the array with values corresponding to a single class, example: M.

In the second classes loop interation, the variable “values” will then have data corresponding to class: B.

After we have filtered only the data corresponding to a certain class, we need to divide it between training data and test data, we will do it in the following code:

When we got our test and training data into loop’s scope, let’s then feed our array that is outside the loop’s scope.

I know, using a loop to concatenate an array/slice does’nt the best way, but remember that I’m coding in the most didactic way that I can, so please feel free to fix it.

After separating the data, let’s tests it!

Let’s sort each line in our test dataset and let’s count how many we hit and how many we missed.

Finally we can print some little things:

fmt.Println(“Total de dados: “, len(records))

fmt.Println(“Total de treinamento: “, len(train))

fmt.Println(“Total de testes: “, len(test))

fmt.Println(“Total de acertos: “, hits)

fmt.Println(“Porcentagem de acertos: “, (100 * hits / len(test)), “%”)

At the end, when we run “go run main.go” in the terminal, we will get this result:

Total de dados: 569

Total de treinamento: 513

Total de testes: 513

Total de acertos: 484

Porcentagem de acertos: 94 %

E caso você queira ver quais foram os acertos e os erros, nós podemos imprimir a classificação que foi feita para o dado de teste descomentando a linha 22 do gist acima:

And if you want to see what the hits and the errors data that you algorithm did, we can print the classification that was made for the test data by uncommenting line 22 of the above gist:

//fmt.Println(“tumor: “, test[i][columnIndex], “ classificado como:

E teremos a seguinte saída:

And we will have the following output:

tumor: M classificado como: B

tumor: M classificado como: M

tumor: M classificado como: M

tumor: M classificado como: M

tumor: M classificado como: M

tumor: M classificado como: M

tumor: B classificado como: B

tumor: B classificado como: B

tumor: B classificado como: B

tumor: B classificado como: B

tumor: B classificado como: B

tumor: B classificado como: B

tumor: B classificado como: B

tumor: B classificado como: B

Total de dados: 569

Total de treinamento: 513

Total de testes: 513

Total de acertos: 479

Porcentagem de acertos: 93 %

Thanks for reading all the content, I hope you find it very useful.