LDA on raw data (All 30 dimensions)

Alright on with the show, let’s start by defining our data:

wdbc.data <- as.matrix(wdbc[,c(3:32)])

row.names(wdbc.data) <- wdbc$id

wdbc_raw <- cbind(wdbc.data, as.numeric(wdbc$diagnosis)-1)

colnames(wdbc_raw)[31] <- "diagnosis"

What this does is it simply removes ID as a variable and defines our data as a matrix instead of a dataframe while still retaining the ID but in the column-names instead.

Now we need to define a train- / test-split so that we have some data we can test our model on:

smp_size_raw <- floor(0.75 * nrow(wdbc_raw))

train_ind_raw <- sample(nrow(wdbc_raw), size = smp_size_raw)

train_raw.df <- as.data.frame(wdbc_raw[train_ind_raw, ])

test_raw.df <- as.data.frame(wdbc_raw[-train_ind_raw, ])

This will make a 75/25 split of our data using the sample() function in R which is highly convenient. We then converts our matrices to dataframes.

Now that our data is ready, we can use the lda() function i R to make our analysis which is functionally identical to the lm() and glm() functions:

f <- paste(names(train_raw.df)[31], "~", paste(names(train_raw.df)[-31], collapse=" + ")) wdbc_raw.lda <- lda(as.formula(paste(f)), data = train_raw.df)

This is a little lifehack to paste all the variable names instead of writing them all manually. You can call on the object ‘wdbc_raw.lda’ if you want to see the coefficients and group means of your FDA if you like, but it’s quite a mouthful so I wont post the output in this article.

Now let’s make some predictions on our testing-data:

wdbc_raw.lda.predict <- predict(wdbc_raw.lda, newdata = test_raw.df)

If you want to check the predictions simply call ‘wdbc_raw.lda.predict$class’

Evaluation

This is the exciting part, now we can see how well our model performed!

### CONSTRUCTING ROC AUC PLOT: # Get the posteriors as a dataframe.

wdbc_raw.lda.predict.posteriors <- as.data.frame(wdbc_raw.lda.predict$posterior) # Evaluate the model

pred <- prediction(wdbc_raw.lda.predict.posteriors[,2], test_raw.df$diagnosis)

roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")

auc.train <- performance(pred, measure = "auc")

auc.train <- auc.train@y.values # Plot

plot(roc.perf)

abline(a=0, b= 1)

text(x = .25, y = .65 ,paste("AUC = ", round(auc.train[[1]],3), sep = ""))

And here we go, a beautiful ROC plot! Here I’ve simply plotted the points of interest and added a legend to explain it. Now the point I’ve plotted as the “optimal” cut-off is simply the point in our curve with lowest euclidean distance to the point (0,1) which signals 100% True Positive Rate and 0% False Positive Rate, which means we have a perfect separation / prediction.

So what does this mean? This means that depending on how we want our model to “behave” we can use different cut-offs. Do we want 100% true positive rate at the cost of getting some false positives? Or do we want 0% false positives at the cost of a love true positive rate? Is it worse to get diagnosed with a malignant (cancerous) tumor if it’s actually benign or is worse to get told you’re healthy if it’s actually malignant?

Our “optimal” point has a TRP of 96.15% and a FPR of 3.3% which seems decent but do we really want to tell 3.3% of healthy people that they have cancer and 3.85% of sick people that they’re healthy?

Please keep in mind that your results will most definitely differ from mine since the sample method to do train- / test-splits are random.

Let’s take a look on LDA on PCA transformed data and see if we get some better results.