This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper.

This follows from: Machine learning for cancer classification - part 1 - preparing the data sets and Machine learning for cancer classification - part 2 - Building a Random Forest Classifier. This part covers how to use a pre-computed Random Forest classification model to predict relapse in new samples of breast cancer. We make the assumption that the data sets will be in the correct format as produced in part 1. The data sets in this tutorial are the 'test data' from the prior tutorial, retrieved from GSE2990. To avoid the hassle of Copy-Pasting every block of code, the full script can be downloaded here.

Load the necessary packages.

library(randomForest) library(ROCR) require(Hmisc)

Set working directory and filenames for Input/output

setwd("/Users/ogriffit/git/biostar-tutorials/MachineLearning") RF_model_file="RF_model" datafile="testset_gcrma.txt" clindatafile="testset_clindetails.txt" outfile="testset_RFoutput.txt" case_pred_outfile="testset_CasePredictions.txt" ROC_pdffile="testset_ROC.pdf" vote_dist_pdffile="testset_vote_dist.pdf"

Next we will read in the data sets (expecting a tab-delimited file with header line and rownames). We also need to rearrange the clinical data so that it will be in the same order as the GCRMA expression data.

data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t") clin_data_import=read.table(clindatafile, header = TRUE, na.strings = "NA", sep="\t") clin_data_order=order(clin_data_import[,"geo_accn"]) clindata=clin_data_import[clin_data_order,] data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above header=colnames(rawdata)

Extract just the expression values from the filtered data and transpose the matrix.

predictor_data=t(rawdata[,4:length(header)]) predictor_names=as.vector((rawdata[,3])) #gene symbol colnames(predictor_data)=predictor_names

Load the RandomForests classifier from file (object "rf_output" which was saved previously)

load(file=RF_model_file)

Extract predictor (gene) names from RF model built in the previous tutorial and subset the test data to just these predictors. This is not strictly necessary as the randomForest predict function would automatically restrict itself to these variables. It has been added for clarity.

RF_predictor_names=rownames(rf_output$importance) predictor_data=predictor_data[,RF_predictor_names]

Run the test data through forest!

RF_predictions_responses=predict(rf_output, predictor_data, type="response") RF_predictions_votes=predict(rf_output, predictor_data, type="vote")

Join predictions with clinical data and exclude rows with missing clinical data needed for subsequent steps. Then, write results to file.

clindata_plusRF=cbind(clindata,RF_predictions_responses,RF_predictions_votes) clindata_plusRF=clindata_plusRF[which(!is.na(clindata_plusRF[,"event.rfs"])),] write.table(clindata_plusRF,file=case_pred_outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE)

As before, the following lines will give an overview of the classifier's performance. This time instead of estimated performance from out-of-bag (OOB) testing we are measuring actual performance on an independent test set.

confusion=table(clindata_plusRF[,c("event.rfs","RF_predictions_responses")]) rownames(confusion)=c("NoRelapse","Relapse") sensitivity=(confusion[2,2]/(confusion[2,2]+confusion[2,1]))*100 specificity=(confusion[1,1]/(confusion[1,1]+confusion[1,2]))*100 overall_error=((confusion[1,2]+confusion[2,1])/sum(confusion))*100 overall_accuracy=((confusion[1,1]+confusion[2,2])/sum(confusion))*100 class1_error=confusion[1,2]/(confusion[1,1]+confusion[1,2]) class2_error=confusion[2,1]/(confusion[2,2]+confusion[2,1])

Next we will prepare each useful statistic for writing to an output file

sens_out=paste("sensitivity=",sensitivity, sep="") spec_out=paste("specificity=",specificity, sep="") err_out=paste("overall error rate=",overall_error,sep="") acc_out=paste("overall accuracy=",overall_accuracy,sep="") misclass_1=paste(confusion[1,2], colnames(confusion)[1],"misclassified as", colnames(confusion)[2], sep=" ") misclass_2=paste(confusion[2,1], colnames(confusion)[2],"misclassified as", colnames(confusion)[1], sep=" ") confusion_out=confusion[1:2,1:2] confusion_out=cbind(rownames(confusion_out), confusion_out)

Create variables for the known target class and predicted class probabilities.

target=clindata_plusRF[,"event.rfs"] target[target==1]="Relapse" target[target==0]="NoRelapse" relapse_scores=clindata_plusRF[,"Relapse"]

Once again, we will create an ROC curve and calculate the area under it (AUC). Recall that we use the relapse/non-relapse vote fractions as predictive variable. The ROC curve is generated by stepping through different thresholds for calling relapse vs non-relapse.

First calculate the AUC value.

pred=prediction(relapse_scores,target) perf_AUC=performance(pred,"auc") AUC=perf_AUC@y.values[[1]] AUC_out=paste("AUC=",AUC,sep="")

Print the results to file along with other stats prepared above.

write("confusion table", file=outfile) write.table(confusion_out,file=outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE, append=TRUE) write(c(sens_out,spec_out,acc_out,err_out,misclass_1,misclass_2,AUC_out), file=outfile, append=TRUE)

Then, plot the actual ROC curve.

perf_ROC=performance(pred,"tpr","fpr") pdf(file=ROC_pdffile) plot(perf_ROC, main="ROC plot") text(0.5,0.5,paste("AUC = ",format(AUC, digits=5, scientific=FALSE))) dev.off()

You should now have case prediction file with which we will perform some survival analysis. See: Machine learning for cancer classification - part 4 - Plotting a Kaplan-Meier Curve for Survival Analysis.