This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper.

The first step is to prepare the data sets. We will use GSE2034 as a training data set and GSE2990 as a test data set. These are both data sets making use of the Affymetrix U133A platform (GPL96). First, let's download and pre-process the training data. Note: The gcrma step may require you to have as much as ~8gb of ram. I ran this tutorial on a Mac Pro (Snow Leopard) with R 3.0.2 installed. It should also work on linux or windows but package installation might differ slightly.

Set some data directories.

tempdir="/Users/ogriffit/temp/" outdir="/Users/ogriffit/git/biostar-tutorials/MachineLearning"

Install the necessary bioconductor packages (if not already installed).

We will use custom CDF packages rather than the standard CDFs. Note, if the packages below do not install successfully with the biocLite method current source files can be installed manually. Go to the latest version, choose the link to ENTREZG for the appropriate version of R, find the HGU133A row and download three source packages ('C', 'P' and 'A'). In a terminal, run R CMD INSTALL pckgfilename where pckgfilename is the name of the package file you downloaded.

biocLite("hgu133ahsentrezgcdf") biocLite("hgu133ahsentrezgprobe") biocLite("hgu133ahsentrezg.db")

Then, load all necessary libraries

library(GEOquery) library(affy) library(gcrma) library(hgu133ahsentrezgcdf) #cdfname="HGU133A_HS_ENTREZG" library(hgu133ahsentrezgprobe) library(hgu133ahsentrezg.db)

Set a data directory, download a GEO dataset, unpack and gunzip, and create a list of files for processing.

setwd(tempdir) getGEOSuppFiles("GSE2034") setwd(paste(tempdir,"GSE2034", sep="")) untar("GSE2034_RAW.tar", exdir="data") cels = list.files("data/", pattern = "CEL") sapply(paste("data", cels, sep="/"), gunzip) cels = list.files("data/", pattern = "CEL")

Create an AffyBatch object and perform GCRMA normalization.

setwd(paste(tempdir,"GSE2034/data", sep="")) raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname="HGU133A_HS_ENTREZG") data.gcrma.norm=gcrma(raw.data)

Extract expression values. Then, remove Affy control probes. In this custom CDF (Version 17.1.0, ENTREZG), these are found in rows 12031 to 12098. You can check for yourself and look for probe names starting with "AFFX" using 'rownames(gcrma)'. Alternatively you could use a regex to grab just those rows that don't contain "AFFX"

gcrma=exprs(data.gcrma.norm) #gcrma=gcrma[1:12030,] gcrma=gcrma[which(!grepl("AFFX", rownames(gcrma))),]

Map probe set identifiers to Entrez gene symbols and IDs and then combine with raw data.

probes=row.names(gcrma) symbol = unlist(mget(probes, hgu133ahsentrezgSYMBOL)) ID = unlist(mget(probes, hgu133ahsentrezgENTREZID)) gcrma=cbind(probes,ID,symbol,gcrma)

Write GCRMA-normalized and mapped data to file.

setwd(outdir) write.table(gcrma, file = "trainset_gcrma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)

Get the clinical details for this dataset.

GSE2034_clindata=getGSEDataTables("GSE2034")[[2]][1:286,] write.table(GSE2034_clindata, "trainset_clindetails.txt", quote = FALSE, sep = "\t", row.names = FALSE)

After running all of the above commands you should have a data file (trainset_gcrma.txt) with expression estimates for 12,030 genes and 286 breast cancer samples and a second file with clinical details (trainset_clindetails.txt). To avoid copy-pasting you can download the processAffyTrainData.R script.

IMPORTANT: Obtaining the test data set follows a very similar procedure and can be achieved by downloading and running the processAffyTestData.R script. If successful, the latter will give you a second expression data file (testset_gcrma.txt) with 189 additional breast cancer samples and another clinical details file (testset_clindetails.txt). You will need both the trainset___ and testset___ data files to successfully complete subsequent tutorials.

The next tutorial will show you how to train a randomForest model for classification of relapse versus non-relapse. See Machine learning for cancer classification - part 2 - Building a Random Forest Classifier. NOTE: this tutorial overlaps substantially with this tutorial but was modified here to produce the exact files we will need for subsequent parts of the series.