Most elementary statistical inference algorithms assume that the data can be modeled by linear parameters with a normally distributed error component. According to Vladimir Vapnik in Statistical Learning Theory (1998), the assumption is inappropriate for modern large scale problems, and his invention of the Support Vector Machine (SVM) makes such assumption unnecessary. There are many implementations of the algorithm, and a popular one is the LIBSVM, which can be invoked in R via the e1071 package.

For demonstration purpose, we will train a regression model based on the California housing prices data from the 1990 Census. The data set is called cadata, and can be downloaded from the LIBSVM site.

Before training the SVM model, we pre-scale the downloaded data in a terminal using a standalone tool rpusvm-scale in RPUSVM. This will create a new data set cadata.scaled. As a good practice, we also save the scale parameters in a secondary file cadata.save for later use.

- scale - x " - 1:1" - y " - 1:1" - s cadata.save cadata cadata.scaled $ rpusvmscalex "1:1"y "1:1"s cadata.save cadata cadata.scaled

Now we can load cadata.scaled in R with the function read.svm.data in the rpudplus add-on. Since the response values in the data set are not in factor levels, we have to set the argument fac as FALSE. We also save the x and y components as standalone variables for convenience.



- read.svm.data("cadata.scaled", fac=FALSE)

- cadata.scaled$x; y < - cadata.scaled$y > library(rpud) # load rpudplus> cadata.scaled x

Then we train an SVM regression model using the function svm in e1071. As the data has been pre-scaled, we disable the scale option. The data set has about 20,000 observations, and the training takes over a minute on an AMD Phenom II X4 system.

We can do likewise with the function rpusvm of the rpudplus add-on. The same training now takes only 6 seconds on an NVIDIA GTX 460 GPU:

The models trained by the two packages are numerically equivalent, as is evidenced by their respective mean square errors. For the LIBSVM model from e1071, the mean square error is about 0.0696.

This is almost identical to the mean square error from the function rpusvm in rpudplus:

Sometimes it is more effective to invoke LIBSVM directly in a terminal. Using OpenMP to parallelize LIBSVM v3.1 on an AMD Phenom II X4 CPU, training a regression model of cadata takes about 28 seconds.

- train - s 3 - m 1000 cadata.scaled cadata.libsvm









- 2232.720805, rho = - 0.299943









$ time svmtrains 3m 1000 cadata.scaled cadata.libsvm........optimization finished, #iter = 9677nu = 0.590304obj =2232.720805, rho =0.299943nSV = 12216, nBSV = 12156real 0m28.633suser 0m28.190ssys 0m0.390s

Using a standalone Linux tool in RPUSVM, we can invoke the same code of rpusvm in a terminal. The training takes about 5 seconds on a GTX 460 GPU:

- train - s 3 cadata.scaled cadata.rpusvm

- train 0.1.2

- tutor.com

- 2012 Chi Yau. All Rights Reserved.









** .







- 2232.72, rho = - 0.300649











$ time rpusvmtrains 3 cadata.scaled cadata.rpusvmrpusvmtrain 0.1.2http://www.rtutor.comCopyright (C) 20112012 Chi Yau. All Rights Reserved.This software is free for academic use only. There is absolutely NO warranty.GeForce GTX 460 GPU.........Finished optimization in 9498 iterationsnu = 0.590179obj =2232.72, rho =0.300649nSV = 12218, nBSV = 12157Total nSV = 12218real 0m5.100suser 0m4.940ssys 0m0.150s

Finally, we compare their prediction speeds on cadata. Parallelized LIBSVM takes about 11 seconds on Phenom II X4:

- predict cadata.scaled cadata.libsvm cadata.out











$ time svmpredict cadata.scaled cadata.libsvm cadata.outMean squared error = 8500.66 (regression)Squared correlation coefficient = 0.000325578 (regression)real 0m11.176suser 0m44.540ssys 0m0.010s

The same task takes RPUSVM under 2 seconds on GTX 460:

- predict cadata.scaled cadata.rpusvm cadata.out

- predict 0.1.2

- tutor.com

- 2012 Chi Yau. All Rights Reserved.



















$ time rpusvmpredict cadata.scaled cadata.rpusvm cadata.outrpusvmpredict 0.1.2http://www.rtutor.comCopyright (C) 20112012 Chi Yau. All Rights Reserved.This software is free for academic use only. There is absolutely NO warranty.GeForce GTX 460 GPUMean squared error = 0.0695664Pearson correlation coefficient = 0.698953real 0m1.631suser 0m1.440ssys 0m0.170s

Exercise 1

Train SVM models of larger data sets using rpusvm.

Exercise 2

Find probability estimates of the regression model of cadata by enabling the probability option in rpusvm.

Exercise 3

Perform cross-validation for the regression model of cadata by enabling the cross option in rpusvm.

Exercise 4

Search for optimal SVM kernel and parameters for the regression model of cadata using rpusvm based on similar procedures explained in the text A Practical Guide to Support Vector Classification. In particular, create a similar contour map as below for selecting smaller regions for further optimization.

Note 1

Suppose we would like to perform prediction on a data file stored in LIBSVM format, say test.dat. We must first pre-scale it with the scale parameter file cadata.save which we created earlier in preparation of training cadata.

- scale - r cadata.save test.dat test.scaled $ rpusvmscaler cadata.save test.dat test.scaled

Then we load it in R with the read.svm.data method in rpudplus and apply the function predict as usual. Just make sure to manually restore the result to the original y-scale before use.

- read.svm.data("test.scaled", fac=FALSE)

- predict(cadata.rpusvm, test.scaled$x)





- 0.053179 > test.scaled pred head(pred)1 2 3 4 50.592945 0.782728 0.557078 0.2990140.053179

These data scaling juggernauts can be avoided with the latest rpudplus and RPUSVM. See our next tutorial for details.

Note 2

A much faster algorithm for large scale document classification without the use of a GPU is LIBLINEAR. It can process millions of records in seconds.

References