Summary

For clarity, features were standardized by subtracting the mean and dividing by standard deviation. The features were then transformed into random features/basis, by gradient boosting of the Random Bits base learner, a 3-layer sparse neural network with random weights and fed to a random forest classifier/regressor to obtain predictions (Fig. 1).

Figure 1 The summarized process. A 3-layer sparse neural network with random weights. Z represents threshold functions. Full size image

Random Bits

Our derived feature/basis/base learner is called Random Bits. It is a 3-layer sparse neural network with random weights. Two parameters were used to construct the neural network: twist1 (the number of features connected to each hidden node) and twist2 (the number of hidden nodes).

The features connected with hidden node are randomly assigned and interlayer weights are drawn from a standard normal distribution. The hidden nodes and the top node are the threshold units, with the threshold of each node determined by calculating the linear summation of its input for the ith sample z i and choosing a random z i among the sample as the threshold15.

Boosting Random Bits

In order to generate many Random Bits, we used a gradient boosting scheme with the following pseudocode:

For boost = 1 to B:

For step = 1 to S:

1: residual = Y; MaxVar = 0; BestBit = NULL;

2: For cand = 1 to C:

1: Draw a random bit, RB

2: Calculate the residual explained by RB: Var

3: if (Var > MaxVar) {MaxVar = Var; BestBit = RB;}

3: Set the random_bit_pool [(boost − 1)* S + step] = BestBit

4: Mean[0] = E(residual|BestBit = 0), Mean[1] = E(residual|BestBit = 1)

5: residual = residual − Mean[BestBit];

The algorithm launched B independent boosting chains, each with S steps. Each boosting chain undergoes the standard gradient boosting procedure, starting with a residual of Y and updating every step. In each step, C Random Bits features (C > 100) were generated and the bit with the largest pseudo residual was chosen. The Random Bits from each independent boosting chain were collected to form a large (~10,000) feature pool. The Random Bits were stored in a compressed format requiring 1 bit per Random Bits per sample.

Random Bits Forest

The produced Random Bits are eventually fed to Random Bits Forest. Random Bits Forest is a random forest classifier/regressor, but slightly modified for speed: each tree was grown with a bootstrapped sample and bootstrapped bits, the number of which can be tuned by users. The best bits among all the bootstrapped bits were chosen for each split. By making full use of the binary nature of Random Bits, through special coding and Streaming SIMD Extensions (SSE), acceleration was achieved, such that the modified random forest can afford ~10,000 binary features for large datasets (N = 500,000).

Benchmarking

We benchmarked nine methods: linear regression (Linear), logistic regression (LR), k-Nearest Neighbors (kNN), neural networks (NN), support vector machines (SVM), extreme learning machines (ELM), random forests (RF), generalized boosted models (GBM) and Random Bits Forest (RBF). We used the RBF software available at http://sourceforge.net/projects/random-bits-forest/ and implemented the other eight methods using various R (v3.2.1) packages: stats, RWeka (v0.4-24), nnet (v7.3-8), kernlab (v0.9-19), randomForest (v4.6-10), elmNN (v1.0) and gbm (v2.1). We used ten-fold cross validation (accuracy, sensitivity, specificity and AUC) to evaluate each method’s performance. For methods sensitive to parameter selection, we manually tuned the parameters to obtain the best performance. As we chose the best handpicked parameters for each method respectively, the performance of each method based on the best parameters was comparable with each other. The results of tuning the parameters of sensitive methods on the real psoriasis genome-wide association study (GWAS) dataset were provided as Supplemental Materials 1. Benchmarking was performed on a desktop PC equipped with an AMD FX-8320 CPU and 32GB of memory. SVM, on some large-sample datasets, failed to complete benchmarking within reasonable time (1 week), so those results were left as blank.

Benchmarked UCI Datasets Study

We benchmarked all datasets from the UCI Machine Learning Repository19 that fulfilled the following criteria including: (1) the dataset contains no missing values; (2) the dataset is in dense matrix form; (3) the dataset uses only binary classification; and (4) the dataset had clear instructions and specified the target variable.

We included 14 regression datasets (3D Road Network20, Bike Sharing21, Buzz in social media tomhardware,Buzz in social media twitter,Computer hardware22, Concrete compressive strength23,Forest fire24,Housing25,Istanbul stock exchange26,Parkinsons telemonitoring27,Physicochemical properties of protein tertiary structure, Wine quality28, Yacht hydrodynamics29,Year prediction MSD)30 and 14 classification datasets (Banknote authentication, Blood transfusion service center31,Breast cancer wisconsin diagnostic32,Climate model simulation crashes33,Connectionist bench34,EEG eye state, Fertility35,Habermans survival36,Hill valley with noise37,Indian liver patient38,Ionosphere39,MAGIC gamma telescope40,QSAR biodegradation41,Skin segmentation)42.

Applications on GWAS Dataset Study

We applied each method to a psoriasis genome-wide association (GWAS) genetic dataset43,44 to predict disease outcomes. We obtained the dataset, a part of the Collaborative Association Study of Psoriasis (CASP), from the Genetic Association Information Network (GAIN) database, a partnership of the Foundation for the National Institutes of Health. The data were available at http://dbgap.ncbi.nlm.nih.gov. through dbGaP accession number phs000019.v1.p1. All genotypes were filtered by checking for data quality44. We used 1590 subjects (915 cases, 675 controls) in the general research use (GRU) group and 1133 subjects (431 cases and 702 controls) in the autoimmune disease only (ADO) group. A dermatologist diagnosed all psoriasis cases. Each participant’s DNA was genotyped with the Perlegen 500K array. Both cases and controls agreed to sign the consent contract and controls (≥18 years old) had no confounding factors relative to a known diagnosis of psoriasis.

We used both SNP ranking and multiple logistic regression methods, based upon allelic association p-values, for feature selection in training datasets and compared the different methods in both training and testing datasets. First, we trained the model based on the GRU dataset with different numbers of top associated SNPs and then chose the robust and popular method (LR) to select the best number of SNPs as predictors based on the maximum AUC of the independent ADO (testing) dataset (Fig. 2 and Supplemental Materials 2). We then selected the best number (best number of SNPs = 50) of top associated SNPs as input variables and evaluated their performance in both the GRU (training) dataset and independent ADO (testing) dataset for each learning algorithm (except LR). To know more information of these selected 50 top associated SNPs, the Pearson’s R squared and Odds Ratio45 were also provided in Supplemental Materials 3.

Figure 2 Maximum AUC of the independent ADO testing dataset with different numbers of markers. Full size image

To evaluate a classification method’s performance on an imbalanced dataset, we used the area under the receiver operating characteristics (ROC) curve. The area under the curve (AUC) measures the global classification accuracy and is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance46. We used the AUC as a measure of classifier performance for both GRU (training) and ADO (testing) datasets (Table 3, Figs 3 and 4). The 95% confidence interval (CI) of the AUC47, sensitivity, specificity and accuracy of all methods were also calculated by choosing the optimal threshold value.

Table 3 Psoriasis prediction performance with all methods based on best number of SNP subsets. Full size table

Figure 3 The ROC curve of six best benchmarked methods on the Psoriasis GWAS dataset of independent ADO group using selected best number of SNPs. Full size image