Applying Principal Component Analysis

Integrate PCA to your production application

In case you are here the first time, you may want to go through my previous deep dives into principal component analysis. Take a look at my tutorial I and tutorial II.

To recap, Principal Component Analysis is a way to reduce the dimensions in our data set. This should make our computations faster and help us make better predictions as well.

Now that you a fair idea on how PCA works and want to implement this in your production models, you may want to see how to implement this. Let’s see how we can do that.

Data Preprocessing

Let us first call all the dependencies that we will be using.

We will use the same data set that we have used in the previous tutorial.

below are the shapes:

(51, 8) (55, 8) (53, 8)

Note, there are 51 observations of hitech, 55 observations of bhagyanagar and 53 observations of hudco. This is important as we will soon use this.

We will now need to combine the three data sets so that we run our calculations on all three of them. We will take only a few of the columns that we think are relevant for our analysis.

[ 209.1 209.95 204.1 205.8 206.7 2935. 6.09]

We will now define the output for our data set. Let’s say that this is a classification problem where we are trying to say if an incoming test data set belongs to one of the three companies. To denote the three companies we come up with imaginary numbers (1000, 2000, 3000) . These numbers are taken in 1000’s so that there is no ambiguity or overlap between the two.

Now, while defining the output or what we will call our y , we will use the shapes that we had found out before. For example since there are 51 observations of hitech, there will be 51 1000’s. The following code should capture the idea.

some samples of y

[1000 1000 1000 1000 1000]

first x shape and then y shape

shape of X: (159, 7)

shape of y: (159,)

[[ 209.1 209.95 204.1 205.8 206.7 2935. 6.09]

[ 212. 214.95 208.3 208.3 209.15 5094. 10.78]]

[1000 1000]

Now, we will combine X and y. This will make it easier to break them into test and train components. We will create a function get_train_test that makes it easier for us to get the train and test samples from X and y.

[ 6.83000000e+01 6.83000000e+01 6.78000000e+01 6.79000000e+01

6.80000000e+01 1.43373000e+06 9.75450000e+02]

[ 3000. 3000. 1000. 1000. 2000.]

[ 225.4 231. 214. 220.7 220.65 126911. 283.93]

[ 1000. 1000. 1000. 1000. 3000.]

Old Model

Now, that are the preprocessing on the data is done, let’s come to the main story line. Let’s say in production, you have a model based on RandomForestClassifier running.

[ 0.05469505 0. 0.14321643 0.19745815 0.23498935 0.15445189

0.21518912]

normal X_test predictions are below:

[ 1000. 1000. 1000. 1000. 3000.]

These are giving some accurate results. Although in the example taken here the matrices are pretty small, let’s say that in our ‘real’ world scenario the matrices are pretty huge and they take up a lot of computational power. We also know that a lot of features in our data set are correlated to each other.

New Model

So let’s perform Principal Component Analysis and reduce the dimensions of our data set to two dimensions. But to do that we will need to scale the data set.

(127, 7)

mean is:

1.55855830679e-16

Note that the mean resultant mean is almost 0. This is just to make sure that the scaling is proper.

We will now run the classifier on the scaled model.

trying out the prediction capabilities

[[ 2.57645802 0.05547991]

[ 2.48947141 -0.02405964]]

explained variance in test

[ 5.15421156 1.83697385]

PCA predictions are below:

[ 1000. 1000. 1000. 1000. 3000.]

Results

Let’s check how it measures up to the previous model. We can see that by getting the scores in this case.

##############################################

normal X_test predictions are below:

[ 1000. 1000. 1000. 1000. 3000.]

......................................

PCA predictions are below:

[ 1000. 1000. 1000. 1000. 3000.]

here are the real values

[ 1000. 1000. 1000. 1000. 3000.]

################################################

what is the prediction score...

normal predictive score: 1.0

pca predictive score: 0.96875

Interestingly, the score on the normal predictions is 1.0 which is probably due because our data set is small. The PCA predictive score is 96% which is pretty decent although probably we can make this better. Verifying manually on a small sample gives us identical results.

In case you found this interesting and would like to talk more on this, just drop me a message @alt227Joydeep. I would be glad to discuss this further. You can also hit the like button and follow me here, on Medium.