Introduction

My primary aim was to predict the sales of an item given the Best Seller Rank on Amazon. Predicting the sales helps me in other use cases like suggesting sellers the best products to sell. My final aim is to provide data insights about any product: How much it will sell as well as when, where and how.



What is Amazon Best Seller Rank?

Best Selling Rank is a ranking system provided by Amazon that is linked to the number of sales of that product. This rank is calculated frequently. An important point to note is that Best Selling Rank is a 'ranking system' and by itself it doesn't mean anything.





A rank of #1, therefore, means that product has sold more than any other product in that category, on that marketplace.





This kind of makes it relatively easy to predict the number of sales of a product if we know the sales of other products ranking close to it.





How did we get the initial sales data? I have been selling professionally on Amazon and have been tracking my own sales vs ranks for all my products in various categories. Additionally I interviewed other professional sellers to get an approximate idea of their sales.





With all the data obtained, cleaned and setup I entered the next phase of design: Choosing the best framework for Predictive Analysis



Enter Spark



At the 2016 Spark Summit Nick Heudecker asked the question Is Apache Spark the future of data analysis?





















While there might be some truth to the above chart I tend to believe the Spark has not reached peak hype yet. Or maybe it seems that way from here down under in Australia and probably Spark has surpassed peak in the Silicon Valley.





Spark has played amazingly well with our spring boot application and our standalone machine learning application (Command Line Interface).





Our database is PostgresSQL and Spark CLI programs run weekly reading the PostgresSQL database via JDBC bridge, processes them, builds learning models and saves the trained model to local path.





This trained model is then read by the Spark in Spring Boot to quickly make predictions or process any incoming information from web users in real time. With all the infrastructure setup we had estimated a week to complete the linear regression algorithms or worst case scenario of two weeks if the problem turned out to be the more complex log-linear regression.

Houston, We have a problem

I expected a straightforward linear regression model of type y=mx+b This would have made the problem very simple as Apache Spark has GeneralizedLinearRegression





val glr = new GeneralizedLinearRegression () . setFamily ( "gaussian" ) . setLink ( "identity" ) . setMaxIter ( 100 ) . setRegParam ( 0.4 )





I first plotted the chart on an Excel Sheet as I had already exported the data from SQL to CSV.

The relationship between Amazon Best Seller Rank and number of sales turned out to be like this chart.

So it looked like a log-linear model and I assumed the poisson family of GeneralizedLinearRegression would be a good fit. We changed the GLM family to poisson and ran the tests few more times however the Mean Squared Error & the RMSE was too huge.





I suspect Spark has issues in dealing with sparse data.

We have limited data of our sales and we don't sell products in all categories since Amazon has gated some categories which left us unable to sell or make any observations on sales for those gated categories.

With limited amount of input data, Spark MSE was too high and even for our products for which we knew the sales, the predicted sales were way off the mark.



Over the next few weeks I spent my time trying out all combinations of Regression family on Spark and none of them gave the desired results. I had absolutely no idea on how to proceed now and this reminded me of this quote by Dan Ariely





“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…”

My last resort was to use Deep Neural Network Learning. I had completed the Andrew Ng course on Machine Learning when he has first launched it and I cannot recommend it enough. Now my aim was to use Neural Network to solve this problem.

We must go Deeper.

I eventually made the decision to keep Spark for big data analysis however we would resort to other libraries for Deep Learning.

I narrowed down my choices to two libraries





I was particularly impressed with this article that discusses how H2O deep learning was used to predict crimes and arrests in San Francisco and Chicago. Since our future use cases are similar where we will be predicting fraudulent users and fraudulent competitors I decided to plunge into Deep Learning using H20 rather than attempting to work around with Spark ML issues.





Sparkling Water-H2O runs within the Spark framework so I could use their integrated framework without replacing Spark. This was a definite bonus for me. Also the documentation was nicely done and I was thoroughly impressed with H2o web-UI , Flow.

I could test my data and algorithms on the browser without writing any code.





The Web-UI provided excellent insights into the data and true to my beliefs, the Deep Learning Neural Networks provided exceptional results.





Using the optimized parameters from H2O Flow, I quickly coded the Deep Learning network in my CLI Program.

val train = result( 'categoryIndex , 'bsr , 'sales ) // Configure Deep Learning algorithm val dlParams = new DeepLearningParameters() dlParams. _train = train dlParams. _response_column = 'sales dlParams. _fast_mode = false dlParams. _epochs = 30 dlParams. _nfolds = 3 dlParams. _distribution = DistributionFamily. gaussian val dl = new DeepLearning(dlParams) val dlModel = dl.trainModel.get //save the model ModelSerializationSupport.exportH2OModel(dlModel , new File( "/data/deeplearning.bin" ).toURI)

On the Web API (Spring Boot) application I read this model in and used it for making predictions in realtime from web users.

def startup (): Unit = { dlModel = ModelSerializationSupport.loadH2OModel( new File( "/data/deeplearning.bin" ).toURI) println ( "Initialization Of BSR Deep learning Module complete" ) } def predict (categoryIndex: Int, bsr: Int ): Double = { if ( null == dlModel ) { startup() } println ( "

====> Making prediction with help of DeepLearning model

" ) val caseClassDS = Seq ( InputBSR (categoryIndex , bsr , 0 )).toDS() val finalresult = dlModel .score(caseClassDS)( 'predict ) val finaldf = asDataFrame(finalresult)( sqlContext ) val predictedSales = finaldf.first().getDouble( 0 ) println ( s"For category index $ {categoryIndex} and BSR $ {bsr} the result is ... $ {predictedSales} " ) predictedSales }

The end result is this:













As you can see from the screenshot, we can only predict sales of products which have Best Selling Ranks for top level category since we have trained the Neural Network with data of sales only from top selling categories.





As we keep collecting data and our algorithm has sufficient confidence to predict sales of lower level categories, the app will start making prediction for more number of products.





This, IMHO, is the best thing about Big Data and Deep Learning. The machine never stops learning and eventually as more data is fed into it, the algorithms automatically start making better predictions.



