When Apache Spark was introduced at the Hadoop World Conference in California in July 2014, my boss Marcin Mejran - then Lead Data Scientist at HookLogic - came back to announce that we would be switching our entire Machine Learning Infrastructure over to Spark from Hadoop.

So we got to work. How could we not? Our boss was actually inviting us to use bleeding edge tools; it was one of those learning opportunities exciting enough to make you wonder why you are even being paid to do your job.

Now I am not going to delve into the virtues of Spark which have been listed endlessly all over the web along with its apparently single flaw : "needs to mature". No, I am going to spend time talking about a specific - and seemingly minor- limitation as a tool for experienced Machine Learning Engineers who want to build large-scale, end-to-end learning systems. Or perhaps this is just a long overdue rant after many sessions of intense debugging ; disguised as a technical review. I hope you will indulge me.

What prompted this post (which I had originally intended to use as a guide for setting up spark-notebooks) was the design and implementation of a complete anomaly-detection system from scratch; in Scala using Apache Spark.

The over-arching approach I decided on, was to calculate different density metrics as well as distance measures of relevant parameters and encode the outputs of those models as features of a Single-Class SVM. Not terribly daunting.

I am a big believer of taking advantage of (which is not the same thing as reliance) libraries ; even if you end up having to make lots of changes to turn it into something you want. Chances are pretty high that other people besides you have used the library and contributed to it to make it better than what you would have carved from scratch. Also, I like to spend my time working on interesting problems that remain unsolved. That being said, I have experience working with Mllib libraries and expected to have to make my own changes to mold their SVM into something to best serve my purposes. I also knew that Mllib was lacking in its number of available algorithms (Tanimoto distance is a relevant example).

I also have very extensive experience using Apache Mahout, and while I have had some really fun times with it, I feel that it has become rather too heavy-weight. Even when we switched our Machine Learning operations to run on Spark, we maintained usage of Apache Mahout due to its extensive embedding in our system and lack of functional counterparts in Apache Spark's Mllib.

But this is a new job, a new system, a clean slate. So I got down to business coding some distances from scratch. Simple enough. Or so I thought. After all, the first order of business was to do some basic linear algebra between two Vectors.. you know.. some dot products, some norms, etc. To my horror, I could not find any proper (i.e. anything that isn't super inefficient and hacky) support for vector operations. What?

So I googled. First, as it turns out I am not the first to be shocked by this stunner. Some golden responses :