A few years ago I was consulting at a company that had a recommender as a service (RaaS?) and wanted to upgrade it. I analyzed what they were doing and designed a test against some Open Source recommenders. Research into these started a journey that led to the development of the Correlated Crross-Occurrence Algorithm in Apache Mahout and the Universal Recommender.

The journey started with lots of what are called "cross-validation" tests where we have a "gold standard" dataset of actual real world ecom purchases and build models for the various recommenders to see who is best. We came to the conclusion that Mahout's "Cooccurrence" recommender beat the inhouse tech AND every other recommender we tested against. The algorithms included:

MF-ALS : Matrix Factorization using Alternating Least Squares. Here we had one version in Mahout, the other from Myrrix. This algorithm is used by many of the best known recommenders but is unimodal, meaning it models only conversions – purchases.

: Matrix Factorization using Alternating Least Squares. Here we had one version in Mahout, the other from Myrrix. This algorithm is used by many of the best known recommenders but is unimodal, meaning it models only conversions – purchases. Graph Recommender : this was based on creating a model of linked purchases and other user actions. Graph recommenders have the benefit that they model data about more than just purchases.

: this was based on creating a model of linked purchases and other user actions. Graph recommenders have the benefit that they model data about more than just purchases. Cooccurrence: The venerable Cooccurrence algorithm, as applied to recommenders, is classic in Collaborative Filtering. It basically says, if you are like other people (in terms of your purchases) then you might like other things they've bought.

Mahout 0.9.0 Cooccurrence

Pros:

Better predicts user buys (conversions) by significant amount (~12%)

Works on arbitrarily big data

Cons:

No realtime data used in recs. The Mahout version requires that every user and item in the data has recs pre-calculated in a background process and stored in a db. So recs are only as up to date as the last time the data was processed.

Business rules are cumbersome and not part of the algorithm

Is not multi-modal

Back to the RaaS company. Cooccurrence already "won" the cross-validation tests but we had data for 100x pageviews vs buys. A Data Scientist hates wasted data so I did lots of experiments to see if we could tease a signal out of all this pageview data. We tried the obvious. We just used the data as if it were all buys, this performed worse than throwing the data out. We tried weighting the pageviews lower than buys. Since there were 100 pageviews per buy we made buy = 1, pageview - 0.01. This also failed in yield a benefit.

This bothered me but it started to make sense. The user intent behind a pageview is often different than a buy. You click a link due to a flashy image, out of curiosity, or in doing research while shopping. When you hit the BUY button, you are done with that. Still if you show me users with similar pageviews I should be able to predict buys.

Correlated Cross-Occurrence to the Rescue

It was time to go back to first principals and dig into the math!

Cooccurrence is a simple algorithm. It uses the similarity of conversion behavior to mine recommendations from what similar people bought. The math is simple if viewed with linear Algebra

\[r = (B^tB)h\]

\(r =\) recommendations as a vector

\(B =\) matrix of all conversions (buys for our ecom data) from all users. Think of this as a table with rows keyed to user-id and columns keyed to item-ids like product-ids. The table elements are 0 for no conversion, 1 for one or more conversions.

\(B^t =\) the transpose of \(B\).

\(h =\) the user's history of conversions as a column vector.

In English this says to calculate the similarity of items based on the users that converted on them. \(B^tB\) is sometimes called a "self-join" and is read like a table of items with a score for each other item's similarity. Just performing the math gives one measure of similarity but other scores may be substituted for each non-zero element to yield better similarity results (see cosine similarity and LLR).

Once we have \(B^tB\) we multiply the resulting matrix by \(h\) which give us a vector with each element representing the score of how well the item's user behavior compares with the user's history. Sort the vector by these scores and take the top n to give recommendations – et voila.

Comparing conversions works well as our tests have shown, but we want to ask a new question, "how do we find items that are similar in terms of some secondary behavior of the user like pageviews?" Put another way, "how do we find similar items based on a user's pageviews where they correlate with the user's buys?"

Now that we have formed the question well, we can answer \(^{1.}\):

\[r = (B^tB)h_b + (B^tP)h_p\]

\(r =\) recommendations as a vector

\(B =\) matrix of all conversions (buys for our ecom data) from all users. Think of this as a table with rows keyed to user-id and columns keyed to item-ids like product-ids. The table elements are 0 for no conversion, 1 for one or more conversions.

\(B^t =\) the transpose of \(B\).

\(P =\) matrix of some secondary user behavior (pageviews for our ecom data) from all users. Think of this as a table of user history of the secondary behavior.

\(h_b =\) the user's history of buys (conversions) as a column vector.

\(h_p =\) the user's history of pageviews (a secondary behavior) as a column vector.

Comparing two Behaviors

It is one thing to compare buys to buys but how do we compare buys to pageviews? We turn to the Log-Likelihood Ratio, which compares event frequencies. LLR, precisely speaking, is a test applied to tell if two things are uncorrelated. If 2 things are "not uncorrelated" in this case we can assume they are correlated. This is due to the domain of the data, which comes from actions in a web application.

Now if we replace those non-zero elements of each matrix product with LLR scores we can keep the top n for each row based on the magnitude of the score.

Now outside of the RaaS company I hacked this idea into Apache Mahout as an experiment. I now had to use different data for cross-validation but finding some I compared Cooccurrence with two-behavior Correlated Cross-Occurrence and it won handily.

So What?

This result is important not only because we can now use more data but because that data is not a conversion. Up until we had CCO all Open Source Recommenders required a conversion and probably more, in order to make recommendations to a user! For ecom this means a lot of people cannot get recs. With CCO anyone with behavior that is modeled will get recs and they can be quite good. So in our example, a few pageviews and the recommender starts to make good guesses about what you are looking for.

This was a nice result but if you go down this train of thought there is no reason we can't add other secondary behavior – as many indicators of behavior as the app can measure. The full form of the equation is (aka The Whole Enchilada):

\[r = (B^tB)h_b + (B^tP)h_p + (B^tS)h_s + (B^tC)h_c + (B^tX)h_x ...\]

\(B =\) history of user buys

\(P =\) history of user pageviews

\(S =\) history of user search terms

\(C =\) history of user category preferences

\(X =\) etc ...

Solving the Realtime Recs Problem

The Apache Mahout hack I made proved a point but still required that we calculate all recs for all people in a batch operation. But if you look at the math through opportunistic glasses you see a pattern.

\[r = (B^tB)h_b \]

is really finding the items from \(B^tB\) that are most similar to \(h_b\) and that is exactly what a Search Engine does! In other words if we index \(B^tB\) in a Search Engine we can use \(h_b\) as a query to return an ordered set of similar items – aka recommendations. Furthermore, most modern Search Engines, and particularly those based on Apache Lucene, support multi-field queries so each bit of user history divided into \(h_b\), \(h_p\), \(h_s\), ... can be put into a single query and \(B^tB\), \(B^tP\) , \(B^tS\) , ... can be put into a single index. A query using all user history performs what is essentially The Whole Enchilada.

Given that a search engine can index all the "model" we build periodically (\(B^tB\), \(B^tP\) , \(B^tS\) , ... ) and we can get any user's history in realtime, and Search Engine queries are fast since the engines are scalable – we now have a way to get realtime recs. This is without calculating all recs for all people and can use even realtime history.

Conclusion

Correlated Cross-Occurrence as described above allow us in the ecom use case, to begin making good recs after a user browses a few pages or searches for something. We no longer have to wait until they buy. We don't need to wait for model building we can use behavior in realtime. Plus Search Engines are good for filtering and boosting so we have a path to creating rich business rules.

Epilog

A CCO implementation was contributed to Apache Mahout and the Apache PredictionIO "Universal Recommender" engine was written around the combo of Search with CCO. This was later moved to the Harness ML Server. Since that time we have deployed The Universal Recommender in many applications, in places like ecom, video, news, social media, even in other RaaS platforms.

References