Zach Gazak is an alumnus from the September 2014 session of Insight Data Science. He holds a Ph.D. in Astronomy from the University of Hawaii and joined Insight as Program Director in December of 2014. While he was a Fellow at Insight, Zach worked with startup Embedly to develop a video recommendation system.

During my Insight Data Science Fellowship I took the opportunity to tackle a data science problem for Embedly, a company that helps websites easily embed any type of content, like videos, on any site. Using Embedly, a website publisher can embed any online content into websites and apps — either by using Embedly’s API, or by creating a clean, responsive, and shareable “card”, which makes the content easily viewable on any platform with streamlined sharing and reposting. For example, USA Today uses Embedly to embed videos into their online articles. So when you watch a video on USA Today’s website, you are likely interacting with Embedly’s API. With a growing number of high profile publishers signing up for Embedly, the volume of video plays and other content they have to serve up continues to grow quickly: currently they are serving about half a billion embeds every month.

With so many video views being handled each month, publishers who use Embedly would love to not only serve content through the platform, but use it as a tool to increase engagement with their site. I was tasked with developing a video recommendation engine that would suggest additional videos available on the publisher’s site once a user finished watching a video.

Exploring the problem

The algorithms that drive text-based recommendation engines benefit from the fact that words, sentences, and entire documents can be vectorized and stored as numerical data. At that point, the same mathematical procedures used on numerical datasets can be applied to cluster text and uncover thematic and topic information.

However, video recommendations are a unique beast.

Videos are stored as collections of pixel data that, on their own, cannot easily be mapped to a greater thematic context or set of topics. In addition, while the tags, titles, and descriptions attached to video content do contain sparse information, these provide inadequate context on which to recommend additional videos. To compound the problem, for a site with even moderate amounts of video content, it turns out that taking a collaborative filtering approach is difficult. The reason is that after users watch a video, say on the front page of the New York Times, there isn’t much overlap between users in terms of which video they watch next. In other words, the item-user matrix is fairly sparse, which makes this class of techniques relatively ineffective for this use case.

Still, a path to a recommendation algorithm could exist — all I needed to do was find a programmatically exploitable context, a language, that could be applied to the video content embedded through Embedly. Luckily, Embedly has leveraged the power of their product to track anonymized user interactions with embedded video content. With this data the team produces high impact per-video analytics so that clients can understand what is successful with their embedded content — and what is not.

Developing and modeling the language of video content

By repurposing the Embedly dataset, I found a language all of its own. In this language, the words are each user’s anonymized identification strings, the descriptive documents are the group of “words” for each video, and the frequency of “words” in those documents is dictated by the user’s interest in that document’s parent video. The approach can even encode sentiment, providing negative context for users who watch very little of a video and scalable positive sentiment based on how much of the video is watched.

The resulting corpus of documents encodes the topics of videos as the interest of users who watch them, and while those topics are entirely unknown to the computer, they can be understood and leveraged by an algorithm called Latent Dirichlet Allocation. LDA, traditionally used on text documents, is a powerful natural language processing (NLP) tool for understanding the similarity and differences between those documents. The technique works by assigning all of the words across the corpus to a number of topics and distributing all of those topics to each document. The LDA algorithm develops a probabilistic model describing which words are associated with each topic and which topics are most representative in each document. The key to unlocking a video recommender for Embedly was to transform event based data into text documents consumable by this NLP technique.

In the case of our descriptive language of user interest, LDA reveals probabilistic connections between different videos with no knowledge at all of the video content itself. Instead, we use only the implied preferences of users who are consuming the content. Using the same Embedly dataset, I weight recommendations from this model by the popularity, user interest (quantified as the fraction of the length of the video that was watched by the user), and user focus (a metric designed to measure the difference between clicking through various parts of a video and pressing play and watching the full video) of the videos to be recommended. In this way, videos which are not being watched often or not watched fully — i.e. not enjoyed by users — are less likely to be recommended. In addition, a level of random chance insures that the videos to be recommended are not static between the hourly model updates.

Once the model was completed, I used it to create weighted recommendations for any video a user visits that is part of Embedly’s embedded content. The model is now used to populate Embedly’s API with recommended videos that are based on the behavior of Embedly’s users alone. Without any knowledge of the video content, the model is able to return fairly sensible results. For example, when a user clicks on the below video:

He or she may be recommended this:

Working on this project was a great experience. I didn’t know anything about natural language processing prior to my collaboration with Embedly, but quickly found that the best way to learn about it was to build something with it. It’s also exciting to have had the opportunity to develop a product that Embedly can continue to use well after the end of our formal 3-week collaboration. As Embedly explores this method of recommendation, they will be able to tune the model parameters to optimize performance and deliver an improved experience to the many users who access Embedly’s clients’ websites.