(For a longer less clean version of this, check out my site)

Pocket’s new recommended articles tab shows a move towards the service getting smarter about their data. Just how accurate is their recommendation system, though? In this post I want to share a bit about machine learning, media consumption, and data enrichment, and show some exciting results showing the wide gap between Pocket’s recommendations and what I’d actually like to read.

Media consumption has long been a secondary subject of research interest — somewhere between networked communication, the filter bubble, and news curation apps lies an interesting place for networks to modify our news consumption, and possibly influence my primary research interest, group opinion dynamics (particularly when it pertains to salient and divisive cultural and political topics).

Pocket has amassed a large user base, and has in turn collected a large dataset which contains a list of articles that have or have not been read by a given user account, and sparse metadata on top of each article. They have also created a fairly rich API, which relies on a REST setup and OAuth, which in turn allows for developers to create apps that capture user data, as in my toy demo over here.

The toy demo (pictured in a partial screen grab above) that I built off of the API allowed me to collect a modest amount of data from the platform, which I then enriched via various APIs to add more metadata to the articles people did and did not read — political sentiment inference, topics covered in the articles, sentiment of the article in general, how an article relates to the overall ecosystem of articles, and so forth. In total, I created 173 data points for every article I captured from the users who OAuthed into my toy demo.

Pocket is at an interesting point as a company — they are currently pushing large changes out to their user base in terms of a new recommendation engine. Pushing out a recommendation engine is, from my perspective at least, a big deal. First and foremost, it’s an indication to me that a company is getting serious about getting smart about their own data. It shows me that there’s some degree of sophistication and introspection about the data on hand. Second, it shows non-immediate investments — recommendation engines take time, are slightly risky, and are hard to get right. My take has always been that it shows that a company is either very, very desperate to have one more shot at another round, or is healthy enough to take a development hit like a recommendation engine. Third, I read it as a sign that the company cares about understanding just what their software actually does to the humans that employ it.

Machine learning is a field that appears, from where I stand now, to have respect for only one theoretical perspective — the theoretical perspective of information entropy. Social science data? Information entropy. Astronomical anomalies? Information entropy. The task of any machine learning algorithm is to use heuristics of information entropy to accurately assess a label variable (typically described as a vector Y) from a given dataset where columns are “features” and rows are observations. (typically described as a matrix X). Machine learning algorithms climb a hierarchy from relatively straightforward interpretations (as a Naive Bayes model), to mostly straightforward interpretations (A linear/logistic regression model) to what the fuck is happening I don’t even really get why this code works interpretations (SVM was the one I got lost in during this first foray into ML).

To make matters more complicated, some of the most cutting edge work (according to the professor of this class, at least), comes from “ensemble methods”, or, essentially, just pretending a pile of machine learning algorithms constitute a congress of learners, and letting their collective votes on individual observations count as the most likely outcome. It works, but again, this is far away from an easily interpretable model for why it worked. The trade-off with interpretation loss, of course, is that you can get very accurate models through this type of approach.

So what if we throw an ensemble at Pocket? Can we get a model that accurately predicts what articles get read (the most obvious labels) against the 173 datapoints I’ve collected from these users and their articles?

These datapoints fall under several general categories, each of which is a feature in my personalized infograph of pocket usage:

Political Leanings is a feature category that leverages a third party API to ingest the content of the article (which is also provided by a third party API) and infers levels to which different political ideologies are embedded in the content.

is a feature category that leverages a third party API to ingest the content of the article (which is also provided by a third party API) and infers levels to which different political ideologies are embedded in the content. Sentiment is a feature category that leverages a third party API to ingest the content of the article (which is also provided by a third party API) and infers levels to which positive or negative sentiment is embedded in the content.

is a feature category that leverages a third party API to ingest the content of the article (which is also provided by a third party API) and infers levels to which positive or negative sentiment is embedded in the content. Punchcard compacts the time that articles are added into a cyclical single week and considers the frequency in which users add articles to Pocket for potential reading later. A screen grab of this type of data from my toy demo can be found above.

compacts the time that articles are added into a cyclical single week and considers the frequency in which users add articles to Pocket for potential reading later. A screen grab of this type of data from my toy demo can be found above. Timeline considers time as a constantly increasing quantity with various subcycles (years, months, hours, days, weekday/nonweekdays, hours) and considers the frequency in which users have Pocketed content over time.

considers time as a constantly increasing quantity with various subcycles (years, months, hours, days, weekday/nonweekdays, hours) and considers the frequency in which users have Pocketed content over time. Read/Unread considers the raw likelihoods of reading or not reading in the aggregate.

considers the raw likelihoods of reading or not reading in the aggregate. Word Counts Considers differences between the word count of a given article and global / user-level patterns.

Considers differences between the word count of a given article and global / user-level patterns. Read / Don’t Read considers topics that are read and not read at the global and user level with regard to the topics in the given article.

considers topics that are read and not read at the global and user level with regard to the topics in the given article. Top Terms considers something similar but with slightly different measurements.

considers something similar but with slightly different measurements. Sources considers the likelihood of reading an article given the news publication that created the content at the global and user level.

considers the likelihood of reading an article given the news publication that created the content at the global and user level. Term Network considers an interrelated network of terms at the global and user level, which represents a mapping of how topics are connected through the articles they appear in. The relative strength of the topics in the given article are then weighed against these networks. A sparse network graph of some of the top terms can be found below.

Finally, some very basic metadata provided by Pocket directly are added to the observations. All together, the largest dataset includes 173 features, and by making some semi-realistic data restrictions, the smallest dataset includes only 67 features. Each observation has a label Y_i in the subset of [0,1].

Here’s the beginning for the ensemble method of machine learning — literally, we are throwing random models at this task, and random combinations of them, and just taking whichever combination has the highest accuracy by running the data through k-folds:

import os

from sklearn.linear_model import Perceptron

from sklearn import linear_model

import random

from sklearn.neighbors import KNeighborsClassifier

import itertools

from scipy import stats

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import AdaBoostClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn import ensemble

from sklearn.svm import SVC

from sklearn import preprocessing

from sklearn.neighbors.nearest_centroid import NearestCentroid

import numpy as np

import csv

from sklearn.neighbors import NearestNeighbors

models = [Perceptron(fit_intercept=False, n_iter=10, shuffle=False),

linear_model.Ridge(alpha = .5),

SVC(kernel="linear", max_iter=1000),

SVC(kernel="poly", degree=3, max_iter=1000),

SVC(kernel="rbf", max_iter=1000),

SVC(kernel="sigmoid", max_iter=1000),

KNeighborsClassifier(n_neighbors=2),

KNeighborsClassifier(n_neighbors=6),

KNeighborsClassifier(n_neighbors=10),

NearestCentroid(),

RandomForestClassifier(n_estimators=2),

RandomForestClassifier(n_estimators=10),

RandomForestClassifier(n_estimators=18),

RandomForestClassifier(criterion="entropy", n_estimators=2),

RandomForestClassifier(criterion="entropy", n_estimators=10),

RandomForestClassifier(criterion="entropy", n_estimators=18),

AdaBoostClassifier(n_estimators=50),

AdaBoostClassifier(n_estimators=100),

AdaBoostClassifier(learning_rate= 0.5, n_estimators=50),

AdaBoostClassifier(learning_rate= 0.5, n_estimators=100),

LogisticRegression(random_state=1),

RandomForestClassifier(random_state=1),

GaussianNB(),

ensemble.GradientBoostingClassifier(**{'n_estimators': 1000, 'max_leaf_nodes': 4, 'max_depth': None, 'random_state': 2, 'min_samples_split': 5, 'learning_rate': 1.0, 'subsample': 1.0})]

It seems dangerous, and it kind of is. But remember, the only thing that matters in machine learning (so far as I understand so far) is accurate prediction on testing data and ultimately real world data — if the model works, no matter how convoluted or Byzantine it is, then it works. This is something I both love and fear about this stuff. But, for the purposes here, we can hold on a theoretically substantive explanation for why people read the things they do and just go ahead and see if we can accurately predict the phenomenon from the data.

Finally, let’s make one more assumption — we’re trying to build something that accurately predicts reading an article, and in general, it looks like users typically pocket more articles than they actually read (I guess we’re all a bit aspirational, eh?). Also, the recommendation engine is specifically looking to accurately predict a correct hit. So, let’s say that when we measure the fitness of any combination of algorithms, we are concerned about false negatives twice as much as we are concerned about false positives. In other terms, we want to try to get things you’ll definitely read in front of you, and we can accept a few misses, since people skip a lot anyways.

Limiting the Data

As a final parameter of interest, let’s say that some data may be off limits. I don’t know which data points those are specifically, because I don’t work at Pocket, but let’s say that it’s easy to get information at the aggregate for the system once every week, and that it’s easy to get information for individual articles. Getting user-level data, however, is hard — it’s expensive because we have to look at a lot of their articles, and we have to aggregate lots of times for each user. So, let’s throw out all variables at the user level in one test.

In another test, let’s say that separating the data into likelihoods against the condition of other articles having been read or not read is a little unfair — it may not help at all, for example, when we have users that click they read when they never read it, and people who never click that they read it (of course, that brings with it the existential problem that the label in this dataset is inaccurate — maybe we have better data from the Front-End team which tracks which pages the users visit?). So, let’s only look at values for articles with respect to the general distributions of articles, whether or not they have been read.

In one final test (originally thought up by Adam Jetmalani, we make a relatively obvious assumption — some feature categories, notably the Timeline and Punchcard categories, are based on the time that an article is Pocketed with regard to normal user behaviors — the logic being that there may be hot times when people Pocket articles with a real intent to read later. Of course, with recommended articles, we have no information about when the article is Pocketed for our given user (though we could use aggregate information about other users Pocketing the same content potentially). For this reason, we’ll remove those feature categories entirely in our fourth test.

Results

The table shows some very dramatic results — the ensemble models, while they had access to many dozens of potential models, only needed about a half dozen to get fairly high accuracy scores. Note, of course, that there could be different ensembles that work better to predict the results, but the optimization task was set to optimize the “Bent Train Accuracy” which overweighted false negatives. Even then, with more time (this optimization ran for about 20 minutes), slightly more complex models with higher scores could be potentially found.

In review of this table, though, a few things are clear — regardless of data removal, the ensemble approach can still generate relatively significant accuracies than a random guess. This is heartening, since it’s clear that there are certain avenues in the dataset that will generally lead towards predictability — adding in more variables and tweaking those parameters in an automated fashion may reveal higher accuracies.

Most concerning, however, is the surprisingly low scores assessed for the articles that are suggested by Pocket. One trick that some machine learning methods employ is the change the decision criteria — in this case, if the probability of an observation having a label of 1 is > 50%, we assign it a one, otherwise we assign it a 0. Sometimes, people move that X% to the average of the data, or even scan to find the optimal split point (where most correct cases are assigned). This is kind of sketchy, but likely wouldn’t help the results anyways — the data is fairly normally distributed, spare one case (where we have no user data (which isn’t that surprising, that’s a hard thing to predict when the only user involved for all of these recommended articles is… my account’s data)).

Feature Category removal impacts

What variables matter the most? To do this, we can use the following methodology for each of the datasets we are using:

For each column in matrix X,

Replace column values with random values for each observation

Assess the k-folds for the given ensemble model and dataset pair

Measure ∆ accuracy (can shift +/-)

From an information gain perspective, a random column provides no substantive value to any model. Dropping the column altogether requires us to retrain the system, and may cause more harm than help (as changes in dimensionality could have knock-on effects depending on the model).

Full Dataset, biggest ∆ change per random feature replacement

No User distinction, biggest ∆ change per random feature replacement

No Read/unread Distinction, biggest ∆ change per random feature replacement

No Time Pocketed information, biggest ∆ change per random feature replacement

We can then bundle these results into each feature category described earlier to get a sense of which categories may give us a theoretical peek at what makes people read what they do (though this is dangerous and is only a guess from a complicated mechanism that can predict accurately, but can’t tell us why).

In the four tables above, the negative scores indicate that the model was worse off when removing that category — in only one case did the model only marginally improve with the removal of a category. The interpretation from this result would seem to be that there is wide variability in terms of which feature categories matter with respect to any generalizable insight.

Conclusion

There’s certainly more exhaustive work to be done. Specifically, the models that were assessed may not apply to Pocket in it’s entirety, or it could be that Pocket specifically recommends articles that aren’t likely to be read by you (either to get you out of your bubble if we think they’re behaving nobly, or to make money through sponsored articles less nobly). For now, it looks like the ensemble methods accurately predict what is going to get read. And that isn’t what’s getting recommended.