Most of you will read this article to discover the most popular blogs, but the real purpose here is to show what goes wrong with many data science projects as simple as this one, and how it can easily be fixed. In the process, we created a new popularity score, much more robust than any ranking used in similar articles (top bloggers, popular books, best websites etc.) This scoring, based on a decay function, could be incorporated in recommendation engines.

Figure 1: Example of star-rating system

1. Introduction

The data covers almost three years worth of DSC (Data Science Central) traffic: more than 50,000 posts, and more than 6 million pageviews (almost half of it in 2014 alone), across our channels: DSC, Hadoop360, BigDataNews, and AnalyticBridge (but not AnalyticTalent).

I actually included 46 articles, but only the top 30 have a highly accurate rank. Some articles have been filtered out as they belong to a cluster of similar articles (education, books, etc.) Finally some very popular pages (in the top 5) are not included because the creation date is not available (the Visualization, Top Links, and Jobs tabs at the top of any page), or because they just should not be listed (my own profile page, the sign-up page, the front page, etc.)

The new scoring model is described below, in the Scoring Engine section below. Also, you will find useful comments in the Interesting Insights section below.

2. Top 30 DSC blogs

The number in parentheses represents the rank if instead of using our popularity score, we had used standard methodology. The date represents when the blog was created. By just looking at these fields, you might be able to guess what our new scoring engine is about. If not, explanations are provided in a section below.

3. Interesting Insights

These top pages represent 21% of our traffic. The front page amounts to another 9%. And the top pages that are filtered out (for various reasons, read introduction) to a few more percents. Here are some of the highlights:

For top popular articles, page views peak in the first three days, but popularity remains high for many years. In short, pageview decay (over time) is very low, see figure 2 below.

(over time) is very low, see figure 2 below. Pageview decay is very low for highly popular, generic articles that are time-insensitive, the type of articles that we try to write. Non popular articles or time-sensitive announcements have a very fast decay, typically exponential decay and short half-life.

You don't notice any decay at all in figure 2. The reason is because decay is hidden by general traffic growth on DSC: 60% growth in the past 12 months, 50% growth in the preceding 12 months. The general growth more than compensate for the natural decay.

Notice a change in subjects that are popular between new material (DSA years) and old material (AnalyticBridge years). You can notice the drastic change only if you use a sound popularity algorithm, as described in section 4.

Many of the most popular articles are new (once you adjust for the time bias, using the methodology described in section 4). Part of it is because we have a better understanding of the type of articles that our readers are interested in, as well as how to successfully reach out to new users, and part of it is because of growth. An article posted today will immediately receive more than twice the volume of traffic it would have received on day one, if posted 2.5 years ago.

We have used two data sources, always a good idea in any data science project: Google Analytics, which filters out robots, and Ning page counts, which does not. On average Ning numbers are 30% above GoogleAnalytics - which we translate in the fact that about 30% of the traffic is by robots (Google crawlers etc.) Robot traffic has a time lag of a few months (on average) compared with human traffic.

Google Analytics has one drawback: it counts two versions of a same page - with only the query string in the two URLs being different - as two different pages. A bit of post-processing can quickly fix this issue. This issue explains why sometimes the discrepancy between Ning and Google Analytics is as high as 60%, as opposed to the average 10-30% range.

Many popular articles have been posted in last 2 weeks, but I decided not to include them (unless pageview count is so high that they naturally appear in our list, after correcting for time bias, as explained in section 4). The reason not to include them is because of high pageview volatility for new articles.

We had to do some time adjustments as our Google Analytic data goes back to 2012 only, while Ning goes back to 2007. Some would-be data scientists working on the same project, are likely to not even notice the issue, let alone fixing it.

Figure 2: Pageview decay (or absence of decay!) for 4 top blogs listed above

4. New Scoring Engine

Let's say that you measure the pageview count for a specific article, and your data frame goes from t = t_0 to t = t_1. Models like this typically involve exponential decay of rate r, meaning that at time t, the pageview count velocity is f(r, t) = exp(-rt). With this model, the theoretical number of pageviews between t_0 and t_1 is simply

P = g(r, t_1) - g(r, t_0),

where g(r, t) = {1 - exp(-rt)} / r.

If t_0 is set to zero, then g(r, t_0) = 0, regardless of r. On a different note, the issue we try to address here (adjusting for time bias) is known as left- and right-censored data in statistical science: right-censored because we don't have data about the future, and left-censored because we don't have data prior to 2012.

To adjust for time bias, define the new popularity score as S = PV / P, where PV is the observed page view count during the time period in question. When r = 0 (no noticeable decay, which is the case here) and t_0 = 0, then P = t. Note that the only two metrics required to compute the popularity score S, for a specific article, are: time elapsed since creation date, and pageview count during the time frame in question, according to Google Analytics, after aggregating identical pages with different URL query strings.

Note: To make sure that we were not missing popular articles posted recetly, we collected the data using two overlapping time frames: one data set for 2012-2014, and one just for 2014, using CSV exports from Google Analytics. Several articles that did not show up in the 2012-2014 data set (because their raw pageview count was below our threshold of about 10,000 pageviews), actually had top scores S when adjusted for time, and could only be found by using the 2014 data. Another way to eliminate this issue is to get statistics for all articles (not just the ones with lots of traffic) for the whole time period. That's the automated approach, and in our case it would have required writing extra pieces of code, and possibly Google API calls, to download time stamps on Ning (via web crawling) and the entire Google Analytic data for the 50,000 articles - not worth the effort, especially since I allowed myself only a couple of hours to complete this project.

5. Good versus bad data science

Using the basic model with r = 0 (in section 4) makes a big difference with traditional rankings, as you can see when comparing our rankings to traditional rankings, in our list of top articles in section 2 (sorted according to our popularity score with r = 0). It allows you to detect trends about what is getting popular over time.

This is what makes the difference between good and bad data science. Note that refining the model, estimating a different r for each article, testing the exponential decay assumption, and adjusting for growth, is also bad data science: it makes the time spent on this project prohibitive, make your model subject to over-fitting, and may jeopardize the value (ROI) of the project.

Data science is not about perfectionism, but about delivering on time and within budget. In this case, if I spend one month on this project (or outsource to people who work with me), it's time wasted on something that could yield far more value than the little incremental gain obtained by seeking perfection. Yet ignoring the decay is equally bad, it makes this whole project worthless. The data scientist must instinctively find what level of perfection is needed, in his models. Data is always imperfect anyway.

6. Next steps

One interesting project is to group pages by categories and aggregate popularity scores. Maybe create popularity scores for categories. Indeed, Nikita Nikitinsky has been working on this problem, indirectly. It was his project during his data science apprenticeship (DSA): we will soon publish the results and announce his successful completion of the DSA (see applied project #3). He is the first candidate to complete the DSA, besides our intern Livan (who worked on a number of projects including our Twitter app to detect top data scientist profiles), and the winner of the Jackknife competition.

Other potential improvements include:

Estimating r rather than using r = 0

Estimating r for each article (risk of over-fitting)

Scoring bloggers rather than blogs

Include this in our list of DSA projects

Testing the exponential decay assumption

Adjusting scores to take traffic growth into account (favoring new blogs over old ones)

Another area of research is to understand why webpage pageview counts closely follows a Zipf distribution.

Related Links

The articles below come with detailed explanations about the (sound) methodology used to obtain the results.