These days Medium is a rage everywhere. Blogging has taken the form of stories with more focus on a personal touch. Medium has become the platform to express your views and share them with the worldwide community.

I decided to do a quick analysis on the articles and see what separates the “Popular on Medium” posts from the other posts. I also wanted to build a predictor which could predict whether a post would be featured in the popular section and try and identify which features are the most useful while writing a post. Another motive I had was learning more about Apache Spark and Machine Learning models so I did all the analysis and visualization in Spark and Zeppelin.

Exploratory Data Analysis of “General Posts” and “Popular on Medium”

The dataset consisted of posts from January to November 2017 and numbered around 11940 posts out of which 1069 posts were the popular ones.

I preprocessed the text of the posts by removing punctuation, special characters, lemmatizing and stemming the words and finally removing the stop words. Then I passed the content of the posts through a Latent Dirichlet Allocation model to cluster the posts into 12 topics**. The purpose was to see what kind of articles dominate the Medium space. The words are stemmed to their root words so a little creativity maybe required to associate them with meaningful terms.

** The number of topics was arrived at by a little hit and trial.

Words that constitute popular topics for “Popular Medium Posts”

Popular Medium Posts topics

The main ideas seen in the popular medium posts are Machine learning and Data (isn’t that why we are here?), Javascript, Relationships, Apple and Google, Trump, Bitcoin, Cultural Arts like photography and music.

Words that constitute popular topics for “Not Popular Medium Posts”

Non Popular Medium Posts topics

The main ideas that can be seen here are Blockchain, Technology, Medical research, Relationships, Trump politics (again), Coding, Food , Economy and Cultural Arts like photography and music.

Looks like articles involving Trump, Bitcoin, Relationships, Music and Data have a higher chance of moving into the Popular section.

Visualizing differences in quantifiable features of posts between Popular and Not Popular Posts

I collected information about claps, users who have clapped, response given for each post. I then normalized the above quantities with the time elapsed since the publishing of the posts and the time of data collection to make it a fair comparison in terms of Per Unit Time.