A team of 3 students from IIT Kharagpur have built a Chrome extension that detects and blocks clickbait headlines.

When perusing news online, how often have you been lured by clickbaits –headlines that are catchy and excite your curiosity – and then found the article in question disappointing? Repeated occurrence of clickbaits can lead to the actual news stories getting missed out due to reader fatigue. To avoid this, IIT Kharagpur researchers have developed a browser extension “Stop Clickbait” that can detect and block clickbaits, thereby preventing such disappointments.

Relying on the premise that good content must be embodied by “good” language, Stop Clickbait analyses a range of features in the headline, including sentence structure, and categorises the headline as “clickbait” or “non-clickbait.”

The extension enjoyed 93 per cent accuracy in detecting clickbaits and 89 per cent accuracy in blocking them. The paper describing this work, presented by the authors in the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), won the best student paper award.

Identifying a clickbait headline

Consider the clickbait headline “This Rugby Fan’s Super-Excited Reaction To Meeting Shane Williams Will Make You Grin Like A Fool,” and compare it with a non-clickbait one, such as, “Tata Sons Announces Key Organisational Changes” or “Editors’ Guild Condemns One-Day Ban on NDTV.” Some of the differences are apparent, but there’s more.

“We found significant difference in language used in clickbaits vs. non-clickbaits. Clickbaits use longer, well-formed English sentences with hyperbolic words, many punctuations, informal word forms, verb tenses etc. that clearly distinguish them from the non-clickbait headlines. It is this very difference in language that is harnessed to build our application,” says Abhijnan Chakraborty, the first author of the paper. Abhijnan, who is a third year PhD programme candidate at the Complex Networks Research Group at IIT Kharagpur, and worked in collaboration with Bhargavi Paranjpe and Sourya Kakarla who are also students at IIT Kharagpur and Niloy Ganguly, professor.

In order to make it work mathematically, the team extracted the headlines from 18, 513 Wikinews articles to represent the non-clickbait type and crawled 8,069 web articles from domains such as BuzzFeed, Upworthy, Scoopwhoop, and so on which regularly publish clickbaits. They had six volunteers go through these to sift out the clickbaits which would form the basis for their study.

Analysing this data set, they quantified the characteristics of clickbaits. One marker was length of the headline – for clickbait the average was 10 words, whereas for non-clickbait it was seven – and use of hyperbolic words. Another feature was extensive usage of common word 3 or 4-grams in clickbaits – for example, “How well do you” or “what happens when.” In non-clickbaits, they found that some of the most common 3-grams were “dies at age,” “found guilty of” etc. The proportion of non-clickbait headlines that had these “most common” n-grams was only 19 per cent, as against the clickbait group, which was 65 per cent.

In this manner, they zeroed in on 14 features that would differentiate the two. They tested three different machine learning models for classification. This is the core idea of the research – building a classifier tool that automates the task of identifying clickbaits. First, the models were trained using a certain proportion of news and clickbait headlines. All of them learnt the patterns or characteristics that differentiate a clickbait from a news headline. However the underlying mathematical formulations vary across models. The Support Vector Model performed the best. “93 out of 100 times its decision was the same as the ‘ground truth’ labelled by human annotators,” says Abhijnan.

Blocking clickbaits

A survey showed that not all readers would choose to block the same clickbaits. This made them develop a method for blocking the identified clickbaits based on individual choice. In the articles the reader clicks on or chooses to block, they compared blocking based on the topical similarity (indicated by tags and keywords given by the developer of the site) against that based on similarity of linguistic patterns. The second method proved to be more effective and was integrated into the browser plugin.

News outlets, however, in competing for reader attention, could be using headlines with clickbait qualities to package news. Abijnan agrees that such news items should not be penalised. “This task is a lot harder than the current scheme we have deployed, and we are working on the extension to tackle this issue,” he says.