All the leading web browsers, except for Google Chrome, include a separate “reading mode” that extracts the main content from pages, reformats it to be more readable, and hides distractions like advertisements, comments, and even page navigation. This separate rendering mode isn’t governed by any standards and as such it behave differently from web browser to web browser. So what is a web developer to do to properly support this distinctly separate and non-standard rendering mode?

This article is part one in a series on web reading mode and reading mode parsers. The article is broken up into multiple parts as each part is written for a slightly different audience.

The non-standard rendering mode

The history of reading mode, a look at the different parsers we have today and how they came to be, and a small criticism of the Apache 2.0 license. Determining the main page content

There are many approaches to content-analysis and extraction, and most only work well with English-content. Why is reader mode so slow to activate, anyway? Title, author, and date metadata extraction

Visual page inspections, standard metadata, or guesswork? Everyone has their own ideas about how to best determine the metadata describing an article. Inconsistent and bad reading experience

Encourage publishers to fix their designs, and standardize reading mode now.

What is reading mode? Reading mode is an alternative web rendering mode where the web browser try to strip out repeated and irrelevant content; such as page navigation, ads, and distractions. The main article is extracted and brought to the front, and displayed in a clean and consistent page design. Reading mode is increasingly useful as more and more websites seem to have abandoned any ideas of design and readability and focus on littering their pages with as much noise as they can fit in on to a single page. You can think of reading mode as what you would think of as the ideal paper-print copy of an article you’re reading online. However, like when printing webpages, reading mode can produce odd results and you end up with a useless page. I’ll look into how reading mode parsers extract the main content and metadata from webpages in part two and three of this series. Before that, however, I’d like to start by establishing a sort of history of reading mode parsers and then quickly talk about parser diversity and the lack of standards for reading modes.

Readability.js and the birth of a new rendering mode Instapaper was first on the scene in 2008 with their ability to extract text from webpages; remove page navigation, ads, and distractions, and even make regular webpages readable on iPhone back before the “mobile-first” mentality had set with web developers. Instapaper also let users bookmark webpages and save them for later. However, Instapaper’s proprietary license make them little more than a footnote in the history of web reading modes. Arc90 was inspired by Instapaper, and launched the Readability bookmarklet (an early form of a web browser extension) a year later . Readability.js was a parser that could extract the main text of a webpage, and would reformat it with large and readable text in an era where most websites still relied on tiny text to better serve visitors with smaller older displays. More importantly, Readability.js was released under the Apache 2.0 software license. In brief, this is a permissive license that allows anyone to take the source and built upon it — even for commercial purposes — without sharing the sources of the changes they make to the software. A few months later, Arc90 launched Readability.com as a reading list and reading mode web service. The service was discontinued in 2016 , but I’ll cover reading as a service in more detail in part four. Readability.com made improvements to Readability.js which gave web publishers more control over the appearance of their content in reading mode, but they didn’t share these changes back to the open-source project and discontinued Readability.js entirely in 2010 .

Readability.js makes the leap to web browsers Apple picked up Readability.js in 2010 (probably contributing to Arc90’s decision to stop maintaining it), bundled it and gave its own Reader button in their Safari 5 web browser for macOS and Windows. Safari Reader made the leap to iOS the following year. Apple has a reputation for clean and uniform designs, and it’s no wonder they preferred the neatness of Safari Reader to the busy and cluttered web designs many websites comes with from publishers. Over the years, Safari have made changes and improvements to Readability.js in their own proprietary fork known as Apple Readability. Apple’s changes are quite extensive, and Apple Readability is probably the most capable reading mode parser available today . In 2014 , Microsoft introduced Reading View in Internet Explorer 11. This reading mode uses Microsoft’s own proprietary parser, loosly inspired by Readability.js. Mozilla didn’t want to be left behind and introduced their own Reader View in Firefox 38 the following year . Mozilla forked the abandoned 2010 -version of Readability.js and re-released it as Mozilla Readability. Mozilla Readability has received some updates but no major overhauls. Mozilla Readability is also licensed under the Apache 2.0 license. Mozilla Readability’s permissive license made it the de-facto reading mode parser found in everything from the Samsung Browser for Android to niche web browsers like GNOME Web, Maxthon, Vivaldi, and Yandex Browser. Maxthon have also forked Mozilla Readability as their own proprietary Maxthon Reader with optimization for Chinese-Japanese-Korean (CJK) language specific content parsing and introduced some Apple Readability-inspired parsing rules.

Reader mode parser diversity I’ve been talking a lot about Readability.js so far. There are quite a few other reading mode parsers, however. Here is a quick overview of web browsers and services and which parser they’re known to use: Vendor Product Parser Environments Mozilla Firefox Mozilla Readability Desktop and Android GNOME Web Desktop Vivaldi Vivaldi Yandex Browser Samsung Browser Android Apple Safari Safari Reader macOS and iOS Maxthon Maxthon Maxthon Reader Desktop Microsoft Edge EdgeHTML Windows and Windows Mobile Edge Mobile Chrome DOM Distiller Android Google Chrome Postlight Mercury Reader Web Reader Web / browser extension Instant Paper Instapaper Instaparser Mozilla Pocket Unknown One notable absentee in the above table is the market leader on the desktop: Google Chrome. Google is primarily an advertisement company, and they’re not too keen on any technology that aims to hide advertisements and distractions. Chrome for Android has an optional reading mode hidden away in the accessibility menu.

No standards, little documentation One of the big problems with reading mode is that there’s no standards for how it’s supposed to work, and I can find little evidence of cooperation between the different reading mode parser vendors. More worryingly, only Microsoft has ever published any documentation about how their reading mode work and how web developers can target it; yet theirs is the parser that there’s the least information available about. Ideally, web developers should be able to deliver a consistent and good user experience for their users in reading mode. However, web developers who want to support reading mode have a tedious job of testing and retesting across all the different reading modes; all of which behave differently and have different and even conflicting ideas on the way documents should be structured. The majority of web browsers don’t set a custom URI for content displayed in reading mode, but all the market leading browsers do — yet they’ve been unable to agree on what URI to use. An agreement on what URI to use would mean websites, or even external programs, could link directly to a page in reading mode. This is indeed possible in some browsers, but only Safari and Microsoft Edge have registered their reading mode URIs in the operating system and enabled other programs to open pages directly in their reading modes. Reading mode is unreliable for users as even though they’ve experienced that it works on a given website before, they never know whether the button will appear in their browser or whether the whole article will be included once they click it. Clicking on the reading mode button is always a bit of a gamble.

There was almost a standard After Readability.com went proprietary, they settled on the hNews microformat as the preferred way of parsing pages while keeping what is now the Mozilla Readability implementation as a fallback. hNews is a structured metadata format that adds semantics for identifying titles, publication time, authors, and even the summary and the main content of an article. In other words, there was a defined standard for parsing documents and retrieving both the metadata and the main content. The hNews microformat did gain some traction, but have since been superseded by the h-entry microformat. h-entry microdata is everywhere on the web thanks to it being included by default in the default themes produced by leading content management systems as well as being recognized as a semantic data structure by leading search engines like Bing, Google, and Yandex. There was aaalmost a standard in place. The only trouble was that Readability.com had stopped contributing to their open-source Readability.js repository, so these changes never made it into Apple or Mozilla’s forks of Readability. The web continued with the non-microformats oriented legacy Readability implementation. When you hear someone argue against the use of the Apache 2.0 software license for open-source projects, things like this is what they’re referring to.