I became interested in web syndication feeds recently. It's an old technology nobody talks about anymore, but everybody still provides them. Taking a look at Google Trends shows us how much people have lost interest in RSS since its peak popularity in 2006:

It's an old technology people seem to have lost interest in, but can we still rely on it? Are people abandoning web syndication? With the advent of CMS (eg. the huge popularity of WordPress), static website generators and publishing platforms (eg. Medium) which all provide syndication feeds by default, RSS doesn't look dead to me.

This is how I became interested in finding how many websites actually do provide at least one XML syndication feed. Builtwith has a pretty nifty "trends" section and it states that 33% of the 1 million most visited websites have an RSS feed. While this gives us a pretty good idea, let's see what we could do by ourselves.

We basically have two options here:

Actually crawling the web - a costly and lengthy process Relying on existing web crawl data

Enters CommonCrawl. CommonCrawl is a non-profit founded on the exciting project of crawling tons of web pages and releasing the obtained dataset publicly and for free. Their latest dump was published in November 2015 and contains 1.82 billion web pages, amounting to over 151TB of highly compressed HTML. I half-jokingly said this before: why not mining CommonCrawl to answer my own question?

A few weeks ago two colleagues of mine mentioned over lunch their desire to mine Common Crawl for their research, and it was a coincidence that I had been thinking about doing the same (though mostly for fun) for a few months. Wouldn't have we been able to combine our efforts into a single "Common Crawl run", I most probably wouldn't have mined this dataset all by myself only to satisfy my curiosity.

Working with CommonCrawl

As mentioned before, the latest Common Crawl is 151TB of data hosted on S3. We decided to process it directly from a few EC2 spot instances using the WDC Extraction Framework . Actually this choice was pretty obvious to us because the syndication feeds are not the only thing we wanted to extract from CommonCrawl. My colleagues were interested in extracting all structured data embedded in any HTML page which is exactly what the WDC framework was designed for. We also wanted to extract all HTML anchors pointing to a Wikipedia page. We forked the WDC framework and modified it to extract these different things all at once, thus only needing to go through the CommonCrawl dataset once.

Now, CommonCrawl dump consists of HTML web pages and we want to extract parts of it, eg. any <link rel="alternate" href="/blog/atom.xml" title="something" type="application/atom+xml"> with either RSS/Atom type attribute or rel="alternate" attribute. Although parsing HTML with regex is a bad idea , constructing a full-blown DOM tree out of each of these 1.82B pages would require far too much time and processing power.

Here's the regular expression I used to match feeds:

(<link[^>]*(?:\s(?:type=[\"']?(application\/rss\+xml|application\/atom\+xml|application\/rss|application\/atom|application\/rdf\+xml|application\/rdf|text\/rss\+xml|text\/atom\+xml|text\/rss|text\/atom|text\/rdf\+xml|text\/rdf|text\/xml|application\/xml)[\"']?|rel=[\"']?(?:alternate)[\"']?))[^>]*>)" 1



The WDC framework wrote its results to our S3 bucket in csv.gz format. For the whole run we budgeted 500USD for AWS EC2 instances. We spawned 100 c3.4xlarge EC2 instances (16 cores each) to run our modified WDC framework. Using spot instances saved us some money. It took around 30 hours to process the whole Common Crawl dump and costed less than 450USD, we were right on target.

Most interesting is of course post-processing the data, not extracting it. Here is what our results S3 bucket looks like together with a short explanation of what the result of our various extractions are:

2.3G /WDC_112015/anchors 60G /WDC_112015/feeds 54G /WDC_112015/urls 346G /WDC_112015/data 408G /WDC_112015/anchor_pages 37G /WDC_112015/stats 905G total 1

2

3

4

5

6

7



(gzipped TSV) anchors: page_url | anchor text | wikipedia_url

(gzipped TSV) feeds: page_url | link_tag | link_type

(gzipped TSV) urls: page_url

(gzipped text) data: <quintuplet>

(subject, predicate, object, page, extractor_used)

(subject, predicate, object, page, extractor_used) (gzipped json) anchor_pages: {"url": page_url, "content": full_html_page}



(gzipped TSV) stats: arcFileName | arcFilePos | detectedMimeType | hostIp | html-head-meta | html-mf-adr | html-mf-geo | html-mf-hcalendar | html-mf-hcard | html-mf-hlisting | html-mf-hrecipe | html-mf-hresume | html-mf-hreview | html-mf-species | html-mf-xfn | html-microdata | html-rdfa | html-rdfa11 | mimeType | recordLength | referencedData | timestamp | totalTriples | uri

First step was of course to sync this bucket locally and backup it to our NAS. Once this done and after having triggered security warnings at the network admins office for downloading almost 1TB at full speed, we proceeded to copy this data to DAPLAB hadoop cluster. DAPLAB is an awesome project aiming at providing a powerful data processing cluster on a freemium and premium basis for companies that cannot afford their own cluster, for researchers and scientists, etc. They also organize weekly hacking sessions to which anyone can attend and get access to the cluster for free. Our research lab has been partnering with DAPLAB since the beginning and DAPLAB infrastructure is a very nice complement to our lab's hadoop cluster. (No more advertising in this post I promise.)

We now have all the data on HDFS, let's process it. Keep in mind this blog post is about the XML syndication feeds and not how and what we did with the other things we extracted (anchors, RDF triples, etc).

[ vfelder@daplab ~ ] $ hdfs dfs -ls /data/WDC_112015/data/feeds | head -n2 Found 35669 items 1.7M 2016-02-28 15:21 /data/WDC_112015/data/feeds/ex_common-crawl_crawl-data_CC-MAIN-2015-48_segments_1448398444047.40_warc_CC-MAIN-20151124205404-00000-ip-10-71-132-137.ec2.internal.warc.gz.csv.gz 1

2

3



We have 35,669 gzipped TSV files. We don't want to work with this format and compression because gzip is slow and CSV/TSV is not ideal to query the data. Also, 35k files is a bit too much for a ~10 nodes cluster. We will convert these 35k gzipped TSV to 1,000 snappy-compressed parquet files.

Why converting 35k gunzip csv files to 1,000 snappy parquet files?

Parquet is better suited for querying (eg. using Hive) than CSV: Parquet files contain their schema, CSV don't. Parquet files store data by column, CSV is row-based. Columns of a parquet file are compressed (each column being compressed according to its data type).

is better suited for querying (eg. using Hive) than CSV: Snappy also offers great advantages, here compared to gzip: Snappy compression is orders of magnitude faster than gzip. Snappy happily trades compression against read/write speed. After all, when running a job that loads terabytes of data from HDFS on a cluster, we care more about fast read/write than about sparing a few TB.

Less files means more throughput (as long as we have more files than processing cores of course). If we assume decompressing a file takes 100ms, the decompression alone for all 35k files will take a cumulated one hour. Of course the whole process will be distributed and run in parallel but it still constitutes overhead compared to working on only 1,000 files.

We could have chosen Avro, but our schemas being very basic and our need being to query columns, Parquet made more sense.

Remember our TSV "schema" is page_url | link_tag | link_type . It's a good thing we saved the whole <link tag because I realized after the EC2 run it might be interesting to extract a few other possible attributes. I wrote a short Scala script for Spark to extract these things and go from csv.gz to 1,000 .snappy.parquet .

These 60G total / 35k .csv.gz , once converted to 1k .snappy.parquet with 4 additional columns, now take 187G. Not bad.

We could now create a Hive table like this:

CREATE EXTERNAL TABLE xi_wdc . feeds ( page STRING , type STRING , tag STRING , hostname STRING , rel STRING , href STRING , title STRING ) STORED AS PARQUET LOCATION '/user/vfelder/feeds/feedsparsed.parquet/' ; 1



EXTERNAL means the data for this table isn't moved by Hive to the tables location, it just stays where it is and Hive loads the data directly from these files. Should we drop this table the data won't be affected.

I also performed the above csv.gz to snappy.parquet conversion + hostname extraction for /data/WDC_112015/data/urls/ which contains all crawled URLs. The Scala program to do this is very similar to the one listed here and can be found here .

Hive is nice to run some queries, but Spark SQL is equally nice and generally offers better performances. It also allows me to run queries directly from the ipython notebook I'm writing this blog post from, get the queries results and rework them directly in python.

The first part of this blog post was writting in Markdown in my ipython notebook running pyspark, the rest of this post is playing with the data directly from ipython, writing pyspark.sql queries and executing them directly from ipython, running them with pyspark over the cluster.

I got this handy *sh alias I'm using to spawn a notebook in a screen:

alias ipyspark = 'IPYTHON_OPTS="notebook --no-browser --ip=localhost --port=1339" pyspark --master yarn-master --conf spark.ui.port= $( shuf -i 2000-65000 -n 1 ) --num-executors 20 --executor-cores 2 --driver-memory 16g --executor-memory 16g' 1



Let's work with our extracted feeds and all crawled URLs

First, import some spark-sql libs.

from pyspark . sql import SQLContext from pyspark . sql . functions import * def h ( n ) : """helper function to display numbers in a human-readable way""" return '{:,}' . format ( n ) 1

2

3

4

5

6



sc is spark context, it's already there because we're running ipython on pyspark.

sqlContext = SQLContext ( sc ) 1



We stored the feeds and urls as snappy compressed parquet files. All we have to do is read them as parquet, everything else is taken care of.

feeds = sqlContext . read . parquet ( '/user/vfelder/feeds/feedsparsed.parquet/' ) urls = sqlContext . read . parquet ( '/user/vfelder/urls/urlsparsed.parquet/' ) 1

2



Common Crawl said their dump had 1.82B URLs, let's check if that's also the number of web pages we used for our extraction.

{ 'total number of urls' : h ( urls . count ( ) ) } 1



{'total number of urls': '1,823,130,936'}

Nice, it seems the data matches.

Feeds and URLs files include these two columns : page and hostname , respectively the original URL of the web page and the hostname of this URL. A single web page can provide several feeds but we're only interested by the number of websites which provide at least one feed, so we take the distinct hostnames. Same with URLs: we crawled a lot of URLs but we are only interested by the number of distinct hostnames, to compare the two numbers.

distinct_feeds_count = feeds . select ( 'hostname' ) . distinct ( ) . count ( ) distinct_urls_count = urls . select ( 'hostname' ) . distinct ( ) . count ( ) { 'hostnames with at least one feed' : h ( distinct_feeds_count ) , 'hostnames processed' : h ( distinct_urls_count ) } 1

2

3

4

5

6



{'hostnames processed': '25,243,438', 'hostnames with at least one feed': '10,294,833'}

'{:0.2f}%' . format ( float ( distinct_feeds_count ) / distinct_urls_count * 100 ) 1



'40.78%'

Ok, around 40% of the crawled websites provide at least one XML feed.

But on average, how many feeds per webpage?

total_feeds_count = feeds . count ( ) total_urls_count = urls . count ( ) float ( total_feeds_count ) / total_urls_count 1

2

3



1.942656384170972

Now it would be interesting to see which standards these feeds implement.

from pyspark . sql import Column types_grouped = feeds . select ( 'type' ) . where ( col ( 'type' ) != ' ' ) . groupBy ( 'type' ) . count ( ) . orderBy ( desc ( 'count' ) ) types_grouped . show ( ) 1

2

3

4

5

6

7

8



+--------------------+---------+ | type| count| +--------------------+---------+ | application/rss+xml|877891180| |application/atom+xml|215504961| | application/xml| 20276729| | text/xml| 17086738| | application/rdf+xml| 14669025| | application/atom| 545717| | application/rdf| 425259| | application/rss| 108139| | text/rdf| 9637| | text/rss+xml| 245| | text/rss| 125| | text/rdf+xml| 14| +--------------------+---------+

Let's do some basic stats:

rss_feeds = types_grouped . filter ( 'type LIKE "%rss%"' ) . agg ( { 'count' : 'sum' } ) . collect ( ) rss_total = rss_feeds [ 0 ] . asDict ( ) . values ( ) [ 0 ] atom_feeds = types_grouped . filter ( 'type LIKE "%atom%"' ) . agg ( { 'count' : 'sum' } ) . collect ( ) atom_total = atom_feeds [ 0 ] . asDict ( ) . values ( ) [ 0 ] feeds_with_type = types_grouped . agg ( { 'count' : 'sum' } ) . collect ( ) feeds_with_type_total = feeds_with_type [ 0 ] . asDict ( ) . values ( ) [ 0 ] { 'RSS' : h ( rss_total ) , 'Atom' : h ( atom_total ) } 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21



{'Atom': '216,050,678', 'RSS': '877,999,689'}

rss_pc = float ( rss_total ) / feeds_with_type_total * 100 atom_pc = float ( atom_total ) / feeds_with_type_total * 100 { '% RSS' : '{:0.2f}%' . format ( rss_pc ) , '% Atom' : '{:0.2f}%' . format ( atom_pc ) , } 1

2

3

4

5

6

7



{'% Atom': '18.84%', '% RSS': '76.58%'}

alternate_total = feeds . select ( 'rel' , 'type' ) . where ( 'rel LIKE "%alternate%" AND (type LIKE "%rss%" OR type LIKE "%atom%")' ) . count ( ) '{:0.2f}%' . format ( float ( alternate_total ) / ( rss_total + atom_total ) * 100 ) 1

2

3

4

5



'97.09%'

That's about it, my questions have been answered!

Let's quickly recap: