Overview

In this post we will take a look on how we can download and parse syndicated

feeds with Python.

The Python module we will use for that is “Feedparser”.

The complete documentation can be found here.

What is RSS?

RSS stands for Rich Site Summary and uses standard web feed formats to publish

frequently updated information: blog entries, news headlines, audio, video.

An RSS document (called “feed”, “web feed”, or “channel”) includes full or

summarized text, and metadata, like publishing date and author’s name. [source]

What is Feedparser?

Feedparser is a Python library that parses feeds in all known formats, including

Atom, RSS, and RDF. It runs on Python 2.4 all the way up to 3.3. [source]

RSS Elements

Before we install the feedparser module and start to code, let’s take a look

at some of the available RSS elements.

The most commonly used elements in RSS feeds are “title”, “link”, “description”,

“publication date”, and “entry ID”.

The less commonnly used elements are “image”, “categories”, “enclosures”

and “cloud”.

Install Feedparser

To install feedparser on your computer, open your terminal and install it using

“pip” (A tool for installing and managing Python packages)

sudo pip install feedparser

To verify that feedparser is installed, you can run a “pip list”.

You can of course also enter the interactive mode, and import the feedparser

module there.

If you see an output like below, you can be sure it’s installed.

>>> import feedparser >>>

Now that we have installed the feedparser module, we can go ahead and begin

to work with it.

Recommended Python Training For Python training, our top recommendation is DataCamp. Free Trial

Getting the RSS feed

You can use any RSS feed that you want. Since I like to read Reddit, I will use

that for my example.

Reddit is made up of many sub-reddits, the one I am particular interested in for

now is the “Python” sub-reddit.

The way to get the RSS feed, is just to look up the URL to that sub-reddit and

add a “.rss” to it.

The RSS feed that we need for the python sub-reddit would be:

http://www.reddit.com/r/python/.rss

Using Feedparser

You start your program with importing the feedparser module.

import feedparser

Create the feed. Put in the RSS feed that you want.

d = feedparser.parse('http://www.reddit.com/r/python/.rss')

The channel elements are available in d.feed (Remember the “RSS Elements” above)

The items are available in d.entries, which is a list.

You access items in the list in the same order in which they appear in the

original feed, so the first item is available in d.entries[0].

Print the title of the feed

print d['feed']['title'] >>> Python

Resolves relative links

print d['feed']['link'] >>> http://www.reddit.com/r/Python/

Parse escaped HTML

print d.feed.subtitle >>> news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python

See number of entries

print len(d['entries']) >>> 25

Each entry in the feed is a dictionary. Use [0] to print the first entry.

print d['entries'][0]['title'] >>> Functional Python made easy with a new library: Funcy

Print the first entry and its link

print d.entries[0]['link'] >>> http://www.reddit.com/r/Python/comments/1oej74/functional_python_made_easy_with_a_new_ library/

Use a for loop to print all posts and their links.

for post in d.entries: print post.title + ": " + post.link + " " >>> Functional Python made easy with a new library: Funcy: http://www.reddit.com/r/Python/ comments/1oej74/functional_python_made_easy_with_a_new_ library/ Python Packages Open Sourced: http://www.reddit.com/r/Python/comments/1od7nn/ python_packages_open_sourced/ PyEDA 0.15.0 Released: http://www.reddit.com/r/Python/comments/1oet5m/ pyeda_0150_released/ PyMongo 2.6.3 Released: http://www.reddit.com/r/Python/comments/1ocryg/ pymongo_263_released/ ..... ....... ........

Reports the feed type and version

print d.version >>> rss20

Full access to all HTTP headers

print d.headers >>> {'content-length': '5393', 'content-encoding': 'gzip', 'vary': 'accept-encoding', 'server': "'; DROP TABLE servertypes; --", 'connection': 'close', 'date': 'Mon, 14 Oct 2013 09:13:34 GMT', 'content-type': 'text/xml; charset=UTF-8'}

Just get the content-type from the header

print d.headers.get('content-type') >>> text/xml; charset=UTF-8

Using the feedparser is an easy and fun way to parse RSS feeds.

Sources

http://www.slideshare.net/LindseySmith1/feedparser

http://code.google.com/p/feedparser/

