Warning! In this post I use the Project Euler site as an example. However, it seems that this method doesn’t work anymore with that site. The PE site was updated recently and they have changed something. However, the method described below might work well with other sites.

Update (20111108): If you want to scrape the Project Euler site, check out Part 3 of this series.

In Part 1 we showed how to download a cookie-protected page with Python + wget. First, cookies of a given site were extracted from Firefox’s cookies.sqlite file and they were stored in a plain-text file called cookies.txt . Then this cookies.txt file was passed to wget and wget fetched the protected page.

The solution above works but it has some drawbacks. First, an external command ( wget ) is called to fetch the webpage. Second, the extracted cookies must be written in a file for wget.

In this post, we provide a clean, full-Python solution. The extracted cookies are not stored in the file system and the pages are downloaded with a Python module from the standard library.

Step 1: extracting cookies and storing them in a cookiejar

On the blog of Guy Rutenberg I found a post that explains this step. Here is my slightly refactored version:

#!/usr/bin/env python import os import sqlite3 import cookielib import urllib2 COOKIE_DB = "{home}/.mozilla/firefox/cookies.sqlite".format(home=os.path.expanduser('~')) CONTENTS = "host, path, isSecure, expiry, name, value" COOKIEFILE = 'cookies.lwp' # the path and filename that you want to use to save your cookies in URL = 'http://projecteuler.net/index.php?section=statistics' def get_cookies(host): cj = cookielib.LWPCookieJar() # This is a subclass of FileCookieJar that has useful load and save methods con = sqlite3.connect(COOKIE_DB) cur = con.cursor() sql = "SELECT {c} FROM moz_cookies WHERE host LIKE '%{h}%'".format(c=CONTENTS, h=host) cur.execute(sql) for item in cur.fetchall(): c = cookielib.Cookie(0, item[4], item[5], None, False, item[0], item[0].startswith('.'), item[0].startswith('.'), item[1], False, item[2], item[3], item[3]=="", None, None, {}) cj.set_cookie(c) return cj def main(): host = 'projecteuler' cj = get_cookies(host) for index, cookie in enumerate(cj): print index,':',cookie #cj.save(COOKIEFILE) # save the cookies if you want (not necessary) if __name__=="__main__": main()

Step 2: download the protected page using the previously filled cookiejar

Now we need to download the protected page:

def get_page_with_cookies(cj): opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) theurl = URL # an example url that sets a cookie, try different urls here and see the cookie collection you can make ! txdata = None # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode #params = {} #txdata = urllib.urlencode(params) txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} # fake a user agent, some websites (like google) don't like automated exploration req = urllib2.Request(theurl, txdata, txheaders) # create a request object handle = urllib2.urlopen(req) # and open it to return a handle on the url return handle.read()

See the full source code here. This code is also part of my jabbapylib library (see the “web” module). For one more example, see this project of mine, where I had to download a cookie-protected page.

Resources used

What’s next

In Part 3 we show how to use Mechanize and Splinter (two programmable browsers) to log in to a password-protected site and get the HTML source of a page.