In this tutorial, we will be scraping xkcd for its brilliant comics and displaying ten random ones on our website with Flask. Our Logic will consist of two parts:



1)The HTML Parsing with lxml 2)Building a web app with Flask.

First, let's create a parsing script.

Parsing XKCD

The first part of any scraping operation is to study the HTML Source of our target pages. So open up the page of any comic (this should do) and start reading its source. There are a few things to note now:

There are roughly 1446 XKCD Comics at the time of writing The information we want is inside a <div id='comic'> tag This div contains a img, an alt and a title tag.

We are going to be using the img tags and the alt tags for our app. So let's get started with the parsing



Lets load a random XKCD url with requests and load the source with lxml.

import requests , random from lxml import html #open a random xkcd comic between 1 & 1446 source = requests . get ( "http://xkcd.com/ %d " % random . randint ( 1 , 1446 )) . text tree = html . fromstring ( source . encode ( 'utf-8' ))

Now we need to extract the actual image and the short-title for the comic from the source. To do this we will be using XPath.

XPath is used to navigate through elements and attributes in an XML document.

A nice tutorial for XPath can be found here

First off, we will need to build our selectors. Remember that our required div had an id "comic"? So, our selector starts off as

'//div[@id="comic"]'

Next we have an img tag.

'//div[@id="comic"]/img'

Next we want the src and the alt attributes within the img tag.

#for the src '//div[@id="comic"]/img/@src' #for alt '//div[@id="comic"]/img/@alt'

Now, since are selectors are complete, lets add them to our Python code.

import requests , random from lxml import html #open a random xkcd comic between 1 & 1446 source = requests . get ( "http://xkcd.com/ %d " % random . randint ( 1 , 1446 )) . text tree = html . fromstring ( source . encode ( 'utf-8' )) img = tree . xpath ( '//div[@id="comic"]/img/@src' ) alt = tree . xpath ( '//div[@id="comic"]/img/@alt' )

We should add a for loop into this so that we get 10 comics instead of one and add the results to a list.

import requests , random from lxml import html comics_list = [] #open ten random xkcd comics between 1 & 1446 for i in range ( 0 , 10 ): source = requests . get ( "http://xkcd.com/ %d " % random . randint ( 1 , 1446 )) . text tree = html . fromstring ( source . encode ( 'utf-8' )) img = tree . xpath ( '//div[@id="comic"]/img/@src' ) alt = tree . xpath ( '//div[@id="comic"]/img/@alt' ) comics_list . append ({ 'img' : img [ 0 ], 'alt' : alt [ 0 ] })

So our parsing script is complete. Let's move on to the Flask Webapp.

Flask Webapp

You can find a detailed tutorial on getting started with flask here.

First lets create a directory for our app, create a VirtualEnv and install Flask, requests and lxml.

$ mkdir randomxkcd $ cd randomxkcd $ virtualenv venv $ . venv/bin/activate $ pip install Flask requests lxml

A typical Flask App Directory structure looks like:

/application >app.py /static /css /js /templates >index.html /venv ..........

To create a structure like this, simply run the following commands

$ mkdir static $ mkdir templates $ touch app.py $ cd templates $ touch index.html $ cd ..

Now add the Flask app template to app.py

import os from flask import Flask , render_template import requests import random from lxml import html app = Flask ( __name__ ) app . config . update ( DEBUG = True , ) @app.route ( '/' ) def index (): return 'Hello' if __name__ == "__main__" : port = int ( os . environ . get ( "PORT" , 5000 )) app . run ( host = '0.0.0.0' , port = port )

Now run app.py and if all went well throughout this process, by navigating to localhost:5000, you should be greeted by "Hello".

Lets add the parsing logic we wrote earlier to a function parse() in app.py and add a check to see if an image is absent, to replace it with a 404 image.

import os from flask import Flask , render_template import requests import random from lxml import html app = Flask ( __name__ ) app . config . update ( DEBUG = True , ) @app.route ( '/' ) def index (): return 'Hello' def parse (): comics_list = [] #open ten random xkcd comics between 1 & 1446 for i in range ( 0 , 10 ): source = requests . get ( "http://xkcd.com/ %d " % random . randint ( 1 , 1446 )) . text tree = html . fromstring ( source . encode ( 'utf-8' )) img = tree . xpath ( '//div[@id="comic"]/img/@src' ) alt = tree . xpath ( '//div[@id="comic"]/img/@alt' ) try : img [ 0 ] except : img . append ( 'http://i0.kym-cdn.com/photos/images/newsfeed/000/178/254/c86.jpg' ) alt . append ( "Error: Image Not Found." ) comics_list . append ({ 'img' : img [ 0 ], 'alt' : alt [ 0 ] }) return comics_list if __name__ == "__main__" : port = int ( os . environ . get ( "PORT" , 5000 )) app . run ( host = '0.0.0.0' , port = port )

Ok now pass the comics_list to the index.html file. Our python script app.py should finally look like

import os from flask import Flask , render_template import requests import random from lxml import html app = Flask ( __name__ ) app . config . update ( DEBUG = True , ) @app.route ( '/' ) def index (): #pass the comic_list to index.html return render_template ( 'index.html' , data = parse ()) def parse (): comics_list = [] #open ten random xkcd comics between 1 & 1446 for i in range ( 0 , 10 ): source = requests . get ( "http://xkcd.com/ %d " % random . randint ( 1 , 1446 )) . text tree = html . fromstring ( source . encode ( 'utf-8' )) img = tree . xpath ( '//div[@id="comic"]/img/@src' ) alt = tree . xpath ( '//div[@id="comic"]/img/@alt' ) try : img [ 0 ] except : img . append ( 'http://i0.kym-cdn.com/photos/images/newsfeed/000/178/254/c86.jpg' ) alt . append ( "Error: Image Not Found." ) comics_list . append ({ 'img' : img [ 0 ], 'alt' : alt [ 0 ] }) return comics_list if __name__ == "__main__" : port = int ( os . environ . get ( "PORT" , 5000 )) app . run ( host = '0.0.0.0' , port = port )

In the end we add a very simple index.html file to display the results.

Now run python app.py

And you should be greeted with something like this.

Deploying to Heroku.

Now we want our Flask App to be hosted online. Heroku is the best choice for this. It will provide a 512 MB free dyno for us to host this app and host up to 5 free apps. Our app is not really memory hungry so 512 MB is more than enough.

First follow the first two steps in this tutorial to set up heroku and log yourself in on your machine. Now in our app directory (randomxkcd) run the following command to setup a git repo and create a heroku app.

git init heroku create

Heroku requires a requirements.txt file to see what Python modules it will need to install and a Procfile to tell it what processes to run. To create a requirements.txt file

pip freeze > requirements.txt

and to create the Procfile

touch Procfile

Edit the Procfile to say

web: python app.py

Now rename the heroku app to what you want with

heroku apps:rename uniquenameiwant

I chose randomxkcd so this command becomes

heroku apps:rename randomxkcd #this name is taken by me now.

Now to add these files to git and push them to Heroku.

git add -A git commit -m "Original Commit" git push heroku master

Your app will now be availabe at

uniquenameiwant.herokuapp.com #in my case it becomes randomxkcd.herokuapp.com

Find the Source Code on Github

Live Demo on Heroku