Knowing what type of content we’re crawling.

First, we began by choosing the information we wanted and could probably extract, such as title, keywords, tags, and post length. We also manually researched the size of popular publications and popular writer followings. FreeCodeCamp has great insights from the top 252 Medium stories of 2016.

How do we extract from all the different parts of a blog post on a page to insert the right data?

However, not all sites store data the same way. There are structured data and unstructured data. Structured content includes RSS, JSON and XML that you can extract directly from to represent in an ordered way (such as a newsfeed or put on a spreadsheet). Unstructured content like Medium requires a two-step process of extracting data from HTML, then turning it into structured data.

No matter how different the layout of blogs or publications sites look, the data falls into two categories, structural and unstructured. Now, we need to choose a tool to help us build our crawler that will extract this data.

Choosing your library: don’t build from scratch.

If you want to build something quickly like I did, open source tools are great. You can choose from a range of free crawler libraries for different programming languages. Here is a list of libraries you can consider.

This time, I choose Scrapy as it is an open source Python library and well. They also have a great community support so beginners can ask for help easily.

Using Scrapy to crawl data.

Our company loves open source and collaborative frameworks like Scrapy

Install Scrapy

First, we will need to install Scrapy to your computer. You can follow the guideline here to install Scrapy on different platforms such as Windows, Mac OS X or Ubuntu.

Install the start project

One of the best ways to get started is using their start project since it helps you set up most of the configurations. Make a new directory and then run the following command after successful installation. (Or you can also read the documentation and set up everything yourself.)

scrapy startproject tutorial

You will see a spider folder in tutorial directory. Then go to tutorial/spider , open a new file called stories_spider.py . Then paste the below script in this file:

name : identifies the Spider.

: identifies the Spider. start_urls : List of URLs you want to crawl. The list will be then used by default implementation of start_requests() to create the initial requests for your spider.

: List of URLs you want to crawl. The list will be then used by default implementation of to create the initial requests for your spider. parse() : handles the response downloaded for each of the requests made.

For further details, you can reference the Scrapy documentation.



In order to crawl the data from Medium, we have to figure out the URLs & the paths of the data and put them to stories_spider.py .

Study the website: URLs

Now, you will need to let the crawler know which site you want to crawl data from. You will have to pass the right URL to your Scrapy program. Don’t worry, you don’t have to input every post manually. Instead, look in the Archives that tell Scrapy to look at all the posts published within (usually) the time period.



Medium publications have a page call ‘Archive’, where you can find the blog posts published in the past few years. For example, the URL for 2016 is ‘https://m.oursky.com/archive/2016'