How to Scrape Data from a website using Python

Data mining or web scraping is the technique by which we can download the data present inside specific web-page, there are a hundreds of tutorials on “how to scrape data from a website using python” on the web but I remember the first time I searched for good tutorial it couldn’t really help me understand the simple concepts for mining.

So here I would try and explain scraping for absolute beginners. First and foremost this article is for educational purpose only mine data without slamming the servers and nor can I help you mine contents for which you have to pay for.

Tools I’m using:

Python (I am currently using python3 but have also worked on 2.7) Few Libraries has to downloaded i will step by step Currently using Linux (Ubuntu) but both Windows and Mac would do just fine

First find a website you want to scrape data from here let’s take IMDB for example

For the test scenario let’s say we want to grab names of all the movie names released in 2018 which are voted highest and grab it’s ratings so that we can create a watch list. The libraries required are ‘requests’ & ’beautifulsoup’.

To install these libraries we need to install ‘pip’ which is a package manager for python. If you have not yet installed python I don’t think you’d have read this far ahead, so anyway there are a couple of ways to install pip on your system respective of the OS you are currently on

Linux Users : Go to this Link

Windows Users : Go to this Link

Mac Users : Go to this Link

And while testing python codes use Ipython, which will help in easy learning and also has an auto-complete feature on tab-key. Install ipython by typing “pip install ipython” in your terminal

Now install required libraries like requests “pip install requests”

Also “pip install BeautifulSoup” and “pip install html5lib”

Now in your terminal open ipython-shell by typing “ipython” and now try importing the installed libraries

[1] import requests



[2] from bs4 import BeautifulSoup





Now when these are both are imported we are halfway there, now we have the required tools to scrape data. First navigate to webpage we want to scrape ( here we are taking IMDb link as reference”)

Here we can see the zoomed out version of webpage now we are going to scrape the data we need, the first thing we need to understand is that no matter how we see a web site any website its all following the same structure

<html>

<head>

</head>

<body>

All Contents

</body>

</html>