Skip the build up, go to the scraping part!

Recently, I had to gather some textual data for my Natural Language Processing project for rating restaurants based on their food, ambience, service and other important factors. You would think, “ooh, its like all the other NLP related projects were sentiment analysis is performed on the comments and a score is generated”, yes its the same but there is a catch, I won’t be using the comments of the common public but the reviews and ratings by some food bloggers.

Now I had to get a dataset which meets my requirements, so I started surfing the internet which has a glut of information and wasn’t able to find any dataset which could be compliant to the problem at hand. There are many data-sets available on the internet for various purposes, like:

Reddit Dataset: which consists of every reddit comment which is publicly available. IMDb movie dataset: textual reviews of movies Film Corpus: consists of written screenplays

One can use this datasets in any which way they want: to develop a chat-bot or for sarcasm detection using NLP(for this the reddit dataset is a gem) or any other application of choice. There are many other online platforms like Kaggle, Data.world, data.gov, etc… to obtain various datasets.

After all the futile efforts of finding a relevant dataset I thought to scrap some of the food blogs as I had read about web scrapers and spiders earlier and had some theoretical knowledge about how they work. I started searching for example articles and videos to get a practical knowledge and some of the articles and videos helped a lot in understanding the convoluted concepts of these web scrapers. While there are sufficient theoretical tutorials there are only a handful of articles which apply those concepts to a practical problem. Many scrapping libraries like BeautifulSoup, Scrappy, etc.. are available to facilitate us. For me BeautifulSoup was easily understandable so I started on with beautiful soup.

Setting Up!

There are some handful of libraries you would need to install beforehand.

BeautifulSoup Pandas

to install these on Linux type the following command,

pip3 install -U BeautifulSoup pandas

I am using Jupyter Notebooks as it is more convenient to have have live output of the code and it also offers many other features. You could work with any editor or IDE of your choice but keep in mind that we are working with python3 and some of the libraries may not be the same in python2.

Disclaimer: I won’t be showing the name of the website I am scraping as there are many legal implications to this and there is a thin gray line between collecting information and stealing information.

Finally lets CODE:

This is the whole script, if you don’t want to go to the explanation part just copy and paste and the script will work.

Lets break this: