Ten second takeaway: Learn how to create your own dataset with simple web scraping that mines the entire Metacritic website for game reviews using python’s beautiful soup and hosted for free on Google Cloud Platform (GCP) micro (always free tier)

Aspiring data scientists are always confused about the first step after learning the theory. There are very few places where they can apply the accrued knowledge. Sure there are plenty of datasets available but the free ones never give you a pragmatic insight into solving actual problems or sometimes they are too small to use for deep learning applications.

One way to get robust datasets is either to pay for them or enrol in expensive courses, and another is web scraping. Here I tell you how to scrape large datasets using python for free!

Why did I opt for web scraping and using Google Cloud?

Data gets stale: When you scrape data you potentially get the latest data for any topic. While you can get robust datasets from Kaggle, if you want to creating something fresh for you or your company, scraping is the way to go, for example. if you want to build a price recommendation for shoes you would want the latest trends and prices from Amazon and not 2 years old data.

When you scrape data you potentially get the latest data for any topic. While you can get robust datasets from Kaggle, if you want to creating something fresh for you or your company, scraping is the way to go, for example. if you want to build a price recommendation for shoes you would want the latest trends and prices from Amazon and not 2 years old data. Customisable: You can tailor fit the code to get only the data you need from any source you want.

You can tailor fit the code to get only the data you need from any source you want. Why not local? With big players like Google and Amazon in the cloud computing market it’s very cheap to rent out a PC for a few hours. They also give you a free tier which is perfect for something simple like web scraping. GCP is slightly cheaper and gives you $300 credit to start with so I went with GCP. Also, I didn’t want my IP to get blocked (heh)

With big players like Google and Amazon in the cloud computing market it’s very cheap to rent out a PC for a few hours. They also give you a free tier which is perfect for something simple like web scraping. GCP is slightly cheaper and gives you $300 credit to start with so I went with GCP. Also, I didn’t want my IP to get blocked (heh) Fun: Well this is my idea of a Friday night!

In my case I did not find a good dataset for game reviews that was fairly new, as Metacrtic has the largest game repository and gets updated fairly regularly I decided to go with that.

Getting Started

All you need to do is iterate over a list of URL, identify the containers for the data, extract the data and store it in a csv.

1. Libraries used

import urllib2

import csv

from bs4 import BeautifulSoup

import pandas as pd

urllib2 : Our library for making url requests.

: Our library for making url requests. csv : Library to store data in a CSV

: Library to store data in a CSV bs4 : The beautiful soup library that makes extracting data from a webpage very easy.

: The beautiful soup library that makes extracting data from a webpage very easy. pandas: Store the data in a nice tabular format.

2. Understanding the flow of the website

Metacritic layout is pretty simple. All the data is structured as follows

http://www.metacritic.com/browse/games/release-date/available/pc/metascore?view=detailed&page=1

lets break it down:

http://www.metacritic.com/browse/ : is the domain

: is the domain games: This gives the subsection and can be replaces by movies/music for other subsections

This gives the subsection and can be replaces by movies/music for other subsections available/pc/ : This part gives the data for pc. Change this to ps4 for data on ps4 games.

: This part gives the data for pc. Change this to ps4 for data on ps4 games. metascore : this gives the rankings by meta score we can change this to “ user_rating” to get data by user rating.

: this gives the rankings by meta score we can change this to “ to get data by user rating. view=detailed : This gives the view type we choose detailed as it contains more data like genre and maturity rating.

This gives the view type we choose detailed as it contains more data like genre and maturity rating. page=x: This gives the page number x. Incase the page number doesn’t exist the site return a blank template page with no data and doesn’t throw an error

Next we decide on the html elements which have the data. For this we use the inspect tool in Chrome. We select on the elements to and highlight the subsection to get the html element and its class .

use the inspect button on the top right (circled) then highlight the area you want the HTML element for

the release data is in the element li and has class stat release_date

Now we know what element we need to extract let’s go ahead and extract them.

3. Making URL requests

metacritic_base = “http://www.metacritic.com/browse/games/release-date/available/pc/metascore?view=detailed&page=" hdr= {‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8’, ‘User-Agent’ : “Magic Browser”} filepath=’/Users/rra/Downloads/’ for i in range(0,54):

metacritic = metacritic_base+str(i)

page = urllib2.Request(metacritic, headers=hdr )

content = urllib2.urlopen(page).read()

Metacritic has a simple site layout with a static URL where page number changes for each page.

we use the urllib2.Request to request for the page and Urllib2.urlopen to read the page data

Tip: There are 53 pages so I put my counter to max out at 54 however you can simply include all this in a try except which exits on encountering an error.

4. Extracting data

We then read the data

soup = BeautifulSoup(content, ‘html.parser’)

right_class=soup.find_all(‘div’, class_=’product_wrap’)

for item in right_class: try:

link=item.find(‘h3’, class_=”product_title”).find(“a”)

g=link.get(‘href’)

except: g=’' try:

score = item.find(“span”, class_=”metascore_w”)

s=score.text

except: s =’’ try:

dt = item.find("li", class_="release_date").find("span", class_="data")

d=dt.text

except: dt='' try:

rating=item.find("li",class_="stat maturity_rating").find("span", class_="data")

r= rating.text

except: r="" try:

pub =item.find("li",class_="stat publisher").find("span", class_="data")

p= pub.text

except: p=''



try:

genre= item.find("li",class_="stat genre").find("span", class_="data")

gr = genre.text

except: gr='' try:

user_score=item.find("span", class_="textscore")

u = user_score.text

except: u=''

We use BeautifulSoup(content, ‘html.parser’) which does all the heavy lifting of parsing the vast amount of HTML.

Now we saw in the previous section that each game data is in a div with a class called product_wrap. So we extract all such divs and iterate over each div to get to the data. Here we store the following data:

g: game name

s: metascore

d: release date

p: publisher

r: mturity rating

u: user rating

gr: genre

Tip: HTML is unreliable so it’s better to use try:except witch each extraction

5. Saving data

game=[g,s,d,r,p,gr.strip(),u]

df = pd.DataFrame([game])

with open(filepath+'gamenames.csv', 'a') as f:

df.to_csv(f, header=False, index=False, quoting=csv.QUOTE_NONNUMERIC, sep="|")

We use pandas to convert the list of list into a table and then write the data to a csv. Here we use the | as a delimiter as the genre column contains , (comma)

6. Running the code on Google cloud

I’m assuming you would know how to setup a GCP account. If now please follow this blogpost on how. You need to create an instance as follows.

Creating a GCP instance

Once it is running you need to install python on it and copy the code from your local machine to the instance. Here spheric-crow is the project name and instance-2 is the instance name. You need to also specify the time zone to google.

Tip: Remember to update your ubuntu once you ssh to the instance.

gcloud compute — project “spheric-crow” ssh — zone “us-east1-b” “instance-2” gcloud compute scp scrap.py instance-1:scrap

scp copies the file from the local machine to the GCP instance. Next you need to install the aforementioned libraries on your instance. Next install byobu which is a text based windows manager. This will keep the session intact even if your ssh connection breaks.

Sudo apt-get install byobu

byobu

byobu interface, you create separate tabs and run multiple code at the same time

Finally run the code using the following command and you are done.

python scrap.py

Tip: You can scp or simply git pull from my github account.

7. Retrieving the mined data

You can get the mined data from the cloud using scp again and now you have a really cool dataset to play with!

gcloud compute scp instance-1:scrap/game_review.csv /Users/

you can then access the data using python as

import pandas as pd

df= pd.read_csv(“/metacritic_scrap/gamenames.csv”, sep=”|”)

df.columns=[“game”, “metascore”,”release_date”, “maturity_rating”,”publisher”, “genre”, “user_rating”]

and your dataset will look like

See how nicely structured this is!

Tip: You can automate this too!

Tips and Tricks

Be polite: Most sites would frown upon you mining their content as it puts a lot of pressure on their servers. Try to avoid making too many request in a short time and read the robots.txt file. Here are a couple of tricks to get as little errors as possible:

Sleep: Include sleep which delays the code by a few seconds before the next request is made. I prefer to use sleep (randint(a,b)) i.e use a random integer instead of a fixed value.

from random import randint

from time import sleep

#pause for 20-100 seconds randomly

sleep(randint(20,100))

User Agent: It is a string that a browser or app sends to each website you visit. We use a user_agent generator to trick the website into thinking that the request is coming form different browsers. As many institutions use the same IP we don’t usually risk getting a too many request error. Here is a list of popular user agents you can use.

from user_agent import generate_user_agent

VPN and Proxies: If you are mining a large dataset chances are that you will get a too many requests error eventually. In my case around every 2000 pages. So to counter that you can rotate a few proxies and get ~5k pages per instance run. You can get some free proxies here.

2. Caught in a Captcha: Some websites really don’t want you scraping their data and they put a captcha in place. If it’s a simple 4–5 alphanumeric captcha you can try to work it using Python Tesseract and this technique. If it the google re-captcha you have to manually solve each time just like these guys.

3. Exception Handling: HTML is very unreliable and site may not follow a strict pattern all the time so the best practice is to include each element in a try:except statement.

4. Saving periodically: Web scraping is risky business and if you don’t save data regularly you may risk losing the entire dataset mined so far. A simple solution to this is to save data regularly in a csv (like I do) or use SQLite (like the smart people do). A introduction to sqlite can be found here.

What’s next?

We can use the same code to mine Metacritic for loads of other content just by changing the base url:

movie reviews/ratings

tv show reviews/ratings

music reviews/rating

I extended my code to mine for data all ~100k user reviews for ~5k PC games and I had mentioned I will be using it for a game recommendation engine (blog posts to follow). If you want a slice of the action email me and I’ll send you a part of the dataset!

Metracritic’s site layout is pretty easy to follow and is replicated by a simple for loop but for mining more complicated sites like Amazon we use a browser automation tool called Puppeteer, which simulates clicks to generate the next page and so on.

Check out this great blog by Emad Eshan on how to use Puppeteer.

The entire code for this blog can be found on my git. The entire code for the user review scraping can also be found here.

In the next blog, I’ll be doing some data wrangling and deep learning on this great dataset. Stay tuned!

This is my first post so if you liked it, do comment and clap :)