When we look closely we will see the class “ratings-bar” contains the rating of the movie. If we inspect other movies, we will find all the movies have the same class name for the ratings on that page. Here, we found a pattern to extract all the ratings from the page. Similarly, we can extract summary, title, genre, etc.

Not only using class but you can select a specific part of the HTML code using id, tags, etc as well.

Let’s jump into the code!

BeautifulSoup allows us to extract data(more precisely parse data) from HTML using the class name, id, tags, etc. Isn’t it Beautiful? :-D

from bs4 import BeautifulSoup # Create a BeautifulSoup object

# response_text -> The downloaded webpage

# lxml -> Used for processing HTML and XML pages

soup = BeautifulSoup(response_text,'lxml')

To select the content from the page we use CSS Selectors. CSS Selectors allows us to select different classes, ids, tags, and other elements easily. CSS Selector for Class is “.” and ID is “#”. To select a class we need to prefix a “.” to the class name we want to extract and similarly, for ID we need to prefix “#”.

# As we saw the rating's class name was "ratings-bar"

# we prefix "." since its a class

rating_class_selector = ".ratings-bar" # Extract the all the ratings class

rating_list = soup.select(rating_class_selector)



This “rating_list” is the list of object containing all the <div> elements containing “ratings-bar” as class name. We need to get the text from within the div element.

Here’s how a single rating object looks like:

<div class="ratings-bar">

<div class="inline-block ratings-imdb-rating" data-value="10" name="ir">

<span class="global-sprite rating-star imdb-rating"></span>

<strong>10.0</strong>

</div>

...

</div>

We need to get the rating value from the <strong> tag. We can extract the tags using find(‘tagName’) method and get the text using getText().

# This List will store all the ratings

ratings = [] # Iterate through all the ratings object

for rating_object in rating_list: # Find the <strong> tag and get the Text

rating_text = rating_object.find('strong').getText() # Append the rating to the list

ratings.append(rating_text) print(ratings)

And we are done. Similarly, you can extract Titles, Summary, Genre using the above method with the appropriate class name and tag names.

You can store the data to CSV or excel file and use it for your Machine Learning model.

Full Code present on my Github :

Additional Readings: