Beautiful Soup is a famous Python library which is generally used to get the desired data from HTML, XML files using your famous parser(html5.parser, lxml parser) for navigating, searching and modifying the html tree.

In this series, First we will learn the basics of the Beautiful Soup and at the end we will work on a demo project.



Installing the Beautiful Soup:-

To install the Beautiful Soup on the Windows machine use below mentioned PIP command.

[code]pip install beautifulsoup4[/code]

Download a HTML document and display by Beautiful Soup:-

To download and see the beautiful HTML document in Beautiful Soup. Lets Import the Beautiful Soup and urllib module to the project.

from bs4 import BeautifulSoup import urllib.request 1 2 from bs4 import BeautifulSoup import urllib . request

We are going to use “Prettify” Method to see the HTML document in the console.

[python]

__author__ = ‘WP8Dev’

from bs4 import BeautifulSoup

import urllib.request

def main():

print("***********")

testUrl = "http://scrolltest.com/about-us/"

pageSource = urllib.request.urlopen(testUrl)

soupPKG = BeautifulSoup(pageSource)

print(soupPKG.prettify())

if __name__=="__main__":

main()

[/python]

Output:-



Basics of the Beautiful Soup:-

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.

But you’ll only ever have to deal with about four kinds of objects:

– Tag

– NavigableString

– BeautifulSoup

– Comment

Tag:-

A Tag object corresponds to an XML or HTML tag in the original document:

e.g

[python]

soup = BeautifulSoup(‘<b class="boldest">Extremely bold</b>’)tag = soup.b

print(tag)[/python]

Get the Attribute of the tag, You can access a tag’s attributes by treating the tag like a dictionary:-

Single-values attrubute

[code]tag[‘class’][/code]

To get the all attribs

[code]tag.attrs[/code]

Multi-valued attributes

[python]

css_soup = BeautifulSoup(‘<p class="body strikeout"></p>’)

css_soup.p[‘class’]

# ["body", "strikeout"]

[/python]

NavigableString

A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree.

[python]unicode_string = unicode(tag.string)

unicode_string

# u’Extremely bold'[/python]

Comments and other special strings



[code]markup = "<b><!–Hey, buddy. Want to buy a used parser?–></b>"

soup = BeautifulSoup(markup)

comment = soup.b.string

print(comment)[/code]

BeautifulSoup:-



The BeautifulSoup object itself represents the document as a whole. For most purposes, you can treat it as a Tag object.

Lets build a Simple program to get the all the links of the page using “find_all(‘a’)”.

[python]

from bs4 import BeautifulSoup

import urllib.request

def main():

print("***********")

testUrl = "http://scrolltest.com/about-us/"

pageSource = urllib.request.urlopen(testUrl)

soupPKG = BeautifulSoup(pageSource)

#print(soupPKG.prettify())

for link in soupPKG.find_all("a"):

print(str(link))

if __name__=="__main__":

main()

[/python]

Now we have basics of BS, In the next tutorial we will learn more about the Beautiful Soup usage and create a demo project to scarp a website.