Home

About

gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies.

Install

Install with pip at the command line:

pip install -U gazpacho

Quickstart

Give this a try:

from gazpacho import get , Soup url = 'https://scrape.world/books' html = get ( url ) soup = Soup ( html ) books = soup . find ( 'div' , { 'class' : 'book-' }, partial = True ) def parse ( book ): name = book . find ( 'h4' ) . text price = float ( book . find ( 'p' ) . text [ 1 :] . split ( ' ' )[ 0 ]) return name , price [ parse ( book ) for book in books ]

Tutorial

Import

Import gazpacho following the convention:

from gazpacho import get , Soup

get

Use the get function to download raw HTML:

url = 'https://scrape.world/soup' html = get ( url ) print ( html [: 50 ]) # '<!DOCTYPE html>

<html lang="en">

<head>

<met'

Adjust get requests with optional params and headers:

get ( url = 'https://httpbin.org/anything' , params = { 'foo' : 'bar' , 'bar' : 'baz' }, headers = { 'User-Agent' : 'gazpacho' } )

Soup

Use the Soup wrapper on raw html to enable parsing:

soup = Soup ( html )

Soup objects can alternatively be initialized with the .get classmethod:

soup = Soup . get ( url )

.find

Use the .find method to target and extract HTML tags:

h1 = soup . find ( 'h1' ) print ( h1 ) # <h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>

attrs=

Use the attrs argument to isolate tags that contain specific HTML element attributes:

soup . find ( 'div' , attrs = { 'class' : 'section-' })

partial=

Element attributes are partially matched by default. Turn this off by setting partial to False :

soup . find ( 'div' , { 'class' : 'soup' }, partial = False )

mode=

Override the mode argument { 'auto', 'first', 'all' } to guarantee return behaviour:

print ( soup . find ( 'span' , mode = 'first' )) # <span class="navbar-toggler-icon"></span> len ( soup . find ( 'span' , mode = 'all' )) # 8

dir()

Soup objects have html , tag , attrs , and text attributes:

dir ( h1 ) # ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']

Use them accordingly:

print ( h1 . html ) # '<h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>' print ( h1 . tag ) # h1 print ( h1 . attrs ) # {'id': 'firstHeading', 'class': 'firstHeading', 'lang': 'en'} print ( h1 . text ) # Soup

Support

If you use gazpacho, consider adding the badge to your project README.md:

[![scraper: gazpacho](https://img.shields.io/badge/scraper-gazpacho-C6422C)](https://github.com/maxhumber/gazpacho)

Contribute

For feature requests or bug reports, please use Github Issues

For PRs, please read the CONTRIBUTING.md document