Open Source has fueled a massive part of the technology boom we are all experiencing. Even in the world of web scraping, open source web scraping tools play a large part to help gather data from the Internet. We will walk through open source web scraping frameworks and tools that are great for crawling, scraping the web, and parsing out the data.

Here is a comparison chart showing the important features of all the best open source web scraper frameworks and tools that we will go through in this post:

Features/Tools Github Stars Github Forks Github Open Issues Last Updated Documentation License Scrapy 37.4K 8.6K 443 April 2020 Excellent BSD License PySpider 14.4K 3.5K 250 April 2018 Good Apache License 2.0 MechanicalSoup 3.5K 310 22 June 2020 Average MIT Portia 7.8K 1.3K 100 June 2019 Good BSD License NodeCrawler 5.4K 829 24 Nov 2014 Good BSD 2-Clause Apify SDK 2.4K 155 75 June 2020 Good Apache License 2.0 Selenium 17.9K 5.7K 349 Dec 2018 Good Apache License 2.0 Puppeteer 62.2K 6.4K 1.1K June 2020 Good Apache License 2.0 Heritrix 1.8K 659 33 May 2020 Good Apache License 2.0 Apache Nutch 2.1K 1.2K 443 June 2020 Excellent Apache License 2.0 Jaunt 37.4K 8.6K – June 2016 Excellent Apache License 2.0 StormCrawler 659 212 33 June 2020 Good Apache License 2.0 Webscraper.io 39 20 5 April 2020 Good Webscraper.io Web Harvest – 1 – Feb 2010 Average N/A

These are the best Open Source web scraper tools available in each language or platform :

Scrapy

Scrapy is an open source web scraping framework in Python used to build web scrapers. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format. One of its main advantages is that it’s built on top of a Twisted asynchronous networking framework. If you have a large web scraping project and want to make it as efficient as possible with a lot of flexibility then you should definitely use Scrapy.

Scrapy has a couple of handy built-in export formats such as JSON, XML, and CSV. Its built for extracting specific information from websites and allows you to focus on the data extraction using CSS selectors and choosing XPath expressions. Scraping web pages using Scrapy is much faster than other open source tools so its ideal for extensive large-scale scaping. It can also be used for a wide range of purposes, from data mining to monitoring and automated testing.

What stands out about Scrapy is its ease of use and

If you are familiar with Python you’ll be up and running in just a couple of minutes.

It runs on Linux, Mac OS, and Windows systems.

Scrapy is under BSD license.

Requires Version – Python 2.7, 3.4+

Available Selectors – CSS, XPath

Available Data Formats – CSV, JSON, XML

Pros

Suitable for broad crawling

Easy setup and detailed documentation

Active Community

Cons

Since it is a full-fledged framework, it is not beginner friendly

Does not handle JavaScript

MechanicalSoup

is a python library that is designed to simulate the behavior of a human using a web browser and built around the parsing library BeautifulSoup. If you need to scrape data from simple sites or if heavy scraping is not required, using MechanicalSoup is a simple and efficient method. MechanicalSoup automatically stores and sends cookies, follows redirects and can follow links and submit forms. MechanicalSoup is a python library that is designed to simulate the behavior of a human using a web browser and built around the parsing library BeautifulSoup. If you need to scrape data from simple sites or if heavy scraping is not required, using MechanicalSoup is a simple and efficient method. MechanicalSoup automatically stores and sends cookies, follows redirects and can follow links and submit forms.

It’s best to use MechanicalSoup when interacting with a website that doesn’t provide a web service API, out of a browser. If the website provides a web service API, then you should use this API and you don’t need MechanicalSoup. If the website relies on JavaScript, then you probably need a fully-fledged browser, like Selenium.MechanicalSoup is licensed under MIT.

Requires Version – Python 3.0+

Available Selectors – CSS, XPath

Available Data Formats – CSV, JSON, XML

Pros

Preferred for fairly simple websites

Cons

Does not handle JavaScript

PySpider

is a web crawler written in Python. It supports Javascript pages and has a distributed architecture. This way you can have multiple crawlers. PySpider can store the data on a backend of your choosing such as PySpider is a web crawler written in Python. It supports Javascript pages and has a distributed architecture. This way you can have multiple crawlers. PySpider can store the data on a backend of your choosing such as MongoDB MySQL , Redis, etc. You can use RabbitMQ, Beanstalk, and Redis as message queues.

One of the advantages of PySpider its easy to use UI where you can edit scripts, monitor ongoing tasks and view results. If you are working with a website-based user interface, PySpider is the Internet scrape to consider. It also supports AJAX heavy websites. To know more about PySpider, you can check out their documentation and or their community resources. It’s currently licensed under Apache License 2.0.

Requires Version – Python 2.6+, Python 3.3+

Available Selectors – CSS, XPath

Available Data Formats – CSV, JSON

Pros

Facilitates more comfortable and faster scraping

Powerful UI

Cons

Difficult to deploy

Portia

Portia is a visual scraping tool and the best web scraper created by Scrapinghub that does not require any programming knowledge. If you are not a developer, its best to go straight with Portia for your web scraping needs. You can try Portia for free without needing to install anything, all you need to do is sign up for an account at Scrapinghub and you can use their hosted version.

Making a crawler in Portia and extracting web contents is very simple if you do not have programming skills. You won’t need to install anything as Portia runs on the web page. With Portia, you can use the basic point-and-click tools to annotate the data you wish to extract, and based on these annotations Portia will understand how to scrape data from similar pages. Once the pages are detected Portia will create a sample of the structure you have created. Actions such as click, scroll, wait are all simulated by recording and replaying user actions on a page.Portia is great to crawl Ajax powered based websites (when subscribed to Splash) and should work fine with heavy Javascript frameworks like Backbone, Angular, and Ember. It filters the pages it visits for an efficient crawl. Its currently licensed under BSD license.

Requirements – If you are using Linux you will need Docker installed or if you are using a Windows or Mac OS machine you will need boot2docker.

Available Selectors – CSS, XPath

Available Data Formats – CSV, JSON, XML

Pros

Defines CSS or XPath selectors

Filters the page it visits

Cons

Quite time-consuming as compared to other open source tools

Navigating websites are difficult to control. You always need to start the crawl with the target pages, else Portia will visit unnecessary pages and may lead to unwanted results

Apify SDK



With its unique features like RequestQueue and AutoscaledPool, you can start with several URLs and then recursively follow links to other pages and can run the scraping tasks at the maximum capacity of the system respectively. Apify SDK is a Node.js library which is a lot like Scrapy positioning itself as a universal web scraping library in JavaScript, with support for Puppeteer, Cheerio and more.With its unique features like RequestQueue and AutoscaledPool, you can start with several URLs and then recursively follow links to other pages and can run the scraping tasks at the maximum capacity of the system respectively.

Requirements – The Apify SDK requires Node.js 8 or later

Available Selectors – CSS

Available Data Formats – JSON, JSONL, CSV, XML, Excel or HTMLPros

Supports any type of website

Best library for web crawling in Javascript we have tried so far.

Built-in support of Puppeteer

NodeCrawler

Nodecrawler is a popular web crawler for NodeJS, making it a very fast crawling solution. If you prefer coding in JavaScript, or you are dealing with mostly a Javascript project, Nodecrawler will be the most suitable web crawler to use. Its installation is pretty simple too. JSDOM and Cheerio (used for HTML parsing) use it for server-side rendering, with JSDOM being more robust.

Requires Version – Node v4.0.0 or greater

Available Selectors – CSS, XPath

Available Data Formats – CSV, JSON, XML

Pros

Easy installation

Cons

It has no Promise support

Selenium Web Driver

When it comes to websites that use very complex and dynamic code, it’s better to have all the page content rendered using a browser first. When it comes to websites that use very complex and dynamic code, it’s better to have all the page content rendered using a browser first. Selenium WebDriver uses a real web browser to access the website, so it would like its activity wouldn’t look any different from a real person accessing information in the same way. When you load a page using Web Driver, the browser loads all the web resources and executes the javascript on the page. At the same time, it stores all the cookies created by websites and sends complete HTTP headers as all browsers do. This makes it very hard to determine whether a real person accesses the website or if its a bot.

Although it’s mostly used for testing, WebDriver can be used for scraping dynamic web pages. It is the right solution if you want to test if a website works properly with various browsers or Javascript-heavy websites. Using WebDriver makes web scraping easier, but the scraping process is much slower as compared to simple HTTP request to the web browser. When you are using the WebDriver, the browser waits until the whole page is loaded and then can you only access the elements. Selenium has a very large and active community which is great for beginners.

Learn More: How to scrape hotel prices using Selenium and Python

Requires Version – Python 2.7 and 3.5+ and provides bindings for languages Javascript, Java, C, Ruby, and Python.

Available Selectors – CSS, XPath

Available Data Formats – Customizable

Pros

Suitable for scraping heavy Javascript websites

Large and active community

Detailed documentation, making it easy to grasp for beginners

Cons

Hard to maintain when there are any changes in the website structure

High CPU and memory usage

Puppeteer

Puppeteer is a Node library which provides a powerful but simple API that allows you to control Google’s headless Chrome browser. A headless browser means you have a browser that can send and receive requests but has no GUI. It works in the background, performing actions as instructed by an API. You can truly simulate the user experience, typing where they type and clicking where they click.

The best case to use Puppeteer for web scraping is if the information you want is generated using a combination of API data and Javascript code. A headless browser is a great tool for automated testing and server environments where you don’t need a visible UI shell. For example, you may want to run some tests against a real web page, create a PDF of it, or just inspect how the browser renders a URL. Puppeteer can also be used to take screenshots of web pages visible by default when you open a web browser.Puppeteer’s API is very similar to Selenium WebDriver, but works only with Google Chrome, while WebDriver works with most popular browsers. Puppeteer has a more active support than Selenium, so if you are working with Chrome, Puppeteer is your best option for web scraping. Learn more: How to build a Web Scraper using Puppeteer and Node.Js

Requires Version – Node v6.4.0, Node v7.6.0 or greater

Available Selectors – CSS

Available Data Formats – JSON

Pros

With its full-featured API, it covers a majority of use cases

The best option for scraping Javascript websites on Chrome

Cons

Only available for Chrome

Supports only JSON format

Learn more: Open Source Javascript Web Scraping Tools and Frameworks Heritirix

Heritrix is a web crawler designed for web archiving, written by the Internet Archive . It is available under a free software license and written in Java . The main interface is accessible using a web browser , and there is a command-line tool that can optionally be used to initiate crawls. Heritrix runs in a distributed environment. It is scalable, but not dynamically scalable. This means you must decide on the number of machines before you start crawling.

Requires Versions – Java 5.0+

Available Selectors – XPath, CSS

Available Data Formats – ARC file

Pro

Excellent user documentation and easy setup

Mature and stable platform.

Good performance and decent support for distributed crawls

Respects robot.txt

Supports broad and focused crawls

Cons

Not dynamically scalable

Apache Nutch

Apache Nutch is a well-established web crawler based on Apache Hadoop. As such, it operates by batches with the various aspects of web crawling done as separate steps like generating a list of URLs to fetch, parsing web pages, and updating its data structures.

Requirements – Java 8

Available Selectors – XPath, CSS

Available Data Formats – JSON, CSV, XML

Pros

Highly extensible and Flexible system

Open-source web-search software, built on Lucene Java

Dynamically scalable with Hadoop

Cons

Difficult to setup

Poor documentation

Some operations take longer, as the size of crawler grows

Jaunt

Jaunt is a Java library for web-scraping and JSON querying. The library provides a fast, headless browser. The browser provides access to the DOM, and control over each HTTP Request/Response. Jaunt enables your Java programs to work with forms and tables, control/process individual HTTP Requests/Responses and provides customizable.

Jaunt provides both free and paid versions. The free version is under Apache license, it can be used for personal or commercial projects, including redistributing the file.

Requirements – Java 7

Available Selectors – Jaunt has its own syntax

Available Data Formats – XML, JSON

Pros

Because it’s lightweight, it’s relatively easy to scale such as using one UserAgent per thread.

Don’t need to rely too heavily on CSS and XPath selectors

Provides high-level components for common web scraping tasks

Good for DOM level operations, when Javascript support is not required

Cons

Does not support Javascript

Free version lasts only for one month

StormCrawler

StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. is based on the stream processing framework Apache Storm and all operations occur at the same time such as – URLs being fetched, parsed, and indexed constantly – which makes the whole crawling process more efficient. The framework

It comes with modules for commonly used projects such as Apache Solr, Elasticsearch, MySQL, or Apache Tika and has a range of extensible functionalities to do data extraction with XPath, sitemaps, URL filtering or language identification.

Requirements – Apache Maven, Java 7

Available Selectors – XPath

Available Data Formats – JSON, CSV, XML

Pros

Appropriate for large scale recursive crawls

Suitable for Low latency web crawling

Cons

Does not support document deduplication

Webscraper.io

, a standalone chrome extension, is a great web scraping tool for extracting data from dynamic web pages. Using the extension you can create a sitemap to how the website should be traversed and what data should be extracted. With the sitemaps, you can easily navigate the site the way you want and the data can be later exported as a CSV or into Web scraper , a standalone chrome extension, is a great web scraping tool for extracting data from dynamic web pages. Using the extension you can create a sitemap to how the website should be traversed and what data should be extracted. With the sitemaps, you can easily navigate the site the way you want and the data can be later exported as a CSV or into CouchDB

The advantage of webscraper.io is that you just need basic coding skills. If you aren’t proficient with programming or need large volumes of data to be scraped, Webscraper.io will make the job easier for you. The extension requires Chrome 31+ and has no OS limitations.You can download and add the extension to Chrome using the link – https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn?hl=en

Required Version – Chrome 31+

Available Selectors – CSS

Available Data Formats – CSV

Pros

Best Google Chrome extension for basic web scraping from websites into CSV format

Easy to install, learn and understand

Cons

It cannot be used if you have complex web scraping scenarios such as bypassing CAPTCHA, submitting forms, etc.

Web Harvest

Web Harvest is an open-source web scraping tool written in Java. It offers text and XML manipulation such as Regular Expression and XQuery. This web scraping tool is great for beginners and full of features for experienced users. It can be used in three modes – GUI application, command-line utility, and from the Java code. You can extract data or save it to a database without knowing how to program Java.

Requires Version – Java 1.5+

Available Selectors – XPath

Available Data Formats – JSON, XML

Pros

Good for beginners

Can easily implement Java libraries

Cons

Versions are not updated frequently compared to other frameworks