What is a Web Scraper?

Web Scrapers refer to scripts that perform the task of extracting data from websites. This usually happens by performing a GET request to the web page and then parsing the HTML response to retrieve the desired content.

1. Generate a console project

Create a directory for your project

$ mkdir hacker_news_scraper && cd hacker_news_scraper

Use the stagehand package to generate a console application:

$ pub global activate stagehand # If it's not installed

$ stagehand console-full

Add the http and html dependency in the pubspec.yaml file:

dependencies:

html: ^0.13.3+3

http: ^0.12.0

The http package provides a Future-based API for making requests. The html package contains helpers to parse HTML5 strings using a DOM-inspired API. It’s a port of html5lib from Python.

And install the added dependencies:

$ pub get

Following these instructions correctly should give you the file/folder structure below:

Project structure for console application

2. Implement the script

Empty the contents of lib/hacker_news_scraper.dart for we shall start from scratch☝️️

Import our installed dependencies:

import 'dart:convert'; // Contains the JSON encoder



import 'package:http/http.dart'; // Contains a client for making API calls

import 'package:html/parser.dart'; // Contains HTML parsers to generate a Document object

import 'package:html/dom.dart'; // Contains DOM related classes for extracting data from elements

Create a function after our imports to contain our logic:

initiate() async {}

The http package contains a Client class for making HTTP calls. Create an instance and perform a GET request to the Hacker News homepage:

Future initiate() async {

var client = Client();

Response response = await client.get(

'https://news.ycombinator.com'

);



print(response.body);

}

To test this out, go to bin/main.dart and invoke the initiate method:

import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;



void main(List<String> arguments) async {

print(await hacker_news_scraper.initiate());

}

Run this file:

$ dart bin/main.dart

Below is an extract of the response:

<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0">

...

...

<table border="0" cellpadding="0" cellspacing="0" class="itemlist">

<tr class='athing' id='18678314'>

<td align="right" valign="top" class="title"><span class="rank">1.</span></td> <td valign="top" class="votelinks"><center><a id='up_18678314' href='vote?id=18678314&how=up&goto=news'><div class='votearrow' title='upvote'></div></a></center></td><td class="title"><a href="http://vmls-book.stanford.edu/" class="storylink">Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares</a><span class="sitebit comhead"> (<a href="from?site=stanford.edu"><span class="sitestr">stanford.edu</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">

<span class="score" id="score_18678314">381 points</span> by <a href="user?id=yarapavan" class="hnuser">yarapavan</a> <span class="age"><a href="item?id=18678314">8 hours ago</a></span> <span id="unv_18678314"></span> | <a href="hide?id=18678314&goto=news">hide</a> | <a href="item?id=18678314">37 comments</a> </td></tr>

...

...

In order to know what to look for, we need to know how to select the links on the page:

It appears that each link is in a table cell and has the class “storylink”. This means that we can use this CSS selector to traverse those: td.title > a.storylink

In lib/hacker_news_scraper.dart , rather than printing the response body in the initiate function, let’s parse the body and select our elements using the helpers from the html package.

Future initiate() async {

var client = Client();

Response response = await client.get(

'https://news.ycombinator.com'

);



// Use html parser and query selector

var document = parse(response.body);

List<Element> links = document.querySelectorAll('td.title > a.storylink');

}

At this point we have a list of Element s where each element is an a.storylink item. The Element type provides an API similar to the DOM.

With a for in loop we can traverse the collection:

List<Map<String, dynamic>> linkMap = [];



for (var link in links) {

linkMap.add({

'title': link.text,

'href': link.attributes['href'],

});

}

And return the JSON-encoded output:

import 'dart:convert'; // Import this library at the top of the file



Future initiate() async {

...

...

return json.encode(linkMap);

}

Here’s the full script so far:

Running this should return a JSON output similar to below:

[

{

"title":"Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares",

"href":"http://vmls-book.stanford.edu/"

},

{

"title":"Write Your Own Virtual Machine",

"href":"https://justinmeiners.github.io/lc3-vm/"

},

...

...

]

3. Write the unit tests

Our tests will go in test/hacker_news_scraper_test.dart . Replace its contents with the below:

import 'dart:convert';



import 'package:test/test.dart';

import 'package:http/http.dart';

import 'package:http/testing.dart';

import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;



void main() {

// Our tests will go here

}

This is what our first test looks like:

void main() {

test('calling initiate() returns a list of storylinks', () async {

var response = await hacker_news_scraper.initiate();

expect(response, equals('/* JSON string to match against */'));

});

}

We need to refactor our solution slightly for our tests. This is because writing tests will be flakey since we will be making actual calls to the Hacker News website.

In the scenario where Hacker News isn’t available or we do not have an internet connection or the story listings change(and they will), our tests will fail.

Let’s refactor our initiate() method calll to expect a client parameter and remove the var client = Client(); declaration:

// lib/hacker_news_scraper.dart

initiate(BaseClient client) {

// var client = Client(); // <- Remove this line

...

}

The http package extends a BaseClient type for its HTTP client. This is also useful because the same package provides another subclass called MockClient for mocking HTTP calls, useful for our unit tests!

Return to bin/main.dart and ensure the Client is passed in:

import 'package:http/http.dart'; // Import the package first!

import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;



void main(List<String> arguments) async {

print(await hacker_news_scraper.initiate(Client()));

}

Ok, back to our unit tests.

This is the first test that uses our MockClient :

The MockClient instance takes a closure as the first parameter. This closure provides a request object which we can manipulate if needed. A Future object is expected to be returned from this closure, which is what happens here. We return an HTML string when the call is made in our await client.get(...) method.

The MockClient instance also takes in a second parameter, an integer representing the response status code. In this case is a 200 OK .

We proceed to making our initiate() call, passing in the mocked client. This means that our test is now predictable and can confidently perform assertions on the response.

The expect and equals top-level functions come as part of the test package by the Dart team. We installed this earlier on and it is listed under dev_dependencies: in our pubspec.yaml file.

We are using the json.encode() method as its an encoded JSON string we expect from the operation.

We can run this test by doing:

$ pub run test

Here’s the second test to address a failure scenario:

void main() {

...

... test('calling initiate(client) should silently fail', () async {

// Arrange

client = MockClient((req) => Future(() => Response('Failed', 400)));



// Act

var response = await hacker_news_scraper.initiate(client);



// Assert

expect(response, equals('Failed'));

});

}

Execute pub run test again. This will fail.

Let’s make this pass. In our initiate() method, let’s add this condition below our GET call:

if (response.statusCode != 200) return response.body;

Run the test again. All should pass!

Passing tests. Output produced by the Dart extension for VS Code

Conclusion

To sum things up, we have built a scraping tool to pull in the latest feed from the Hacker News website using the http and html packages provided by the Dart team. We then covered our backs by writing some unit tests.

In reality though it may serve you better to use the Hacker News APIs for this 😄. That being said, you will still need this approach for websites that do not have an official API for traversing their content.

I hope this has been insightful, especially in the area of writing tests in Dart.

→ Get the source code

I also run a YouTube channel teaching subscribers to develop fullstack applications with Dart. Become a subscriber to receive updates when new videos are released.

And lastly, I’m almost finished with producing the free Dart course on Egghead.io. This is due for release in the New Year 🎉, so keep an eye out for that 👁️

Like, share and follow me 😍 for more content on Dart. Thanks!

Further reading