Recently I have been working on WWW::Gittip, the Perl implementation of the Gittip API. There are a number of public API calls that return JSON. Some required a little extra thinking as they were using Basic Authentication without a challenge, but there are also requests I'd like to make that don't have a public JSON end-point yet. For example currently there is no way to fetch the list of members of a community.

Luckily there is an HTML page for each such community, for example this is the page of the Perl community.

So in order to make it easy for others to fetch the list of community members we need to parse the HTML page and extract the information ourself.

This article is originally from 2014. Since then Gittip was renamed to be Gratipay and in November 2017 it was shut down. Nevertheless the technique is still useful.

If you look at the Perl community page you'll see it say the number of members (at the time of writing this it was 516), and list 3 groups of the members. 12 members in each group.

The 3 groups can have some overlap. The left-most group shows the members as they joined Gittip, the right-most group shows the members ordered by the sum they receive, and the middle group show them ordered by the sum they give.

In order to show more members we can click on the "more" button at the bottom. It will send a new request with ?limit=24 attached to the end of the URL. This time we'll see up to 24 members in each one of the 3 groups.

We can edit the URL to increase the limit to the total number of members as shown in the center top of the page, but it will only show up to 100 members.

After searching a bit among the tickets on GitHub I found that the URL will also accept another parameter: offset=100 that tells the site how many members to skip before showing the members. So limit=100&offset=0 will show the first 100 members, and limit=100&offset=100 will show the second hundred members. If we increase the offset enough times, we can fetch all the members. 3 Times.

But how are we going to extract the actual values from the HTML files?

For the sake of the example we are going to use the get function of LWP::Simple to fetch the web pages, and HTML::TeeBuilder to parse the HTML and extract the information we need.

Extracting the number of members

As we will want to know how many pages we need to fetch, we should start by extracting the number of members. We create the $url from 3 parameters that are currently embedded in the code. A better solution might let the user supply them. At least the name of the community.

get will fetch the HTML page.

At this point you might want to save the HTML page in a local file to make it easier to analyze it manually, or you can visit the web site and click on the "view source" that is usually provided as a right-click on your mouse.

In order to be able to locate the required data programmatically, we need to find the HTML construct that wraps it. Because we are looking for a very specific piece of data here I just manually searched the source for the value 516 (the number of members) and found the following snippet of HTML:

<div class="on-community"> <h2 class="pad-sign">Perl</h2> <div class="number">516</div> <div class="unit pad-sign">members</div> </div>

The value we are looking for is wrapped in a div element that has a class="number" attribute.

HTML::TreeBuilder inherits from HTML::Element which provides a very nice tool to search for elements. We can call the look_down method and provide it a name of an attribute and the value of that attribute, and in SCALAR context it will return the first element matching our definition. (In LIST context it will return all the matching elements.)

my $e = $tree->look_down('class', 'number'); will return the first element that has a class attribute with the value "number". The return object in $e is a new HTML::Element object. It has a method called as_text that will return the content of the element after stripping away all the HTML. In our case that will be the desired number.

use strict; use warnings; use 5.010; use LWP::Simple qw(get); use HTML::TreeBuilder 5 -weak; my $name = 'perl'; my $limit = 100; my $offset = 0; my $url = "https://www.gittip.com/for/$name?limit=$limit&offset=$offset"; my $html = get $url; my $tree = HTML::TreeBuilder->new; $tree->parse($html); my $e = $tree->look_down('class', 'number'); say $e->as_text;

The actual solution I used in WWW::Gittip was a bit more complex. At first I used look_down to find the div element that has a class="on-community", and once that was found a new search on that element found the right element.

my $cl = $tree->look_down('class', 'on-community'); my $n = $cl->look_down('class', 'number'); my $total = $n->as_text; say $total;

Probably the implementation in WWW::Gittip should be updated.

Fetching the members

Looking at the web page that lists the a few members of the "Perl community" we can see that there is a title "NEW MEMBERS". In order to understand the layout of the underlying HTML we look at the source of the HTML again and look for that string. I found the following HTML snippet:

<div id="leaderboard"> <div class="people"> <h2>New Members</h2> <ul class="group"> <li> <a href="/dwierenga/" class="mini-user tip" data-tip=""> <span class="inner"> <span class="avatar" style="background-image: url(\'https://avatars.githubusercontent.com/u/272648?s=128\')"> </span> <span class="age">14 <span class="unit">hours</span></span> <span class="name">dwierenga</span> </span> </a> </li>

That looks quite clean. Searching for "TOP GIVERS" shows us that the HTML of the 3 groups is the same. The are all ul elements inside div elements that are all in a div with id="leaderboard".

We can start by locating the leaderboard using my $leaderboard = $tree->look_down('id', 'leaderboard');. We are looking for an element with an attribute id that has the value leaderboard.

Once we found that we fetch the child-elements of this div using the content_list. Each such child is one of the 3 groups. We iterate over the child elements using foreach my $ch ($leaderboard->content_list) {

In every iteration first we fetch the title, which is wrapped in h2 elements. In addition to the regular attributes of the HTML elements, HTML::Element also allows us to search for pseudo-attributes. For example we can it pretends as if there was an attribute called _tag with the value being the name of the tag. p, a, h1, or in this case h2.