You can still follow along in the examples with the redacted version of the database . Email me if you're interested in using the data for research purposes.

I'm making the data collected from the Putnam site available for download as a SQLite database . However I have set the birthdate days to 01 and truncated the first names. Yes, this is public information. But this is a programming lesson, not a live mirror of the sheriff's data and thus will not reflect any recent updates that the official site might make.

As in previous scraping examples, we'll start out by downloading every page of data to the hard drive to parse later.

In the case of the Putnam County jail archive, there are two steps in the downloading phase:

Scrape the history lists

The Putnam County jail archive can be accessed at this address: http://public.pcso.us/jail/history.aspx

There are two search fields for first and last name. The functionality of the query is generous: if you enter % as a wildcard, you can return the entire archive (paginated with 25 entries per page).

At the bottom of each results page are links to the next 10 pages:

Unfortunately, the site does not have any deep links to the results. In fact, if you inspect the links with your web inspector (you did read that chapter, right?), you'll see that the links don't contain actual URLs. Instead, they contain JavaScript code:

I've highlighted the relevant part of the link; the href attribute:

<a style="color:#333333;" href=" javascript:__doPostBack('MyGridView','Page%247') ">7</a>

When a link is clicked, the JavaScript inside its href attribute specifies that a method called __doPostBack . The webpage's already-loaded JS knows to expect this method call, which triggers a process that generates the server request for the next page.

To refresh your memory, in my big picture guide to web-scraping, this JS processing is Step #2. But remember that it's not worth poring over JS files (to figure out what __doPostBack actually does) when instead, you can use your web inspector to see the request sent by the browser after the JS does its magic (Step #3)

Looks pretty straightforward, right? That Page$5 attribute clearly comes from the JavaScript inside a given link. Can we just iterate through every integer from 1 to whatever?

Unfortunately not. The sheriff's search form has some tricky fields to deal with. I've highlighted them in the screenshots below.

You can even see that there are <input> tags where the type is "hidden" . The data values stored here are not visible in the browser-rendered page. Take special note of the hidden field with the id of __VIEWSTATE :

__VIEWSTATE becomes a sizeable parameter in the POST request:

Session-tracking with hidden variables inside the form

This Putnam County site uses a search form with hidden variables to keep track of the user session. Because the search form submits the user's search terms with a POST request, there's no ability to directly-link to results.

This type of session-tracking, as described by Wikipedia:

Another form of session tracking is to use web forms with hidden fields. This technique is very similar to using URL query strings to hold the information and has many of the same advantages and drawbacks; and if the form is handled with the HTTP GET method, the fields actually become part of the URL the browser will send upon form submission. But most forms are handled with HTTP POST, which causes the form information, including the hidden fields, to be appended as extra input that is neither part of the URL, nor of a cookie.

The upshot for us is that the sheriff's webserver knows the page number we're currently accessing and uses this to limit how far we can jump ahead in the results list. If we have requested page 12 of the results, it will only let us access pages 10, 11, and 13 through 21. Submitting Page$300 as a parameter for page 300 will be rejected.

I'm not an expert on backend design. But my guess is that the those gibberish values in the hidden field of __VIEWSTATE are used by the server as a simple form of authentication. Each page's form has a unique value and so the backend script can tell if you're submitting an out-of-bounds page request.

It should be easy to wrap up the form values in a hash and submit using RestClient.post . But maybe my guess is wrong and the sheriff's website is detecting state through cookies. Rather than spend 15 minutes trying to satisfy my curiosity, I've decided to just use Mechanize, which handles all the details of acting like a browser, including properly submitting forms, clicking buttons, and consuming cookies.

I've started writing a chapter on Mechanize, yet to be finished. I try to use it as a last-resort because I like figuring out website operations on my own. But if time and blood pressure levels are a consideration, go with Mechanize.

Here's the scraping code in its entirety. I can't do a better job of explaining Mechanize's methods than its homepage, so I'll leave it to you to read the documentation if you're curious about its range of functionality.

The one hack I use is to keep track of the farthest results page I've visited to prevent an infinite loop when the scraper reaches the 590th-or-so page.