Jaunt: Java Web Scraping & Automation Package

Today I'm announcing the free Beta release of Jaunt , a Java package for web scraping and other web automation tasks. The library presents a lightweight, headless browser that makes it easier for developers to parse, traverse, search, extract and filter HTML and XML data.

The Java ecosystem has a number of tools for parsing HTML, the best of which deal gracefully with real-world, online data, which is often dirty and unpredictably formatted. "Graceful", in this case, means not only parsing without choking, but being able to switch seamlessly between HTML and XHTML. Jaunt's parser, which handles both HTML and XML, is guaranteed to generate a parse tree for even the messiest, non-validating data.

Beyond acting as a parser and exposing the low-level DOM-level mechanics, Jaunt also provides high-level convenience functions. The package accomodates three levels of abstraction:

browser level

document-component level

DOM level

Although Jaunt makes it easy to traverse/search the parse tree directly, it also exposes hyperlinks, forms, etc. as components which can be clicked, manipulated, and submitted. For example, when working with the form component, individual fields can be specified on the basis of how they are labelled on the page rather than via the element's attributes or XPath.

The following Jaunt program visits, fills-out, and submits a login form:

But Jaunt can make life even easier. Instead of writing code to navigate back and forth through a form interface for multiple submissions (eg. search forms), the developer can automatically generate a form's request permutations. Each request represents a submission for a possible combination of inputs, so there is no need to manipulate the form inputs one at a time; the developer is free to focus on actual data extraction from the results pages.

Jaunt is Beta software and is ready to be test driven. The website (http://jaunt-api.com) provides a quickstart tutorial with plenty of short, simple examples for each of Jaunt's most important features. Try it out and provide feedback for the next release!