About the univocity HTML parser

The univocity HTML parser is a complete framework with all the features you need to implement simple and complex HTML parsing projects. You will reduce your development time by 80% as the library takes care of all the heavy lifting for you.

Here is a quick rundown of what you can do with it.

Write clean, readable code

Unlike XPATH or CSS selectors, its fluent matching rule API allows you to specify easy to understand paths that locate and capture the data elements you need. You can handle very complex pages with no effort and you will probably never have write code to sift through HTML nodes.

For example, this table:

Address Personal Business Mailing Somewhere in the world Somewhere else

This can be easily parsed with two matching rules:

// Gets the content of all cells under the "Address" column address.addField("address") .match("td") .underHeader("td").withExactText("Address") .getText(); // Finds a checked radio button and returns the text in the header of the corresponding column address.addField("type") .match("input").attribute("type", "radio") //matches radio buttons .attribute("checked") //matches only checked radio buttons .getHeadingText(); //gets the text of the first row of the table, //at the same column where a checked radio button is found.

All values are collected into rows ready for your database:

Results<HtmlParserResult> results = parser.parse(input); HtmlParserResult addresses = results.get("address"); String[] headers = addresses.getHeaders(); // [address, type] List<String[]> rows = addresses.getRows(); // [Somewhere in the world, Personal] // [Somewhere else, Mailing]

You don’t have to stitch individual pieces of information manually or develop any complex logic to identify which value pertains to which record.

You can also get your data records as java beans, with the help of annotations:

public class Company { @Trim @UpperCase @Parsed private String companyName; @Linked(entity = "address", type = Address.class) private List<Address> companyAddresses; }

Learn more by reading the Introduction to the univocity HTML parser.

Following links and joining the data available in multiple levels of linked pages is straightforward and automatic:

//opens page referenced by link <a href="company/123/profile">View company profile<a> HtmlLinkFollower profileFollower = company.addField("profileUrl") .match("a").withText("View company profile") .getAttribute("href") .followLink(); //add fields to "company" via the link follower profileFollower.addField("employees") .match("td").classes("value").precededImmediatelyBy("td").withText("Company size").getOwnText();

Learn more on the Link following tutorial.

Built-in pagination support

Pagination is handled for you, even if you are parsing historical files stored offline.

//visits next page of results in <span id="nextPage"><a href="search/results?page=2">2<a></span> paginator.setNextPage() .match("span").id("nextPage") .match("a").getAttribute("href"); // follows up to 3 extra pages of results paginator.setFollowCount(3);

Learn more on the Pagination tutorial.

And more

The univocity HTML parser comes packed with other features such as:

assists you in Detecting changes in web pages.

takes care of Downloads and historical data management, allowing you to store and re-parse HTML copies - including paginated results and followed links.

saves copies of web pages like your browser does, Collecting page resources such as images, javascript and CSS.

And much more!

Check out the HTML parser documentation to learn all about the features you can use to build your next web page parsing project.