Home 01 Setting Behavior 02 Java code usage 03 Command line usage 04 Ant usage 05 Download 06 Release notes 07 Java Doc 08 License 09 Forums 10 Contact 11

Introduction HtmlCleaner is an open source HTML parser written in Java. HTML found on the Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring some order to the tags, attributes and ordinary text. For any given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create the Document Object Model. However, you can provide custom tag and rule sets for tag filtering and balancing. Contributing Like any open source project we rely on the contributions from our community. In each of our releases you can see how community patches and bug fixes drive the project forwards. (You could even contribute a much nicer public website than this one :) So please do contribute! If, on the other hand, you make a lot of use of HtmlCleaner but can't contribute time or code, you can make a donation instead. I (Scott Wilson) maintain HtmlCleaner releases in my spare time, so any donations are very welcome! Support Free support is available through our community forums. Commercial support is available from Cetis LLP - email info@cetis.org.uk for more details. If your company offers HtmlCleaner support, let me know and I'll add your details here. Latest News HtmlCleaner has been fixing shoddy HTML since 2006! Below are the most recent releases. HtmlCleaner 2.24 released! April 29th 2020: Various bug fixes, and a new experimental serializer For more details see the release notes. September 6th 2019: HtmlCleaner 2.23 released! September 6th 2019: Various bug fixes, with thanks to our friends at Screaming Frog! For more details see the release notes. April 24th 2018: HtmlCleaner 2.22 released! April 24th 2018: Various bug fixes. For more details see the release notes. May 11th 2017: HtmlCleaner 2.21 released! May 11th 2017: Bugfix release - regression in PruneTags. For more details see the release notes. May 2nd 2017: HtmlCleaner 2.20 released! May 2nd 2017: Improvements to attribute handling, extra Ant properties and more. For more details see the release notes. February 7th 2017: HtmlCleaner 2.19 released! February 7th 2017: Various bug fixed and stability improvements. For more details see the release notes. November 2nd 2016: HtmlCleaner 2.18 released! November 2nd 2016: Fixed a build error affecting command line use. For more details see the release notes. October 19th 2016: HtmlCleaner 2.17 released! October 19th 2016: A small number of bug fixes in this release. For more details see the release notes. December 2nd 2015: HtmlCleaner 2.16 released! December 2nd 2015: Various bug fixes in this release, and more improvements to character handling For more details see the release notes. October 1st 2015: HtmlCleaner 2.15 released! October 1st 2015: You can now specify which tags you want to allow CDATA for, and we fixed a unicode issue. For more details see the release notes. August 24th 2015: HtmlCleaner 2.14 released! August 24th 2015: A number of improvements to the cleaning algorithm, plus some bug fixes around new HTML 5 tags. For more details see the release notes. July 1st 2015: HtmlCleaner 2.13 released! July 1st 2015: Maintenance release fixing some recursion issues. For more details see the release notes. May 15th 2015: HtmlCleaner 2.12 released! May 15th 2015: Maintenance release to fix an issue with option tags. For more details see the release notes. May 12th 2015: HtmlCleaner 2.11 released! May 12th 2015: Adds much better HTML5 support, pipelining of HTML from stdin (and XML to stdout), and more For more details see the release notes. October 31st 2014: HtmlCleaner 2.10 released! October 31st 2014: Various small bug fixes For more details see the release notes. How does it work? Here is a typical example - improperly structured HTML containing unclosed tags and missing quotes: <table id =table1 cellspacing =2px <h1 > CONTENT </h1 > <td > <a href =index.html > 1 -> Home Page </a > <td > <a href =intro.html > 2 -> Introduction </a > After putting it through HtmlCleaner, XML similar to the following is coming out: <?xml version = "1.0" encoding = "UTF-8" ?> <html > <head /> <body > <h1 > CONTENT </h1 > <table id = "table1" cellspacing = "2px" > <tbody > <tr > <td > <a href = "index.html" > 1 - > Home Page </a > </td > <td > <a href = "intro.html" > 2 - > Introduction </a > </td > </tr > </tbody > </table > </body > </html > HtmlCleaner can be used in java code, as command line tool or as Ant task. It is designed to be small, independent (no runtime dependencies except JRE 1.5+), fast and flexible (its behavior is configurable through number of parameters). Although the main motive was to prepare ordinary HTML for XML processing with XPath, XQuery and XSLT, structured data produced by HtmlCleaner may be consumed and handled in manu other ways. Features Summary HtmlCleaner parses input HTML and generates tree-structure suitable for programmatic manipulation.

Serializers are responsible for outputting the DOM structure to XML, HTML, DOM or JDom.

Parsing phase relies on tag descriptions which can be customized by the user.

HtmlClaner's behaviour can be configured through number of parameters.

HtmlClaner is thread safe, meaning that single instance can clean multiple html sources at the same time.

HtmlClaner can be used from Java code, from command line or as Ant task.

HtmlClaner requires JRE 1.5+.

