Bleach is a whitelist-based HTML sanitizer and auto-linker in Python, built on html5lib, for AMO and SUMO and released under the BSD license.

Bleach has two main functions: sanitizing HTML based on a whitelist of tags and attributes, and turning URLs into links. It uses html5lib for both.

For more information on using Bleach, see the README included in the source. For more info on how Bleach works, follow below the jump.

Sanitizing HTML

Bleach’s clean() function uses a slightly custom version of html5lib’s HTMLSanitizer tokenizer that adds support for per-tag attribute whitelists. Any entity that is not part of a whitelisted tag or valid entity will be encoded. Legitimate entities and tags are allowed. The default whitelist is set up for AMO.

Linkifying Text

The linkify() function is a little more complicated. Naïve implementations usually rely on a simple regular expression to find URL-like strings, but this quickly becomes insufficient when you need to handle situations like these:

<em>http://example.com</em> (should be linkified)

(should be linkified) <a href="http://example.com">test</a> (already linked, no need to linkify)

(already linked, no need to linkify) <a href="http://example.com">http://example.com</a> (really don’t need to linkify)

(really don’t need to linkify) <em>http://xx.com <a href="http://example.com">http://example.com</a></em> (regular expression freak-out)

So linkify() actually uses html5lib to build a document fragment and walks it, only applying the naïve regular expression in safe locations. In pseudocode:

tree = parseFragment ( input ) 2. 3. linkify_nodes ( tree ) : 4. for node in tree: 5. if node is a text node: 6. replace node with text nodes and links 7. else if node is a link: 8. if nofollow: 9. set rel= "nofollow" on node 10. else : 11. linkify_nodes ( node. childNodes ) 12. 13. return string ( linkify_nodes ( tree ) ) 1.2.3.4.5.6.7.8.9.10.11.12.13.