Sanitize: A whitelist-based Ruby HTML sanitizer Wednesday December 24, 2008 This is an ancient blog post that was written many years ago. It’s archived here as a historical curiosity, and is likely to contain bad writing, bad ideas, and broken links. Please don’t assume that anything here is still accurate or represents my current opinions.

Merry Christmas, Internets! My gift to you this year is Sanitize, a whitelist-based HTML sanitizer written in Ruby. Given a list of acceptable elements and attributes, Sanitize will remove all unacceptable HTML from a string.

Using a simple configuration syntax, you can tell Sanitize to allow certain elements, certain attributes within those elements, and even certain URL protocols within attributes that contain URLs. Any HTML elements or attributes that you don’t explicitly allow will be removed.

Because it’s based on Nokogiri, a full-fledged HTML parser, rather than a bunch of fragile regular expressions, Sanitize has no trouble dealing with malformed or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of caution.

Using Sanitize is easy. First, install it:

gem install sanitize

Then call it like so:

require 'rubygems' require 'sanitize' html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />' Sanitize.clean(html) # => 'foo'

By default, Sanitize removes all HTML. You can use one of the built-in configs to tell Sanitize to allow certain attributes and elements:

Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>' Sanitize.clean(html, Sanitize::Config::BASIC) # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>' Sanitize.clean(html, Sanitize::Config::RELAXED) # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'

Or, if you’d like more control over what’s allowed, you can provide your own custom configuration:

Sanitize.clean(html, :elements => ['a', 'span'], :attributes => {'a' => ['href', 'title'], 'span' => ['class']}, :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})

For more details, see the Sanitize Documentation.