Dealing with XSS is simple if you do not need to allow HTML code in the content you receive from the clients. If you are only dealing with simple strings or integers, avoiding XSS is extremely easy:

Encode all user content You can escape all HTML using either URL encoding or HTML entity encoding (covered in the next slide). Strip all tags The downside of encoding HTML entities is that if someone tried to use HTML for benign purposes (e.g. if they tried to emphasize part of their message using <h1>, <b> or <i>), this escaped HTML code will look ugly on the resulting page. Common practice is to strip all HTML tags. Be careful, though, some HTML can be very complex so make sure you strip tags, and then also encode the result. This will make sure that even if something didn't get stripped correctly, at least it just looks ugly and doesn't leave you vulnerable. Cast all integers Often times you know that something in client input is supposed to be an integer. In that case, make sure you cast it to int -- it's much faster than stripping tags or encoding.

If you do need to allow some HTML, you are in for a world of pain. "Filtering out bad HTML" is a very, very, very hard thing to do. Do not try it, unless you have no other choice. Especially do not try to do it with regular expressions -- they are simply not well-suited to the task. It can be done, but your code will look like a wall of ASCII soup.

The trouble with trying to "filter bad html" is that there are many, many ways to sneak in malicious content in otherwise inconspicuous-looking HTML code. There are pitfalls like UTF-7 encoding, entity-encoding, url-encoding, etc. Then there is Internet Explorer, whose goal in life appears to be to try and execute anything that may possibly be executable code.

There is a number of projects out there that do a fairly good job at HTML filtering. I wrote one back in 2004 for Squirrelmail, but I haven't really maintained it since 2005 because I ended up rewriting it mostly from scratch in 2006 as part of the McGill's website. My solution was to use HTML tidy first, to make sure that I was always dealing with valid XML, and then try to clean it up. Still, the resulting code was over 2500 lines of PHP.

If you do end up going the route of filtering out bad HTML, don't try to reinvent the wheel and just use one of the available libraries. Chances are, they have already done a better job than you could from scratch. Because such filters are always found vulnerable to one thing or the other, a laudable approach is to save HTML-containing content in the database unfiltered, and filter on output, to make sure that even if a previous version of the filter didn't catch some exploit, the latest version will. This, of course, is pretty computationally-heavy, so you will need to use some sort of caching mechanism (which your app probably does anyway). Another approach is to re-filter all of the content whenever you adjust the filtering code, to make sure that there is no malicious content left over from the previous, vulnerable version of the filter.