A lot of the support work that we do here at Anchor involves looking at websites. You could say that we’ve seen a few websites in our time. Something we come across pretty frequently is inadequate protection when it comes to handling user-submitted form data and URLs.

This might not seem like a big deal, but it has some pretty big security implications, mostly relating to cross-site scripting. These problems can enable malicious activity like leaking of private data.

The short version is that user-supplied data can never be trusted, and you need to carefully escape and format the data to make it safe for the intended use, such as printing it on a webpage.

A very simple example

Let’s say you run a site that accepts news tips from readers, something like Hackernews or Slashdot. Visitors can submit a URL and a comment, which other visitors will see. Normal people submit URLs like “http://example.com”, which you drop into the HTML and all is well.

A malicious visitor appears and decides to submit a name like this:

"/><script>location="http://cheapv1agrapillz.example.com/"</script>

If you print this directly to the page inside an <a> tag, visitors’ browsers will dutifully jump away to the site selling shady meds.

What are we going to do about it?

Clearly that supposed-URL shouldn’t have been used without being escaped. It wouldn’t be valid, but the page will behave as expected. This is a broad issue that appears across a number of realms such as SQL, JSON and XML to name a few, but today we’re just focusing on URLs.

Ideally your programming language of choice will have a library to correctly mangle URLs and make them safe for printing, but sadly this isn’t always the case. This means you need to do it yourself.

It’s easy to get most of the way there with very little effort, but unless you’re very careful you’re likely to miss something that attackers can exploit. This isn’t a hypothetical issue either, there are automated tools that make it very easy for them.

Let’s break it down

RFC 3986 defines the structure of URLs (technically URIs, but we’re glossing over that fact). This is an example of a fully-featured HTTP URL:

http://example.com/over/there?name=ferret#nose _/ _________/_________/ _________/ __/ | | | | | scheme domain name path query fragment

You won’t always be using all parts of the URL, but the thing to note is that each component needs to be handled a bit differently, which matters if user-supplied data is ever used to build a URL component.

Domain name

The term the RFC uses here is “reg-name” – basically a domain name, but IP addresses are also valid. IP addresses are fairly self-explanatory, but here’s what the spec has to say about reg-names:

reg-name = *( unreserved / pct-encoded / sub-delims )

HTTP is actually more strict than what this implies. In practice, you’re allowed to use alphanumeric characters, dots, hyphens and underscores in domain names.

Path

The path is a hierarchy of slash-separated segments, which may be empty (indicated by a single slash). The grammar for segments is fairly straightforward, but worth noting if you plan on embedding user-supplied data in there (eg. a username as a directory path).

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

For reference, those three character classes are defined as:

unreserved = a-zA-Z0-9 - . _ ~ sub-delims = ! $ & ' ( ) * + , ; = pct-encoded = % symbol followed by 2 hex digits (0-9A-F)

Some of the valid characters might surprise you – various punctuation are perfectly legal and don’t need to be percent-encoded.

A reasonable way to handle this in PHP is to use rawurlencode() on each segment. If you need to process an existing URL, explode() on slashes, rawurlencode each segment, then implode() it back together again.

rawurlencode is quite strict and will percent-encode anything that’s not an unreserved character, plus the tilde (~) in versions of PHP older than 5.3.0. This is no problem, though it will break the URL if any of the segments are already percent-encoded.

Query

This is where things get interesting, and where most trouble occurs. The query string, if present, starts at the first question-mark (recall that up until now, un-encoded question-marks haven’t been legal).

The HTTP query string is remarkably free-form and follows similar rules to path segments, but also allowing slashes and question-marks.

query = *( pchar / "/" / "?" )

You can in theory use rawurlencode() on a query string and be done with it, but this ignores the very common practice of supplying GET parameters in the URL as key-value pairs. To handle this correctly, keys and values should be encoded separately, joined into “encodedKey=encodedValue” particles, which are then glued together with ampersands(&) separating them.

This can be done manually, but it’s recommended that you use http_build_query() instead, a function added in PHP5. It handles key-value pairs and escapes them correctly, even this awful example doesn’t faze it:

$query_params["foo=bar"] = "bar nbva?sdf"; echo "?" . http_build_query($query_params) // Prints... ?foo%3Dbar=bar+nbva%3Fsdf

In general, use rawurlencode() for simple query strings, and http_build_query() for key-value GET parameters.

Fragment

We’re finally at the end of the URL. If present, the fragment generally identifies anchors on the page in question, though in recent years it’s been used for all sorts of Web2.0 navigation trickery. This is a good reason to ensure that any hashes in the query string are properly escaped, lest they trick the browser into parsing the remainder of the URL as a fragment.

Fragments are subject to the same restrictions as a plain query string.

fragment = *( pchar / "/" / "?" )

Putting it on the page

Whew, so you’ve got your carefully escaped URL, now what? You format it as HTML on the page. It’s not quite valid HTML for use in href attributes though, if the URL contains any ampersands then it almost certainly won’t be valid HTML.

To fix this, you can use htmlspecialchars() to encode troublesome characters and make them safe for inserting into HTML.

The default behaviour of htmlspecialchars is to only encode double-quote (“) characters. If you use single-quotes (‘) for your attribute values (eg. href=’foo’) then you’ll have problems with URLs containing single-quotes. To dodge this, you can instruct htmlspecialchars to encode both single and double quotes, like so:

$href = htmlspecialchars($url, ENT_HTML401 | ENT_QUOTES)

Wrap up

That’s pretty much it. If your environment has library functions for handling URLs, like Drupal’s url() , do use them. Otherwise, we hope this guide has given you a better understanding of how URLs work and what the hazards are.

Like all guides, things may change over time and it’s possible there are mistakes. If that’s the case, or you just have some questions, we’d love to hear from you.