Pragmatism in URL design

I was at the 2nd Linked Data Meetup London last week. It was great, and there was a lot of consensus and momentum built, but one of the smallest remarks by Tom Scott generated the most debate (partly egged on by me).

Tom was discussing how the BBC uses "the web as its CMS" for some topics, such as using Music Brainz as a source for discography information, and Wikipedia as a source for introductions on various topic pages, such as those on the BBC Wildlife Finder about animals. As well as simply using the data though, the BBC also uses the identifiers from those sites in its URLs. For Music Brainz, these look like long strings of random letters, numbers and hyphens (the Beatles, for instance, are b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d ), which then form the end part of a URL at http://www.bbc.co.uk/music/artists (eg see the page for the Beatles). Wikipedia, though, uses 'human readable' identifiers, which look like words, with capital letters where appropriate, separated by underscores. So, for instance, the identifier for the Ursus arctos, commonly known as the 'brown bear', is simply Brown_Bear , and this too is used in the URLs for both the Wikipedia and BBC pages for that animal.

There are some good, pragmatic reasons why Music Brainz chose random strings, but Wikipedia chose words. Music Brainz has data on far more artists, albums and tracks than there are Wikipedia pages, and a lot of their data is generated far more automatically and quickly than Wikipedia (eg by ripping CDs and uploading ID3 tags). Whereas Wikipedia pages all created by hand, and there's a strong requirement to only have one page per topic, and so matching page names to the URLs is both doable and is helps to enforce unique page titles.

The biggest problem with URLs that use human-readable paths and identifiers, though, is that names can and do change. And if the name is tied to the URL, then when a name changes, you either leave the URL as it is (and live with the mis-match), or update the URL, with a redirect from the old one (and live with the fact that this stops all the URLs being 'permanent'). Wikipedia currently follows the second option. Wordpress, by default, follows the first option (although you can manually re-name URL slugs to override this if you like).

So, you can see that there are pros and cons to having human-readable vs non-human-readable URIs. Within the realm of Linked Data, URIs are really important (as they identify concepts, as well as web pages), and so the design of URLs has to be even more carefully considered. To return to the meetup last week, Tom's comment, which has prompted a fair bit of debate on this topic, was that "persistance beats human-readability", and he seemed to regret having human-readable URIs on the BBC Wildlife Finder*. I disagreed, and a debate then ensued on the relative merits of human-readability vs persistance in URI design.

There were a range of opinions on this. Chris Sizemore commented that "I'm honest, I'm not as worried abt persistance as I prob shld b. Web got by, why can't SemWeb?" Tom replied that "of course web of doc does get broken coz urls change. It's quite expensive to keep link checking (incl what doc it points to)" and then suggested that "the whole opaque vs human readable URL thing is largely religious. Experience dictates which you think is more important".

There followed a bunch of comments suggesting that it was more a pragmatic, rather than a "religious" decision, weighing up the relative costs and benefits of each approach, and Michael Smethurst pointed out that creating human-readable identifiers can be expensive.

I then suggested that "not even opaque URLs can be 100% persistent. Concepts change, merge and split", and Chris Sizemore added "we should design [for] less than 100% persistance in URIs [and] promote 'healing' mechanisms". I'm not sure that there are any real answers in this area yet - it's still something to be investigated, and something the Linked Data community will have to deal with. I suggested that "HTTP already supplies some of the healing mechanisms: redirects, 300 Multiple Choice, 410 Gone, etc" and Michael suggested that "changed concepts, merged concepts and split concepts are new concepts so need new uris...", which are two contrasting but not un-complementary methods. (Take the example where a company 'splits', by spinning out a part of its business into a new one, and keeping the rest running under the same name and brand - whether the larger part of the 'split' should retain the same URI is a decision that can either be philosophical or pragmatic).

I've come to the conclusion that URL design has to be done pragmatically, balancing lots of different factors, including readability of the identifiers, the 'hackability' of the path structure (ie the number of slashes to include), overall length, ease/costs of producing them, and overall the effect that all these things have on the ability for the URL to be 'permanent'. It may be one of the axioms of the web that URIs are opaque, and that machines "should not look at the contents of the URI string to gain other information", but there are lots of ways in which humans don't follow this principle:

URLs appear in search results (eg on Google), and users use these to help form a judgements on which links to click. For this reason, Google recommends making them "intelligible to humans".

URLs get printed, and shorter ones take up less space.

URLs get typed in (occasionally), and so shorter ones are quicker to type. Characters which look similar (eg capital I and the number 1) can cause problems.

URLs get used in Twitter, where they have to be really short (otherwise URL shortners get used).

URLs get e-mailed, and recipients might use the URL to help understand what the link is about (if it's not obvious from the e-mail).

If URLs contain a lot of structure (eg domain.com/section/subsection/article/), users will sometimes try to navigate 'upwards' by removing words from the end of a URL up to a particular slash.

Developers sometimes use URLs as an API (using REST and content negotiation), and so the path structure in URLs helps to communicate how the data is structured (eg what the types of things are, and how they are related). You can argue that developers shouldn't do this, but hey, they're human too, and it can be a handy first step to understanding.

There's probably other ways in which URLs are treated as non-opaque by humans too.

I'm interested in how various sites and services negotiate this minefield of options and trade-offs. Here are some examples which I think are interesting:

BBC programmes are given short ids (called 'PIDs'), which use both letters and numbers, to keep them short (compared to using just numbers), but don't use vowels (to avoid accidentally spelling out rude words).

ISBNs consist of 10 or 13 digits, but they're not completely 'opaque', as the publishers are allocated a range of numbers, and so you can tell from the ISBN who the publisher is (and there are books and services to help you do this). There's also a 'check digit', which means that if you make a mistake typing in an ISBN, a system can tell you that you've made a mistake, rather than showing you a different book.

Wikipedia URLs use the title of the page, but spaces are converted to underscores, as spaces aren't valid characters in URLs, and so would have to be escaped as '%20', which looks ugly.

Most of the URLs on Flickr start with http://www.flickr.com/photos/ , even for videos - which just goes to show that sometimes your website might change in ways that your initial URL design didn't anticipate, which you'll then need to make a decision on.

, even for videos - which just goes to show that sometimes your website might change in ways that your initial URL design didn't anticipate, which you'll then need to make a decision on. Tag urls, as implemented on Flickr, Delicious, Last.fm, and similar sites, all use the tag name in the URLs. This feels pretty natural and obvious, mostly tags are essentially only a string of characters anyway, with no implied semantics, and so it's fairly safe to assume a 1-to-1 mapping with the URL.

I might try and add to these lists as I discover other interesting examples, but for now, my message is that you have to think hard about URL design, weigh up the different options, and then be pragmatic about it...

P.S I've just remembered that I wrote a post about 'how the BBC iPlayer broke its URLs' a couple of years, which might also be relevant to this discussion...

* Tom later clarified that the bit he regretted was "including the /species/ etc. in the URL", rather than re-using Wikipedia identifiers, though others have suggested there have been problems with this too.

Other blog posts on this topic:

Updates: (9:31PM) Added a couple of extra links and examples. (9:54pm) Added link to Matt's blog post. (Mar 1) Added P.S with link to an old blog post of mine.