Let us negotiate

Or: why I’m unsatisfied with mostly everything available for web authoring and what the acronym “HTTP” has to do with it all.

By Leonardo Boiko

The m17n problem

I am a Brazilian. I have friends who can understand Portuguese but not English. I like to play with the Internet, so I know a lot of people who can understand English and not Portuguese. Being a Japanese culture enthusiast, I also have some friends who can understand Japanese better than anything else.

I can write in Portuguese and (passably) in English. I also hope to be able to produce Japanese content in some years. Since many of my friends are not technical persons, I’d like to provide alternative language versions of my writings in the easiest possible manner.

If only I could send alternative language versions of content based on user preferences, I’d be set. There’s a catchy techword for that, m17n (for “multilingualization”). But I could find no web software to make this easy — not even cherished software such as the Wordpress weblog, the Ruby on Rails framework or the Gallery image management scripts.

The XHTML problem

I’m a computer guy. I can read and appreciate the technical standards that make the Internet possible. I also consider myself a poet, and as such, I crave beauty in everything I create — including webpages.

The common standard for webpages, HTML, is very messy. There’s a much nicer alternative, called XHTML, available since January 2000. Unfortunately, the most popular web browser (which also happens to be the ugliest, slowest and most unsafe of web browsers) does not care about XHTML. Indeed, it doesn’t even know what XHTML is. To deal with problems like that, the folks at W3C made XHTML 1.0 to be backwards compatible.

Backward compatibility almost always means ugliness, and XHTML 1.0 is ugly. It has useless redundancy and arbitrary restrictions to make it feels like HTML when seen by that crappy browser. There’s a 2001 standard, XHTML 1.1, which is basically 1.0 minus ugliness. It’s XHTML which I can have pleasure in writing. Nowadays all browsers that properly support XHTML 1.0 also supports 1.1, since the newer standard is mostly a subset of the older (the only thing new is support for ruby characters, which is very useful for Japanese). Further, there’s no reason at all to send “compatible”XHTML 1.0 to that bad browser. You gain nothing by doing this and might as well send plain old HTML.

In a nutshell: HTML sucks, and XHTML 1.0 sucks too.

I sat down and hacked a XSLT script to automatically convert XHTML 1.1 documents to HTML documents. Now I can write beautiful hand-crafted XHTML code and, when needed, generate HTML from it. If only there was some way of asking the browser what kind of document he prefers. However, again I could not find any projects to help me do this, not even the popular ones.

HTTP rulez

As it turns out, the two problems described above are already solved, and they were solved ten years ago. By HTTP, no less. Yep, HTTP, those four letters before every Web URL. HTTP 1.1 has a mechanism for sending content in different languages and media types, called “content negotiation”, which is very nice. It’s also completely ignored by developers. For example, it’s painful to change your language preferences in major browsers (My Mom could never do it). Further, I could find no web project taking full advantage of the possibilities offered by content negotiation. By now I was not surprised anymore.

I’m starting to think that people simply do not care about standards. They ignore that the Unicode standard solved everything about character set encodings and insist in using legacy national standards. They ignore that XHTML and XML solved everything needed for m17n and refuse to acknowledge mechanisms like xml:lang and hreflang and <link rel="alternate" hreflang="xx"> . And they seem to think of HTTP as some boring old network thingy. It’s not. It’s a very powerful and elegant mechanism for transmitting information. The more I read into it, the more I am amazed about how much HTTP is underused in Wordpress or Ruby on Rails or Gallery. HTTP rulez.

Update: the page referenced above seems to be offline frequently now, here's the google cache. Recommended reading. In case it’s lost forever, I’ll copy Tomayko's list of things web frameworks should do, which is a nice summary of the problem:

Help implement content negotiation properly (hell yes!).

Proper, smart using of HTTP caching facilities ( If-modified-since , Expires , Cache-Control , etc.)

, , , etc.) Make dealing with media types easy.

Make dealing with character encodings easy.

Encourage the use of standard HTTP authentication schemes (this is another underused but powerful HTTP area).

Provide a sane mechanism for attaching different behavior to different verbs for a single resource.

Help ensure that URIs stay cool.

Make dealing with transfer encodings (gzip, compress, etc.) easy.

Help you use the entire range of response status codes properly (e.g. nearly all dynamic content returns either a 200 or 500).

Help ensuring GET requests are safe and idempotent.

How many of those points your favourite PHP/Python/Ruby framework provide?

The three-layered system for content negotiation

I tried to use content negotiation through available tools such as Apache Multiviews. The first problem I had was that, as mentioned before, web browsers do not make it easy for you to set, say, your preferred language. I tried pointing to alternative language versions with the <link/> element, but the browsers ignore that too. I tried to add explicit <a/> links to the alternative versions, but this solution, contrarywise to elementary user interface guidelines, was transitory; as soon as the user opened another page, his choice of language was lost.

I can’t change the browser’s list of preferred languages remotely. The only solution, unfortunately, seems to be to duplicate the HTTP content negotiation mechanism. We can do that using cookies. If the user clicks “Portuguese”, you not only send him to the Portuguese version of a page, you also set a cookie saying that he likes Portuguese more than English. When you retrieve a resource, you first try to see if you have a version satisfying the cookie preferences, and if not, you scan the standard HTTP language preferences. You have two content negotiation layers.

What happens if you do not have any of the languages the user requested? HTTP has a status code for that, “406 Not Acceptable”. But it seems to me a terrible idea to do what Apache does by default and throw a big ugly “406 Not Acceptable” error message at the user’s face. If I had written some text only in English and My Mom tried to see it, that error message would surely scare her out of the computer. The authors of the HTTP RFC probably felt like that too, since they made the standard error message a SHOULD, not a MUST. They even state that sending unacceptable content may be preferable to sending a 406 error response in some cases.

What if we take a middle road? We send the response with a 406 status code, but instead of attaching an error message, we attach some available version of the content. My Mom needs not to know that an error ocurred. If we feel like, we could include a little informative box saying “I could not find this page in Portuguese, so I’m sending it in English instead”.

The decision of which language is the fallback should clearly be made by the webmaster. But should the webmaster be required to write a version of every single page in this language? What if I want to write some content only in English and some only in Portuguese? The most general solution is not a fallback language but a fallback list of languages. Yet another layer of content negotiation. And thus we arrive at my three-layered content negotiation scheme.

Until now I’ve discussed the problem in terms of m17n, but the same reasoning applies to media types (the XHTML/HTML issue). It also applies to selecting the best character set encoding and transfer encoding, which I do not worry about right now but is nice to keep in mind. So, in general, we add two extra layers of content negotiation preferences mimicking the four HTTP Accept headers and follow this algorithm:

First, scan the user preferences set in cookies, if any, and see if there’s an available representation satisfying them;

If not, try the same with the standard HTTP headers;

If not, find the best representation according to the site defaults and send it flagged as 406.

So, Boiko, have you actually done anything about it?

I do not wish to duplicate efforts. It seems to me that the most logical place to implement such a content negotiation system would be at the web server, either as an Apache module or using something like Webrick. But I cannot install Apache modules or run custom servers at my hosting service, so I’m stuck with (Fast) CGI.

I’ve been hacking together some Ruby for the algorithm described above, with emphasis on standards compliance. It already works for serving static (non-ERB) content, but I want to improve it a little more before releasing. (Update: sorry people, I'm a procrastinator and an overengineer. Since writing this I fell in love with Scheme — specifically, with a very m17n-aware and unixish Scheme implementation — and am in the process of rewriting everything.)

And that’s my excuse about why I’m not confortable in creating Wordpress texts or Gallery photo albums or Rails applications.

I welcome your suggestions. Discuss this at reddit, or in my blog (Portuguese, but English comments welcome).