Cleaning up the Web

These are my rough notes before I made the talk. No guarantee of accuracy, but with a transcript* should be a help.

Good morning everybody.

I saw with shock that I had been billed at talking about the relationship between th Semantic Web and the browser web. However, in keeping with a long standing tradition, I have allowed the topic of my remarks to be influenced by the suggestions of many people and the many discussions which have taken place over the last few days. And uninfluenced by the published title. I'm happy to wax lyrical about the semantic web browser/web browser architecture in breaks set aside for the consumption of various brown liquids.

I must admit I came to this week with a certain trepidation. Finances are low all round, some people have not been able to come. There has been dissonance on the lists, not unusual but perhaps more than normal. There has been concern about the HTML5 work being done in ways which do not match the expectations people have about W3C groups, and concerns about the TAG not being connected.

On the flip side, when I talk to anyone in any of the groups, I find a personal enthusiasm and energy which takes me aback but I remember from previous meetings. Whether it is discussing HTML5 or authorization using FOAF and SSL, people are full of ideas, and anxious to make powerful systems which are easy for users.

Robert Fulghum's famous line is "Everything I need to know I learned in Kindergarten". He suggests things like

Don't hurt other people

Clean up your own mess

Flush

and so on.

We can add things more domain-specific -- in software:

Keep it Simple

Make orthogonal modules

Internet:

No kings, just rough consensus and running code.

Specify what goes on the wire and what it means

Be liberal in what you accept, but conservative in what you produce.

Web:

Every system should be a platform for future systems.

Assume a power law of overlapping communities.

No kings, just rough consensus and running code and test suites!

There are more complicated ones which the TAG tries to list in various documents and I don't go into here. Norm Walsh sent a summary of them to the www-tag@ list a few days ago.

Groups:

Listen to people. It is worth the extra time.

Travel as little as possible. It costs.

Travel as much as you need to.

Listen to people.

When you come here, use the fact that we are all here together.

If you conclude that some one or another group does not understand you, what do you do?

You listen to them.

Why? Because if your explanation didn't work for them, you didn't understand where they are coming from. You have got to get into someone's way of thinking to be able to explain anything.

So you have to listen.

You may have to interview them.

When you have listened enough, they will understand you.

One of the causes of tension which has surfaced recently has been, about to do about HTML which has markup errors.

To dramatically oversimplify, some people worry worry that browsers which silently fix up all kinds of errors in markup (and HTTP and MIME types and image formats) encourage a world which is more and more messy, while the HTML WG points out that browsers have to render all broken legacy web pages without a murmur to compete in the market.

Historically, this has come from an application of the Be Liberal rule in browsers. Be liberal in what you accept -- ignore unknown tags, correct syntax errors, sniff for image file types, all silently. The market-driven downward spiral of expectation of conformance to specifications in an interesting study of deployment dynamics. We realize we are not just designing systems, we are throwing potentially viral designs into a soup of users, designers and developers.

What has happened is that the need for browser vendors to be able to boast being able to render at least as many legacy pages as the competition has combined with user's natural fallibility to produce a web of pages the majority of which are in error.

So the idea which I am going to summarize again today, which I suggested at the last AC, is that this is because the reward function is wrong.

What is the reward function? It is whether the user gets what he or she wants from a browser, or for validator (or checker of any sort) whether he or she is told the page is good.

Suppose there is, on the one hand (and on the X axis) a certain effort which a Web page author puts into the writing of a Web page, to eliminate various levels of error, and on the other hand (and on the Y axis) a reward given, in part, in terms of the quality of the rendered Web page on the range of clients perceived to be of interest.

In the case, shown above, of the conservative, the page must be completely correct or nothing is rendered. The writer who has an almost perfect page is motivated to fix it, but the writer who has a page with several errors is not, as there will be no noticeable reward for incremental improvement. It is not very surprising that the majority of Web users whose pages would have started off near the left of the graph did not make it to the right when serving their code as XHTML.

Some errors we may consider hopeless even in HTML, in that no useful recovery seems possible for them. In the case of the liberal browser (above), the reward for a hopeless page is zero, but for a page with any other level of errors, it in fact is rendered completely by the browsers. Therefore, a writer whose page is hopeless is motivated to clean it up a little bit. But the writers of pages which have other levels of error are not motivated to clean them up at all.

So while the liberal and conservative forks have very different philosophies, they share one thing: They do not motivate the writer of a Web page to progressively improve their offering.

The solution, as I see it, is to look at the motivating slope and fix it. When the user is provided with incremental rewards,

then he or she will move, hopefully, up the slope.

Browsers could:

By default, show no errors still (Just a bit?)

Show me errors on web sites I own

Show me errors if I am interested

Show me errors if I View Source

Give me cleaned up code as the first default in show source

Give me cleaned up code as the first default in Save As

Validators could give for (random) example

Marks for how well formed it is out of 20

Marks for HTML structure out of 20

Marks for embedded CSS and JS out of 20

Marks for HTTP and MIME out of 20

Marks for links working out of 20

Total marks out of 100

Servers:

In the *default configuration*

Don't assume you know what type a file with no extension has unless told;

Have by default MIME types for all W3C specs at last call or further;

Admins give server users local control of new MIME types (eg voice markup)

Give 500 errors when misconfigured with an explanation of how to fix the problem.

Automatically clean up stuff outgoing?

Motivation

Why is this important? Why should users, web page designers and server admin, go to the trouble of cleaning up their acts?

Because we should be building a clean platform for others to build on.

We need people to be able to design new formats for new applications which are completely new ways of using the web. They have to be able to introduce a new MIME type which will not be sniffed as something else. Well, consider the options in a few years time. Maybe 10.

If we continue allowing messy web pages to be accepted without comment, the norm will be messy and the occasional misguided page will be even messier.

True, if the browser vendors agree to draw a line in the sand at a particular set of quirks and not add any more, them there will be no incentive for web designers to use new quirks which are not implemented in the browsers. But at the moment, I don't see this agreement.

So if we don't clean it up the web will get dirtier an dirtier. Where e now wade wist-deep in errors we have to silently correct for, we will be wading shoulder-deep in them. It will be much more difficult for people to build stuff on top of the web.

There will be so much complexity in the web as a platform that it will become a dead-end technology. Not a platform but a ceiling. No one will build great things where the web is just one component in the design.

If we do clean it up, we can make it so the basic design, and an ever-increasing proportion of new web pages, are simple.

It should be simple. A function which maps a URI onto content, or URI onto meaning in some sense. Where our ideas about how we convey content and meaning can grow over time. Where that function can be used in all kinds of new processes on all kinds of machines in all kinds of circumstances.

Where we can (say) adjust the HTTP protocol to blend into a peer-peer protocol when the net gets loaded or a server goes away, and the web function still works, because it it simply defined, has a clean interface, and we can re-engineer it under the hood, because applications of it don't look under the hood.

Where client-side applications which currently see the web as XMLHTTPRequest can use it in all kinds of very novel ways.

In the what 35 years of Internet, the infrastructure has been upgraded through a ridiculous number orders of magnitude, and the socket interface that we code to has been unchanged. During the same time, a very dramatic amount of creativity in the uses of TCP/IP have flourished, including the WWW has just one. That all worked because the original creators of the technology believed in making clean abstractions. The hourglass model in which a narrow constant neck separated the independent layers above and below. Modular orthogonal specifications. And in designing for unknown applications in the future. And when new people (like me) came along we were lectured on how to write good spec which is short and clean and doesn't constrain other specs. Some of the folks who did that are here today.

It may seem bizarre to try to plan for 10 years time. But believe me, 10 years slips by increasingly quickly. The first memo about the web was written 20 years ago in March, when some of the people here were not yet born. And it seems like yesterday because in many ways we are talking about very similar issues, but in new parameters in terms of the weight of the legacy documents out there.

But the future, my friends, is longer than the past. We should clean up the web so that the future can build on it.

To that end, in the next few days, please go out and listen to those people who don't understand you.

Thanks for listening to me.

PS: Oh, and my take on web browsing :)

You can use URIS with hashes in for HTML anchors

You can use URIs with hashes in in HTML/RDFa documents for things like events, people

You should never use the same URI for an anchor and anything else.

Browsers who can handle RDF and HTML just as well should Accept both at same q

Servers serving RDF or HTML should prefer RDF if the data is the same, or HTML if the HTML page has significantly more information.

* transcripts for TPAC 2008 are in progress.