From : Ian Hickson < : Ian Hickson < ian@hixie.ch



To : Manu Sporny < : Manu Sporny < msporny@digitalbazaar.com

Cc : WHAT-WG < : WHAT-WG < whatwg@whatwg.org >, " www-archive@w3.org " < www-archive@w3.org

Message-ID : <Pine.LNX.4.62.0808260901570.7044@hixie.dreamhostps.com>



On Mon, 25 Aug 2008, Manu Sporny wrote: > > Web browsers currently do not understand the meaning behind human > statements or concepts on a web page. While this may seem academic, it > has direct implications on website usability. If web browsers could > understand that a particular page was describing a piece of music, a > movie, an event, a person or a product, the browser could then help the > user find more information about the particular item in question. Is this something that users actually want? How would this actually work? Personally I find that if I'm looking at a site with music tracks, say Amazon's MP3 store, I don't have any difficulty working out what the tracks are or interacting with the page. Why would I want to ask the computer to do something with the tracks? It would be helpful if you could walk me through some examples of what UI you are envisaging in terms of "helping the user find more information". Why is Safari's "select text and then right click to search on Google" not good enough? Have any usability studies been made to test these ideas? (For example, paper prototype usability studies?) What were the results? > It would help automate the browsing experience. Why does the browsing experience need automating? > Not only would the browsing experience be improved, but search engine > indexing quality would be better due to a spider's ability to understand > the data on the page with more accuracy. This I can speak to directly, since I work for a search engine and have learnt quite a bit about how it works. I don't think more metadata is going to improve search engines. In practice, metadata is so highly gamed that it cannot be relied upon. In fact, search engines probably already "understand" pages with far more accuracy than most authors will ever be able to express. You started by saying: > Web browsers currently do not understand the meaning behind human > statements or concepts on a web page. This is true, and I even agree that fixing this problem, letting browsers understand the meaning behind human statements and concepts, would open up a giant number of potentially killer applications. I don't think "automating the browser experience" is necessarily that killer app, but let's assume that it is for the sake of argument. You continue: > If we are to automate the browsing experience and deliver a more usable > web experience, we must provide a mechanism for describing, detecting > and processing semantics. This statement seems obvious, but actually I disagree with it. It is not the case the providing a mechanism for describing, detecting, and processing semantics is the only way to let browsers understand the meaning behind human statements or concepts on a web page. In fact, I would argue it's not even the the most plausible solution. A mechanism for describing, detecting, and processing semantics; that is, new syntax, new vocabularies, new authoring requirements, fundamentally relies on authors actually writing the information using this new syntax. If there's anything we can learn from the Web today, however, it is that authors will reliably output garbage at the syntactic level. They misuse HTML semantics and syntax uniformly (to the point where 90%+ of pages are invalid in some way). Use of metadata mechanisms is at a pitifully low level, and when used is inaccurate (Content-Type headers for non-HTML data and character encoding declarations for all text types are both widely wrong, to the point where browsers have increasingly complex heuristics to work around the errors). Even "successful" formats for metadata publishing like hCard have woefully low penetration. Yet, for us to automate the browsing experience by having computers understand the Web, for us to have search engines be significantly more accurate by understanding pages, the metadata has to be widespread, detailed, and reliable. So to get this data into Web pages, we have to get past the laziness and incompetence of authors. Furthermore, even if we could get authors to reliably put out this data widely, we would have to then find a way to deal with spammers and black hat SEOs, who would simply put inaccurate data into their pages in an attempt to game search engines and browsers. So to get this data into Web pages, we have to get past the inherent greed and evilness of hostile authors. As I mentioned earlier, there is another solution, one that doesn't rely on either getting authors to be any more accurate or precise than they are now, one that doesn't require any effort on the part of authors, and one that can be used in conjunction with today's anti-spam tools to avoid being gamed by them and potentially to in fact dramatically improve them: have the computers learn the human languages themselves. Instead of making all the humans of the world learn a computer language, or tools for writing that computer language, have the computers learn the human language. Not only does this not require us to solve a fundamentally unsolvable pair of problems (making humans not be lazy and making humans not be evil), but it also means that the computers would also gain an understanding of all the legacy content that would otherwise never be seen by computers. This kind of thing is already being done, for example with automated language translation where the software learns for itself how to translate text, or in search engines that extract information like byline dates and author credits, without the need for pages to have special markup, or in data clustering, where tools can examine large sets of data and sort the content into buckets based on topics without any special markup or user intervention. Similarly developments in image processing are making huge steps, with tools that can derive depth mapping data from moving video, or that can convert a set of static 2D images to a 3D point field. It's clear that over the coming years, this will only get better and better. However, let's pretend for now that we can find a way to solve laziness and evilness and continue with your e-mail: > If one understands web semantics to be an important part of the web's > future, the question then becomes, why RDFa? Why not Microformats? > > While there are a number of technical merits that speak in favor of RDFa > over Microformats (fully qualified vocabulary terms Why is this better? > prefix short-hand via CURIEs This is definitely not better. > accessibility-friendly How is not reusing HTML semantics better than using them? With the exception of the now-resolve <time> issue, it seems like Microformats has the better accessibility story. > unified processing rules, etc) Microformats could certainly benefit from a more consistent parsing model, but that can be obtained without going to RDFa. > [...] this issue really boils down to one of centralized innovation vs. > distributed innovation. I don't see what the syntax has anything to do with whether the formats are developed centrally or not. Nothing is stopping anyone from creating another Microformats-like organisation that does the same thing without going through the Microformats.org process. There could be millions of them, in fact. So long as they pick names that are suitably unique (e.g. URIs, or Java-like identifiers), or so long as they don't promote the use of their formats outside of their own site, I don't see a problem. In fact, this is happening every day, with each author making up his own class values for use on this own site. > The Microformats community, and all communities like it, require a group > of people to come together, collaborate and create a standard vocabulary > to express ALL semantics. Well, for any one person to do anything useful with the data on the Web, they have to have a core vocabulary (or set of vocabularies) that they understand. So a set of standard vocabularies to express all the semantics that that one person is interested in is needed, yes. It doesn't have to be _all_ semantics, however. I might want to have a format for annotating Stargate analysis Web pages that me and my friends write, but so long as me and my friends agree on it it doesn't have to involve anyone else. > A somewhat strained analogy would be bringing in representatives from > all of the cultures of the world and having them agree on a universal > vocabulary. That's pretty much exactly what Unicode did. Or what we're doing with HTML. That doesn't seem untennable, it seems quite reasonable. However, I'm not suggesting that it should be necessary. > In short, RDFa addresses the problem of a lack of a standardized > semantics expression mechanism in HTML family languages. RDFa not only > enables the use cases described in the videos listed above, but all use > cases that struggle with enabling web browsers and web spiders > understand the context of the current page. I'm not convinced the problem you describe can be solved in the manner you describe. It seems to rely on getting authors to do something that they have shown themselves incapable of doing over and over since the Web started. It seems like a much better solution would be to get computers to understand what humans are doing already. Even if we ignore that, it doesn't seem like the above discussion would lead one to a set of requirements that would lead one to design a language like RDFa. Thanks for the explanation, by the way. This is by far the most useful explanation of RDFa that I have ever seen. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'