Thursday, January 27, 2005

The Normalised Google Distance

I love the idea of a lowercase semantic web, when you use “brute force” statistical analysis of the growing corpus the world wide web offers. Brute force is dumb by definition, but ironically I find this kind of dumb works best for Artificial Intelligence. Just think of the 20 questions game, which also simply (correct me if I’m wrong) matches patterns of previous games and still manages to get to look so smart.

One of the most powerful recent approaches is the so-called Google Mindshare, or Googleshare algorithm. I use a similar idea for my Centuryshare application at FindForward.com, by relating different years to a specific word to see when this word was most “popular”. Instead of googling for meta-data (which there is relatively little of, even though what exists is easily machine-readable) flat text is enough; very often, just the Google page count reveals all we need to know.

Now, as the New Scientist reports, the two scientists Paul Vitanyi and Rudi Cilibrasi of the National Institute for Mathematics and Computer Science in Amsterdam, Netherlands, used Google to find what they call the “Normalised Google Distance.”

Basically this NGD quantifies the strength of a relationship between two words. For example, “speakers” and “sound” are more related than “speakers” and “elephant.” Instead of creating this manually, of course, they find the Google PageCount when both words are used together in a search. (“Speakers” and “sound” would have a relatively high number of result pages when compared to “speakers” and “elephant.”)

Now when you repeat this process of finding the NGD for a lot of words, you can build a multi-connection word map. According to the New Scientist, Paul Vitany says “This is automatic meaning extraction. It could well be the way to make a computer understand things and act semi-intelligently.” You can read all the details of this approach in Vitany’s and Cilibrasi’s paper “Automatic Meaning Discovery Using Google.”

I do believe the future of AI will evolve along these lines, but it might not be quite what we expected. After all, we may never really understand what we built, and possibly, aren’t able to fully control it. Just image a scientist’s suprise if he was successful in creating real Turing test intelligence, and as soon as he’d pose his first question, the automate replies: “I could answer this easily, but tell me, what’s in it for me?”

>> More posts


