From the I-will-never-actually-finish-this department I bring you the Semantic Web Cluster-ball:

I started this is a part of the Billion Triple Challenge work, it shows the how different sites on Semantic Web are linked together. The whole thing is an interactive SVG, I could not get it to embed here, so click on that image and mouse over things and be amazed. Clicking on the different predicates in the SVG will toggle showing that predicate, mouse over any link will show how many links are currently being shown. (NOTE: Only really tested in Firefox 3.5.X, it looked roughly ok in Chrome though.)

The data is extracted from the BTC triples by computing the Pay-Level-Domain (PLD, essentially the top-level domain, but with special rules for .co.uk domains and similar) for the subjects and objects, and if they differ, count the predicates that link them. I.e. a triple:

dbpedia:Albert_Einstein rdf:type foaf:Person.

would count as a link between http://dbpedia.org and http://xmlns.com for the rdf:type predicate. Counting all links like this gives us the top cross-domain linking predicates:

Most frequent is of course rdf:type, since most schemas are from different domains to the data, and most things have a type. The ball linked above is excluding type, since it’s not really a link. You can also see a version including rdf:type. The rest of the properties are more link-like, I am not sure what is going on with the akt:has-date though, anyone?

The visualisation idea is of course not mine, mainly I stole it from Chris Harrison: Wikipedia Clusterball. His is nicer since he has core nodes inside the ball. He points out that the “clustering” of nodes along the edge is important, as this brings out the structure of whatever is being mapped. My “clustering” method was very simple, I swap each node with the one giving me the largest decrease in edge distance, then repeat until the solution no longer improves. I couple this with a handful of random restarts and take the best solution. It’s essentially a greedy hill-climbing method, and I am sure it’s far from optimal, but it does at least something. For comparison, here is the ball on top without clustering applied.

The whole thing was of course hacked up in python, the javascript for the mouse-over etc. of the SVG uses prototype. I wanted to share the code, but it’s a horrible mess, and I’d rather not spend the time to clean it up. If you want it, drop me a line., see below. The data used to generate this is available either as a download: data.txt.gz (19Mb, 10,000 host-pairs and top 500 predicates), or a subset on Many Eyes (2,500 host-pairs and top 100 predicates, uploading 19Mb of data to Many Eyes crashed my Firefox :)

UPDATE: Richard Stirling asked for the code, so I spent 30 min cleaning it up a bit, grab it here: swball_code.tar.gz It includes the data+code needed to recreate the example above.