Just last week Lars Svensson from the Deutschen Nationalbibliothek (German National Library aka DNB) made a big announcement that they have released their authority data as Linked Data for the world to use. What this means is that there are now unique URLs (and machine readable data at the other end of them) for:

1.8 million authors from the Personennamendatei (PND)

1.3 million corporate bodies from the Gemeinsame Körperschaftsdatei (GKD)

187,000 subject headings from the Schlagwortnormdatei (SWD)

51,000 Dewey Decimal Classification categories

The full dataset that the DNB has made available for download amounts to 38,849,113 individual statements (aka triples). Linked Data enthusiasts that are used to thinking in terms of billions of triples might not even blink when seeing these numbers. But it is important to remember that these data assets have been curated by a network of German, Austrian and Swiss libraries, for close to a hundred of years, as they documented (and continue to document) all known German-language publications.

The simple act of making each of these authority records URL addressable, means that they can now meaningfully participate in the global information space some call the Web of Data. It’s true, the records were available as part of the DNB’s Online Catalog before they were released as Linked Data. What’s new is that the DNB has commited to using persistent URLs to identify these records, using a new host name d-nb.info in combination with their own record identifiers. This means that people can persistently link to these DNB resources in their own web applications and data. Another subtle thing, and really the heart of what Linked Data pattern offers, is the ability to use the same URL to retrieve the record as structured metadata. The important thing about having machine readable data is it allows other applications to easily re-purpose the information, much like libraries have done traditionally by shipping around batches of Machine Readable Cataloging (MARC) records. Here’s a practical example:

The URL http://d-nb.info/gnd/119053071 identifies the author Herta Müller, who won the Nobel Prize for Literature in 2009. If you load that URL in your web browser by clicking on it, you should see a web page (HTML) for the authority record describing Herta Müller. But if a web client requests that same URL asking for RDF it will (via a redirect) get the same authority record as RDF. RDF is more a data model than a particular file format, so it has a variety of serializations … The server at d-nb.info returns RDF/XML, and they have made their data dumps available in N-Triples…but I’m kind of fond of the Turtle serialization which is kind of JSON-ish, and makes the RDF a bit more readable. Here is the RDF (as Turtle) for Herta Müller that the DBN makes available:

@prefix gnd: <http://d-nb.info/gnd/> . @prefix rdaGr2: <http://RDVocab.info/ElementsGr2/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . <http://d-nb.info/gnd/119053071> rdaGr2:biographicalInformation "Rumän.-dt. Schriftstellerin und Essayistin, lebt seit 1987 in Deutschland, Literaturnobelpreisträgerin 2009"@de ; rdaGr2:dateOfBirth "1953" ; rdaGr2:identifierForThePerson "(DE-588)119053071", "(DE-588c)4293331-6", "(DLC)n 86833524" ; rdaGr2:placeOfBirth "Nitzkydorf (Banat)"@de ; rdaGr2:placeOfResidence "Berlin"@de ; rdaGr2:professionOrOccupation <http://d-nb.info/gnd/4053311-6> ; gnd:countryCodeForThePerson "XA-RO" ; gnd:preferredNameForThePerson [ gnd:foreName "Herta" ; gnd:surname "Müller" ; gnd:usedRules "RAK-WB" ], "Müller, Herta" ; gnd:studyPathsOfThePerson "Germanistik, Romanistik"@de ; gnd:variantNameForThePerson [ gnd:foreName "Cherta" ; gnd:surname "Myller" ; gnd:usedRules "RAK-WB" ], [ gnd:personalName "Heta-Mulei" ; gnd:usedRules "RAK-WB" ], [ gnd:foreName "Heta" ; gnd:surname "Mulei" ; gnd:usedRules "RAK-WB" ], [ gnd:foreName "Herta" ; gnd:surname "Müller" ; gnd:usedRules "AACR" ], [ gnd:foreName "Heruta" ; gnd:surname "Myur?" ; gnd:usedRules "RAK-WB" ], "Heta-Mulei", "Mulei, Heta", "Müller, Herta", "Myller, Cherta", "Myur?, Heruta" ; owl:sameAs <http://dbpedia.org/resource/Herta_M%C3%BCller>, <http://viaf.org/viaf/12324250> ; foaf:page <http://de.wikipedia.org/wiki/Herta_M%C3%BCller> .

A few interesting things to note in this example are the use the RDA Group 2 Entities vocabulary and the GND vocabulary to describe Herta Müller. RDF vocabularies are explicit ways of describing resources like people, places, topics, etc. When different things are described using the same vocabulary (or the vocabularies themselves are related together in a particular way) it becomes possible to merge the descriptions, and build software on top of it. So the DNB’s choice of RDA and GND is quite significant. Normally the URL for an RDF schema will return a description of that schema known as a Namespace Document. Namespace Documents are handy for understanding what exactly the vocabulary means, and how it might relate to other RDF vocabularies on the web. This is the case for the RDA vocabulary, but the GND vocabulary namespace doesn’t appear to be resolving to anything that describes the GND vocabulary.

Another really interesting thing to note about this RDF for Herta Müller are the links to Wikipedia (http://de.wikipedia.org/wiki/Herta_M%C3%BCller), VIAF (http://viaf.org/viaf/12324250) and dbpedia (http://dbpedia.org/resource/Herta_M%C3%BCller). These are important because they contextualize the DNB record for Herta Müller by relating it to other records for her, thus allowing it to be disambiguated from records describing other people named Herta Müller. Another beneficial side effect of linking your own records to others out on the Web of Data is that you enrich your own data in the process. For example if a machine agent resolves the dbpedia URI it will get back RDF that includes 114 new assertions, some of which you can see below:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . <http://dbpedia.org/resource/Herta_M%C3%BCller> dbpedia-owl:birthDate "1953-08-17"^^<http://www.w3.org/2001/XMLSchema#date> ; dbpedia-owl:birthPlace <http://dbpedia.org/resource/Ni%C5%A3chidorf> ; dbpedia-owl:spouse <http://dbpedia.org/resource/Richard_Wagner_%28novelist%29> ; dbpedia-owl:thumbnail <http://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Herta_M%C3%BCller_2007.JPG/200px-Herta_M%C3%BCller_2007.JPG> ; rdfs:label "Herta Müller"@de, "Herta Müller"@en, "Herta Müller"@es, "Herta Müller"@fi, "Herta Müller"@fr, "Herta Müller"@it, "Herta Müller"@nl, "Herta Müller"@nn, "Herta Müller"@pl, "Herta Müller"@pt, "Herta Müller"@sv, "??????, ?????"@ru, "????????"@ja, "??·??"@zh ; owl:sameAs <http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000dc69bb>, <http://umbel.org/umbel/ne/wikipedia/Herta_M%C3%BCller> ; foaf:depiction <http://upload.wikimedia.org/wikipedia/commons/2/2c/Herta_M%C3%BCller_2007.JPG> .

So now we’ve enriched the DNB authority record with:

a thumbnail picture of Herta Müller

her name in Japanese, Chinese and Russian

her birth day

her place of birth

a link to a similar record for her spouse Richard Wagner

links to records for Herta Müller at Freebase (recently purchased by Google)

And that’s just a sampling of the sorts of data that dbpedia returns. Another interesting one to look at is the Virtual International Authority File (VIAF), which links together the authority records for 18 National Libraries around the world. If you resolve the VIAF URL that DNB have linked to, you will get machine readable information for authority records from the the Library of Congress, NII (Japan), Biblioteca Nacional de Portugal, National Library of Sweden, Biblioteca Nacional de España, Bibliothèque nationale de France, National Library of the Czech Republic, and of course the Deutsche Nationalbibliothek. The information for the DNB and Sweden are particularly important because they in turn link back to the records at the originating institution: http://d-nb.info/gnd/119053071 and http://libris.kb.se/auth/218085. It might be worthwhile for the DNB to consider linking directly to their own record in VIAF http://viaf.org/viaf/12324250/#DNB%7C119053071 instead of http://viaf.org/viaf/12324250, but that’s largely a technical matter. We’ve connected up the DNB’s notion of Herta Müller with the Royal Library of Sweden’s–just by following our nose on the World Wide Web. And this is an activity that computer software can perform as well.

So, it’s clear there’s a whole lot of library linking going on. I did some quick and dirty analysis of the full data dump from the DNB and found: 3,569,402 links to VIAF and 40,136 links to dbpedia (the Linked Data version of Wikipedia). What remains to be done to some extent is leveraging this contextual information around our data in Library Applications, both cataloging, metadata enrichment applications and end user facing discovery applications.

One challenge to building applications that use this Web of Library Data are the vocabularies that are used. I did some more rudimentary analysis on the full DNB data dump and came up with this count of property usage:

RDF Property Number of Assertions http://www.w3.org/2002/07/owl#sameAs 3,609,878 http://d-nb.info/gnd/preferredNameForThePerson 3,609,753 http://d-nb.info/gnd/usedRules 3,476,879 http://d-nb.info/gnd/variantNameForThePerson 3,327,005 http://d-nb.info/gnd/surname 3,218,840 http://d-nb.info/gnd/foreName 3,218,125 http://RDVocab.info/ElementsGr2/identifierForTheCorporateBody 2,642,185 http://RDVocab.info/ElementsGr2/identifierForThePerson 2,163,258 http://d-nb.info/gnd/preferredNameForTheCorporateBody 1,320,711 http://d-nb.info/gnd/variantNameForTheCorporateBody 1,293,751 http://RDVocab.info/ElementsGr2/biographicalInformation 1,084,183 http://RDVocab.info/ElementsGr2/professionOrOccupation 1,059,570 http://d-nb.info/gnd/publicationOfThePerson 986,418 http://RDVocab.info/ElementsGr2/dateOfBirth 971,993 http://d-nb.info/gnd/countryCodeForThePerson 823,100 http://d-nb.info/gnd/countryCodeForTheCorporateBody 759,088 http://RDVocab.info/ElementsGr2/periodOfActivityOfThePerson 539,230 http://RDVocab.info/ElementsGr2/gender 404,247 http://RDVocab.info/ElementsGr2/dateOfDeath 381,888 http://purl.org/dc/terms/identifier 337,230 http://metadataregistry.org/uri/schema/RDARelationshipsGR2/hierarchicalSuperior 277,484 http://d-nb.info/gnd/personalName 258,214 http://d-nb.info/gnd/prefixName 233,481 http://d-nb.info/gnd/functionOfThePerson 211,045 http://d-nb.info/gnd/invalidIdentifierForThePerson 208,267 http://RDVocab.info/ElementsGr2/placeOfBirth 192,563 http://d-nb.info/gnd/qualifierName 169,284 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 168,615 http://www.w3.org/2004/02/skos/core#prefLabel 163,854 http://www.w3.org/2004/02/skos/core#altLabel 143,254 http://xmlns.com/foaf/0.1/page 123,569 http://d-nb.info/gnd/invalidIdentifierForTheCorporateBody 122,999 http://www.w3.org/2004/02/skos/core#broader 118,696 http://metadataregistry.org/uri/schema/RDARelationshipsGR2/predecessor 110,112 http://metadataregistry.org/uri/schema/RDARelationshipsGR2/successor 109,819 http://www.w3.org/2004/02/skos/core#narrower 102,850 http://d-nb.info/gnd/preferredNameAcronymForTheCorporateBody 102,517 http://RDVocab.info/ElementsGr2/dateOfEstablishment 88,470 http://d-nb.info/gnd/academicTitleOfThePerson 77,763 http://RDVocab.info/ElementsGr2/placeOfResidence 70,112 http://metadataregistry.org/uri/schema/RDARelationshipsGR2/relatedCorporateBodyPerson 65,319 http://www.w3.org/2004/02/skos/core#closeMatch 60,893 http://xmlns.com/foaf/0.1/homepage 59,065 http://RDVocab.info/ElementsGr2/dateOfTermination 38,997 http://RDVocab.info/ElementsGr2/dateOfTermination 38,997 http://www.w3.org/2004/02/skos/core#definition 37,086 http://RDVocab.info/ElementsGr2/placeOfDeath 35,266 http://d-nb.info/gnd/locQualifier 35,220 http://d-nb.info/gnd/studyPathsOfThePerson 33,307 http://www.w3.org/2004/02/skos/core#related 26,971 http://RDVocab.info/ElementsGr2/nameOfTheCorporateBody 20,009 http://RDVocab.info/ElementsGr2/languageOfThePerson 13,318 http://d-nb.info/gnd/variantNameAcronymForTheCorporateBody 12,786 http://www.w3.org/2004/02/skos/core#scopeNote 11,000 http://d-nb.info/gnd/useInsteadSWD 9,572 http://d-nb.info/gnd/useInsteadNoteSWD 9,522 http://d-nb.info/gnd/countryCodeForTheSubject 7,179 http://RDVocab.info/ElementsGr2/titleOfThePerson 6,798 http://purl.org/vocab/relationship/childOf 6,554 http://purl.org/vocab/relationship/parentOf 5,895 http://purl.org/vocab/relationship/spouseOf 5,613 http://d-nb.info/gnd/successorWithoutPredecessor 5,574 http://www.w3.org/2000/01/rdf-schema#label 4,761 http://d-nb.info/gnd/useConceptsInsteadSWD 4,761 http://d-nb.info/gnd/invalidIdentifierForTheSubject 4,635 http://purl.org/vocab/relationship/siblingOf 3,891 http://metadataregistry.org/uri/schema/RDARelationshipsGR2/relatedPersonPerson 2,764 http://d-nb.info/gnd/predecessorWithoutSuccessor 1,501 http://purl.org/vocab/relationship/grandchildOf 493 http://www.w3.org/2000/01/rdf-schema#seeAlso 484 http://purl.org/vocab/relationship/grandparentOf 416 http://purl.org/dc/terms/language 266

So we see heavy usage of the http://d-nb.info/gnd/ vocabulary, but we don’t know precisely how this vocabulary connects up with other vocabularies in use on the Web. We also see the new RDA vocabulary http://RDVocab.info/ElementsGr2 heavily used. Whereas the trailblazing Royal Library of Sweden chose to leverage the Friend of a Friend vocabulary more. It’s very important that we see some convergence in vocabulary use, so that our distributed data is interoperable, and mashable. This will undoubtedly lead to changes in what vocabularies are used, and growing pains in any applications that are dependent on the data. But I think it is worth it. I have high hopes that some of this convergence may come about as a result of meetings later this week at the Dublin Core Metadata Initiative 2010 meeting in Pittsburgh. But if it’s going to scale, we need to see this convergence going on all the time in online forums like the Linked Library Data discussion list, and via tools that allow library data managers to view the emerging web of library data.

Another niggling little problem is the need to synchronize these data sets. For example how am I to know when DNB has created, updated or deleted one of their authority records? I could wait for a database dump, and blow away what I knew before. But ideally there would be a mechanism to keep my own view of the DNB data synchronized. Of course there is the tried and true OAI-PMH which VIAF is using to collect MARC rocords, but it is showing its age and doesn’t really fit the Linked Data pattern very well. There is the successor to OAI-PMH, OAI-ORE which better fits more recent notions of Web Architecture and Linked Data. But there are some issues to do with very large resource maps which kind of need ironing out. The Dataset Dynamics has been doing some interesting work identifying the various mechanisms for performing synchronization with an emphasis on using Atom. Atom is a standard XML document format for describing sets of web resources. In fact OAI-ORE leverage Atom as one of the serialization formats for resource maps. But I’m personally hoping we’ll see some stream lined guidelines for publishing feeds for Linked Data, that leverage Atom’s Feed Paging/Archiving for making large lists of resources available. Maybe the Semantic Sitemaps (an Linked Data extension to traditional sitemaps that the big web search engines use to stay on top of things. I imagine we’ll see a combination of these approaches, but I think it’s important to see some convergence amongst Library Linked Data publishers to help the ecosystem flourish.

Update: I shared some more pedantic thoughts about the d-nb.info URLs in another forum. I didn’t want these particular technical details/questions to detract from saying how important I think the DNB Linked Data release is.