Many have wondered about what the semantic web and publishing can offer each other. (By "publishing" here, I mean "making content available in one media or another, ideally to make money".) After following a lot of writing and discussions in these two worlds—and they are surprisingly separate worlds—I have a few ideas and wanted to write them up where people could comment on them.

What can the publishing world offer to the semantic web?

The less obvious, but to me, the clearest win is what the publishing world can offer to the semantic web: the lessons learned from long practical experience with developing and applying taxonomies, such as identifying useful concepts, naming them, identifying the useful relationships between them, and mapping units of content to those concepts. Many of the if-you-build-it-they-will-come ontologies out there seem to be thrown together in the hope that someone will use them, with no examination of use cases beyond the needs of the individual developers who created them—and sometimes, not even a close look at those needs. Semantic web technology gives us the standards and tools to assign descriptive terms to resources so that people (and software agents) who need those resources can identify them more easily; taxonomy professionals know about best practices for picking good terms to assign that will help the larger project meet specific goals. (For an example of this thinking, see part 1 and part 2 of the article "Creating User-Centred Taxonomies" from the FUMSI group, which is just one of the resources I've learned about since I began following the Taxonomy Community of Practice.)

What can the semantic web offer to the world of publishing?

I've heard discussions in which publishers picture machine-readable encoded semantics of content driving customers to that content, but this sounds a little pie-in-the-sky for now. (I'd be happy if someone could point me to indications that working examples of this using semantic technology are imminent.) Publishers who want more people to find their content on the web would be better off putting greater effort into basic search engine optimization, and will find solid practical advice in Jamie Lowe's SEO for Publishers presentation.

Semantic web technologies, as opposed to the grander idea of the Semantic Web itself, offer tools that can help publishers assemble and distribute their content more efficiently, and I think that this low-hanging fruit is a better place to start, if only to get a better idea of the technology's strengths and weaknesses.

What can an aggregator/publisher do to take advantage of content metadata when the metadata fields for one source's articles don't line up with the fields in another source's articles?

More and more publishing these days is about aggregation. When so much content is available from so many places for free, we're more likely to pay money for (or put up with ads next to) content selected by people whose judgment we trust. There are many models for aggregation, ranging from print publications such as Utne Reader to grand old online services such as Nexis and Factiva to more Web 2.0-oriented approaches such as Digg and Reddit. Now more than ever, publishers know that metadata makes it easier for both publishing staff and readers to track and connect relevant content, but a problem for aggregators is that while they're happy to get metadata with the content that they collect, different content sources will send different sets of metadata.

There may be certain fields of metadata that most content chunks have in common, such as Dublin Core fields, but what can an aggregator/publisher do to take advantage of content metadata when the metadata fields for one source's articles don't line up with the fields in another source's articles? Or when the same thing happens with images?

According to traditional practice, the aggregator should put this data into a database that may be built into a CMS or set up as a standalone relational system such as Oracle, MySQL, or SQL Server. In either case, a crucial step in the setup part is deciding what fields you want to track.

Let's say you define 10 fields of metadata to track. If an article arrives with 12 fields of metadata, but only 8 match fields that you've defined, you store those 8, throw out the other 4, and have 2 blanks left over. If, over time, you find yourself throwing out a particular field that more and more content providers have been including with their articles and images, you can modify your database schema or revise the customized fields in your CMS to start collecting that field from that point on, but this is rarely a quick and simple procedure, and all the values delivered to you for that field in content you've received up to then are still lost.

The kind of technology developed to support semantic web projects offers an alternative. The RDF triples at the base of semantic technology let you store the fact that a particular resource (for example, a JPEG file) has a field with a particular name (for example, "resolution") and a particular value for that field (for example, "72dpi".) Actual resource and field names must be URLs to avoid confusion (I discussed this a bit last week); if you can do this, you can store any metadata about anything. The {resource, field name, field value} combination (more technically known as a subject/predicate/object) is called a triple, and the database managers that store them are called triplestores. Unlike relational database managers and production XML systems, the technology for working with these triples doesn't need to know about field names in advance. The flexibility that this offers lets developers fit applications around their data instead of shoehorning their data into the current application's requirements, which can put a lot of constraints on future possibilities for both the applications and the data.

This flexibility does offer the possibility that two publishers might use different field names for the same concept, as Dale Waldt described in the posting I responded to last week, but the OWL part of the semantic web technology stack can help to account for that. For example, what if two publishers use different URLs to indicate the title of an article? If one uses a term from the Adobe XMP namespace to assign an article a http://ns.adobe.com/xap/1.0/Title value of "The Trans-Siberian Railroad", and the other publisher assigns another article an http://purl.org/dc/elements/1.1/title value of "Across Canada by Train", a bit of OWL (as demonstrated in my response to Dale) can show that these terms mean the same thing so that a single query for titles retrieves both articles.

If you as an aggregator feel that it would be easier for your suppliers to use a more normalized set of vocabulary terms, get them together and talk about it. This is what standards groups such as OASIS and IDEAlliance are for. (IDEAlliance's PRISM standard, whose motto is "Developing a standard XML metadata vocabulary for the publishing industry", is just such a group, and they include an RDF profile as part of their standard.)

Getting More Semantic

If I'm recommending semantic web tools to help you keep track of things such as the resolution of your digital images, you might ask "what's so semantic about that?" It's not particularly semantic, but it uses semantic web technology to track metadata that helps your staff and customers more easily find the content that they need, so it does help toward the greater goal. If you want to push this technology a little further to incorporate metadata about the semantics of the content—without spending money on software—look into OpenCalais, which analyzes content and returns a copy with RDF representations of key terms it found and information about what classes those key terms fall into (for example, that "Slumdog Millionaire" is a Movie or that "Golden Globe" is an EntertainmentAwardEvent). I played with the first release of OpenCalais to create the BlogBigPicture website, which uses this metadata to ease navigation of news about Hollywood gossip, investing, the British Premier League, world business, and U.S. politics. You can take the metadata that OpenCalais returns and store it in the triplestore of metadata about your content as easily as you can store information about the resolution of your digital images.

Don't let the grander ideas about semantics distract you too much just yet, though. Prototypes aimed at lower-hanging fruit will give you a better focus on which of the grand ideas can help your business. There's plenty of free software available to create these prototypes, and even Oracle provides support for triplestores nowadays. So, if you're interested in what semantic web technology can do for your publishing business, start thinking about some inexpensive short-term projects that will give you a better idea of the long-term possibilities.