For nearly 100 years, we have maintained a thesaurus to tag The New York Times. The thesaurus consists of more than a million terms organized into five controlled vocabularies: subjects, personal names, organizations, geographic locations and the titles of creative works (books, movies, plays, etc).

At last week’s Semantic Technology Conference, we announced our intention to publish The New York Times thesaurus under a license that will allow the community to both use it and contribute back to it. The results will, in time, prepare The Times to enter the linked data cloud.

Evan Sandhaus

Releasing the Times thesaurus is consistent with our TimesOpen strategy. We want to facilitate access to slices of our data for those who want to include Times content in their applications. Our TimesTags API already makes available our most frequently used tags, the 27,000 that power our topics pages. But the new effort will go well beyond that. We plan to release hundreds of thousands of tags from the corpus back to 1980, and later, in a second phase, hundreds of thousands more going back to 1851.

Our hope is that the community will link our thesaurus to existing taxonomies and, when we release the tags for the deep archive, help normalize the terms.

Many details need to be ironed out, including format, location and the license. At the conference, we asked the audience for input and received some good ideas. We would love to have more. Please use the comments area to ask questions, make suggestions or offer recommendations.