Since its publication in 1993, Martha Mitchell's 629 page Encyclopedia Brunoniana has served as the definitive reference work of Brown University's history. Its 668 articles document the University's buildings, departments, people, and publications. The Liber Brunoniana project utilizes natural language processing techniques to transform Mitchell's text into hypertext, automatically inferring over 5,000 hyperlinks between articles, delimiting content into categories, and constructing pages detailing the events of each year mentioned in the encyclopedia. This article details the techniques used to construct Liber Brunoniana, a book freed from the limitations of paper.

The possibility of creating Liber Brunoniana owes itself to Brown's longstanding distribution of a basic online edition of Encyclopedia Brunoniana . We used a simple python script to scrape the article text of this online edition. The faithful rendition of the text into mostly semantic HTML (blockquotes are enclosed in the appropriate tag, for example) eased the subsequent steps of transforming the text using rule-based natural language processing, and transforming the markup for our presentation.

Transformation

Notwithstanding presentation, the current online edition of Encyclopedia Brunoniana is an effective transformation of a text into HTML, but a poor case-study on the enrichment of a document with hypertext. This fault is particularly jarring since hypertext was designed not for creating applications (as is now the trend), but for organizing and presenting vast amounts of organized documents. The only navigation mechanism provided by the current online edition is the index on its home page of over six hundred hyperlinks. Index-based navigation is suitable for printed books because books afford the user the ability to browse with the mere flip of a page. When the ability to browse is removed, index-based navigation remains suitable only for users who have a precise quarry in mind.

For insight on what an effective rendition of the encyclopedic form in hypertext entailed, we looked to none other than Wikipedia. The English edition of the site effectively presents over five million articles, a volume for which a print rendition would be unfeasible! While the smaller number of articles in Encyclopedia Brunoniana doesn't prohibit offering a definitive index of articles, it's enough that more expressive navigation mechanisms are useful. Liber Brunoniana borrows Wikipedia's classification of articles into categories (including the ability to classify categories themselves into categories) and inter-document navigation via wikilinks. For technical reasons, we haven't yet implemented an integrated document search, but we suspect the necessity of search is diminished for collections of under a thousand documents. Finally, we borrowed Wikipedia's practice of thematic meta-pages, namely that of year pages, which summarize events of a given year.

Iterative Classification We initially tried to apply the clustering techniques detailed in Brandon Rose's Document Clustering in Python, but the results were, from a glance, underwhelming. That we could make this sort of at-a-glance evaluation owed itself to the uniform structures present in Martha Mitchell's choice of article titles. For many categories, we were able to generate simple rules that matched precisely the set of articles we desired. Articles about people, for example, have titles following the structure "Last, First M.I" . Likewise, articles about Brown's gates contained the string "Gate" in their title. The number of categories that could be derived with total accuracy from title alone was very small, but the certainty and success of the process enabled us to iteratively bootstrap structured semantics onto the text. Articles about buildings, for example, invariably contain the phrase "built in", but so did other types of articles: the article about a building's namesake often refers to the structure that memorialized that person, and non-building articles such as gates contain the phrase "built in", too. To create a category containing all articles about buildings, we searched the set of articles which had not already been placed into the "people" or "gates" categories for the text "built in". Similarly, to construct the sub-category Professors, we filtered for the phrase "professor of" within articles that had already been categorized as people. By iteratively structuring the document collection, we increased the precision of classification, without increasing the complexity of performing it. No surefire regular expression identifies Publications with few false positives, but we were able to continue to use very general search expressions by reducing the search space with prior classifications.

Entity Linking To create the same sort of experience of exploration that makes sites like Wikipedia and TVTropes addicting to navigate, we attempted to automatically identify keywords in articles that corresponded to other articles, and replace them with hyperlinks. The same structured properties of Martha Mitchell's titles that enabled classification simplified entity linking, too. To identify keywords, we simply performed a case-sensitive search of each article's text for the names of other articles. This naïve technique performed surprisingly well. Applied directly to a collection like Wikipedia, such a process would flood articles with irrelevant hyperlinks (which is to say nothing of the problem of disambiguation), but it proves suitable for collections of documents with narrow breadth and uncommon names. The only problematic article in Encyclopedia Brunoniana was Well. While a regular expression search-and-replace powered the initial attempt at entity linking, a common, confounding case rendered it useless. Brown—and by extension, Encyclopedia Brunoniana—tends to honor notable individuals in its rich history by naming buildings after them. Thus to "John Hay", add the "John Hay Library", and so on. With regular expressions alone, it is impossible to express that a hyperlink should never be nested inside another hyperlink; to express this, we enter the realm of context-free languages. Consistently handling these cases threatened to explode the complexity of the task into parsing HTML. Greg Hendershott's xexpr-map procedure reduced the challenge of expressing a context-aware tree transformer to a few lines of Racket: For all articles, we perform linkification with an identity mapping between article name and keyword. For articles about people (which we can identify with absolute certainty), we additionally linkify with various common arrangements of name components. Excluding links introduced by categories and year pages, this process introduced over 3,000 hyperlinks between documents. Visualized, this process reflects the transformation of a disparate cloud of about six-hundred-eighty articles, ...into a complex web that leaves few documents orphaned (if you include hyperlinks introduced by date pages, there are no orphaned pages): That not sufficient evidence to say this is a functional improvement, but by exploring Liber Brunoniana you can be the judge of that.