March 2020

Over the years, we have digitized and gathered quite a few books relating to medicine, but we have only now made a thematic page for this topic . Most of the books found there are in Swedish. We welcome suggestions for more works to digitize. Recently, we have added:

While the whole world is worrying about the new corona virus, we thought that maybe we can derive some wisdom from the history of previous epidemics. The Spanish Flu of 1918 comes to mind. What has been written about it, really? Perhaps the best accounts we have are encyclopedia entries from the 1920s about influenza, such as this one in Nordisk familjebok .

February 2020

Since we boldly digitize journals and encyclopedias (having numerous contributors) 70 years after they were published, regardless of when each contributors lived, we have also scanned:

Public Domain Day is January 1st. This is when works by a new group of authors enter the public domain because copyright expires when they have been dead for 70 years. We continue to celebrate it, even though it is already February. So who died in 1949? And what have we added so far?

November 2019

From November 12 to December 21, a small banner (the one above) was seen on some of our web pages, promoting donations toward our aim of raising 25,000 SEK for the fiscal year 2019/20. The idea was that the banner would be removed as soon as the aim has been reached, to reappear next year. We have long had a link "Donate" in the header of all our web pages. Read more on our donation page.

March 2019

Redoing OCR

In the year 2000 and again in 2010 we found that OCR of fraktur (blackletter, Gothic) was too difficult and could wait. For normal print (antikva, Latin) we have used the commercial software ABBYY Finereader with great success. Since 2007 we have also increasingly imported books that have been scanned by others and often copied both scanned images and OCR text.

Around 2013 or 2014, the OCR quality for books printed in fraktur and scanned by Nasjonalbiblioteket of Norway suddenly improved radically. It seems they have used a special edition of Finereader developed by some German/Austrian project, but this was outside of our reach. Later, books in fraktur digitized by Det Kongelige Bibliotek of Denmark have also become better.

As we return to consider this problem again in 2019, free software Tesseract (Wikipedia, Github, wiki) is now in version 4.0 and a standard part of the Ubuntu Linux distribution, with support for Swedish and Danish fraktur added around 2015. The output is far from excellent, not as good as the Norwegian books, but much better than some other and quite useful as a starting point for manual proofreading.

We are now, using Tesseract, starting to redo OCR for some books in fraktur. The first attempt is Søren Kierkegaards Samlede Værker (15 volumes, 1920-1926), which were digitized in 2009 at the University of Toronto by the Internet Archive. From their OCR text, of terrible quality, it is apparent that they used ABBYY Finereader for Latin letters. We copied volumes 1-8 in 2014, but decided in 2015 to do our own OCR by manually training Finereader to interpret the fraktur text. This was timeconsuming and painful and the result was not very good. Now, we have copied the remaining volumes and redone OCR for all of them with Tesseract, with much better result.

In the meanwhile, a new edition of Søren Kierkegaards Skrifter (55 printed volumes, 2007-2013) has been published and come online at SKS.dk. There you will find all of the texts, without needing to proofread anything. However, this is not true for all the other books that we provide.

A problem is that we have no algorithm for determining which OCR text is better. The right way to determine this is to manually proofread the page and then see which OCR candidate required the smaller amount of edits to reach the desired result. But of course, when we have two OCR texts for the same page, we want to find out which is better without needing to proofread the page. And we can't just use a spell checker because then any sequence of correctly spelled words would win, regardless of its similarity to the scanned page. So far, we only redo OCR on pages were the naked eye can immediately see that there are too many errors typical of bad fraktur OCR, for example containing words such as "reban" (redan) or "ogfaa" (ogsaa).