Unlocking 100 Years Of Scientific Papers Through Machine Learning

Reference mining is fundamental to the creation of citation networks and rich, discoverable digital libraries. In recent years, a number of tools have been developed to address this need, but they are often limited by input format, infrastructure requirements and runtime performance. The most recent developments in this space have focused on reference mining PDFs from arts and humanities literature, but there’s a growing need for a fast, accurate way of extracting and parsing references from a wide range of documents and formats across the full research landscape.

The Challenge for BMJ

The BMJ has an archive extending to hundreds of thousands of articles (some dating back to the 1840s) that exist only in PDF format. At the end of 2018, BMJ wanted to mine these PDFs for references and automatically structure them in CrossRef XML format, to make them widely available to the research community as part of the Initiative for Open Citations (I4OC). The publisher needed a fast, efficient way to extract, parse and link citations from over 200,000 articles, across 29 journals, to their source documents to support the open citation initiative and make their archives more discoverable.

Digital Strategy Lead for Research at BMJ, Helen King, was looking for solutions to this problem towards the end of last year, and explains:

“We were considering a range of solutions to help us mine our archives, extract millions of references, and automatically structure them in CrossRef XML format to make them available to the community as part of the Open Citations Initiative. We had spoken to Phil Gooch, Scholarcy’s founder, and were impressed by the tool’s use of machine-learning and automated linguistic analysis to extract key facts, references and data from research papers. The other thing that really struck us about Scholarcy was the speed of its reference extraction and parsing processes, with the ability to process 1000 PDFs per minute at an accuracy of up to 95%.”

The Challenge for Scholarcy

1. Converting over 20 different reference styles to a consistent format

In January of this year, BMJ gave us access to 200,000 articles (148GB of data) in PDF format. Most of these had been published between 1960-1998, but some dated back to the 1840s. Although by the late-1980s/early 1990s most journals had moved to the Vancouver referencing format, earlier issues of each journal had their own referencing style, so we needed to convert up to 29 different styles into the CrossRef XML format as part of the project.

2. OCR: What you see isn’t always what you get

Each PDF had previously been scanned as an image and run through Optical Character Recognition (OCR) software. This meant that our machine learning models had to handle a wide range of inconsistent and noisy data, along with many different referencing styles, formats and typesetting quirks. For example, in some early issues, 1 would be typeset as a serif letter ‘I’ and 0 as a small letter ‘o’.

In scanned articles, the references section often looks like this on screen and in print:

But the raw OCR text extracted from the PDF looks like this:

Our machine learning models had to turn this noisy data in clean, structured references like this:

3. Older articles don’t always have a defined references section

What we found with many articles in BMJ’s archive, particularly those written before 1960, was that references could occur anywhere in the article. This meant that Scholarcy’s models had to be adapted to accurately extract all referenced sources in an article, regardless of their location, formatting or conformity to more recent publishing protocols.

4. Many single file PDFs contained multiple articles

Another challenge we faced was, of the PDF files to be processed, around 40,000 contained multiple articles. Again, Scholarcy’s algorithms needed to be able to identify and accurately parse multiple articles from a single file, before going on to extract all references for each corresponding article. This segmentation was achieved by using fuzzy matching to locate each article title within the PDF, and then reading from there until the start of the next article or the end of the file.

Project results

From requirements gathering, algorithm refinement, to the process of extracting over 2 million citations as validated XML records in CrossRef, the entire project ran for 12 weeks. Publications which particularly benefited included the British Medical Journal itself (279,000 new records), Gut (177,000), Journal of Clinical Pathology (171,000) and Journal of Neurology, Neurosurgery and Psychiatry (168,000).

99.9% of the extracted records were fully valid XML. In only 0.1% of cases, the XML required some manual correction to meet CrossRef validation standards. The records were uploaded to CrossRef and are now available as open citations for anyone to reuse.

Talking about the results of the joint BMJ-Scholarcy initiative, Helen King said: