An optical microscope captures micrographs of cells and tissues, duplicated images of which can appear in the scientific literature.Credit: Mikhail Tereshchenko/TASS/Getty

Computer software can now quickly detect duplicate images across large swathes of the research literature, three scientists say.

In a paper published on 22 February on the bioRxiv preprint server1 , a team led by Daniel Acuna, a machine-learning researcher at Syracuse University in New York, report using an algorithm to crunch through hundreds of thousands of biomedical papers, searching for duplicate images. If journal editors adopted similar methods, they might be able to more easily screen images before publication — something that currently requires considerable effort and is done by only a few publications.

The work shows that it is possible to use technology to detect duplicates, says Acuna. He isn't making the algorithm public, because of the risk it could trigger false allegations. Instead he and his colleagues plan to license it to journals and research-integrity offices. Acuna says he has discussed the algorithm with Lauran Qualkenbush, director of the Office for Research Integrity at Northwestern University in Chicago, Illinois, and vice-president of the US Association of Research Integrity Officers. “It would be extremely helpful for a research-integrity office,” she says. “I am very hopeful my office will be a test site to figure out how to use Daniel’s tool this year.”

In early 2015, Acuna and two colleagues used an algorithm to extract more than 2.6 million images from the 760,000 articles then in the open-access subset of the PubMed database of biomedical literature, which is run by the US National Institutes of Health. These included micrographs of cells and tissues, and gel blots. The algorithm then zoomed in on the most feature-rich areas — where colour and greyscales vary most — to extract a characteristic digital ‘fingerprint’ of each image.

After eliminating features such as arrows or flow-chart components, the team ended up with around 2 million images. The researchers only compared images across papers from the same first and corresponding authors, to avoid the computational load of comparing every image against every other one. But the system could pick up potential duplicates even if they had been rotated, resized or had their contrast or colours changed.

The trio then manually examined a sample of around 3,750 of the flagged images to judge whether they thought the duplicates were suspicious or potentially fraudulent. On the basis of their results, they predict that 1.5% of the papers in the database would contain suspicious images, and that 0.6% of the papers would contain fraudulent images.

The researchers haven’t been able to benchmark the accuracy of their algorithm, says Hany Farid, a computer scientist at Dartmouth College in Hanover, New Hampshire — because there isn’t any database of known duplicate or non-duplicate scientific images against which they could test the tool. But he applauds the trio for applying existing techniques to real-world images and for working to put tools in the hands of journal editors.

Laborious process

At present, many journals check some images but relatively few have automated processes. For instance, Nature runs random spot checks on images in submitted manuscripts and also requires authors to submit unedited gel images for reference. It is currently reviewing its image-checking procedures. (Nature’s news team is editorially independent of its journal team.)

Some journals are following the lead of publications such as the Journal of Cell Biology and The EMBO Journal in manually screening most images in submitted manuscripts. But the process is time-consuming, and a routine, automated screen to streamline the process is long overdue, says Bernd Pulverer, chief editor of The EMBO Journal.

In order to spot image re-use across the literature, publishers would need to create a shared database of all published images against which articles submitted for publication could be compared, says IJsbrand Jan Aalbersberg, head of research integrity at the Dutch publishing giant Elsevier.

There is a precedent for such co-operation. In 2010, scholarly publishers worked together on an industry-wide service to tackle plagiarism. Crossref, a non-profit collaboration of around 10,000 commercial and learned society publishers, created CrossCheck, a service that collates full-text articles from its member publishers and makes use of the iThenticate plagiarism detection software made by Turnitin, a company in Oakland, California. The service, since renamed Similarity Check’, has helped to make it routine practice in publishing to screen submitted manuscripts for plagiarism.

There are currently no plans for a publisher-wide system for image checking, but that is partly because the technologies are not yet mature, says Ed Pentz, executive director of Crossref. But Crossref watches developments in the area with interest, he says.

Elsevier says it would support an initiative such as Similarity Check for images. Two years ago, the company set up a 3-year, €1-million (US$1.2-million) partnership with Humboldt University in Berlin to research article mining and to identify research misconduct. On 25 January, the project announced that it intends to create a database of images from retracted publications. Such a data set would provide a bank of test images for researchers developing automated screening of images in publications.