TL;DR Elsevier typsetting turns double bonds into garbage.

Those of you who follow this blog will know that I contend that publishers corrupt manuscripts and thereby destroy science.

Those of you who follow this blog will know that Elsevier publicly stated that I could not use the new “Hargreaves” law to mine articles on their web page and I must do this through their API. Originally there were zillions of conditions, which – under our constant criticism – have gradually (but nowhere completely) disappeared. They now allow me to mine from the web page, but insist that their XML-API gives better content.

I have consistently refused to use Elsevier’s API for legal, political and social reasons (I don’t want to sign my rights away, be monitored, have to ask permission, etc.). But I also know from at least 5 years of trying to interpret publishers’ PDFs and HTML that information is corrupted. By this I mean that what the author submits is turned into something different lexically, typographically and often semantics. (Yes, that means that by changing the way something looks , you can change its meaning).

Anyway yesterday Chris Shillum, who was part of the team I challenged, tweeted that he would let me have a paper – in XML format – from the Elsevier API. For those who don’t know, XML is designed to hold information in a style-free form. It can be rendered by a stylesheet or program (e.g. FOP) into whatever font you like. I’m very familiar with XML having run the developers’ list with Henry Rzepa in 1997 and been co-author of the universal SAX protocol. Henry and I have developed Chemical Markup Language (CML) precisely for the purpose of chemical publishing (among many other things).



But Elsevier don’t use CML, they use typographers who know nothing about chemistry. At school you may have heard of a “double bond” (http://en.wikipedia.org/wiki/Double_bond). It’s normally represented by two lines between the atoms. We used to draw these with rapidographs, but now we type them. So every chemist in the world will type Carbon Dioxide as

O=C=O

capital-O equals capital-C equals capital-O

You can do it – nothing terrible happens. You can even search chemical databases using this. They all understand “equals”.

But that’s not good enough for Elsevier (and most of the others). It has to look “pretty”. It’s more important that a publication looks pretty than that it’s correct. And that’s one of the major ways they corrupt information. So here’s the paper that Chris Shillum sent me.

First as a PDF.



Can you see the C=O double bond in the middle? “(C=O stretching)”. It’s no longer an equals, but a special publisher-only symbol they think looks prettier. Among other things if I search for “C=O” I won’t find the double bond in the text. That’s bad enough. But what’s far worse is that this symbol has been included in their XML. And this gets transmitted to the HTML – which looks like (you can try this yourself http://www.sciencedirect.com/science/article/pii/S0014579301033130 ).



???

What’s happened??? Do you also see a square? The double bond has disappeared.

The square is Firefox saying “I have been given a character I don’t understand and the best I can do is draw a square” – sorry. Safari does the same. Do ANY of you get anything useful? I doubt it.

Because Elsevier has created a special Elsevier-only method of displaying chemistry. It probably only works inside Elsevier back-room – it won’t work in any normal browser. Here’s what has happened.

Elsevier wanted a symbol to display a double bond. “Equals” which all the rest of the world uses – isn’t good enough. So they created their own special Elsevier-double-bond. It’s not a standard Unicode codepoint – it’s in a Private Use Area: (http://en.wikipedia.org/wiki/Private_Use_Areas). This is reserved for a single organisation to use. It is not intended for unrestricted public use. In certain cases groups, with mutual agreement, have developed communities of practice. But I know of no community outside Elsevier that uses this. (BTW the XML uses 6 Elsevier-only DTDs and can only be understood by reading a 500-page manual – the chemistry is hidden somewhere at the end. This is the monstrosity that Elsevier wishes to force us to use.

It’s highly dangerous. If you change a double bond to a triple bond (ethylene => acetylene) it can explode and blow you up. But double and triple bonds are both represented by a hollow square if you try to view Elsevier-HTML. And goodness knows what else:

So Elsevier destroys information.

Chris Shillum tells me on Twitter that it’s not a problem. But it is. Using the Private Use Area without the agreement of the community is utterly irresponsible. No one even knew that Elsevier was doing it.

Why’s it irresponsible? Because many languages use it for other purposes. See Wikipedia above. Estonian, Tibetan, Chinese … If an Elsevier-double-bond is used in these documents (e.g. an Estonian chemistry department) there will be certain corruption of both the chemistry and the Estonian. There are probably 10 million chemical compounds with double bonds and all will be corrupted.

But it’s also arrogant. “We’re Elsevier. We’re not going to work with existing DTDs (XML specifications) – we’re going to invent our own.” Who uses it outside Elsevier? “And we are going to force text-miners to use this monstrosity.”

And it’s the combined arrogance and incompetence of publishers that destroys science during the manuscript processing. I’ve been through it. I know.



