Back in October, I gave an example of how we're already using data to drive our decision-making at Trends: a network analysis, supported by very simple text mining, used to generate cluster maps of the scientific concepts most strongly associated with Trends in Biotechnology. But now I'd like to zoom out a bit, turn to the future, and speculate how we might be using analytics to understand, classify, and yes, even write papers years from now.

Machine reading

A first critical obstacle in applying machine learning to analyze scientific publications is teaching the machines to read them as humans would. A quip I keep hearing from strategy-inclined folks at Elsevier is that machines are our now most voracious readers, and if that's not literally true yet, it certainly will be soon. But there's a wide gulf between a machine processing the words in a published article and a machine understanding the point of the article.

One of machine learning's most ubiquitous commercial successes is the recommender system, the algorithm that tells you everything from what you should binge on Netflix this weekend to what else you need to add to your cart on Amazon. And even though they're everywhere on the contemporary Internet, they're far from perfect. For every insightful success (why yes, I do think I need a screen protector to go along with my new tablet), there's the kind of error that only a machine could make (no, the fact that I just bought a giant book filled with crossword puzzles doesn't mean I need any more giant books filled with crossword puzzles). It's relatively easy to create a recommender system that relies on text similarity; the real utility of the method, yet the biggest challenge in programming it, will be in finding a way to determine context similarity.

The branch of machine learning that can do this is called natural language processing, and it has its work cut out for it if it wants to understand academic articles. The usual problems exist: in a given context, is "set" a verb, noun, or adjective? Particular to scientific writing, though, is a whole suite of vocabularies that might or might not overlap: does "strain" refer to a material property or a subspecies of bacteria? Sometimes the answer is obvious from the topical context, if you're talking about civil engineering or epidemiology. But what about a paper on cellular biophysics? Or functionalized bone implants? The task gets harder.

As editors, what would we do with machine-read papers? One obvious application would be to create a way to automatically and objectively annotate our own publications. For instance, sometimes an article's keywords are so broad that they tell you nothing about the article, or so narrow that no interested reader would ever search for them. A machine-reading algorithm could suggest keywords simply by counting words in phrases in a document and comparing them to an expected background frequency. A more intelligent one could learn which concepts were important and align them to a standardized ontology.

We could also use this approach to improve our own recommender systems, like the ones you see on ScienceDirect article pages, with the aim of suggesting papers with similar content instead of just similar words.





Article classification

When I apply my data-driven cluster framework to the articles published in TIBTECH, I manually assign every review to one or two clusters. An algorithm that did that for me would not only save me some work but could also make a decision based on the actual contents of the paper and not just my perception of it. Such an algorithm could detect a lot of uses of the word "algae"—or, better yet, the names of specific species and strains of algae—and decide it was probably a bioprocessing paper. This same analysis could easily extend to papers published by other biotech reviews journals or in the research journals that make up TIBTECH's research communities. That information, in turn, would give me a more precise and objective idea of what kinds of articles were being published across the biotech landscape and give me advice for what sorts of reviews to commission.

Incorporating a neural network or other variety of classification algorithm could create even more advanced versions of review-reading programs. For instance, it could learn its own classifiers for what makes something a bioprocessing paper (say, the names of certain species of bacteria located within some proximity of the terms "yield" or "metabolism"). Or it could produce a probability vector with the likelihood that a given article fell within a particular cluster: an article reviewing the applications of CRISPR to industrial biotechnology might receive similarly high probabilities of being an omics or a bioprocessing paper, while one about CRISPR antibacterials might be judged equally likely to be an omics or therapeutics paper.

Even more speculatively, it's possible to envision novel takes on other techniques that could produce an even more insightful classification. For instance, maybe sentiment analysis could be modified to, instead of differentiating between positive and negative emotions, differentiate between basic and applied biology. Or a nearest-neighbor analysis of a machine-read article could tell an editor how similar a given article under review is to the other articles published in her journal, giving an opportunity to tweak the scope during revision. The same nearest-neighbor approach could also point an editor toward opportunities for new reviews, if there were many statistically similar research articles but no review that drew them all together.





Predicting impact

Quick: what's the single biggest factor that determines how well a review will be cited, how many people will want to read it, or how much its research community will care about it?

If you can't think of an answer, you're in good company; we as editors have absolutely no idea either. There are lots of things we like to think could be important: the quality of the writing, the completeness of the references section, the usefulness of the figures, the timeliness of the topic, the novelty of the statement. Something that all of those have in common is that they're basically impossible to quantify.

Even among more quantitative properties, answers are no easier to come by. While I don't decide who to invite to write for TIBTECH based solely on the author's h-index, it's reasonable to think that it might play a role in how much a paper is cited. What about the number of authors? The length of the paper? The number of papers cited, the age of the references, the total citations of the references, or the impact factors of the reference sources? Even some more exotic metrics like the Flesch-Kincaid score might have something to do with it. And of course, while the impact factor of the journal itself might be an important input, positive feedback suggests that it's almost certainly an important output too: TIBTECH publishes well-cited reviews, so the journal gains a strong reputation, which means more people read it, which means more people cite its articles.

The unfortunate truth for us editors is that easy correlations elude us between any of these numbers and how well a paper might be received. That's not to say they don't exist, it just means that between evaluating proposals, editing manuscripts, managing reviews and revisions, attending conferences, and keeping up with the current literature, we don't necessarily have the resources that it would take to structure all of these data and churn through multiple regressions until we stumble upon the correct linear combination of variables. But techniques to reduce all of these factors to a more manageable number, whether principal component analysis or something more specialized, could certainly help to make more sense of this high-dimensional space.





Machine writing?

If you're a Harry Potter fan, you may already be familiar with Harry Potter and the Portrait of What Looked Like a Large Pile of Ash, a masterpiece of predictively generated text trained on the entire Harry Potter corpus. The algorithm that wrote it is essentially the same one that suggests the next word to use when you’re sending a text message: it relies on the context of nearby words to predict what you want to say next. Like any great satire, Large Pile of Ash is at first hilariously unrealistic and then makes you think. Could a machine actually write an entire fictional novel? What about a review paper?

I think we're far enough from that possibility that it remains tough to envision exactly how it might be accomplished. But machine-written text is already a reality for applications that rely extensively or entirely on factual statements, without any human insight or synthesis, like summaries of basketball games or reports of who won an election. In other words, a review written by artificial intelligence using current technology is exactly the kind of review that we wouldn't want to publish.

And, as machine writing and reading evolves, I think that's increasingly going to represent the role of Trends authors and Trends editors: understanding the human element of what makes research exciting, relevant, and worth thinking about in a way that an algorithm cannot.