We applied two independent machine learning approaches to screening a large number of identified citations (70,365 records) for a systematic review. We first selected 2000 records at random to provide the first training set. This number was chosen arbitrarily as we could not predict how many training instances would be required. Of these, only 1993 were suitable due to data deposition errors. These were then screened by two human reviewers with previous experience with reviews of animal studies, with a third expert reviewer reconciling any differences. The resulting ML algorithms gave a score between 0 and 1. To ensure that the true sensitivity was likely to be 95% or higher, we chose as our cut-point the value for which the lower bound of the 95% confidence interval of the observed sensitivity exceeded 95% when applied to the unseen validation dataset. We then repeated this process adding a further 1000 randomly selected (996 useable) citations to the training set; and then again adding a further 3000 randomly selected (2760 useable) citations to the training set. At each stage, we assessed performance of the approaches on a validation set of unseen documents, using a number of different metrics. Next, the best performing algorithm was used to identify human errors in the training and validation sets by selecting those with the largest discrepancy between the human decision (characterised as 0 for exclude or 1 for include) and the machine prediction (a continuous variable between 0 and 1). Performance of the approaches trained on the full 5749 records is reported here, and each of the iterations is available in Additional file 1. The error analysis was assessed on the net reclassification index, and the performance of the ML approach is compared before and after correcting the errors in human screening using AUC (Fig. 1).

Step 1: Application of ML tools to screening of a large preclinical systematic review

Training sets

We identified 70,365 potentially relevant records from PubMed and EMBASE. The search strings were composed of the animal filters devised by the Systematic Review Center for Laboratory animal Experimentation (SYRCLE) [21, 22], NOT reviews, comments, or letters AND a depression disorder string (for full search strings see [23]). The training and validation sets were chosen at random from the 70,365 by assigning each record a random number using the RAND function in excel and ranking them from smallest to largest. The final training set consisted of 5749 records. The final validation set consisted of the next 1251 records. The training set and validation sets were screened by two independent human screeners with any discrepancies reconciled by a third independent human screener. The human screening process used an online systematic review tool called SyRF (app.syrf.org) which randomly presents a reviewer with a record, with the title and abstract displayed. The reviewer makes a decision about the record, to include (1) or to exclude (0). A second reviewer is also presented with records but in a different random sequence. If a given record receives two ‘include’ decisions or two ‘exclude’ decisions, the screening for this record is considered complete. If reviewer 1 and reviewer 2 disagree, the record is listed for review by third reviewer who. The record then has an average inclusion score of 0.666 or 0.333. Any record that has an inclusion score above 0.6 is included, those scoring less than 0.6 are excluded, and screening is considered complete. Reviewers are not aware of whether they are the first, second or third reviewer or of the decisions of the other reviewers. Datasets are available on Zenodo, as described in the “Availability of data and materials” section. The validation set had more than 150 ‘included’ records, which should give a reasonably precise estimate of the sensitivity and specificity which would be achieved in screening other citations from the population from which the validation set was drawn.

Fig. 1 Diagram of the layout of the study Full size image

Three feature sets (BoW, LDA and SVD (LSI)) were tested on SVMs, logistic regression and random forests [24]. The two algorithms described below performed best for this dataset of 70,365 records, on the broad topic of preclinical animal models of depression.

Approaches

Here, two approaches were developed independently, using different classification models and feature representations, but sharing the linear classification principles.

Approach 1

Approach 1 used a tri-gram ‘bag-of-words’ model for feature selection and implemented a linear support vector machine (SVM) with stochastic gradient descent (SGD) as supported by the SciKit-Learn python library [25]. To account for the relative importance of words within a given document, and difference in words used between documents we used ‘Term Frequency – Inverse Document Frequency’ (TD-IDF). This is defined as

$$ tfidf\left({w}_i,{d}_j\right)= tf\left({w}_i,{d}_j\right)\ast \frac{\left|D\right|}{\left|\left\{d:{w}_i\in d\right\}\right|} $$

The score for the ith word in context of the jth document takes into account not only how many times the word occurred there (tf), but also how many other documents (d) from the whole corpus (D) contain it as well. This helps to reduce the score for words that are common for all documents and therefore have little predictive power. This helps the classifier to focus on terms which help to distinguish between documents, rather than on terms which occur frequently [26]. We allowed n-grams; did not use stemming; and used the MySQL text indexing functionality ‘stopword’ list to remove frequently occurring words which provide little relevant information for classification purposes [27].

The support vector machine classifier with stochastic gradient descent (SGD) was chosen as it is efficient, scales well to large numbers of records, and provides an easily interpretable list of probability estimates when predicting class membership (i.e. scores for each document lying between 0 and 1). Efficiency and interpretability are important, as this classifier is already deployed in a large systematic review platform [28], and any deployed algorithm therefore needs not to be too computationally demanding, and its results understood by users who are not machine learning specialists. The tri-gram feature selection approach without any additional feature engineering also reflects the generalist need of deployment on a platform used in a wide range of reviews: the algorithm needs to be generalisable across disciplines and literatures, and not ‘over-fitted’ to a specific area. For example, the tri-gram ‘randomised controlled trial’ has quite different implications for classification compared with ‘randomised controlled trials’ (i.e. ‘trials’ in plural). The former might be a report of a randomised controlled trial; while the latter is often found in reports of systematic reviews of randomised trials. Stemming would remove the ‘s’ on trials and thus lose this important information. This approach aims to give the best compromise between reliable performance across a wide range of domains and that achievable from a workflow that has been highly tuned to a specific context.

Approach 2

Approach 2 used a regularised logistic regression model built on latent Dirichlet allocation (LDA) and singular value decomposition (SVD) features. Namely, the document text (consisting of title and abstract) was first lemmatised with the tool GENIA tagger [29] and then converted into bag-of-words representation of unigrams, which was then used to create two types of features. First, the word frequencies were converted into a matrix TF/IDF scores, which was then decomposed via a general matrix factorisation technique (SVD) implemented in scikit-learn library and truncated to the first 300 dimensions. Second, an LDA model was built using MALLET library [30], setting 300 as a number of topics. As a result, each document was represented by 600 features, and an L1-regularised logistic regression model was built using glmnet package [31] in R statistical framework [32].

In this procedure, every document is represented with a constant, manageable number of features, irrespective of corpus or vocabulary size. As a result, we can use a relatively simple classification algorithm and expect good performance with short processing time even for very large collections. This feature is particularly useful when running the procedure numerous times in cross-validation mode for error analysis (see below).

For further details of feature generation methods and classifiers see Additional file 1. For a given unseen test instance, the logistic regression returns a score corresponding to the probability of it being relevant according to the current model. An optimal cut-off score that gives the best performance is calculated as described above.

Assessing machine learning performance

The facets of a machine learning algorithm performance that would be most beneficial to this field of research are high sensitivity (see Table 1), at a level comparable to the 95% we estimate is achieved by two independent human screeners. To be confident that the sensitivity which would be achieved in the screening of other publications from which the validation set was drawn would be 95% or higher, we selected the threshold for inclusion such that the lower bound of the 95% confidence interval of the observed sensitivity in the validation set excluded 95% sensitivity. This has practical implications that, the larger the validation set, the more precisely that sensitivity will be estimated. Once the level of sensitivity has been reached, the next priority is to maximise specificity, to reduce the number of irrelevant records included by an algorithm. Although specificity at 95% sensitivity is our goal, we also provide additional measures of performance.

Table 1 Equations used to assess performance of machine learning algorithms Full size table

Performance metrics

Performance was assessed using sensitivity (or recall), specificity, precision, accuracy, work saved over sampling (WSS), and the positive likelihood ration (LR+) (see Table 1), carried out in R (R version 3.4.2; [32]) using the ‘caret’ package [24]. 95% confidence intervals were calculated using the efficient-score method [33]. Cut-offs were determined manually for each approach by taking the score that gave confidence that true sensitivity was at least 95% (as described above), and the specificity at this score was calculated.

Step 2: Application of ML tools to training datasets to identify human error

Error analysis methods

The approach to error analysis was outlined in an a priori protocol, published on the CAMARADES (Collaborative Approach to Meta-Analysis and Review of Animal Data from Experimental Studies) website on 18 December 2016 [34]. We used non-exhaustive fivefold cross-validation to generate the machine learning scores for the set of records that were originally used to train the machine (5749 records). This involves randomly partitioning records into five equal sized subsamples. Over five iterations, one subsample is set aside, and the remaining four subsamples are used to train the algorithm [35]. Thus, every record serves as an ‘unknown’ in one of these iterations, and has a score computed by a machine learning model where it was not included in the training portion. These scores were used to highlight discrepancies or disagreements between machine decision and human decision. The documents were ranked by the machine assigned prediction of relevance from most likely to least likely. The original human assigned scores (either 0 or 1) were compared with this ranking, to highlight potential errors in the human decision. A single human reviewer (experienced in animal systematic reviews) manually reassessed the records starting with the most discrepant. To avoid reassessing the full 5749 record dataset, a pragmatic stopping rule was established such that if the initial human decision was correct for five consecutive records, further records were not reassessed (Fig. 2).

Fig. 2 Error analysis. The methodology for using cross-validation to assign ML-predicted probability scores. The ML-predicted probability scores for the records were checked against the original human inclusion decision Full size image

After the errors in the training set were investigated and corrected as described above, a second model was built on the updated training data. The outcome of error analysis is presented as reclassification tables, the area under the curve (AUC) being used to compare the performance of the ML algorithm trained on the uncorrected training set, and the net reclassification index (NRI) [36] used to compare the performance of the classifier built on the updated training data with the performance of the classifier built on the original training data. The following equation was used [37]:

$$ {\mathrm{NRI}}_{\mathrm{binary}\ \mathrm{outcomes}}={\left(\mathrm{Sensitivity}+\mathrm{Specificity}\right)}_{\mathrm{second}\ \mathrm{test}}-{\left(\mathrm{Sensitivity}+\mathrm{Specificity}\right)}_{\mathrm{first}\ \mathrm{test}} $$

The AUC was calculated using the DeLong method in the ‘pROC’ package in R [38].

Further, we applied the same technique as above to identify human screening errors in the validation dataset. Due to the small number of records in the validation set (1251 records), it was assumed that every error would be likely to impact measured performance, and so the manual screening of the validation set involved revisiting every record where the human and machine decision were incongruent. The number of reclassified records was noted. The inter-rater reliability of all screening decisions on training set and validation set between reviewer 1 and reviewer 2 were analysed using the ‘Kappa.test’ function in the ‘fmsb’ package in R [39].