Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier’s generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term “Cross-validation and cross-testing” improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do.

Funding: RV thanks the financial support from the Estonian Research Council through the personal research grants program (PUT438 grant). Website of the funder: http://www.etag.ee/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Introduction

The goal of supervised machine learning, in particular classification, is to find a model that accurately assigns data to separate predefined classes. To test the generality of a learned model, this model is typically applied to independent test data, and the accuracy of the prediction informs a researcher about the quality of the classifier [1]. Finding a classifier that performs optimally according to the researcher’s objective requires a set of assumptions and also a trade-off in model complexity: Too simple parameters lead to under-fitting, i.e. the model is not able to account for the complexity of the data. Too complex parameters at the same time lead to over-fitting, i.e. the model is too complex and fits to noise in the data. To test different assumptions and also optimize this so-called bias-variance trade-off [2], it is quite common to divide a data set into three parts: (1) a training set and (2) a validation set, which together are used iteratively for optimizing the parameters of the chosen classifier, and (3) a separate test set to validate the generalization performance of the final classifier.

Machine learning methods in life sciences are used with different objectives: At one end of the spectrum, the goal is making predictions in real-world applications and building a maximally predictive model with interpretable weights and parameters that can be used in future applications. At the other end, the goal is to make an inference about the presence of information, where even the slightest discrimination performance indicates a statistical dependence between independent and dependent variables (e.g. classes and data). In the latter approach, the interpretability of weights and parameters and their reuse is not the focus of the research and performance is commonly evaluated using statistical analyses (e.g. [3]). This latter approach is quite common in the field of neuroimaging [4] and bioinformatics [5].

Data collection can be very expensive in biological and social sciences, and over time more data-efficient methods have emerged [6]. Cross-validation is a method that makes near-optimal use of the available data by repeatedly training and testing classifiers on different subsets of data, typically with a large training and a small validation set in each iteration [7]. For example, in 5-fold cross-validation 80 percent of the data are used for training, 20 percent for validation, and in the next iteration another 20 percent of data are chosen as a test set, etc. This process is repeated five times until all data have served as validation data once. Cross-validation is repeated with different parameter combinations, and once the best parameters have been found, the model is trained with the chosen parameters on all data that have previously been used for cross-validation and applied to the separate test set (see Fig 1). When the goal of a researcher is to build a model that generalizes well to unseen cases and that may be used in real life applications such as image recognition or text classification, this approach is probably the most generally used method for carrying out classification analyses, with five to ten fold cross-validation resulting in a good trade-off between over-fitting and sufficient training size of the classifier [8].

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. In the “Cross-validation and testing” approach, the data are divided into two separate sets (cross-validation set and test set) only once. First, different models are trained and validated with cross-validation and the best set of parameters is chosen. Prediction accuracy and statistical significance of the parameters are evaluated on the test set, after training on the cross-validation set. https://doi.org/10.1371/journal.pone.0161788.g001

One difficulty of this approach is that the test set that is used to validate classification performance is limited in size. While cross-validation makes good use of training data, the estimation of the generalization performance of the classifier may still suffer by this limited size of the final test set. Increasing the size of the test set at the same time would come at the cost of diminishing classification performance. When data are scarce or expensive to acquire, this can become a large problem and may lead to a sub-optimal choice of classifiers and the associated parameters.

One approach that has been used to overcome this difficulty is “Nested cross-validation” [9] (see Fig 2). Here, the test set is not kept completely separate, but cross-validation is extended to incorporate all available data (outer cross-validation). In that way, all data can serve as test set once, overcoming the aforementioned trade-off. In order to still be able to optimize parameters, for a particular cross-validation iteration the training set is again divided for nested cross-validation (inner cross-validation), and once the best parameters for this iteration have been found, they are used to train a model on the current training data, which is then applied to the current test set. This approach is most useful for a researcher who does not need to build a model that generalizes well to unseen data, but who would like to describe whether there is a meaningful statistical dependence between the class labels and the given dataset, in other words whether the dataset contains information about the labels.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. In the “Nested cross-validation” approach, first (outer) cross-validation is performed to estimate predictability of the data. In each iteration, data are divided into training and test sets. Before training, another (inner) cross-validation loop is used to optimize parameters. As model weights (fitted models) and parameters are different at every partition, it is not possible to report accuracy or statistical significance about a particular set of parameters or model weights. https://doi.org/10.1371/journal.pone.0161788.g002

While a Nested cross-validation procedure makes more efficient use of data, it has some drawbacks: Due to the absence of a completely separate test set it is not possible to claim that a particular model, i.e. a particular set of weights and classifier parameters, could in the future be used to classify unseen data [2]. In addition, the chosen parameters and models may vary between cross-validation iterations, making it impossible to select one set of parameters or one model as the final choice. In other words, a separate model and a separate set of parameters are chosen in each iteration and choosing any one of them would mean returning to a simple cross-validation and testing approach which would annihilate the advantage gained by nested cross-validation. There are, however, cases in which the interpretation of parameters is desirable, even when the model is not used. For example, for certain applications it might be useful to report that the best parameter corresponds to linear models as opposed to quadratic ones, without the need to describe the specific model weights used by the linear models. In another example, when using neural network as the class of machine learning algorithms, the number of layers selected during the optimization, say 3, 4 or 5 layers, may be an important choice to be communicated to other researchers and that may lend some interpretation to the best combination of parameters and data.

Here we describe a novel approach that improves the trade-off between training and test size for better generalization performance than “Cross-validation and testing”, while, in contrast to “Nested cross-validation”, maintains the interpretability of chosen parameters. In the case of the widely used “Cross-validation and testing” approach, a larger test set results in less data for choosing the best set of parameters and also less data for fitting the model. In contrast, with our newly proposed “Cross-validation and cross-testing” approach (Fig 3) a larger test set still means that we are left with less data for parameter selection, but it does not reduce the amount of data available for model fitting. This comes at the cost of losing the ability to generate a single predictive model that is used in the future for general application. However, for many researchers it is sufficient to (1) demonstrate that information is indeed present in the dataset [4] or (2) interpret the parameters of the classifier. The latter can also guide a number of important choices for future design of classifiers. The novel approach suggested in this work improves the trade-off by using data more efficiently for fitting the model but makes it still possible to choose, interpret and report a set of parameters that may be used in the future. In brief, it is a mixture of the approaches of “Cross-validation and testing” and “Nested cross-validation”, which allows reusing test data to improve the size of the training set, thereby improving classification performance. We term this method “Cross-validation and cross-testing”.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. For cross-validation and cross-testing, data are divided into two separate sets only once: a cross-validation set and a test set. Similar to typical cross-validation, a number of iterations are carried out to choose the best parameters for the final model on the test set. Once the best combination of parameters has been chosen, the prediction accuracy and statistical significance can be evaluated on the test set with a modified cross-validation such that for each fold the original cross-validation set is repeatedly added to the training data. Due to the similarity to cross-validation, we term this approach cross-testing. While making it impossible to pick one final model on additional unseen data, the parameters that have been chosen remain interpretable. https://doi.org/10.1371/journal.pone.0161788.g003

The current article is structured as follows. In the methods section we will describe the data sets and parameters used in order to test the applicability of the proposed approach on simulated and real data. In the first part of the results section we will then outline the novel approach of “Cross-validation and cross-testing” in detail. In the second part of the results section, we will compare all approaches using numerical experiments described in the methods part and show that the novel approach has a superior classification performance than the common “Cross-validation and testing” approach. We also demonstrate that the novel pipeline is not biased by applying it to simulated data for which ground-truth is known.