Abstract The current research study is concerned with the automated differentiation between histopathological slides from colon tissues with respect to four classes (healthy tissue and cancerous of grades 1, 2 or 3) through an optimized ensemble of predictors. Six distinct classifiers with prediction accuracies ranging from 87% to 95% are considered for the task. The proposed method of combining them takes into account the probabilities of the individual classifiers for each sample to be assigned to any of the four classes, optimizes weights for each technique by differential evolution and attains an accuracy that is significantly better than the individual results. Moreover, a degree of confidence is defined that would allow the pathologists to separate the data into two distinct sets, one that is correctly classified with a high level of confidence and the rest that would need their further attention. The tandem is also validated on other benchmark data sets. The proposed methodology proves to be efficient in improving the classification accuracy of each algorithm taken separately and performs reasonably well on other data sets, even with default weights. In addition, by establishing a degree of confidence the method becomes more viable for use by actual practitioners.

Citation: Lichtblau D, Stoean C (2019) Cancer diagnosis through a tandem of classifiers for digitized histopathological slides. PLoS ONE 14(1): e0209274. https://doi.org/10.1371/journal.pone.0209274 Editor: Marco Magalhaes, University of Toronto, CANADA Received: July 19, 2018; Accepted: December 3, 2018; Published: January 16, 2019 Copyright: © 2019 Lichtblau, Stoean. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All histopathological files are available from the IMEDIATREAT database at https://doi.org/10.6084/m9.figshare.4508672.v1. Funding: Wolfram Research, Inc. provided support in the form of salary to DL but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. CS acknowledges the support of the research grant no. 26/2014, code PN-II-PT-PCCA-2013-4-1153, entitled “Intelligent Medical Information System for the Diagnosis and Monitoring of the Treatment of Patients with Colorectal Neoplasm”; financed by the Romanian Ministry of National Education (MEN) – Research, and the Executive Agency for Higher Education Research Development and Innovation Funding (UEFISCDI). CS received support in the form of salaries and research materials from the UEFISCDI. The grant was led by CS and made possible in collaboration with physicians. The specific roles of these authors are articulated in the ‘author contributions’ section. Competing interests: DL is employed by Wolfram Research, Inc. and received research materials in that his employer permitted the use of his desktop computer in this work. There are no patents, products in development or marketed products associated with this research to declare. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Introduction The best possibility to cure cancer lies currently in its detection from early stages [1], [2], [3]. It is therefore advised that individuals with an increased risk of developing cancer based on history take screening tests from an early age and repeat such tests at certain intervals. In some countries, there are recommendations to take such screening tests for all adults after a certain age, depending on the cancer type [4], [5]. Also, there have been important investments worldwide in acquiring advanced microscopy hardware for hospitals. This leads to an increasing amount of histological slides that have to be analyzed. Computational approaches can support the medical professionals through autonomous learning and direct diagnosis establishment especially by providing a second opinion [6], [7] or even determining evidently benign cases in order to allow the human experts to concentrate on the more problematic slides [8], [9], [10]. Even outside the realm of artificial intelligence driven diagnoses, cancer identification based on digital slides is a highly variable and controversial topic, since diagnostic criteria vary across pathologists and particular case types are very challenging and elicit high variability within and across pathologists [11]. For purposes of this work, the labels established by the human experts are considered ground truth. The primary focus of this work is on automated grading of a collection of images of histopathological slides of colon tissue. These range from healthy through degree three (most serious). The pursued tasks are as follows. Reach a combined, augmented approach from the information provided by six independent classifiers previously considered for the problem [12] through the practical and efficient optimization approach of differential evolution.

Validate the ensemble method on other histological image data sets.

Identify the samples that are often misclassified by the proposed hybridized algorithm.

Define a possibility that allows the human expert (e.g. the pathologist) to separate the collection of data to be classified into a set of samples that can be correctly labeled with a high degree of confidence, and a complementary set of more difficult cases that thus require further attention from the physician.

1 Materials The image data set comes from the University Hospital of Craiova, Romania, and contains 357 images at 800x600 pixels with 62 healthy (G0) records, 96 of the first grade (G1), 99 of the second grade (G2) and 100 of the third grade (G3). The grades for the samples were established by two pathologists that reached a consensus diagnosis. This diminishes, but does not remove, the possibility that there may be classification errors in some cases where the pathologists must distinguish difficult diagnostic categories. Examples of samples from each class can be observed in Fig 1. Based on the name of the project that put forward the data, it will be further referred in the article as the IMEDIATREAT data set. It was initially introduced in [13] and is available for download [14]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Histological image samples: From left to right, normal tissue followed by cancer of grades G1, G2 and G3. https://doi.org/10.1371/journal.pone.0209274.g001 For showing the generalization ability of the proposed approach, two other case studies are considered for the designed combination of classifiers. One such data set that also refers to colon cancer was put forward through a challenge contest called GlaS that was held at the MICCAI 2015 conference [15], [16]. The contest goal regarded the accuracy of the gland segmentation, not of the actual automated diagnosis. In this respect, along with the raw histopathological images, associated ones with the manual segmented glands are provided. In addition, one of the two diagnosis labels (benign and malignant) is associated with each slide. These can be used when automated diagnosis is desired. The data collection (that will be further on called the GlaS data set, from the contest name) comprises 165 images with a 20x magnification level. Recently, a large data set of breast cancer histopathology images acquired from 82 patients was introduced in [9] under the name BreaKHis. There are several considered magnification levels, e.g. 40x, 100x, 200x and 400x. For each level in turn, the samples are separated in two classes of approximately 600 benign and 1300 malignant. In total, there are 2480 images of healthy tissue and 5429 of malignant tissue. The methods of extracting the features applied on the BreaKHis are briefly described in the next section and their results will be discussed in the experiments part. To the best of our knowledge, there is no information regarding the number of pathologists that decided the grade for each slide in turn for these two data sets. Both benchmark data sets are available for download; see their respective references for details.

2 State of the art in histological image classification Although the traditional sequence of preprocessing—segmentation—feature extraction—feature selection—classification is still preferred by many studies in image analysis, fully automated classification of cancer histological images has currently emerged as an alternative human-independent methodology. This implies further intervention is not required for pre-annotation of the regions of interest from the pathologists and therefore exempts the human experts from the additional effort of assisting the machine. Recently, more uncommon means of diagnosis have been proposed, like studying the movement of the eyes of the pathologists [17], but that still needs the pathologist to do the classification task. For a very recent and broad literature review about clinical information extraction, including that from histological slides, see [18]. Based only on training pre-diagnosed samples, direct image-level classification for the confirmation or infirmation of the presence of cancer, with additional feature design [19], has been extensively explored. The authors of [20] use a bag of features approach to gain image representations of basal-cell skin carcinoma slides which are next classified by support vector machines as positive or negative. The study of [21] targets prostate cancer diagnosis through a Bayesian multiresolution system to recognize cancerous regions within digital histopathology slides, with features being selected through an AdaBoost component. Some very recent attempts are highlighted in the editorial paper of [22]. A competition in mitosis detection from breast cancer histopathology images has compared eleven algorithms submitted for the task [23]. Among them, only two do not use the traditional 4-step sequence, i.e. a multi column max-pooling convolutional neural network [24] and a cascade learning framework with a boosting method [25], both targeting supervised pixel classification (giving the probability of belonging to the mitosis class). Another convolutional neural network for the classification of breast lesions from histopathological images is used in [26] to perform feature extraction in terms of pattern and distribution, after nuclei had been previously located. Again texture features, extracted by co-occurence statistics and local binary patterns, are used by a k-nearest neighbor classifier to differentiate between stroma-rich and stroma-poor slides in neuroblastoma patients [27]. First and second order image statistical parameters are given to several classification approaches to distinguish between grades of anal intraepithelial neoplasia from histological slices in [28]. More specifically, concerning the particular case of colorectal cancer slide interpretation, the study of [29] is concerned with images of the 4 cancer classes with 20000 annotated nuclei to be detected. This is done by deep convolutional neural networks with subsequent, separate classification by a neighboring assemble predictor. On the other hand, [30] computes texture feature vectors (on images obtained from 30 colorectal surgically removed tissues) through local histograms and co-occurence matrices and uses them for cancer labeling normal vs. adenocarcinoma by the quasi-supervised statistical learning method. Other feature identification approaches that are also applied for the BreaKHis data set (which will be further discussed within section 4) are next briefly presented. In [31], the local binary patterns (LBI) operator considers each pixel intensity to compute the distribution of binary patterns by making use of a radius and a number of neighbors. The occurrence histogram of the reached LBI proves to be a good texture descriptor. A recent variant of the method is presented in [32] as the completed local binary patterns (CLBI) and considers the central pixel, sign and magnitude: these bring significant improvement for rotation invariant texture classification. A descriptor, also for texture classification, that is based on quantized phase information of the discrete Fourier transform computed locally in a window for every image position is proposed in [33] and is called local phase quantization (LPQ): the histogram of the resulting image is used as the vector of features. The gray-level co-occurrence matrix (GLCM) represents a classical method [34] that is still widely used in the present and assumes the calculation of the Haralick parameters. A morphological measure for cell phenotype image classification is introduced through the threshold adjacency statistics (TAS) in [35]. Thresholding is applied to the image and, on the subsequent binary image, for each white pixel, the number of adjacent pixels that are also white are considered to reach some statistics that proved to be important in the classification process. A parameter-free version of TAS (PFTAS) is introduced in [36]. The class structure-based deep convolutional neural network (CSDCNN) [37] is a recent successful non-linear representation learning model that also abandons feature extraction steps. Instead it automatically learns semantic and discriminative hierarchical features from low-level to high-level. Another very recent deep classifier is represented by the supervised intra-embedding method with a multilayer neural network model, followed by a CNN [38]. Except for one classifier of the proposed ensemble, these use the numerical features extracted by a CNN for classification. For none of them are separate landmark or feature considerations provided. As the results will reveal in the following sections, the fact that one classifier has a different manner of dealing with the slides proves to be important, since it will exert a large influence over the ensemble decision. Once several well-performing classification models are constructed, the resulting probability estimations for the four cancer grades on the validation samples can be joined in an optimized way that allows complementarities to achieve a boosted level of accuracy. A first naive attempt to reach a better performing ensemble has been recently designed in [12], by simply doubling the weight of a new Fourier trig transform with principal component analysis classifier which seemed to behave differently from the the other machine learners within the classification process.

3 Methods A combination of 6 machine learning techniques (5 state-of-the-art methods and a relatively new approach) is employed for the histopathological image classification task. The images are transformed using the AlexNet CNN [39] into numerical vectors, using the pre-trained weights and without fine tuning. Each vector has a size of 1024 numerical features extracted from the CNN. The outputs of these involved approaches, in the form of test probability estimations for each outcome, are next combined to provide a further enhancement in accuracy. An optimization process by differential evolution takes into account the degree to which a classifier is wrong and penalizes the mistakes proportionally. The 5 classifiers were selected from types of machine learners exhibiting different properties, i.e. random forests (RF) [40], nearest neighbours (kNN) [41], logistic regression (LR) [42], naive Bayes (NB) [43] and support vector machines (SVM) [44]. A novel Fourier trig transform with principal component analysis (FTT+PCA) approach [45] was additionally included in the pool of methods, due to its competitive results. Each classifier is independently trained and applied to the test set. In a previous attempt [12], the weight of the last technique was doubled within the voting process for a common prediction of the chosen classifiers which led to improved accuracy. The aim of the current work is to further and automatically enhance the classification output, as well as provide some degree of confidence with respect to the classified samples. Information regarding how large the errors are, meaning how far the classified samples from the ground truth, will also be found. Moreover, those samples are identified that are most frequently mistaken (which gives some insight into what might cause erroneous classifications). The outputs for each of the 6 individual classifiers are considered in the form of probabilities, so instead of deriving the direct label for a test sample, the result displays four probabilities, one for each class of the problem. In this way, for the cases that are harder to discern, the methods do not provide the direct class, but more information regarding the way the decision is split between the classes is given. The most direct manner to reach the decision for a test sample would be to allow for an equal importance to each method and consider the class with the maximum sum or average for each possible grade over the probabilities reached by every single classifier. The grade that has the highest value would then be the determined label for the sample to be classified. However, as some classifiers may have their results highly correlated, they would dominate the labeling for most of the test samples, and this would result in an accuracy that is not significantly higher than the algorithm that performs best. To counteract this effect, weights are used to influence how much each classifier counts for the final decision, in the spirit of other weighted combinations of classifiers [46], [47]. The task of determining these weights is performed by an optimization metaheuristic, i.e. a Differential Evolution (DE) algorithm [48], [49]. DE proved to be a top performer in most of the competition series organized by IEEE Congress on Evolutionary Computation (CEC), globally surpassing all the other search paradigms for single objective, constrained, dynamic, large-scale, multi-objective, and multi-modal optimization problems [50]. Additionally, personal experience [51], [52] has shown that DE is robust, readily handles integer-valued objective functions, and has few control parameters which makes it easy to use. As will be seen, the obtained results justify this choice. The classification process, with particular reference to the optimization for the weights, can be summarized as follows. In the first step, the classification models are constructed on the training examples. Once the 6 models are built, their prediction results are computed for the validation samples. The labeling for a sample is in the form of four probabilities, corresponding to the four possible categories, and it is provided for each of the 6 methods. Optimal weights to balance the outputs of the 6 classifiers are determined by DE on subsets of the validation set. The best weighting is eventually applied to the model probabilities of the complementary subsets to assess the final prediction accuracy. The candidate solutions used by the DE have 6 variables that represent the weights for the same number of considered classifiers. The interest lies in relative weighting, so they are not constrained to sum to unity. Nevertheless, in order to still have control over the variables, boundaries are set for each of them in the interval [0, 1]. Another imposed constraint is that the overall sum of the weights should be in [0.5, 2]. Thus, the sum of weights has precise boundaries, and situations in which all weights are very small (close to zero) or very high (close to 1) are avoided. The fitness function penalizes errors that are worse than off-by-one class in order to diminish the amount of samples that are mistaken by a larger extent. It is given in Eq (1): w is the vector of weights, n is the number of samples in the validation set, dc i and ac i are the determined and the actual classes for sample i. The determined class dc for a sample is considered to be the label for which the highest value is obtained when applying the weights for the probabilities of the 6 classifiers. (1)

Discussion Ensemble of the various classifiers There are two different approaches that were previously applied for this data set: one represents a multiple step approach that performs pre-processing, image segmentation, extraction of numerical features, selection of a subset of most discriminative characteristics and eventually classification on the resulting attributes. Such an approach achieved a maximum classification accuracy of 83.94% [57]. A second approach refers to a CNN containing 5 convolutional layers that had a performance of 91.44% [58], and was further tuned in [59] reaching 92% in classification accuracy. The classification results of the combined approach on the IMEDIATREAT data set, as shown in Table 1, are significantly better than those of the previous naive design of doubling the weight of the Fourier transform approach, which led to 95.65% correct labeling [12]. Besides the confusion matrix in Table 1, Table 2 brings more insight concerning the separation between the different classes for IMEDIATREAT data set: the slides that correspond to the healthy tissues on the one hand and the ones that correspond to G3 on the other hand are better delineated from the rest with respect to all the considered metrics, while the slides that are more difficult to distinguish come from G1 and G2. This observation is especially sustained by the values of the specificity, showing that there are only few images that do not have this condition and are labeled as such. In order to achieve each plot from Fig 3, the DE is run 100 times, each run using a random subset of 20 of the 40 trials to optimize the weights, with the weights then tested on the remaining 20. The box plots are computed from the weights contained in the best candidate solutions. A general observation is that FTT+PCA has a relatively high weight over all data sets, except for GlaS, where it performs poorly. In general, the boxes are relatively small, meaning that the degree of dispersion is relatively low, so the weights are close to each other. The exception is represented in general by RF (which is kept to zero only for IMEDIATREAT), where a large variety of weights leads to good results. Besides FTT+PCA, SVM and LR are the models that have high influence for most of the problems. However, it is important to notice that the weights plotted in Fig 3 are not always in direct agreement with the individual accuracies of the chosen classifiers in Table 3. In order to understand this behavior, the Pearson product-moment correlation coefficients were computed for each pair of methods by taking into account the class probabilities outputs on the test sets for IMEDIATREAT data set and the Cohen’s kappa coefficients based on the actual test accuracies. The results are illustrated in Fig 4. By analyzing it in conjunction with the first plot in Fig 3, they show that (1) those methods that misclassified different test samples as opposed to the other techniques got higher weights relative to their individual performance and (2) methods gave similar labeling to others (not only as concerns the actual labels, but also the probabilities for each grade in turn) got lower weights. In summary, the generated weights are not always proportional to the accuracies obtained by each of the six methods, as the DE finds an optimum that takes into account the level of disagreement of each classifier with the other classifiers. For IMEDIATREAT, DE therefore puts again a very high weight on the FTT+PCA, as it was also empirically observed to be efficient in the earlier naive approach [12]. The same high weight can be found for it for BreaKHis in all four magnification levels. The optimized approach results on the IMEDIATREAT data substantially surpassed the initial expectations as concerns the achieved classification accuracy, boosting the correctness rate by 3 percentage points. Table 4 contains the results of the proposed method for the other two public data sets, i.e. GlaS and BreaKHis using the weights discovered by DE in the best configuration, the one illustrated with a blue circle in Fig 3. As seen in Table 4, the recent deep learning classifier CSDCNN reaches the most accurate results as compared to the other state-of-the-art methodologies. However, a distinct protocol is considered for CSDCNN, i.e. different splits and especially a larger training set. Excepting this method, the proposed methodology reaches the best results for 200x and 400x in the image level results and for 200x for the patient level computations, while in other cases it performs close to the best. Although for the IMEDIATREAT and BreaKHis data sets FTT+PCA has the highest influence in the ensemble (for the former it is very close at the top to the one of kNN), it fails to deliver good results for GlaS on its own and accordingly it was taken out by DE from the ensemble. The poor results are probably reached due to the very small size of the GlaS data set, as opposed to the other two. The DE ensemble approach was employed for finding weights specific to the individual classifiers that work best for each different image data set. We were interested however in testing how the weights discovered for IMEDIATREAT data set would perform if transferred to the BreaKHis data set. Fig 5 illustrates the obtained results and the actual gains when the DE is used specifically for that data set. As it can be noticed, the gains are significant in general, but the IMEDIATREAT weights already lead to a decent result, better than most of those delivered by many other methods illustrated in Table 4. As concerns the running times, for the 40x magnification BreaKHis data set, the training of all classifiers on 1338 images takes in total 17.7 minutes. Also as part of the training, DE is further applied to find the weights and this lasts 125 seconds. For testing the entire ensemble on one image, the time is 0.69 seconds. The tests are performed on a PC running Ubuntu Linux, 2.9 Ghz CPU and 16 GB RAM. Degree of confidence As quantified in Fig 7, the higher the threshold for T conf is, the smaller the set of samples that are trusted. Similarly, the amount of trusted failures, i.e. the samples that are put in the trusted set but the classifier is wrong about them, is decreased, as desired. Out of 92 missed guesses of the hybridized method, 55 are generated by the images from Fig 6. The anarchic structure from G3 has a high similarity degree especially with the last two slides. Another fact that should be considered is that the data set is obtained by cropping the slides from larger images that contain borderlines between healthy tissues and unhealthy ones of grades 1, 2 or 3, following the acknowledged procedure in [60]. By analyzing the blue lines in the two plots from Fig 7, one can notice that in the situation when T conf is equal to 0.95, the trusted failures is 0, and there are still 21.43% results that are labeled correctly. When T conf is 0.85, 1 sample out of 4760 is incorrectly labeled and 65.27% (3107 samples out of 4760) are both labeled correctly and trusted to be so with respect to T conf . The methodology would thus surely help human experts to establish the grading for histopathological images. By setting the T conf to around 0.85, the pathologists would in principle have more than half of the samples classified and they could focus on the cases that are more difficult to distinguish.

5 Conclusions The current research is focused on improving the classification accuracy on the IMEDIATREAT collection. It is comprised of 357 histopathological slides that are separated into four different classes: healthy or cancer of grades 1, 2 or 3. There are six classifiers that are tested on the data set: they learn from a training set consisting of 2/3 of the entire data collection and are tested on the remaining 1/3 samples. The classification accuracies of these six approaches are superior or similar to the previous attempted techniques on the same data set. However, the research goes further and combines the six classifiers, establishing weights for their significance in order to further boost the accuracy. The values for the weights for the classifiers are discovered (that is, optimized) via a differential evolution approach, using a fitness function that further penalizes the errors where the classes are wrong by more than one grade. The ensemble approach is successfully applied to two other data sets that consist also of histopathological images, a small one, GlaS, and a very large one, BreaKHis. The latter could be seen as consisting of 4 large data sets of different magnification levels. The ensemble with the weights as discovered by the DE for the IMEDIATREAT data reaches very competitive results on the BreaKHis data set, but still, using the DE optimizer for that specific data set boosts the accuracy results by approximately 1.5% on average. It is interesting to observe that it is not the classifiers that perform best that are selected as the most important ones by the DE optimizer. As observed in the first experiment, a classifier that reaches a classification accuracy usually below average counts the most in general and it is also observed that it does not agree at a high extent with the other classifiers as reflected by a Cohen’s kappa test. A degree of confidence is further defined to separate the classified samples into trusted (with respect to their classification) or unreliable. The classification problem that is dealt with is a very delicate one, so it is perhaps more important to divide the histopathological slides to be tested into a set that the classifier can distinguish very well and a set where there are doubts. The latter subset could be regarded as in need of more human expertise. Without a doubt, the pathologists will never be replaced, but a high amount of work could be set by establishing the cases that are very clear, or at least be afforded a strong second opinion. This could provide pathologists with a means for prioritizing samples that would need further scrutiny. The proposed approach combining the six classifiers proves to be very strong with respect to the classification accuracy, especially as compared to the individual results of the six methods. By taking into account the probabilities for each grade in turn, the DE manages to balance the strengths of each classifier with correlations between methods. In setting a larger weight for the FTT+PCA method, the weighted ensemble often reaches the correct labeling where FTT+PCA is mistaken, by counting on the combined weight of those classifiers that are correct, and takes the correct FTT+PCA label where some other classifiers may have their best choices spread among several possibilities. This finding may have future impact on other classification problems as well.

Acknowledgments C.S. thanks Wolfram Research for the 1 year licence of Mathematica Standard Edition that made this research possible and simplified the collaboration with D.L. Both authors thank Ruxandra Stoean for careful proof reading and comments that lead to improvements in the exposition.