Key Points

Question Are surgical skin markings in dermoscopic images associated with the diagnostic performance of a trained and validated deep learning convolutional neural network?

Findings In this cross-sectional study of 130 skin lesions, skin markings by standard surgical ink markers were associated with a significant reduction in the specificity of a convolutional neural network by increasing the melanoma probability scores, consequently increasing the false-positive rate of benign nevi by approximately 40%.

Meaning This study suggests that the use of surgical skin markers should be avoided in dermoscopic images intended for analysis by a convolutional neural network.

Abstract

Importance Deep learning convolutional neural networks (CNNs) have shown a performance at the level of dermatologists in the diagnosis of melanoma. Accordingly, further exploring the potential limitations of CNN technology before broadly applying it is of special interest.

Objective To investigate the association between gentian violet surgical skin markings in dermoscopic images and the diagnostic performance of a CNN approved for use as a medical device in the European market.

Design and Setting A cross-sectional analysis was conducted from August 1, 2018, to November 30, 2018, using a CNN architecture trained with more than 120 000 dermoscopic images of skin neoplasms and corresponding diagnoses. The association of gentian violet skin markings in dermoscopic images with the performance of the CNN was investigated in 3 image sets of 130 melanocytic lesions each (107 benign nevi, 23 melanomas).

Exposures The same lesions were sequentially imaged with and without the application of a gentian violet surgical skin marker and then evaluated by the CNN for their probability of being a melanoma. In addition, the markings were removed by manually cropping the dermoscopic images to focus on the melanocytic lesion.

Main Outcomes and Measures Sensitivity, specificity, and area under the curve (AUC) of the receiver operating characteristic (ROC) curve for the CNN’s diagnostic classification in unmarked, marked, and cropped images.

Results In all, 130 melanocytic lesions (107 benign nevi and 23 melanomas) were imaged. In unmarked lesions, the CNN achieved a sensitivity of 95.7% (95% CI, 79%-99.2%) and a specificity of 84.1% (95% CI, 76.0%-89.8%). The ROC AUC was 0.969. In marked lesions, an increase in melanoma probability scores was observed that resulted in a sensitivity of 100% (95% CI, 85.7%-100%) and a significantly reduced specificity of 45.8% (95% CI, 36.7%-55.2%, P < .001). The ROC AUC was 0.922. Cropping images led to the highest sensitivity of 100% (95% CI, 85.7%-100%), specificity of 97.2% (95% CI, 92.1%-99.0%), and ROC AUC of 0.993. Heat maps created by vanilla gradient descent backpropagation indicated that the blue markings were associated with the increased false-positive rate.

Conclusions and Relevance This study’s findings suggest that skin markings significantly interfered with the CNN’s correct diagnosis of nevi by increasing the melanoma probability scores and consequently the false-positive rate. A predominance of skin markings in melanoma training images may have induced the CNN’s association of markings with a melanoma diagnosis. Accordingly, these findings suggest that skin markings should be avoided in dermoscopic images intended for analysis by a CNN.

Trial Registration German Clinical Trial Register (DRKS) Identifier: DRKS00013570

Introduction

Incidence rates of malignant melanoma are increasing in many countries of the world.1 Despite much progress being made regarding public awareness, basic research, and clinical care for treating malignant melanoma, mortality rates are still high.2 Therefore, there is a continuous need for improvements in the methods for the early detection of malignant melanoma. When diagnosed early, melanoma may be cured by surgical excision, whereas the prognosis of more advanced cases is limited. In clinical routine, a high sensitivity for the detection of melanoma is of utmost importance; nevertheless, the number of excised benign nevi should be limited.3 Dermoscopy was shown to significantly improve the diagnostic sensitivity and specificity compared with that obtained by naked eye examination.4-6 Various dermoscopic features have been associated with the diagnoses of melanoma,7 and a number of simplified algorithms have been defined and validated to support dermatologists in deciding which lesions to excise.8-10

As in other fields of medicine, automated and computerized deep learning systems are emerging for the diagnosis of skin cancer.11 Deep learning is defined as a form of machine learning in which large data sets (eg, dermoscopic images) and corresponding classification labels (eg, diagnoses of nevi or melanomas) are fed into a neural network for training purposes. Within the network, which is composed of many sequential layers, input images are assessed on a pixel level for the presence of “good representations” (here, dermoscopic features) of the input classification. With the increasing number of training images, the network assembles and weights image features that are useful for differentiating nevi from melanomas. Therefore, deep learning could be described as a hierarchical feature learning. Deep learning convolutional neural networks (CNNs) form a subcategory of deep learning algorithms that have shown strong performance in image classification. To date, deep learning CNNs have demonstrated a diagnostic performance at the level of experienced physicians in the evaluation of medical images from the fields of dermatology,12-14 radiology,15 ophthalmology,16 and pathology.17

While a single physician with a low diagnostic performance in the detection of melanoma may cause serious harm, the effect of a broadly applied neural network with inherent “diagnostic gaps” or unknown pitfalls would be even more detrimental. In dermoscopic images, artifacts such as air bubbles, hair, or overlayed rulers have previously been reported to present some of the difficulties in automated image evaluation.11 Because suspicious lesions are often routinely marked with gentian violet surgical skin markers, our study investigated whether highlighting lesions with a skin marker may alter the evaluation scores of a computerized deep learning CNN for melanoma recognition.

Methods

This noninterventional study was approved by the ethics committee of the medical faculty of the University of Heidelberg, Heidelberg, Germany, and performed in accordance with the Declaration of Helsinki18 principles. Informed consent of patients was waived by the ethics committee because all images were acquired as part of clinical routine procedures and only deidentified data were used. The study was conducted from August 1, 2018, to November 30, 2018. A pretrained CNN architecture (Inception-v4; Google)19 was used that was additionally trained with more than 120 000 dermoscopic images and corresponding labels (Moleanalyzer-Pro; FotoFinder Systems GmbH). Details on the CNN architecture and training have been described earlier.12

For the present study, 3 image sets were created, with each including 130 melanocytic lesions (107 benign nevi and 23 melanomas). Dermoscopic images of nevi with and without skin markings were prospectively and sequentially acquired in clinical routine with a mobile digital dermatoscope attached to a smartphone (Handyscope; FotoFinder Systems GmbH). The diagnoses of benign nevi were not based on histopathologic findings but rather on the absence of any melanoma-associated clinical and dermoscopic features in combination with an uneventful follow-up over the past 2 years. Skin markings included variable dots, streaks, or circles made with a gentian violet skin marker (Devon Surgical Skin Marker; Cardinal Health or pfm medical skin marker; pfm medical ag) to the skin adjacent to the nevi. All nevi were first imaged as unmarked lesions, after which they were marked in vivo and imaged again as marked lesions (Figure 1). Melanoma images without markings were randomly selected from the image library of the Department of Dermatology, University of Heidelberg. All melanoma cases were validated by histopathologic analysis with additional information on localization, Breslow thickness, and patient data being available. To allow for corresponding analyses of melanomas, the skin markings were digitally superimposed on the melanoma images with the use of photograph manipulation software (Photoshop CS6, version 13.0.1 x32; Adobe Inc). For a statistical comparison, 20 nevi from the test set were used to demonstrate that electronically superimposed markings provide comparable results to in vivo markings. In 20 unmarked benign nevi, the CNN’s mean melanoma probability score was 0.15 (95% CI, 0.01-0.29). Melanoma probability scores can range from 0 to 1; higher scores indicate a higher probability of the measured lesion being a melanoma. In vivo markings increased the mean score to 0.52 (95% CI, 0.31-0.74), whereas electronically superimposed markings led to a comparable mean score of 0.59 (95% CI, 0.39-0.79). The Mann-Whitney test did not reveal a significant difference between in vivo and electronically marked nevi (P = .78). Moreover, in each of the 20 nevi, the CNN classification of in vivo and electronically marked lesions showed consistent results. For more details, refer to the eMethods and eFigures 1 and 2 in the Supplement. All dermoscopic images were then cropped to reduce the background and to focus solely on the melanocytic lesions. The aforementioned steps resulted in 3 complete sets of the same 130 dermoscopic images, namely, set 1 with unmarked lesions, set 2 with marked lesions, and set 3 with cropped images.

Heat Maps

Deep learning CNNs do not provide any information about why a certain classification decision was reached. There are many different interpretability approaches that may help to more clearly visualize the information “learned” by the model.20

Heat maps were created to identify the most important pixels for the CNN’s diagnosis to better explain how much each pixel of the image contributes to the diagnostic classification. These heat maps were derived by vanilla (meaning “basic”) gradient descent backpropagation.21

Statistical Analysis

The primary outcome measures were sensitivity, specificity, and area under the curve (AUC) of receiver operating characteristic (ROC) curves for the diagnostic classification of lesions by the CNN. The CNN accorded a malignancy probability score between 0 and 1, and a validated a priori cutoff greater than 0.5 for the dichotomous classification of malignant vs benign lesions was applied. Descriptive statistical measures, such as frequency, mean, range, and SD, were used. Mann-Whitney tests were performed to assess the differences in the melanoma probability scores between the 3 sets of images. A 2-sample McNemar test was performed to compare the sensitivities and specificities attained by the CNN.22 Results were considered statistically significant at the P < .05 level (2-sided). All analyses were carried out using SPSS version 24 (IBM).

Results

Characteristics of Imaged Lesions

In all, 130 melanocytic lesions (107 benign nevi and 23 melanomas) were imaged. Of the 23 imaged melanomas, 18 (78.3%) were localized on the trunk and extremities, 3 (13.0%) on the facial skin, 1 (4.3%) on the scalp, and 1 (4.3%) on the palmoplantar skin (eTable in the Supplement). Nineteen melanomas (82.6%) were invasive (mean thickness, 1 mm [range, 0.2-5.6 mm]) and 4 (17.4%) in situ. The analysis of melanoma subtypes revealed the following subtypes: 15 superficial spreading melanomas, 2 lentigo maligna melanomas, 1 nodular melanoma, and 1 acrolentiginous melanoma. Of the 4 in situ melanomas, 1 was classified as lentigo maligna (eTable in the Supplement). The 123 imaged benign nevi showed no clinical or dermoscopic criteria associated with the presence of melanoma and had an uneventful follow-up for at least 2 years (Figure 1).

CNN’s Melanoma Probability Scores

Box plots in Figure 2 show the distribution of the CNN melanoma probability scores for the 3 different sets of images (unmarked, marked, and cropped). Skin markings significantly increased the mean melanoma probability scores of the classifier in benign nevi from 0.16 (95% CI, 0.10-0.22) to 0.54 (95% CI, 0.46-0.62) (P < .001). Figure 3 and Figure 4 show heat maps of representative unmarked and marked nevi in which the most important pixels for the CNN’s diagnostic classifications were identified by vanilla gradient descent backpropagation.21 In nevi images that were cropped to reduce the background, the mean melanoma probability scores were significantly reduced to 0.03 (95% CI, 0-0.06) compared with those in unmarked (0.16; 95% CI, 0.10-0.22) and marked (0.54; 95% CI, 0.46-0.62) images (P < .001). In melanoma images we also observed an increase of the mean melanoma probability scores in unmarked vs electronically marked images from 0.94 (95% CI, 0.85-1.00) to 1.00 (95% CI, 0.99-1.00). However, as unmarked melanoma images already showed mean scores close to the maximum score of 1, the induced changes did not reach statistical significance (P = .10). Irrespective of markups or cropping, the statistical differences in melanoma probability scores between benign nevi vs melanomas remained significant across all image sets. At the same time, no significant difference was observed between the melanoma probability scores of in situ melanomas vs invasive melanomas across all image sets.

CNN’s Sensitivity, Specificity, and ROC AUC

At the a priori operation point of 0.5, the sensitivity of the CNN in the unmarked image set was 95.7% (95% CI, 79%-99.2%) and the specificity was 84.1% (95% CI, 76%-89.8%). When lesions were marked, the sensitivity changed to 100% (95% CI, 85.7%-100%) and the specificity to 45.8% (95% CI, 36.7%-55.2%). In cropped images, the CNN showed a sensitivity of 100% (95% CI, 85.7%-100%) and a specificity of 97.2% (95% CI, 92.1%-99%). A pairwise comparison of the CNN’s sensitivities in unmarked, marked, or cropped images revealed no significant differences. A pairwise comparison of the specificities showed significant differences between unmarked and marked images (84.1%; 95% CI, 76.0%-89.8% vs 45.8%; 95% CI, 36.7%-55.2%; P < .001), unmarked and cropped images (84.1%; 95% CI, 76.0%-89.8% vs 97.2%; 95% CI, 92.1%-99.0%; P = .003), and marked and cropped images (45.8%; 95% CI, 36.7%-55.2% vs 97.2%; 95% CI, 92.1%-99.0%; P < .001).

The ROC AUC in unmarked images was 0.969 (95% CI, 0.935-1.000), in marked images was 0.922 (95% CI, 0.871-0.973), and in cropped images was 0.993 (95% CI, 0.984-1.000). All 3 ROC curves that were calculated for the 3 image sets are depicted in Figure 5 and illustrate a significant reduction in specificity of nearly 40% in marked vs unmarked lesions as well as the outperformance of the CNN when using cropped lesions.

Discussion

Deep learning CNNs have recently been applied to different diagnostic tasks in medical image recognition and classification (eg, ophthalmology,16 radiology,15 histopathology,17 and dermatology23). Several landmark studies compared human and machine accuracy in skin cancer detection.24,25 Two recent publications reported an expert dermatologist-level classification of dermoscopic images of benign melanocytic nevi and cutaneous melanomas,12,13 and a first deep learning CNN for classification of skin neoplasms has gained market access in Europe as a medical device (Moleanalyzer-Pro). While these achievements represent major successes, further exploring the limitations of deep learning CNNs is important before considering a broader application worldwide.

It has previously been shown that artifacts in dermoscopic images, such as dark corners (caused by viewing through the tubular lens of the dermatoscope), gel bubbles, superimposed color charts, overlayed rulers, and occluding hair, may impede image segmentation and classification by automated algorithms.11,26 Various methods have been reported for the removal of such artifacts,27,28 and strategies for preprocessing of images were described to improve the classification outcomes of CNNs.29 However, the removal of artifacts by image preprocessing may ultimately alter the original image and itself be prone to error. Therefore, a major advantage of deep learning CNNs is that the raw RGB dermoscopic image may be used as an input, thus bypassing preprocessing.30

This study investigated the possible association of surgical skin markers as artifacts in dermoscopic images with the classification outcomes by a deep learning CNN. In clinical routine, suspicious lesions are frequently marked before being excised or photographed. Our attention was drawn to this issue when evaluating dermoscopic images of benign nevi under sequential digital dermoscopy follow-up. We observed that sequentially imaged benign nevi, although largely unchanged, were frequently labeled as being malignant by the CNN when ink markers were visible at the periphery of the dermoscopic image. To systematically and prospectively investigate our observation, 3 sets of dermoscopic images (unmarked, marked, and cropped) of the same 130 melanocytic lesions were created. Our assessments of these images with the CNN showed that skin markings at the periphery of benign nevi were associated with an increase in the melanoma probability scores that increased the false-positive rate by approximately 40%. To prove that this association may be attributed solely to the dermoscopic background and not the melanocytic lesion itself, the dermoscopic images were cropped manually. This procedure reversed the negative association of skin markings with the diagnostic performance of the CNN. Overall, image preprocessing by manually cropping images led to the best diagnostic performance of the CNN, achieving a sensitivity of 100%, specificity of 97.2%, and ROC AUC of 0.993. The CNN’s specificity in the cropped images (97.2%) was significantly improved compared with that in the unmarked images (84.1%). However, cropping was done manually by experienced dermatologists, and the results may deteriorate with automated cropping by a formal preprocessing step using border segmentation algorithms.

When reviewing the open-access International Skin Imaging Collaboration database, which is a source of training images for research groups, we found that a similar percentage of melanomas (52 of 2169 [2.4%]) and nevi (214 of 9303 [2.3%]) carry skin markings. Nevertheless, it seems conceivable that either an imbalance in the distribution of skin markings in thousands of other training images that were used in the CNN tested herein or the assignment of higher weights to blue markings only in lesions with specific (though unknown) accompanying features may induce a CNN to associate skin markings with the diagnosis of melanoma. The latter hypothesis may also explain why melanoma probability scores remained almost unchanged in many marked nevi while being increased in others.

The fact that blue markings are associated with changes in melanoma probability scores while the underlying mechanisms remain unclear highlights the lack of transparency in the classification process of neural network models. Thus, although not being dependent on manmade criteria for classification has opened a new level of performance, it may impede the insights into a mechanistic understanding. The CNN tested in this study applies the melanoma probability score as a softmax output classifier. Recently, content-based image retrieval has been shown to provide results comparable to softmax classifiers.14 In this alternative approach, the CNN generates several images that are visually similar to the input image along with the corresponding diagnoses. The displayed output images are retrieved from the compiled training images based on overlapping features identified by the neural network. This strategy has been hypothesized to increase the explainability for clinicians.

There are several approaches to the problem of bias induced by skin markings. Avoiding markings in images that are intended for analysis seems the most straightforward solution for the CNN tested in our study. Avoiding markings in training images (eg, by cropping images before training) is logical with regard to future algorithms. In contrast, teaching the CNN to ignore parts of the image that may or may not be artificial skin markings appears rather difficult. Because there are many more types of artifacts in images other than blue surgical skin markers, some artifacts may still be undetected. At the same time, other parts of images may erroneously be interpreted as artifacts that preclude them from analysis by the CNN. Moreover, as stated above, automated segmentation with border detection of the lesion of interest may be another option to improve evaluation.27

Limitations

Our study has some limitations. First, benign melanocytic nevi were not excised for histologic verification, but rather were selected from patients under follow-up and showed no changes during the past 2 years. Second, dermoscopic images of melanomas were extracted from a validated database; thus, skin markings could not be added in vivo. Alternatively, skin markings were electronically duplicated from digital images and superimposed on the melanoma background. This procedure and its association with changes in the classification by the CNN were extensively tested with images of benign nevi. In all these cases, no differences were found between the melanoma probability scores attained with the CNN in images with “in vivo” markings vs images with electronically superimposed markings. Third, most images included in this study were derived from fair-skinned patients residing in Germany; therefore, the findings may not be generalized for lesions of patients with other skin types and genetic backgrounds.

Conclusions

In summary, the results of our investigation suggest that skin markings at the periphery of dermoscopic images are significantly associated with the classification results of a deep learning CNN. Melanoma probability scores of benign nevi appear to be significantly increased by markings causing a strong increase in the false-positive rate. In clinical routine, these lesions may have been sent for unnecessary excisions. Therefore, we recommend to avoid skin markings in dermoscopic images intended for analysis by a deep learning CNN.

Back to top Article Information

Accepted for Publication: May 11, 2019.

Corresponding Author: Holger A. Haenssle, MD, Department of Dermatology, University of Heidelberg, Im Neuenheimer Feld 440, 69120 Heidelberg, Germany (holger.haenssle@med.uni-heidelberg.de).

Published Online: August 14, 2019. doi:10.1001/jamadermatol.2019.1735

Author Contributions: Drs Winkler and Haenssle had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Winkler, Fink, Haenssle.

Acquisition, analysis, or interpretation of data: All authors.

Drafting of the manuscript: Winkler, Fink, Haenssle.

Critical revision of the manuscript for important intellectual content: All authors.

Statistical analysis: Winkler, Haenssle.

Administrative, technical, or material support: Fink, Toberer, Enk, Thomas, Blum, Stolz, Haenssle.

Supervision: Fink, Deinlein, Haenssle.

Conflict of Interest Disclosures: Dr Fink reported receiving travel expenses from Magnosco GmbH. Dr Haenssle reported receiving honoraria and/or travel expenses from the following companies specializing in the development of devices for skin cancer screening: SciBase AB, FotoFinder Systems GmbH, Heine Optotechnik GmbH, and Magnosco GmbH. No other disclosures were reported.