You are on page 1of 5

Skin Disease Classification Using Convolutional

Neural Networks

Simon Schäfer and Christian Ludwigs

Ludwig-Maximilians-Universität München

Abstract. This manuscript briefly describes the approach taken by


team LMU to generate predictions for Task 3 (Disease Classification)
of the 2018 ISIC Challenge. Our score achieved on the 2018 ISIC valida-
tion set was a balanced accuracy of 0.881.

Keywords: Skin Disease Classification · Neural Networks

1 Background & Introduction

Skin cancer has emerged as a major challenge in public health. The incidence of
both melanoma and other skin cancers has been increasing over the past decades.
According to the World Health Organization, over 130,000 cases of melanoma
and between 2 and 3 million non-melanoma skin cancers are diagnosed globally
each year.1 Incidence rates are expected to further increase driven by continued
ozone depletion and increasing recreational sun exposure.
Falling prices of cloud computing services and growing availability of powerful
graphics processing units have led to the rise of Convolutional Neural Networks
(CNNs), making them more accessible and learning with large datasets much
more tractable. CNNs turned out to be useful in diagnostic decision-making and
therefore find extensive applications in medical science.
In the field of dermatology they are applied to problems such as classifica-
tions of skin diseases. In this context, the International Skin Imaging Collabora-
tion (ISIC) created an open-access archive of dermoscopic images of skin lesions
and conducts regular skin disease detection challenges. As reported by [5], the
melanoma classification results from the 2016 ISIC Challenge and a companion
reader study with experienced dermatologists on a subset of images have shown
that deep learning algorithms classified melanoma dermoscopy images with ac-
curacy that exceeded some dermatologists. In the landmark publication titled
Dermatologist-level classification of skin cancer with deep neural networks, [2]
demonstrated that a Google Inception v3 CNN architecture trained on dataset
of c.130,000 skin lesions is capable of classifying skin cancer with a level of
competence comparable to dermatologists. In a recent publication, [4] used a
Google Inception v4 CNN architecture trained on reportedly more than 100,000
1
http://www.who.int/uv/faq/skincancer/en/index1.html
2 S. Schäfer et al.

dermoscopic images and results of a reader study to show that the diagnostic
performance of the CNN outperformed those of most dermatologists.2
The main purpose of this manuscript is to briefly describe the approach
used by team LMU to generate predictions for Task 3 of the 2018 ISIC Chal-
lenge. The defined task was to submit disease classification predictions for 1,512
dermoscopic images covering the following seven possible disease classes with
abbreviations in square brackets: melanoma [MEL], melanocytic nevus [NV],
basal cell carcinoma [BCC], actinic keratosis, Bowens disease / intraepithelial
carcinoma [AKIEC], benign keratosis (solar lentigo, seborrheic keratosis, lichen
planus-like keratosis) [BKL], dermatofibroma [DF] and vascular lesion [VASC].

Fig. 1. 2018 ISIC Challenge Task 3 lesion classes and example images. Source:
https://challenge2018.isic-archive.com/task3/

2 Approach

Our dataset included the 23,532 mostly dermoscopic images from the ISIC
archive (as reported in Table 1). As pointed out by [1], employment of addi-
tional data led to the higher performance in previous challenges.
We therefore supplemented the ISIC datasets by additional publicly available
image datasets (e.g., MED-NODE, PH) and images that are not publicly avail-
able. In total we used c.30,000 images. C.750 (or c.7.5%) and c.700 (or c.7%)
randomly selected images of the HAM10000 dataset were used for test and val-
idation sets respectively. While doing so, we controlled for lesion IDs to ensure
2
Note: At the same time, results suggested lower performance in lesion classification
than dermatologists on the ISIC 2016 test set.
Skin Disease Classification Using Convolutional Neural Networks 3

Dataset NV MEL BKL BCC AKIEC DF VASC All classes


HAM10000 6705 1113 1099 514 327 115 142 10015
ISIC 2017 11861 1019 575 33 7 7 15 13517
SONIC 9251 0 0 0 0 0 0 9251
MSK-2 1004 334 130 3 2 0 0 1473
MSK-4 543 215 189 0 0 0 0 947
MSK-1 532 258 57 27 5 5 11 895
UDA-1 396 159 0 0 0 0 0 555
MSK-3 123 19 83 0 0 0 0 225
MSK-5 0 0 109 0 0 0 2 111
UDA-2 12 34 7 3 0 2 2 60
Total 18566 2132 1674 547 334 122 157 23532
Table 1. Summary of ISIC datasets used in training, validation and test sets
This table reports the number dermoscopic and clinical images by class from the ISIC
Archive (as available on 15th July 2018) used in training, validation and test sets of
our algorithm.

that no images from the same lesion were distributed across training, validation-
and test set. In order to more robustly predict the performance of our model
on unseen data, crossvalidation was employed by selecting the images for the
validation set multiple times and thereby using all the different parts of the
HAM10000 training set as validation sets.
A number of architectures with pretrained weights from imagenet are avail-
able, but finding a good configuration to fine-tune them to the specific task is
a very time-consuming exercise. We therefore only explored a subset of these
architectures for this task (e.g., VGG, InceptionV4, ResNet152 and NasNet), of
which NasNet turned out to show the best results.
Using random rotation and flipping during training is common practice in im-
age classification and given the limited amount of training data, this convention
was not challenged. Additional augmentation techniques tested during training
and evaluation include zooming, shearing, different color constancy algorithms
and image distortions. Of the aforementioned techniques zoom and shear turned
out to be of limited use.
We further used a number of techniques to account for the class imbalance
problem and compensate for underrepresented classes (individual classes not con-
taining the same number of images). Specifically, we incorporated class weights
(inversely proportional to the size of each class) into the training process and ap-
plied thresholding (i.e., automatically changing class predictions) in the testing
phase based on proven rules derived from numerous confusion matrices.
Since every model produced slightly different results (not least because of
random augmentation during training), we used simple averaging of predictions
from multiple models based on different architectures. This yielded better results
than just using the single best model. This is in line with the approach taken by
[7] and the findings of [1].
4 S. Schäfer et al.

The key differences between the three submissions can be summarised as


follows:

1. Described approach without thresholding


2. Described approach incl. thresholding on DF and AKIEC
3. Described approach with more aggressive thresholding on DF, AKIEC, MEL,
VASC and BKL

3 Results

Figure 2 shows the performance of one single model on the ISIC Challenge 2017
test set, before including those images in the training set.

Fig. 2. Area under the curve (AUC) of one single model on the test set of the ISIC
Challenge 2017 Part 3: Lesion Classification. The MEL AUC outperforms previously
published results, which includes ensemble approaches. The test set only consisted of
BKL, MEL and NV images; defaulting the other categories to zero.

After submission of our test set predictions, our final results achieved with
our developed approach will be published on the leaderboard and visible to all
participants of the 2018 ISIC Challenge. Our score achieved on the 2018 ISIC
validation set was 0.881. Results on our test set were of similar magnitude.
Skin Disease Classification Using Convolutional Neural Networks 5

References
1. Codella, Noel C. F., Gutman, D., Celebi, M. E. et al. (2017). Skin Lesion Anal-
ysis Toward Melanoma Detection: A Challenge at the 2017 International Sym-
posium on Biomedical Imaging (ISBI), Hosted by the International Skin Imag-
ing Collaboration (ISIC). In: CoRR abs/1710.05006. arXiv: 1710.05006. URL:
http://arxiv.org/abs/1710.05006
2. Esteva, A., Kuprel, B., Novoa, R. A. et al. (2017). Dermatologist-level classification
of skin cancer with deep neural networks. Nature, 542 (7639), 115118.
3. Giotis, I., Molders, N., Land, S., et al. (2015). MED-NODE: A computer-assisted
melanoma diagnosis system using non-dermoscopic images. Expert Systems with
Applications, 42, 65786585.
4. Haenssle, H. A., Fink, C., Schneiderbauer, R. et al. (2018). Man against machine:
diagnostic performance of a deep learning convolutional neural network for dermo-
scopic melanoma recognition in comparison to 58 dermatologists. Annals of Oncol-
ogy, 0, 17.
5. Marchetti, M. A., Codella, N. C. F., Dusza, S. W. et al. (2017). Results of the 2016
International Skin Imaging Collaboration International Symposium on Biomedical
Imaging challenge: comparison of the accuracy of computer algorithms to dermatolo-
gists for the diagnosis of melanoma from dermoscopic images. J Am Acad Dermatol,
78 (2), 270277.
6. Mendonca, T., Ferreira, P. M., Marques, J. S. et al. (2013). PH2 - A dermoscopic
image database for research and benchmarking in 2013. 35th Annual International
Conference of the IEEE Engineering in Medicine and Biology Society (EMBC),
54375440.
7. Menegola, A. et al. (2017). RECOD Titans at ISIC Challenge 2017. In: ArXiv
eprints. arXiv: 1703.04819. URL: https://arxiv.org/abs/1703.048
8. Tschandl, P., Rosendahl, C., Kittler, H. (2018) The HAM10000 dataset, a large
collection of multi-source dermatoscopic images of common pigmented skin lesions.
Sci. Data 5, 180161 doi:10.1038/sdata.2018.161.

You might also like