Experiments Using Deep Learning For Dermoscopy Image Analysis

JID: PATREC
ARTICLE IN PRESS [m5G;November 29, 2017;20:57]
Pattern Recognition Letters 0 0 0 (2017) 1–9
Contents lists available at ScienceDirect
Pattern Recognition Letters

journal homepage: www.elsevier.com/locate/patrec
Experiments using deep learning for dermoscopy image analysis

Cristina Nader Vasconcelos a,∗, Bárbara Nader Vasconcelos b
a
Departamento de Ciência da Computação, Instituto de Computação, Universidade Federal Fluminense, Niterói, 24210-346, Brazil
b
Serviço de Dermatologia, Hospital Universitário Pedro Ernesto (Hupe), Universidade Estadual do Rio de Janeiro, Rio de Janeiro 20, 551-030, Brazil
a r t i c l e i n f o a b s t r a c t
Article history: Skin cancer is a major public health problem, as is the most common type of cancer and represents more
Available online xxx than half of cancer diagnosed worldwide. Early detection influences the outcome of the disease and mo-
tivates the research presented in this paper. Recent results show that deep learning based approaches
MSC:
41A05 learn from data, and can outperform human specialists in a set of tasks when large databases are avail-
41A10 able for training. This research investigates the scenario where the amount of data available for training is
65D05 small. It obtains relevant results for the ISBI 2016 melanoma classification challenge (named Skin Lesion
65D17 Analysis for Melanoma Detection) facing the peculiarities of dealing with such a small and unbalanced
biological database. To do this, it explores committees of Deep Convolutional Neural Networks (DCNN),
Keywords:
the augmentation of the training data set by image processing classical transforms and by deformations
Skin lesion classification
guided by expert knowledge about the lesion axis, and it introduces a third class aiming to improve the
Dermoscopy image analysis
Lesion dermoscopic feature extraction and classifiers’ distinction of the region of interest of the lesion. The experiments show that the proposed ap-
classification proach improves the final classifier invariance for common melanoma variations, common skin patterns
Convolutional neural networks and markers, and dermatoscope capturing conditions.
Data augmentation
© 2017 Elsevier B.V. All rights reserved.
Deep learning
1. Introduction The access to digital images of skin lesions annotated with

metadata can be used to educate professionals in melanoma recog-
Skin cancer is the most common type of cancer and accounts nition, as well as to directly aid the diagnosis of melanoma through
for more than half of all cancer diagnoses. There are two basic teledermatology, clinical decision assistance, and support the de-
types of skin cancer, non-melanoma and melanoma, much rarer, velopment of automated (or semi-automated) diagnosis tools. In
however, a much more serious disease. Melanoma is a malignancy this direction, dermatoscopy proved to be valuable in visualizing
with predominance in white adults in the trunk of men or in the the morphological structures in pigmented lesions for the diag-
women lower limbs - but they also appear in other body parts. nosis of melanoma. It inspires the development of various Com-
Although more common in fair-skinned people, blacks and their puter Vision techniques for the diagnosis of skin cancer such as
descendants are not free of the disease. Lens and Dawes [14] ob- the works of Scharcanski and Celebi [27] and also Masood and Al-
serve that the incidence and overall mortality rates have increased Jumaily [17] to name a few.
in the last decades, therefore, it represents a substantial public More recently, a great effort is being made as part of the “Inter-
health problem. Up to one-fifth of patients develop metastatic dis- national Skin Imaging Collaboration: Melanoma Project” (ISIC [23])
ease, which may even lead to death. to establish an archive annotated by experts to serve as a public
Fortunately, the prognosis of patients can be considered good resource of images for teaching and as a standard for the develop-
when melanoma is detected in the early stages. Early detection and ment of automated diagnostic systems. Currently, the ISIC Archive
proper excision lead to a cure rate of more than 90% in patients contains around 13,0 0 0 images associated with clinical metadata
with low-risk melanoma. Nestle FO [21] pointed out that innova- and expert annotations.
tive early detection programs in combination with improved diag- In 2016, in order push the development of dermoscopy image
nostic tools and new immunologic and molecular target treatments analysis tools for automated diagnosis of melanoma evaluated over
for advanced stages of the disease may influence the outcome of a common standard, the ISIC project organized a challenge dur-
the disease in the future. ing the IEEE International Symposium on Biomedical Imaging (ISBI
2016) named Skin Lesion Analysis Towards Melanoma Detection
[8]. The challenge provides 900 annotated images as training data
∗
Corresponding author. for participants to engage in three components of lesion image
E-mail address: crisnv@ic.uff.br (C.N. Vasconcelos). analysis: lesion segmentation; detection and localization of fea-
https://doi.org/10.1016/j.patrec.2017.11.005
0167-8655/© 2017 Elsevier B.V. All rights reserved.
Please cite this article as: C.N. Vasconcelos, B.N. Vasconcelos, Experiments using deep learning for dermoscopy image analysis, Pattern
Recognition Letters (2017), https://doi.org/10.1016/j.patrec.2017.11.005
JID: PATREC
2 C.N. Vasconcelos, B.N. Vasconcelos / Pattern Recognition Letters 000 (2017) 1–9
tures; and disease classification (which is the focus of the present can be used to train such a model. It explores: (I) data augmen-
work). It also provides a separate test dataset containing 379 im- tation using both image processing classical transforms and based
ages so that the results of different methods can be compared. on specialist knowledge; (II) transfer learning, so that CNN weights
This work investigates the use of Deep Learning techniques, are initialized with those obtained from a huge dataset of natu-
more specifically, Deep Convolutional Neural Networks (DCNNs) ral images; (III) the introduction of an additional class represent-
trained on the ISBI challenge data set, for the problem of ing the visual patterns of regions outside the lesion to reduce their
melanoma classification from dermoscopic images of skin lesions. influence on the classification decision; (IV) the analysis of the im-
CNNs are able to learn hierarchies of invariant features and clas- pact of the data set unbalance and a method to reduce its effect
sifiers from annotated datasets. They currently represent the state on the classifier decision; (V) diversity in the composition of com-
of the art solution for classification problems over natural images. mittees;
That is the case of the Imagenet Large Scale Visual Recognition an- The results show that it is possible to surpass the average pre-
nual Challenge (ILSVRC), organized by Russakovsky et al. [26], for cision obtained by the participants of the ISBI 2016 melanoma
object localization/detection and image/scene classification from classification challenge by exploring the hypotheses presented in
images and videos at large scale since the 2012 edition. Section 1.1 with the presented methodology.
The development of biomedical images classifiers has its own
research peculiarities. One of the most critical difficulties for the 1.3. Organization
direct application of the Deep Learning techniques to biomedical
problems is the fact that public databases with annotations are This paper is organized as follows. Section 2 presents related
considerably smaller than those of natural images. It is well known work on skin lesions analysis and the use of deep learning for
that CNNs usually require a large amount of training data in or- biomedical image analysis. Section 3 presents the neural networks
der to avoid overfitting. While the ISBI challenge database used for investigated, while Section 4 details the methodology proposed to
training the proposed model contains 900 images, the ILSVRC 2012 train them using a small and unbalanced data set for skin lesion
participants received 10 million annotated images. Such a huge dif- classification. Section 5 presents the experiments and the results.
ference in the amount of publicly available annotated data drives Finally, Section 6 presents the conclusions.
the experimental hypotheses raised by the present work that are
listed below. 2. Related work
1.1. Hypotheses There are several methods to the analysis of biomedical im-
ages that explicitly model features that describe visual patterns,
This work investigates how far a well known CNN topology guided by expert knowledge. They are usually based on Computer
(GoogLeNet [30]) can be trained on the ISBI 2016 melanoma clas- Vision pipelines that contain three main steps: an image prepro-
sification challenge data set since it is expected that such complex cessing module responsible for image enhancements, such as noise
models suffer from severe overfitting when the number of train- removal and illumination calibration; a module responsible for the
ing samples available is too small. The present work adopted the extraction, selection and description of features that is guided by
restriction of not using external sources of images. Although the specialized knowledge and responsible for encoding the natural in-
challenge rules do not impose such a restriction, it allows the ob- variance of the problem; and, a final module, during which Ma-
servation of the proposed methods effects in the training of deep chine Learning techniques are applied to train a classifier on the
neural networks using small data sets. It investigates the following adopted descriptors, where Support Vector Machines (SVM) is one
hypotheses: of the most popular choices.
(H1) The training of a deep classifier on biomedical images, The analysis of digital images of skin lesions has been investi-
more specifically skin lesions, can be benefited by the use of pre- gated for a long time in the literature. Moss et al. [19] proposed
trained models trained in a huge set of natural images; an expert system based on the analysis of texture frequencies and
(H2) CNN performance is affected by unbalanced training data features obtained by the Fourier transform. In the work of Chang
sets, but it is possible to reduce the side effect of the existing nat- et al. [3], a typical pipeline is presented where initially the im-
ural imbalance on biomedical datasets by artificially balancing the age is cropped and preprocessed; next, 91 image features are ex-
number of benign and malignant samples; tracted (covering lesion shape, texture, and color); and finally, fea-
(H3) It is possible to improve the classifier performance by in- tures obtained in a labeled dataset are used to train an SVM. Many
troducing a new and independent class containing visual patterns other studies investigate the skin lesions classification under dif-
obtained from the original training samples, but not directly re- ferent Computer Vision concepts for feature engineering. Two de-
lated to the malignant or benign prognosis of the lesion; tailed reviews of such approaches can be consulted in the works
(H4) It is possible to increase the invariance of the CNN classi- of Scharcanski and Celebi [27] and also of Masood and Al-Jumaily
fier to image capturing conditions by augmenting the training data [17].
set with classical image processing operations exploring geometric The development of a classifier based on Computer Vision
and color transforms; pipelines can be very complex and time-consuming. The feature
(H5) It is possible to increase the invariance of the CNN classi- engineering requires domain-specific knowledge that may not be
fier to biological patterns by augmenting the training data set with known in advance. It also requires the development of models that
warped images preserving the characteristics of the classes accord- are robust to existing diversity, caused for instance by capturing
ing to specialized knowledge (dermatologist); conditions or natural intra-class variations.
(H6) It is possible to improve the overall classification perfor- Recent enthusiasm for deep learning is driven by its character-
mance by creating committees composed of a set of CNNs trained istic of providing meta-solutions in which domain-specific knowl-
with different variations of the database; edge is learned from data. In the case of DCNNs, they are able to
learn from labeled images how to adapt to the application domain
1.2. Contributions in an integrated manner what formerly represented the image pro-
cessing, feature extraction, and classifier modules. The good perfor-
The presented methodology is based on a known CNN topology mance of DCNNs, however, is usually associated with the availabil-
and investigates how a small and unbalanced biomedical dataset ity of a large training database. DCNNs are very suitable to overfit-
JID: PATREC
C.N. Vasconcelos, B.N. Vasconcelos / Pattern Recognition Letters 000 (2017) 1–9 3
ting when trained from scratch over a small dataset, such as the ity classes. Regarding expected behavior for CNN-based classifiers
ISBI melanoma classification challenge investigated in this work. (which are a special case of MLP), this is not a topic explored in
The following paragraphs describe the methods found in the lit- most large-scale image classification challenges, such as CIFAR-10
erature to overcome such difficulties that inspire some of the hy- and CIFAR-100 [12] and ILSVC [26]. These classification challenges
potheses raised in this study. deal with cases where the number of samples available for training
Transfer Learning is a set of strategies commonly used to pre- is balanced (or near equilibrium) among existing classes. Despite
vent overfitting that exchange knowledge between a source and a this, there are many situations where the number of training sam-
target domain in order to overcome the deficit of training samples per class is imbalanced. This work investigates how the un-
ples. Different experiments indicate that the visual features pro- balanced distribution influences a classifier based on a CNN, more
duced by CNNs are very powerful and generic so that they can be specifically the distribution of the skin lesion benign and malig-
transferred from one context (where more training data is avail- nant labels.
able) to another. Raina et al. [24] pointed out that pre-training a In order to develop the proposed methodology based on ma-
CNN in a different dataset can be adopted in supervised and un- chine learning, it was necessary to choose a skin lesion database
supervised training (ie, on both labeled and unlabeled data sets). for training it. There are several skin lesion databases, but most of
In Razavian et al. [25] work, the authors performed a series of the publicly available have less than a thousand images, making it
experiments on the features obtained by a CNN trained for ob- difficult to train a CNN from scratch using them. They can be dis-
ject classification on the ILSVRC data set. Extracted features were tinguished according to the type of image adopted. The two most
used as a generic image representation for the object image classi- commonly used types are clinical images (obtained by conven-
fication, scene recognition, finely grained recognition, attribute de- tional photographs) and dermatoscopic images (obtained through
tection, and image retrieval tasks applied to a diverse set of data the use of a dermatoscope).
sets. Their results strongly suggest that the CNN features should be The use of clinical images recently received a lot of attention
the primary candidate in most visual recognition tasks. Following due to the opportunity of developing low-cost diagnostic applica-
their findings, the present work investigates if such hypothesis is tions for massive screening using personal cameras, such as smart-
valid for the skin lesion analysis problem. Additional discussions phones. Its classification is extensively explored by classical meth-
on Transfer Learning techniques and performance can be found in ods, such as Chang et al. [3] work which argues that conventional
the work of Pan and Yang [22] and Yosinski et al. [31]. digital photographs are feasible to provide useful information for
Another commonly adopted approach in Machine Learning to CADx applications. In parallel to the present research, Esteva et al.
overcome training sample deficits and to improve the classifier [7] obtained a dermatologist-level performance on the classifica-
generalization ability is to produce new samples from the original tion of skin lesions using a single DCNN, trained using an anno-
training set with a data augmentation scheme. In biomedical prob- tated dataset of 129,450 clinical images. The database used is not
lems, increasing training data is crucial to deal with problems aris- publicly available, but their results attest the potential of using a
ing from the existence of rare events and with unbalanced or small CNN based approach capable of classifying skin cancer with a level
data sets. That is the case of the work of Ciresan et al. [5] that took of competence comparable to dermatologists.
the first place in the 2012 Mitosis Detection in Breast Cancer His- The use of a dermatoscope facilitates the melanoma diagnosis.
tological Images Challenge with a 13-layer CNN architecture. They It allows a clear inspection of lesions, as it cancels the skin sur-
created additional training samples performing arbitrary rotations face reflections. In addition, since they have a magnifier (typically
and mirroring. As detailed in the following sections, this work in- x10), allow the visualization of clinical dermoscopic features (such
vestigates the use of even more ostensible data augmentation, en- as Milia-like Cysts, streaks and pigment networks), which are rec-
compassing geometric and color transformations, as well as image ognized by dermatologists and inform their decision to biopsy sus-
warps, guided by the skin lesion geometry. picious skin lesions.
In a different line of investigation, which they call convergent The ISBI Challenge for melanoma classification uses dermato-
learning, the work of Li et al. [15] investigates whether DNNs scopic images. They are publicly available and provide a common
trained separately learn features that converge to span similar criterion for comparing different methodologies for lesion classi-
spaces. One of their findings is of particularly interesting for the fication. Gutman et al. [8] published the numerical results of the
methodology proposed in this work. They note that some features ISBI challenge, but did not include the description of the methods
are learned across multiple networks, but other features are not that produced the presented statistics. As far as the authors of this
consistently learned. Moreover, combining results from multiple article are aware, those methods with public descriptions to date
networks in a committee is expected to increase classifier perfor- are detailed below.
mance by smoothing out the effects of non-optimal solutions. In Majtner et al. [16] propose an SVM ensemble over the combina-
2010, Ciresan et al. [6] work reduced the error in the MNIST hand- tion of hand-crafted features (RSurf and Local Binary Patterns) with
writing recognition from 0.40% to 0.27% by combining seven CNNs. deep learning features obtained from training an AlexnNet [12] on
Proposals that led the ILSVRC championship in its most recent edi- the ISBI dataset.
tions [10,13,30] also used committees. Menegola et al. [18] trained a VGGnet [29] using transfer learn-
The present work investigates Li et al. [15] findings by cloning ing (the training was initialized with weights from the ImageNet
each DCNNs training five times. The present methodology also dataset) and data augmentation using classical geometric transfor-
raises the hypothesis that a better classifier can be obtained by mations (scaling, rotations, and flipping). An extra source of im-
combining different classifiers, and extends the common use of ages was used by Menegola et al. [18] to train and test the tar-
the committees. It investigates the combination of CNNs trained get models, containing both clinical and dermatoscopic images for
in different partitions of the dataset and also using different data more than a thousand cases.
augmentation schemes suggesting that these compositions benefit The challenge team called CUMED [32] focused on training a
from unique features and greater diversity. deeper network. They obtained the highest accuracy on the chal-
Another topic in the focus of this work is the fact that training lenge submission date using the CNN architecture considered cur-
with unbalanced data sets may severely hinder the performance of rent state of the art for image classification: a Residual Network
a number of classifiers such as Decision Trees, SVMs, Multi-Layer [10]. They propose a framework where the lesion is initially pre-
Perceptrons (MLPs) and Boosting classifiers [9,11]. They all suffer segmented, so that a cropped version of the image is used for clas-
from poor generalization, especially in the classification of minor- sification.
JID: PATREC
Although the cited works are also based on DCNNs, the use In order to keep the network deep and combat the vanish-
of transfer learning and data augmentation by geometric trans- ing gradient problem, they include during the training two auxil-
forms, the present work investigates additional strategies to im- iary output branches connected to the intermediate layers of the
prove the classifier performance. Compared to the present work, network. The loss is added to the total net loss weighted by a
none of them directly addressed the database imbalance nor did smaller weight. The auxiliary layers are discarded during the in-
them use specialized knowledge and color transforms in data aug- ference time.
mentation strategies nor did them investigate the influence of vi-
sual patterns from outside of the skin lesion nor did then address 4. Methodology
the diversification of classifiers composing committees. The pro-
posed methodology is presented next. This section describes the foundations behind the experimental
hypotheses raised by this work.
3. Convolutional neural networks (CNNs)
4.1. Transfer learning
CNNs are biologically inspired variants of MultiLayer Percep-
A well-known technique to prevent overfitting in CNN classi-
trons (MLP) specially designed for learning visual features. Unlike
fiers is to initially train them with a large auxiliary dataset and
Computer Vision pipelines, they learn domain-specific features au-
then transfer the obtained knowledge to the target domain. More
tomatically from training image data sets, without the need for
specifically, the weights learned from this initial training are used
feature engineering. The CNN architecture is characterized by alter-
as the seed for the parameters in the second round of training.
nating convolutional with pooling layers and by the organization of
This time the training is performed in the target database, aiming
their neurons in a grid.
to fine-tune the solution to its real purpose.
A convolutional layer implements a set of linear filters and
The transfer learning success depends primarily on the size
learns its weights from training data. Each of its neurons observes
and variety of the auxiliary dataset and its similarity to the tar-
a small region at the corresponding input, simulating a local recep-
get dataset. The experiments evaluate fine-tuning starting from the
tive field. A neuron shares its weights with the neighbors of the
weights obtained by training the GoogLeNet in the ILSVC data set
same depth of the grid so that together they form a certain convo-
for natural objects classification. This choice is motivated by the
lution filter. In the vicinity along the depth dimension, it observes
size and extraordinary diversity of images in this database.
the same receptive field as its neighbors, but they encode different
The weights from the ILSVC are expected to be very generic in
filters.
the early layers of the CNN as they define basic visual features such
Pooling layers are responsible for downsampling along the spa-
as oriented edges and color contrast filters. The open question is
tial dimensions of the feature maps grids, adding local transla-
whether the CNN-based skin lesion classifier benefits from them
tion invariance to the classifier. Together with nonlinear activation
since a biological data set may or may not be similar enough to
functions, they add nonlinearity to the overall network decision
natural images in order to share features (or to ease their learning),
function. They usually encode a fixed transformation that does not
especially in the middle and upper layers.
depend on training (ie. do not increase the number of learning pa-
rameters).
4.2. Imbalanced dataset
In CNN original architectures, fully-connected layers are added
to learn a classifier over the features hierarchy. As the name im-
Studies on machine learning from unbalanced data sets state
plies, in these layers each neuron is connected to all the neurons
that their side effect to the classifier should be considered in re-
in the previous layer, just as in classic MLPs.
lation to the complexity and nature of the specific task. [9,11,20].
The development of a new CNN architecture is not the objective
Even a large imbalance in the number of samples available per
of this work, but rather to evaluate the extent to which CNN based
class may not cause problems in cases of well-separated classes
solutions can be applied to the melanoma classification problem,
or when there is a large set of training data.
given the ISBI data set peculiarities. A well-known topology, named
The present case study does not meet these conditions. The
GoogLeNet [30], was chosen to implement the proposed methodol-
training set is small. It contains 900 images distributed in 727 of
ogy in order to clarify the contributions in a replicable framework.
benign lesions and 173 of malignant lesions (approximately 4:1).
It is described next.
The present research assumes that it is not a problem of easily sep-
arated classes based on the methodology adopted by specialized
3.1. GoogLeNet analysis (dermatologists). They presume the existence of cases that
can not be diagnosed solely by the observation of images of the
GoogLeNet is the ILSVRC 2014 image classification winner de- lesion (neither classical nor dermoscopy). The diagnosis of these
veloped by Szegedy et al. [30]. Their main contribution is the In- cases occurs after the biopsy of the lesions. For these reasons, the
ception Module that drastically reduced the number of parameters, proposed method investigates the impact of the imbalanced train-
creating a deeper and wider topology. The input signal entering an ing set on the CNN classifier and also how to minimize this effect.
inception module is processed by filters of different sizes (includ- Existing methods to reduce the effect of an imbalanced data set
ing 1 × 1 filters which are responsible for aggregating correlated operate at both the algorithm level and the training data set [4,20].
features), whereas in previous networks the convolution kernels In the first case, the learning algorithms include a misclassification
were typically uniform within a layer. weight associated with each class, adjust the decision threshold,
GoogLeNet is formed with Inception modules stacked on the or even modify the existing algorithm to be more sensitive to rare
top of each other, leading to a 22-layer deep model when only classes.
taking into account layers with learning parameters. Along with This proposal follows the second group of methodologies, based
the convolution layers, it also has nonlinear transformations such on the balancing of the training set by some sampling strategy. Bal-
as ReLU and pooling layers. Differing from the previous CNNs, the ance through a strategy based on majority-class sub-sampling (i.e.
GoogLeNet architecture replaces the fully connected layers with discarding majority-class samples) is not appropriate for our case
sparse ones and spread the classifier decision across the network study since the total number of samples available is already small
(even within the convolutions). for training a deep CNN. Hence, the proposed method investigates
JID: PATREC
Fig. 1. Cropping maintaining lesion aspect ratio and border integrity. Fig. 2. PCA-based color processing and validated by specialists.
two strategies of oversampling of the minority class: based on the

addition of replicas of their samples, and based on the artificial
synthesis of new samples.
4.3. Data augmentation by image processing techniques
This section and the next (4.4) describe how new training sam- Fig. 3. Distortion of the original image keeping the main axes of the lesion.
ples were synthesized. They were included in the training data set
for multiple purpose: to deal with the data set imbalance (as justi-
fied above); to increase the classifier invariance to common visual 4.4. Data augmentation based on specialists knowledge
patterns observed in dermatoscopy; to avoid overfitting in training
a DCNN model; and to increase the diversity among trained classi- The analysis of skin lesions by specialists is done by observing
fiers. the symmetries and patterns around their main orthogonal axes.
There are no requirements for domain specific prior knowledge
4.3.1. Geometric transformations nor hand-crafted features in order to produce CNN-based classi-
The classifier should not alter its results according to the ori- fiers, as they learn directly from training data. However, specialized
entation, scale and position of the lesion within the image. Thus, knowledge can be used to augment the available training sam-
copies of each training image were rotated (by multiples of 90°), ples while preserving the annotation. Consequently, the proposed
flipped and cropped to create new examples associated with the method preserves the symmetries of lesions (or anti-symmetries)
same label as the original sample. in the artificial images produced, since they imply in the final clas-
Random cropping is very common in CNN training, as it is ex- sification of the lesion.
pected to increase the invariance to image translations and scal- An additional data augmentation scheme has been included to
ing. Unlike the free cut used over natural images, the proposed synthesize variations of lesions usually found in different individu-
method enforces the rule that the entire lesion must be pre- als, or with the development of the lesion. It is based on the dis-
served in the new image produced as lesion boundaries are an- tortion of the lesion main axis sizes, preserving its symmetries. To
alyzed by the decision criteria of dermatologists. Therefore, the obtain the lesion axes the segmentation mask is processed so that
random crop is applied in a controlled manner, respecting a re- the lesion region is approximated with an ellipse (Fig. 3 (b)). The
stricted area. The lesion bounding box is obtained by processing its main axes of the ellipse are used to control the image distortion
segmentation mask (available per training image in the ISBI data applied by increasing or reducing their sizes by a random factor
set). A fixed margin is added around the bounding box to define while their directions are kept.
the protected area (Fig. 1). Finally, the lesion original aspect ratio A warped image is created by employing a thin plate spline
is enforced, as its proportion is also used by specialists in their mapping [28] between nine points of the original image in posi-
melanoma/nonmelanoma analysis. tions in the new image. The four corners of the image are held
fixed and included as anchor points. The center of the ellipse ap-
4.3.2. Color transformations proximates the lesion center and is translated at random. The four
Augmenting the color variety of an image data set can be end points of the ellipse axes are shifted following their original
achieved by different techniques. However, in biological data sets, axes by random factors, altering the axes sizes (positively or nega-
it is critical that the synthesized colors correspond to plausible in- tively) but maintaining their original directions (Fig. 3).
stances of the subject portrayed.
Dermoscopic images present a bias in their color distribution,
covering a very specific subset of the color spectrum. It corre- 4.5. Visual patterns from outside the lesion
sponds to the biological variations of skin and lesions observed
in relation to the color distribution of the light source. The ISBI There are visual patterns in skin images that are not directly re-
database images in particular contains two lighting patterns correlated to the classification of the lesion but represent natural vari-
sponding to whitish and bluish dermatoscope lights. ations in the area surrounding the lesion. Some of these occur as
A color synthesis based on the main axes of the database color rare cases or even as a single sample in the training data set, but
distribution was applied to follow its bias and create plausible im- are recurrent in dermatoscopic images. Fig. 4 illustrates some of
ages. They were obtained by Principal Component Analysis (PCA). these patterns found in the ISBI dataset presenting variations of
The new images were produced by adding vectors to the origi- the image and dermatoscope borders, colored stickers, hair, pen
nal image colors. The vectors were obtained as the principal com- markers and rulers. Neither of which is directly associated with
ponents weighted by their corresponding eigenvalues and multi- the malignant or benign prognosis, therefore, should not influence
plied by a random factor. The random factors were sampled from the classifier’s decision.
a Gaussian distribution with mean μ = 0.0 and standard deviation The flexibility of deep CNNs allows them to learn observed pat-
σ = 0.2. This σ was found by validating the appearance of the syn- terns as inherent characteristics of classes. Patterns that are not
thesized images with specialists, who described images obtained directly related to skin lesion, but which do not occur in a dis-
with higher values of σ as unnatural. Samples created are illus- tributed manner among different classes, may introduce an unde-
trated in Fig. 2. sired bias in the learned classifier.
JID: PATREC
trained for a given problem into a single response. A better re-

sponse is expected from a committee than from its constituent
experts assuming that the model’s individual errors are average
out. The use of committees can be seen as a resolution of the
bias-variance dilemma when applied to networks that individually
present low bias and high variance, but combined produce a low
bias and low variance model.
The proposed committees were built in three steps. First, N
models were designed, including the CNN topology definition, ini-
tialization choices and training data set. Then each model was
trained separately corresponding to a single response. Finally, the
Fig. 4. Visual patterns that should not interfere in theprognostic. responses of the models were combined into a single one. Differ-
ent strategies have been investigated for each step as detailed in
Section 5.6.
5. Experiments and results
All the experiments performed adopted the same CNN topol-

ogy (from GoogLeNet [30]) and fixed hyper-parameters, in order to
focus exclusively on the improvements obtained by the proposed
method. They were trained for a maximum number of epochs
Fig. 5. Extra class representing out-of-lesion patterns. ranging from 30 to 50 according to the training database size and
using a learning rate fixed at 10−4 .
The experiments presented in this section followed a 5-fold
One approach to overcome this effect following the CV pipelines cross validation. They started with a random partition of the ISBI
would be to clean the image in a preprocessing step during both training set into five folders of 180 images each, maintaining the
training and testing. It deals with each pattern modeling it indi- melanoma/non-melanoma ratio of ≈ 0.24. Each model was trained
vidually (for instance, the remotion of visible hair). Another ap- with images of four of these folders. The fifth is used for valida-
proach would be to synthesize new images of both labels contain- tion, so that its statistics were used to choose the models weights.
ing the out-of-lesion most common patterns. An example would be The five combinations are identified by the corresponding valida-
the use of hair simulation to amplify the originally existing varia- tion folder number (fi with i = 0, . . . , 4).
tions in the base and produce new samples associated with both Although accuracy is the measure most commonly adopted in
types of labels (malignant and benign). In practice, however, the the CNN literature, it is highly affected by the unbalance of the
syntheses of those patterns are not trivial, considering the need for data set. It can hide what is known as the accuracy paradox, where
a high degree of realism. Neither approaches attended to the pro- a high accuracy rate only reflects the distribution of the underlying
posed methodology requirements of simplicity and easy extension class. This means that it is very likely to predict the most popular
for new cases. Therefore, another approach is proposed. class, regardless of the data it is asked to classify. For this reason,
An extra class was included and associated with a new, inde- each trained CNN was used to produce two models by choosing
pendent label. The new class was populated with random crops the set of weights corresponding respectively to the epoch with
from regions outside the lesions captured from both benign and the minimum loss and with the maximum accuracy in the valida-
malignant samples (Fig. 5). For this, the crops were obtained tion set. The box charts included in this section present a compar-
within the area which is complementary to the lesion segmenta- ison between the two.
tion mask, maintaining the aspect ratio of the original image. The The analysis of a single experiment covering each hypothesis
goal is to force a distinction between patterns that should be as- investigated is unreliable as CNNs performance is affected by ran-
sociated with the melanoma prognosis from unrelated ones by re- dom initialization of its weights and other factors (such as the or-
designing the problem as a ternary classifier. Thus, it contains im- der in which samples are presented during training). Besides, the
ages of healthy skin, rulers, hair, among other patterns found in final solution can be benefited by diversity as Li et al. [15] found
the training base, which are similar to the originals, except for the that some features are learned across multiple networks, but other
lack of the lesion itself. features are not. With these assumptions, five replicas of each
model were trained using the same hyperparameters and input
data. Thus, a total of 50 models were evaluated for each hypoth-
4.6. Committees
esis investigated to portray a more stable analysis of the investi-
gated events. The performance fluctuation between the 25 differ-
Training a DCNN using a small dataset is especially affected by
ent replicas of a model is represented in the columns of the box
a dilemma called bias - variance tradeoff [2], that prevents super-
charts presented below.
vised learning algorithms from generalizing beyond their training
set. It can be seen as the choice between very flexible models with
low bias and high variance (overfitting), and relatively rigid models 5.1. Performance measurements
with high bias and low variance (underfitting). Ideally, the tradeoff
is solved by increasing the number of training samples to infinity. The performance measurements were taken from 379 unseen
As this is not possible in practice, the methodology proposed in- images (75 melanomas and 304 non-melanoma) from the ISBI test
creases the number and diversity of training samples using data set. They were not used during training nor for validation (i.e. for
augmentation (as presented in previous sections), but also deals any parameter choice or model adjustment) for a realistic estimate
with the bias-variance dilemma by the formation of committees. of the generalization error of the proposed method.
In Machine Learning, a committee is responsible for combin- The box charts present average precision measurements since
ing the responses of multiple models (in this case, multiple CNNs) the ISBI challenge participants were ranked based on it alone. Even
JID: PATREC
Fig. 6. First convolutional layer learned filters.
Fig. 8. Average Precision distributions of the binary × the ternary classifier.
Fig. 7. Average Precision of the classifiers trained over imbalanced × balanced

datasets.
Fig. 9. Average Precision distributions of the proposed data augmentations.
though not used as criteria of the challenge, in practice, the sensi- 5.4. Experiments of the independent class hypothesis (H3)
tivity and specificity are of special interest for the specialized com-
munity as represent respectively an estimate of the probability of Fig. 8 presents the average precision of CNNs trained on the
a positive test (melanoma) given that the patient has the disease, original classes against CNNs designed with a third class represent-
and the probability of a negative test (non-melanoma) given that ing the surroundings of the lesions. The number of test samples
the patient is well. Thus, results are compared with related works mistakenly associated with the third class introduced was negli-
using average precision, accuracy, sensitivity and specificity mea- gible, which indicated that the classifier could identify such an
surements. artificial class extremely well. Thus, a binary output is obtained
from the ternary classifier by normalizing the benign and malig-
nant scores by their sum. Improvements results were observed in
5.2. Experiments of the transfer learning hypothesis (H1) almost all models and partitions evaluated.
The first experimental hypothesis investigates the adoption of 5.5. Experiments of the data augmentation hypotheses (H4,H5)
a fine tuning procedure. The best model obtained when training a
CNN from scratch got an accuracy of 83%, but this is a fallacious The box charts of Fig. 9 present the average precision of ternary
number. The observation of its accuracy by class shows a strong classifiers, in which the malignant class had their samples over-
tendency for the benign class (98% × 24%). Some trained models sampled by the data augmentation schemes proposed. They corre-
have reached an accuracy of 100% for the benign class against only spond to data augmentation experiments using the image geome-
4% to malignant. try, color and lesion axis distortion and a fourth experiment with
The numbers described hide a number of questions raised by a random mix of the three. In all the experiments, the weights ob-
this work, but to clarify the advantage of using a pre-trained set tained as the training epoch with minimum loss in the validation
of weights, Fig. 6 presents a comparison of their first convolu- set induced a better generalization.
tional layer filters. As shown, features learned from scratch do The augmentation by lesion axis distortion induced the highest
not converge to semantic patterns using such a small data set, observed performance gains followed by the one based on color.
but are visualized as random noise as opposed to those obtained This should not be interpreted simply as if one of the proposed
by fine-tuning. Consequently, subsequent experiments started with augmentation methods is significantly better than the others. By
GoogLeNet’s public weights pre-calculated over the ILSVC 2014 analyzing the classification of the validation images individually, it
data set. This fine-tuning reduced the overfitting and training time is possible to observe that the networks established different clas-
observed in all realized experiments. sification partitions as if they had actually observed different fea-
tures and decision criteria. This result is in line with the proposal
initial design that seeks for diversity.
5.3. Experiments of the data set balancing hypothesis (H2)
5.6. Experiments of the committee formation hypothesis (H6)
Fig. 7 presents the average precision of CNNs trained with the
original training data set against an oversampling strategy based The analysis of the hypotheses H1 to H5 was used to pre-select
on the inclusion of identical copies of the malignant class sam- the set of models to be combined in different committees in the
ples. All models have improved their sensitivity and most of them search for the final classifier. That is, the evaluated committees
have improved their average precision with such a simple strategy. were composed only of those models obtained by fine-tuning, with
These results encouraged the adoption of a balanced training data minimum loss in the corresponding validation set, trained in bal-
set in the subsequent experiments. anced sets, and modeled as ternary classifiers. Fig. 10 shows the
JID: PATREC
Fig. 10. Block diagram of the proposed approach for training the CNNs.
Table 1
Results of the proposed committees versus the first (CUMED) and second
(GTDL) best results of the ISBI challenge and others [8]).
Aver. Prec. Accur. Spec. Sens.
SDAS(geometric) 0.648 0.802 0.819 0.733

SDAS(color) 0.642 0.825 0.858 0.693
SDAS(lesionaxis) 0.629 0.807 0.835 0.693
SDAS (random) 0.628 0.810 0.835 0.706
SDASBM (geometric) 0.661 0.807 0.819 0.760
SDASBM (color) 0.659 0.836 0.865 0.720
SDASBM (lesionaxis) 0.660 0.807 0.842 0.666
SDASBM (random) 0.634 0.791 0.819 0.680
ADAS 0.645 0.817 0.845 0.706
ADAS(withoutRandom) 0.645 0.812 0.838 0.706
ADASBM 0.669 0.825 0.845 0.746
ADASBM (withoutRandom) 0.669 0.810 0.842 0.680
CUMED [32] 0.637 0.855 0.941 0.507
GTDL (unpublished method) 0.619 0.813 0.872 0.573
[16] 0.473 0.826 0.898 0.533
[18] 0.549 0.792 0.881 0.476 Fig. 11. Limitations of the training using only dermatoscopic images: it does not
generalize to the illumination conditions of clinical images, presence of specular
brightness and unconstrained differences in resolution, scale and acquisition equip-
ment.
workflow proposed to train the CNNs that compose the final clas-
sifier. Diversity was introduced in the committees composition by
three training strategies: (A) using the same data set, but varying 5.7. Proposal limitations: experiments with clinical images
CNN initialization and training samples order; (B) using different
partitions of the same dataset; and (C) using different augmen- The ADASBM (with the random augmentations) committee
tation schemes. The following committee compositions were in- trained in the ISBI data set was applied to a set of images ob-
vestigated: (a) gathering all 25 models trained from a single aug- tained from the public DermIS and DermQuest databases by Ame-
mentation scheme (named SDAS); (b) gathering the 5 best models, lard et al. [1] in order to evaluate its performance in clinical image
one from each validation folder partition of a single augmentation analysis. The tests were done by applying the images directly to
scheme (named SDASBM ); (c) gathering all models of all augmenta- CNN classifiers without additional preprocessing steps to be con-
tion schemes (named ADAS), with and without the random combi- sistent with the proposed methodology.
nations database; (d) gathering the best models from each valida- The images in Fig. 11 show some samples in which the com-
tion folder of all augmentation schemes (named ADASBM ), with and mittee trained in the ISBI data set does not generalize well to the
without the random combinations database; The composition rule visual patterns presented in clinical images. These results could be
adopted follows Ciresan et al. [6] approach, where the committee predicted for a number of reasons: by the illumination condition
output is calculated as the average of the individual outputs. variations of the clinical images (both in the direction and in the
Table 1 presents the results of the 12 different committees pro- color spectrum) as opposed to the uniform illumination condition
duced by the proposed methodology, and related works statistics. of the dermatoscopes; due to the fact that in most clinical im-
It shows that the two ADASBM committees proposed obtained the ages, lesions and skin present some level of specular glow occlu-
highest average precision of all of them and that all twelve have a sion, which is usually eliminated by the use of dermoscopy; and
better balance between specificity and sensitivity than the classi- by the unconstrained differences in resolution, scale, and acquisi-
fiers of the ISBI challenge participants. They outperform a ResNet- tion equipment, whose diversity is not covered by the ISBI data set.
based approach [32], recognized for producing better results than Besides, the fact that lesions in the ISBI data set occupy most of
the GoogLenet in a direct comparison between the two, as well as the image area and are of sufficient resolution to observe dermato-
an approach using considerably more training data [18]. scopic features (such as Milia-like Cysts, streaks, and pigment net-
JID: PATREC
works), may have induced the trained networks to look for those [7] A. Esteva, B. Kuprel, R.A. Novoa, J. Ko, S.M. Swetter, H.M. Blau, S. Thrun, Derma-
patterns that are not visualized in some of the clinical images eval- tologist-level classification of skin cancer with deep neural networks, Nature
542 (7639) (2017) 115–118.
uated. [8] D. Gutman, N.C.F. Codella, E. Celebi, B. Helba, M. Marchetti, N. Mishra,
The potential of a deep learning based approach in the context A. Halpern, Skin lesion analysis toward melanoma detection: a challenge at
of clinical imaging is clear from the results obtained by the work the international symposium on biomedical imaging (ISBI) 2016, hosted by the
international skin imaging collaboration (ISIC), CoRR (2016). abs/1605.01397.
of Esteva et al. [7]. Thus, the presented limitations do not invali- [9] H. He, Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications,
date the proposed methodology since it can be replicated without 1st, Wiley-IEEE Press, 2013.
modifications to train a classifier of clinical images using a suitable [10] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
in: arXiv:1506.01497arXiv prepring arXiv, 2015.
data set, containing its natural variations.
[11] N. Japkowicz, S. Stephen, The class imbalance problem: a systematic study, In-
tell. Data Anal. 6 (5) (2002) 429–449.
[12] A. Krizhevsky, Learning multiple layers of features from tiny images, 2009.
6. Conclusions [13] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
volutional neural networks, in: Advances in Neural Information Processing Sys-
Melanoma early detection programs, in combination with im- tems 25, 2012, pp. 1106–1114.
[14] M. Lens, M. Dawes, Global perspectives of contemporary epidemiological
proved diagnostic tools, are crucial for the outcome of the disease. trends of cutaneous malignant melanoma, Brit. J. Dermatol. 150 (2) (2004)
In parallel, Deep Learning approaches are succeeding in several 179–185.
tasks. The presented methodology investigates in several aspects [15] Y. Li, J. Yosinski, J. Clune, H. Lipson, J.E. Hopcroft, Convergent learning:
do different neural networks learn the same representations? CoRR (2015).
how a DCNN based classifier can be employed to deal with a small abs/1511.07543.
and unbalanced biomedical data set. [16] T. Majtner, S.Y. Yayilgan, J.Y. Hardeberg, Combining deep learning and
The contributions presented include data augmentation hand-crafted features for skin lesion classification, in: Sixth International Con-
ference on Image Processing Theory, Tools and Applications, IPTA 2016, 2016,
schemes, the inclusion of a third class representing out-of-lesion
pp. 1–6.
patterns, the exploration of diversity in committee formation, that [17] A. Masood, A.A. Al-Jumaily, Computer aided diagnostic support system for skin
together with the balance of the data set led to a classifier with cancer: a review of techniques and algorithms, Int. J. Biomed. Imaging 2013
(2013) 1–22.
the best average precision published so far for the ISBI challenge
[18] A. Menegola, M. Fornaciali, R. Pires, F.V. Bittencourt, S.A.F. de Avila, E. Valle,
in dermatoscopic images classification. Knowledge transfer for melanoma screening with deep learning, in: 14th IEEE
The comparison with the related works strongly validates the International Symposium on Biomedical Imaging, ISBI 2017, 2017, pp. 297–300.
contributions presented since it surpassed those using extra source [19] R. Moss, W.V. Stoecker, S.-J. Lin, S. Muruganandhan, K.-F. Chu, K.M. Poneleit,
C.D. Mitchell, Skin cancer recognition by computer vision, Comput. Med. Imag-
of data or a DCNN known for its capacity in achieving better re- ing Graph. 13 (1989) 31–36.
sults (ResNet) than the one adopted by this work (GoogLeNet-v1). [20] Y.L. Murphey, H. Guo, L.A. Feldkamp, Neural learning from unbalanced data,
In addition, the proposed methodology led to classifiers with the Appl. Intell. 21 (2) (2004) 117–128.
[21] H.A. Nestle FO, Melanoma, in: Dermatology, 2, second edition, 2008,
best balance between specificity and sensitivity among related ap- pp. 1745–1771. Eds. Bolognia JL, Jorizzo JL, Rapini RP.
proaches, as it increases the maximum sensitivity published so far [22] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng.
in this dataset from the 57%–76%. 22 (10) (2010) 1345–1359.
[23] I. project, Public skin lesion image archive, (http://isdis.net/isic-project/
The performance gain obtained is expected to increase further public- image- archive- of- skin- lesions/). Accessed: 2016-09-01.
by replicating the proposal using a ResNet, but it was left as future [24] R. Raina, A. Battle, H. Lee, B. Packer, A.Y. Ng, Self-taught learning: transfer
work, as well as its extension to clinical images. learning from unlabeled data, in: Proceedings of the 24th International Con-
ference on Machine Learning, ICML’ 07, ACM, 2007, pp. 759–766.
[25] A.S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features off-the-shelf:
Acknowledgments an astounding baseline for recognition, in: Proceedings of the 2014 IEEE Con-
ference on Computer Vision and Pattern Recognition Workshops, CVPRW’14,
2014, pp. 512–519.
The authors would like to thank NVidia for the donation of the [26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa-
Titan X GPU used in this research. thy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, Imagenet large scale visual
recognition challenge, Int. J. Comput. Vision (IJCV) 115 (3) (2015) 211–252.
[27] J. Scharcanski, M.E. Celebi, Computer Vision Techniques for the Diagnosis of
References Skin Cancer, Springer Publishing Company, Incorporated, 2013.
[28] R. Sibson, G. Stone, Computation of thin-plate splines, SIAM J. Sci. Stat. Com-
[1] R. Amelard, J. Glaister, A. Wong, D. A. Clausi, Melanoma decision support using put. 12 (6) (1991) 1304–1313.
lighting-corrected intuitive feature models, pp. 193–219. [29] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
[2] C.M. Bishop, Pattern Recognition and Machine Learning (Information Science image recognition, CoRR (2014). abs/1409.1556.
and Statistics), Springer-Verlag New York, Inc., 2006. [30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S.E. Reed, D. Anguelov, D. Erhan,
[3] W.-Y. Chang, A. Huang, C.-Y. Yang, C.-H. Lee, Y.-C. Chen, T.-Y. Wu, G.-S. Chen, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, CoRR (2014).
Computer-aided diagnosis of skin lesions using conventional digital photogra- abs/1409.4842.
phy: a reliability and feasibility study, PLOS ONE 8 (11) (2013) 1–9. [31] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep
[4] N.V. Chawla, Data mining for imbalanced datasets: an overview, in: The Data neural networks? in: Advances in Neural Information Processing Systems 27,
Mining and Knowledge Discovery Handbook., 2005, pp. 853–867. 2014, pp. 3320–3328.
[5] D.C. Ciresan, A. Giusti, L.M. Gambardella, J. Schmidhuber, Mitosis detection in [32] L. Yu, H. Chen, Q. Dou, J. Qin, P. Heng, Automated melanoma recognition in
breast cancer histology images with deep neural networks, in: Medical Image dermoscopy images via very deep residual networks, IEEE Trans. Med. Imaging
Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th Interna- 36 (4) (2017) 994–1004.
tional Conference, Proceedings, Part II, 2013, pp. 411–418.
[6] D.C. Ciresan, U. Meier, L.M. Gambardella, J. Schmidhuber, Convolutional neural
network committees for handwritten character classification., in: ICDAR, IEEE
Computer Society, 2011, pp. 1135–1139.

Experiments Using Deep Learning For Dermoscopy Image Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Experiments Using Deep Learning For Dermoscopy Image Analysis

Uploaded by

Copyright:

Available Formats

JID: PATREC

ARTICLE IN PRESS [m5G;November 29, 2017;20:57]

Pattern Recognition Letters 0 0 0 (2017) 1–9

Contents lists available at ScienceDirect

Pattern Recognition Letters

Experiments using deep learning for dermoscopy image analysis

1. Introduction The access to digital images of skin lesions annotated with

two strategies of oversampling of the minority class: based on the

4.3. Data augmentation by image processing techniques

trained for a given problem into a single response. A better re-

5. Experiments and results

All the experiments performed adopted the same CNN topol-

Fig. 6. First convolutional layer learned ﬁlters.

Fig. 8. Average Precision distributions of the binary × the ternary classiﬁer.

Fig. 7. Average Precision of the classiﬁers trained over imbalanced × balanced

Aver. Prec. Accur. Spec. Sens.

SDAS(geometric) 0.648 0.802 0.819 0.733

You might also like