You are on page 1of 12

Informatics in Medicine Unlocked 18 (2020) 100297

Contents lists available at ScienceDirect

Informatics in Medicine Unlocked


journal homepage: http://www.elsevier.com/locate/imu

Deep learning approaches to biomedical image segmentation


Intisar Rizwan I Haque a, b, *, 1, Jeremiah Neubert c, 2
a
University of North Dakota (UND), College of Engineering and Mines, Grand Forks, ND, USA
b
National University of Sciences and Technology (NUST), Islamabad, Pakistan
c
University of North Dakota, College of Engineering and Mines, Grand Forks, ND, USA

A R T I C L E I N F O A B S T R A C T

Keywords: The review covers automatic segmentation of images by means of deep learning approaches in the area of
Image segmentation medical imaging. Current developments in machine learning, particularly related to deep learning, are proving
Deep learning instrumental in identification, and quantification of patterns in the medical images. The pivotal point of these
Machine learning
advancements is the essential capability of the deep learning approaches to obtain hierarchical feature repre­
Biomedical images
sentations directly from the images, which in turn is eliminating the need for handcrafted features. Deep learning
is expeditiously turning into the state-of-the-art for medical image processing and has resulted in performance
improvements in diverse clinical applications. In this review, the basics of deep learning methods are discussed
along with an overview of successful implementations involving image segmentation for different medical ap­
plications. Finally, some research issues are highlighted and the future need for further improvements is pointed
out.

1. Introduction or classification.
The ease of image acquisition due to advancing technology has led to
Medical Imaging is an essential part of today’s healthcare system for generation of vast amounts of high-resolution images at very low costs.
performing non-invasive diagnostic procedures. It involves creation of This has led to a significant improvement in the development of
visual and functional representations of the interior of the human body biomedical image processing algorithms. This has in turn enabled
and organs for clinical analysis. Its different types include: X-ray based development of automated image analysis or evaluation algorithms to
methods, such as conventional X-ray, computed tomography (CT) and extract useful information. The basic step for such automated analysis is
mammography; molecular imaging, magnetic resonance imaging (MRI) segmentation which sub-divides the image based on visually distinct
and ultrasound (US) imaging. Apart from these medical imaging tech­ regions which have a semantic meaning for the given problem. Each of
niques, clinical images are increasingly being used to diagnose various these regions typically have uniform characteristics either in terms of
conditions especially those related to skin [1]. their gray level, texture or color [3]. Clear segmentation and distin­
There are two components of medical imaging: 1) image formation guishable regions are essential for further analysis that may involve
and reconstruction and 2) image processing and analysis [2]. Image determination of homogeneity levels of texture or layer thickness [4]. At
formation involves the set of processes through which two dimensional times the image may contain multiple objects of the same class. And the
(2D) images of three dimensional (3D) objects are formed while segmentation process may segregate regions containing the objects of
reconstruction relies on a set of iterative algorithms to form 2D and 3D the same class while ignoring other classes in which case it is known as
images typically from the projection data of an object. Image processing, instance segmentation as opposed to semantic segmentation in which
on the other hand, involves the use of algorithms to enhance image objects of the same class are not segregated but different classes are
properties like noise removal while image analysis extracts quantitative segregated.
information or a set of features from the image for object identification All image segmentation techniques can be grouped into three

* Corresponding author. College of Engineering and Mines, The University of North Dakota (UND), Grand Forks, ND, USA.
E-mail addresses: intisar.rizwanihaque@und.edu (I. Rizwan I Haque), jeremiah.neubert@und.edu (J. Neubert).
1
I. Rizwan i Haque, is with the College of Engineering and Mines (CEM), The University of North Dakota (UND), Grand Forks, ND 58202, USA and on leave from
the National University of Sciences and Technology (NUST), Islamabad, Pakistan.
2
J. Neubert, is with the Department of Mechanical Engineering, College of Engineering and Mines, The University of North Dakota (UND), Grand Forks, ND 58202,
USA.

https://doi.org/10.1016/j.imu.2020.100297
Received 9 August 2019; Received in revised form 18 January 2020; Accepted 18 January 2020
Available online 26 January 2020
2352-9148/© 2020 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
I. Rizwan I Haque and J. Neubert Informatics in Medicine Unlocked 18 (2020) 100297

categories: 1) Manual segmentation (MS), 2) Semi-automatic segmen­ learning architectures and performance metrics used for image seg­
tation and 3) fully automatic segmentation techniques [5]. MS tech­ mentation is provided. In Section 3, recent studies using deep learning
niques require subject experts to first determine the region of interest models for different biomedical image segmentation applications are
(ROI) and then draw precise boundaries surrounding the ROI to presented. In Section 4, a discussion of the studies presented in previous
correctly annotate each of the image pixels. MS is necessary as it pro­ sections is presented. And finally, the conclusion is provided that sum­
vides the ground truth labelled images for the further development of marizes research trends, challenges associated with deep learning-based
semi-automatic and fully automatic segmentation techniques. MS is image segmentation and recommends directions for future work.
time intensive and only feasible for smaller image datasets. In case of
high-resolution images, the high resolution may result in images no 2. Overview of Deep Learning
longer having a crisp boundary (weak contrast) resultantly slight vari­
ations in selection of pixels of ROI boundary can result in a large error. 2.1. Machine Learning
Another issue with manual segmentation is that it is subjective as the
approach is dependent on the expert’s knowledge and experience and as The typical application of a machine learning based image segmen­
a result often encounters significant inter and intra expert variability tation approach is to classify the ROI e.g. diseased region or healthy
[6]. region. The steps involved for designing such an application begins with
Semi-automatic segmentation techniques involve a small level of the pre-processing stage which may involve the use of a filter to remove
user interactions with automated algorithms to produce accurate seg­ any noise or for the purpose of contrast enhancement. Following the pre-
mentation results [7]. The user interaction may involve selection of processing stage, the image is segmented using a segmentation tech­
approximate initial ROI which is used subsequently to segment the nique like thresholding, clustering based approach and edge-based
entire image. It may involve manual checking and editing of region segmentation. After the segmentation, features are extracted based on
boundaries to reduce segmentation error. Examples of semi-automatic color information, texture, contrast and size from the ROI. Then domi­
segmentation techniques include: 1) seeded region growing (SRG) al­ nant features are determined using feature selection techniques like
gorithm that iteratively merges neighborhood pixels with similar in­ principal component analysis (PCA) or statistical analysis. Subse­
tensity based on user provided initial seed point [8], 2) level-set based quently, the selected features are used as an input to the ML classifier
active contour model which starts with initial boundary shapes repre­ such as SVM or NN. The ML classifier uses the input feature vector along
sented by contours and iteratively alters them through a shrinking or with target class labels to determine the optimal boundary that separates
expansion operation based on the implicit level of a function, and has each class [14]. Once the ML classifier has been trained, it can be used to
the advantage that it does not require prior shape knowledge and the classify new unknown data to determine its class. Typical challenges
initial locations of the ROI [9] and 3) localized region-based active include the determination of appropriate pre-processing requirements
contour techniques that utilize region parameters to describe foreground based on raw image properties, determination of the appropriate fea­
and background of the image using small local regions and has the tures and length of the feature vector and the type of classifier among
advantage of handling heterogeneous textures [10]. others.
The fully automatic segmentation techniques do not require any user
interaction. However, most such techniques are based on supervised 2.2. Deep Learning-based Classifier (DLC)
learning approaches that require training data, e.g. shape models, atlas-
based segmentation approaches, random forest and deep neural net­ DLC can process raw image directly which means, there should be no
works. Both the training data and in case of unsupervised learning ap­ need for pre-processing, segmentation and feature extraction. However,
proaches the validation data require labelled images which are obtained most deep learning approaches require image resizing due to the limit on
through manual segmentation thus imposing similar constraints as input values. While some techniques do require intensity normalization
mentioned earlier. Additional challenges with the automated segmen­ and contrast enhancement which may be avoided if data augmentation
tation of medical images is the large variations in shape, size, texture techniques discussed later are employed during training. Resultantly,
and in certain cases color of ROI between patients and poor contrast DLC has higher classification accuracy as it can avoid errors associated
between regions [11]. Noise or lack of consistency in source data with erroneous feature vector or imprecise segmentation [14]. The
acquisition may also result in wide variations in the source image data comparison of ML and DLC approaches is shown in Fig. 1 below. DLC
which is often the case in real applications. Due to this reason, most based approaches have altered the focus of research from traditional
existing approaches based on clustering techniques, watershed algo­ image processing techniques for feature engineering to network archi­
rithms and machine learning (ML) based approaches have the basic tecture design for obtaining optimal results. DLC networks typically
problem of lack of global applicability that limits their use to a narrowed have multiple hidden layers which resultantly means more mathemat­
limited number of applications. Furthermore, human feature engineer­ ical operations are performed compared to ML based approaches and
ing which are often utilized with machine learning approaches based on thus the models are more computationally intensive.
support vector machines (SVM) or neural networks (NN) are time As seen from Fig. 1, the machine learning classifier takes the feature
consuming and fail to handle natural data in their raw form and typically vector as input and the output is the object class while the deep learning
do not adapt to new information. Deep learning approaches on the other classifier takes in the image and the output is the object class. It may be
hand are capable of processing natural data in their raw form thus noted that theoretically deep learning can be said to be an enhancement
eliminating the need for handcrafted features [3]. These approaches of conventional artificial neural networks (ANN) as it consists of more
have been effectively used for semantic segmentation on images of na­ layers than ANN [13]. It is considered as a type of representational
ture and have also found applications in biomedical image segmentation learning as each layer transforms the input data from the previous layer
[12]. The increased usage of deep learning approaches has been facili­ into a new representation at a higher and somewhat greater abstraction
tated due to faster central processing units (CPUs) and graphic pro­ level [3]. This allows the model to learn both local and
cessing units (GPUs) that have largely reduced the training and inter-relationships of the whole data in a hierarchical structure. The
execution time, access to large sets of data and advances in learning transformation of data into representation in each layer of a deep
algorithms [13]. learning model is a result of a non-linear function. Usually, features
The remaining parts of the review are structured as follows. In Sec­ extracted from the first layer of representation for a given image will
tion 2, an overview of the machine learning approach to image seg­ identify the presence or absence of edges in specific alignment and its
mentation, deep learning approach to image segmentation, deep location in the image. The second layer detects patterns by recognizing
learning architectures, typical approaches for implementing deep positioning of edges and ignores minor differences in these positions

2
I. Rizwan I Haque and J. Neubert Informatics in Medicine Unlocked 18 (2020) 100297

Fig. 1. Change in classifier approach using typical machine learning algorithm


and deep learning. Figure adapted from Ref. [14].

while the third layer associates these patterns into bigger combinations
corresponding to fragments of similar objects while enabling succeeding
layers to detect objects through these combinations [2]. This hierar­
chical feature representation learning directly from the data has led to
the unprecedented success of deep learning in a range of artificial in­
telligence applications [13].

2.3. Deep Learning Architecture – Convolutional Neural Network (CNN)

Among the many deep learning architectures, CNN is the most


widely used as it is very similar to conventional NN. As opposed to a
typical NN, shown in Fig. 2(a), CNN takes in an image as input and has
three-dimensional arrangement of neurons that connect with a small
region of the preceding layer instead of the entire layer as shown in
Fig. 2. (a) A 2-layer Neural Network (one hidden layer of 4 neurons and one
Fig. 2(b) below. The CNN comprises of layers which include a con­
output layer with 3 neurons), and three inputs, (b). Convolutional Neural
volutional layer, non-linear activation layer such as rectified linear unit Network (CNN) and (c) Fully Convolutional Network (FCN).
layer (ReLU), pooling layer or fully connected layer. The convolutional
layer applies convolution operation between pixels of the input image
descent. The process involves the determination of the gradient of a loss
and a filter to obtain volumes of feature maps containing features
function which is used by the optimization algorithm to update the
extracted by the filter. ReLU is a non-linear activation layer that applies
network weights so as to minimalize the loss function value.
the function fðxÞ ¼ maxð0; xÞ to the input values to increase non-
linearity and improve the training speed. Pooling layer down-samples
the input values to reduce the spatial dimensionality of the image to 2.4. Other Architectures
improve computational cost and prevent overfitting and are translation
invariant as computations are based on neighboring pixels [15]. A fully 2.4.1. Restricted Boltzmann Machines (RBMs)
connected layer is typically the last layer of CNN and is like the hidden Restricted Boltzmann Machines (RBMs) are neural networks
layers of traditional NN in the sense that all neurons in this layer are designed on the principles of energy-based models (EBMs). EBMs encode
linked to the neurons in the preceding layer. As mentioned earlier that reliance between variables by assigning scalar energy to the individual
CNN is typically used for classification problems. In order to use CNN for arrangement of the variables. Inference or prediction is made using
semantic segmentation, the input image is subdivided into small patches values of observed variables to determine the values of the residual
of equal size. The CNN classifies the center pixel of the patch. Then the variables in a manner to minimize the energy. While learning is achieved
patch is glided forward to classify the next center pixel. However, such through determining an energy function that yields minimal energy for
an approach is inefficient as the overlapping features of the sliding the right values of the residual variables and larger energies for erro­
patches are not re-used resulting in loss of spatial information of the neous values. The loss function minimized during learning, in turn,
image as the features move into the final fully connected network layers. provides a measure of the superiority of the available energy functions.
To overcome this problem the use of the fully convolutional network The RBM consists of one input layer (I1, …, I4), one hidden layer (h1, 2),
(FCN) was proposed in which the final fully connected layers of the CNN a bias vector (b1, b2) and a weight vector (w) but does not have an
were changed to transposed convolutional layers, as shown in Fig. 2(c), output layer. A simple architecture of RBM is shown in Fig. 3.
which applies up-sampling on the low-resolution feature maps to Based on the architecture shown in Fig. 3, the energy function for the
recover the original spatial dimensions while simultaneously performing RBM with ai weighted inputs can be defined as:
semantic segmentation [11]. X X X
EðI; hÞ ¼ ai Ii bj hj Ii hj wi;j (1)
Typically, deep neural networks are trained using backpropagation i j i;j
algorithm in combination with an optimization algorithm like gradient

3
I. Rizwan I Haque and J. Neubert Informatics in Medicine Unlocked 18 (2020) 100297

Fig. 3. Restricted Boltzmann machines (RBMs).

During the training of RBMs, network parameters are determined for


given inputs that minimize the energy function provided in Eq. (1).
RBMs are probabilistic models and the values of neurons in input and
hidden layers imply the state at a specific point in time and these values
indicate if the corresponding neuron is either active (state 1) or inactive
(state 0).
A type of RBM achieved through stacking is known as Deep Belief
Network (DBN) in which each layer communicates with preceding and
subsequent layers. The top two layers contain undirected connections
while lower layers have directed connections. As opposed to another Fig. 4. Autoencoder architecture with vector and image inputs.
type of RBM network known as Deep Boltzmann Machine (DBM) which
only has undirected connections. DBMs are considered to handle un­ small variations of the same input during training. The stochastic nature
certainty better in the presence of noisy inputs. of the autoencoder ensures that the latent space representation is by
design continuous thus allowing for random interpolation and sampling.
2.4.2. Autoencoder based Deep Learning Architectures
An autoencoder neural network is an unsupervised learning algo­ 2.4.3. Sparse Coding based Deep Learning Architectures
rithm that compresses the input into a latent-space representation by Sparse coding is a class of unsupervised learning that determines an
applying the backpropagation algorithm with target values the same as overcomplete set of basis vector to represent the input data. Over­
the inputs. It comprises of two parts: 1) Encoder portion of the network complete means that the dimension of the latent representation is higher
that condenses the input into a latent-space representation expressed by than the input. The aim is to determine a linear combination of these
function h ¼ fðxÞ, and 2) Decoder part of the network that reconstructs basis vectors corresponding to a given input. The overcomplete network
the input from the latent space representation. The compression is means that additional constraint of sparsity needs to be applied to
achieved by constraining the hidden layer to have lower dimensions address any degeneracy. The advantage of sparse coding is that it can
than the input layer. Such a network is known as undercomplete. Lower identify correlations between similar descriptors and capture salient
dimensionality of the hidden layer results in network learning the most properties of images [15].
prominent features in the training data. Alternatively, a sparsity
constraint can also be applied to achieve similar results by keeping 2.4.4. Generative Adversarial Networks (GANs)
neurons in the hidden layer inactive most of the time. In autoencoder In GANs, the idea is to have a generator implemented through neural
based deep learning approaches, the input comprises of the image which network such that it models a transform function which takes in a
is downsampled to obtain a latent representation of lower dimensions random variable as input and follows the targeted distribution when
enabling the autoencoder to be trained and learn on the condensed form trained. While another network is trained as a discriminator simulta­
of the images. The autoencoder architecture is shown in Fig. 4. neously to distinguish between generated data and true data. The two
One of the challenges in the autoencoders is that nodes in the hidden networks operate as adversaries, that is the first network tries to maxi­
layer are greater than the number of input values. This has the risk of mize the final classification error between generated data and true data
network learning null or identity function in which the output equals the while the other tries to minimize the same error. Resultantly, both
input. To resolve this issue denoising autoencoders are used in which the networks improve after each iteration of the training process.
data is intentionally distorted by randomly assigning around 30–50% of
the input values as zero. The actual values reduced to zero depend on the 2.4.5. Recurrent Neural Networks (RNNs)
data size and number of nodes present in the network. When deter­ RNNs are specifically designed to operate with series type inputs
mining the loss function, the output is compared with the original input when the size of the input cannot be pre-determined. The series input is
thus the risk of null function learning is eliminated. different from many inputs as it impacts its neighboring values and
The applications of autoencoders are relatively limited due to dis­ therefore this relationship has to be recognized by the network. RNNs
continuities in the latent space representations that do not allow its are such networks whose present output is based on both the current
application as a generative model. To resolve this issue, variational input and learning based on past values. The prior input information is
autoencoders were introduced. In variational autoencoder, the output of part of the network and stored in a hidden state vector. This means the
the encoder is not a single encoded vector rather it outputs two encoded same input can result in different output depending upon the previous
vectors; one is a vector of means and the other is a vector of standard inputs in the series. The network becomes recurrent when repeatedly
deviations. These vectors act as the parameters of a random variable transformed with different input series values resulting in the generation
from which the output encoded vector is sampled. This allows the of different fixed-size output vectors. The hidden state is updated with
decoder to correctly decode the encoded values even in the presence of each input. The depth can be added to RNNs by adding more hidden

4
I. Rizwan I Haque and J. Neubert Informatics in Medicine Unlocked 18 (2020) 100297

state layers or by adding non-linear hidden layers between the input and Table 1
the hidden state layer or by adding more layers between the hidden state Definition of the abbreviations.
layer and the output layer or use a combination of all three. Category Actual Disease Actual No Disease

Predicted Disease True Positive (TP) False Positive (FP)


2.5. Typical Approaches for Implementing Deep Learning Architectures Predicted No Disease False Negative (FN) True Negative (TN)

There are multiple approaches through which deep learning tech­


niques have been implemented for image segmentation. In the first 2.6. Performance Metrics
approach, the neural network is trained from scratch which usually re­
quires availability of a large labelled dataset and is time-intensive to The efficacy of the image segmentation system is evaluated by using
build and train the network. In the second approach, one of the existing standard and well-known metrics which enables comparison of the
pre-trained CNNs like AlexNet which was trained to classify 1.2 million system with existing techniques in the literature. Selection of an
high-resolution images, available through the ImageNet Large Scale appropriate evaluation metric depends on many factors according to the
Visual Recognition Challenge 2010 for 1000 different classes, can be functionality of the system. These metrics may measure computational
used [16]. The typical process in such an approach is to remove the last complexity, processing time, memory utilization and accuracy among
few layers of the network and replace them with the new task-specific others [17]. Various performance metrics are provided below that can
layers. The low-level features learned from millions of images in the be used to assess the segmentation efficacy of the deep learning models
beginning layers are merged with the task-specific features extracted in in terms of accuracy. TP, FP, FN, and TN are defined in Table 1 below.
the final layers to implement the network for classification of new im­
ages. This provides the advantage of reduced time for implementation as 2.6.1. Accuracy
only a small number of weights need to be found, respectively. Transfer
Correctly Predicted Pixels TP þ TN
learning is typically employed with networks trained on data from Accuracy ¼ ¼ (2)
Total number of Image Pixels TP þ FP þ FN þ TN
ImageNet and is better than the random initialization of weights [17].
The third approach involves the use of pre-trained CNNs for Accuracy, as defined above, represents the percent of image pixels
extracting features from the data and subsequently employing those that are classified correctly. It is also known as overall pixel accuracy. It
features as inputs for training a traditional classifier like support vector is the most basic performance metric but has the limitation to misrep­
machine for classification. The advantage of this approach is that the resent image segmentation performance in case of class imbalance. Class
features can be automatically extracted for a large number of categorical imbalance occurs when one segmentation class dominates the other. In
data thus eliminating the time-consuming need of human-engineered such a case, higher accuracy for the dominating class will overshadow
feature extraction. the lower accuracy associated with the other class thus providing biased
Some of the well-known convolutional neural networks include U- results. Therefore, the accuracy measure is recommended to be used for
Net that was developed for biomedical image segmentation [18] and evaluating segmentation performance with images when there is no
V-Net that was developed for volumetric medical image segmentation class imbalance.
[19]. U-Net is a type of FCN with a contraction path and expansion path. An alternative to the above definition of accuracy is per-class accu­
The contraction path consists of consecutive convolutional layers and racy that determines the percent of correctly labelled pixels for each
max-pooling layer. It is used to extract features while limiting the size of class and then determines their average [20]. This measure is useful for
feature maps. The expansion path performs up-conversion and has images where the class imbalance is present. In the case of class
convolutional layers to recover the size of the segmentation map with imbalance, both the average and the per-class accuracy must be
loss of localization information. Skip connections are used to share considered. The limitation of the average per-class accuracy is the
localization information from the contraction layer to the expansion reduction in confidence measurement of the individual classes.
layer. These are parallel connections allowing signals to propagate Furthermore, fewer instances of a given class may result in large vari­
directly from one block of the network to another without adding any ance thereby impacting the reliability of the results.
computational complexity. Finally, the convolutional layer before the
output maps the feature vector to the required number of target classes 2.6.2. Precision
in the final segmentation output. V-Net is similar to U-Net and consists of
Correctly Predicted Disease Pixels TP
two parts; 1) compression part and 2) decompression part. The Precision ¼ ¼ (3)
Total number of Predicted Disease Pixels TP þ FP
compression part consists of multiple stages with each stage consisting
of 1–3 convolutional layers. At each stage, the residual function is Precision, as defined above, represents the proportion of the disease
learned with convolution operations performed using volumetric data pixels in the automatic segmentation results that match with the ground
based on voxels. The compression path reduces the resolution by half truth disease pixels. Precision is a useful measure of segmentation per­
through convolution similar to the pooling layer. However, the pooling formance as it is sensitive to over-segmentation. The over-segmentation
layer is not used to reduce memory utilization. Parametric Rectified results in low precision scores.
Linear Unit (PReLU), also known as Leaky ReLU, is a generalization of
ReLU and is used as non-linearity activation function. The decompres­ 2.6.3. Recall
sion part of the network expands the spatial support of the feature maps
Correctly Predicted Disease Pixels TP
to produce sufficient information for volumetric segmentation. Decon­ Recall ¼ ¼ (4)
Total number of Actual Disease Pixels TP þ FN
volution is used to increase the size of the inputs and residual function is
learned similarly to the compression part of the network. The convolu­ Recall, as defined above, represents the proportion of disease pixels
tional layer before the output produces two feature map outputs with in the ground truth that were correctly identified through automatic
the same size as input volume. The two feature maps contain the pre­ segmentation. It is sensitive to under-segmentation as that results in low
dicted foreground and background region information. Skip connections recall scores.
are used in a similar manner to U-Net to forward localization informa­
tion from the contraction part of the network to the expansion part of the 2.6.4. F1measure
network.
Precision � Recall
F1measure ¼ 2 � (5)
Precision þ Recall

5
I. Rizwan I Haque and J. Neubert Informatics in Medicine Unlocked 18 (2020) 100297

Precision and recall can be jointly used, as high values for both volume of ground truth.
measures for a given segmentation result means that the predicted A smaller value of AVD(G, S) implies a greater segmentation accu­
segmented regions match with the ground truth both in terms of location racy [22]. AVD and MHD are sensitive to point positions and thus can
and the level of detail. F1measure also known as Boundary F1 (BF) de­ evaluate the segmentation accuracy more effectively when exact
termines the harmonic mean of precision and recall and is useful for boundary delineation is needed to be determined [23]. Another
contour or boundary matching between predicted segmentation and advantage of using MHD and AVD is that these measures do not penalize
ground truth segmentation. It is also known as the Dice Similarity Co­ low-density segmentation results as strictly as compared to other mea­
efficient (DSC) whose alternative definition is provided below. sures [23]. Low density means that the image segmentation has more
accurate boundary contours but has several tiny holes.
2.6.5. DICE Similarity Coefficient (DSC) For a more detailed reading on segmentation metrics, the following
literature articles can be consulted [17,20,23–25].
2jSGround Truth \ SAutomated j 2 � TP
DICE ¼ ¼ (6)
jSGround Truth j þ jSAutomated j 2 � TP þ FP þ FN
2.7. Types of Biomedical Images
where S stands for segmentation.
There are various types of biomedical images that are dependent on
As seen from the equation, DSC considers both the false alarm and
the method of imaging. Some of the widely used biomedical imaging
the missed values in each class thus is superior to overall pixel accuracy.
techniques are provided below. The list below is not exhaustive as with
DICE is also considered to be superior as it not only evaluates the
advancement in technology new imaging techniques are being intro­
number of pixels correctly labelled but also determines the accuracy of
duced to achieve a better and timely diagnosis.
the segmentation boundaries [20]. Additionally, DICE is often used to
measure repeatability of system performance through cross-validation.
2.7.1. Clinical images
Clinical images are digital images of the patient’s body and are often
2.6.6. Jaccard Similarity Index (JSI)
used to document the injury, burn or skin lesions. The automatic anal­
SGround Truth \ SAutomated TP ysis of these images may be used to track the efficacy of treatment over
JSI ¼ ¼ (7)
SGround Truth [ SAutomated TP þ FP þ FN time. These images are widely used for dermatological and cosmetic
treatments to track before and after representation of the skin or
where S stands for segmentation. anatomical structure. The most widely used application of clinical im­
Jaccard Similarity Index (JSI) is also known as Intersection-Over- ages is for the detection of skin cancer known as melanoma.
Union (IoU) and is defined as the ratio of the area of the overlap be­
tween the predicted segmentation and the ground truth segmentation to 2.7.2. X-ray imaging
the area of union between the predicted segmentation and the ground X-ray imaging is the most widely used imaging technique to detect
truth segmentation. JSI is similar to DSC as they are monotonic in one fractures and bone dislocation. The generated image is two-dimensional.
another or positively correlated as can be seen from the equations below. The National Institutes of Health (NIH) has provided open access to
DSC 2JSI 100,000 chest x-ray images with associated data and diagnoses for
JSI ¼ DSC ¼ (8) improving imaging analysis techniques [26]. Similarly, the Massachu­
2 DSC 1 þ JSI
setts Institute of Technology (MIT) has published a dataset containing a
The difference between JSI and DSC as evident from the above collection of more than 350,000 chest x-rays for developing machine
equation is that JSI will penalize instances of incorrect results more than learning models to automatically detect 14 common illnesses like
the DSC. Thus, anyone of these two metrics can be used for segmentation pneumonia or punctured lung etc. [27].
validation instead of using both measures.
2.7.3. Computed Tomography (CT)
2.6.7. Modified Hausdorff Distance (MHD) CT refers to a computerized imaging procedure in which x-rays are
Given two sets of pixels A and B, the MHD is defined as follows: directed at the patient through 360 degrees to produce detailed cross-

max

min

max

min
�� sectional images of the internal organs, bones, soft tissue and blood
HðK; GÞ ¼ max fDða; bÞg ; fDða; bÞg (9) vessels in the body. The images are traditionally captured in the axial or
aεA bεB bεB aεA
transverse plane and perpendicular to the long axis of the body. How­
where D(a, b) denotes the Euclidean distance between pixels a and b. ever, these images which are also known as slices can be reformatted
A small value of MHD specifies greater adjacency of the two-point into multiple planes and can be used to generate a three-dimensional
sets, thus suggesting superior segmentation performance [21]. Some­ image. It is widely used to detect cancer by localizing the presence of
times Kth ranked distance is determined based on Hausdorf Distance tumors and their size and is among the most widely tackled biomedical
(HD) as defined below: imaging problems. The National Institutes of Health (NIH) has provided
open access to 32,000 CT images with associated data and diagnoses for
HDðG; SÞ ¼ maxfh95 ðS; GÞ; h95 ðG; SÞg; (10) improving lesion recognition accuracy [28].
where
2.7.4. Magnetic Resonance Imaging (MRI)
95 min MRI is an imaging technique that is used to form images of physio­
h95 ðS; GÞ ¼ K th jj g s jj
sεS gεG logical processes, organs and tissues within the body using strong
magnetic fields. MRI is used to image the non-bony parts or soft tissues
where G represents the ground truth. of the body. The main difference from CT scans is that it does use the
ionizing radiation of x-rays. Knee and shoulder injuries can be seen with
2.6.8. Absolute Volume Difference (AVD) better resolution in MRI scans as compared to both x-ray and CT scans.
� �
�Vs Vg � In the brain, MRI scans can be used to distinguish between gray and
AVD ðG; SÞ ¼ 100% (11) white matter which in turn helps with determining aneurysms and tu­
Vg
mors. For biomedical imaging researchers, the Open Access Series of
Imaging Studies (OASIS) project has gathered neuro-imaging datasets
where Vs is the volume of segmentation results and Vg represents the
that contain more than 2000 MRI sessions [29].

6
I. Rizwan I Haque and J. Neubert Informatics in Medicine Unlocked 18 (2020) 100297

2.7.5. Ultrasound Imaging (US) a safer approach to use. Whereas heavy augmentations may produce
The US imaging technique uses high-frequency sound waves to variations in features, which are not realistic. However, the final choice
produce visual images of the internal organs, tissues and blood flow. It is again will be largely dependent on the type of medical image and
the most widely used technique to monitor the fetus during pregnancy. It properties of the region of interest. Lastly, data augmentation techniques
is mostly used for abdominal, vascular and thyroid scans and typically can be used to overcome the shortage of data; however, it cannot ac­
not used for imaging bones or tissues that contain air, like lungs. The count for all the variations that can occur in real data.
benefit of using the US is that it is fast and radiation-free.
3. Survey of Deep Learning based Biomedical Image
2.7.6. Optical Coherence Tomography (OCT) Segmentation Articles
OCT is a technique that employs low-coherence light to acquire
micrometer-resolution, two- and three-dimensional images from within In Table 2, a survey of recent articles involving deep learning ap­
biological tissue. OCT is primarily used for diagnosing eye problems by proaches to biomedical image segmentation is provided. Only those
providing a cross-sectional view of retina enabling the physician to articles which utilized deep learning models for biomedical image seg­
distinctly see each layer. This enables layer mapping and measurement mentation applications were selected. The table includes the article
of thickness which is useful for diagnosis. reference, the modality that defines the imaging techniques used for
image formation or acquisition, the methodology that defines the deep
2.7.7. Microscopic Images learning architecture used for segmentation, remarks section that briefly
Microscopic medical images are used to analyze the microscopic points out the proposed approach and finally the last column provides
structure of the tissue. The tissue to be analyzed is usually obtained the performance metrics used for evaluating the proposed algorithm
through biopsy and then sections of the tissue are dyed with staining with brief results.
components to reveal details at the cellular level. Counterstains are used As seen from Table 2, most of these approaches are based on either
to provide color, visibility, and contrast for the images. These images are CNN or FCN based approaches. None of these articles utilize the transfer
widely used for the detection of cancer. The features usually analyzed learning approach while one article did use a deep learning model as a
include the shape and size of the cells and its nucleus and distribution of function for feature extraction for subsequent classification with a
the cells in the tissue. structured support vector machine. The modality for most of these ap­
plications was CT, MR, and the US which indicates the current trend of
2.8. Data Augmentation research as well. One reason for this is the ease of availability of image
datasets through various competitions or from other public sources.
The performance of deep learning neural networks is dependent on
the availability of sufficient data. The problem is in most cases the 4. Discussion
training samples are not available in the required numbers especially in
medical imaging. In order to enhance the available datasets in the In this section, a basic review of the deep learning approaches for
absence of real data, various data augmentation techniques are some of the articles in Table 2 is provided.
employed for creating additional training data from the existing avail­ Roth et al. proposed using multi-level deep convolutional networks
able dataset. The techniques for data augmentation apply class preser­ for the segmentation of the pancreas which is known to have very high
ving transformations on the image data and may include: 1) shifting anatomical variability. The segmentation of the pancreas is important
image pixels in one direction either horizontally or vertically without for quantifying organ volume for diabetic patients. Eighty-two contrast-
any change in the overall image dimensions (image translation); 2) the enhanced abdominal CT volumes along with the ground truths were
horizontal and vertical flip of image pixels by reversing the rows and used in the study. Additionally, random non-rigid deformations were
columns of pixels (image flipping); 3) rotation of image by certain de­ applied to obtain additional data instances. The degree of deformation
gree from 0 to 360 (image rotation); 4) varying the brightness level of was selected to ensure that the warped image had variations like the real
the images to train the model to account for such variations in test im­ data. In the proposed bottom-up approach, image patches were first
ages; 5) zoom in or out of the image randomly either by the addition of labelled. For the image patch labelling, an axial, coronal, and sagittal
new boundary pixels or through interpolation. In most of these tech­ view of the patches was used to obtain a per-location probability
niques, some of the existing pixels are discarded and some new pixels are response map. This was followed by region labelling by generating
added either through nearest-neighbor fill, duplication of boundary super-pixel regions with high sensitivity but low precision at different
pixels, averaging or interpolation. spatial scales in a zoom-out approach. CNN was used to assign a prob­
The first four techniques discussed above are known as rigid data ability to each super-pixel for being pancreatic tissue. Finally, the entire
augmentation techniques, i.e., the shape itself remains unchanged. In organ was detected from an abdominal CT scan using both the proba­
the fifth technique, the ratio of horizontal and vertical augmentation is bility response map and CT intensity values obtained from the super-
maintained. However, if it is different then the image will stretch more pixel regions.
in one direction compared to the other (image stretching). Alternatively, Dou et al. proposed the use of a fully convolutional network in which
if the image is stretched in only one direction along the diagonal axis on all layers of the CNN were either convolutional or pooling for segmen­
both ends, the image will be sheared (image shearing). Another tech­ tation of liver from 3D CT scans, whole heart and great vessels from MR
nique which is known as elastic deformation can also be used which images. This approach was necessary to remove the size limits on the
produces an effect in the shape of the region of interest which is image input to the network which was there due to fully connected
equivalent to stretching under external force. The change in shape is layers. The replacement of fully connected layers with a convolutional
similar to how solid objects deform under external stress that is recov­ layer implies that the input image can have any arbitrary size, and
erable after the stress is removed. Lastly, contrast enhancements can be classification output will be spatially arranged for the entire input
applied to adjust intensity variation in the image as medical images may image. This also eliminates the need of redundant computations
have been obtained from a variety of sources. resulting from overlapping regions in the traditional patch-based
The primary purpose of these augmentation techniques is to improve approach. The 3D deep supervision mechanism simply extracts feature
the generalizability of the deep neural network while avoiding both the maps from the hidden layers, upscales them by connecting deconvolu­
underfitting and overfitting of the features. These techniques are usually tional layers and using the SoftMax function to obtain extra dense pre­
applied automatically during the training phase of the network. dictions. The SoftMax function performs normalization on exponentials
Furthermore, in most cases, linear transformations are sufficient and are of input vectors consisting of real numbers and converts them into a

7
I. Rizwan I Haque and J. Neubert Informatics in Medicine Unlocked 18 (2020) 100297

Table 2
Overview of papers using deep learning techniques for biomedical image segmentation.
Reference Modality Method Remarks Performance Metrics and Results

Badea et al. Clinical Images CNN LeNet and Network in Network (NiN) models were used for Accuracy;
(2016) [30] (LeNet and classification of burn images and for performance evaluation LeNet was able to achieve an accuracy of 75.91% and 58.01%
NiN) by comparing the classification accuracy for Skin vs Burn and for classification of Skin vs Burn and Skin vs Light Burn vs
Skin vs Light Burn vs Serious Burn. Serious Burn, respectively. NiN achieved an accuracy of
55.7% for classification of Skin vs Light Burn vs Serious Burn.
Dhungel et al. X-Ray RBMs and Deep convolution and deep belief networks which are a type DICE;
(2015) [31] (Mammogram) CNN of RBM network are utilized as functions for the conditional The Dice index of the proposed approach with all potential
random field (CRF), and structured support vector machine functions was 93% using CRF and 95% using SSVM.
(SSVM). The techniques are explored for segmenting breast
masses from mammograms.
Zhou et al. CT CNN A performance analysis of a proposed segmentation Mean Accuracy, JSI;
(2018) [12] technique was performed for multiple organ detection. The proposed approach achieved a mean JSI value of 79%
and 67% for segmentation using 3D- and 2D deep CNN,
respectively. Results are averaged for 17 types of organs.
Roth et al. CT FCN (3D U- Deep 3D FCNs was used to automatically segment abdominal DICE;
(2018) [11] Net) CT to delineate the arteries, portal vein, liver, spleen, The proposed model achieved an average DICE score
stomach, gallbladder, and pancreas in each multi-organ performance of (89.3 � 6.5) % during testing.
image.
Moeskops MRI and CTA CNN Single CNN is used to perform multiple image segmentation DICE;
et al. (2016) tasks which include six tissues in MR brain images, the The proposed approach using a single CNN trained with three
[32] pectoral muscle in MR breast images, and the coronary tasks had median percentage of voxels per scan below
arteries in cardiac CTA. CNN identifies the imaging modality, 0.0005% for all tasks when labelling a class which is alien to
anatomical structure, and tissue class. the target. This means that confusion between tasks was very
low during labelling. An example of such a scenario is
labelling of cortical gray matter in breast MR.
Dou et al. 3D CT and 3D CNN A 3D fully convolutional network combined with a 3D deep JSI, DICE, Recall, Sensitivity, Specificity;
(2017) [33] MRI supervision mechanism is proposed for segmenting the liver The proposed approach for the evaluation of blood pool in the
from 3D CT scans. Additionally, the approach is used for heart segmentation task achieved JSI and DICE score of 86.5
segmenting the whole heart and great vessels from 3D MR and 92.8 respectively. This score was the best among
images. participating teams. For myocardium segmentation, the
scores were less with JSI and DICE values equal to 73.9% and
59.1%.
Wang et al. MRI CNN Proposed approach performs 2D segmentation of multiple DICE;
(2018) [34] organs (placenta, fetal brain, fetal lungs, maternal kidneys) The proposed Bounding box and Image-specific Fine-tuning-
from fetal MR slices, by using two organ annotations for based Segmentation (BIFSeg) achieved the best DICE score for
training. Additionally, 3D segmentation of brain tumor core the placenta, fetal brain, fetal lungs, maternal kidneys with
(without edema) and whole-brain tumor (with edema) from values of 86.41%, 90.39%, 85.35%, and 86.33% respectively.
MR sequences is performed using annotated tumor core in
only one MR sequence for training.
Havaei et al. MRI CNN An automatic brain tumor segmentation method based on DICE, Sensitivity, Specificity;
(2017) [35] multiple variations of CNN architecture was proposed. The The TwoPathCNN achieved a DICE score, sensitivity and
real data was acquired from the 2013 brain tumor specificity of 85%, 93%, and 80% respectively. Out of the
segmentation challenge (BRATS2013), which was part of the three cascade CNN architectures, the InputCascadeCNN
MICCAI conference. The dataset consisted of three subsets for performed the best with a DICE score, sensitivity and
training, testing and the leaderboard dataset for the specificity of 88%, 89%, and 87% respectively.
competition containing 30, 10 and 25 patients subjects
respectively.
Ngo et al. MRI RBMs A segmentation approach is proposed for the endocardial and DICE;
(2017) [36] epicardial borders of the left ventricle (LV). The The proposed approach achieved an average DICE score of
segmentation was done from all the slices of the end-diastole 90% compared with the boosted cascade detector’s score of
(ED) and end-systole (ES) cardiac phases of an MR cine study. 86% for testing data.
The ED and ES volumes were manually selected by the user.
Chen et al. 3D MRI 3D ResNet A voxel-wise residual network (VoxResNet) built with 25 DICE, the 95th-percentile of the Hausdorff distance (HD), and
(2018) [22] layers is proposed for segmentation of key brain tissues into absolute volume difference (AVD);
white matter (WM), gray matter (GM), and cerebrospinal The DICE score results obtained for the proposed model for
fluid (CSF). GM, WM, and CSF were 86.15%, 89.46%, and 84.25%,
respectively when trained on relatively small training data.
Milletari et al. MRI and US CNN and Six variants of CNN architectures were proposed and trained DICE;
(2017) [37] V-Net with patches derived from annotated medical volumes from The best DICE results for Hough-CNN was 85% compared
(FCN) MRI and transcranial US volumes illustrating respectively 26 with V-Net resulting in 71%.
areas of the basal ganglia and the midbrain. The proposed
approach used parametric rectified linear units (PReLU) as

x if x � 0
activation functions where PReLU ðxÞ ¼ .
α x if x < 0
The parameter α was learned during training. Hough voting-
based strategy was used to localize the anatomy of interest
achieving high precision, despite the very high rate of
misclassified voxels.
Xu et al. 3D US CNN Proposed approach segments ultrasound images of breast Accuracy, JSI, Precision, Recall, F1measure;
(2019) [21] into four major tissues: skin, fibro glandular tissue, mass, and Above 80% performance results were obtained on these
fatty tissue using 3 orthogonal image planes; CNN indicates metrics.
the tissue class of the centered pixel within the image blocks.
Prince et al. Endoscopic OCT CNN Accuracy;
(2019) [38] The proposed approach had an average relative error of
(continued on next page)

8
I. Rizwan I Haque and J. Neubert Informatics in Medicine Unlocked 18 (2020) 100297

Table 2 (continued )
Reference Modality Method Remarks Performance Metrics and Results

An approach is proposed for automatically segmenting prediction about 6.0% for five layers that included stratum
endoscopic OCT images using the parallel architecture of the corneum, epithelium, lamina propria, muscularis mucosae,
trained deep neural network. and submucosa.
Jia et al. Microscopic FCN An FCN based approach is proposed for the image to image F1measure;
(2017) [39] Images segmentation of histopathology images under deep weak The proposed algorithm was able to achieve the best F1measure
supervision. Additionally, super-pixels instead of pixels were of 83.6% which was significantly higher than existing weakly
also used which is effective in maintaining intrinsic tissue supervised algorithms.
boundaries.
Zhao et al. Microscopic Mask R- The model performs instance detection and instance Approximate Annotation Time (AT), F1measure; The p
(2018) [40] Images CNN segmentation on the nuclei of HL60 cells and microglia cells. roposed approach was faster compared to VoxResNet with
For C.elegans embryo dataset because of scarcity of full voxel comparable performance. The average segmentation
annotation, only instance detection was performed. F1measure for the proposed approach and VoxResNet was
89.27% and 90.42% while AT was 5.5 h and 22.5 h
respectively.

US – Ultrasound Images, OCT – Optical Coherence Tomography, MRI – Magnetic Resonance Imaging, CTA – Coronary Computed Tomography Angiogram.

probability distribution such that the output values are in the interval modalities that provided complementary information regarding the
(0,1) with the sum of output values equal to 1. The classification error of same brain structure was combined with weights during the training
these branch outputs in comparison with the ground truth was used to phase. This resulted in improved efficiency compared to results obtained
adjust network parameters of the mainstream network during training. through a single modality.
Milletari et al. proposed a patch-wise multi-atlas method by Jia et al. proposed the use of deep weak supervision (DWS) under
combining Hough based voting technique with CNN for segmenting multiple instance learning (MIL) framework. In the MIL approach during
regions of the deep brain in MRI and ultrasound images. The approach the training of the classifier instances are grouped together and are
can locally segment structures even if partially visible or corrupted by called bags. Each bag is assigned a positive or negative label during the
artifacts. CNN is trained to classify foreground and background regions training, but the instances are not assigned any labels. The MIL approach
in the patches extracted from the voxels and links each input to its requires that the classifier model should be able to predict not only the
feature representation extracted from the second last fully connected instance-level label but also has to perform bag level classification. From
layer. The process is repeated for the entire training data and a dataset is the perspective of histopathology images, each cancer or non-cancer
generated that contains 2D, 2.5D or 3D patches of foreground regions image can be considered a bag while each pixel inside the image is the
and features extracted earlier. 2.5D contains higher spatial information instance. Typically, a classifier predicts pixel-level labels then the image
of neighboring pixels and is typically obtained using three orthogonal level prediction is evaluated using a SoftMax function or normalized
patches (XY, YZ and XZ plane) with the resultant kernel still in 2D. exponential function. The DWS approach uses multiple side outputs
Additionally, a vote is collected which is a displacement vector linking taken from the CNN typically after the convolutional layer, similar to the
the voxel and position anatomy centroid. This is the same voxel from way information was extracted by Chen et al. for fusion. The goal of
which the patch was collected. During testing, CNN will generate the DWS is to ensure minimal prediction error between ground truth and
label for an unknown voxel and the corresponding feature map for all each of the side truths which results in improved performance. The two
patches labelled as foreground will be extracted. Each feature map is approaches were combined to form a new framework known as DWS-
compared with the feature maps in the database to extract k-closest MIL. Additionally, another problem that occurs at times is the jagged
neighbors based on Euclidean distance. Then the votes of these neigh­ tissue boundaries that results from treating pixels as instances. Use of
bors and corresponding segmentation patches are used to perform seg­ super-pixels as instances was proposed to resolve this issue and de­
mentation and the process is repeated for all foreground patches. This creases the number of instances thus reducing computational
method was superior to voxel-wise semantic segmentation of CNNs in all complexity.
the parameter settings tested, required fewer training data, generated Havaei et al. analyzed different custom architectures of CNN by
better segmentation contours and eliminated the requirement for post- concatenation of feature maps from different layers for the purpose of
processing. Brain tumor segmentation. In the first approach, a local and global
Chen et al. proposed a voxel-based deep learning model based on the pathway was established to analyze changes in prediction accuracy due
residual network (ResNet) termed as VoxResNet. Typically, deep to visual details of the region surrounding the pixel and contextually in
learning models form feature representation in a successive manner with terms of the patch (containing the pixel) location in the brain. The ar­
level changes as low-middle-high. A greater the number of layers that chitecture was called TwoPathCNN. The second architecture was based
are in the network means more information may be learned thus on the cascading of two CNNs. The output layer of the first CNN was
improving the discrimination capability of the network. But this does concatenated to one of the layers of the second CNN. Three different
not always happen and at times performance starts to degrade at deeper locations of the second CNN were used for concatenation: 1) input layer
layers of the network. This is known as the degradation problem and (InputCascadeCNN), 2) first hidden layer in the second CNN (Local­
ResNets have been shown to overcome this problem as the residual CascadeCNN) and 3) right before the output layer of the second CNN
learning helps network optimization become easier. This is achieved by (MFCascadeCNN). These added inputs to the second CNN were made to
using skip connections. As a result, information can spread through the analyze the impact of nearby labels on the final segmentation prediction.
whole network both in forward and backward passes. In order to address The two pathways model combined with collective training of the two
volumetric data for brain segmentation from 3D magnetic resonance convolutional pathways and with two training phases provided better
(MR) images, VoxResNet was proposed that extends 2D ResNet to 3D segmentation compared to the single path architecture. The InputCas­
ResNet, and its architecture is shown in Fig. 5. As seen from Fig., the cadeCNN, as shown in Fig. 6, outperformed the technique proposed by
proposed network consists of stacked residual modules. In the VoxRes the winner of the BRATS 2013 challenge. Despite the use of two CNNs,
module, the input and transformed features are added together with skip the segmentation time for the entire brain varied between 25 s and 3
connection. To address size variations of 3D anatomical brain structures, min, which was due to the implementation of the architecture on the
4 auxiliary classifiers C1–C4 are used for fusing multi-level contextual graphical processing unit (GPU). The segmentation was performed on
information. In the first layer of the network, information from multiple the images in a slice-by-slice manner from the axial view with the model

9
I. Rizwan I Haque and J. Neubert Informatics in Medicine Unlocked 18 (2020) 100297

Fig. 5. VoxResNet architecture for volumetric image segmentation. Figure shows batch normalization layers (BN), rectified linear units (ReLU), and convolutional
layers N with number of channels, filter size and down-sampling stride. Figure adapted from Ref. [22].

sequentially processing each 2D slice. function approximator to enhance the network’s abstraction capability.
Ngo et al. used a Deep Belief Network (DBN) which is a type of RBMs The burn image database comprised of 611 images from 53 pediatric
and combined it with an active contour model to utilize lower training patients with the image resolution of 1664 � 1248 pixels. The images
data to achieve superior performance. The proposed approach utilized were manually cropped to a final size of 230 � 240 pixels. Results
two separate DBNs for the segmentation of endocardial and epicardial indicated that the simple architecture performed well for binary classi­
images despite similar appearance as it achieved high accuracy. A semi- fication problems, but performance significantly degraded with
automatic approach provided better results compared to the fully increasing complexity.
automatic approach due to the lack of sufficient training and testing Zhao et al. demonstrated the use of a deep learning model for the
data. Larger training data is required to design more complex DBN purpose of instance segmentation for 3D images to generate fully an­
models. The existing image database which was part of the MICCAI 2009 notated data using a weakly supervised method [40]. Annotation of a
Left Ventricle (LV) segmentation challenge contained 15 sequences large number of biomedical images remains a challenge due to extensive
(from cardiac cine magnetic resonance) each for training, testing and time and labor requirements. The novelty of the proposed approach is
online. The online dataset was originally made available on the chal­ that it requires a bounding box for all instances but it only needs a few
lenge day for the assessment of segmentation algorithms submitted by full voxel annotations to achieve faster response and performance at par
the participants. However, now the entire database is available along with the approach proposed by Chen et al. [22] that requires the full set
with the results achieved by the participants. of voxel annotation data. The two-step approach first detects all in­
Badea et al. used LeNet CNN architecture, which was originally stances using 3D bounding box annotation and subsequently all detected
created in 1998 for handwritten digit recognition on banknotes, and instances are segmented. Mask R-CNN was used for 2D instance seg­
Network in Network (NiN) architecture for classification of burn images. mentation. Mask R-CNN can detect objects in an image and simulta­
LeNet CNN comprises of seven layers with three convolutional layers, neously generate a segmentation mask for each instance. The proposed
two pooling layers, and one fully connected layer. The presence of a technique was tested on three biomedical datasets: nuclei of HL60 cells,
multilayer perceptron between the two main layers of a block differ­ microglia cells (in-house), and C. elegans developing embryos.
entiates NiN from LeNeT. The multilayer perceptron acts as a nonlinear

Fig. 6. InputCascadeCNN, where 4 � 65 � 65 refers to number of input modalities x spatial width x spatial height. Adapted from Ref. [35].

10
I. Rizwan I Haque and J. Neubert Informatics in Medicine Unlocked 18 (2020) 100297

5. Conclusion [7] Iglesias JE. Globally optimal coupled surfaces for semi-automatic segmentation of
medical images. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell
Lect Notes Bioinformatics) 2017;10265:610–21. LNCS, no. c.
This review of deep learning approaches for biomedical image seg­ [8] Lee TCM, Fan M. Variants of seeded region growing. IET Image Process 2014;9(6):
mentation has highlighted some key points. All of these studies were 478–85.
based on empirical results that demonstrated the effectiveness of the [9] Fan J, Wang R, Li S, Zhang C. Automated cervical cell image segmentation using
level set based active contour model. 2012. In: 2012 12th int. Conf. Control.
proposed approach for the given application with limited datasets. The Autom. Robot. Vision, vol. 2012. ICARCV; 2012. p. 877–82. December.
question that remains is why the deep learning approaches work for a [10] Kim YJ, Lee SH, Park CM, Kim KG. Evaluation of semi-automatic segmentation
given problem. The understanding of the answer to this question is an methods for persistent ground glass nodules on thin-section CT scans. Healthc.
Inform. Res. 2016;22(4):305–15.
open area of research. Many researchers are working to develop novel [11] Roth HR, et al. Deep learning and its application to medical image segmentation.
visual approaches to help intuitive understanding of the feature maps 2018. p. 1–6.
obtained from the hidden layers [15,41,42]. Additionally, many re­ [12] Zhou X, et al. Performance evaluation of 2D and 3D deep learning approaches for
automatic segmentation of multiple organs on CT images. In: Medical imaging
searchers do not address the problem of generalizability of the network 2018: Computer-Aided Diagnosis, vol. 10575; 2018. p. 83.
response in case the source of data changes. That is what will be the [13] Shen D, Wu G, Suk H-I. Deep learning in medical image analysis. Annu Rev Biomed
impact of change in a data acquisition device as this may lead to changes Eng 2017;19(1):221–48.
[14] Suzuki K. Overview of deep learning in medical imaging. Radiol. Phys. Technol.
in image characteristics such as illumination or color intensity levels. 2017;10(3):257–73.
The lack of generalizability will have a negative impact on network [15] Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS. Deep learning for visual
performance. understanding: a review. Neurocomputing 2016;187:27–48.
[16] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep
Another problem with DLC based networks is the need for very large
convolutional neural networks. Adv Neural Inf Process Syst 2012:1–9.
imaging datasets. This, in turn, will require huge storage and memory [17] Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Martinez-
requirements along with very high training time for the networks. Gonzalez P, Garcia-Rodriguez J. A survey on deep learning techniques for image
Reduction in training time along with effective handling of storage and and video semantic segmentation. Appl Soft Comput J 2018;70:41–65.
[18] Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical
memory requirements for large amounts of imaging data is another area image segmentation. In: Lect Notes Comput Sci (including Subser Lect Notes Artif
of active research. The lack of sufficiently large imaging datasets has Intell Lect Notes Bioinformatics), 9351; 2015. p. 234–41.
also created hurdles in the progress of DLC based approaches for [19] Milletari F, Navab N, Ahmadi SA. V-Net: fully convolutional neural networks for
volumetric medical image segmentation. In: Proc. - 2016 4th int. Conf. 3D vision,
biomedical applications in clinical practice. Even though, a lot of im­ 3DV; 2016. p. 565–71. 2016.
aging data is held by the healthcare industry, it is not publicly shared [20] Csurka G, Larlus D, Perronnin F. What is a good evaluation measure for semantic
either due to the presence of protected health information or it is segmentation?. In: BMVC 2013 - electron. Proc. Br. Mach. Vis. Conf. 2013; 2013.
[21] Xu Y, Wang Y, Yuan J, Cheng Q, Wang X, Carson PL. Medical breast ultrasound
considered a proprietary asset of the organization. It is therefore image segmentation by machine learning. Ultrasonics 2019;91:1–9. July 2018.
imperative that efforts may be made so that such data can be made [22] Chen H, Dou Q, Yu L, Qin J, Heng P-A. VoxResNet: deep voxelwise residual
openly available either through grand challenge competitions or networks for brain segmentation from 3D MR images. Neuroimage Apr. 2018;170:
446–55. April 2017.
through data donations, as the long term benefits of data sharing will far [23] Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation:
outweigh any short term gains achieved through withholding the data. analysis, selection, and tool. BMC Med Imag 2015;15(1).
Deep learning methods have enabled unprecedented performance [24] Costa H, Foody GM, Boyd DS. Supervised methods of image segmentation accuracy
assessment in land cover mapping. Remote Sens Environ 2018;205:338–51.
enhancements in a diverse set of biomedical applications ranging from
December 2016.
automated analysis of CT scans to segmentation of skin lesions. How­ [25] Dubuisson M-P, Jain AK. A modified Hausdorff distance for object matching. 2002.
ever, a lot more can be done if more labelled images are made publicly p. 566–8. 1.
available. The manual labelling of the image data by the experts remain [26] N. I. of H.-C. Center. Chest X-ray NIHCC. 2017 [Online]. Available, https://nihcc.
app.box.com/v/ChestXray-NIHCC [Accessed: 10-Nov-2019].
a significant challenge for generating the ground truths. In the absence [27] T. M. I. of T. (MIT)’s L. for C. Physiology. MIMIC-chest X-ray database (MIMIC-
of ground truths more focus needs to be applied for exploring unsu­ CXR) [Online]. Available, https://physionet.org/content/mimic-cxr/2.0.0/
pervised learning approaches. [Accessed: 30-Nov-2019].
[28] Yan K, Wang X, Lu L, Summers RM. DeepLesion: automated mining of large-scale
lesion annotations and universal lesion detection with deep learning. J Med
Imaging Jul. 2018;5:1. 03.
Declaration of competing interest [29] Fotenos AF, Snyder AZ, Girton LE, Morris JC, Buckner RL. Normative estimates of
cross-sectional and longitudinal brain volume decline in aging and AD. Neurology
Mar. 2005;64(6):1032–9.
None declared.
[30] Badea MS, Felea II, Florea LM, Vertan C. The use of deep learning in image
segmentation, classification and detection. 2016. p. 1–5. no. Section II.
Acknowledgements [31] Dhungel N, Carneiro G, Bradley AP. Deep learning and structured prediction for
the segmentation of mass in mammograms,. In: Navab N, Hornegger J, Wells W,
Frangi A, editors. Medical image computing and computer-assisted intervention –
None declared. MICCAI 2015. MICCAI 2015. Lecture notes in computer science, vol. 9349; 2015.
p. 605–12.
[32] Moeskops P, et al. Deep learning for multi-task medical image segmentation in
References multiple modalities. In: Lect Notes Comput Sci (including Subser Lect Notes Artif
Intell Lect Notes Bioinformatics), 9901. LNCS; 2016. p. 478–86. no. October.
[1] Pal A, Chaturvedi A, Garain U, Chandra A, Chatterjee R. Severity grading of [33] Dou Q, et al. 3D deeply supervised network for automated segmentation of
psoriatic plaques using deep CNN based multi-task learning. In: 2016 23rd volumetric medical images. Med Image Anal 2017;41:40–54.
international conference on pattern recognition. ICPR; 2016. p. 1478–83. [34] Wang G, et al. Interactive medical image segmentation using deep learning with
[2] Wang G. A perspective on deep imaging. IEEE Access 2016;4:8914–24. image-specific fine tuning. IEEE Trans Med Imag 2018;37(7):1562–73.
[3] Henrique Schuindt da Silva F. Deep learning for Corpus Callosum segmentation in [35] Havaei M, et al. Brain tumor segmentation with deep neural networks. Med Image
brain magnetic resonance images. 2018. Anal 2017;35:18–31.
[4] Volkenandt T, Freitag S, Rauscher M. Machine learning powered image [36] Ngo TA, Lu Z, Carneiro G. Combining deep learning and level set for the automated
segmentation. Microsc Microanal 2018;24(S1):520–1. segmentation of the left ventricle of the heart from cardiac cine magnetic
[5] Işin A, Direkoǧlu C, Şah M. Review of MRI-based brain tumor image segmentation resonance. Med Image Anal 2017;35:159–71.
using deep learning methods. Procedia Comput. Sci. 2016;102(August):317–24. [37] Milletari F, et al. Hough-CNN: deep learning for segmentation of deep brain regions
[6] Millioni R, Sbrignadello S, Tura A, Iori E, Murphy E, Tessari P. The inter- and intra- in MRI and ultrasound. Comput Vis Image Understand 2017;164:92–102.
operator variability in manual spot segmentation and its effect on spot quantitation [38] Prince JL, et al. Parallel deep neural networks for endoscopic OCT image
in two-dimensional electrophoresis analysis. Electrophoresis 2010;31(10): segmentation. Biomed Optic Express 2019;10(3):1126.
1739–42.

11
I. Rizwan I Haque and J. Neubert Informatics in Medicine Unlocked 18 (2020) 100297

[39] Jia Z, Huang X, Chang EIC, Xu Y. Constrained deep weak supervision for [41] Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In:
histopathology image segmentation. IEEE Trans Med Imag 2017;36(11):2376–88. European conference on computer vision (ECCV) 2014; 2014. p. 818–33.
[40] Zhao Z, Yang L, Zheng H, Guldner IH, Zhang S, Chen DZ. Deep learning based [42] Simonyan K. Deep inside convolutional Networks : visualising image classification
instance segmentation in 3D biomedical images using weak annotation. In: Lect models and saliency maps. arXiv : 1312 . 6034v2 [ cs . CV ] 19 Apr 2014. 2013.
Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes p. 1–8.
Bioinformatics), 11073. LNCS; 2018. p. 352–60.

12

You might also like