Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 1
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 2
networks (CNNs) are applied to retinal vessel segmentation. used to update the feature maps, making the network more
Although these deep learning models could gain accurate focused on regions beneficial for our tasks. For the multi-
segmentation masks, there are also varieties of challenges in scale vessels, we apply selective kernel (SK) units [25] to
retinal image segmentation. For instance, the existing deep obtain the feature maps with adaptively receptive field sizes.
learning methods universally segment by the convolutional op- The SK unit is a soft attention mechanism to update the
eration, whereas the feature representation of the convolutional weights of the feature maps from different kernel sizes. Thus,
kernel only contains local contextual information, which is we can adaptively aggregate multi-scale features in a single
limited for capturing the long-range dependencies. Therefore, layer instead of stacked multi-layers. The SK unit facilitates
this local structure of the kernels may cause incorrect semantic network to control the multi-scale information flow through
understanding in the continuous vessel segmentation task. learning global contextual information.
Moreover, in standard CNNs, all pixels are regarded equally, The main contributions of this paper are listed as follows:
which might weaken the semantic feature learning and bring (1) We propose a Fully Attention-based Network (FANet)
unnecessary redundant information. In addition, the CNNs- with self-attention mechanism, which promotes the network to
based vessel segmentation models own fixed receptive fields capture long-range contextual information.
of convolutional layers. However, the vessel structures are (2) We design a dual-direction attention block to learn
irregular, complex, and multi-scale. Perceiving these vessel spatial dependencies more lightly and effectively.
features with fixed regions could be inadequate. (3) We integrate SK units into our network for obtaining
To capture long-range contextual information, some meth- different receptive field sizes to generate multi-scale features.
ods explore long-range dependencies through the multi-scale (4) We utilize contrast-limited adaptive histogram equaliza-
modules [14,15,16], dilated convolutions [17,18], and cas- tion (CLAHE) as the pre-processing method to enhance the
caded convolutional layers [19]. Nevertheless, these opera- texture and contrast of the retinal fundus images.
tions exclusively fuse features from different scales or re- (5) We conduct sufficient experiments on three datasets,
ceptive fields, instead of aggregating contextual information including DRIVE, STARE, and CHASE DB1 datasets, and
from semantic-closer regions. Information from the semantic- achieve state-of-the-art performance.
correlated regions is more valuable to identify the objects The rest of this paper is organized as follows. Section II
from different categories. Besides, the same weights of feature introduces a brief description of the related methods. Section
extractors (kernels) at all locations only represent similar long- III specifically explains the method of our work. Section IV
range dependencies on all pixels. Therefore, attention-based shows the experiments. Section V presents the results. Section
methods [20,21] are proposed to focus on adaptively aggregat- VI provides discussions about the proposed model. Section VII
ing contextual information from semantic-correlated regions. is the conclusion of our work.
Attention mechanism aims at not only modeling the long-range
dependencies regardless of the distance in spatial dimension, II. R ELATED W ORK
but also strengthening the discriminant ability of the model to Semantic segmentation in medical images. Many well-
reduce misclassification caused by incorrect semantic under- designed network structures based on deep convolutional neu-
standing. However, attention-based methods need to produce ral networks show excellent performance in semantic segmen-
an enormous attention map to project the similarities of each tation tasks due to their rich representation capabilities. Fully
pixel-pair. Such as an image with resolution H × W , the Convolutional Networks (FCN) [26] is extremely crucial for
space complexity of non-local attention based operation [22] the evolution of semantic segmentation. It utilizes an encoder-
is O((H × W ) × (H × W )). Any such a method practically decoder structure and applies fully convolutional classifica-
suffers from high computational and memory costs. Although tion networks in the whole backbone to perceive features.
some subsampling tactics [21] are used to reduce the cost, it Another important contribution in semantic segmentation is
may potentially hurt the performance. the utilization of skip-connection, which aggregates low-level
Inspired by the above, in this paper, we propose a novel features into high-level features to recover reduced details.
network architecture based on attention mechanisms [23] Motivated by FCN and skip-connection, U-Net in proposed
and U-Net [24] for retinal vessel segmentation. Specifically, with a U-shaped encoder-decoder architecture, which modifies
we first introduce a self-attention mechanism to model the and extends the structure of FCN. U-Net is widely adopted in
correlation between each position and integrate local features medical image segmentation and can effectively handle multi-
with global contextual information. In the self-attention mech- scale features in medical images. However, in retinal vessel
anism, we propose a dual-direction attention block, which segmentation, U-Net is not enough to cope with the thin and
utilizes horizontal and vertical pooling operations to produce irregular retinal vessel structure as shown in Fig.7, though U-
the attention map for long-range contextual information ag- Net achieves multi-scale contextual information aggregation.
gregation. By fusing the information of two directions, this To deal with the problem, we embed the SK unit into the U-
attention block can also collect contextual information from all Net by using two convolutional branches with soft attention
pixels. Furthermore, the space complexity of the dual-direction mechanism to further generate multi-scale information.
attention block is O(H × W ). Dual-direction attention block Attention mechanism is widely used in a range of tasks
could produce higher value for pixels with stronger semantic such as natural language processing [23,27,28] and computer
correlation. Therefore, the attention map represents a set of vision [20,25,29,30]. Attention mechanism increases the dis-
pixels with similar semantic. Meanwhile, the attention map is criminative ability of the network by updating the weights on
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 3
1 64 64 128 64 64 64 2
Fig. 2. Architecture of our proposed network. Our network is based on U-Net, which is specifically designed for medical image segmentation tasks. The
block A represents the dual-direction attention block, aiming to model the semantic relation map. The SK unit is used to replace standard convolution for the
perception of multi-scale features.
III. M ETHODOLOGY
In this work, we develop a Fully Attention-based Network
(FANet) for retinal vessel segmentation. The procedure con- Fig. 3. Original retinal images (top row), images after pre-processing
(middle row), and corresponding labels (bottom row) from DRIVE, STARE,
sists of two main following steps: image pre-processing and CHASE DB1 sequentially.
feeding into network. Pre-processing strategies for enhancing
the contrast of original images. The network structure com- DRIVE contains 40 retinal images. Each image has the same
bines the self-attention modules and SK units with the U-Net. resolution of 565 × 584. This dataset is divided into a training
The network architecture is shown in Fig. 2. set and a test set, both including 20 images. The training set
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 4
has a single manual annotation of each image, and the test set Secondly, CLAHE could reduce the noise enhancement
has two manual segmentation masks generated by two experts, problem by limiting contrast enhancement. If most pixels
but we choose the first one as the gold standard for testing. fall into the same gray range, the histogram peaks in these
STARE has 20 retinal images, and the resolution of each areas are relatively high, thus the slope of the local mapping
image is 700×604. The original STARE dataset is not divided function will be relatively large. In this situation, the low gray
for training and testing. We split the dataset in order and take values (such as original background or noise) will be mapped
the first 10 images as the training set and the remaining 10 to high gray values. CLAHE reduces the noise problem by
images as the test set. This dataset is also manually annotated limiting contrast enhancement. We set the maximum number
by two experts. We select the masks of the first observer as of pixels (Hmax ) for a certain gray value, if the number of
the ground truth. pixels is greater than Hmax , the excessive number of gray
CHASE DB1 includes 28 retinal images, with the same values is clipped. After clipping the histogram, the pixel values
resolution of 999 × 960. These images are collected from 14 distribute more uniform throughout the histogram. In order to
children. We use the first 14 images as the training test and the ensure that the total histogram area remains the same as the
last 14 as the test set. Meanwhile, we employ the first group original, the histogram rises a height L. The final improved
of annotations as the label. histogram is:
0 H(i) + L H(i) < Hmax
H(i) = (2)
Hmax + L H(i) ≥ Hmax
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 5
[
X1 X1’
Fc S S’ Fc
X X’ Fg Ffc O
B
[
X2 X2’
element-wise summation
[ element-wise multiplication
Fig. 5. The architecture of SK unit. SK unit is used for aggregating multi-scale features. ”Fc ”, ”Fg ”, ”Ff c ” means convolution, global average pooling, fully
connected operation respectively.
tation with fixed receptive fields and fixed kernel parame- X ∈ RC ×H×W , as shown in (2). Afterward, we use global
ters when sliding on the feature maps, whereas the feature average pooling Fg in (3) to encode the global information
0
representation of the same-category pixels may have some S ∈ RC , which is designed for the attention weights. Further,
differences in different regions, which could result in intra- a fully connected layer Ff c in (4) with Batch Normalization
0
class inconsistency. It is ineffective and inefficient to aggregate and ReLU are applied to receive low-dimensional features S
contextual information from a pre-defined fixed receptive field for reducing the cost of calculation. The scaling factor is set
in visual tasks [25]. In order to select the receptive field to 2 in our work. Then, the attention vectors are computed by
sizes of the neuron adaptively, we adopt the Selective Kernel using two 1 × 1 convolutional layers Fc in (5) and a softmax
(SK) unit, which employs different sizes of kernels to produce operation. The attention vectors consist of two weight vectors
the multi-scale information. Then we use the gated softmax A and B, which indicate the adaptive weights of the multi-
operation to fuse the information from multi-size convolutional scale information, where A + B = 1. Finally, we multiply
kernels. Moreover, the gains generated by SK units at different the attention vectors with the multi-scale feature maps to
stages are mutually reinforcing because they could be sequen- aggregate0 multi-scale features. The final output feature maps
0 0
tially combined to further enhance network performance. The O ∈ RC ×H×W are calculated by X1 + X2 , as shown in (6)
SK unit is illustrated in Fig. 5. The calculation steps of the and (7). In order to train our network more effectively, we
SK unit can be briefly formulated as: embed the SK unit into the residual blocks [19].
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 6
where N is the number of training images. When the network SP = T N/(T N + F P ). (14)
converges, we can compute the output result O through the
network weights wF AN et and xinput shown in (11). FANet() ACC = (T P + T N )/(T P + T N + F P + F N ). (15)
is the mapping function learned by the proposed FANet.
where TP, TN, FP, FN denote true positive, true negative,
O = F AN et(wF AN et , xinput ). (11) false positive, and false negative. Given a binary segmentation
map, pixels marked by the experts as the vessel class which
classified by the network as the vessel class are counted as
Algorithm 1 The algorithm of the proposed method
TP, and misclassified as the background class are counted as
Input:
FN. Pixels marked by the experts as the background class that
1: x: Image data;
classified by the network as the background class are counted
2: y: Target mask;
as TN, and misclassified as the vessel class are counted as
3: t: The number of iteration = 10000;
FP. In addition, the receiving operator characteristics (ROC)
4: lr: Learning rate;
curve describes the relationship between the true position rate
Output:
(SE) and the false position rate (1-SP) under the different
5: O: The prediction;
classification thresholds. The area under the ROC curve (AUC)
6: Step1: Image pre-processing;
is also used for quality evaluation in our work.
7: Employ the green channel extraction;
8: Adopt the contrast-limited adaptive histogram equaliza-
tion; C. Experimental Setup
9: Clip into 64 × 64 patches, get the input xinput ;
10: Step2: Network training; 1) Experiments on the effect of pre-processing strategy:
11: Initialize wF AN et = 0; We verify the effect of image pre-processing strategies in
12: for t = 0 to 10000 do this subsection, The pre-processing of the images is a key
0
13: Compute the results y = F AN et(wF AN et , xinput ); technology for the accurate segmentation of the retinal ves-
0
14: Compute the loss J = CE(y, y ); sel. We conduct four comparative experiments to prove the
15: Compute the gradient g = 5J; effectiveness of the pre-processing on DRIVE dataset. U-Net
16: Update the weights wF AN et = wF AN et − lr × g; is used as the baseline architecture with stride = 8. Firstly, we
17: end for utilize images without any pre-processing strategies to conduct
18: Compute the prediction O = F AN et(wF AN et , xinput ). the experiments of pre-processing. Secondly, grayscale trans-
formation [42] is employed as the pre-processing step, which
is a widely used method in the retinal vessel segmentation
task. Thirdly, we use green channel extraction (GE) in our
IV. E XPERIMENTS experiments to enhance the contrast of original retinal images.
A. Implementation Details Finally, CLAHE is used in our experiments.
We implement our network on Pytorch [41] and train it on 2) Experiments on the effect of dual-direction attention
1 TITAN Xp GPU. We employ Adaptive Moment Estimation block: We adopt the dual-direction attention block (DAB)
(Adam) optimization method with momentum 0.9 and weight in our proposed network to model long-range dependencies
decay 0.0001. Following the previous work [14,15], we also for aggregating global contextual information. In these ex-
adopt the poly learning rate policy where the learning rate of periments, we use U-Net (stride = 8) as the baseline archi-
each iteration is calculated by tecture and images with pre-processing strategies to conduct
experiments on the effect of dual-direction attention block
1 − iter 0.9
baselr × ( ) . (12) (DAB) using DRIVE dataset. We design two experiments
totaliter to demonstrate the impact of DAB integration in different
The baselr is set to 0.0003. We crop each retinal fundus image locations of our network. We integrate the DAB after the
into 64×64 patches for training. The batch size is set to 32 on encoder module (AE), the decoder module (AD) respectively.
all three datasets. The network is trained for 10000 iterations. Meanwhile, we verify how many DABs in each position will
We only utilize random flipping for data augmentation. bring the best performance.
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 7
Fig. 6. Exemplar retinal vessel segmentation results of the proposed network on the DRIVE (Row 1), STARE (Row 2), CHASE DB1 (Row 3) datasets.
From left to right: the original images, the manual annotations, the prediction maps.
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 8
the feature maps in the decoder module have more rich and TABLE III
accurate semantic information, which is essential for DABs C OMPARISON EXPERIMENTS OF DIFFERENT SETTINGS OF SK BRANCH ON
DRIVE DATASET. K1: 1 X 1 CONVOLUTION . K3D2: 3 X 3 CONVOLUTION
integration. DABs require feature maps with strong correlative WITH DILATION 2. K5: 5 X 5 CONVOLUTION . K7: 7 X 7 CONVOLUTION .
semantic representation as input. As a result, we can find that
the network with two DABs in each position of the decoder Methods SE SP ACC AUC
module has shown the competitive performance. The DABs baseline 0.8020 0.9872 0.9757 0.9846
strengthen the semantic association gradually so as to achieve K1 0.8048 0.9886 0.9762 0.9866
better results. Considering the running time and computational K3D2 0.8089 0.9883 0.9765 0.9876
cost, we set the number of the DAB = 2. K5 0.8134 0.9881 0.9767 0.9883
K7 0.8147 0.9878 0.9765 0.9877
TABLE II
PERFORMANCES OF DUAL - DIRECTION ATTENTION BLOCK ON DRIVE
DATASET. AE: AFTER THE ENCODER MODULE . AD: AFTER THE DECODER that fusing SK units and dual-direction attention blocks will
MODULE . DAB: DUAL - DIRECTION ATTENTION BLOCK .
refine the results to further improve the overall segmentation
DAB position DAB number SE SP ACC AUC accuracy. Fig. 7 shows some examples of the test images,
- - 0.8020 0.9872 0.9757 0.9846 the ground truths, and segmentation results obtained by using
AE 1 0.7712 0.9897 0.9756 0.9868 U-net and our proposed network. In the rectangular regions
AD 1 0.7937 0.9888 0.9760 0.9876 marked with red-dotted lines, we could find that our proposed
AD 2 0.8093 0.9872 0.9763 0.9883 model is likely to get more continuous segmentation bound-
aries and successfully segment micro-vessels. It reveals that
our proposed network decreases noises in the background,
the irregular and multi-scale vessel structures can be well
C. Result of the experiments on SK unit
preserved. These visualizations further demonstrate that the
To investigate the impact of SK unit, we conduct various importance of aggregating multi-scale contextual information
experiments on DRIVE dataset for different architectures in and capturing long-range dependencies in retinal vessel seg-
section IV. The results are shown in Table III. We obtain mentation.
the following three observations about the ablation studies
of the SK units. Firstly, SK units bring performance benefits TABLE IV
no matter which kernel is used in the multi-scale branches. A BLATION ANALYSIS OF MODULE FUSION ON THE DRIVE DATASET.
This phenomenon suggests that the SK unit is helpful for
U-Net SK unit Attention SE SP ACC AUC
retinal vessel segmentation. Aggregating multi-scale contex-
X 0.8020 0.9872 0.9757 0.9846
tual information can improve the identification abilities of
X X 0.8134 0.9881 0.9767 0.9883
the network, especially for handling retinal vessel images,
X X 0.8093 0.9872 0.9763 0.9883
which are complex and multi-scale. In addition, with the
X X X 0.8145 0.9883 0.9769 0.9895
increase of the receptive field size, the performance signifi-
cantly improves to a new level, though the further increase
of the receptive field size results in a slight decrease. With
the larger receptive field size, the richer semantic information E. Comparison with the Existing Methods
will be obtained, which can facilitate image understanding.
In order to further observe the segmentation results gen-
However, the location information becomes indistinct when the
erated by the proposed network, we compare the proposed
kernel size is overlarge, and the classification label for a pixel-
FANet with numerous current state-of-the-art methods on
level task must be aligned to the corresponding coordinates in
DRIVE, STARE, and CHASE DB1 datasets. Table V lists
the output segmentation map. Therefore, the overlarge kernel
the performances of these methods. From the Table V, the
may weaken the retinal vessel segmentation effect. This also
proposed FANet achieves the outstanding performance with
confirms the truth that the increased number of parameters
sensitivity = (0.8145/ 0.8505/ 0.8334), specificity = (0.9883/
may not always contribute to the improvements of the model.
0.9889/ 0.9862), accuracy = (0.9769/ 0.9797/ 0.9803), and
Finally, in the case of the same receptive field size, using
AUC = (0.9895/ 0.9924/ 0.9912) on (DRIVE, STARE and
dilated convolution leads to poor performance. This is because
CHASE DB1 respectively). Our proposed FANet achieves the
using dilated convolution with the same dilation rate frequently
best results of SE and ACC on all three datasets. The results
will cause the gridding effect, which loses the continuity of
reflect that our network is able to distinguish the pixels of
the feature representation, whereas the continuity character is
vessels and background effectively and accurately. In terms
important for dense pixel-level prediction tasks.
of specificity, D-Net [54] shows the highest SP results and
achieves (0.9899/ 0.9904/ 0.9894) on three datasets respec-
D. Result of the experiments on module fusion tively. It means D-Net can better classify the background.
We explore the effectiveness of module fusion in this part. However, due to the highly unbalanced pixel ratio between
The results are shown in Table IV. We could find that using vessels and background, the SE is more important than the SP
SK units and dual-direction attention blocks separately will in the retinal vessel segmentation task. Although D-Net is a
improve the performance of the network. We also prove little bit more (0.0016/ 0.0015/ 0.0032) than our method on SP,
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 9
Fig. 7. Segmentation results of different methods. From left to right: the origin images, the corresponding masks, the results of U-Net, the results of the
proposed FANet.
TABLE V
C OMPARISON RESULTS ON DRIVE, STARE, CHASE DB1 DATASETS .
Fig. 8. ROC curves of the proposed FANet on these three different datasets.
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 10
the proposed FANet outperforms D-Net by (0.0306/ 0.0256/ shows that our approach has the ability to preserve the details
0.0495) in terms of sensitivity. As for AUC, the proposed and capture thin, multi-scale, and irregular curved vessels. It
model achieves the best results on DRIVE and CHASE DB1 also suggests our proposed method is sufficient to diagnose
datasets. D-Net is only 0.0003 more than us in terms of AUC retinal vascular detection, and this could assist professional
on STARE. Fig. 8 shows the ROC curve of the proposed doctors in disease diagnosis and reduce the workload of human
FANet on three datasets. The closer the ROC curve is to the experts in clinical medicine.
upper left boundary, the more accurate the network is. These Although the proposed method successfully segments the
results of AUC show that the proposed network can precisely complex retinal vascular structure, there are still a small num-
classify vessels and background. Besides, the segmentation ber of thin and irregular vascular structures that have not been
maps in Fig. 6 demonstrate our method is more productive accurately classified. We may further improve our network
in classifying thick and thin vessels, which demonstrate that structure by designing a more hybrid contextual semantic
our proposed FANet can decrease noise, enhance the contrast, module to capture a more compact and discriminative feature
and aggregate multi-scale contextual information to classify representation. Furthermore, we also observe the phenomenon
irregular vessel structures. of overfitting during the training process. This is mainly due
to the small retinal vessel dataset. We advocate establishing
some larger and more refined retinal blood vessel datasets.
VI. D ISCUSSION
With the advancement of technology, it has become practical to
In this paper, we propose a Fully Attention-based Network use high-resolution fundus cameras to obtain high-resolution
(FANet) for the task of automatically retinal vessel segmen- retinal images (as CHASE DB1). These high-resolution and
tation in color fundus images. Structures with noise, low high-quality retinal fundus images will greatly improve the
contrast, multi-scale vessels, and irregular curved vessels are accuracy of image recognition. We encourage further research
apparent challenges in retinal vessel segmentation. These chal- in this direction.
lenges have brought great difficulties for accurate identification
and classification of retinal blood vessel pixels. VII. C ONCLUSION
As for the presence of noise and low contrast structures,
In this paper, we propose a Fully Attention-based Net-
we adopt green channel extraction and CLAHE as the pre-
work (FANet) for retinal vessel segmentation, which selects
processing strategy. Fig. 4 and Table I suggest that these
different scale kernels adaptively and adjusts local semantic
tactics can effectively enhance the contrast of retinal fundus
feature representation by using attention modules. FANet
images. In addition, we integrate the SK unit in our network
is an extension of the U-Net with the replacement of the
to deal with the multi-scale vessel structure. SK unit employs
standard convolutional layers by the SK units and integration
two convolutional branches with different kernel sizes to
of the lightweight dual-direction attention module. Besides,
produce multi-scale information and then fuses them. Table
we employ contrast-limited adaptive histogram equalization
III shows SK units bring plentiful performance benefits in our
to enhance the contrast and suppress the original background
experiments. Therefore, the problem of multi-scale structures
noise of retinal fundus images. Moreover, the comparative ex-
can be distinctly solved. As for irregular curved vessels, some
periments demonstrate that the proposed model captures multi-
earlier state-of-the-art methods [14,56] have paid attention to
scale and long-range contextual information, which improves
capturing complex contextual information. By enhancing the
the intra-class discrimination ability of the model. Compared
understanding of retinal fundus image semantics, the model
with the baseline U-Net, FANet extracts vessels with more
can better complete the task of vessel segmentation. However,
details. Our proposed model achieves outstanding performance
these methods generally capture contextual information by in-
consistently on the three public datasets. In the future, a more
creasing receptive fields, which can only capture short-range or
lightweight and advanced segmentation architecture could be
adjacent dependencies. Although self-attention module [21,23]
applied to improve the results and accelerate the process of
has proved that it can perfectly capture long-range contextual
clinical diagnosis.
information, the calculation of this module will occupy huge
GPU memories and computation resources. We address this
R EFERENCES
limitation by designing a lightweight dual-direction attention
block. The proposed dual-direction attention block reduces the [1] J. J. Kanski and B. Bowling, ”Clinical ophthalmology: a systematic
approach,”Elsevier Health Sciences, 2011.
space complexity from O((H ×W )×(H ×W )) to O(H ×W ). [2] S. Chaudhuri, et al. ”Detection of blood vessels in retinal images using
Table II shows the proposed dual-direction attention block two-dimensional matched filters,” IEEE transactions on medical imaging,
can better perceive irregular curved vessels by associating pp. 263-269, 1989.
[3] A. A. Mendonca, and A. Campilho. ”Segmentation of retinal blood
contextual information. Experiments on the effect of module vessels by combining the detection of centerlines and morphological
fusion have demonstrated that our method can effectively solve reconstruction,” IEEE transactions on medical imaging, VOL. 25, NO. 9,
noise and low contrast problems, capture multi-scale vessels pp. 1200-1213, 2016.
[4] Y. Wang, et al. ”Retinal vessel segmentation using multiwavelet kernels
and irregular curved vessels. and multiscale hierarchical decomposition,” Pattern Recognition, pp.
Extensive experiments are carried out on three datasets 2117-2133, 2013.
to interpret the effectiveness of our proposed approach. The [5] J. Elson, et al. ”Automated extraction and analysis of retinal blood vessels
with multi scale matched filter,” in ICICICT, pp.775-779, 2017.
performance results are shown in Table V, and the results show [6] Z. Yu, and K. Sun. ”Vessel segmentation on angiogram using morphology
that our proposed approach consistently performs well. Fig. 7 driven deformable model,” in BMEI, Vol. 2, pp. 675-678, 2010.
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 11
[7] Y. Zhang, W. Hsu, and M. Lee. ”Detection of retinal blood vessels based [39] S. Ioffe, and C. Szegedy. ”Batch normalization: Accelerating deep
on nonlinear projections,” Journal of Signal Processing Systems, vol. 55, network training by reducing internal covariate shift,” arXiv preprint
pp.103-112, 2009. arXiv:1502.03167, 2015.
[8] S. Kozerke, et al. ”Automatic vessel segmentation using active contours in [40] A. F. Agarap. ”Deep learning using rectified linear units (relu),” arXiv
cine phase contrast flow measurements,” Journal of Magnetic Resonance preprint arXiv:1803.08375, 2018.
Imaging, vol. 10, pp. 41-51, 1999. [41] A. Paszke, et al. ”Automatic differentiation in pytorch,” 2017.
[9] A. Lahiri, et al. ”Deep neural ensemble for retinal vessel segmentation in [42] Z. Feng, J. Yang, and L. Yao, “Patch-based fully convolutional neural
fundus images towards achieving label-free angiography,” in EMBC, pp. network with skip connections for retinal blood vessel segmentation,” in
1340-1343, 2016. ICIP, pp. 1742–1746, 2017.
[10] A. Şengür, et al. ”A retinal vessel detection approach using convolution [43] B. Zhang, et al. “Retinal vessel extraction by matched filter with
neural network,” in IDAP, pp. 1-4, 2017. firstorder derivative of Gaussian,” Comput. Biol. Med., vol. 40, no. 4,
[11] H. Fu, et al. ”Deepvessel: Retinal vessel segmentation via deep learning pp. 438-445, 2010.
and conditional random field,” in MICCAI, pp. 132-139, 2016. [44] X. You, et al. “Segmentation of retinal blood vessels using the radial
[12] Z. Yan, X. Yang, and K. Cheng. ”Joint segment-level and pixel- projection and semi-supervised approach,” Pattern Recog., vol. 44, no.
wise losses for deep learning based retinal vessel segmentation,” IEEE 10, pp. 2314-2324, 2011.
Transactions on Biomedical Engineering, pp. 1912-1923, 2018. [45] M. M. Fraz, et al. “An approach to localize the retinal blood vessels
[13] Y. Zhang, and A. C. S. Chung. ”Deep supervision with additional labels using bit planes and centerline detection,” Comput. Methods Programs
for retinal vessel segmentation task,” in MICCAI, pp. 83-91, 2018. Biomed., vol. 108, no. 2, pp. 600-616, 2012.
[46] S. Roychowdhury, et al. “Iterative vessel segmentation of fundus im-
[14] L. Chen, et al. ”Deeplab: Semantic image segmentation with deep
ages,” IEEE Trans. Biomed Eng., vol. 62, no. 7, pp. 1738-1749, 2015.
convolutional nets, atrous convolution, and fully connected crfs,” IEEE
[47] G. Azzopardi, et al. “Trainable COSFIRE filters for vessel delineation
transactions on pattern analysis and machine intelligence, vol. 40, no.4,
with application to retinal images,” Med. Image Anal., vol. 19, no. 1, pp.
pp. 834-848, 2017.
46C57, 2015.
[15] L. Chen et al, ”Rethinking Atrous Convolution for Semantic Image
[48] J. Zhang, et al. “Robust retinal vessel segmentation via locally adaptive
Segmentation,” in arxiv.org/abs/1706.05587, 2017.
derivative frames in orientation scores,” IEEE Trans. Med. Imag., vol. 35,
[16] H. Zhao, et al. ”Pyramid scene parsing network,” in CVPR, pp. no. 12, pp. 2631-2644, 2016.
2881–2890, 2017. [49] Z. Yan, X. Yang, and K. Cheng. ”A three-stage deep learning model for
[17] Y. Fisher, ”Dilated Residual Networks,” in CVPR, pp. 636-644, 2017. accurate retinal vessel segmentation,” IEEE journal of Biomedical and
[18] L. Chen, et al. ”Encoder-decoder with atrous separable convolution for Health Informatics, vol. 23, no. 4, pp. 1427-1436, 2018.
semantic image segmentation,” in ECCV, pp. 801-808, 2018. [50] M. Z. Alom, et al. ”Recurrent residual convolutional neural network
[19] K. He, et al, ”Deep Residual Learning for Image Recognition.” in CVPR, based on u-net (r2u-net) for medical image segmentation,” arXiv preprint
pp. 770-778, 2016. arXiv:1802.06955, 2018.
[20] Z. Huang, et al. ”CCNet: Criss-Cross Attention for Semantic Segmen- [51] Q. Jin, et al. ”DUNet: A deformable network for retinal vessel segmen-
tation,” in ICCV, pp. 603-612, 2019. tation,” Knowledge-Based Systems,vol. 178, pp. 149-162, 2019.
[21] J. Fu, et al. ”Dual attention network for scene segmentation,” in CVPR, [52] S. Zhang, et al. ”Attention guided network for retinal image segmenta-
pp.3146-3154, 2019. tion,” in MICCAI, pp. 795-805, 2019.
[22] X. Wang, et al, ”Non-local Neural Networks.” in CVPR, pp. 7794-7803, [53] Z. Luo, Y. Zhang, L. Zhou, et al. ”Micro-Vessel Image Segmentation
2018. Based on the AD-UNet Model,” IEEE Access, vol. 7, pp.143402-143411.
[23] A. Vaswani, et al. ”Attention is all you need,” in NIPS, pp.5998-6008, 2019.
2017. [54] Y. Jiang, et al. ”Retinal vessels segmentation based on dilated multi-scale
[24] O. Ronneberger, P. Fischer, and T. Brox. ”U-net: Convolutional networks convolutional neural network,” IEEE Access, vol. 7, pp. 76342-76352,
for biomedical image segmentation,” in MICCAI, pp. 234-241, 2015. 2019.
[25] X. Li, et al. ”Selective kernel networks,” in CVPR, pp. 510-519, 2019. [55] D. Wang, et al, ”Hard Attention Net for Automatic Retinal Vessel
[26] J. Long, E. Shelhamer, and T. Darrell. ”Fully convolutional networks Segmentation.” in IEEE journal of Biomedical and Health Informatics,
for semantic segmentation,” in CVPR, pp.3431-3440, 2015. pp. 7794-7803, 2018.
[27] M. Luong, H. Pham, and C. D. Manning. ”Effective approaches [56] H. Zhang, et al. ”Context encoding for semantic segmentation,” in
to attention-based neural machine translation,” arXiv preprint CVPR, pp. 7151-7160, 2018.
arXiv:1508.04025, 2015.
[28] P. Huang, et al. ”Attention-based multimodal neural machine transla-
tion,” in Proceedings of the First Conference on Machine Translation,
Vol. 2, pp. 639-645, 2016.
[29] J. Hu, L. Shen, and G. Sun. ”Squeeze-and-excitation networks,” in
CVPR, pp. 7132-7141, 2018.
[30] M. Ren, and S. Z. Richard. ”End-to-end instance segmentation with
recurrent attention,” in CVPR, pp.6656-6664, 2017.
[31] Y. Chen, et al. ”A2 -Nets: Double Attention Networks,” in NIPS, pp.
352-361, 2018.
[32] Staal, Joes, et al. ”Ridge-based vessel segmentation in color images of
the retina,” IEEE transactions on medical imaging, vol. 23, no. 4, pp.
501-509, 2014.
[33] A. D. Hoover, V. Kouznetsova, and M. Goldbaum. ”Locating blood
vessels in retinal images by piecewise threshold probing of a matched
filter response,” IEEE Transactions on Medical imaging, vol. 19, no. 3,
pp. 203-210, 2000.
[34] M. M. Fraz, et al. ”An ensemble classification-based approach applied
to retinal blood vessel segmentation,” IEEE Transactions on Biomedical
Engineering, vol. 59, no. 9, pp. 2538-2548, 2002.
[35] M. Hajabdollahi, et al. ”Low complexity convolutional neural network
for vessel segmentation in portable retinal diagnostic devices,” in ICIP,
pp. 2785-2789, 2018.
[36] K. Zuiderveld. ”Contrast limited adaptive histogram equalization ,”
Graphics Gems IV, pp. 474-485, 1994.
[37] H. Wang, and D. Suter. ”Color image segmentation using global infor-
mation and local homogeneity,” in DICTA, 2003.
[38] H. Ding, et al. ”Context contrasted feature and gated multi-scale
aggregation for scene segmentation,” in CVPR, pp. 2393-2402, 2018.
2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.