You are on page 1of 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 1

Accurate Retinal Vessel Segmentation in Color


Fundus Images via Fully Attention-based Networks
Kaiqi Li*, Xingqun Qi*, Yiwen Luo, Zeyi Yao, Xiaoguang Zhou, and Muyi Sun†

Abstract—Automatic retinal vessel segmentation is important


for the diagnosis and prevention of ophthalmic diseases. The
existing deep learning retinal vessel segmentation models always
treat each pixel equally. However, the multi-scale vessel structure
is a vital factor affecting the segmentation results, especially
in thin vessels. To address this crucial gap, we propose a
novel Fully Attention-based Network (FANet) based on attention
mechanisms to adaptively learn rich feature representation and
aggregate the multi-scale information. Specifically, the framework
consists of the image pre-processing procedure and the semantic
segmentation networks. Green channel extraction (GE) and Fig. 1. Three samples of the color fundus images. In these images, there
contrast limited adaptive histogram equalization (CLAHE) are are noise, low contrast features, irregular structures, and multi-scale vessels.
employed as pre-processing to enhance the texture and contrast These challenges have brought huge obstacles to retinal vessel segmentation
of retinal blood images. Besides, the network combines two types tasks. (For unified illustration, all the images are resized to the same size.)
of attention modules with the U-Net. We propose a lightweight
dual-direction attention block to model global dependencies and
reduce intra-class inconsistencies, in which the weights of feature
maps are updated based on the semantic correlation between
multi-scale structures as shown in Fig. 1. These factors result
pixels. The dual-direction attention block utilizes horizontal and in the poor performance of vessel segmentation. Therefore, the
vertical pooling operations to produce the attention map. In automated and standard retinal vessel segmentation plays an
this way, the network aggregates global contextual information important role in the diagnosis of eye diseases.
from semantic-closer regions or a series of pixels belonging to
The currently proposed retinal vessel segmentation meth-
the same object category. Meanwhile, we adopt the selective
kernel (SK) unit to replace the standard convolution for obtaining ods can be broadly categorized into two types: unsupervised
multi-scale features of different receptive field sizes generated by methods and supervised methods. Unsupervised methods do
soft attention. Furthermore, we demonstrate that the proposed not require manual annotations for reference. In general, the
model can effectively identify irregular, noisy, and multi-scale annotation of the vessel images is expensive and complex
retinal vessels. The abundant experiments on DRIVE, STARE,
due to the requirements of enough time, energy, and patience,
and CHASE DB1 datasets show that our method achieves state-
of-the-art performance. especially in pixel-level annotations. Unsupervised methods
effectively avoid this difficulty. There are generally two types
Index Terms—deep learning, retinal vessel segmentation, image
processing, attention mechanism.
of unsupervised methods in previous research, matched fil-
ter methods [2,3,4,5] and model-based methods [6,7,8,9].
Matched filter methods mainly design filters to extract the
I. I NTRODUCTION features of retinal blood vessels in different directions, lo-

T HE structure and feature are significant elements in


retinal fundus images, which indicate whether the pa-
tients suffer from eye diseases, such as diabetic retinopathy,
cations, and dimensions. Model-based methods mainly use
explicit models to extract the retinal vessels. These unsuper-
vised methods usually require a hand-crafted feature extractor,
hypertension, and glaucoma [1]. For extracting vessel features which relies on rich prior knowledge and extra pre-processing
to analyze and detect diseases in retinal fundus images, retinal tactics. The manually designed specific extractors often have
vessel segmentation is essential in clinical medicine. However, tremendous differences for different backgrounds of retinal
manual segmentation of retinal vessels by experts is a time fundus images. These extractors can achieve excellent results
consuming and challenging task. For instance, the retinal fun- when the backgrounds of retinal images are flat and simple.
dus images commonly have noise, low contrast, irregular and However, if there are diverse and complex backgrounds in
the images such as the focus of infection, these methods
K. Li, X. Qi, Y. Luo, Z. Yao, X. Zhou, and M. Sun are with the
School of Automation, Beijing University of Posts and Telecommunications, may lead to misclassification of vessels from the backgrounds.
Beijing, China (e-mail: {Lkq1997, XingqunQi, luoyiwen1995, yaozeyi, zxg1, Compared with supervised methods, unsupervised methods are
sunmuyi}@bupt.edu.cn). not acceptable for the vessel segmentation due to their poor
M. Sun is also with the Center for Research on Intelligent Percep-
tion and Computing, National Laboratory of Pattern Recognition, Institute robustness and limited performance.
of Automation, Chinese Academy of Sciences, Beijing, China. (e-mail: Supervised methods utilize precious annotations in segmen-
muyi.sun@cripac.ia.ac.cn). tation and can achieve relatively precise results. Since the
* These authors contributed equally to this work.
† Corresponding author. computing resources of computers are gradually increasing, a
Manuscript submitted at September 20, 2020 host of methods [10,11,12,13] based on convolutional neural

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 2

networks (CNNs) are applied to retinal vessel segmentation. used to update the feature maps, making the network more
Although these deep learning models could gain accurate focused on regions beneficial for our tasks. For the multi-
segmentation masks, there are also varieties of challenges in scale vessels, we apply selective kernel (SK) units [25] to
retinal image segmentation. For instance, the existing deep obtain the feature maps with adaptively receptive field sizes.
learning methods universally segment by the convolutional op- The SK unit is a soft attention mechanism to update the
eration, whereas the feature representation of the convolutional weights of the feature maps from different kernel sizes. Thus,
kernel only contains local contextual information, which is we can adaptively aggregate multi-scale features in a single
limited for capturing the long-range dependencies. Therefore, layer instead of stacked multi-layers. The SK unit facilitates
this local structure of the kernels may cause incorrect semantic network to control the multi-scale information flow through
understanding in the continuous vessel segmentation task. learning global contextual information.
Moreover, in standard CNNs, all pixels are regarded equally, The main contributions of this paper are listed as follows:
which might weaken the semantic feature learning and bring (1) We propose a Fully Attention-based Network (FANet)
unnecessary redundant information. In addition, the CNNs- with self-attention mechanism, which promotes the network to
based vessel segmentation models own fixed receptive fields capture long-range contextual information.
of convolutional layers. However, the vessel structures are (2) We design a dual-direction attention block to learn
irregular, complex, and multi-scale. Perceiving these vessel spatial dependencies more lightly and effectively.
features with fixed regions could be inadequate. (3) We integrate SK units into our network for obtaining
To capture long-range contextual information, some meth- different receptive field sizes to generate multi-scale features.
ods explore long-range dependencies through the multi-scale (4) We utilize contrast-limited adaptive histogram equaliza-
modules [14,15,16], dilated convolutions [17,18], and cas- tion (CLAHE) as the pre-processing method to enhance the
caded convolutional layers [19]. Nevertheless, these opera- texture and contrast of the retinal fundus images.
tions exclusively fuse features from different scales or re- (5) We conduct sufficient experiments on three datasets,
ceptive fields, instead of aggregating contextual information including DRIVE, STARE, and CHASE DB1 datasets, and
from semantic-closer regions. Information from the semantic- achieve state-of-the-art performance.
correlated regions is more valuable to identify the objects The rest of this paper is organized as follows. Section II
from different categories. Besides, the same weights of feature introduces a brief description of the related methods. Section
extractors (kernels) at all locations only represent similar long- III specifically explains the method of our work. Section IV
range dependencies on all pixels. Therefore, attention-based shows the experiments. Section V presents the results. Section
methods [20,21] are proposed to focus on adaptively aggregat- VI provides discussions about the proposed model. Section VII
ing contextual information from semantic-correlated regions. is the conclusion of our work.
Attention mechanism aims at not only modeling the long-range
dependencies regardless of the distance in spatial dimension, II. R ELATED W ORK
but also strengthening the discriminant ability of the model to Semantic segmentation in medical images. Many well-
reduce misclassification caused by incorrect semantic under- designed network structures based on deep convolutional neu-
standing. However, attention-based methods need to produce ral networks show excellent performance in semantic segmen-
an enormous attention map to project the similarities of each tation tasks due to their rich representation capabilities. Fully
pixel-pair. Such as an image with resolution H × W , the Convolutional Networks (FCN) [26] is extremely crucial for
space complexity of non-local attention based operation [22] the evolution of semantic segmentation. It utilizes an encoder-
is O((H × W ) × (H × W )). Any such a method practically decoder structure and applies fully convolutional classifica-
suffers from high computational and memory costs. Although tion networks in the whole backbone to perceive features.
some subsampling tactics [21] are used to reduce the cost, it Another important contribution in semantic segmentation is
may potentially hurt the performance. the utilization of skip-connection, which aggregates low-level
Inspired by the above, in this paper, we propose a novel features into high-level features to recover reduced details.
network architecture based on attention mechanisms [23] Motivated by FCN and skip-connection, U-Net in proposed
and U-Net [24] for retinal vessel segmentation. Specifically, with a U-shaped encoder-decoder architecture, which modifies
we first introduce a self-attention mechanism to model the and extends the structure of FCN. U-Net is widely adopted in
correlation between each position and integrate local features medical image segmentation and can effectively handle multi-
with global contextual information. In the self-attention mech- scale features in medical images. However, in retinal vessel
anism, we propose a dual-direction attention block, which segmentation, U-Net is not enough to cope with the thin and
utilizes horizontal and vertical pooling operations to produce irregular retinal vessel structure as shown in Fig.7, though U-
the attention map for long-range contextual information ag- Net achieves multi-scale contextual information aggregation.
gregation. By fusing the information of two directions, this To deal with the problem, we embed the SK unit into the U-
attention block can also collect contextual information from all Net by using two convolutional branches with soft attention
pixels. Furthermore, the space complexity of the dual-direction mechanism to further generate multi-scale information.
attention block is O(H × W ). Dual-direction attention block Attention mechanism is widely used in a range of tasks
could produce higher value for pixels with stronger semantic such as natural language processing [23,27,28] and computer
correlation. Therefore, the attention map represents a set of vision [20,25,29,30]. Attention mechanism increases the dis-
pixels with similar semantic. Meanwhile, the attention map is criminative ability of the network by updating the weights on

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 3

1 64 64 128 64 64 64 2

output segmentation map


A dual-direction attention block
sk unit
input image

copy and concat


A max pool 2x2
up-conv 2x2
conv 1x1
 element-wise summation
[ element-wise multiplication
64 128128 256 128 128 128

128 256 256 512 256 256 256


A

256 512 512


Dual-Direction Attention Block

Fig. 2. Architecture of our proposed network. Our network is based on U-Net, which is specifically designed for medical image segmentation tasks. The
block A represents the dual-direction attention block, aiming to model the semantic relation map. The SK unit is used to replace standard convolution for the
perception of multi-scale features.

each position. The flexibility of the model is also improved A. Datasets


due to the dynamic weights, which are calculated from the We utilize three public and standard datasets DRIVE [32],
feature maps. Attention mechanism could handle the problem STARE [33], and CHASE DB1 [34] in our experiments to
of extracting region-specific information and yet ignoring irrel- evaluate the proposed FANet. We show some original images,
evant regions. In this work, we integrate two types of attention images after pre-processing and corresponding labels of these
mechanisms, self-attention and soft attention. Self-attention datasets in Fig. 3.
mechanism is firstly proposed in machine translation tasks
[23] and proved to be superior in natural language processing. DRIVE STARE CHASE_DB1
Non-local neural networks [22] introduce the self-attention
into the task of image classification. DANet [21] improves
self-attention in spatial and channel dimensions for semantic
segmentation, which can model long-range dependencies to
strengthen the feature representation for semantic understand-
ing. However, self-attention modules need huge computational
and memory costs. Some studies are dedicated to reducing the
complexity of the module [20,31]. Inspired by the above, we
propose an efficient dual-direction attention block to model
global dependencies and reduce intra-class inconsistencies.
Soft attention block uses global information to selectively
emphasize informative features and suppress redundant infor-
mation. SKNet [25] and SENet [29] apply the soft attention on
kernels and channels, which achieve remarkable performances
in image classification. In this paper, we employ SK unit
to obtain multi-scale information of vessels in the baseline
network.

III. M ETHODOLOGY
In this work, we develop a Fully Attention-based Network
(FANet) for retinal vessel segmentation. The procedure con- Fig. 3. Original retinal images (top row), images after pre-processing
(middle row), and corresponding labels (bottom row) from DRIVE, STARE,
sists of two main following steps: image pre-processing and CHASE DB1 sequentially.
feeding into network. Pre-processing strategies for enhancing
the contrast of original images. The network structure com- DRIVE contains 40 retinal images. Each image has the same
bines the self-attention modules and SK units with the U-Net. resolution of 565 × 584. This dataset is divided into a training
The network architecture is shown in Fig. 2. set and a test set, both including 20 images. The training set

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 4

has a single manual annotation of each image, and the test set Secondly, CLAHE could reduce the noise enhancement
has two manual segmentation masks generated by two experts, problem by limiting contrast enhancement. If most pixels
but we choose the first one as the gold standard for testing. fall into the same gray range, the histogram peaks in these
STARE has 20 retinal images, and the resolution of each areas are relatively high, thus the slope of the local mapping
image is 700×604. The original STARE dataset is not divided function will be relatively large. In this situation, the low gray
for training and testing. We split the dataset in order and take values (such as original background or noise) will be mapped
the first 10 images as the training set and the remaining 10 to high gray values. CLAHE reduces the noise problem by
images as the test set. This dataset is also manually annotated limiting contrast enhancement. We set the maximum number
by two experts. We select the masks of the first observer as of pixels (Hmax ) for a certain gray value, if the number of
the ground truth. pixels is greater than Hmax , the excessive number of gray
CHASE DB1 includes 28 retinal images, with the same values is clipped. After clipping the histogram, the pixel values
resolution of 999 × 960. These images are collected from 14 distribute more uniform throughout the histogram. In order to
children. We use the first 14 images as the training test and the ensure that the total histogram area remains the same as the
last 14 as the test set. Meanwhile, we employ the first group original, the histogram rises a height L. The final improved
of annotations as the label. histogram is:

0 H(i) + L H(i) < Hmax
H(i) = (2)
Hmax + L H(i) ≥ Hmax

C. Dual-Direction Attention Block


Many recent works [37,38] have elucidated that local fea-
tures only represent information for a region. It is not enough
(a) (b) (c) to model the long-range dependencies in the whole images.
The global dependencies are bitterly important for image
Fig. 4. The images after each pre-processing strategy. (a): original image; semantic understanding. Motivated by the fact above, in this
(b): image after green channel extraction; (c): image after CLAHE.
subsection, we propose an efficient and effective dual-direction
attention block based on self-attention mechanisms to ame-
liorate networks capturing long-range contextual information.
B. Image Pre-processing
Dual-direction attention block encodes the attention weights
Due to the low contrast in retinal fundus images, we employ of the entire pixel space and capture complex relations, thus
some image pre-processing strategies. Inspired by previous presenting a perfectly capable of gathering global depen-
methods [12,13,35], the green channel images represent higher dencies into the local feature representation. Dual-direction
contrast than RGB images, so we extract the green channel to attention block consists of two steps: fusion and distribution.
increase the contrast of the images and decrease the noise. The structure diagram of our proposed dual-direction attention
Meanwhile, we utilize the Contrast-Limited Adaptive His- block is illustrated in the right-bottom of Fig. 2.
togram Equalization (CLAHE) method [36] to further enhance Fusion. In order to address the problem of modeling spatial
the contrast. The images after each pre-processing strategy are long-range dependencies, we consider the relations of all
shown in Fig. 4. positions in the feature map. For any given feature map F ∈
CLAHE is originally designed for medical image analysis RH×W , we feed F into two parallel branches simultaneously,
and has proven to be successful for the enhancement of low each of which contains a horizontal or vertical average pool-
contrast images. Compared with standard histogram equaliza- ing operation. Horizontal average pooling operation captures
tion methods, CLAHE has two major contributions. Firstly, relationships between pixels in a row. Vertical average pooling
CLAHE could enhance the contrast of all pixels more equally. operation captures relationships between pixels in a column.
In the standard histogram equalization methods, the range of To obtain the relation over the whole scene, we fuse horizontal
the histogram becomes wider and the gray value distribution feature Fh ∈ RH and vertical feature Fv ∈ RW by using
of the image becomes more uniform after the histogram is matrix multiplication to produce the crude attention map. The
equalized. In this way, pixels with lower contrast may not be refined attention map A ∈ RH×W must undergo a simple
visible when the number of pixels is small in a certain range. scaling softmax operation to ensure that the attention weights
CLAHE adopts a sliding pane method to divide the entire are limited to (0, 1). Attention map provides high attention
image into 8 × 8 regions. Equation (8) is the local mapping scores for high-level semantic pixels.
function, where DA , DB means the gray value before and Distribution. The next step after fusing features from
after the conversion, and H(i) means the number of pixels horizontal and vertical dimensions is to distribute them to
in the i gray. CLAHE optimizes the contrast in each region, each location of the input. We perform an element-wise
resulting in the enhancement of contrast throughout the whole multiplication between F and A for the implementation of
image. attention. In this way, the feature maps could have rich contex-
DA
255 X tual information and aggregate contexts selectively according
DB = H(i). (1)
8 × 8 i=0 to the attention map. Therefore, the proposed networks can

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 5

[
X1 X1’

Fc  S S’ Fc 
X X’ Fg Ffc O
B

[
X2 X2’

 element-wise summation
[ element-wise multiplication

Fig. 5. The architecture of SK unit. SK unit is used for aggregating multi-scale features. ”Fc ”, ”Fg ”, ”Ff c ” means convolution, global average pooling, fully
connected operation respectively.

adaptively select features, which are favorable to the current


task and have stronger semantic representation capabilities. X1 , X2 = Fc (X). (3)
Finally, we multiply the output of attention by a learnable 0
parameter α, initialized by 0, and then perform an element- X = X1 + X2 . (4)
wise summation operation with the features F to obtain the 0

final output, which is similar to residual learning for promoting S = Fg (X ). (5)


the training process and local feature enhancement. 0

Dual-direction attention block can capture long-range con- S = Ff c (S). (6)


textual information in horizontal and vertical directions. It is 0
A, B = sof tmax(Fc (S )). (7)
beneficial to attain dense contextual information for semantic
0 0
segmentation. Each position is allowed to rebuild relationships X1 , X2 = A × X1 , B × X2 . (8)
with a variety of nearby positions. We can stack the above
0 0
attention process twice or more, which is flexible to capture O = X1 + X2 . (9)
long-range dependencies from all pixels. Dual-direction atten-
tion block can be directly inserted in any CNN architecture at Specifically, we first feed the input feature map X ∈
any stage and can employ end-to-end training strategy. RC×H×W into the block with 3 × 3 and 5 × 5 convo-
lutional layers
0
for generating the
0
multi-scale information
C ×H×W C ×H×W
D. SK Unit X1 ∈ R and X2 ∈ R , followed by Batch
Normalization [39] and ReLU [40], as shown in (1). For
It is widely accepted that the receptive field sizes of feature
gaining the weights of feature maps from different kernels,
representation should be various, which enables neurons to
we aggregate the multi-scale information with an element-
capture multi-scale spatial semantic information [16,18]. In
wise sum0 operation to obtain the fused feature information
addition, the standard convolution generates feature represen- 0

tation with fixed receptive fields and fixed kernel parame- X ∈ RC ×H×W , as shown in (2). Afterward, we use global
ters when sliding on the feature maps, whereas the feature average pooling Fg in (3) to encode the global information
0

representation of the same-category pixels may have some S ∈ RC , which is designed for the attention weights. Further,
differences in different regions, which could result in intra- a fully connected layer Ff c in (4) with Batch Normalization
0
class inconsistency. It is ineffective and inefficient to aggregate and ReLU are applied to receive low-dimensional features S
contextual information from a pre-defined fixed receptive field for reducing the cost of calculation. The scaling factor is set
in visual tasks [25]. In order to select the receptive field to 2 in our work. Then, the attention vectors are computed by
sizes of the neuron adaptively, we adopt the Selective Kernel using two 1 × 1 convolutional layers Fc in (5) and a softmax
(SK) unit, which employs different sizes of kernels to produce operation. The attention vectors consist of two weight vectors
the multi-scale information. Then we use the gated softmax A and B, which indicate the adaptive weights of the multi-
operation to fuse the information from multi-size convolutional scale information, where A + B = 1. Finally, we multiply
kernels. Moreover, the gains generated by SK units at different the attention vectors with the multi-scale feature maps to
stages are mutually reinforcing because they could be sequen- aggregate0 multi-scale features. The final output feature maps
0 0
tially combined to further enhance network performance. The O ∈ RC ×H×W are calculated by X1 + X2 , as shown in (6)
SK unit is illustrated in Fig. 5. The calculation steps of the and (7). In order to train our network more effectively, we
SK unit can be briefly formulated as: embed the SK unit into the residual blocks [19].

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 6

E. Overview B. Evaluation Metrics


Given a retinal image x, we first put the image into the The retinal vessel segmentation task classifies each pixel
pre-processing step, and then crop the image into 64 × 64 according to classes which it belongs to. In order to quanti-
patches to get the input xinput . These patches are fed into the tatively analyze the performance of the segmentation results
proposed network. During the training procedure, the Cross- of our proposed network, we choose three basic metrics for
Entropy (CE) loss function is used to compute the difference evaluation: sensitivity/recall (SE/Recall), specificity (SP), and
0
between the prediction results y and target masks y. accuracy (ACC).
N
0 1 X 0 0
SE = T P/(T P + F N ). (13)
CE(y, y ) = − (yi log(yi ) + (1 − yi )log(1 − yi )). (10)
N i=0

where N is the number of training images. When the network SP = T N/(T N + F P ). (14)
converges, we can compute the output result O through the
network weights wF AN et and xinput shown in (11). FANet() ACC = (T P + T N )/(T P + T N + F P + F N ). (15)
is the mapping function learned by the proposed FANet.
where TP, TN, FP, FN denote true positive, true negative,
O = F AN et(wF AN et , xinput ). (11) false positive, and false negative. Given a binary segmentation
map, pixels marked by the experts as the vessel class which
classified by the network as the vessel class are counted as
Algorithm 1 The algorithm of the proposed method
TP, and misclassified as the background class are counted as
Input:
FN. Pixels marked by the experts as the background class that
1: x: Image data;
classified by the network as the background class are counted
2: y: Target mask;
as TN, and misclassified as the vessel class are counted as
3: t: The number of iteration = 10000;
FP. In addition, the receiving operator characteristics (ROC)
4: lr: Learning rate;
curve describes the relationship between the true position rate
Output:
(SE) and the false position rate (1-SP) under the different
5: O: The prediction;
classification thresholds. The area under the ROC curve (AUC)
6: Step1: Image pre-processing;
is also used for quality evaluation in our work.
7: Employ the green channel extraction;
8: Adopt the contrast-limited adaptive histogram equaliza-
tion; C. Experimental Setup
9: Clip into 64 × 64 patches, get the input xinput ;
10: Step2: Network training; 1) Experiments on the effect of pre-processing strategy:
11: Initialize wF AN et = 0; We verify the effect of image pre-processing strategies in
12: for t = 0 to 10000 do this subsection, The pre-processing of the images is a key
0
13: Compute the results y = F AN et(wF AN et , xinput ); technology for the accurate segmentation of the retinal ves-
0
14: Compute the loss J = CE(y, y ); sel. We conduct four comparative experiments to prove the
15: Compute the gradient g = 5J; effectiveness of the pre-processing on DRIVE dataset. U-Net
16: Update the weights wF AN et = wF AN et − lr × g; is used as the baseline architecture with stride = 8. Firstly, we
17: end for utilize images without any pre-processing strategies to conduct
18: Compute the prediction O = F AN et(wF AN et , xinput ). the experiments of pre-processing. Secondly, grayscale trans-
formation [42] is employed as the pre-processing step, which
is a widely used method in the retinal vessel segmentation
task. Thirdly, we use green channel extraction (GE) in our
IV. E XPERIMENTS experiments to enhance the contrast of original retinal images.
A. Implementation Details Finally, CLAHE is used in our experiments.
We implement our network on Pytorch [41] and train it on 2) Experiments on the effect of dual-direction attention
1 TITAN Xp GPU. We employ Adaptive Moment Estimation block: We adopt the dual-direction attention block (DAB)
(Adam) optimization method with momentum 0.9 and weight in our proposed network to model long-range dependencies
decay 0.0001. Following the previous work [14,15], we also for aggregating global contextual information. In these ex-
adopt the poly learning rate policy where the learning rate of periments, we use U-Net (stride = 8) as the baseline archi-
each iteration is calculated by tecture and images with pre-processing strategies to conduct
experiments on the effect of dual-direction attention block
1 − iter 0.9
baselr × ( ) . (12) (DAB) using DRIVE dataset. We design two experiments
totaliter to demonstrate the impact of DAB integration in different
The baselr is set to 0.0003. We crop each retinal fundus image locations of our network. We integrate the DAB after the
into 64×64 patches for training. The batch size is set to 32 on encoder module (AE), the decoder module (AD) respectively.
all three datasets. The network is trained for 10000 iterations. Meanwhile, we verify how many DABs in each position will
We only utilize random flipping for data augmentation. bring the best performance.

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 7

Fig. 6. Exemplar retinal vessel segmentation results of the proposed network on the DRIVE (Row 1), STARE (Row 2), CHASE DB1 (Row 3) datasets.
From left to right: the original images, the manual annotations, the prediction maps.

3) Experiments on the effect of SK unit: We report ablation TABLE I


studies to investigate the effectiveness of the SK Unit on A BLATION ANALYSIS OF PRE - PROCESSING STRATEGY ON DRIVE
DATASET. GE: GREEN CHANNEL EXTRACTION . CLAHE DENOTES
DRIVE dataset. We use U-Net (stride = 8) as the baseline CONTRAST- LIMITED ADAPTIVE HISTOGRAM EQUALIZATION .
architecture and images with pre-processing strategies. In
general, the convolutional kernel size and the dilation rate are Methods SE SP ACC AUC
two vital elements to control the size of receptive fields. The baseline 0.7808 0.9889 0.9711 0.9795
most commonly used kernel size is 3×3 in computer vision Transformation [42] 0.7961 0.9868 0.9718 0.9811
tasks. We set one of our convolutional branches with a fixed GE 0.7894 0.9877 0.9721 0.9817
3×3 filter. Then, the other branch will have two ways to obtain GE + CLAHE 0.8020 0.9872 0.9757 0.9846
the different receptive field sizes: using the dilated convo-
lution or changing the convolutional filter sizes. Therefore,
we compare four different kernels in our experiments: ”K1” V. R ESULTS
(1×1 convolution), ”K5” (5×5 convolution), ”K3D2” (3×3 A. Result of the experiments on pre-processing strategy
convolution with dilation 2), ”K7” (7×7 convolution).
We set up four comparative experiments to prove the
effectiveness of pre-processing method in section IV, the
4) Experiments on the effect of module fusion: To evaluate performance is shown in Table I. Obviously, both the gray
the effectiveness of each component in the proposed model, we transformation (Transformation) and the green channel extrac-
conduct sufficient experiments on DRIVE dataset. The base- tion (GE) improve the performance of segmentation, though
line is U-Net (stride =8) and images are processed with pre- GE is a little better than Transformation in all four evalu-
processing strategies. The experimental design is as follows: ation indicators. Compared with the baseline, our final pre-
(1) baseline U-Net; (2) only attention blocks are integrated processing strategy GE + CLAHE shows the best performance.
into baseline architecture; (3) only SK units are integrated The indicator SE increases from 0.7808 to 0.8020, achieving
into baseline architecture; (4) attention blocks and SK units improvement over 2 percentage points. The ACC and AUC in-
are jointly integrated into baseline architecture. crease nearby 0.5 percentage points. We choose GE + CLAHE
as the pre-processing strategy in subsequent experiments.

5) Comparison against existing methods: We compare our


network with numerous current state-of-the-art approaches on B. Result of the experiments on dual-direction attention block
DRIVE, STARE, and CHASE DB1 datasets. These results Table II shows the performance of the networks with dual-
obtained by other methods are derived from the related ref- direction attention blocks. Compared with the baseline U-Net,
erences. In order to further observe the segmentation results using DABs could bring performance improvement. Mean-
generated by the proposed network, we show the results of while, we could find that integrating DABs into the decoder
some examples in Fig. 6. module gains better results than the encoder module, because

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 8

the feature maps in the decoder module have more rich and TABLE III
accurate semantic information, which is essential for DABs C OMPARISON EXPERIMENTS OF DIFFERENT SETTINGS OF SK BRANCH ON
DRIVE DATASET. K1: 1 X 1 CONVOLUTION . K3D2: 3 X 3 CONVOLUTION
integration. DABs require feature maps with strong correlative WITH DILATION 2. K5: 5 X 5 CONVOLUTION . K7: 7 X 7 CONVOLUTION .
semantic representation as input. As a result, we can find that
the network with two DABs in each position of the decoder Methods SE SP ACC AUC
module has shown the competitive performance. The DABs baseline 0.8020 0.9872 0.9757 0.9846
strengthen the semantic association gradually so as to achieve K1 0.8048 0.9886 0.9762 0.9866
better results. Considering the running time and computational K3D2 0.8089 0.9883 0.9765 0.9876
cost, we set the number of the DAB = 2. K5 0.8134 0.9881 0.9767 0.9883
K7 0.8147 0.9878 0.9765 0.9877
TABLE II
PERFORMANCES OF DUAL - DIRECTION ATTENTION BLOCK ON DRIVE
DATASET. AE: AFTER THE ENCODER MODULE . AD: AFTER THE DECODER that fusing SK units and dual-direction attention blocks will
MODULE . DAB: DUAL - DIRECTION ATTENTION BLOCK .
refine the results to further improve the overall segmentation
DAB position DAB number SE SP ACC AUC accuracy. Fig. 7 shows some examples of the test images,
- - 0.8020 0.9872 0.9757 0.9846 the ground truths, and segmentation results obtained by using
AE 1 0.7712 0.9897 0.9756 0.9868 U-net and our proposed network. In the rectangular regions
AD 1 0.7937 0.9888 0.9760 0.9876 marked with red-dotted lines, we could find that our proposed
AD 2 0.8093 0.9872 0.9763 0.9883 model is likely to get more continuous segmentation bound-
aries and successfully segment micro-vessels. It reveals that
our proposed network decreases noises in the background,
the irregular and multi-scale vessel structures can be well
C. Result of the experiments on SK unit
preserved. These visualizations further demonstrate that the
To investigate the impact of SK unit, we conduct various importance of aggregating multi-scale contextual information
experiments on DRIVE dataset for different architectures in and capturing long-range dependencies in retinal vessel seg-
section IV. The results are shown in Table III. We obtain mentation.
the following three observations about the ablation studies
of the SK units. Firstly, SK units bring performance benefits TABLE IV
no matter which kernel is used in the multi-scale branches. A BLATION ANALYSIS OF MODULE FUSION ON THE DRIVE DATASET.
This phenomenon suggests that the SK unit is helpful for
U-Net SK unit Attention SE SP ACC AUC
retinal vessel segmentation. Aggregating multi-scale contex-
X 0.8020 0.9872 0.9757 0.9846
tual information can improve the identification abilities of
X X 0.8134 0.9881 0.9767 0.9883
the network, especially for handling retinal vessel images,
X X 0.8093 0.9872 0.9763 0.9883
which are complex and multi-scale. In addition, with the
X X X 0.8145 0.9883 0.9769 0.9895
increase of the receptive field size, the performance signifi-
cantly improves to a new level, though the further increase
of the receptive field size results in a slight decrease. With
the larger receptive field size, the richer semantic information E. Comparison with the Existing Methods
will be obtained, which can facilitate image understanding.
In order to further observe the segmentation results gen-
However, the location information becomes indistinct when the
erated by the proposed network, we compare the proposed
kernel size is overlarge, and the classification label for a pixel-
FANet with numerous current state-of-the-art methods on
level task must be aligned to the corresponding coordinates in
DRIVE, STARE, and CHASE DB1 datasets. Table V lists
the output segmentation map. Therefore, the overlarge kernel
the performances of these methods. From the Table V, the
may weaken the retinal vessel segmentation effect. This also
proposed FANet achieves the outstanding performance with
confirms the truth that the increased number of parameters
sensitivity = (0.8145/ 0.8505/ 0.8334), specificity = (0.9883/
may not always contribute to the improvements of the model.
0.9889/ 0.9862), accuracy = (0.9769/ 0.9797/ 0.9803), and
Finally, in the case of the same receptive field size, using
AUC = (0.9895/ 0.9924/ 0.9912) on (DRIVE, STARE and
dilated convolution leads to poor performance. This is because
CHASE DB1 respectively). Our proposed FANet achieves the
using dilated convolution with the same dilation rate frequently
best results of SE and ACC on all three datasets. The results
will cause the gridding effect, which loses the continuity of
reflect that our network is able to distinguish the pixels of
the feature representation, whereas the continuity character is
vessels and background effectively and accurately. In terms
important for dense pixel-level prediction tasks.
of specificity, D-Net [54] shows the highest SP results and
achieves (0.9899/ 0.9904/ 0.9894) on three datasets respec-
D. Result of the experiments on module fusion tively. It means D-Net can better classify the background.
We explore the effectiveness of module fusion in this part. However, due to the highly unbalanced pixel ratio between
The results are shown in Table IV. We could find that using vessels and background, the SE is more important than the SP
SK units and dual-direction attention blocks separately will in the retinal vessel segmentation task. Although D-Net is a
improve the performance of the network. We also prove little bit more (0.0016/ 0.0015/ 0.0032) than our method on SP,

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 9

Fig. 7. Segmentation results of different methods. From left to right: the origin images, the corresponding masks, the results of U-Net, the results of the
proposed FANet.

TABLE V
C OMPARISON RESULTS ON DRIVE, STARE, CHASE DB1 DATASETS .

DRIVE STARE CHASE DB1


Methods Year SE SP ACC AUC SE SP ACC AUC SE SP ACC AUC
MF-FDOG [43] 2010 0.7120 0.9724 0.9382 - 0.7177 0.9753 0.9484 - - - - -
You [44] 2011 0.7410 0.9751 0.9434 - 0.7260 0.9756 0.9497 - - - - -
Fraz [45] 2012 0.7152 0.9759 0.9430 - 0.7311 0.9680 0.9442 - - - - -
Roychowdhury [46] 2015 0.7395 0.9782 0.9494 0.9672 0.7317 0.9842 0.9560 0.9673 0.7615 0.9575 0.9467 0.9623
Azzopardi [47] 2015 0.7655 0.9704 0.9442 0.9614 0.7716 0.9701 0.9497 0.9563 0.7585 0.9587 0.9387 0.9487
Zhang [48] 2016 0.7743 0.9725 0.9476 0.9636 0.7791 0.9758 0.9554 0.9748 0.7626 0.9661 0.9452 0.9606
Yan [49] 2017 0.7653 0.9818 0.9542 0.9752 0.7581 0.9846 0.9612 0.9901 0.7633 0.9809 0.9610 0.9781
R2U-Net [50] 2018 0.7792 0.9813 0.9556 0.9784 0.8298 0.9862 0.9712 0.9914 0.7756 0.9820 0.9634 0.9815
DUNet [51] 2019 0.7963 0.9800 0.9566 0.9802 - - - - 0.8155 0.9752 0.9610 0.9804
AGNet [52] 2019 0.8100 0.9848 0.9692 0.9856 - - - - 0.8186 0.9848 0.9743 0.9863
ADUNet [53] 2019 0.8075 0.9814 0.9693 0.9846 0.8437 0.9762 0.9684 0.9765 - - - -
D-Net [54] 2019 0.7839 0.9899 0.9709 0.9864 0.8249 0.9904 0.9781 0.9927 0.7839 0.9894 0.9721 0.9866
HANet [55] 2020 0.7991 0.9813 0.9581 0.9823 0.8186 0.9844 0.9772 0.9881 0.8239 0.9813 0.9670 0.9871
proposed 2020 0.8145 0.9883 0.9769 0.9895 0.8505 0.9889 0.9797 0.9924 0.8334 0.9862 0.9803 0.9912

Fig. 8. ROC curves of the proposed FANet on these three different datasets.

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 10

the proposed FANet outperforms D-Net by (0.0306/ 0.0256/ shows that our approach has the ability to preserve the details
0.0495) in terms of sensitivity. As for AUC, the proposed and capture thin, multi-scale, and irregular curved vessels. It
model achieves the best results on DRIVE and CHASE DB1 also suggests our proposed method is sufficient to diagnose
datasets. D-Net is only 0.0003 more than us in terms of AUC retinal vascular detection, and this could assist professional
on STARE. Fig. 8 shows the ROC curve of the proposed doctors in disease diagnosis and reduce the workload of human
FANet on three datasets. The closer the ROC curve is to the experts in clinical medicine.
upper left boundary, the more accurate the network is. These Although the proposed method successfully segments the
results of AUC show that the proposed network can precisely complex retinal vascular structure, there are still a small num-
classify vessels and background. Besides, the segmentation ber of thin and irregular vascular structures that have not been
maps in Fig. 6 demonstrate our method is more productive accurately classified. We may further improve our network
in classifying thick and thin vessels, which demonstrate that structure by designing a more hybrid contextual semantic
our proposed FANet can decrease noise, enhance the contrast, module to capture a more compact and discriminative feature
and aggregate multi-scale contextual information to classify representation. Furthermore, we also observe the phenomenon
irregular vessel structures. of overfitting during the training process. This is mainly due
to the small retinal vessel dataset. We advocate establishing
some larger and more refined retinal blood vessel datasets.
VI. D ISCUSSION
With the advancement of technology, it has become practical to
In this paper, we propose a Fully Attention-based Network use high-resolution fundus cameras to obtain high-resolution
(FANet) for the task of automatically retinal vessel segmen- retinal images (as CHASE DB1). These high-resolution and
tation in color fundus images. Structures with noise, low high-quality retinal fundus images will greatly improve the
contrast, multi-scale vessels, and irregular curved vessels are accuracy of image recognition. We encourage further research
apparent challenges in retinal vessel segmentation. These chal- in this direction.
lenges have brought great difficulties for accurate identification
and classification of retinal blood vessel pixels. VII. C ONCLUSION
As for the presence of noise and low contrast structures,
In this paper, we propose a Fully Attention-based Net-
we adopt green channel extraction and CLAHE as the pre-
work (FANet) for retinal vessel segmentation, which selects
processing strategy. Fig. 4 and Table I suggest that these
different scale kernels adaptively and adjusts local semantic
tactics can effectively enhance the contrast of retinal fundus
feature representation by using attention modules. FANet
images. In addition, we integrate the SK unit in our network
is an extension of the U-Net with the replacement of the
to deal with the multi-scale vessel structure. SK unit employs
standard convolutional layers by the SK units and integration
two convolutional branches with different kernel sizes to
of the lightweight dual-direction attention module. Besides,
produce multi-scale information and then fuses them. Table
we employ contrast-limited adaptive histogram equalization
III shows SK units bring plentiful performance benefits in our
to enhance the contrast and suppress the original background
experiments. Therefore, the problem of multi-scale structures
noise of retinal fundus images. Moreover, the comparative ex-
can be distinctly solved. As for irregular curved vessels, some
periments demonstrate that the proposed model captures multi-
earlier state-of-the-art methods [14,56] have paid attention to
scale and long-range contextual information, which improves
capturing complex contextual information. By enhancing the
the intra-class discrimination ability of the model. Compared
understanding of retinal fundus image semantics, the model
with the baseline U-Net, FANet extracts vessels with more
can better complete the task of vessel segmentation. However,
details. Our proposed model achieves outstanding performance
these methods generally capture contextual information by in-
consistently on the three public datasets. In the future, a more
creasing receptive fields, which can only capture short-range or
lightweight and advanced segmentation architecture could be
adjacent dependencies. Although self-attention module [21,23]
applied to improve the results and accelerate the process of
has proved that it can perfectly capture long-range contextual
clinical diagnosis.
information, the calculation of this module will occupy huge
GPU memories and computation resources. We address this
R EFERENCES
limitation by designing a lightweight dual-direction attention
block. The proposed dual-direction attention block reduces the [1] J. J. Kanski and B. Bowling, ”Clinical ophthalmology: a systematic
approach,”Elsevier Health Sciences, 2011.
space complexity from O((H ×W )×(H ×W )) to O(H ×W ). [2] S. Chaudhuri, et al. ”Detection of blood vessels in retinal images using
Table II shows the proposed dual-direction attention block two-dimensional matched filters,” IEEE transactions on medical imaging,
can better perceive irregular curved vessels by associating pp. 263-269, 1989.
[3] A. A. Mendonca, and A. Campilho. ”Segmentation of retinal blood
contextual information. Experiments on the effect of module vessels by combining the detection of centerlines and morphological
fusion have demonstrated that our method can effectively solve reconstruction,” IEEE transactions on medical imaging, VOL. 25, NO. 9,
noise and low contrast problems, capture multi-scale vessels pp. 1200-1213, 2016.
[4] Y. Wang, et al. ”Retinal vessel segmentation using multiwavelet kernels
and irregular curved vessels. and multiscale hierarchical decomposition,” Pattern Recognition, pp.
Extensive experiments are carried out on three datasets 2117-2133, 2013.
to interpret the effectiveness of our proposed approach. The [5] J. Elson, et al. ”Automated extraction and analysis of retinal blood vessels
with multi scale matched filter,” in ICICICT, pp.775-779, 2017.
performance results are shown in Table V, and the results show [6] Z. Yu, and K. Sun. ”Vessel segmentation on angiogram using morphology
that our proposed approach consistently performs well. Fig. 7 driven deformable model,” in BMEI, Vol. 2, pp. 675-678, 2010.

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3028180, IEEE Journal of
Biomedical and Health Informatics
ACCURATE RETINAL VESSEL SEGMENTATION IN COLOR FUNDUS IMAGES VIA FULLY ATTENTION-BASED NETWORKS 11

[7] Y. Zhang, W. Hsu, and M. Lee. ”Detection of retinal blood vessels based [39] S. Ioffe, and C. Szegedy. ”Batch normalization: Accelerating deep
on nonlinear projections,” Journal of Signal Processing Systems, vol. 55, network training by reducing internal covariate shift,” arXiv preprint
pp.103-112, 2009. arXiv:1502.03167, 2015.
[8] S. Kozerke, et al. ”Automatic vessel segmentation using active contours in [40] A. F. Agarap. ”Deep learning using rectified linear units (relu),” arXiv
cine phase contrast flow measurements,” Journal of Magnetic Resonance preprint arXiv:1803.08375, 2018.
Imaging, vol. 10, pp. 41-51, 1999. [41] A. Paszke, et al. ”Automatic differentiation in pytorch,” 2017.
[9] A. Lahiri, et al. ”Deep neural ensemble for retinal vessel segmentation in [42] Z. Feng, J. Yang, and L. Yao, “Patch-based fully convolutional neural
fundus images towards achieving label-free angiography,” in EMBC, pp. network with skip connections for retinal blood vessel segmentation,” in
1340-1343, 2016. ICIP, pp. 1742–1746, 2017.
[10] A. Şengür, et al. ”A retinal vessel detection approach using convolution [43] B. Zhang, et al. “Retinal vessel extraction by matched filter with
neural network,” in IDAP, pp. 1-4, 2017. firstorder derivative of Gaussian,” Comput. Biol. Med., vol. 40, no. 4,
[11] H. Fu, et al. ”Deepvessel: Retinal vessel segmentation via deep learning pp. 438-445, 2010.
and conditional random field,” in MICCAI, pp. 132-139, 2016. [44] X. You, et al. “Segmentation of retinal blood vessels using the radial
[12] Z. Yan, X. Yang, and K. Cheng. ”Joint segment-level and pixel- projection and semi-supervised approach,” Pattern Recog., vol. 44, no.
wise losses for deep learning based retinal vessel segmentation,” IEEE 10, pp. 2314-2324, 2011.
Transactions on Biomedical Engineering, pp. 1912-1923, 2018. [45] M. M. Fraz, et al. “An approach to localize the retinal blood vessels
[13] Y. Zhang, and A. C. S. Chung. ”Deep supervision with additional labels using bit planes and centerline detection,” Comput. Methods Programs
for retinal vessel segmentation task,” in MICCAI, pp. 83-91, 2018. Biomed., vol. 108, no. 2, pp. 600-616, 2012.
[46] S. Roychowdhury, et al. “Iterative vessel segmentation of fundus im-
[14] L. Chen, et al. ”Deeplab: Semantic image segmentation with deep
ages,” IEEE Trans. Biomed Eng., vol. 62, no. 7, pp. 1738-1749, 2015.
convolutional nets, atrous convolution, and fully connected crfs,” IEEE
[47] G. Azzopardi, et al. “Trainable COSFIRE filters for vessel delineation
transactions on pattern analysis and machine intelligence, vol. 40, no.4,
with application to retinal images,” Med. Image Anal., vol. 19, no. 1, pp.
pp. 834-848, 2017.
46C57, 2015.
[15] L. Chen et al, ”Rethinking Atrous Convolution for Semantic Image
[48] J. Zhang, et al. “Robust retinal vessel segmentation via locally adaptive
Segmentation,” in arxiv.org/abs/1706.05587, 2017.
derivative frames in orientation scores,” IEEE Trans. Med. Imag., vol. 35,
[16] H. Zhao, et al. ”Pyramid scene parsing network,” in CVPR, pp. no. 12, pp. 2631-2644, 2016.
2881–2890, 2017. [49] Z. Yan, X. Yang, and K. Cheng. ”A three-stage deep learning model for
[17] Y. Fisher, ”Dilated Residual Networks,” in CVPR, pp. 636-644, 2017. accurate retinal vessel segmentation,” IEEE journal of Biomedical and
[18] L. Chen, et al. ”Encoder-decoder with atrous separable convolution for Health Informatics, vol. 23, no. 4, pp. 1427-1436, 2018.
semantic image segmentation,” in ECCV, pp. 801-808, 2018. [50] M. Z. Alom, et al. ”Recurrent residual convolutional neural network
[19] K. He, et al, ”Deep Residual Learning for Image Recognition.” in CVPR, based on u-net (r2u-net) for medical image segmentation,” arXiv preprint
pp. 770-778, 2016. arXiv:1802.06955, 2018.
[20] Z. Huang, et al. ”CCNet: Criss-Cross Attention for Semantic Segmen- [51] Q. Jin, et al. ”DUNet: A deformable network for retinal vessel segmen-
tation,” in ICCV, pp. 603-612, 2019. tation,” Knowledge-Based Systems,vol. 178, pp. 149-162, 2019.
[21] J. Fu, et al. ”Dual attention network for scene segmentation,” in CVPR, [52] S. Zhang, et al. ”Attention guided network for retinal image segmenta-
pp.3146-3154, 2019. tion,” in MICCAI, pp. 795-805, 2019.
[22] X. Wang, et al, ”Non-local Neural Networks.” in CVPR, pp. 7794-7803, [53] Z. Luo, Y. Zhang, L. Zhou, et al. ”Micro-Vessel Image Segmentation
2018. Based on the AD-UNet Model,” IEEE Access, vol. 7, pp.143402-143411.
[23] A. Vaswani, et al. ”Attention is all you need,” in NIPS, pp.5998-6008, 2019.
2017. [54] Y. Jiang, et al. ”Retinal vessels segmentation based on dilated multi-scale
[24] O. Ronneberger, P. Fischer, and T. Brox. ”U-net: Convolutional networks convolutional neural network,” IEEE Access, vol. 7, pp. 76342-76352,
for biomedical image segmentation,” in MICCAI, pp. 234-241, 2015. 2019.
[25] X. Li, et al. ”Selective kernel networks,” in CVPR, pp. 510-519, 2019. [55] D. Wang, et al, ”Hard Attention Net for Automatic Retinal Vessel
[26] J. Long, E. Shelhamer, and T. Darrell. ”Fully convolutional networks Segmentation.” in IEEE journal of Biomedical and Health Informatics,
for semantic segmentation,” in CVPR, pp.3431-3440, 2015. pp. 7794-7803, 2018.
[27] M. Luong, H. Pham, and C. D. Manning. ”Effective approaches [56] H. Zhang, et al. ”Context encoding for semantic segmentation,” in
to attention-based neural machine translation,” arXiv preprint CVPR, pp. 7151-7160, 2018.
arXiv:1508.04025, 2015.
[28] P. Huang, et al. ”Attention-based multimodal neural machine transla-
tion,” in Proceedings of the First Conference on Machine Translation,
Vol. 2, pp. 639-645, 2016.
[29] J. Hu, L. Shen, and G. Sun. ”Squeeze-and-excitation networks,” in
CVPR, pp. 7132-7141, 2018.
[30] M. Ren, and S. Z. Richard. ”End-to-end instance segmentation with
recurrent attention,” in CVPR, pp.6656-6664, 2017.
[31] Y. Chen, et al. ”A2 -Nets: Double Attention Networks,” in NIPS, pp.
352-361, 2018.
[32] Staal, Joes, et al. ”Ridge-based vessel segmentation in color images of
the retina,” IEEE transactions on medical imaging, vol. 23, no. 4, pp.
501-509, 2014.
[33] A. D. Hoover, V. Kouznetsova, and M. Goldbaum. ”Locating blood
vessels in retinal images by piecewise threshold probing of a matched
filter response,” IEEE Transactions on Medical imaging, vol. 19, no. 3,
pp. 203-210, 2000.
[34] M. M. Fraz, et al. ”An ensemble classification-based approach applied
to retinal blood vessel segmentation,” IEEE Transactions on Biomedical
Engineering, vol. 59, no. 9, pp. 2538-2548, 2002.
[35] M. Hajabdollahi, et al. ”Low complexity convolutional neural network
for vessel segmentation in portable retinal diagnostic devices,” in ICIP,
pp. 2785-2789, 2018.
[36] K. Zuiderveld. ”Contrast limited adaptive histogram equalization ,”
Graphics Gems IV, pp. 474-485, 1994.
[37] H. Wang, and D. Suter. ”Color image segmentation using global infor-
mation and local homogeneity,” in DICTA, 2003.
[38] H. Ding, et al. ”Context contrasted feature and gated multi-scale
aggregation for scene segmentation,” in CVPR, pp. 2393-2402, 2018.

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University College London. Downloaded on November 02,2020 at 01:34:34 UTC from IEEE Xplore. Restrictions apply.

You might also like