Deep Learning-Based Instance Segmentation of Cracks From Shield Tunnel Lining Images

Structure and Infrastructure Engineering
Maintenance, Management, Life-Cycle Design and Performance
ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/nsie20
Deep learning-based instance segmentation of

cracks from shield tunnel lining images
Hongwei Huang , Shuai Zhao , Dongming Zhang & Jiayao Chen
To cite this article: Hongwei Huang , Shuai Zhao , Dongming Zhang & Jiayao Chen (2020): Deep
learning-based instance segmentation of cracks from shield tunnel lining images, Structure and
Infrastructure Engineering, DOI: 10.1080/15732479.2020.1838559
To link to this article: https://doi.org/10.1080/15732479.2020.1838559
Published online: 03 Nov 2020.
Submit your article to this journal
Article views: 18
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=nsie20
STRUCTURE AND INFRASTRUCTURE ENGINEERING
https://doi.org/10.1080/15732479.2020.1838559
Deep learning-based instance segmentation of cracks from shield

tunnel lining images
Hongwei Huang, Shuai Zhao, Dongming Zhang and Jiayao Chen
Key Laboratory of Geotechnical and Underground Engineering of Minister of Education and Department of Geotechnical Engineering,
Tongji University, Shanghai, China
ABSTRACT ARTICLE HISTORY

This paper presents a deep learning (DL)-based method for the instance segmentation of cracks from Received 22 August 2019
shield tunnel lining images using a mask region-based convolutional neural network (Mask R-CNN) Revised 11 May 2020
incorporated with a morphological closing operation. The Mask R-CNN herein is divided into a back- Accepted 31 July 2020
bone architecture, a region proposal network (RPN), and a head architecture for specification, and the
KEYWORDS
implementation details are introduced. Compared with the current image processing methods, the Crack; convolutional neural
proposed DL-based method efficiently detects cracks in an image while simultaneously generating a networks; instance
high-quality segmentation mask for each crack. A shield tunnel lining image dataset is established for segmentation; morpho-
crack instance segmentation task. The established dataset contains a total of 1171 labelled crack logical closing operation;
instances in 761 images. The morphological closing operation was incorporated into a Mask R-CNN to shield tunnel linings;
form an integrated model to connect disjoint cracks that belong to one crack. Image tests were car- deep learning
ried out among four trained models to explore the effect of the morphological closing operation, net-
work depth, and feature pyramid network on crack segmentation performance, and a relative optimal
model is found. The relative optimal model achieves a balanced accuracy of 81.94%, a F1 score of
68.68%, and an intersection over union (IoU) of 52.72% with respect to 76 test images.
1. Introduction crack identification step. In previous studies, numerous

researchers use some IP techniques, such as top hat trans-
Subway shield tunnels are often used in urban areas to
formation and median filtering, to remove background
relieve the ground traffic pressure and have been a vital
noises and enhance the image contrast, and then use thresh-
constituent of transit infrastructure. Unfortunately, subway
old (TH) techniques to isolate the crack from the back-
shield tunnels would deteriorate owing to aging and harsh
ground of the image (Fujita, Mitani, & Hamamoto, 2006;
environmental conditions. Various defects such as cracks on
Lee, Kim, Yi, & Kim, 2013; Miyamoto, Konno, & Br€ uhwiler,
the lining surface of a subway tunnel will develop owing to
the deterioration (Huang, Shao, Zhang, & Wang, 2017). 2007; Qi, Liu, Wu, & Zhang, 2014; Shen, Zhang, Qi, & Wu,
Cracks are the primary indicator of structure deterioration 2015; Ukai, 2007; Zhang, Zhang, Qi, & Liu, 2014). The
(Attard, Debono, Valentino, & Di Castro, 2018; Koch, major difficulty using TH methods is how to choose a suit-
Georgieva, Kasireddy, Akinci, & Fieguth, 2015), often reduce able threshold to isolate cracks from the background of an
the bearing capacity of tunnel structure (Liu, Bai, Yuan, & image, because the pixel grey values of some distractors
Mang, 2016), and even cause an accident (Asakura & (such as segmental joints) are quite similar to those for the
Kojima, 2003). If cracks are detected timely, precautionary cracks. It inevitably affects the selection of the threshold.
measures can be taken to avoid greater damages of a tunnel Hence, those methods can be hardly applied to different
as well as to avoid potential accidents that might otherwise environments of tunnels.
take place. Therefore, manual onsite inspection is periodic- The identification step generally uses pattern recognition
ally conducted to inspect cracks and other defects in subway techniques and/or model-based methods to determine crack
tunnels. The accuracy of this manual method is highly regions. With respect to model-based algorithms, Yu, Jang, and
dependent the inspectors’ knowledge and experience, which Han (2007), Paar, Caballo-Perucha, Kontrus, and Sidla (2006),
inevitably leads to the subjective evaluation of tunnel condi- and Yamaguchi and Hashimoto (2010) used a similar algorithm
tion. In view of this limitation, the development of com- that searches crack regions starting from a seed point which
puter vision (CV) techniques is quite promising to solve the strongly rely on user input. Whether the other pixels around
disadvantages of human-based onsite inspections (Ai, Yuan, the seed point are labelled as crack pixels depends on a prede-
& Bi, 2016). fined threshold. It is also difficult to choose the suitable thresh-
Most of the CV methods for crack detection for concrete old. For crack pattern recognition algorithms, Support Vector
tunnels usually involve an image processing (IP) step and a Machine, Extreme Learning Machine, and Support Vector Data
CONTACT Dongming Zhang 09zhang@tongji.edu.cn Geotechnical Building, Tongji University, No. 1239 Siping Road, 200092 Shanghai, China.
ß 2020 Informa UK Limited, trading as Taylor & Francis Group
2 H. HUANG ET AL.
Figure 1. Process of crack image processing.
Description were used by Liu, Suandi, Ohashi, and Ejima determine the position of a defect on the concrete segment
(2002), Zhang et al. (2014), and Lin, Lin, and Wang (2017), due to the removal of the image background in their
respectively. These methods rely on the aforementioned TH approach. Hence, instance segmentation (Figure 1) combin-
techniques to obtain the ground truth of the images of training ing object detection and semantic segmentation is proposed
dataset. In summary, the selection of a suitable threshold used in this study to overcome the above limitations.
by IP algorithms is based on the prior knowledge, and the A Mask R-CNN (He, Gkioxari, Dollar, & Girshick, 2017)
accuracy of detection using these IP algorithms is highly is a currently developed architecture treating with the task
dependent on the complexity of the image background. The of instance segmentation. This method efficiently generates
aforementioned crack classification algorithm may be prone to both a bounding box and a segmentation mask for each
inaccuracy when more distractors present in images. instance in an image, and distinguishes each crack instance
Deep learning (DL) has made great progress in computer in different colours, as shown in Figure 1. Mask R-CNN has
vision in current years. It enables a machine to be fed with been used in fields for the concrete bughole instance seg-
raw image data and to automatically discover the features mentation (Wei et al.,2019) and the moisture-mark instance
needed for classification or object detection (LeCun, Bengio, segmentation (Zhao, Zhang, & Huang, 2020). However,
& Hinton, 2015). DL-based convolutional neural networks there are quite few researches on the application of Mask R-
(CNNs) are good at processing image data and are generally CNN into crack instance segmentation in the field of tunnel
used to achieve object detection, semantic segmentation,
engineering.
and instance segmentation (Figure 1). CNN is proven to be
This paper presents the instance segmentation of cracks
effective in several civil engineering applications, namely the
from shield tunnel lining images using a modified Mask R-
structural defect detection (Cha, Choi, Suh, Mahmoudkhani,
CNN, in which a morphological closing operation is incor-
& B€ uy€uk€ozt€
urk, 2018; Makantasis, Protopapadakis,
porated. An image dataset including 1171 crack instances in
Doulamis, Doulamis, & Loupos, 2015; Soukup & Huber-
761 images is firstly established for the crack instance seg-
M€ ork, 2014), the concrete bughole instance segmentation
(Wei, Yao, Yang, & Sun, 2019), and the concrete crack mentation task, and the properties of cracks that are suitable
detection and instance segmentation (Cha, Choi, & for automated identification are introduced. The overall
B€uy€uk€
ozt€urk, 2017; Kim & Cho, 2019; Ni, Zhang, & Chen, Mask R-CNN architecture are then introduced through
2019; Zhang et al., 2017). three steps, namely feature extraction, proposal generation,
The mentioned crack detection and instance segmenta- and crack identification. Next, the implementation of Mask
tion methods are used for crack images with a simpler back- R-CNN is detailed, and the morphological closing operation
ground than shield tunnel lining images. Nowadays, there is incorporated into a trained model to improve the seg-
are quite few researches on the application of CNN into mentation results of cracks at testing stage. What’s more,
crack recognition from shield tunnel lining images. Xue and the balanced accuracy, IoU, and F1 score of four trained
Li (2018) used a region-based fully convolutional network to models are compared to analyse the effect of the morpho-
implement crack and leakage detection. They focused on logical closing operation, network depth, and feature pyra-
defect detection, rather than obtain the semantic informa- mids on crack segmentation. Afterwards, the different
tion of cracks. Huang, Li, and Zhang (2018) employed a results obtained by different Mask R-CNN models were fur-
fully convolutional network (FCN) for the semantic segmen- ther explained and discussed from the internal mechanism.
tation of leakage areas and cracks. However, it is not easy to Finally, the concluding remarks are given.
STRUCTURE AND INFRASTRUCTURE ENGINEERING 3
sections can be inspected by the equipment in approxi-

mately 1.5 h per night, which doesn’t make inspectors feel
tired. Details of this MTI system is not elaborated here due
to page limit but can be found in the paper (Huang, Sun,
Xue, & Wang, 2017; Huang et al., 2018).
2.2. Crack image annotation

Each line-scan CCD camera acquired images with a reso-
lution of 1,000 7,448 pixels, and the captured images were
stored into the computer of MTI-200a equipment. In order
to establish labelled datasets of crack images, the cracks in
images must be labelled. Crack label was carried out in four
steps. Firstly, images with 3,000 7,448 pixels were
extracted from the computer by image stitching. Secondly,
the images containing cracks were selected among the
extracted images. Thirdly, the selected images were cropped
to highlight the targeted cracks. To avoid significant distor-
tions to image features owing to the rescaling process in
training, the cropped images were set to have sizes ranging
from 800 800 pixels to 3,000 3,000 pixels. And finally,
the images obtained in step 3 were labelled by LabelMe tool
(Wada, 2016), as depicted in Figure 3. After image label and
conversion, the final JavaScript Object Notation (JSON) files
were generated (Zhao et al., 2020). The final annotation file
contains the width and length of the minimum enclosing
rectangle of each crack, coordinates of crack edge points,
and other information.
Figure 2. (a) Schematic of MTI-200a and (b) picture of MTI-200a in a
shield tunnel.
2.3. Data division

2. Introduction of crack image dataset The inspection was practised in Shanghai subway networks
Training a crack segmentation model relies on a large num- using the MTI-200a equipment. A total of 761 crack images
ber of labelled image data. The cracks should be labelled in containing 1171 crack instances were obtained and labelled
images so as to make a DL-based model learn features of using methods in the Section 2.2. Figure 4 displays a frac-
cracks automatically and efficiently. Therefore, it is crucial tion of the labelled images. Following a research by Shahin,
to collect and label crack images to establish crack Maier, and Jaksa (2004), out of the crack image data, 10%
image dataset. (76 images) was used as testing dataset, and 90% of the
remaining data (685 images) were used as training dataset
and validation dataset. The proportion of training and valid-
2.1. Crack image acquisition ation dataset is 4:1. Table 1 displays general statistics of tar-
get images.
The images of subway tunnel lining are obtained using the
Moving Tunnel Inspection (MTI-200a) equipment (Figure
2(a)) that is previously developed by the authors. Figure 2.4. Properties of crack for automated identification
2(b) shows the working scene in a subway tunnel, and the
Generally, for an image, crack regions are generally dark-
inspection range is approximately 290 except for the bot- colored regions, leading to lower intensity values compared
tom. The following characters of the MTI-200a ensure the to their background, and the contours of cracks are curved.
good quality of captured images. First, the layout of the These known characteristics allow TH techniques to be used
line-scan CCD cameras can guarantee that the images are to isolate cracks from the background of an image (Shen
taken orthogonal to the lining surface with a 0.14-m overlap et al., 2015; Ukai, 2000; Zhang et al., 2014). However, some
to pledge better area coverage. Second, the LEDs can pro- image pre-processing steps must be applied to reduce back-
vide sufficient lighting to guarantee the coherency of images. ground noise before using TH methods to isolate cracks,
The line-scan CCD cameras can recognize cracks of and some thresholds used in image pre-processing steps
0.29 mm in width on the 5.5-m-diameter subway tunnel lin- need to be selected based on prior knowledge (Fujita et al.,
ing surface with the illumination intensities of 5 klx. And 2006; Miyamoto et al., 2007). These TH methods mainly
finally, a 3–5 km/h inspection speed can meet the precision consider the difference between the grey value of the crack
that is demanded for the inspection. With this speed, 2–3 and distractors such as joints and pipes. Therefore, the
4 H. HUANG ET AL.
Figure 3. Cracks labelled by LabelMe.
Figure 4. (a) Labelled images and (b) the corresponding ground truth of (a).
Table 1. Numbers of crack images for the instance segmentation dataset.

Remaining data (90%) leakage areas (Cha et al., 2017; Xue & Li, 2018). This func-
Testing (10%) Training (80%) Validation (20%)
tion of deep learning and the gradient-grey-value character-
76 548 137
istics at crack edges motivate the use of the DL-based Mask
R-CNN to segment each crack from image background. An
image is made up of an array of pixel values. Using these
pixel values, the Mask R-CNN can automatically discover
performance may be significantly decreased when the image edges and texture of cracks through its shallow convolu-
background become more complex. tional layer, make edges and texture form motifs through its
Deep learning (DL) enables a machine to automatically middle convolutional layer, and assemble motifs into parts
learn the features needed for detection and classification, of cracks through its deep convolutional layer (Zhao
and has proven to be robust for the detection of cracks and et al., 2020).
Figure 5. Overall architecture of a Mask R-CNN.
The above process is automatically accomplished using a the crack localization in step 2 and crack instance segmenta-
self-learning procedure, which is considered as the key tion in step 3. Residual nets (ResNets) (He, Zhang, Ren, &
advantage of DL. During training, the Mask R-CNN is Sun, 2016) and visual geometry group’s (VGG) network
shown a crack-labeled image and produces an output in the (Simonyan & Zisserman, 2014) can be used as backbone
form of a vector of scores. An objective loss function is architecture to extract crack features, because their increased
used to compute the error between the output scores and convolutional layer depth substantially improves the model
the desired pattern of scores. The Mask R-CNN automatic- performance.
ally modifies its internal adjustable parameters (i.e. weights) There are two ways to use the crack features on the fea-
to reduce this error by a gradient vector until a set of opti- ture maps produced by backbone architecture, namely to
mal weights for the Mask R-CNN is obtained and saved. make crack prediction on a single high-level feature map or
After training, the desired crack can obtain the highest score on a built feature pyramid, as displayed in Figure 6. Xue
than unlabeled objects in the image. Therefore, if a Mask R- and Li (2018) and Huang et al. (2018) used VGG as back-
CNN is well-trained, it can segment cracks from the image bone architecture to generate crack feature maps and make
background, regardless of the distractors. crack prediction on the final single-scale high-level feature
map. However, for small objects (e.g. cracks), making pre-
diction on the final single-scale high-level feature map is
3. Framework for image segmentation for cracks
often replaced by making prediction on a built feature pyra-
Figure 5 illustrates a Mask R-CNN structure. It comprises mid. Reasons will be explained in the ensuing chapters or
multiple architectures, namely a backbone architecture, a paragraphs. A feature pyramid can be built by a feature
region proposal network (RPN), and a head architecture. In pyramid network (FPN), as illustrated in Figures 5 and 6(b).
the following, it is divided into three steps to detail how the The FPN can leverage the hierarchical feature maps pro-
Mask R-CNN accomplish crack instance segmentation using duced by different convolutional layers to form a feature
its backbone architecture, RPN, and head architecture. pyramid, which is realized through a top-down pathway
implementing up-sampling with a factor of 2 and lateral
connections implementing 1 1 feature fusion convolution
3.1. Step 1: Crack feature extraction
(Lin, Dollar, et al., 2017).
Crack feature extraction is the first important step. For an
input crack image, the backbone architecture (e.g., ResNet-
3.2. Step 2: Crack region proposal generation
101 with FPN (Figure 5)) of Mask R-CNN extracts crack
features and produces feature maps first. These features are An RPN is used by Ren, He, Girshick, and Sun (2017) to
extracted through edge detection, motifs form, and motifs generate region proposals on the single-scale feature maps
assembly (as introduced in Section 2.4), and will be used for produced by the backbone architecture. It is built via a
6 H. HUANG ET AL.
ranked crack proposals will be as input for the RoI align layer in
accordance with their class scores. For each crack proposal, a
fixed-length feature vector is extracted through an RoI align
layer from the corresponding level of the fP2, P3, P4, P5g fea-
ture pyramid according to the proposal size. Then, a sequence
of fully connected layers utilize each feature vector for classifica-
tion and more accurate bounding-box regression than it in
step2. Simultaneously, a segmentation mask for each RoI is gen-
erated through an FCN subnetwork (i.e. the mask branch).
During training process, a multi-task loss, L, is defined,
and it includes classification loss (Lcls), bounding-box regres-
sion (Lbox), and mask loss (Lmask) (He et al., 2017). The
multi-task loss, L is calculated as follows (He et al., 2017):
L ¼ Lcls þ Lbox þ Lmask (1)
The values of L are commonly used as an important
indicator to decide whether the training process of a Mask
R-CNN model is completed. When the loss converges, the
training can be seen as complete.
4. Model implementation and experimental results

4.1. Mask R-CNN model implementation and results
The previously mentioned steps explained in Section 3 and
shown in Figure 5 were implemented on DETECTRON
(Girshick, Radosavovic, Gkioxari, Dollar, & He, 2018), which
is Facebook AI Research’s software system that implements
Figure 6. (a) Single feature map and (b) feature pyramid network. Thicker blue state-of-the-art object detection algorithms. Experiments were
outlines indicate semantically stronger features. carried out on a desktop equipped with two GeForce GTX
1080 graphics processing units (GPUs), 64 GB of random
small FCN subnetwork with a 3 3 convolutional layer and access memory (RAM), one Intel Core i7-5820K central proc-
two sibling 1 1 convolutions, as displayed in Figure 7. At essing unit (CPU), and Ubuntu 16.04 operating system. The
each sliding-window location, the 3 3 convolutional layer computation software environment was configured with
slides on the final single-scale feature map to produce 9 Python 2.7.14, CUDA 8.0, CUDNN 6.0, and DETECTRON.
anchors. The anchors have three pre-defined aspect ratios In experiments, ResNet with 101 layers and 50 layers were
f1:2, 1:1, 2:1g and three pre-defined scales to correspond to used and they were denoted as ResNet-101 and ResNet-50,
different size of cracks (Figure 7). One sibling 1 1 convo- respectively. A Mask R-CNN using ResNet-50 as a backbone
lution use these anchors as references to perform classifica- architecture to extract crack features on the C4 feature map was
tion, and the other sibling 1 1 convolution use these denoted as M-R-50-C4, and a Mask R-CNN using ResNet-50
anchors as references to perform regression. and FPN as a backbone architecture was denoted as M-R-50-
If the ResNet combining an FPN is used, a feature pyra- FPN. Similarly, a Mask R-CNN using ResNet-101 and FPN as a
mid is produced and comprises fP2, P3, P4, P5, P6g feature backbone architecture was denoted as M-R-101-FPN. The key
maps, as presented in Figure 5. As a result, the definition of characteristics of the three models are displayed in Table 2.
multi-scale anchors on a specific feature level is unnecessary. Image-centric training was used for these three models, and the
Each level of fP2, P3, P4, P5, P6g is only assigned anchors training can automatically implement horizontal flipping of the
of a single scale with aspect ratios f1:2, 1:1, 2:1g. Thus, a training image for data augmentation. The momentum and
total of 15 anchors are generated for classification and weight decay were set to 0.9 and 0.0001, respectively. Through
regression. P6 feature map is only used by RPN to generate trial and error, the learning rate was set as 0.005 for the first
region proposals, it’s not used by the head architecture in 30,000 iterations and 0.0005 for the next 10,000 iterations, and
step 3 to extract region of interest (RoI) features. was decreased by 10 at 40,000 iterations. The maximum iter-
ation is set as 45,000. The training does not stop until the max-
imum iteration is reached, and the trained model is saved every
3.3. Step 3: Crack identification
15,000 iterations. It can be seen in Figures 8 and 9 that both
The crack identification is achieved through the head architec- loss and accuracy of the three models have converged after
ture, which comprises an RoI align layer, two fully connected 45,000 iterations. Therefore, the saved M-R-50-C4, M-R-50-
layers, and an FCN, as illustrated in Figure 5. After the above FPN and M-R-101-FPN model at the 45,000 iteration were
two steps, the top-N (N is the number of crack proposals) used to conducted experiments on the testing dataset.
Figure 7. RPN sliding on a single feature map.
Table 2. Characteristics of the three Mask R-CNN models.

Model Backbone architecture Layer depth of backbone architecture Feature extractor
M-R-50-C4 ResNet-50 50 C4 feature map
M-R-50-FPN ResNet-50-FPN 50 fP2, P3, P4, P5g feature pyramid
M-R-101-FPN ResNet-101-FPN 101 fP2, P3, P4, P5g feature pyramid
Figure 8. Learning curves of the training process.
Figure 9. Classification accuracy curves of the training process.
4.2. Instance segmentation of crack images

using the M-R-101-FPN model. Each crack instance in
The trained Mask R-CNN models can recall the gained opti- Figure 10 is localised with a bounding box, covered with a
mal weights and thus get new output sets from new input segmentation mask, and distinguished by different colours.
data. Figure 10 presents examples of the true positive results The score at the top-left of each bounding box indicate the
8 H. HUANG ET AL.
Figure 10. Two true positive results segmented by M-R-101-FPN.
Figure 11. Three false positive segmentation results (down) from the corresponding input raw image (up) using M-R-101-FPN model: (a) joints’ boundaries identi-
fied as a crack, (b) bolt hole’s boundaries identified as a crack, and (c) distractors’ boundaries identified as a crack.
possibility of the segmented object as a crack. The segmen- similar to cracks, which triggered the false recognition of
tation effect is going to be tested in Section 4.4 using met- the M-R-101-FPN model.
rics. Figure 11 displays three false positive results. The M-R- In some cases, a continuous crack in a raw image breaks
101-FPN model falsely recognize a part of joints (Figure into several sections (Figures 12(a) and 14(a)) when using
11(a)), a part of a bolt hole (Figure 11(b)), and a part of dis- the gained Mask R-CNN models to recognize it. This dis-
tractors (Figure 11(c)) as cracks. The falsely segmented parts joint problem of the segmented cracks may be caused by
of the mentioned cases have the curved edges and lower the multiple convolution operations which can lead to the
intensity values compared to their neighbouring back- loss of features of small objects (e.g. cracks). In order to
ground. These shape and gradient-grey-value features are connect the disjoint cracks that belong to one continuous
Figure 12. Samples of good connection results: (a) before the morphological closing operation and (b) after the morphological closing operation, with a 15 15
ellipse structuring element. All the disjoint regions of cracks are connected by using the morphological closing operation.
crack, the morphological closing operation was incorporated values of 1 for the inside of the polygons and 0 for the out-
into a Mask R-CNN to process the crack features in side of the polygons to generate a binary image. Once the
this study. binary image was generated, at this time the code of the
morphological closing operation was called to execute clos-
ing operation for the binary image using a 15 15 ellipse
4.3. The morphological closing operation structuring element. This operation connects disjoint cracks
The morphological closing operation is defined as a dilation that belong to one crack. Then, the modified model overlay
operation followed by an erosion operation for binary a mask on the connected crack to output masked image.
images using the same structuring element for both opera- The M-R-101-FPN model incorporated with the morpho-
tions (Raid, Khedr, El-Dosuky, & Aoud, 2014). A structur- logical closing operation is named M-R-101-FPN-closing-
ing element is a kernel with a set of coordinate points, and operation model. The disjoint cracks that belong to one
it can be cross, rectangle, or ellipse structure. The structure crack are connected using M-R-101-FPN-closing-operation
can be also customized and determines exactly how the model, as presented in Figure 12(b). Even if the morpho-
objects will be eroded or dilated (Gil & Kimmel, 2002). logical closed operation is used, there are some disjoint
Disjoint cracks can be connected in dilation operation using cracks that cannot be fully connected, as presented in Figure
a proper structuring element, but the boundaries of cracks 14. The reason is that a fixed 15 15 ellipse structuring
are also expanded. The expanded boundaries are eroded in element is used in this study, and it cannot connect cracks
the followed erosion operation using the same structuring that are far apart. The effect of the morphological closing
element. Therefore, the disjoint cracks are connected operation on the segmentation performance are tested using
through the morphological closing operation. Figure 13 contrast experiments in Section 4.4.
illustrates how the disjoint cracks are connected using the
morphological closing operation.
4.4. Test of mask R-CNN models for crack segmentation
In this study, the source code of the trained M-R-101-
FPN model was modified. The modified model output a In this section, the segmentation performance of M-R-50-C4
closed polygon for each crack in an image and then assigned model, M-R-50-FPN model, M-R-101-FPN model and M-R-
10 H. HUANG ET AL.
Figure 13. Process of the morphological closing operation. The red contour in dilated and eroded image is the original crack boundaries; the blue contour is the
expanded crack boundaries after dilation of the binary image; the green contour is the eroded crack boundaries after erosion of the dilated image.
Figure 14. Samples of poor connection results: (a) before the morphological closing operation and (b) after the morphological closing operation, with a 15 15
ellipse structuring element. There are also some disjoint regions that are not connected after the morphological closing operation.
101-FPN-closing-operation model are compared to analyse IoU refers to the overlap rate between a predicted objects
the effect of the morphological closing operation, network and the corresponding ground truth. F1 score measures both
depth, and FPN on crack segmentation. Three metrics, i.e. identification completeness and identification exactness, and
the balanced accuracy, IoU, and F1 score are applied herein considers identification completeness to be as important as
to evaluate the segmentation effect. The balanced accuracy identification exactness. The three metrics are (Brodersen,
describes the average accuracy obtained on either class, while Ong, Stephan, & Buhmann, 2010; Zhao et al., 2020):
Table 3. Metrics of different methods for crack segmentation.

Method Balanced accuracy F1 score IoU
M-R-50-C4 0.6819 0.4567 0.3051
M-R-50-FPN 0.7935 0.6535 0.4902
M-R-101-FPN 0.7901 0.6491 0.4852
M-R-101-FPN-closing-operation 0.8194 0.6868 0.5272
four trained models, and the inference time is not affected

by the image size.
5. Discussion
In this section, the differences in the segmentation perform-
Figure 15. Intersection over union. ance among the four methods are further discussed. The
feature hierarchy of convolutional layers of a CNN is
Balanced accuracy : TP=ðTP þ FN Þ þ TN=ðTN þ FPÞ =2 affected by the position of convolutional layers; a deeper
layer results in a higher-level feature. The shallow convolu-
F1 score : 2= 1=precision þ 1=recall
tional layer produces fine-resolution maps with low-level
IoU : TP=ðTP þ FP þ FN Þ features needed for crack detection, whereas the deep con-
where TP is the number of correct identified crack pixels; volutional layer produces coarse-resolution maps with high-
FN is the number of false identified crack pixels; TN is the level features. Figure 17(a) shows a raw image; Figure 17(b,
number of correct identified background pixels; FP is the c) show the low level features from C2 and C3 feature map,
respectively; Figure 17(d) displays the high level features
number of false identified background pixels. The precision
from C4 feature map of M-R-50-C4 model. It can be
is calculated by TP/(TP þ FP) and recall is calculated by
observed that the low-level features mainly referred to the
TP/(TP þ FN). The TP, FN, TN, FP is illustrated in
detail features about edge and texture which are important
Figure 15.
for crack detection, and the high-level features referred to
The results obtained by the mentioned four models with
the abstract features. Generally, a crack is a small object and
respect to the 76 testing images are presented in Table 3.
its information occupies only a small portion of an image.
The IoU value for the models in Table 3 is around 0.5
When a CNN executes convolutions and pooling to the last
except that for M-R-50-C4, and is lower than the balanced
layer of conv4_x block (Figure 5), the low-level features that
accuracy and F1 score. Cracks are characterized by narrow
are essential for crack detection may be lost. Thus, the crack
and long features and most of misidentified pixels are pixels
features may not be extracted only on the final single high-
around the edge of cracks, which makes the value of TP
level feature map of the last layer.
approximately equal to the value of the sum of FP and FN An FPN solves the above problems by combining coarse-
in most cases. As a result, the mean value of IoU is around resolution semantically strong features with fine-resolution
0.5. With an IoU around 0.5, the labelled crack almost over- semantically weak features via a top-down pathway and lat-
laps with the predicted crack, as shown in Figure 16. eral connections (Figure 5). The features of each level of fea-
Comparing the performance of the M-R-101-FPN with ture pyramid become semantically stronger after up-
that of the M-R-50-FPN, it is known that the network depth sampling through the top-down pathway and feature fusion
has a small impact on crack segmentation with respect to through the lateral connections. Therefore, the feature maps
balanced accuracy, IoU and F1 score. While comparing the with low-level features are re-used, and cracks can be bet-
performance of the M-R-50-FPN with that of the M-R-50- ter segmented.
C4, FPN improved the balanced accuracy by 11.16%, F1 The M-R-101-FPN model and the M-R-50-FPN model
score by 19.68%, and IoU by 18.51%. The use of the mor- used FPN to generate feature pyramid and carried out crack
phological closing operation improved the balanced accur- prediction at each level of feature pyramid (Figures 5 and
acy from 79.01% (M-R-101-FPN model) to 81.94% (M-R- 6(b)), whereas the M-R-50-C4 model carried out crack pre-
101-FPN-closing-operation model), F1 score from 64.91% to diction only on a single-scale high-level feature map failing
68.68%, and IoU from 48.52% to 52.72%, that is, the bal- to re-use the low-level features of cracks. Therefore, the per-
anced accuracy, F1 score, and IoU increased by 2.93%, formance of the M-R-101-FPN model and M-R-50-FPN
3.77%, and 4.2%, respectively. Therefore, it can be con- model are superior than the M-R-50-C4 model. Figures
cluded that the morphological closing operation contribute 18(a–c) can be as a support example. For a same input
to the increase of the value of balanced accuracy, F1 score image, Figure 18(a–c) are the segmentation results of M-R-
and IoU. 50-C4 model, M-R-50-FPN model and M-R-101-FPN
In summary, the M-R-101-FPN-closing-operation model model, respectively. The segmentation result of Figure 18(b,
is a relative optimal model among these models. These four c) are better than that of Figure 18(a). As introduced in
models were used to perform crack segmentation via GPUs, Section 4.3, the morphological closing operation connects
which can process images efficiently (Zhao et al., 2020). The the disjoint cracks. As a result, the localization of crack
inference time is about 0.11 s per image using each of these boundaries was improved. Therefore, the M-R-101-FPN-
12 H. HUANG ET AL.
Figure 16. Examples of IoU of cracks. The red colour denotes the ground truth of crack and the blue colour denotes the crack output by the M-R-101-FPN-closing-
operation model.
Figure 17. Feature maps of an input image.
Figure 18. Examples of Masked images. (a), (b), (c) are the segmentation results of M-R-50-C4 model, M-R-50-FPN model and M-R-101-FPN model for a same input
image, respectively.
closing-operation model is superior than the other images. The implementation of the proposed method can be
three models. summarized into two steps—image dataset creation and
model development. For image dataset establishment, in this
study, the MTI-200a equipment were used to acquire
6. Conclusions images. Through LabelMe tool, images were labelled. At
last, an image dataset of 761 images in total, including train-
This paper implements Mask R-CNN models for the ing dataset (548 images), validation dataset (137 images),
instance segmentation of cracks from shield tunnel lining and testing dataset (76 images), was established. For model
development, a trial and error approach was used to choose Brodersen, K.H., Ong, C.S., Stephan, K.E., & Buhmann, J.M. (2010).
training hyper-parameters for Mask R-CNN models, and The balanced accuracy and its posterior distribution. In Proceedings
of 20th International Conference on Pattern Recognition (pp.
the morphological closing operation was incorporated into
3121–3124). doi:10.1109/ICPR.2010.764
the M-R-101-FPN model to form an integrated model at Cha, Y.J., Choi, W., & B€ uy€uk€ozt€
urk, O. (2017). Deep learning-based
testing stage. crack damage detection using convolutional neural networks.
Experiments are conducted among four models to Computer-Aided Civil and Infrastructure Engineering, 32(5),
explore the effect of the morphological closing operation, 361–378. doi:10.1111/mice.12263
network depth and feature pyramids on crack segmentation Cha, Y.J., Choi, W., Suh, G., Mahmoudkhani, S., & B€ uy€
uk€ozt€
urk, O.
(2018). Autonomous structural visual inspection using region-based
performance, and thus the relative optimal model (i.e. M-R- deep learning for detecting multiple damage types. Computer-Aided
101-FPN-closing-operation model) is found. The use of Civil and Infrastructure Engineering, 33(9), 731–747. doi:10.1111/
FPN improved the balanced accuracy from 68.19% (M-R- mice.12334
50-C4 model) to 79.35% (M-R-50-FPN model), F1 score Fujita, Y., Mitani, Y., & Hamamoto, Y. (2006). A method for crack
from 45.67% to 65.35%, and IoU from 30.51% to 49.02%. detection on a concrete structure. In Proceedings of 18th
International Conference on Pattern Recognition (Vol. 3, pp.
The morphological closing operation improved the balanced
901–904). doi:10.1109/ICPR.2006.98
accuracy by 2.93%, F1 score by 3.77%, and IoU by 4.2%. For Gil, J.Y., & Kimmel, R. (2002). Efficient dilation, erosion, opening, and
76 test images, the relative optimal achieves a balanced closing algorithms. IEEE Transactions on Pattern Analysis and
accuracy of 81.94%, an F1 score of 68.68%, and an IoU Machine Intelligence, 24 (12), 1606–1617. doi:10.1109/TPAMI.2002.
of 52.72%. 1114852
However, it should be specifically noted that the image Girshick, R., Radosavovic, I., Gkioxari, G., Dollar, P., & He, K. (2018).
Detectron. Retrieved from https://github.com/facebookresearch/
database presented here is still relatively small compared to Detectron.
the needs of the required sample size for universal condi- He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). Mask R-CNN.
tions. Therefore, the presented model can be only applied arXiv Preprint arXiv:1703.06870.
for some similar conditions of lining backgrounds. In the He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for
future, the database can be enlarged to contain different image recognition. In Proceedings of the 29th IEEE Conference on
Computer Vision and Pattern Recognition (pp. 770–778).
types of cracks so as to further improve the balanced accur- Huang, H., Shao, H., Zhang, D., & Wang, F. (2017). Deformational
acy, F1 score, IoU, and robustness of the proposed method. responses of operated shield tunnel to extreme surcharge: A case
A Mask R-CNN model that is well trained on a large dataset study. Structure and Infrastructure Engineering, 13(3), 345–360. doi:
can be generalized to different environments of tunnels. 10.1080/15732479.2016.1170156
Huang, H., Sun, Y., Xue, Y., & Wang, F. (2017). Inspection equipment
study for subway tunnel defects by grey-scale image processing.
Disclosure statement Advanced Engineering Informatics, 32, 188–201. doi:10.1016/j.aei.
2017.03.003
No potential conflict of interest was reported by the authors. Huang, H.W., Li, Q.T., & Zhang, D.M. (2018). Deep learning based
image recognition for crack and leakage defects of metro shield tun-
nel. Tunnelling and Underground Space Technology, 77, 166–176.
Funding doi:10.1016/j.tust.2018.04.002
Kim, B., & Cho, S. (2019). Image-based concrete crack assessment
The authors are grateful to Mr. Qingtong LI from Shanghai Shentong
using mask and region-based convolutional neural network.
Metro Co., Ltd. for his help in acquiring images of this work. The
Structural Control and Health Monitoring, 26 (8), e2381. doi:10.
financial support from the National Natural Science Foundation of
1002/stc.2381
China (grant No. 51778474, 51978516, 52022070) and Key innovation
Koch, C., Georgieva, K., Kasireddy, V., Akinci, B., & Fieguth, P.
team program of innovation talents promotion plan by MOST of
(2015). A review on computer vision based defect detection and
China (grant No. 2016RA4059) are gratefully acknowledged.
condition assessment of concrete and asphalt civil infrastructure.
Advanced Engineering Informatics, 29(2), 196–210. doi:10.1016/j.aei.
2015.01.008
Lee, B.Y., Kim, Y.Y., Yi, S.-T., & Kim, J.-K. (2013). Automated image
ORCID processing technique for detecting and analysing concrete surface
Dongming Zhang http://orcid.org/0000-0001-7652-1919 cracks. Structure and Infrastructure Engineering, 9(6), 567–577. doi:
10.1080/15732479.2011.593891
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature,
References 521(7553), 436–444. doi:10.1038/nature14539
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie,
Ai, Q., Yuan, Y., & Bi, X. (2016). Acquiring sectional profile of metro S. (2017). Feature pyramid networks for object detection. In
tunnels using charge-coupled device cameras. Structure and Proceedings of the IEEE Conference on Computer Vision and Pattern
Infrastructure Engineering, 12(9), 1065–1075. doi:10.1080/15732479. Recognition (pp. 2117–2125).
2015.1076855 Lin, W.D., Lin, Y.R., & Wang, F. (2017). Crack detection based on
Asakura, T., & Kojima, Y. (2003). Tunnel maintenance in Japan. support vector data description. In Proceedings of the 29th Chinese
Tunnelling and Underground Space Technology, 18(2–3), 161–169. Control and Decision Conference (pp. 1033–1038). doi:10.1109/
doi:10.1016/S0886-7798(03)00024-5 CCDC.2017.7978671
Attard, L., Debono, C.J., Valentino, G., & Di Castro, M. (2018). Liu, X., Bai, Y., Yuan, Y., & Mang, H.A. (2016). Experimental investi-
Tunnel inspection using photogrammetric techniques and image gation of the ultimate bearing capacity of continuously jointed seg-
processing: A review. ISPRS Journal of Photogrammetry and Remote mental tunnel linings. Structure and Infrastructure Engineering,
Sensing, 144, 180–188. doi:10.1016/j.isprsjprs.2018.07.010 12(10), 1364–1379. doi:10.1080/15732479.2015.1117115
14 H. HUANG ET AL.
Liu, Z., Suandi, S.A., Ohashi, T., & Ejima, T. (2002). Tunnel crack Soukup, D., & Huber-M€ ork, R. (2014). Convolutional neural networks
detection and classification system based on image processing. In for steel surface defect detection from photometric stereo images. In
Proceedings of SPIE (Vol. 4664, pp. 145–152). doi:10.1117/12.460191 International Symposium on Visual Computing (pp. 668–677).
Makantasis, K., Protopapadakis, E., Doulamis, A., Doulamis, N., & Ukai, M. (2000). Development of image processing technique for
Loupos, C. (2015). Deep convolutional neural networks for efficient detection of tunnel wall deformation using continuously scanned
vision based tunnel inspection. In Proceedings of the 11th IEEE image. Quarterly Report of RTRI, 41(3), 120–126. doi:10.2219/rtriqr.
International Conference on Intelligent Computer Communication 41.120
and Processing (pp. 335–342). doi:10.1109/ICCP.2015.7312681 Ukai, M. (2007). Advanced inspection system of tunnel wall deform-
Miyamoto, A., Konno, M.-A., & Br€ uhwiler, E. (2007). Automatic crack ation using image processing. Quarterly Report of Rtri, 48(2), 94–98.
recognition system for concrete structures using image processing doi:10.2219/rtriqr.48.94
approach. Asian Journal of Information Technology, 6, 553–561. Wada, K. (2016). Labelme: Image polygonal annotation with python.
Ni, F., Zhang, J., & Chen, Z. (2019). Pixel-level crack delineation in Retrieved from https://github.com/wkentaro/labelme.
images with convolutional feature fusion. Structural Control and Wei, F., Yao, G., Yang, Y., & Sun, Y. (2019). Instance-level recognition
Health Monitoring, 26 (1), e2286. doi:10.1002/stc.2286 and quantification for concrete surface bughole based on deep
Paar, G., Caballo-Perucha, M.d.P., Kontrus, H., & Sidla, O. (2006). learning. Automation in Construction, 107, 102920. doi:10.1016/j.
Optical crack following on tunnel surfaces. In Proceedings of SPIE autcon.2019.102920
(Vol. 6382, p. 638207). doi:10.1117/12.685987 Xue, Y.D., & Li, Y.C. (2018). A fast detection method via region-based
Qi, D., Liu, Y., Wu, X., & Zhang, Z. (2014). An algorithm to detect the fully convolutional neural networks for shield tunnel lining defects.
crack in the tunnel based on the image processing. In Proceedings of Computer-Aided Civil and Infrastructure Engineering, 33(8),
10th International Conference on Intelligent Information Hiding and 638–654. doi:10.1111/mice.12367
Multimedia Signal Processing (pp. 860–863). doi:10.1109/IIH-MSP. Yamaguchi, T., & Hashimoto, S. (2010). Fast crack detection method
2014.217 for large-size concrete surface images using percolation-based image
Raid, A., Khedr, W., El-Dosuky, M., & Aoud, M. (2014). Image restor- processing. Machine Vision and Applications, 21(5), 797–809. doi:10.
ation based on morphological operations. International Journal of 1007/s00138-009-0189-8
Computer Science, Engineering and Information Technology, 4 (3), Yu, S.-N., Jang, J.-H., & Han, C.-S. (2007). Auto inspection system
9–21. doi:10.5121/ijcseit.2014.4302 using a mobile robot for detecting concrete cracks in a tunnel.
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards Automation in Construction, 16(3), 255–261. doi:10.1016/j.autcon.
real-time object detection with region proposal networks. IEEE 2006.05.003
Transactions on Pattern Analysis and Machine Intelligence, 39(6), Zhang, A., Wang, K.C., Li, B., Yang, E., Dai, X., Peng, Y., … Chen, C.
1137–1149. doi:10.1109/TPAMI.2016.2577031 (2017). Automated pixel-level pavement crack detection on 3d
Shahin, M.A., Maier, H.R., & Jaksa, M.B. (2004). Data division for asphalt surfaces using a deep-learning network. Computer-Aided
developing neural networks applied to geotechnical engineering. Civil and Infrastructure Engineering, 32 (10), 805–819. doi:10.1111/
Journal of Computing in Civil Engineering, 18(2), 105–114. doi:10. mice.12297
1061/(ASCE)0887-3801(2004)18:2(105) Zhang, W., Zhang, Z., Qi, D., & Liu, Y. (2014). Automatic crack detec-
Shen, B., Zhang, W.-Y., Qi, D.-P., & Wu, X.-Y. (2015). Wireless multi- tion and classification method for subway tunnel safety monitoring.
media sensor network based subway tunnel crack detection method. Sensors (Basel, Switzerland), 14(10), 19307–19328. doi:10.3390/
International Journal of Distributed Sensor Networks, 11(6), 184639. s141019307
doi:10.1155/2015/184639 Zhao, S., Zhang, D.M., & Huang, H.W. (2020). Deep learning–based
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional net- image instance segmentation for moisture marks of shield tunnel
works for large-scale image recognition. arXiv Preprint arXiv: lining. Tunnelling and Underground Space Technology, 95, 103156.
1409.1556. doi:10.1016/j.tust.2019.103156

Deep Learning-Based Instance Segmentation of Cracks From Shield Tunnel Lining Images

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning-Based Instance Segmentation of Cracks From Shield Tunnel Lining Images

Uploaded by

Copyright:

Available Formats

Structure and Infrastructure Engineering

Maintenance, Management, Life-Cycle Design and Performance

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/nsie20

Deep learning-based instance segmentation of

Hongwei Huang , Shuai Zhao , Dongming Zhang & Jiayao Chen

To link to this article: https://doi.org/10.1080/15732479.2020.1838559

Published online: 03 Nov 2020.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

Deep learning-based instance segmentation of cracks from shield

ABSTRACT ARTICLE HISTORY

1. Introduction crack identification step. In previous studies, numerous

Figure 1. Process of crack image processing.

sections can be inspected by the equipment in approxi-

2.2. Crack image annotation

2.3. Data division

Figure 3. Cracks labelled by LabelMe.

Table 1. Numbers of crack images for the instance segmentation dataset.

Figure 5. Overall architecture of a Mask R-CNN.

4. Model implementation and experimental results

Figure 7. RPN sliding on a single feature map.

Table 2. Characteristics of the three Mask R-CNN models.

Figure 8. Learning curves of the training process.

Figure 9. Classification accuracy curves of the training process.

4.2. Instance segmentation of crack images

Figure 10. Two true positive results segmented by M-R-101-FPN.

Table 3. Metrics of different methods for crack segmentation.

four trained models, and the inference time is not affected

Figure 17. Feature maps of an input image.

You might also like