Automated Asphalt Crack Detection Using Deformable SSD

Received October 20, 2021, accepted November 1, 2021, date of publication November 8, 2021, date of current version November
12, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3125703
Automated Asphalt Highway Pavement Crack

Detection Based on Deformable Single Shot
Multi-Box Detector Under a Complex
Environment
KUN YAN AND ZHIHUA ZHANG
Faculty of Geomatics, Lanzhou Jiaotong University, Lanzhou 730070, China
National-Local Joint Engineering Research Center of Technologies and Applications for National Geographic State Monitoring, Lanzhou 730070, China
Gansu Provincial Engineering Laboratory for National Geographic State Monitoring, Lanzhou 730070, China
Corresponding author: Zhihua Zhang (43447077@qq.com)
This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFB0504201 and
2017YFB0504203; in part by the National Natural Science Foundation of China under Grant 41861059, Grant 41761082, and Grant
61862039; and in part by Lanzhou Jiaotong University (LZJTU), China, under Grant 201806.
ABSTRACT Pavement cracks are severely affecting highway performance. Thus, implementing high-
precision highway pavement crack detection is important for highway maintenance. However, the asphalt
highway pavement environment is complex, and different pavement backgrounds are more difficult than
others for detecting highway pavement cracks. Interference from road markings and surface repairs also com-
plicate the environments and thus the detection of crack. To reduce interference, we collected many images
from different highway pavement backgrounds. We also improved the single shot multi-box detector (SSD)
network and proposed a novel network named deformable SSD by adding a deformable convolution to the
backbone feature extraction network VGG16. We verified our model using the PASCAL VOC2007 dataset
and obtained a mean average precision (mAP) 3.1% higher than that of the original SSD model. We then
trained and tested the proposed model using our crack detection dataset. We calculated precision, recall, F1
score, AP, mAP, and FPS to examine the performance of our model. The mAP of all categories in the test data
was 85.11% using the proposed model 10.4% and 0.55% more than that of YOLOv4 and the original SSD
model, respectively. These findings show that our model outperforms YOLOv4 and the original SSD model
and confirm that incorporating a deformable convolution into the SSD network can improve the model’s
performance. The proposed model is appropriate for detecting pavement crack categories and locations in
complicated environments. It can also provide important technical support for highway maintenance.
INDEX TERMS Crack detection, deformable convolution, multi-scale, SSD.
I. INTRODUCTION and shorten road operation times. Serious pavement cracks

Pavement crack is the most common and important pavement will also weaken the bearing capacity of the roadbed, give
disease. These cracks may be caused by different reasons, rise to pavement collapse and traffic accidents, affect traffic
such as vehicle load, man-made, and natural factors, is the safety, and cause economic losses. Therefore, the detection
main performance of the early stage of pavement disease. of pavement cracks is important for maintaining the highway.
Crack length can be from millimeters to meters, and crack Traditional pavement crack detection methods involve using
width can be from 1 mm to a few centimeters. Highway a detection vehicle to collect pavement images, marking the
pavement cracks lead to damage to the pavement structure, cracks on the image manually, and calculating the length and
reduce running speed, slow down traffic transportation times, width of the crack. However, as China’s highway mileage
is rather long, the manual detection of cracks consumes
The associate editor coordinating the review of this manuscript and considerable manpower and time. Thus, determining how to
approving it for publication was Le Hoang Son .
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 150925
K. Yan, Z. Zhang: Automated Asphalt Highway Pavement Crack Detection Based on Deformable SSD
implement automatic and intelligent pavement crack detec- dataset and our crack dataset. Our model achieves the highest
tion is an urgent need. mAP on pavement crack detection.
In the field of computer vision, object detection is the most This paper is organized as follows. Section II introduces
widely used technology compared to image classification the related work. Next section III presents a novel model
and segmentation. Object detection has been used in remote named deformable SSD and introduces its network structure.
sensing image detection, medical image detection, industrial We introduce the procedure of making dataset in section IV.
parts detection, power inspection, crop disease detection, Subsequently, results and discussion about the crack detec-
etc. This technology can not only identify the category of tion are presented in detail in section V. Then, we present the
the object, but can also locate the position of the object in crack detection results in section VI. Finally, in section VII,
the image. Therefore, it greatly improves work efficiency, conclusions and future work are also presented.
reduces labor costs, and improves the level of automation
and intelligence. The residual network (Resnet) solves the II. RELATED WORK
overfitting problem caused by deepening network layers [1]. A. OBJECT DETECTION
Therefore, deep convolutional neural networks (DCNNs) can At present, object detection technology can be divided
be integrated into object detection networks as a backbone into one and two-stage object detection. Two-stage object
to upgrade networks such as YOLOv3 [2], YOLOv4 [3], and detection, which mainly includes RCNN, Fast RCNN, and
Faster R-CNN [4]. DCNNs promote the rapid development of Faster RCNN, is divided into object classification and object
the target detection network and achieve higher performance. localization tasks, and these two tasks are completed sep-
Moreover, in natural language processing, transformers are arately [16]–[18]. The one-stage object detection is mainly
main model that can not only extract features but also achieve based on the SSD and YOLO series networks, which can
multimodal fusion [5], [6]. Object detection networks based complete both classification and positioning tasks at the same
on transformer modules, such as the Swin-transformer, have time. The one-stage object detection network is superior to
been proposed and have obtained high performance [7]. the two-stage object detection network in terms of accuracy
These networks improve the practicability of target detection. and speed, and it has higher practical value [19]–[24]. The
Pavement crack detection is a complex vision task. The R-CNN network was originally proposed by Zhang et al. [25],
aim of pavement crack detection is to determine whether who applied high-capacity CNNs to bottom-up region pro-
any cracks exist that belong to a specified category and to posals in the network. The mAP of the network improved
identify the category and location information in the image. from 35.1% to 53.7% and was much faster when compared
Highway asphalt pavement crack detection is more difficult with the multi-feature, non-linear kernel SVM method. How-
than normal crack detection in pavement made of concrete ever, R-CNN was slow. Subsequently, Girshick proposed a
because cracks in asphalt pavement are less obvious than fast R-CNN and a faster R-CNN to accelerate object detection
those in concrete pavement, resulting in difficulty in perform- using SPPnet [4]. The faster R-CNN promoted real-time
ing feature extraction. In addition, interference from other object detection. However, to overcome the flaw detected in
pavement objects, such as crack surface repairs and pavement the region proposal computation, region proposal networks
markings, and interference between classes including trans- (RPNs) that can share full-image convolutional features with
verse, longitudinal, and map cracks also increases detection the detection network were proposed [26]. To further improve
difficulty. Deep learning-based approaches are an intelligent accuracy, Cao et al. introduced a rotation-invariant faster
and high-efficiency method for detecting pavement cracks. R-CNN object detection network that integrates regulariza-
The rich hierarchical features of DCNNs and the end-to-end tion constraints into the target function of the model [27]. Fea-
trainable network promote pixel-level semantic segmentation ture pyramid networks (FPNs) were also proposed to achieve
tasks [8]–[10]. At present, several crack detection methods multi-scale object detection [28]. Employing a Faster R-CNN
have been proposed based on object detection [11], [12] and integrated with FPN obtains better precision and improves
image block segmentation [13]–[15]. the detection speed. Although these models perform well in
However, these methods have some defects. The existing object detection, they are slower and less accurate than one-
pavement crack detection datasets include less interference stage object detection.
and cannot meet the needs of crack detection under complex YOLO series networks belong to one-stage object detec-
environments. Additionally, few datasets contain information tion. Redmon et al. proposed a novel YOLO network
on crack locations, so they cannot be used to locate cracks that obtained double mAP in real-time detectors [29].
in the image. To address these challenges, we developed a Subsequently, Redmon and Farhadi introduced the improved
highway pavement crack dataset and conducted experiments model YOLO V2 [30] by integrating batch normalization
employing the original SSD and YOLOv4 models. Our pro- into the network to speed up the training. They also added
posed model performs better on the SSD model than on anchor boxes and high-resolution classifiers to the network
the YOLOv4 model. Specifically, this paper contributes to to improve the accuracy. YOLO V2 ran faster than F-RCNN
existing research by developing an improved object detec- with ResNet and SSD model. Then Redmon and Farhadi pro-
tion network that adds deformable convolution to the SSD posed a YOLO V3 model, which achieves multi-scale object
network. We verify the proposed model on the VOC2007 detection [2]. The YOLO V3 was as accurate as the SSD
150926 VOLUME 9, 2021

model, but three times faster. However, it performed worse on ent scales and carry out feature fusions on different scales.
medium- and larger-sized objects. Currently, the YOLO V4 YOLOv4 object detection network has been widely applied
and YOLO V5 models have been proposed [23], [31]. These in autonomous driving and medicine fields [23], [40].
models applied the Mish activation function and some data
augmentation approaches to improve precision. C. HIGHWAY PAVEMENT CRACK DETECTION
Object detection networks based on transformer modules The development of image processing technology improves
have also performed well [32], [33]. A novel framework the level of automatic pavement crack detection. Gener-
called the DEtection Transformer, or DETR, was pro- ally, there is a significant difference between the brightness
posed [34]. This method does not need a non-maximum value of the pavement crack and the pavement background.
suppression procedure or anchor generation that encoded Consequently, threshold segmentation-based methods can
prior knowledge. The model performance, in terms of pre- be employed to extract the crack features and complete
cision and run-time, was comparable to the well-established the crack detection. Li and Liu proposed an approach for
and highly-optimized Faster RCNN model. The DETR had crack detection based on a neighboring difference histogram
high computation and space complexity when the attention method [42]. They extracted the crack from the pavement
weight was computed in transformer module. However, in background by setting a threshold. However, when the pave-
the encoder, the amount of calculation was proportional to ment background is dark, such as when it is wet or asphalt
the number of pixels squared, meaning it had difficulty pro- pavement, the difference in brightness between the cracks and
cessing the high resolution features. To address the problem, pavement background is minimal, and the effect of the thresh-
the deformable DETR was put forward [35], which added old segmentation method is poor. In addition, this method
a deformable convolution to the network; the validity of requires the threshold value to be set manually, which is
the model was proved using the COCO dataset. The DETR largely subjective. Image filtering-based approaches can also
and deformable DETR networks added only transformer be applied to crack detection. Subirats et al. employed a
modules. Josh Beal et al. proposed a vision transformer continuous wavelet transform to obtain a binary image to
as a backbone, called ViT-FRCNN, for detection task [36]. detect cracks [43]. However, the anti-noise ability of this
While ViT-FRCNN served as a crucial first step in a class method is poor. Cho et al. proposed a crack width transform
of transformer-based models, it did not achieve state-of-the- for crack detection [44]. Their method can better detect crack
art results and did not consider the local feature. Liu et al. width. Although it has a strong anti-noise ability, the thresh-
proposed the Swin Transformer, which employed shifted old still needs to be set manually. Premachandra et al. pro-
windows [7], which achieved interactivity with information posed an image based automatic road crack detection method
between windows, meaning that the model can obtain seman- by employing the pixel variance and discriminant analysis
tic and local information. The Swin Transformer outper- and the method was effective [45]. Then Chinthaka et al.
formed the YOLOv4 and DetectoRS models when tested used color variance distribution and discriminant analysis for
using COCO dataset. While the transformer-based models road crack detection and had higher precision than the con-
perform well on massive quantities of data, they tend to have ventional approaches [46]. However, the difference between
a worse performance on datasets smaller than 100 k. the cracks and the pavement background was very obvious.
Several researchers have also proposed improved versions The crack feature was easy to extract. When the difference
of the SSD model. Fu et al. introduced a Deconvolutional between the cracks and the pavement background is not obvi-
Single Shot Detector (DSSD) by combining a classifier ous, the method performs worse. Machine learning is widely
(Resnet-101) with a SSD network [37]. Wang et al. proposed used in the field of image processing. The application of this
an improved SSD model by combining the advantages of method to detect cracks further improves the automation level
existing target detection approaches [38]. Their model out- of crack detection. The genetic programming and percolation
performed the F-RCNN and R-FCN models for small object model-based approach was proposed for concrete pavement
detection. Kumar et al. used depth-wise separable convolu- crack detection [47]. The algorithm enhances its anti-noise
tion to improve the SSD network and had a high performance ability, accelerates the detection speed, and upgrades its pre-
on small objects detection and real-time detection [39]. cision, but its rate of convergence is slow. Shi et al. proposed
a new pavement crack detection framework based on random
B. YOLOv4 structured forests named CrackForest [48]. As intricate struc-
In the field of object detection, the YOLOv4 model has high tural features of cracks can be extracted, the framework has
accuracy and speed. On common image datasets, such as high detection precision and fast detection speed. However,
the COCO dataset and VOC dataset, the performance of the the precision and intelligence levels of these methods are
YOLOv4 model is better than the SSD model. The YOLOv4 lower than those of deep learning approaches.
network structure mainly consists of three parts: the backbone With the advance of DCNNs, object detection technology
feature extraction network CSPDarkNet53, spatial pyramid has become increasingly mature. CNNs can extract high-
pooling (SPP), and the feature pyramid network (PANet). The dimensional crack features and put these features into a
Mish activation function has also been adopted [40], [41]. detector to complete crack classification and location. This
Through PANet, we can attain feature maps of three differ- method greatly improves the accuracy and speed of crack
VOLUME 9, 2021 150927

FIGURE 1. SSD network structure.
detection [49]. Bhat et al. employed CNN to detect cracks the interference of surface repairs and pavement markings.
and achieved high precision [50]. However, The number of Additionally, the model cannot locate cracks in the image
dataset they used was too small. Model generalization is and compute the positioning accuracy. Feng et al. employed
too weak. Qu et al. proposed an improved VGG16 network an improved SSD model to achieve crack classification and
model to detect cracks [51]. The model clearly outperforms location, and it performed well [56]. However, they still failed
VGG16, U-Net, and Percolation and obtains the highest F1 to consider the interference from crack surface repairs and
score when using the CFD dataset and Cracktree200 dataset. pavement markings. Maeda et al. provided an open road dam-
Zhang et al. proposed a novel model called APLCNet, which age dataset and used some detectors to verify the validity of
uses instance segmentation based on the model to attain the data [57]; however, they failed to provide comprehensive
pixel-level crack detection [52]. The model obtained higher evaluation indicators.
precision, recall, and F1 scores in the CFD dataset; however,
the study mainly employed datasets with concrete pavements
on which the crack feature was more distinct. Therefore, III. PROPOSED MODEL
these approaches perform well on concrete pavements, but A. SINGLE SHOT MULTI-BOX DETECTOR (SSD)
their performance degrades on asphalt pavements. A novel The SSD object detection network has a faster detection speed
model based on a feature pyramid and a hierarchical boosting and higher accuracy compared with RCNN, Fast RCNN,
network was proposed to detect pavement cracks [53]. The and the Faster RCNN detection network, and its training
model integrated different levels of crack features and differ- speed and detection speed are faster than the YOLOv4 and
ent kinds of crack datasets, including concrete and asphalt YOLOv5 detection networks. Through comprehensive com-
pavement image datasets. The model therefore has good parison, the SSD target detection network is a better perfor-
robustness. Song et al. proposed a network named Crack- mance network that has been widely applied in various fields.
Seg that employed deep multiscale convolutional features to Figure 1 shows the structure of the SSD network, and its
detect pavement cracks [54]. It achieved high performance backbone network adopts the VGG16 [58] feature extraction
in precision, recall, F1-score, and mIOU. However, the net- network. The extra feature layers are added after the back-
work does not make fine divisions for the categories of the bone network to achieve the multi-scale detections. The sizes
crack. Different crack types, such as transverse, longitudi- of these layers decrease progressively, and the convolutional
nal, and map cracks, cause different kinds of damage to module for predicting detections differs for each feature layer.
the pavement structure, so a fine classification of the crack Detection predictions are produced by using convolutional
categories is necessary. Interference among different kinds filters on each existing feature layer. A 3∗ 3 ∗ a convolutional
of cracks also makes it more difficult to detect them. Song kernel is the basic module for predicting parameters of a
et al. divided cracks into transverse, longitudinal, alligator, detection on a feature layer of size m∗ m with a channels [59].
and block cracks and employed a multi-scale feature attention The kernel produces a score for a category and a shape
network to detect the cracks [55]. The classification precision offset for the default box coordinates. The kernel is used
of transversal and longitudinal cracks was above 95%, and at each m∗ m location. And it produces the bounding box
the classification precision of alligator and block cracks was offset output values that are measured about a default box
higher than 86%. However, the model does not consider position.
150928 VOLUME 9, 2021

that the highest layer has a scale of 0.9 and the lowest layer
has a scale of 0.2. Different aspect ratios for the default boxes
are imposed and are denoted as ar . The width (wak ) and height
(hak ) of each default box are calculated. Moreover, for ar = 1,
√
the scale s0k = sk sk+1 of the default box is added, resulting
in six kinds of default boxes. The center point (cx, cy) of each
cell is computed in Equation (5), where |fk | denotes the size
of the k-th feature map.
FIGURE 2. 5 ∗ 5 feature map. C. ENCODE

In the encoding procedure, the parameters g_xy and g_wh are
adjusted from the default box to the ground truth box. The
We input the image size of 300∗ 300∗ 3 and output the fea-
formula is shown in Equations (6) and (7).
ture layers of six scales, which are 38∗ 38∗ 512, 19∗ 19∗ 1024,
10∗ 10∗ 512, 5∗ 5∗ 256, 3∗ 3∗ 256, and 1∗ 1∗ 256, respectively, g_xy = ([x1 , y1 ] − [x0 , y0 ])/(v0 ∗ [h0 , w0 ]), (6)
where 512, 1024, 512, and 256 denote not only the number g_wh = log([h1 , w1 ]/[h0 , w0 ])/v1 , (7)
of image channel, but also the number of feature images
extracted from each feature layer, respectively. The feature where x1 , y1 are the center of the ground truth box; h1 , w1
map cell is each grid in the feature map here. For example, denote the height and width of the ground truth box; x0 , y0
38∗ 38 denotes 38∗ 38 cells. After that, we input the feature are the central point coordinates of the default box; h0 , w0
maps of each scale into the classifier and detector respectively denote the height and width of the default box, respectively;
for classification and regression prediction and then select and v0 and v1 are equal to 0.1 and 0.2, respectively.
the optimal prediction box through the non-maximum sup-
D. DECODE
pression algorithm (NMS). In addition, to improve network
performance, the 38∗ 38 feature maps are L2 regularized to The decoding process attains a prediction box, according to
reduce the number of channels from 512 to 20. the formula shown in Equations (8) and (9).
[x, y] = [x0 , y0 ] + loc[:, : 2] ∗ v0 ∗ [h0 , w0 ], (8)
B. DEFAULT BOX
[h, w] = [h0 , w0 ] ∗ exp(loc[:, 2 :] ∗ v1 ), (9)
The default box in the SSD network is similar to the prior
anchor in the Faster RCNN network. However, the default where x and y are the center coordinates of the prediction box,
boxes are applied to feature maps in different-scale feature and h and w are the height and width of the prediction box.
layers. Six different size default boxes are defined in the
SSD network. Each feature layer initializes some default E. LOSS FUNCTION
boxes, and these default boxes are adjusted to achieve optimal The loss function of the SSD network is a multi-task loss
classification and regression prediction (see Figure 2). The function, including classification and regression loss (as
dotted boxes denote the default boxes, and the red dotted shown in Equations (10)-(17)). We call a sample containing
box denotes the ground truth box in Figure 2. The numbers objects in the default box a positive sample, and a sample
[4, 6, 6, 6, 4, 4] respectively denote the number of default with no objects in the default box a negative sample. Most
boxes in each feature map cell in the six scales. For instance, default boxes do not include objects, which results in uneven
if a feature map with 38∗ 38 cells had four default boxes per positive and negative samples. To balance the positive and
cell, there would be 38∗ 38∗ 4 = 5,776 default boxes in the negative samples, we set the ratio of the positive and negative
feature map. We need to compute the size and center of each samples to 1:3. Classification loss includes positive and neg-
default box. The size of the input images is S ∗ S = 300∗ 300. ative sample loss, and multi-classification cross-entropy loss
The calculation formulas of the default box are shown in is adopted. Negative samples do not need to be positioned,
Equations (1) - (5). so regression loss includes only positive sample loss and the
smax − smin smooth_L1_loss function is used.
sk = smin + (k − 1), k ∈ [1, m] (1) 1
m −1 L(x, c, l, g) = (Lconf (x, c) + αLloc (x, l, g)), (10)
1 1 N
ar ∈ 1, 2, 3, , , (2) N
2 3 X X p
√ Lloc (x, l, g) = xij smoothL1 (lim − ĝm
j ) (11)
wak = sk ar , (3)
√ i∈Pos m∈{Cx,Cy,w,h}
hk = sk / ar ,
a
(4)
ĝj = (gCxj − di )/di ,
Cx Cx w
(12)
i + 0.5 j + 0.5
(cx, cy) = ( , ) i, j ∈ [0, |fk |], (5) Cy Cy Cy
ĝj = (gj − di )/dih , (13)
|fk | |fk |
where sk denotes the scale of the default boxes in each feature w
gwj
ĝj = log( w ), (14)
map; and smax and smin are 0.9 and 0.2, respectively, meaning di
VOLUME 9, 2021 150929

can adapt to the shape characteristics of the objects and

match the shape changes of the objects. The convolution
region always covers the surrounding objects. Sampling
points of the deformable convolution are not uniformly dis-
tributed. Instead, they are distributed in the interior of the
detected object according to the shape of the detected object.
Deformable convolution has a strong scale modeling ability
and a larger receptive field than standard convolution.
X
FIGURE 3. Deformable convolution. y(P0 ) = w(Pn ) · x(P0 + Pn + 1Pn ), (18)
Pn ∈R
R = {(−1, −1), (−1, 0), . . . , (0, 1), (1, 1)} , (19)
ghj
ĝhj = log( h ), (15) where Pn denotes the pixel value in the convolution window,
di
w is the weight of the convolution kernel, R represents the
N
X p p
X standard convolution kernel of size 3 × 3, and 1Pn denotes
Lconf = − xij log(ĉi ) − log(c0i ), (16)
the offset of a certain pixel.
i∈Pos i∈Neg
p
p exp(ci ) H. DEFORMABLE SSD
ĉi = P p , (17)
p exp(ci ) The original SSD employs VGG16 as the backbone feature
extraction network and uses standard convolution to extract
where N denotes the number of matched default boxes; the features. We add a deformable convolution (D_Conv)
i and j denote the number of prediction box and ground truth behind Conv7 in the VGG16 network, and Table 1 shows its
p
boxes, respectively; p is the category number; ĉi denotes the parameter settings.
probability that the i-th prediction box predicts the category When crack images are fed into the input layer, the pro-
p, p = 0 is the background; Lconf is the localization loss; Lloc posed network does the following:
is the confidence loss; α is the weight item that is employed (1) The size of the crack images is transformed to 300∗ 300
to balance the classification and regression loss and is set by batch normalization.
p
to 1; xij = {1, 0} is an indicator that denotes whether the i-th (2) The crack images are passed through the VGG16 net-
prediction box matches the j-th ground truth box of category work extracting feature maps at different scales.
Cy
k; lim is the i-th predicted box; (diCx , di ), diw , and dih denote (3) The default boxes are generated based on
the center, width, and height of the i-th default box; and ĝm j Equations (1)-(5).
denotes the j-th ground truth box. (4) The offsets are predicted for the default box shapes in
the cell.
F. NON-MAXIMUM SUPPRESSION (NMS) (5) Per-class confidence scores are predicted for each box.
NMS mainly solves the problem of a target being detected (6) Based on the IOU and NMS, the ground truth boxes are
many times [60]. First, the box with the highest confidence is matched with the predicted boxes.
located from all the detection boxes, and then the interaction (7) The loss is computed using Equations (10)-(17).
ratios (IOU) between it and the remaining boxes are calcu-
lated successively. If the value is greater than a certain thresh- IV. DATASET
old (the coincidence is too high), the box is removed. The Crack Dataset: The dataset consists of highway pavement
process is repeated for the remaining detection boxes until survey images from the national highway and provincial
all detection boxes are processed. We set the IOU threshold highway in Gansu Province, China. The cracks are about
to 0.45. 3 mm wide. The data used in this study were mainly from on-
board CCD cameras, which cover most pavement conditions.
G. DEFORMABLE CONVOLUTION The pavement images have resolutions of 1688 × 1874.
The addition of deformable convolutions to a CNN can Training networks with large images would require a large
upgrade the performance of the network [61], [62]. For amount of memory, thus overburdening the training process.
example, adding deformable convolution to an object detec- Further, since the crack region occupies only a small part
tion network that consists of ResNet or CSPDarkNet as a of the entire image, it is difficult to extract features and
backbone feature extraction network improves the detection recognize cracks. Therefore, to reduce memory usage and
performance of the network [63]. Deformable convolution improve the precision, we divided the original highway crack
adds an offset variable at each sampling point to increase its images into small blocks with a size of 562 × 562 pixels (see
adaptability to geometric deformation compared with stan- Figure 4). Then, the dataset was manually divided into five
dard convolution (as shown in Equations (18) and (19)). categories: transverse cracks (1), longitudinal cracks (2), map
Figure 3 shows the realization process of deformable convo- cracks (3), crack surface repairs (4), and pavement markings
lution. The convolution kernel of the deformable convolution (5). Subsequently, we used LabelImg to label the images,
150930 VOLUME 9, 2021

FIGURE 4. Sample images of our dataset. (a)–(i) Denote different pavement target images.
VOLUME 9, 2021 150931

FIGURE 4. (Continued.) Sample images of our dataset. (a)–(i) Denote different

pavement target images.
TABLE 1. Deformable convolution parameter settings.
TABLE 2. Number of objects per category.
TABLE 3. Pascal VOC2007 test detection results.
including the crack category and location. We also divided i9-9900K CPU, 3.60 GHz and a NVIDIA GTX 2080Ti GPU
the dataset into three sub-datasets, including the training set and 12 GB of memory. To verify the applicability of the
(18,694 pictures), the validation set (2,077 pictures), and the proposed model, we first tested our model on the PAS-
test set (1,483 pictures). Table 2 shows the number of objects CAL VOC2007 dataset. We employed pre-training weight
per category. and transfer learning to train our model and set the initial
learning rate to 0.0005 and epochs to 40. One epoch means
V. RESULTS AND DISCUSSION the model has been trained once. Table 3 shows the test
Each experiment in our study was performed under a results. The mAP of the proposed model was 3.1% higher
Windows 10 Operating System with Intel(R) Core(TM) than that of the original SSD model, and the AP per class
150932 VOLUME 9, 2021

we divided the training process into two stages and fine-

tuned the training parameters in the second stages. Table 4
shows the training parameters of the different models at
different stages. When training deformable SSD, we adopted
the idea of transfer learning and trained the trained SSD
model as the initial weight of deformable SSD. Because
the training process was short and the performance was not
suboptimal, we did not divide the two stages to train pro-
posed model. To highlight the superiority of our model, we
compared it with the original SSD model with the YOLOv4
model.
We plotted the loss curves of three models at train process
in Figure 5. The results show that the convergence speed of
FIGURE 5. Loss curves under different models.
the proposed model is the fastest and the loss is minimal.
To evaluate network performance, we calculated the
following indexes: Precision (P), Recall (R), F1, AP,
mAP, and FPS. Their calculation formulas are shown in
Equations (20)-(24), respectively.
TP
P= , (20)
TP + FP
TP
R= , (21)
TP + FN
F1 = 2 ∗ (P ∗ R) / (P + R) , (22)
N
FIGURE 6. Map crack.
X
AP = P(k)1R(k), (23)
k=1
improved. Fine-tuning the model as a long train process is m
X
an effective approach to avoiding overfitting and degener- mAP = AP/m, (24)
ation. To better train the model and obtain better results, i=1
FIGURE 7. Map crack precision curves of three models. (a) Precision curve of the YOLOv4 model; (b) Precision curve of the
SSD model; (c) Precision curve of our model.
FIGURE 8. Map crack F1 curves of three models. (a) F1 curve of the YOLOv4 model; (b) F1 curve of the SSD model; (c) F1
curve of our model.
VOLUME 9, 2021 150933

FIGURE 9. Detection results. (a)–(i) Denote different objective detection results.
150934 VOLUME 9, 2021

FIGURE 9. (Continued.) Detection results. (a)–(i) Denote different objective detection results.
TABLE 4. The training parameters of different models at different stages. not predicted correctly; for example, there is an object in the
picture, but the prediction box is not drawn. N denotes the
number of predicted samples, 1 denotes the difference, and
m is the number of categories. FPS denotes frame rate per
second, namely the number of images detected per second,
which is an important indicator for evaluating the detection
speed of the model.
We calculated the indicators of the three models on the test
set, as shown in Table 5. Our model obtained the highest mAP
on the test set, 0.55% and 10.4% higher than the SSD and
In the formulas, true positive (TP) denotes the correct YOLOv4 models, respectively, and had the best performance.
positioning result, and false positive (FP) is the wrong The YOLOv4 network had the worst performance. Because
positioning result. False negative (FN) means the result is the YOLOv4 model only obtains the features of three scales,
VOLUME 9, 2021 150935

TABLE 5. The indicators (P, R, F1, AP, FPS, and MAP) of the three models on the test set.
it has a poor effect on large-scale object detection. However, VI. DETECTION RESULTS
map cracks usually cover the whole image (as shown in Figures 9 (a)-(f) show the detection results of images, size
Figure 6), which requires a large-scale object detection fea- 562 × 562, and Figure 9 (i) is the original image detection
ture layer to obtain global features. The SSD network has result. In Figure 9, cracks are detected on different pavement
six scale feature layers, including large-scale feature layers, types. Cracks, surface repairs, and markings on the pave-
and it is suitable for map crack detection. The calculated ment are also detected, which is important for removing the
FPS shows that the detection speed of the proposed model interference of surface repairs and markings. The proposed
is faster than that of the YOLOv4 model and slightly slower model can detect not only crack pictures under different pave-
than that of the SSD model. Because we used transfer learning ment backgrounds, but also crack pictures of different sizes,
to train our model, the Training time was greatly shortened. and it therefore has strong applicability. However, as can
The model needed to allocate more memory in the training be seen in Figures 9 (d), (g), and (i), there are omissions
process because of the increasing model complexity after on the image edge and on multi-objective detection. During
adding the deformable convolution. The results show that cracks detection, we used a computer with GPU to achieve
the proposed model and SSD model achieved great improve- fast detection. Therefore, a highly configured computer is
ments in accuracy, recall, and F1 score for map cracks, which necessary. Currently, our model is not applicable to detect
verifies that the YOLOv4 model has poor performance on cracks on mobile or small type hardware.
large-scale object detection. The proposed model attained the
highest AP for map crack and crack surface repairs and higher VII. CONCLUSION
AP for other categories, which indicated that our model was We propose a novel object detection model named
more suitable for crack detection. deformable SSD to detect asphalt highway pavement cracks
Taking map cracks as an example, we plotted precision in complicated environments. We draw the following con-
and F1-score changing curves while changing the confi- clusions: (1) Annotating the crack category and location
dence threshold in Figure 7 and Figure 8. Confidence is a on a crack detection dataset and using objective detection
probability that reflects the similarity between predicted and networks to train crack detection models can effectively
truth objectives. Precision should increase as the confidence detect crack class and location. Obtaining this information
threshold increases. However, the F1-score first increased is important for achieving automation and intellective crack
and then decreased as the confidence threshold increased. detection. (2) Collecting crack images on different pavements
We therefore set the confidence threshold to 0.5 according to increase sample diversity and include surface repairs
to the results of our experiments. and markings in the dataset can reduce interference among
150936 VOLUME 9, 2021

classes and upgrade the generalization ability of the model. [16] Y. Fang, C. Zhang, C. Huang, L. Liu, and Y. Yang, ‘‘Phishing email detec-
(3) Improving the SSD network by adding a deformable tion using improved RCNN model with multilevel vectors and attention
mechanism,’’ IEEE Access, vol. 7, pp. 56329–56340, 2019.
convolution to the network can upgrade the model perfor- [17] A. Ullah, H. Xie, M. O. Farooq, and Z. Sun, ‘‘Pedestrian detection in
mance. Employing fine tuning in training process can avoid infrared images using fast RCNN,’’ in Proc. 8th Int. Conf. Image Process.
overfitting and model degeneration, and transfer learning Theory, Tools Appl. (IPTA), Nov. 2018, pp. 1–6.
[18] Y. Zhao, W. Cui, S. Geng, B. Bo, Y. Feng, and W. Zhang,
can accelerate model convergence. Comparative experiments ‘‘A malware detection method of code texture visualization based
of VOC 2007 and our dataset showed the proposed model on an improved faster RCNN combining transfer learning,’’ IEEE Access,
outperformed the original SSD and YOLOv4 models. (3) The vol. 8, pp. 166630–166641, 2020.
[19] C. Ning, H. Zhou, Y. Song, and J. Tang, ‘‘Inception single shot multibox
detection results show that our model can detect not only detector for object detection,’’ in Proc. IEEE Int. Conf. Multimedia Expo
cracks in complicated environments but also multi-objective Workshops (ICMEW), Jul. 2017, pp. 549–554.
crack images. However, it is difficult to detect cracks on [20] Q. Xu, R. Lin, H. Yue, H. Huang, Y. Yang, and Z. Yao, ‘‘Research on small
target detection in driving scenarios based on improved YOLO network,’’
image edge. Therefore, we aim to develop approaches to IEEE Access, vol. 8, pp. 27574–27583, 2020.
enhance the crack detection feature on the image edge. [21] Z. Wu, J. Sang, Q. Zhang, H. Xiang, and X. Xia, ‘‘Multi-scale vehi-
cle detection for foreground-background class imbalance with improved
ACKNOWLEDGMENT YOLOv2,’’ Sensors, vol. 19, no. 15, pp. 3336–3346, 2019.
[22] H. Li, L. Deng, C. Yang, J. Liu, and Z. Gu, ‘‘Enhanced YOLO v3 tiny
The authors would like to acknowledge Gansu Province Key network for real-time ship detection from visual image,’’ IEEE Access,
Laboratory of Highway Network Monitoring for providing vol. 9, pp. 16692–16706, 2021.
the original highway pavement crack data. [23] Y. Cai, T. Luan, H. Gao, H. Wang, L. Chen, Y. Li, M. A. Sotelo, and Z. Li,
‘‘YOLOv4-5D: An effective and efficient object detector for autonomous
driving,’’ IEEE Trans. Instrum. Meas., vol. 70, pp. 1–13, 2021.
REFERENCES [24] H. R. Alsanad, O. N. Ucan, M. Ilyas, A. U. R. Khan, and O. Bayat, ‘‘Real-
time fuel truck detection algorithm based on deep convolutional neural
[1] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image network,’’ IEEE Access, vol. 8, pp. 118808–118817, 2020.
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), [25] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, ‘‘Part-based R-CNNs
Jun. 2016, pp. 770–778. for fine-grained category detection,’’ in Proc. Eur. Conf. Comput. Vis.,
[2] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ 2014.
2018, arXiv:1804.02767. [26] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-
[3] A. Tack, B. Preim, and S. Zachow, ‘‘Fully automated assessment of knee time object detection with region proposal networks,’’ IEEE Trans. Pattern
alignment from full-leg X-rays employing a ‘YOLOv4 and ResNet land- Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
mark regression algorithm’ (YARLA): Data from osteoarthritis initiative,’’ [27] Y. J. Cao, G. M. Xu, and G. C. Shi, ‘‘Low altitude armored target detection
Comput. Methods Programs Biomed., vol. 205, Jun. 2021, Art. no. 106080. based on rotation invariant faster R-CNN,’’ Laser Optoelectron. Prog.,
[4] L. Zhang, L. Liang, X. Liang, and K. He, ‘‘Is faster R-CNN doing well for vol. 55, no. 10, 2018, Art. no. 101501.
pedestrian detection?’’ in Proc. Eur. Conf. Comput. Vis., 2016. [28] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,
[5] K. Song, X. Zhou, H. Yu, Z. Huang, Y. Zhang, W. Luo, X. Duan, and ‘‘Feature pyramid networks for object detection,’’ in Proc. IEEE Conf.
M. Zhang, ‘‘Towards better word alignment in transformer,’’ IEEE/ACM Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125.
Trans. Audio, Speech, Lang. Process., vol. 28, pp. 1801–1812, 2020. [29] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
[6] S. Singh and A. Mahmood, ‘‘The NLP cookbook: Modern recipes for Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
transformer based deep learning architectures,’’ IEEE Access, vol. 9, Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
pp. 68675–68702, 2021. [30] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ in
[7] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
‘‘Swin transformer: Hierarchical vision transformer using shifted win- pp. 6517–6525.
dows,’’ 2021, arXiv:2103.14030. [31] X. Sun, W. Shi, Q. Cheng, W. Liu, Z. Wang, and J. Zhang, ‘‘An LED
[8] E. Shelhamer, J. Long, and T. Darrell, ‘‘Fully convolutional networks for detection and recognition method based on deep learning in vehicle optical
semantic segmentation,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, camera communication,’’ IEEE Access, vol. 9, pp. 80897–80905, 2021.
no. 4, pp. 640–651, Apr. 2017. [32] C. Liu, W. Zhou, Y. Chen, and J. Lei, ‘‘Asymmetric deeply fused network
for detecting salient objects in RGB-D images,’’ IEEE Signal Process.
[9] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-net: Convolutional networks
Lett., vol. 27, pp. 1620–1624, 2020.
for biomedical image segmentation,’’ in Proc. Int. Conf. Med. Image
Comput. Comput.-Assist. Intervent., Oct. 2015, pp. 234–241. [33] L. Wang, S. Giebenhain, C. Anklam, and B. Goldluecke, ‘‘Radar ghost
target detection via multimodal transformers,’’ IEEE Robot. Autom. Lett.,
[10] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep convolu-
vol. 6, no. 4, pp. 7758–7765, Oct. 2021.
tional encoder-decoder architecture for image segmentation,’’ IEEE Trans.
[34] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, Dec. 2017.
S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ 2020,
[11] F.-C. Chen and M. R. Jahanshahi, ‘‘NB-CNN: Deep learning-based arXiv:2005.12872.
crack detection using convolutional neural network and Naïve Bayes [35] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘‘Deformable
data fusion,’’ IEEE Trans. Ind. Electron., vol. 65, no. 5, pp. 4392–4400, DETR: Deformable transformers for end-to-end object detection,’’ 2020,
May 2018. arXiv:2010.04159.
[12] H. Maeda, Y. Sekimoto, T. Seto, T. Kashiyama, and H. Omata, ‘‘Road [36] J. Beal, E. Kim, E. Tzeng, D. Huk Park, A. Zhai, and D. Kislyuk, ‘‘Toward
damage detection and classification using deep neural networks with transformer-based object detection,’’ 2020, arXiv:2012.09958.
smartphone images: Road damage detection and classification,’’ Comput.- [37] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, ‘‘DSSD: Deconvo-
Aided Civil Infrastruct. Eng., vol. 33, no. 12, pp. 1127–1141, Dec. 2018. lutional single shot detector,’’ 2017, arXiv:1701.06659.
[13] X. Wang and Z. Hu, ‘‘Grid-based pavement crack analysis using deep [38] J. Q. Wang, J. S. Li, X. W. Zhou, and X. Zhang, ‘‘Improved SSD algorithm
learning,’’ in Proc. 4th Int. Conf. Transp. Inf. Saf. (ICTIS), Aug. 2017, and its performance analysis of small target detection in remote sensing
pp. 917–924. images,’’ Acta Opt. Sin., vol. 39, no. 6, 2019, Art. no. 0628005.
[14] Y.-J. Cha, W. Choi, and O. Büyüköztürk, ‘‘Deep learning-based crack [39] A. Kumar, Z. J. Zhang, and H. Lyu, ‘‘Object detection in real time based on
damage detection using convolutional neural networks,’’ Comput.-Aided improved single shot multi-box detector algorithm,’’ EURASIP J. Wireless
Civil Infrastruct. Eng., vol. 32, no. 5, pp. 361–378, May 2017. Commun. Netw., vol. 2020, no. 1, pp. 1–18, Dec. 2020.
[15] B. Kim and S. Cho, ‘‘Automated vision-based detection of cracks on con- [40] S. Albahli, N. Nida, A. Irtaza, M. H. Yousaf, and M. T. Mahmood,
crete surfaces using a deep learning technique,’’ Sensors, vol. 18, no. 10, ‘‘Melanoma lesion detection and segmentation using YOLOv4-DarkNet
p. 3452, Oct. 2018. and active contour,’’ IEEE Access, vol. 8, pp. 198403–198414, 2020.
VOLUME 9, 2021 150937

[41] C. Yang, Z. Yang, S. Liao, Z. Hong, and W. Nai, ‘‘Triple-GAN with [57] H. Maeda, Y. Sekimoto, T. Seto, T. Kashiyama, and H. Omata, ‘‘Road dam-
variable fractional order gradient descent method and Mish activation func- age detection using deep neural networks with images captured through a
tion,’’ in Proc. 12th Int. Conf. Intell. Hum.-Mach. Syst. Cybern. (IHMSC), smartphone,’’ 2018, arXiv:1801.09454.
vol. 1, Aug. 2020, pp. 244–247. [58] F. Shao, X. Wang, F. Meng, J. Zhu, and J. Dai, ‘‘Improved faster R-CNN
[42] Q. Li and X. Liu, ‘‘Novel approach to pavement image segmentation based traffic sign detection based on a second region of interest and highly pos-
on neighboring difference histogram method,’’ in Proc. Congr. Image sible regions proposal network,’’ Sensors, vol. 19, no. 10, pp. 2288–2315,
Signal Process., vol. 2, May 2008, pp. 792–796. 2019.
[43] P. Subirats, J. Dumoulin, V. Legeay, and D. Barba, ‘‘Automation of pave- [59] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and
ment surface crack detection using the continuous wavelet transform,’’ in A. C. Berg, ‘‘SSD: Single shot MultiBox detector,’’ in Proc. Eur. Conf.
Proc. Int. Conf. Image Process., Oct. 2006, pp. 3037–3040. Comput. Vis., Oct. 2016, pp. 21–37.
[44] H. Cho, H.-J. Yoon, and J.-Y. Jung, ‘‘Image-based crack detection [60] V. Ziaei-Rad, L. Shen, J. Jiang, and Y. Shen, ‘‘Identifying the crack path
using crack width transform (CWT) algorithm,’’ IEEE Access, vol. 6, for the phase field approach to fracture with non-maximum suppression,’’
pp. 60100–60114, 2018. Comput. Methods Appl. Mech. Eng., vol. 312, pp. 304–321, Dec. 2016.
[45] C. Premachandra, H. Waruna, H. Premachandra, and C. D. Parape, ‘‘Image [61] J. Li, L. Huang, Z. Wei, W. Zhang, and Q. Qin, ‘‘Multi-task learning
based automatic road surface crack detection for achieving smooth driv- with deformable convolution,’’ J. Vis. Commun. Image Represent., vol. 77,
ing on deformed roads,’’ in Proc. IEEE Int. Conf. Syst., Man, Cybern., May 2021, Art. no. 103109.
Oct. 2013, pp. 4018–4023. [62] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, ‘‘Deformable
[46] C. Premachandra, H. W. H. Premachandra, C. D. Parape, and H. Kawanaka, convolutional networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
‘‘Road crack detection using color variance distribution and discriminant Oct. 2017, pp. 764–773.
analysis for approaching smooth vehicle movement on non-smooth roads,’’ [63] W. Xi, L. Sun, and J. Sun, ‘‘Upgrade your network in-place with
Int. J. Mach. Learn. Cybern., vol. 6, no. 4, pp. 545–553, Aug. 2015. deformable convolution,’’ in Proc. 19th Int. Symp. Distrib. Comput. Appl.
[47] Z. Qu, Y.-X. Chen, L. Liu, Y. Xie, and Q. Zhou, ‘‘The algorithm of concrete Bus. Eng. Sci. (DCABES), Oct. 2020, pp. 239–242.
surface crack detection based on the genetic programming and percolation
model,’’ IEEE Access, vol. 7, pp. 57592–57603, 2019.
[48] Y. Shi, L. Cui, Z. Qi, F. Meng, and Z. Chen, ‘‘Automatic road crack
detection using random structured forests,’’ IEEE Trans. Intell. Transp.
Syst., vol. 17, no. 12, pp. 3434–3445, Dec. 2016. KUN YAN received the B.Sc. degree in geomatics
[49] L. Zhang, F. Yang, Y. D. Zhang, and Y. J. Zhu, ‘‘Road crack detection engineering from the Henan University of Engi-
using deep convolutional neural network,’’ in Proc. IEEE Int. Conf. Image
neering, Zhengzhou, China. He is currently pursu-
Process. (ICIP), Sep. 2016, pp. 3708–3712.
ing the M.E. degree with the Faculty of Geomatics,
[50] S. Bhat, S. Naik, M. Gaonkar, P. Sawant, S. Aswale, and P. Shetgaonkar,
‘‘Road crack detection using convolutional neural network,’’ Indian J. Sci. Lanzhou Jiaotong University. His current research
Technol., vol. 14, no. 10, pp. 881–891, Mar. 2021. interests include the areas of 3D ground penetrat-
[51] Z. Qu, J. Mei, L. Liu, and D.-Y. Zhou, ‘‘Crack detection of concrete ing radar and highway pavement crack detection.
pavement with cross-entropy loss function and improved VGG16 network
model,’’ IEEE Access, vol. 8, pp. 54564–54573, 2020.
[52] Y. Zhang, B. Chen, J. Wang, J. Li, and X. Sun, ‘‘APLCNet: Automatic
pixel-level crack detection network based on instance segmentation,’’ IEEE
Access, vol. 8, pp. 199159–199170, 2020.
[53] F. Yang, L. Zhang, S. Yu, D. V. Prokhorov, X. Mei, and H. Ling, ‘‘Feature
pyramid and hierarchical boosting network for pavement crack detection,’’ ZHIHUA ZHANG received the B.Sc. degree in
IEEE Trans. Intell. Transp. Syst., vol. 21, no. 4, pp. 1525–1535, Apr. 2020.
geomatics engineering and the M.Sc. and Ph.D.
[54] W. Song, G. Jia, H. Zhu, D. Jia, and L. Gao, ‘‘Automated pavement crack
degrees in geological engineering from Xi’an Min-
damage detection using deep multiscale convolutional features,’’ J. Adv.
Transp., vol. 2020, pp. 1–11, Jan. 2020. ing University, Xi’an, China, in 2002, 2006, and
[55] W. Song, G. Jia, D. Jia, and H. Zhu, ‘‘Automatic pavement crack detec- 2010, respectively.
tion and classification using multiscale feature attention network,’’ IEEE He is currently working as a Professor with the
Access, vol. 7, pp. 171001–171012, 2019. Faculty of Geomatics, Lanzhou Jiaotong Univer-
[56] X. Feng, L. Xiao, W. Li, L. Pei, Z. Sun, Z. Ma, H. Shen, and H. Ju, sity. His current research interests include the areas
‘‘Pavement crack detection and segmentation method based on improved of 3D geoscience simulation and photogrammetry
deep learning fusion model,’’ Math. Problems Eng., vol. 2020, pp. 1–22, and image recognition
Dec. 2020.
150938 VOLUME 9, 2021

Automated Asphalt Crack Detection Using Deformable SSD

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automated Asphalt Crack Detection Using Deformable SSD

Uploaded by

Copyright:

Available Formats

Received October 20, 2021, accepted November 1, 2021, date of publication November 8, 2021, date of current version November

Automated Asphalt Highway Pavement Crack

INDEX TERMS Crack detection, deformable convolution, multi-scale, SSD.

I. INTRODUCTION and shorten road operation times. Serious pavement cracks

150926 VOLUME 9, 2021

VOLUME 9, 2021 150927

FIGURE 1. SSD network structure.

150928 VOLUME 9, 2021

FIGURE 2. 5 ∗ 5 feature map. C. ENCODE

VOLUME 9, 2021 150929

can adapt to the shape characteristics of the objects and

150930 VOLUME 9, 2021

VOLUME 9, 2021 150931

FIGURE 4. (Continued.) Sample images of our dataset. (a)–(i) Denote different

TABLE 2. Number of objects per category.

TABLE 3. Pascal VOC2007 test detection results.

150932 VOLUME 9, 2021

we divided the training process into two stages and fine-

VOLUME 9, 2021 150933

FIGURE 9. Detection results. (a)–(i) Denote different objective detection results.

150934 VOLUME 9, 2021

VOLUME 9, 2021 150935

150936 VOLUME 9, 2021

VOLUME 9, 2021 150937

150938 VOLUME 9, 2021

You might also like