Jpeodx Pveng-1194

Deep Learning–Based Autonomous Road Condition
Assessment Leveraging Inexpensive RGB and Depth

Downloaded from ascelibrary.org by "Indian Institute of Technology (Indian School of Mines), Dhanbad" on 04/08/23. Copyright ASCE. For personal use only; all rights reserved.
Sensors and Heterogeneous Data Fusion: Pothole

Detection and Quantification
Yu-Ting Huang 1; Mohammad R. Jahanshahi, A.M.ASCE 2; Fangjia Shen 3; and
Tarutal Ghosh Mondal 4
Abstract: Poor condition of roads is a major factor for traffic accidents and damage to vehicles. A significant portion of car accidents
is attributed to severe three-dimensional (3D) pavement distresses such as potholes, ruttings, and ravelings. Insufficient road condition as-
sessment is responsible for the poor condition of roads. To inspect the condition of the pavement surfaces more frequently and efficiently,
an inexpensive data acquisition system was developed that consists of a consumer-grade RGB-D sensor and an edge computing device that
can be mounted on vehicles and collect data while driving vehicles. The RGB-D sensor is used for collecting two-dimensional (2D) color
images and corresponding 3D depth data, and the lightweight edge computing device is used to control the RGB-D sensor and store the
collected data. An RGB-D pavement surface data set is generated. Furthermore, encoder-decoder deep convolutional neural networks
(DCNNs) consisting of one or two encoders, and one decoder trained on heterogeneous RGB-D pavement surface data are used for pothole
segmentation. Comprehensive experiments using different depth encoding techniques and data fusion methods including data- and feature-
level fusion were performed to investigate the efficacy of defect detection using DCNNs. Experimental results demonstrate that the feature-
level RGB-D data fusion based on the surface normal encoding of depth data outperform other approaches in terms of segmentation accuracy,
where the mean intersection over union (IoU) over 10-fold cross-validation of 0.82 is achieved that shows a 7.7% improvement compared
with a network trained only on RGB data. In addition, this study explores the efficacy of indirectly using depth information for pothole
detection when depth data are not available. Additionally, the semantic segmentation results were utilized to quantify the severity level of the
potholes assisting in maintenance decision-making. The result from these comprehensive experiments using an RGB-D pavement surface
data set gathered through the proposed data acquisition system is a stepping stone for opportunistic data collection and processing through
crowdsourcing and Internet of Things in future smart cities for effective road assessment. Finally, suggestions about the improvement of the
proposed system are discussed. DOI: 10.1061/JPEODX.PVENG-1194. © 2023 American Society of Civil Engineers.
Introduction improved since 2009, indicating a poor or mediocre condition

(ASCE 2021). Moreover, 21% of the highways in the US have poor
Background and Motivation pavement conditions based on the report published by National
Transportation Research Group in 2015 (TRIP 2015). These
According to the ASCE report card for America’s infrastructure in mediocre conditions of the roads in the US are not only unaccept-
2021, the overall performance grade for infrastructures is C−. able but also uneconomical and hazardous to citizens. For example,
In particular, the roads in the US received merely a D grade, which driving on roads in need of repair costs US motorists $120.5 billion
is at the bottom of all infrastructure categories and has not in extra vehicle repairs and operating costs in 2015, which is $533
per driver (TRIP 2015). Additionally, a lot of accidents are caused
1 by poor road conditions that are dangerous and life threatening.
Ph.D. Candidate, Lyles School of Civil Engineering, Purdue Univ., 610
A pothole is a common pavement distress that has a considerable
Purdue Mall, West Lafayette, IN 47907 (corresponding author). ORCID:
https://orcid.org/0000-0002-9773-2122. Email: huan1152@purdue.edu
impact on a crash or rollover accident, injuring the driver, pas-
2
Associate Professor, Lyles School of Civil Engineering and Elmore sengers, and pedestrians. In the Federal Highway Administration
Family School of Electrical and Computer Engineering, Purdue Univ., 610 (FHWA) guidelines (Miller and Bellinger 2003), potholes are any
Purdue Mall, West Lafayette, IN 47907. Email: jahansha@purdue.edu bowl-shaped holes with a minimum dimension of 150 mm. They
3
Ph.D. Student, Elmore Family School of Electrical and Computer can be classified depending on the depth in three categories: pot-
Engineering, Purdue Univ., 610 Purdue Mall, West Lafayette, IN 47907. holes less than 25 mm deep are categorized as low severity, pot-
ORCID: https://orcid.org/0000-0002-0902-5939. Email: shen449@purdue holes 25–50 mm deep are categorized as moderate, and potholes
.edu more than 50 mm deep are categorized as high-severity potholes
4
Postdoctoral Fellow, Dept. of Civil, Architectural and Environmental (Miller and Bellinger 2003).
Engineering, Missouri Univ. of Science and Technology, Rolla, MO 65401.
To alleviate the deterioration of pavements, it is necessary to
Email: tg5qf@mst.edu
Note. This manuscript was submitted on March 29, 2022; approved on detect and repair potholes on the road efficiently. The traditional
December 24, 2022; published online on March 10, 2023. Discussion per- pothole detection practices are manual and semiautomated. A
iod open until August 10, 2023; separate discussions must be submitted for crew of trained raters is sent to the field to inspect, record, and
individual papers. This paper is part of the Journal of Transportation classify potholes on the road, which is time consuming and unsafe.
Engineering, Part B: Pavements, © ASCE, ISSN 2573-5438. The current road inspection procedures are not effective enough
© ASCE 04023010-1 J. Transp. Eng., Part B: Pavements
J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

because the roads all over the state are inspected once a year or once lead to the introduction of a more quantitative road index only
every other year. An alternative method is to use a professional based on visual data that can reflect the road performance more
commercial vehicle equipped with multiple sensors such as cam- accurately.
eras and laser sensors to collect data and survey road conditions.
However, a significant disadvantage with this kind of application is
the expensive equipment investment; for example, a new automated Related Work
road analyzer van (ARAN) costs about $1.15 million. To this There is an extensive body of literature on the assessment of road
end, this study proposes an inexpensive hardware-software system surface conditions. Earlier studies developed various systems to
based on an edge computing device and Intel RealSense depth cam- collect RGB images of the pavement surface. For instance, Jo and
era D435 (Intel RS-D435, Santa Clara, California), which can Ryu (2015) proposed a pothole data collection system using a com-
collect RGB and depth data while driving vehicles at approximately mercial black-box camera. Zhang and Elaksher (2012) proposed a
50 km/h (30 mi/h). Instead of purchasing a specific type of vehicle, system composed of an unmanned aerial vehicle fitted with a dig-
the developed inexpensive data acquisition system can be mounted ital camera to detect road surface distresses. However, these studies
on various vehicles such as city and DOT vehicles, US Postal only used two-dimensional (2D) RGB data to detect road defects,
Service or UPS vehicles, Uber cars, and volunteer vehicles. The whereas some road defects such as potholes, rutting, and raveling
total expense of the hardware is about $1,600 without the cost of are three-dimensional (3D) defects. To correctly classify the se-
vehicle. Therefore, this inexpensive data acquisition system can verity level of potholes, it is necessary to obtain depth data that
be mounted on multiple vehicles to build a dense mobile sensor provide rich 3D information. Furthermore, the depth information
network that serves as an autonomous, more frequent, and quanti- allows the calculation of the amount of missing material of a pot-
tative condition-based as well as opportunistic assessment of roads hole, which can provide more insight for determining the severity
through crowdsourcing instead of the existing scheduled-based level of potholes. Therefore, several studies have attempted to use
manual or semiautonomous inspections. This study also aims to 3D road data for pothole detection. Fan et al. (2019) proposed a
automatically detect potholes using a deep learning-based segmen- stereo vision–based pothole detection system that utilized 2D im-
tation approach that can further obtain the shape of detected pot- age analysis and 3D road surface modeling techniques. Alterna-
holes for defect quantification. With the assistance of depth data, tively, commercial 3D pavement data tools have been used by
the proposed quantification approach not only can estimate the area researchers in the past. Barbarella et al. (2019) used a terrestrial
but also the volume of a pothole, which can be considered as a new laser scanner (TLS) to obtain the geometric features of pavement.
pothole severity level indicator in the future. Zhang et al. (2018) proposed a special vehicle with a 3D data col-
One of the limitations of the current study is that it focuses only lection system for pavement defect detection. Tsai and Chatterjee
on pothole detection and quantification as a case study. Other de- (2018) collected 3D pavement data using a sensing vehicle equipped
fects, such as alligator cracking, can lead to the formation of pot- with a data acquisition system including 2D imaging and 3D lidar.
holes if neglected for long period of time. While pothole detection However, this equipment is relatively expensive. Therefore, a num-
is useful, road decision makers are more interested in multiple ber of studies used low-cost sensors for 3D pavement data ac-
distresses together or an index such as the pavement condition in- quisition. Chen et al. (2016b) and Moazzam et al. (2013) relied
dex (PCI) to estimate the road performance. Consequently, it is on Microsoft Kinect, which is an RGB-D sensor. However, these
important to consider the effectiveness of the proposed solution studies did not design the proposed data acquisition system as an
for other types of defects, particularly the less coarse defects where autonomous data collection system so the inspection efficiency is
the depth information may not contribute much in defect detection limited.
process. Therefore, it is recommended that further research should Traditionally, pothole detection has been assessed by image-
be undertaken on the various pavement distresses based on the processing and image-thresholding methods. In particular, Otsu’s
proposed system. The proposed system can also be instrumental in thresholding method was one of the most successful methods for
monitoring the progress of less severe distresses such as alligator image thresholding (Akagic et al. 2017; Jahanshahi et al. 2013).
cracks. In recent years, there has been an increasing interest in using deep
In recent years, there have been several studies where RGB im- learning approaches (LeCun et al. 2015). Deep learning has dem-
ages and machine learning, including deep learning, techniques are onstrated its successes in computer vision areas such as classifica-
used successfully for detection of less coarse defects such as cracks tion and pattern recognition. To this end, several convolutional
(Ibragimov et al. 2022; Zhou and Song 2021; Dung 2019). This neural network (CNN) architectures for object detection and rec-
indicates that the combination of the proposed inexpensive hard- ognition have been utilized, including R-CNN, fast R-CNN, and
ware system and existing defect detection algorithms can lead to faster R-CNN, which are region-based CNNs (Girshick et al. 2014;
detection of a variety of different defects that can be used to es- Ren et al. 2015; Girshick 2015). Many recent studies have shown
timate a road condition index, such as PCI. One advantage of that deep learning methods can successfully perform pavement de-
the developed system is that it can be mounted on multiple cars, fect detection tasks. Maeda et al. (2018) used smartphone cameras
providing a huge amount of pavement surface images though a to take photos of road defects and trained deep learning models
crowdsourcing solution to mitigate the uncertainty of PCI estima- using those images to recognize road defects. Ibragimov et al.
tion, which is primarily dependent on manual survey of the road. (2022) applied the faster R-CNN model to detect pavement cracks.
Piryonesi and El-Diraby (2021) scrutinized the effect of the various Cao et al. (2020) implemented a comprehensive study on various
pavement distresses as well as climate change on the estimation of deep learning models for road damage detection. Ukhwah et al.
PCI. Furthermore, there have been some efforts to introduce better (2019) applied YOLO, a region-based object deep convolutional
pavement distress indexes. For instance, Chen et al. (2016a) uti- neural networks (DCNNs) model to detect potholes. The proposed
lized Long-term Pavement Performance (LTPP) data and structural approaches in the preceding studies are region-based methods. One
equation modeling to develop a new road condition index. To this disadvantage of region-based methods is the lack of a precise
end, using the proposed hardware system in this study, the depth spatial boundary of the detected object. In contrast, semantic seg-
information can be utilized for quantifying the identified defects mentation classifies each pixel in an image that can be used for
even for less coarse defects (e.g., crack width). This can potentially detection of defect boundaries. Fan et al. (2021) evaluated a set

of nine state-of-the-art CNNs for road pothole detection and ob- assessment where pothole assessment is used as a case study. Even
served the best mean IOU to be 75% and the lowest mIoU to be though a number of studies in the past relied on RGB-D data for
30.71%. Chun and Ryu (2019) proposed a fully CNN-based road pavement defect detection, information is scarce on the effective-
surface damage detection algorithm with semisupervised learning. ness of RGB-D data fusion. This study aims to fill this research gap
Zhang et al. (2017) proposed CrackNet, which is a pixel-level by establishing an inexpensive road surface RGB-D data acquisi-
CNN model to detect cracks on the asphalt pavement. Furthermore, tion and analysis system to put in place a generalized framework
several pieces of research have looked into various segmentation for road condition assessment. To provide a proof of concept, this
tasks, for instance, nuclear power plant crack detection (Chen and study selects potholes as the target pavement distress for detection
Jahanshahi 2017) and road crack detection (Zhang et al. 2016). and quantification. This paper has the following main contributions:
Although many researchers demonstrated success in pavement • To enhance the autonomy of the road condition assessment,
defect detection, most relied on only 2D RGB data or 3D depth data an inexpensive data acquisition prototype integrating multiple
alone. However, 2D RGB and 3D depth data complement each pieces of hardware components such as an edge computing
other perfectly. RGB data provide color information and depth device and consumer-grade RGB-D sensor using the Robotic
data provide geometric information. In general, the performance of Operating System (ROS) to interconnect and synchronize is
the semantic segmentation CNN models is limited by the quality developed.
of training data. The defect detection accuracy may be compro- • This paper comprehensively explores the efficacy of segmenta-
mised due to the poor illumination condition, various background tion DCNNs trained on various depth encoding and data fusion
textures, and misleading nondefect objects in RGB images. To methods for defect detection. There are two main types of data
improve the accuracy, depth data have been considered to be an fusion methods: data- and feature-level RGB-D fusion. The per-
informative input data in the CNN-based segmentation method. formance of each DCNN is evaluated based on two aspects.
Several studies have utilized RGB-D data for indoor scene detec- First, the accuracy of segmentation is observed. Second, the in-
tion and observed enhanced performance (Li et al. 2016; Chang ference time of each DCNN on a server and an edge computing
et al. 2017). Hazirbas et al. (2016) proposed a CNN model, called device is measured. Suggestions for the selection of the input
FuseNet, that incorporated depth into the semantic segmentation data and a fusion method on an edge computing device are
algorithm and trained on two public RGB-D data sets (Song et al. provided.
2015; Gupta et al. 2014). • It is shown that the fusion of depth and RGB data improves the
There have been efforts to utilize crowdsourcing to detect pot- performance and robustness of segmentation task. Also, it is
holes. For instance, the Get It Done San Diego app is a crowdsourc- shown that feature-level data fusion is superior to data-level
ing solution where users can report potholes and connect directly to fusion. This study discusses a novel technique to estimate depth
the city’s work tracking system. The user has to take a picture of the data from RGB images when data collection is not available.
defective region, which is not always feasible on crowded roads • From the segmentation results, defect quantification is imple-
and may pose safety threats to the individuals involved in the pro- mented through the calculation of potholes’ area and volume,
cess. Furthermore, such solutions need a human operator to look at which can be used as important indicators in the categorization
the submitted images and make decision whether that section of the and prioritization of the potholes.
road needs to be fixed or not. Another example is the Street Bump
app that is operated by the City of Boston. Street Bump is a mobile
app that is used by volunteers to collect road condition data while Data Collection and Preparation
they drive. Street Bump utilizes the accelerometer and GPS of the
phone’s sensors and uploads the data on a central server and alerts To collect road surface data efficiently across a region, a light
drivers regarding the locations of unfixed bumps. This solution re- low-cost data acquisition system that can be mounted on a vehicle
quires manual interaction where the users log trips individually. was developed. The proposed experimental setup is shown in
Another limitation of similar aforementioned crowdsourcing solu- Fig. 1. The three major instruments employed in our system are
tions is that they can only detect coarse defects (i.e., potholes and the NVIDIA (Santa Clara, California) Jetson TX2 developer kit
bumps) and, in most cases, only when the vehicle hits the bumps. (an edge computing device), the Intel RS-D435 RGB-D sensor,
Furthermore, the generated reports only provide the locations of and a 1-TB portable solid-state drive (Extreme Portable SSD,
bumps. Accurate quantification of defects is not performed to eval- Milpitas, California). The edge computing device and Intel RS-
uate the evolution of defective regions. In this study, the feasibility D435 RGB-D sensor are compact and are capable of being safely
of a crowdsource-based data collection for quantitative measure- mounted on the back of vehicles. A piece of computer middleware
ment of pothole conditions is investigated. The proposed solution called ROS was used to interconnect the hardware components
has the capability of detecting potholes even if the vehicle does and provide an interface for top-down control. This section aims
to introduce the characteristics of these three hardware compo-
not hit the pothole. Recently, the Carbin app has been a successful
nents as well as the middleware that facilitate the data acquisition
crowdsourcing solution in Boston based on smartphones for road
system.
roughness sensing and monitoring (Botshekan et al. 2020, 2021).
However, the Carbin app cannot detect and quantify defects directly
and only focuses on estimating the International Roughness Index Data Acquisition System
(IRI). Because it has been shown that RGB-D sensors can be used
to estimate pavement roughness as well (Mahmoudzadeh et al. Hardware
2019), the proposed solution in the current study has more potential As shown in Fig. 1, the Intel RS-435 RGB-D sensor can capture
than the aforementioned smartphone-based crowdsourcing. spatially aligned RGB and depth images simultaneously. Along
with an ordinary RGB camera, it has a depth module that measures
the absolute distance of a physical point represented by each pixel
Contribution
using a structured-light stereopsis that mainly consists of a static
The objective of this study is to establish an inexpensive road infrared (IR) pattern projector and a pair of left and right imagers.
surface data acquisition and analysis system for road condition The infrared projector projects nonvisible static IR patterns to

NVIDIA Jetson TX2
Intel® RealSense™ Camera D435
(b) (a) (c)
Fig. 1. Data collection system configuration: (a) the overall setup of the data acquisition system; (b) the Intel RS-D435 RGB-D sensor, which can be
mounted on a bike rack on the trunk; and (c) a power-efficient edge computing device the controls the system. (Images by Yu-Ting Huang.)
improve depth accuracy in scenes with low texture, thereby Data Set Generation
allowing the sensor to produce trustworthy results under various To train and validate the semantic segmentation DCNNs, this study
lighting conditions. The Intel RS-435 RGB-D sensor offers a wide generated a pavement surface data set. The data were collected
range of resolution and frames per second (FPS) for the operator to using the developed data acquisition system. To this end, this light-
choose from. The depth resolution is up to 1,280 × 720 and RGB weight system was mounted on a vehicle, and the data were
resolution is up to 1,920 × 1,080. In this study, the FPS of 30 for collected at the speed of approximately 50 km/h (30 mi/h) on local
both the RGB and the depth sensor was used and the resolution was streets around the Purdue University campus (Fig. 2). The data set
640 × 480 with the car speed being below 30 mi=h. consists more than 30,000 RGB-D image pairs, where 1,344 pairs,
In the proposed system, the Intel RS-435 RGB-D sensor is con- including defective and nondefective regions, were manually se-
nected to a Jetson TX2 developer kit. Jetson TX2 is a system-on- lected for training and validating. Fig. 3 gives some sample images
module designed by NVIDIA that incorporates a 256-core Pascal in the data set. Figs. 3(a–c) are the RGB images where the reso-
graphics processing unit (GPU), a 6-core ARM-v8 CPU cluster, lution for each image is 640 × 480 pixels, and Figs. 3(d–f) depict
8 GB of low-power double data rate 4 (LPDDR4) memory, and the corresponding absolute depth data that are spatially aligned
32 GB of flash storage. It is a power-efficient device whose maxi- with RGB images, and Figs. 3(g–i) show the ground truths. All
mum power does not exceed 15 W. On top of the bare module of the potholes in the data set are manually annotated. In addition, the
Jetson TX2, NVIDIA offers a carrier board that wraps up the mod- data set contains pothole-like objects. For instance, as shown in
ule itself along with several essential computer peripherals into a Fig. 4, a manhole and circular asphalt bleeding have similar shapes
developer kit, making it versatile as a mobile computer as shown or textures as potholes. These objects may potentially yield to false-
in Fig. 1. positive predictions by a DCNN model.
A portable 1-TB USB SSD device was used to archive the
high volume of images captured on the fly. RGB and depth
images were stored in JPEG and PNG format, respectively. The
resolution of RGB and depth data was 640 × 480. The size of a
three-channel JPEG image containing a fair amount of texture de-
tails is around 200 kilobytes (kB), while the size of a PNG depth
image is typically 75 kB. When the images are captured at 30
FPS, the influx of data to the storage system is around 16.1 mega-
bytes (MB)/s.
Software
In this prototype, ROS allows hardware components to communicate
with each other. ROS is an open-source freeware that facilitates the
programming of robots for interprocess communication (Quigley
et al. 2009). In the ROS communications model, nodes publish
messages to certain topics and read messages from other topics.
Messages are transported using either TCP/IP or the User Datagram
Protocol (UDP). Using this approach, the camera-controlling logic
and the image-processing logic can be implemented as separate no-
des, thereby promoting modularity. Furthermore, one performance-
optimizing feature that was used for host-local communications
Fig. 2. Data collection path around the Purdue University campus.
(contrary to network communications) is the ROS nodelet, which
[(c) OpenStreetMap contributors (openstreetmap.org/copyright).]
allows zero copy passing of data between functional nodes.

(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Fig. 3. Sample images in the data set: (a) a road surface without pothole; (b) a road surface with a small defect; (c) a road surface with a high-severity-
level pothole; (d–f) the corresponding aligned absolute depth data; and (g–i) manually annotated ground truth.
(a) (b) (c)
Fig. 4. Samples of pothole-like objects that may confuse the DCNN: (a) a manhole; (b) a manhole; and (c) a patching.
Data Preprocessing world coordinate using the equation derived from the pinhole
camera model
As shown in Fig. 5, the sensor is not exactly perpendicular to the

road surface because it is mounted on a bike rack at the rear of ði; jÞ − P
the vehicle to avoid including any part of the vehicle in the ðX; Y; ZÞ ¼ d · ;d ð1Þ
F
captured data. Fig. 6(a) displays an example RGB image of a pot-
hole and the corresponding absolute depth data are presented in where ðX; Y; ZÞ = point in the world coordinate relative to the
Fig. 6(b). As seen, the defect is not apparent in the absolute depth depth sensor; i and j = pixel coordinates of the point in the depth
image because the sensor was slightly inclined during data collec- image; P = principal point on the image plane; F = focal length of
tion. To correct this issue, the following procedure is implemented the depth sensor; and d = depth value captured by the sensor.
to identify and remove the baseline plane. First, the camera param- Next, the random sample consensus (RANSAC) (Fischler and
eters are applied to deproject the 2D image coordinate to a 3D Bolles 1981) algorithm is used to fit a plane to these points in

(a) (b)
Fig. 5. Data acquisition system setup: (a) sensor is mounted on the bike rack, which is not perfectly perpendicular to the road, to avoid including the
car in the image; and (b) a close view of the sensor and region of interest.
(a) (b) (c)
Fig. 6. Sample RGB-D data: (a) an RGB pothole image; (b) the corresponding absolute depth data; and (c) the relative depth data with respect to the
road surface after subtraction of the fitted plane.
the 3D space. The points having a distance to the plane greater semantic segmentation DCNN was trained to classify a pixel as
than a set threshold are seen as an outliers. After the plane is a pothole or background. The pothole signifies a defect region,
fitted, it is subtracted from the original depth image. The fitted whereas all pixels outside the defect region are labeled as the back-
plane is an estimation of the road surface. This subtraction pro- ground. The background may occasionally include pothole-like
vides relative depth data with respect to the estimated road surface nondefect areas such as manholes. The background intersection
[Fig. 6(c)]. over union (IoU) provides an insight into the DCNN’s ability to
distinguish between a pothole and a nonpothole object. In addition,
for the purpose of pothole detection and severity level classifi-
Methodology cation, depth data were adopted in this study. The advantage of
using depth data is that depth data provide the pothole’s geometric
RGB-D Defect Segmentation information, which complements 2D RGB images. To this end,
an encoder-decoder-based semantic segmentation neural network
To date, various methods have been developed and introduced to was used in this study, and different data fusion strategies were
detect pavement defects. Many researchers have utilized RGB data used to fuse grayscale or RGB with depth data. Several experiments
for defect detection using a bounding box. However, there are were carried out using various types of fusion strategies and depth
drawbacks associated with the use of bounding boxes to detect the encoding techniques.
defect. The bounding box does not provide accurate dimensions
and shape of the defect; therefore, the estimated geometry informa- Data Encoding
tion of the object is not enough to accurately determine the type As shown in Fig. 7, various types of depth encoding techniques
of defect and its severity. In this study, semantic segmentation were applied in this study, including absolute depth data, raw rel-
was used to detect the pothole on the pavement. Unlike the object ative depth data, locally normalized depth data, globally normal-
detection method, semantic segmentation is a pixel-wise classifica- ized depth data, and surface normal (SN) data. Fig. 7(a) depicts
tion that can predict the contour of each object. Knowing the pre- a sample of raw relative depth data without any normalization.
cise shape of the defect, the dimensions and the area of the defect Locally normalized depth is obtained by dividing the relative depth
can be estimated, which are critical parameters for classifying the by the maximum relative depth in each image, where the values
defect according to the inspection guideline. In this study, a binary vary between 0 and 1. A brighter region represents a deeper area

(a) (b)
(c) (d)
Fig. 7. Different types of depth encoding techniques utilized in this study: (a) relative depth data; (b) locally normalized relative depth data obtained
by dividing relative depth by the maximum relative depth in each image; (c) globally normalized relative depth data obtained by dividing relative
depth by the maximum relative depth in the entire RGB-D pavement surface data set; and (d) SN data.
on the road surface. A sample of locally normalized depth data is learning. Fifteen semantic segmentation DCNNs were implemen-
displayed in Fig. 7(b). In contrast, globally normalized depth is ob- ted and trained on different types of input data including (1) gray-
tained by dividing relative depth by the maximum relative depth in scale image, (2) RGB image, (3) relative depth (RD) data,
the entire RGB-D pavement surface data set (i.e., among all of the (4) locally normalized depth (LND) data, (5) globally normalized
depth data in the data set). A sample of globally normalized relative depth (GND) data, (6) SN data, (7) stacked gray-RD data,
depth is displayed in Fig. 7(c). A brighter area in this figure cor- (8) stacked RGB-RD, (9) stacked RGB-LND, (10) stacked RGB-
responds to a region with less depression on the road. The darker GND data, and feature-level data fusion using (11) gray-RD,
area corresponds to a deeper region. In addition, another depth en- (12) RGB-RD, (13) RGB-LND, (14) RGB-GND, and (15) RGB-
coding method, which is surface normal, is considered in this study. SN data.
To find the SN at a point, all pixels need to be converted to 3D
points. Next, a plane considering the adjacent points is fit at each Network Architecture
point in the 3D point cloud. Then an SN vector can be obtained Two DCNN architectures were applied for data- and feature-level
based on the fitted plane (Silberman et al. 2012; Wang et al. 2015). fusion in this study. All networks perform class-wise semantic seg-
Three components of the SN vector constitute three channels of mentation on RGB-D images to classify each pixel into binary
the SN map. Fig. 7(d) shows the SN estimation of the depth map classes (i.e., background or defect). As shown in Fig. 8, for data-
in Fig. 7(a). level fusion an encoder-decoder DCNN was used. The encoder net-
work was identical to the VGG16 network architecture (Simonyan
Data Fusion and Zisserman 2014) without fully connected layers. The decoder
This study evaluated the effectiveness of fusing 2D grayscale or network upsamples the feature maps from the corresponding en-
RGB images with depth data for semantic segmentation. Two ap- coder and outputs a pixel-wise labeling result with the same reso-
proaches were considered for data fusion: data-level fusion method lution as the input data. For feature-level fusion, a DCNN was
and feature-level data fusion. For data-level fusion, four-channel adopted based on the architecture proposed by Hazirbas et al.
RGB-D input data were generated by stacking the depth data on (2016). The network architecture is displayed in Fig. 9 and is an
the RGB channels. For feature-level fusion, instead of combining encoder-decoder-style DCNN with two encoders. There are two
RGB and depth data as input to the network, RGB and depth data encoder networks: an RGB input encoder branch, and the depth
were fed into two individual encoder networks in the semantic seg- input encoder branch. The depth encoder network was also a
mentation DCNN. VGG16 network architecture without fully connected layers. In ad-
Comprehensive experiments and comparisons were imple- dition, five fusion layers were added after each convolution, batch
mented to evaluate the performance of various depth encoding tech- normalization, and activation (CBR) layer in the RGB encoder
niques and data fusion methods. The outcomes were compared network. The encoder network extracted feature maps from RGB
with an existing approach based on depth data and unsupervised and depth data. These two branches in the encoder part extracted

Fig. 8. Network architecture for data-level fusion. The input can be RGB images or stacked RGB-D. The decoder part upsamples the feature maps to
original input resolution.
Fig. 9. Network architecture with two encoder networks that extract the features from RGB and depth input and apply element-wise summation to
fuse both features. The decoder network upsamples the feature maps to original input resolution.
the feature from RGB and depth data and applied element-wise Estimating Depth from RGB Data
summation to fuse both features. The significant strength of this
Despite all the advantages of the developed RGB-D data acquisi-
architecture is the fusion layer in the encoder that can merge the
tion and analysis system, there are also some practical challenges
2D color attributes and the 3D spatial information of the pothole.
associated with real-time depth sensing in the outdoor environment.
To obtain information from both encoder branches, a fusion layer
For the Intel RS-D435 camera, the official recommended operating
was inserted after every CBR layer that added the feature maps in
temperature range lies between 0°C and 35°C. When the projector
the depth encoder to the feature maps in the RGB encoder.
temperature with the sensor is lower than 0°C or higher than 35°C,
Network Training the laser safety mechanism in the firmware driver turns on and the
Each network was trained for 30 epochs using the Adaptive quality of the depth data may decrease. Additionally, some of the
Momemt Estimation (Adam) (Kingma and Ba 2014) optimizer, depth sensors use infrared technology, which does not work well
a learning rate of 0.0001, and the batch size of 2. In addition, the under direct bright sunlight. Therefore, this study investigated two
trainable parameters in the encoder part were fine tuned from a pre- approaches that can estimate depth data from RGB data or indi-
trained VGG16 network that was trained on the ImageNet data rectly extract depth features when depth data are not collected.
set (Deng et al. 2009). To better evaluate the performance of each
DCNN model, a repeated 5-fold cross-validation was performed Encoder-Decoder Network
(Kim 2009). At each fold, the data set was randomly split into train- A number of techniques have been developed to estimate depth
ing and validated sets: 20% of all data were used for validation from a monocular RGB image. In this study, an encoder-decoder
and the remaining data were used for training. The 5-fold cross- DCNN with skip connections proposed by Alhashim and Wonka
validation was repeated twice. For each cross-validation, the (2018) was applied for depth estimation. The network architecture
validation and training sets were mutually exclusive from the train- is shown in Fig. 10. The encoder network is identical to DenseNet-
ing set. The training took place on a Linux server with Ubuntu 161 (Huang et al. 2017) and the decoder network consists of suc-
14.04. The server included two Intel Xeon E5-2620 v4 CPUs, cessive series of upsampling layers where the dimension of output
256-gigabytes (GB) double data rate 4 (DDR4) memories, and four is identical to the input. The depth estimation network was pre-
NVIDIA Titan X Pascal GPUs. Pytorch was used to implement the trained on the NYU Depth v2 data set (Silberman et al. 2012) and
semantic segmentation networks (Paszke et al. 2017). then retrained on collected pavement surface data through transfer

added after the last activation layer in depth (ADNet ) and the hallu-
cination encoder (AhNet ) networks
RGB Depth LHallucination ¼ kADNet − AhNet k22 ð2Þ
As shown in Eq. (3), a joint loss function consists of cross-

entropy and hallucination loss is applied to train the network
Fig. 10. Depth estimation network architecture proposed by Alhashim

and Wonka (2018). The encoder network is identical to DenseNet-161. LTotal ¼ αLCE þ βLHallucination ð3Þ
The decoder part upsamples the feature maps to original input resolu-
where α and β = wights to balance the cross-entropy and halluci-
tion. The dashed lines represent skip connections.
nation loss. In this study, α and β were set to 0.1 and 5, respectively,
based on trial and error as reasoned weights. During the validation,
only RGB input is passed through the RGB and hallucination en-
learning. An example of estimated depth data is shown in Fig. 11
coders to predict the defective region in the image.
where Fig. 11(a) is the original RGB image, Fig. 11(b) is the cor-
responding depth data and Fig. 11(c) is the estimated depth data. It
can be observed that the prediction approximately captures the de- RGB-D Defect Quantification
fective region, although the depth values do not perfectly match the
actual area. To obtain the relative depth magnitude, the output depth In this study, defect quantification can be implemented using se-
map is normalized so that the brighter the region is, the deeper it is. mantic segmentation results and depth information. According to
In addition, a network that has an identical architecture for SN pre- the FHWA guidelines, a pothole’s minimum width should be
diction was also trained in this study. As shown in Fig. 11(d), it is 150 mm, or a 150-mm-diameter circular plate should fit in an
the SN map of the RGB image in Fig. 11(a). The corresponding irregular-shaped pothole. In addition, there are three severity levels
estimated SN map is displayed in Fig. 11(e). With the SN estima- of potholes (i.e., low, moderate, and high) based on the pothole’s
tion DCNN, it is not necessary to acquire camera parameters to maximum depth (Miller and Bellinger 2003). To quantify the di-
obtain an SN map. mensions of a pothole, the first step is to classify whether a defect
shown in an image satisfies the pothole criterion. The boundaries of
Encoder-Decoder Network through Modality Hallucination the defective region can be obtained from the semantic segmenta-
In addition, another network architecture proposed by Hoffman tion result and drawing the largest inscribed circle of it. For each
et al. (2016) was applied to the RGB-D pavement data set in this pixel that lies inside the semantic segmentation mask, the Euclidean
research. As shown in Fig. 12, an additional hallucination encoder distance between that pixel and the nearest pixel on the mask edge
network fed RGB input was added to the original two-stream is computed. The pixel that has the maximum distance is the center
encoder-decoder network. The hallucination encoder attempts to of the largest inscribed circle. As shown in Figs. 13(a and b), the
mimic the features from the depth encoder network using an RGB maximum inscribed circle of each defect can be identified. If the
input during the training process. With this hallucination encoder, diameter of the inscribed circle is smaller than 150 mm, it will not
depth modality can transfer side information to RGB modality. To be classified as a pothole [Fig. 13(c)]. In contrast, the defect is clas-
mimic the depth encoder, as shown in Eq. (2), a Euclidean loss is sified as a pothole if the in-circle diameter is larger than 150 mm
(a) (b) (c)
(d) (e)
Fig. 11. Sample of estimated depth images: (a) RGB image; (b) corresponding depth data; (c) estimated depth data; (d) SN map obtained from RGB
image and camera parameters; and (e) estimated SN map.

Fig. 12. Hallucination network architecture with three encoder branches including RGB, hallucination, and depth. The hallucination network is
identical to the depth encoder but trained on RGB input. The decoder part upsamples the feature maps to the original input resolution.
(a) (b)
(c) (d)
Fig. 13. Sample defect with the boundaries and estimated maximum inscribed circle: (a) an example of the maximum inscribed circle and boundary
of the semantic segmentation result; (b) another example of the maximum inscribed circle and boundary of the semantic segmentation result;
(c) the diameter of the inscribed circle does not exceed 150 mm to this irregular shape of the defect, which cannot be classified as a pothole;
and (d) this defect can contain an inscribed circle with a diameter 550 mm, which can be classified as a pothole.

Table 1. Comparison of prior works (Jahanshahi et al. 2013; Akagic et al.
2017) and proposed approach using different depth encoding techniques
and data fusion methods on pavement data set
Method Input data Background Defect Mean
Prior work Relative depth 0.938 0.667 0.803
Proposed Gray 0.958 0.696 0.827
DCNNs RGB 0.966 0.762 0.864

Relative depth 0.956 0.721 0.838
Locally normalized 0.956 0.728 0.842
relative depth
Globally normalized 0.955 0.724 0.840
relative depth
Surface normal 0.960 0.733 0.847
Stacked gray-RD 0.949 0.648 0.798
Stacked RGB-RD 0.965 0.773 0.869
Fig. 14. Calculation to obtain actual surface area of each pixel on the Stacked RGB-LND 0.962 0.762 0.862
depth image. Stacked RGB-GND 0.965 0.769 0.867
Fused gray-RD 0.966 0.771 0.869
Fused RGB-RD 0.974 0.817 0.895
[Fig. 13(d)]. Next, the severity level of each pothole can be ob- Fused RGB-LND 0.972 0.807 0.892
tained utilizing the maximum depth of each pothole from the rel- Fused RGB-GND 0.973 0.811 0.895
Fused RGB-SN 0.974 0.821 0.898
ative raw depth data. Furthermore, the volume of each pothole can
be estimated using the area and the depth of each pixel. To know the Note: Class-wise IoU scores of both classes. Background represents the
actual size of each pixel, as shown in Fig. 14, the back-projection of nondefective region. Defect represents the defective region. Bold text
a pixel corner is calculated by solving the intersection of the pave- indicates the highest value in the table which refers to the best performance.
ment surface plane and the line joining the camera center and a
pixel on the image plane. For each pixel, four intersection points fused RGB-D and fused RGB-SN data. These results show that the
can be found. Then the depth value of each pixel multiplied by the capability of detecting defects benefits from the features directly
area of the quadrilateral gives the small volumes represented by extracted from depth data. The DCNN with an extra depth encoder
each pixel. The total volume of each pothole is the summation branch improves the detection result because the depth encoder
of the volume of each white pixel in the binary semantic segmen- successfully preserves features from depth data. Moreover, no no-
tation results. ticeable difference in the performance is observed in the depth data
subjected to different normalization techniques. The finding that
the boxes overlap is also confirmed in Figs. 15(a–c). The box plot
Experimental Results and Discussions consists of a box and a set of whiskers to compare the results. The
box represents the first to the third quartile and the solid line in
Model Performance Metrics the box is the median. The distance between the upper and lower
quartiles is the interquartile range (IQR), which is a measurement
IoU is the metric used to evaluate the performance of DCNNs.
of the variability about the median. A larger IQR means the data
As shown in Eq. (4), IoU is defined as the ratio of the overlapping
points have more variability. The upper and lower whiskers go to
region between predicted and ground-truth pixels over their union.
the maximum and minimum data points excluding outliers. All box
Mean IoU is calculated by averaging the IoU of each class
plots depicted in this study result from 10-fold cross-validation.
ground truth ∩ prediction In addition, the results shown in the boxplots indicate that the per-
IoU ¼ ð4Þ formances are similar when raw relative, locally normalized, or
ground truth ∪ prediction
globally normalized depth data are used where the IQRs are sig-
To evaluate the capabilities of different DCNNs for pothole de- nificantly overlapping.
tection, the average IoU over 10 validation data sets are reported Another important finding is that the RGB image still plays
from two aspects in this study: (1) class-specific IoU (nondefective an important role in semantic segmentation. Monochromatic im-
and defective regions), and (2) the mean IoU of both classes. More- ages cannot replace RGB images. As shown in Table 1, the DCNN
over, the inference time of each trained DCNN is also reported in trained on grayscale images has the worst performance compared
this study. with the other DCNNs. In addition, it is observed that the DCNN
The efficiency of different types of depth encoding techniques trained solely on the depth data has worse performance than the
and data fusion methods was investigated in this study. Table 1 pro- one trained only on the RGB data. These results suggest that the
vides the class-wise IoU scores of different types of depth encoding RGB data provide important insights about the features. However,
techniques and data fusion methods. In addition, the results are also the variance of the performance using only the RGB data is large.
compared with an existing approach based on Otsu’s thresholding An illustration of this can be seen in Fig. 15(b), which indicates that
method (Akagic et al. 2017; Jahanshahi et al. 2013). First, it is ob- the fusion of RGB and depth data is necessary to improve the mod-
served that the prior method (Akagic et al. 2017; Jahanshahi et al. el’s robustness (e.g, the depth data can complement RGB data
2013) resulted in the lowest mean IoU compared with the DCNN in the low-light condition). As shown in Fig. 15(b), it is seen
method. In the DCNN methods, the fused RGB-SN input has the that the defect IoU of the fused data has a smaller interquartile
best performance among all encoding techniques. As shown in range than RGB data. Moreover, the results of the coefficient of
Table 1, the IoU score for the defect using fused RGB-SN input variation (COV) of IoU are shown in Fig. 16. The COV is defined
is 0.821, the background IoU is 0.974, and the mean IoU score is as the ratio of the standard deviation to the mean of a population.
0.898. As shown in Table 1, stacked RGB-D performs worse than COV is an indicator of the relative variability of the data points.

Background
Defect
Coefficient of variation (CV)

Mean
Fig. 16. COV of different depth encoding techniques. COV is an

indicator of the relative variability of the data points. A lower COV
value means higher robustness. The DCNN trained on fused RGB-SN
has the lowest COV, representing higher robustness.
Table 2. Class-wise IoU scores of background and defect among different

input data and data fusion methods
Input data and fusion method Background Defect Mean
RGB 0.966 0.762 0.864
Fused RGB-LND 0.972 0.807 0.892
Fused RGB-estimated D 0.966 0.775 0.870
Fused RGB-SN 0.973 0.821 0.898
Fused RGB-estimated SN 0.975 0.815 0.895
Fused RGB-HSN 0.975 0.832 0.904
Note: Background represents the nondefective region. Defect represents the
defective region. Bold text indicates the highest value in the table which
refers to the best performance.
However, Table 2 compares the IoU of the models trained on

RGB fused with actual depth and estimated depth data, where fused
RGB-HSN represents the DCNN with hallucination encoder. It is
observed that the fused RGB–estimated SN and fused RGB-HSN
have a performance comparable with the DCNN trained on fused
RGB-SN. The result of DCNN trained on fused RGB-HSN shows
that the extra hallucination encoder can transfer the features learned
from the depth encoder network to the hallucination network with
RGB input. Figs. 17(a–c) present the overall performance distribu-
tion over 10-fold cross-validation via box plot. From Fig. 17(b),
although it shows that the variances of defect IoU of fused RGB–
estimated SN and fused RGB-HSN are slightly larger than fused
RGB-SN, the variances of these fused DCNNs are lower than
the network trained on RGB data. Although the variance of DCNN
trained on estimated depth is much larger than that of DCNN
trained on actual depth, this is an expected result because the esti-
Fig. 15. Boxplot comparison of IoU over 10-fold cross-validation mated depth includes more noise and error than actual depth data.
using different depth encoding techniques and RGB-D fusion methods The comparison of the COV displayed in Fig. 18 also confirms this
on pavement data set: (a) background IoU; (b) defect IoU; and finding. This indicates that these incidental depth data fusion meth-
(c) mean IoU. ods are practical solutions for RGB-D fusion when depth sensing is
not involved during data collection.
These results also confirm that it is more robust to utilize both color
Defect Segmentation
and depth data instead of only one. Furthermore, the DCNN trained
on fused RGB-SN data has the lowest COV, which means it is the A qualitative comparison of the segmentation results among differ-
most robust. ent types of DCNN inputs to DCNN is displayed in Fig. 19. The

Coefficient of variation (CV)
Fig. 18. Coefficient of variation comparison of the network trained

on actual depth and estimated depth map. COV is an indicator of the
relative variability of the data points. A lower COV value means higher
robustness.
first column in the figure displays the RGB image image of intput
data and the second column in this figure depicts the ground truth.
In Figs. 19(c and d), the network that is trained only on the gray or
RGB data does not successfully identify the defective region in
shadow or low-light conditions. As seen in Figs. 19(e–h), the fused
RGB-SN outperforms the other encoding technique in terms of seg-
menting the defective regions.
Qualitative comparisons of the segmentation results using dif-
ferent indirect depth integration techniques are displayed in Fig. 20,
where Fig. 20(a) shows the RGB image and Fig. 20(b) is the
ground-truth segmentation mask. Fig. 20(c) is the output from the
network trained on fused RGB-estimated Depth data, Fig. 20(d)
is the output from the network trained on fused RGB–estimated
SN data, and Fig. 20(e) is the output from the network trained on
fused RGB-HSN data.
Computation Capability on Edge Computing Device

Table 3 lists the dimensions of the input data, total number of net-
work parameters, and inference time of the DCNNs testing on the
server equipped with two Intel Xeon E5-2620 v4 CPUs, 256-GB
DDR4 memories, and three NVIDIA RTX 6000 GPUs. For real-
time inspection of pavement surface conditions, it is important to
have a portable edge computing device on the vehicle. In this
study, the NVIDIA Jetson TX2 GPU development kit was used
as the edge computing platform. As discussed in “Data Collection
and Preparation,” NVIDIA Jetson TX2 is an embedded chipset
equipped with a GPU with 256 CUDA cores. Its capabilities have
Fig. 17. Boxplot of IoU for 10-fold cross-validation using different
been acknowledged in drones and robots (NVIDIA 2016; Wu et al.
depth encoding techniques and data fusion methods: (a) background;
2019). The main concern of utilizing edge computing device for
(b) defect; and (c) mean. RGB represents a normal encoder-decoder
online pavement inspection is the inference time and memory usage
network, fused RGB-SN represents the RGB and surface normal fusion
of the trained DCNNs with various types of depth encoding and
network, and RGB-HSN is the network with an extra hallucination
data fusion methods.
encoder that is trained on RGB-SN data but validated on solely
To this end, the inference times of the DCNNs on a server and
RGB data. RGB-SN and RGB-HSN fusion networks perform better
an edge computing device were compared in this study. Fig. 21
than the network trained on only RGB data. RGB-HSN has a compa-
compares the computation time on the server and Jetson TX2. Two
tible performance compared with RGB-SN, whereas it has a larger IQR
performance-power configurations in Jetson TX2 were selected to
than RGB-SN. This shows that the variability of RGB-HSN is slightly
compare the inference time. Jetson TX2 with Max-N mode has the
larger than that of RGB-SN. The hallucination encoder can effectively
highest throughput, whereas Max-Q mode has the highest power
transfer depth information from the depth encoder.
efficiency. It is apparent from Fig. 21 that the inference time of

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 19. Sample segmentation results from the networks trained on different depth encoding approaches and data fusion methods: (a) RGB image;
(b) ground truth of defective region; (c) output from the network trained on grayscale input data; (d) output from the network trained on RGB data;
(e) output from the network trained on stacked RGB-RD data; (f) output from the network trained on fused gray-RD data; (g) output from the network
trained on fused RGB-RD data; and (h) output from the network trained on fused RGB-SN data.

(a)
(b)
(c)
(d)
(e)
Fig. 20. Sample segmentation results from the networks trained on different depth encoding approaches and data fusion methods: (a) RGB image;
(b) ground truth of defective region; (c) output from the network trained on fused RGB-estimated D data; (d) output from the network trained on fused
RGB–estimated SN data; and (e) output from the network trained on fused RGB-HSN data.
Table 3. Input data dimension, number of parameters, and average inference time for different depth encoding and data fusion methods on a server equipped
with an NVIDIA RTX 6000 GPU and Jetson TX2
Inference time (s)
Jetson TX2
Input data and fusion method Input data dimension (pixels) DCNN parameters Server Max-N Max-Q
Gray 640 × 480 × 1 29,442,443 0.035 1.160 1.692
RGB 640 × 480 × 3 29,443,585 0.035 1.162 1.699
Stacked gray-RD 640 × 480 × 2 29,443,009 0.036 1.162 1.698
Stacked RGB-RD 640 × 480 × 4 29,444,161 0.036 1.162 1.698
Fused gray-RD 640 × 480 × 1 and 640 × 480 × 1 44,164,417 0.054 1.788 2.593
Fused RGB-RD 640 × 480 × 3 and 640 × 480 × 1 44,165,569 0.054 1.788 2.593
Fused RGB-SN 640 × 480 × 3 and 640 × 480 × 3 44,166,721 0.055 1.795 2.617
Jetson TX2 is much higher than the server’s. This is due to the and RGB-D data are around 112 and 170 MB, respectively. Con-
limited computational capacity of the edge computing device. sidering the trade-off between segmentation accuracy, inference
Furthermore, the loading of the two-stream encoder-decoder time, and memory usage, it is better to deploy a DCNN that uses
DCNN is too cumbersome for the edge computing device. Addi- only RGB data as input on the edge computing device. However,
tionally, the memory requirements for DCNN trained on only RGB the depth input significantly improves the performance of defect

Fig. 21. Comparison of inference time for one frame on the server
Fig. 22. Comparison of the pothole volumes estimated autonomously
equipped with NVIDIA RTX 6000 GPU and Jetson TX2 with Max-N
by the proposed approach based on the fused RGB-SN DCNN versus
and Max-Q mode for different input data and data fusion methods.
the manually labeled data.
segmentation. To shorten the computational time on the edge de-

vice for real-time defect detection utilizing RGB-D fusion, DCNN indirectly from a single RGB image. An encoder-decoder network
pruning methods can be an important part of future works. with skip connections for depth estimation is one practical solution,
while another DCNN with an extra hallucination encoder to repli-
Defect Quantification cate features extracted from depth data was also considered. The
results show that the hallucination encoder can transfer the features
Knowing the total volume of each pothole, the loss of the material from the depth encoder using an RGB input during the training
can be obtained, which can be a potential metric for severity level process, while input depth data are not required during validation.
classification. The volume of the pothole based on different severity With the accurate semantic segmentation result, the volume
levels is calculated. As shown in Fig. 22, the results of the volume of each identified pothole can be estimated autonomously. The
estimated from ground-truth data and the prediction results can be volume of a pothole represents the amount of material loss. The
compared. It is observed that the estimated volume exhibits a close volume of a pothole holds a lot of promise to become an effective
correlation with the ground truth, where mean relative error is criterion for deciding the severity level of potholes in the future.
5 × 10−4 m3 . Current inspection manuals can be updated in the Although potholes were considered as a case study in this paper,
future to benefit from volumetric defect quantification. other 3D defects, such as rutting and raveling, can be detected
autonomously if a DCNN is trained on the appropriate training set.
Furthermore, the efficiency of the proposed inexpensive RGB-D
Conclusion and Future Work
data acquisition system for crack detection is part of future work.
This paper proposed an inexpensive road condition assessment data In this study, the feasibility of using the developed inexpensive sys-
acquisition system using an Intel RS-D435 camera, a consumer- tem on various vehicles was shown. This prototype can be used as a
grade RGB-D sensor, and an NVIDIA Jetson TX2 computing building block for crowdsourcing and Internet of Things (IoT) data
device. This data collection system is suitable to be mounted on generation to improve the inspection frequency and monitor the
several vehicles. This study also investigated the performance of progression of defect deterioration on lane road networks.
encoder-decoder-based semantic segmentation DCNNs using vari- While the results are encouraging, the proposed solution needs
ous depth encoding techniques and data fusion methods (i.e., data to be scaled up and applied to a large number of roads and multi-
and feature level). It is observed that fused RGB-SN outperforms ple vehicles to form a dense mobile sensor network to obtain IoT-
the other types of encoding techniques and data fusion methods in generated data for crowdsourcing. To this end, more improvements
terms of accuracy, robustness, and processing speed. The fusion of and modifications of the system need to be undertaken before it
color and depth features improves the performance of segmentation is launched on numerous vehicles. The authors intend to integrate
algorithms. Potholes are detected successfully even when the scene GPS into the data acquisition system. As a result, the trajectory of
is dark (i.e., there is not enough brightness). each vehicle and the corresponding coordinates of the pothole
To accelerate the efficiency of pavement repair, it is worth will be recorded. Furthermore, real-time defect detection and time-
installing this data acquisition system on several vehicles and col- based tracking for evaluating the deterioration rate of defects are
lect data in an opportunistic approach. Hence, this study also mea- part of future works. Inclusion of a commercially available wireless
sures the inference time of different DCNN models on an NVIDIA communication device where the collected data will be transmitted
Jetson TX2, which is an edge computing device. The result sug- to a centeral data hub automatically is another direction for future
gests that the DCNN model with solely RGB input is the most work. With this system, decision makers will have more frequent
efficient way considering the memory usage, inference time, and updates about the condition of a section of road, which will lead to
accuracy. Moreover, depth sensing may be disabled due to some more informed decision-making regarding inspection optimization,
environment limitations. To fully utilize the collected RGB data, maintenance prioritization, estimation of remaining useful life,
this study adopted two approaches to obtain depth information and so on.

Data Availability Statement Fan, R., U. Ozgunalp, B. Hosking, M. Liu, and I. Pitas. 2019. “Pothole
detection based on disparity transformation and road surface modeling.”
All data, models, or code that support the findings of this study are IEEE Trans. Image Process. 29 (Aug): 897–908. https://doi.org/10
available from the corresponding author upon reasonable request. .1109/TIP.2019.2933750.
Fan, R., H. Wang, Y. Wang, M. Liu, and I. Pitas. 2021. “Graph atten-
tion layer evolves semantic segmentation for road pothole detection:
Acknowledgments A benchmark and algorithms.” IEEE Trans. Image Process. 30 (Sep):
8144–8154. https://doi.org/10.1109/TIP.2021.3112316.
The authors would like to acknowledge Zhao Xing Lim, Da Cheng, Fischler, M. A., and R. C. Bolles. 1981. “Random sample consensus:
and Xianmeng Zhang from the Elmore Family School of Electrical A paradigm for model fitting with applications to image analysis and
and Computer Engineering at Purdue University for their support automated cartography.” Commun. ACM 24 (6): 381–395. https://doi
.org/10.1145/358669.358692.
during the development of the data acquisition collection system.
Girshick, R. 2015. “Fast R-CNN.” In Proc., IEEE Int. Conf. on Computer
Vision, 1440–1448. New York: IEEE.
Girshick, R., J. Donahue, T. Darrell, and J. Malik. 2014. “Rich feature
References hierarchies for accurate object detection and semantic segmentation.”
Akagic, A., E. Buza, and S. Omanovic. 2017. “Pothole detection: An effi- In Proc., IEEE Conf. on Computer Vision and Pattern Recognition,
cient vision based method using RGB color space image segmentation.” 580–587. New York: IEEE.
In Proc., 2017 40th Int. Convention on Information and Commu- Gupta, S., R. Girshick, P. Arbeláez, and J. Malik. 2014. “Learning rich
nication Technology, Electronics and Microelectronics (MIPRO), features from RGB-D images for object detection and segmentation.”
1104–1109. New York: IEEE. In Proc., European Conf. on Computer Vision, 345–360. Cham,
Alhashim, I., and P. Wonka. 2018. “High quality monocular depth estima- Switzerland: Springer.
tion via transfer learning.” Preprint, submitted December 31, 2018. Hazirbas, C., L. Ma, C. Domokos, and D. Cremers. 2016. “FuseNet:
http://arxiv.org/abs/1812.11941. Incorporating depth into semantic segmentation via fusion-based CNN
ASCE. 2021. “2021 report card for America’s infrastructure.” Accessed architecture.” In Proc., Asian Conf. on Computer Vision, 213–228.
March 1, 2022. https://infrastructurereportcard.org/. Cham, Switzerland: Springer.
Barbarella, M., M. R. De Blasiis, and M. Fiani. 2019. “Terrestrial Hoffman, J., S. Gupta, and T. Darrell. 2016. “Learning with side infor-
laser scanner for the analysis of airport pavement geometry.” Int. J. mation through modality hallucination.” In Proc., IEEE Conf. on Com-
Pavement Eng. 20 (4): 466–480. https://doi.org/10.1080/10298436 puter Vision and Pattern Recognition, 826–834. New York: IEEE.
.2017.1309194. Huang, G., Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. 2017.
Botshekan, M., et al. 2020. “Roughness-induced vehicle energy dissipation “Densely connected convolutional networks.” In Proc., IEEE Conf.
from crowdsourced smartphone measurements through random vibra- on Computer Vision and Pattern Recognition, 4700–4708. New York:
tion theory.” Data-Centric Eng. 1 (Dec): e16. https://doi.org/10.1017 IEEE.
/dce.2020.17. Ibragimov, E., H.-J. Lee, J.-J. Lee, and N. Kim. 2022. “Automated
Botshekan, M., E. Asaadi, J. Roxon, F.-J. Ulm, M. Tootkaboni, and pavement distress detection using region based convolutional neural
A. Louhghalam. 2021. “Smartphone-enabled road condition monitoring: networks.” Int. J. Pavement Eng. 23 (6): 1981–1992. https://doi.org/10
From accelerations to road roughness and excess energy dissipation.” .1080/10298436.2020.1833204.
Proc. R. Soc. A 477 (2246): 20200701. https://doi.org/10.1098/rspa Jahanshahi, M. R., F. Jazizadeh, S. F. Masri, and B. Becerik-Gerber.
.2020.0701. 2013. “Unsupervised approach for autonomous pavement-defect detec-
Cao, M.-T., Q.-V. Tran, N.-M. Nguyen, and K.-T. Chang. 2020. “Survey on tion and quantification using an inexpensive depth sensor.” J. Comput.
performance of deep learning models for detecting road damages using Civ. Eng. 27 (6): 743–754. https://doi.org/10.1061/(ASCE)CP.1943
multiple dashcam image resources.” Adv. Eng. Inf. 46 (Oct): 101182. -5487.0000245.
https://doi.org/10.1016/j.aei.2020.101182. Jo, Y., and S. Ryu. 2015. “Pothole detection system using a black-box
Chang, A., A. Dai, T. Funkhouser, T. Halber, M. Niessner, M. Savva, camera.” Sensors 15 (11): 29316–29331. https://doi.org/10.3390
S. Song, A. Zeng, and Y. Zhang. 2017. “Matterport3D: Learning from /s151129316.
RGB-D data in indoor environments.” Preprint, submitted September Kim, J.-H. 2009. “Estimating classification error rate: Repeated cross-
18, 2017. http://arxiv.org/abs/1709.06158. validation, repeated hold-out and bootstrap.” Comput. Stat. Data Anal.
Chen, F.-C., and M. R. Jahanshahi. 2017. “NB-CNN: Deep learning-based
53 (11): 3735–3745. https://doi.org/10.1016/j.csda.2009.04.009.
crack detection using convolutional neural network and naïve bayes
Kingma, D. P., and J. Ba. 2014. “Adam: A method for stochastic optimi-
data fusion.” IEEE Trans. Ind. Electron. 65 (5): 4392–4400. https://doi
zation.” Preprint, submitted December 22, 2014. http://arxiv.org/abs
.org/10.1109/TIE.2017.2764844.
/1412.6980.
Chen, X., Q. Dong, H. Zhu, and B. Huang. 2016a. “Development of
LeCun, Y., Y. Bengio, and G. Hinton. 2015. “Deep learning.” Nature
distress condition index of asphalt pavements using ltpp data through
structural equation modeling.” Transp. Res. Part C Emerging Technol. 521 (7553): 436–444. https://doi.org/10.1038/nature14539.
68 (Jul): 58–69. https://doi.org/10.1016/j.trc.2016.03.011. Li, Z., Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin. 2016. “LSTM-CF:
Chen, Y. L., M. R. Jahanshahi, P. Manjunatha, W. Gan, M. Abdelbarr, Unifying context modeling and fusion with LSTMS for RGB-D scene
S. F. Masri, B. Becerik-Gerber, and J. P. Caffrey. 2016b. “Inexpensive labeling.” In Proc., European Conf. on Computer Vision, 541–557.
multimodal sensor fusion system for autonomous data acquisition of Cham, Switzerland: Springer.
road surface conditions.” IEEE Sens. J. 16 (21): 7731–7743. https://doi Maeda, H., Y. Sekimoto, T. Seto, T. Kashiyama, and H. Omata. 2018.
.org/10.1109/JSEN.2016.2602871. “Road damage detection and classification using deep neural net-
Chun, C., and S.-K. Ryu. 2019. “Road surface damage detection using fully works with smartphone images.” Comput.-Aided Civ. Infrastruct. Eng.
convolutional neural networks and semi-supervised learning.” Sensors 33 (12): 1127–1141. https://doi.org/10.1111/mice.12387.
19 (24): 5501. https://doi.org/10.3390/s19245501. Mahmoudzadeh, A., A. Golroo, M. R. Jahanshahi, and S. Firoozi Yeganeh.
Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. 2019. “Estimating pavement roughness by fusing color and depth data
“Imagenet: A large-scale hierarchical image database.” In Proc., 2009 obtained from an inexpensive RGB-D sensor.” Sensors 19 (7): 1655.
IEEE Conf. on Computer Vision and Pattern Recognition, 248–255. https://doi.org/10.3390/s19071655.
New York: IEEE. Miller, J. S., and W. Y. Bellinger. 2003. Distress identification manual
Dung, C. V. 2019. “Autonomous concrete crack detection using deep fully for the long-term pavement performance program. Rep. No. FHWA-
convolutional neural network.” Autom. Constr. 99 (Mar): 52–58. https:// RD-03-031. McLean, VA: Federal Highway Administration, Office of
doi.org/10.1016/j.autcon.2018.11.028. Infrastructure.

Moazzam, I., K. Kamal, S. Mathavan, S. Usman, and M. Rahman. 2013. Tsai, Y.-C., and A. Chatterjee. 2018. “Pothole detection and classifica-
“Metrology and visualization of potholes using the Microsoft Kinect tion using 3D technology and watershed method.” J. Comput. Civ.
sensor.” In Proc., 16th Int. IEEE Conf. on Intelligent Transportation Eng. 32 (2): 04017078. https://doi.org/10.1061/(ASCE)CP.1943-5487
Systems (ITSC 2013), 1284–1291. New York: IEEE. .0000726.
NVIDIA. 2016. “NVIDIA jetson solutions for drones and UAVs.” Ukhwah, E. N., E. M. Yuniarno, and Y. K. Suprapto. 2019. “Asphalt pave-
Accessed February 14, 2023. https://developer.nvidia.com/embedded ment pothole detection using deep learning method based on YOLO
/community/quick-start-platforms. neural network.” In Proc., 2019 Int. Seminar on Intelligent Technology
Paszke, A., S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, and Its Applications (ISITIA), 35–40. New York: IEEE.
A. Desmaison, L. Antiga, and A. Lerer. 2017. “Automatic differentia- Wang, A., J. Cai, J. Lu, and T.-J. Cham. 2015. “MMSS: Multi-modal
tion in PyTorch.” In Proc., 31st Conf. on Neural Information Process- sharable and specific feature learning for RGB-D object recognition.”
ing Systems. Red Hook, NY: Curran Associates. In Proc., IEEE Int. Conf. on Computer Vision, 1125–1133. New York:
Piryonesi, S. M., and T. El-Diraby. 2021. “Climate change impact on infra- IEEE.
structure: A machine learning solution for predicting pavement condi- Wu, R.-T., A. Singla, M. R. Jahanshahi, E. Bertino, B. J. Ko, and D. Verma.
tion index.” Constr. Build. Mater. 306 (Nov): 124905. https://doi.org/10 2019. “Pruning deep convolutional neural networks for efficient edge
.1016/j.conbuildmat.2021.124905. computing in condition assessment of infrastructures.” Comput.-Aided
Quigley, M., K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, Civ. Infrastruct. Eng. 34 (9): 774–789. https://doi.org/10.1111/mice
and A. Y. Ng. 2009. “ROS: An open-source robot operating system.” .12449.
In Vol. 3 of Proc., ICRA Workshop on Open Source Software, 5.
Zhang, A., K. C. Wang, B. Li, E. Yang, X. Dai, Y. Peng, Y. Fei, Y. Liu,
New York: IEEE.
J. Q. Li, and C. Chen. 2017. “Automated pixel-level pavement crack
Ren, S., K. He, R. Girshick, and J. Sun. 2015. “Faster R-CNN: Towards
detection on 3d asphalt surfaces using a deep-learning network.”
real-time object detection with region proposal networks.” IEEE Trans.
Comput.-Aided Civ. Infrastruct. Eng. 32 (10): 805–819. https://doi.org
Pattern Anal. Mach. Intell. 39 (6): 1137–1149. https://doi.org/10.1109
/10.1111/mice.12297.
/TPAMI.2016.2577031.
Silberman, N., D. Hoiem, P. Kohli, and R. Fergus. 2012. “Indoor segmen- Zhang, C., and A. Elaksher. 2012. “An unmanned aerial vehicle-based im-
tation and support inference from RGBD images.” In Vol. 7576 of aging system for 3D measurement of unpaved road surface distresses.”
Proc., Computer Vision—ECCV 2012. Lecture Notes in Computer Comput.-Aided Civ. Infrastruct. Eng. 27 (2): 118–129. https://doi.org
Science, edited by A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and /10.1111/j.1467-8667.2011.00727.x.
C. Schmid. Cham, Switzerland: Springer. https://doi.org/10.1007/978-3 Zhang, D., Q. Zou, H. Lin, X. Xu, L. He, R. Gui, and Q. Li. 2018.
-642-33715-4_54. “Automatic pavement defect detection using 3D laser profiling tech-
Simonyan, K., and A. Zisserman. 2014. “Very deep convolutional networks nology.” Autom. Constr. 96 (Dec): 350–365. https://doi.org/10.1016/j
for large-scale image recognition.” Preprint, submitted September 4, .autcon.2018.09.019.
2014. http://arxiv.org/abs/1409.1556. Zhang, L., F. Yang, Y. D. Zhang, and Y. J. Zhu. 2016. “Road crack de-
Song, S., S. P. Lichtenberg, and J. Xiao. 2015. “Sun RGB-D: A RGB-D tection using deep convolutional neural network.” In Proc., 2016
scene understanding benchmark suite.” In Proc., IEEE Conf. on Com- IEEE Int. Conf. on Image Processing (ICIP), 3708–3712. New York:
puter Vision and Pattern Recognition, 567–576. New York: IEEE. IEEE.
TRIP. 2015. “The interstate highway system turns 60: Challenges to its Zhou, S., and W. Song. 2021. “Crack segmentation through deep con-
ability to continue to save lives, time and money.” Accessed March 1, volutional neural networks and heterogeneous image fusion.” Autom.
2022. https://infrastructureusa.org/the-interstate-highway-system-turns Constr. 125 (May): 103605. https://doi.org/10.1016/j.autcon.2021
-60-challenges-to-its-ability-to-continue-to-save-lives-time-and-money/. .103605.

Jpeodx Pveng-1194

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jpeodx Pveng-1194

Uploaded by

Copyright:

Available Formats

Deep Learning–Based Autonomous Road Condition

Assessment Leveraging Inexpensive RGB and Depth

Sensors and Heterogeneous Data Fusion: Pothole

Introduction improved since 2009, indicating a poor or mediocre condition

© ASCE 04023010-1 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

© ASCE 04023010-2 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

© ASCE 04023010-3 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

NVIDIA Jetson TX2

Intel® RealSense™ Camera D435

(b) (a) (c)

© ASCE 04023010-4 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(a) (b) (c)

© ASCE 04023010-5 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

(a) (b) (c)

© ASCE 04023010-6 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

© ASCE 04023010-7 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

© ASCE 04023010-8 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

As shown in Eq. (3), a joint loss function consists of cross-

Fig. 10. Depth estimation network architecture proposed by Alhashim

(a) (b) (c)

© ASCE 04023010-9 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

© ASCE 04023010-10 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

DCNNs RGB 0.966 0.762 0.864

© ASCE 04023010-11 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

Coefficient of variation (CV)

Fig. 16. COV of different depth encoding techniques. COV is an

Table 2. Class-wise IoU scores of background and defect among different

However, Table 2 compares the IoU of the models trained on

© ASCE 04023010-12 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

Fig. 18. Coefficient of variation comparison of the network trained

Computation Capability on Edge Computing Device

© ASCE 04023010-13 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

© ASCE 04023010-14 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

© ASCE 04023010-15 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

segmentation. To shorten the computational time on the edge de-

© ASCE 04023010-16 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

© ASCE 04023010-17 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

© ASCE 04023010-18 J. Transp. Eng., Part B: Pavements

J. Transp. Eng., Part B: Pavements, 2023, 149(2): 04023010

You might also like