Professional Documents
Culture Documents
Abstract: Poor condition of roads is a major factor for traffic accidents and damage to vehicles. A significant portion of car accidents
is attributed to severe three-dimensional (3D) pavement distresses such as potholes, ruttings, and ravelings. Insufficient road condition as-
sessment is responsible for the poor condition of roads. To inspect the condition of the pavement surfaces more frequently and efficiently,
an inexpensive data acquisition system was developed that consists of a consumer-grade RGB-D sensor and an edge computing device that
can be mounted on vehicles and collect data while driving vehicles. The RGB-D sensor is used for collecting two-dimensional (2D) color
images and corresponding 3D depth data, and the lightweight edge computing device is used to control the RGB-D sensor and store the
collected data. An RGB-D pavement surface data set is generated. Furthermore, encoder-decoder deep convolutional neural networks
(DCNNs) consisting of one or two encoders, and one decoder trained on heterogeneous RGB-D pavement surface data are used for pothole
segmentation. Comprehensive experiments using different depth encoding techniques and data fusion methods including data- and feature-
level fusion were performed to investigate the efficacy of defect detection using DCNNs. Experimental results demonstrate that the feature-
level RGB-D data fusion based on the surface normal encoding of depth data outperform other approaches in terms of segmentation accuracy,
where the mean intersection over union (IoU) over 10-fold cross-validation of 0.82 is achieved that shows a 7.7% improvement compared
with a network trained only on RGB data. In addition, this study explores the efficacy of indirectly using depth information for pothole
detection when depth data are not available. Additionally, the semantic segmentation results were utilized to quantify the severity level of the
potholes assisting in maintenance decision-making. The result from these comprehensive experiments using an RGB-D pavement surface
data set gathered through the proposed data acquisition system is a stepping stone for opportunistic data collection and processing through
crowdsourcing and Internet of Things in future smart cities for effective road assessment. Finally, suggestions about the improvement of the
proposed system are discussed. DOI: 10.1061/JPEODX.PVENG-1194. © 2023 American Society of Civil Engineers.
end, this study proposes an inexpensive hardware-software system surface conditions. Earlier studies developed various systems to
based on an edge computing device and Intel RealSense depth cam- collect RGB images of the pavement surface. For instance, Jo and
era D435 (Intel RS-D435, Santa Clara, California), which can Ryu (2015) proposed a pothole data collection system using a com-
collect RGB and depth data while driving vehicles at approximately mercial black-box camera. Zhang and Elaksher (2012) proposed a
50 km/h (30 mi/h). Instead of purchasing a specific type of vehicle, system composed of an unmanned aerial vehicle fitted with a dig-
the developed inexpensive data acquisition system can be mounted ital camera to detect road surface distresses. However, these studies
on various vehicles such as city and DOT vehicles, US Postal only used two-dimensional (2D) RGB data to detect road defects,
Service or UPS vehicles, Uber cars, and volunteer vehicles. The whereas some road defects such as potholes, rutting, and raveling
total expense of the hardware is about $1,600 without the cost of are three-dimensional (3D) defects. To correctly classify the se-
vehicle. Therefore, this inexpensive data acquisition system can verity level of potholes, it is necessary to obtain depth data that
be mounted on multiple vehicles to build a dense mobile sensor provide rich 3D information. Furthermore, the depth information
network that serves as an autonomous, more frequent, and quanti- allows the calculation of the amount of missing material of a pot-
tative condition-based as well as opportunistic assessment of roads hole, which can provide more insight for determining the severity
through crowdsourcing instead of the existing scheduled-based level of potholes. Therefore, several studies have attempted to use
manual or semiautonomous inspections. This study also aims to 3D road data for pothole detection. Fan et al. (2019) proposed a
automatically detect potholes using a deep learning-based segmen- stereo vision–based pothole detection system that utilized 2D im-
tation approach that can further obtain the shape of detected pot- age analysis and 3D road surface modeling techniques. Alterna-
holes for defect quantification. With the assistance of depth data, tively, commercial 3D pavement data tools have been used by
the proposed quantification approach not only can estimate the area researchers in the past. Barbarella et al. (2019) used a terrestrial
but also the volume of a pothole, which can be considered as a new laser scanner (TLS) to obtain the geometric features of pavement.
pothole severity level indicator in the future. Zhang et al. (2018) proposed a special vehicle with a 3D data col-
One of the limitations of the current study is that it focuses only lection system for pavement defect detection. Tsai and Chatterjee
on pothole detection and quantification as a case study. Other de- (2018) collected 3D pavement data using a sensing vehicle equipped
fects, such as alligator cracking, can lead to the formation of pot- with a data acquisition system including 2D imaging and 3D lidar.
holes if neglected for long period of time. While pothole detection However, this equipment is relatively expensive. Therefore, a num-
is useful, road decision makers are more interested in multiple ber of studies used low-cost sensors for 3D pavement data ac-
distresses together or an index such as the pavement condition in- quisition. Chen et al. (2016b) and Moazzam et al. (2013) relied
dex (PCI) to estimate the road performance. Consequently, it is on Microsoft Kinect, which is an RGB-D sensor. However, these
important to consider the effectiveness of the proposed solution studies did not design the proposed data acquisition system as an
for other types of defects, particularly the less coarse defects where autonomous data collection system so the inspection efficiency is
the depth information may not contribute much in defect detection limited.
process. Therefore, it is recommended that further research should Traditionally, pothole detection has been assessed by image-
be undertaken on the various pavement distresses based on the processing and image-thresholding methods. In particular, Otsu’s
proposed system. The proposed system can also be instrumental in thresholding method was one of the most successful methods for
monitoring the progress of less severe distresses such as alligator image thresholding (Akagic et al. 2017; Jahanshahi et al. 2013).
cracks. In recent years, there has been an increasing interest in using deep
In recent years, there have been several studies where RGB im- learning approaches (LeCun et al. 2015). Deep learning has dem-
ages and machine learning, including deep learning, techniques are onstrated its successes in computer vision areas such as classifica-
used successfully for detection of less coarse defects such as cracks tion and pattern recognition. To this end, several convolutional
(Ibragimov et al. 2022; Zhou and Song 2021; Dung 2019). This neural network (CNN) architectures for object detection and rec-
indicates that the combination of the proposed inexpensive hard- ognition have been utilized, including R-CNN, fast R-CNN, and
ware system and existing defect detection algorithms can lead to faster R-CNN, which are region-based CNNs (Girshick et al. 2014;
detection of a variety of different defects that can be used to es- Ren et al. 2015; Girshick 2015). Many recent studies have shown
timate a road condition index, such as PCI. One advantage of that deep learning methods can successfully perform pavement de-
the developed system is that it can be mounted on multiple cars, fect detection tasks. Maeda et al. (2018) used smartphone cameras
providing a huge amount of pavement surface images though a to take photos of road defects and trained deep learning models
crowdsourcing solution to mitigate the uncertainty of PCI estima- using those images to recognize road defects. Ibragimov et al.
tion, which is primarily dependent on manual survey of the road. (2022) applied the faster R-CNN model to detect pavement cracks.
Piryonesi and El-Diraby (2021) scrutinized the effect of the various Cao et al. (2020) implemented a comprehensive study on various
pavement distresses as well as climate change on the estimation of deep learning models for road damage detection. Ukhwah et al.
PCI. Furthermore, there have been some efforts to introduce better (2019) applied YOLO, a region-based object deep convolutional
pavement distress indexes. For instance, Chen et al. (2016a) uti- neural networks (DCNNs) model to detect potholes. The proposed
lized Long-term Pavement Performance (LTPP) data and structural approaches in the preceding studies are region-based methods. One
equation modeling to develop a new road condition index. To this disadvantage of region-based methods is the lack of a precise
end, using the proposed hardware system in this study, the depth spatial boundary of the detected object. In contrast, semantic seg-
information can be utilized for quantifying the identified defects mentation classifies each pixel in an image that can be used for
even for less coarse defects (e.g., crack width). This can potentially detection of defect boundaries. Fan et al. (2021) evaluated a set
tasks, for instance, nuclear power plant crack detection (Chen and study selects potholes as the target pavement distress for detection
Jahanshahi 2017) and road crack detection (Zhang et al. 2016). and quantification. This paper has the following main contributions:
Although many researchers demonstrated success in pavement • To enhance the autonomy of the road condition assessment,
defect detection, most relied on only 2D RGB data or 3D depth data an inexpensive data acquisition prototype integrating multiple
alone. However, 2D RGB and 3D depth data complement each pieces of hardware components such as an edge computing
other perfectly. RGB data provide color information and depth device and consumer-grade RGB-D sensor using the Robotic
data provide geometric information. In general, the performance of Operating System (ROS) to interconnect and synchronize is
the semantic segmentation CNN models is limited by the quality developed.
of training data. The defect detection accuracy may be compro- • This paper comprehensively explores the efficacy of segmenta-
mised due to the poor illumination condition, various background tion DCNNs trained on various depth encoding and data fusion
textures, and misleading nondefect objects in RGB images. To methods for defect detection. There are two main types of data
improve the accuracy, depth data have been considered to be an fusion methods: data- and feature-level RGB-D fusion. The per-
informative input data in the CNN-based segmentation method. formance of each DCNN is evaluated based on two aspects.
Several studies have utilized RGB-D data for indoor scene detec- First, the accuracy of segmentation is observed. Second, the in-
tion and observed enhanced performance (Li et al. 2016; Chang ference time of each DCNN on a server and an edge computing
et al. 2017). Hazirbas et al. (2016) proposed a CNN model, called device is measured. Suggestions for the selection of the input
FuseNet, that incorporated depth into the semantic segmentation data and a fusion method on an edge computing device are
algorithm and trained on two public RGB-D data sets (Song et al. provided.
2015; Gupta et al. 2014). • It is shown that the fusion of depth and RGB data improves the
There have been efforts to utilize crowdsourcing to detect pot- performance and robustness of segmentation task. Also, it is
holes. For instance, the Get It Done San Diego app is a crowdsourc- shown that feature-level data fusion is superior to data-level
ing solution where users can report potholes and connect directly to fusion. This study discusses a novel technique to estimate depth
the city’s work tracking system. The user has to take a picture of the data from RGB images when data collection is not available.
defective region, which is not always feasible on crowded roads • From the segmentation results, defect quantification is imple-
and may pose safety threats to the individuals involved in the pro- mented through the calculation of potholes’ area and volume,
cess. Furthermore, such solutions need a human operator to look at which can be used as important indicators in the categorization
the submitted images and make decision whether that section of the and prioritization of the potholes.
road needs to be fixed or not. Another example is the Street Bump
app that is operated by the City of Boston. Street Bump is a mobile
app that is used by volunteers to collect road condition data while Data Collection and Preparation
they drive. Street Bump utilizes the accelerometer and GPS of the
phone’s sensors and uploads the data on a central server and alerts To collect road surface data efficiently across a region, a light
drivers regarding the locations of unfixed bumps. This solution re- low-cost data acquisition system that can be mounted on a vehicle
quires manual interaction where the users log trips individually. was developed. The proposed experimental setup is shown in
Another limitation of similar aforementioned crowdsourcing solu- Fig. 1. The three major instruments employed in our system are
tions is that they can only detect coarse defects (i.e., potholes and the NVIDIA (Santa Clara, California) Jetson TX2 developer kit
bumps) and, in most cases, only when the vehicle hits the bumps. (an edge computing device), the Intel RS-D435 RGB-D sensor,
Furthermore, the generated reports only provide the locations of and a 1-TB portable solid-state drive (Extreme Portable SSD,
bumps. Accurate quantification of defects is not performed to eval- Milpitas, California). The edge computing device and Intel RS-
uate the evolution of defective regions. In this study, the feasibility D435 RGB-D sensor are compact and are capable of being safely
of a crowdsource-based data collection for quantitative measure- mounted on the back of vehicles. A piece of computer middleware
ment of pothole conditions is investigated. The proposed solution called ROS was used to interconnect the hardware components
has the capability of detecting potholes even if the vehicle does and provide an interface for top-down control. This section aims
to introduce the characteristics of these three hardware compo-
not hit the pothole. Recently, the Carbin app has been a successful
nents as well as the middleware that facilitate the data acquisition
crowdsourcing solution in Boston based on smartphones for road
system.
roughness sensing and monitoring (Botshekan et al. 2020, 2021).
However, the Carbin app cannot detect and quantify defects directly
and only focuses on estimating the International Roughness Index Data Acquisition System
(IRI). Because it has been shown that RGB-D sensors can be used
to estimate pavement roughness as well (Mahmoudzadeh et al. Hardware
2019), the proposed solution in the current study has more potential As shown in Fig. 1, the Intel RS-435 RGB-D sensor can capture
than the aforementioned smartphone-based crowdsourcing. spatially aligned RGB and depth images simultaneously. Along
with an ordinary RGB camera, it has a depth module that measures
the absolute distance of a physical point represented by each pixel
Contribution
using a structured-light stereopsis that mainly consists of a static
The objective of this study is to establish an inexpensive road infrared (IR) pattern projector and a pair of left and right imagers.
surface data acquisition and analysis system for road condition The infrared projector projects nonvisible static IR patterns to
Fig. 1. Data collection system configuration: (a) the overall setup of the data acquisition system; (b) the Intel RS-D435 RGB-D sensor, which can be
mounted on a bike rack on the trunk; and (c) a power-efficient edge computing device the controls the system. (Images by Yu-Ting Huang.)
improve depth accuracy in scenes with low texture, thereby Data Set Generation
allowing the sensor to produce trustworthy results under various To train and validate the semantic segmentation DCNNs, this study
lighting conditions. The Intel RS-435 RGB-D sensor offers a wide generated a pavement surface data set. The data were collected
range of resolution and frames per second (FPS) for the operator to using the developed data acquisition system. To this end, this light-
choose from. The depth resolution is up to 1,280 × 720 and RGB weight system was mounted on a vehicle, and the data were
resolution is up to 1,920 × 1,080. In this study, the FPS of 30 for collected at the speed of approximately 50 km/h (30 mi/h) on local
both the RGB and the depth sensor was used and the resolution was streets around the Purdue University campus (Fig. 2). The data set
640 × 480 with the car speed being below 30 mi=h. consists more than 30,000 RGB-D image pairs, where 1,344 pairs,
In the proposed system, the Intel RS-435 RGB-D sensor is con- including defective and nondefective regions, were manually se-
nected to a Jetson TX2 developer kit. Jetson TX2 is a system-on- lected for training and validating. Fig. 3 gives some sample images
module designed by NVIDIA that incorporates a 256-core Pascal in the data set. Figs. 3(a–c) are the RGB images where the reso-
graphics processing unit (GPU), a 6-core ARM-v8 CPU cluster, lution for each image is 640 × 480 pixels, and Figs. 3(d–f) depict
8 GB of low-power double data rate 4 (LPDDR4) memory, and the corresponding absolute depth data that are spatially aligned
32 GB of flash storage. It is a power-efficient device whose maxi- with RGB images, and Figs. 3(g–i) show the ground truths. All
mum power does not exceed 15 W. On top of the bare module of the potholes in the data set are manually annotated. In addition, the
Jetson TX2, NVIDIA offers a carrier board that wraps up the mod- data set contains pothole-like objects. For instance, as shown in
ule itself along with several essential computer peripherals into a Fig. 4, a manhole and circular asphalt bleeding have similar shapes
developer kit, making it versatile as a mobile computer as shown or textures as potholes. These objects may potentially yield to false-
in Fig. 1. positive predictions by a DCNN model.
A portable 1-TB USB SSD device was used to archive the
high volume of images captured on the fly. RGB and depth
images were stored in JPEG and PNG format, respectively. The
resolution of RGB and depth data was 640 × 480. The size of a
three-channel JPEG image containing a fair amount of texture de-
tails is around 200 kilobytes (kB), while the size of a PNG depth
image is typically 75 kB. When the images are captured at 30
FPS, the influx of data to the storage system is around 16.1 mega-
bytes (MB)/s.
Software
In this prototype, ROS allows hardware components to communicate
with each other. ROS is an open-source freeware that facilitates the
programming of robots for interprocess communication (Quigley
et al. 2009). In the ROS communications model, nodes publish
messages to certain topics and read messages from other topics.
Messages are transported using either TCP/IP or the User Datagram
Protocol (UDP). Using this approach, the camera-controlling logic
and the image-processing logic can be implemented as separate no-
des, thereby promoting modularity. Furthermore, one performance-
optimizing feature that was used for host-local communications
Fig. 2. Data collection path around the Purdue University campus.
(contrary to network communications) is the ROS nodelet, which
[(c) OpenStreetMap contributors (openstreetmap.org/copyright).]
allows zero copy passing of data between functional nodes.
Fig. 3. Sample images in the data set: (a) a road surface without pothole; (b) a road surface with a small defect; (c) a road surface with a high-severity-
level pothole; (d–f) the corresponding aligned absolute depth data; and (g–i) manually annotated ground truth.
Fig. 4. Samples of pothole-like objects that may confuse the DCNN: (a) a manhole; (b) a manhole; and (c) a patching.
Data Preprocessing world coordinate using the equation derived from the pinhole
camera model
As shown in Fig. 5, the sensor is not exactly perpendicular to the
road surface because it is mounted on a bike rack at the rear of ði; jÞ − P
the vehicle to avoid including any part of the vehicle in the ðX; Y; ZÞ ¼ d · ;d ð1Þ
F
captured data. Fig. 6(a) displays an example RGB image of a pot-
hole and the corresponding absolute depth data are presented in where ðX; Y; ZÞ = point in the world coordinate relative to the
Fig. 6(b). As seen, the defect is not apparent in the absolute depth depth sensor; i and j = pixel coordinates of the point in the depth
image because the sensor was slightly inclined during data collec- image; P = principal point on the image plane; F = focal length of
tion. To correct this issue, the following procedure is implemented the depth sensor; and d = depth value captured by the sensor.
to identify and remove the baseline plane. First, the camera param- Next, the random sample consensus (RANSAC) (Fischler and
eters are applied to deproject the 2D image coordinate to a 3D Bolles 1981) algorithm is used to fit a plane to these points in
(a) (b)
Fig. 5. Data acquisition system setup: (a) sensor is mounted on the bike rack, which is not perfectly perpendicular to the road, to avoid including the
car in the image; and (b) a close view of the sensor and region of interest.
Fig. 6. Sample RGB-D data: (a) an RGB pothole image; (b) the corresponding absolute depth data; and (c) the relative depth data with respect to the
road surface after subtraction of the fitted plane.
the 3D space. The points having a distance to the plane greater semantic segmentation DCNN was trained to classify a pixel as
than a set threshold are seen as an outliers. After the plane is a pothole or background. The pothole signifies a defect region,
fitted, it is subtracted from the original depth image. The fitted whereas all pixels outside the defect region are labeled as the back-
plane is an estimation of the road surface. This subtraction pro- ground. The background may occasionally include pothole-like
vides relative depth data with respect to the estimated road surface nondefect areas such as manholes. The background intersection
[Fig. 6(c)]. over union (IoU) provides an insight into the DCNN’s ability to
distinguish between a pothole and a nonpothole object. In addition,
for the purpose of pothole detection and severity level classifi-
Methodology cation, depth data were adopted in this study. The advantage of
using depth data is that depth data provide the pothole’s geometric
RGB-D Defect Segmentation information, which complements 2D RGB images. To this end,
an encoder-decoder-based semantic segmentation neural network
To date, various methods have been developed and introduced to was used in this study, and different data fusion strategies were
detect pavement defects. Many researchers have utilized RGB data used to fuse grayscale or RGB with depth data. Several experiments
for defect detection using a bounding box. However, there are were carried out using various types of fusion strategies and depth
drawbacks associated with the use of bounding boxes to detect the encoding techniques.
defect. The bounding box does not provide accurate dimensions
and shape of the defect; therefore, the estimated geometry informa- Data Encoding
tion of the object is not enough to accurately determine the type As shown in Fig. 7, various types of depth encoding techniques
of defect and its severity. In this study, semantic segmentation were applied in this study, including absolute depth data, raw rel-
was used to detect the pothole on the pavement. Unlike the object ative depth data, locally normalized depth data, globally normal-
detection method, semantic segmentation is a pixel-wise classifica- ized depth data, and surface normal (SN) data. Fig. 7(a) depicts
tion that can predict the contour of each object. Knowing the pre- a sample of raw relative depth data without any normalization.
cise shape of the defect, the dimensions and the area of the defect Locally normalized depth is obtained by dividing the relative depth
can be estimated, which are critical parameters for classifying the by the maximum relative depth in each image, where the values
defect according to the inspection guideline. In this study, a binary vary between 0 and 1. A brighter region represents a deeper area
(a) (b)
(c) (d)
Fig. 7. Different types of depth encoding techniques utilized in this study: (a) relative depth data; (b) locally normalized relative depth data obtained
by dividing relative depth by the maximum relative depth in each image; (c) globally normalized relative depth data obtained by dividing relative
depth by the maximum relative depth in the entire RGB-D pavement surface data set; and (d) SN data.
on the road surface. A sample of locally normalized depth data is learning. Fifteen semantic segmentation DCNNs were implemen-
displayed in Fig. 7(b). In contrast, globally normalized depth is ob- ted and trained on different types of input data including (1) gray-
tained by dividing relative depth by the maximum relative depth in scale image, (2) RGB image, (3) relative depth (RD) data,
the entire RGB-D pavement surface data set (i.e., among all of the (4) locally normalized depth (LND) data, (5) globally normalized
depth data in the data set). A sample of globally normalized relative depth (GND) data, (6) SN data, (7) stacked gray-RD data,
depth is displayed in Fig. 7(c). A brighter area in this figure cor- (8) stacked RGB-RD, (9) stacked RGB-LND, (10) stacked RGB-
responds to a region with less depression on the road. The darker GND data, and feature-level data fusion using (11) gray-RD,
area corresponds to a deeper region. In addition, another depth en- (12) RGB-RD, (13) RGB-LND, (14) RGB-GND, and (15) RGB-
coding method, which is surface normal, is considered in this study. SN data.
To find the SN at a point, all pixels need to be converted to 3D
points. Next, a plane considering the adjacent points is fit at each Network Architecture
point in the 3D point cloud. Then an SN vector can be obtained Two DCNN architectures were applied for data- and feature-level
based on the fitted plane (Silberman et al. 2012; Wang et al. 2015). fusion in this study. All networks perform class-wise semantic seg-
Three components of the SN vector constitute three channels of mentation on RGB-D images to classify each pixel into binary
the SN map. Fig. 7(d) shows the SN estimation of the depth map classes (i.e., background or defect). As shown in Fig. 8, for data-
in Fig. 7(a). level fusion an encoder-decoder DCNN was used. The encoder net-
work was identical to the VGG16 network architecture (Simonyan
Data Fusion and Zisserman 2014) without fully connected layers. The decoder
This study evaluated the effectiveness of fusing 2D grayscale or network upsamples the feature maps from the corresponding en-
RGB images with depth data for semantic segmentation. Two ap- coder and outputs a pixel-wise labeling result with the same reso-
proaches were considered for data fusion: data-level fusion method lution as the input data. For feature-level fusion, a DCNN was
and feature-level data fusion. For data-level fusion, four-channel adopted based on the architecture proposed by Hazirbas et al.
RGB-D input data were generated by stacking the depth data on (2016). The network architecture is displayed in Fig. 9 and is an
the RGB channels. For feature-level fusion, instead of combining encoder-decoder-style DCNN with two encoders. There are two
RGB and depth data as input to the network, RGB and depth data encoder networks: an RGB input encoder branch, and the depth
were fed into two individual encoder networks in the semantic seg- input encoder branch. The depth encoder network was also a
mentation DCNN. VGG16 network architecture without fully connected layers. In ad-
Comprehensive experiments and comparisons were imple- dition, five fusion layers were added after each convolution, batch
mented to evaluate the performance of various depth encoding tech- normalization, and activation (CBR) layer in the RGB encoder
niques and data fusion methods. The outcomes were compared network. The encoder network extracted feature maps from RGB
with an existing approach based on depth data and unsupervised and depth data. These two branches in the encoder part extracted
Fig. 8. Network architecture for data-level fusion. The input can be RGB images or stacked RGB-D. The decoder part upsamples the feature maps to
original input resolution.
Fig. 9. Network architecture with two encoder networks that extract the features from RGB and depth input and apply element-wise summation to
fuse both features. The decoder network upsamples the feature maps to original input resolution.
the feature from RGB and depth data and applied element-wise Estimating Depth from RGB Data
summation to fuse both features. The significant strength of this
Despite all the advantages of the developed RGB-D data acquisi-
architecture is the fusion layer in the encoder that can merge the
tion and analysis system, there are also some practical challenges
2D color attributes and the 3D spatial information of the pothole.
associated with real-time depth sensing in the outdoor environment.
To obtain information from both encoder branches, a fusion layer
For the Intel RS-D435 camera, the official recommended operating
was inserted after every CBR layer that added the feature maps in
temperature range lies between 0°C and 35°C. When the projector
the depth encoder to the feature maps in the RGB encoder.
temperature with the sensor is lower than 0°C or higher than 35°C,
Network Training the laser safety mechanism in the firmware driver turns on and the
Each network was trained for 30 epochs using the Adaptive quality of the depth data may decrease. Additionally, some of the
Momemt Estimation (Adam) (Kingma and Ba 2014) optimizer, depth sensors use infrared technology, which does not work well
a learning rate of 0.0001, and the batch size of 2. In addition, the under direct bright sunlight. Therefore, this study investigated two
trainable parameters in the encoder part were fine tuned from a pre- approaches that can estimate depth data from RGB data or indi-
trained VGG16 network that was trained on the ImageNet data rectly extract depth features when depth data are not collected.
set (Deng et al. 2009). To better evaluate the performance of each
DCNN model, a repeated 5-fold cross-validation was performed Encoder-Decoder Network
(Kim 2009). At each fold, the data set was randomly split into train- A number of techniques have been developed to estimate depth
ing and validated sets: 20% of all data were used for validation from a monocular RGB image. In this study, an encoder-decoder
and the remaining data were used for training. The 5-fold cross- DCNN with skip connections proposed by Alhashim and Wonka
validation was repeated twice. For each cross-validation, the (2018) was applied for depth estimation. The network architecture
validation and training sets were mutually exclusive from the train- is shown in Fig. 10. The encoder network is identical to DenseNet-
ing set. The training took place on a Linux server with Ubuntu 161 (Huang et al. 2017) and the decoder network consists of suc-
14.04. The server included two Intel Xeon E5-2620 v4 CPUs, cessive series of upsampling layers where the dimension of output
256-gigabytes (GB) double data rate 4 (DDR4) memories, and four is identical to the input. The depth estimation network was pre-
NVIDIA Titan X Pascal GPUs. Pytorch was used to implement the trained on the NYU Depth v2 data set (Silberman et al. 2012) and
semantic segmentation networks (Paszke et al. 2017). then retrained on collected pavement surface data through transfer
(d) (e)
Fig. 11. Sample of estimated depth images: (a) RGB image; (b) corresponding depth data; (c) estimated depth data; (d) SN map obtained from RGB
image and camera parameters; and (e) estimated SN map.
Fig. 12. Hallucination network architecture with three encoder branches including RGB, hallucination, and depth. The hallucination network is
identical to the depth encoder but trained on RGB input. The decoder part upsamples the feature maps to the original input resolution.
(a) (b)
(c) (d)
Fig. 13. Sample defect with the boundaries and estimated maximum inscribed circle: (a) an example of the maximum inscribed circle and boundary
of the semantic segmentation result; (b) another example of the maximum inscribed circle and boundary of the semantic segmentation result;
(c) the diameter of the inscribed circle does not exceed 150 mm to this irregular shape of the defect, which cannot be classified as a pothole;
and (d) this defect can contain an inscribed circle with a diameter 550 mm, which can be classified as a pothole.
These results also confirm that it is more robust to utilize both color
Defect Segmentation
and depth data instead of only one. Furthermore, the DCNN trained
on fused RGB-SN data has the lowest COV, which means it is the A qualitative comparison of the segmentation results among differ-
most robust. ent types of DCNN inputs to DCNN is displayed in Fig. 19. The
first column in the figure displays the RGB image image of intput
data and the second column in this figure depicts the ground truth.
In Figs. 19(c and d), the network that is trained only on the gray or
RGB data does not successfully identify the defective region in
shadow or low-light conditions. As seen in Figs. 19(e–h), the fused
RGB-SN outperforms the other encoding technique in terms of seg-
menting the defective regions.
Qualitative comparisons of the segmentation results using dif-
ferent indirect depth integration techniques are displayed in Fig. 20,
where Fig. 20(a) shows the RGB image and Fig. 20(b) is the
ground-truth segmentation mask. Fig. 20(c) is the output from the
network trained on fused RGB-estimated Depth data, Fig. 20(d)
is the output from the network trained on fused RGB–estimated
SN data, and Fig. 20(e) is the output from the network trained on
fused RGB-HSN data.
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 19. Sample segmentation results from the networks trained on different depth encoding approaches and data fusion methods: (a) RGB image;
(b) ground truth of defective region; (c) output from the network trained on grayscale input data; (d) output from the network trained on RGB data;
(e) output from the network trained on stacked RGB-RD data; (f) output from the network trained on fused gray-RD data; (g) output from the network
trained on fused RGB-RD data; and (h) output from the network trained on fused RGB-SN data.
(b)
(c)
(d)
(e)
Fig. 20. Sample segmentation results from the networks trained on different depth encoding approaches and data fusion methods: (a) RGB image;
(b) ground truth of defective region; (c) output from the network trained on fused RGB-estimated D data; (d) output from the network trained on fused
RGB–estimated SN data; and (e) output from the network trained on fused RGB-HSN data.
Table 3. Input data dimension, number of parameters, and average inference time for different depth encoding and data fusion methods on a server equipped
with an NVIDIA RTX 6000 GPU and Jetson TX2
Inference time (s)
Jetson TX2
Input data and fusion method Input data dimension (pixels) DCNN parameters Server Max-N Max-Q
Gray 640 × 480 × 1 29,442,443 0.035 1.160 1.692
RGB 640 × 480 × 3 29,443,585 0.035 1.162 1.699
Stacked gray-RD 640 × 480 × 2 29,443,009 0.036 1.162 1.698
Stacked RGB-RD 640 × 480 × 4 29,444,161 0.036 1.162 1.698
Fused gray-RD 640 × 480 × 1 and 640 × 480 × 1 44,164,417 0.054 1.788 2.593
Fused RGB-RD 640 × 480 × 3 and 640 × 480 × 1 44,165,569 0.054 1.788 2.593
Fused RGB-SN 640 × 480 × 3 and 640 × 480 × 3 44,166,721 0.055 1.795 2.617
Jetson TX2 is much higher than the server’s. This is due to the and RGB-D data are around 112 and 170 MB, respectively. Con-
limited computational capacity of the edge computing device. sidering the trade-off between segmentation accuracy, inference
Furthermore, the loading of the two-stream encoder-decoder time, and memory usage, it is better to deploy a DCNN that uses
DCNN is too cumbersome for the edge computing device. Addi- only RGB data as input on the edge computing device. However,
tionally, the memory requirements for DCNN trained on only RGB the depth input significantly improves the performance of defect
Fig. 21. Comparison of inference time for one frame on the server
Fig. 22. Comparison of the pothole volumes estimated autonomously
equipped with NVIDIA RTX 6000 GPU and Jetson TX2 with Max-N
by the proposed approach based on the fused RGB-SN DCNN versus
and Max-Q mode for different input data and data fusion methods.
the manually labeled data.
The authors would like to acknowledge Zhao Xing Lim, Da Cheng, Fischler, M. A., and R. C. Bolles. 1981. “Random sample consensus:
and Xianmeng Zhang from the Elmore Family School of Electrical A paradigm for model fitting with applications to image analysis and
and Computer Engineering at Purdue University for their support automated cartography.” Commun. ACM 24 (6): 381–395. https://doi
.org/10.1145/358669.358692.
during the development of the data acquisition collection system.
Girshick, R. 2015. “Fast R-CNN.” In Proc., IEEE Int. Conf. on Computer
Vision, 1440–1448. New York: IEEE.
Girshick, R., J. Donahue, T. Darrell, and J. Malik. 2014. “Rich feature
References hierarchies for accurate object detection and semantic segmentation.”
Akagic, A., E. Buza, and S. Omanovic. 2017. “Pothole detection: An effi- In Proc., IEEE Conf. on Computer Vision and Pattern Recognition,
cient vision based method using RGB color space image segmentation.” 580–587. New York: IEEE.
In Proc., 2017 40th Int. Convention on Information and Commu- Gupta, S., R. Girshick, P. Arbeláez, and J. Malik. 2014. “Learning rich
nication Technology, Electronics and Microelectronics (MIPRO), features from RGB-D images for object detection and segmentation.”
1104–1109. New York: IEEE. In Proc., European Conf. on Computer Vision, 345–360. Cham,
Alhashim, I., and P. Wonka. 2018. “High quality monocular depth estima- Switzerland: Springer.
tion via transfer learning.” Preprint, submitted December 31, 2018. Hazirbas, C., L. Ma, C. Domokos, and D. Cremers. 2016. “FuseNet:
http://arxiv.org/abs/1812.11941. Incorporating depth into semantic segmentation via fusion-based CNN
ASCE. 2021. “2021 report card for America’s infrastructure.” Accessed architecture.” In Proc., Asian Conf. on Computer Vision, 213–228.
March 1, 2022. https://infrastructurereportcard.org/. Cham, Switzerland: Springer.
Barbarella, M., M. R. De Blasiis, and M. Fiani. 2019. “Terrestrial Hoffman, J., S. Gupta, and T. Darrell. 2016. “Learning with side infor-
laser scanner for the analysis of airport pavement geometry.” Int. J. mation through modality hallucination.” In Proc., IEEE Conf. on Com-
Pavement Eng. 20 (4): 466–480. https://doi.org/10.1080/10298436 puter Vision and Pattern Recognition, 826–834. New York: IEEE.
.2017.1309194. Huang, G., Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. 2017.
Botshekan, M., et al. 2020. “Roughness-induced vehicle energy dissipation “Densely connected convolutional networks.” In Proc., IEEE Conf.
from crowdsourced smartphone measurements through random vibra- on Computer Vision and Pattern Recognition, 4700–4708. New York:
tion theory.” Data-Centric Eng. 1 (Dec): e16. https://doi.org/10.1017 IEEE.
/dce.2020.17. Ibragimov, E., H.-J. Lee, J.-J. Lee, and N. Kim. 2022. “Automated
Botshekan, M., E. Asaadi, J. Roxon, F.-J. Ulm, M. Tootkaboni, and pavement distress detection using region based convolutional neural
A. Louhghalam. 2021. “Smartphone-enabled road condition monitoring: networks.” Int. J. Pavement Eng. 23 (6): 1981–1992. https://doi.org/10
From accelerations to road roughness and excess energy dissipation.” .1080/10298436.2020.1833204.
Proc. R. Soc. A 477 (2246): 20200701. https://doi.org/10.1098/rspa Jahanshahi, M. R., F. Jazizadeh, S. F. Masri, and B. Becerik-Gerber.
.2020.0701. 2013. “Unsupervised approach for autonomous pavement-defect detec-
Cao, M.-T., Q.-V. Tran, N.-M. Nguyen, and K.-T. Chang. 2020. “Survey on tion and quantification using an inexpensive depth sensor.” J. Comput.
performance of deep learning models for detecting road damages using Civ. Eng. 27 (6): 743–754. https://doi.org/10.1061/(ASCE)CP.1943
multiple dashcam image resources.” Adv. Eng. Inf. 46 (Oct): 101182. -5487.0000245.
https://doi.org/10.1016/j.aei.2020.101182. Jo, Y., and S. Ryu. 2015. “Pothole detection system using a black-box
Chang, A., A. Dai, T. Funkhouser, T. Halber, M. Niessner, M. Savva, camera.” Sensors 15 (11): 29316–29331. https://doi.org/10.3390
S. Song, A. Zeng, and Y. Zhang. 2017. “Matterport3D: Learning from /s151129316.
RGB-D data in indoor environments.” Preprint, submitted September Kim, J.-H. 2009. “Estimating classification error rate: Repeated cross-
18, 2017. http://arxiv.org/abs/1709.06158. validation, repeated hold-out and bootstrap.” Comput. Stat. Data Anal.
Chen, F.-C., and M. R. Jahanshahi. 2017. “NB-CNN: Deep learning-based
53 (11): 3735–3745. https://doi.org/10.1016/j.csda.2009.04.009.
crack detection using convolutional neural network and naïve bayes
Kingma, D. P., and J. Ba. 2014. “Adam: A method for stochastic optimi-
data fusion.” IEEE Trans. Ind. Electron. 65 (5): 4392–4400. https://doi
zation.” Preprint, submitted December 22, 2014. http://arxiv.org/abs
.org/10.1109/TIE.2017.2764844.
/1412.6980.
Chen, X., Q. Dong, H. Zhu, and B. Huang. 2016a. “Development of
LeCun, Y., Y. Bengio, and G. Hinton. 2015. “Deep learning.” Nature
distress condition index of asphalt pavements using ltpp data through
structural equation modeling.” Transp. Res. Part C Emerging Technol. 521 (7553): 436–444. https://doi.org/10.1038/nature14539.
68 (Jul): 58–69. https://doi.org/10.1016/j.trc.2016.03.011. Li, Z., Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin. 2016. “LSTM-CF:
Chen, Y. L., M. R. Jahanshahi, P. Manjunatha, W. Gan, M. Abdelbarr, Unifying context modeling and fusion with LSTMS for RGB-D scene
S. F. Masri, B. Becerik-Gerber, and J. P. Caffrey. 2016b. “Inexpensive labeling.” In Proc., European Conf. on Computer Vision, 541–557.
multimodal sensor fusion system for autonomous data acquisition of Cham, Switzerland: Springer.
road surface conditions.” IEEE Sens. J. 16 (21): 7731–7743. https://doi Maeda, H., Y. Sekimoto, T. Seto, T. Kashiyama, and H. Omata. 2018.
.org/10.1109/JSEN.2016.2602871. “Road damage detection and classification using deep neural net-
Chun, C., and S.-K. Ryu. 2019. “Road surface damage detection using fully works with smartphone images.” Comput.-Aided Civ. Infrastruct. Eng.
convolutional neural networks and semi-supervised learning.” Sensors 33 (12): 1127–1141. https://doi.org/10.1111/mice.12387.
19 (24): 5501. https://doi.org/10.3390/s19245501. Mahmoudzadeh, A., A. Golroo, M. R. Jahanshahi, and S. Firoozi Yeganeh.
Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. 2019. “Estimating pavement roughness by fusing color and depth data
“Imagenet: A large-scale hierarchical image database.” In Proc., 2009 obtained from an inexpensive RGB-D sensor.” Sensors 19 (7): 1655.
IEEE Conf. on Computer Vision and Pattern Recognition, 248–255. https://doi.org/10.3390/s19071655.
New York: IEEE. Miller, J. S., and W. Y. Bellinger. 2003. Distress identification manual
Dung, C. V. 2019. “Autonomous concrete crack detection using deep fully for the long-term pavement performance program. Rep. No. FHWA-
convolutional neural network.” Autom. Constr. 99 (Mar): 52–58. https:// RD-03-031. McLean, VA: Federal Highway Administration, Office of
doi.org/10.1016/j.autcon.2018.11.028. Infrastructure.
A. Desmaison, L. Antiga, and A. Lerer. 2017. “Automatic differentia- Wang, A., J. Cai, J. Lu, and T.-J. Cham. 2015. “MMSS: Multi-modal
tion in PyTorch.” In Proc., 31st Conf. on Neural Information Process- sharable and specific feature learning for RGB-D object recognition.”
ing Systems. Red Hook, NY: Curran Associates. In Proc., IEEE Int. Conf. on Computer Vision, 1125–1133. New York:
Piryonesi, S. M., and T. El-Diraby. 2021. “Climate change impact on infra- IEEE.
structure: A machine learning solution for predicting pavement condi- Wu, R.-T., A. Singla, M. R. Jahanshahi, E. Bertino, B. J. Ko, and D. Verma.
tion index.” Constr. Build. Mater. 306 (Nov): 124905. https://doi.org/10 2019. “Pruning deep convolutional neural networks for efficient edge
.1016/j.conbuildmat.2021.124905. computing in condition assessment of infrastructures.” Comput.-Aided
Quigley, M., K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, Civ. Infrastruct. Eng. 34 (9): 774–789. https://doi.org/10.1111/mice
and A. Y. Ng. 2009. “ROS: An open-source robot operating system.” .12449.
In Vol. 3 of Proc., ICRA Workshop on Open Source Software, 5.
Zhang, A., K. C. Wang, B. Li, E. Yang, X. Dai, Y. Peng, Y. Fei, Y. Liu,
New York: IEEE.
J. Q. Li, and C. Chen. 2017. “Automated pixel-level pavement crack
Ren, S., K. He, R. Girshick, and J. Sun. 2015. “Faster R-CNN: Towards
detection on 3d asphalt surfaces using a deep-learning network.”
real-time object detection with region proposal networks.” IEEE Trans.
Comput.-Aided Civ. Infrastruct. Eng. 32 (10): 805–819. https://doi.org
Pattern Anal. Mach. Intell. 39 (6): 1137–1149. https://doi.org/10.1109
/10.1111/mice.12297.
/TPAMI.2016.2577031.
Silberman, N., D. Hoiem, P. Kohli, and R. Fergus. 2012. “Indoor segmen- Zhang, C., and A. Elaksher. 2012. “An unmanned aerial vehicle-based im-
tation and support inference from RGBD images.” In Vol. 7576 of aging system for 3D measurement of unpaved road surface distresses.”
Proc., Computer Vision—ECCV 2012. Lecture Notes in Computer Comput.-Aided Civ. Infrastruct. Eng. 27 (2): 118–129. https://doi.org
Science, edited by A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and /10.1111/j.1467-8667.2011.00727.x.
C. Schmid. Cham, Switzerland: Springer. https://doi.org/10.1007/978-3 Zhang, D., Q. Zou, H. Lin, X. Xu, L. He, R. Gui, and Q. Li. 2018.
-642-33715-4_54. “Automatic pavement defect detection using 3D laser profiling tech-
Simonyan, K., and A. Zisserman. 2014. “Very deep convolutional networks nology.” Autom. Constr. 96 (Dec): 350–365. https://doi.org/10.1016/j
for large-scale image recognition.” Preprint, submitted September 4, .autcon.2018.09.019.
2014. http://arxiv.org/abs/1409.1556. Zhang, L., F. Yang, Y. D. Zhang, and Y. J. Zhu. 2016. “Road crack de-
Song, S., S. P. Lichtenberg, and J. Xiao. 2015. “Sun RGB-D: A RGB-D tection using deep convolutional neural network.” In Proc., 2016
scene understanding benchmark suite.” In Proc., IEEE Conf. on Com- IEEE Int. Conf. on Image Processing (ICIP), 3708–3712. New York:
puter Vision and Pattern Recognition, 567–576. New York: IEEE. IEEE.
TRIP. 2015. “The interstate highway system turns 60: Challenges to its Zhou, S., and W. Song. 2021. “Crack segmentation through deep con-
ability to continue to save lives, time and money.” Accessed March 1, volutional neural networks and heterogeneous image fusion.” Autom.
2022. https://infrastructureusa.org/the-interstate-highway-system-turns Constr. 125 (May): 103605. https://doi.org/10.1016/j.autcon.2021
-60-challenges-to-its-ability-to-continue-to-save-lives-time-and-money/. .103605.