You are on page 1of 13

Automation in Construction 89 (2018) 58–70

Contents lists available at ScienceDirect

Automation in Construction
journal homepage: www.elsevier.com/locate/autcon

Transfer learning and deep convolutional neural networks for safety T


guardrail detection in 2D images

Zdenek Kolar, Hainan Chen, Xiaowei Luo
Department of Architecture and Civil Engineering, City University of Hong Kong, Hong Kong

A R T I C L E I N F O A B S T R A C T

Keywords: Safety has been a concern for the construction industry for decades. Unsafe conditions and behaviors are con-
Computer vision sidered as the major causes of construction accidents. The current safety inspection of conditions and behaviors
Construction safety heavily rely on human efforts which are limited onsite. To improve the safety performance of the industry, a
Guardrail detection more efficient approach to identify the unsafe conditions on site is required to supplement the current manual
Convolutional neural networks
inspection practice. A promising way to supplement the current manual safety inspection is automated and
Transfer learning
VGG-16
intelligent monitoring/inspection through information and sensing technologies, including localization techni-
ques, environment monitoring, image processing and etc. To assess the potential benefits of contemporary
technologies for onsite safety inspection, the authors focused on real-time guardrail detection, as unprotected
edges are the ones cause for workers falling from heights.
In this paper, the authors developed a safety guardrail detection model based on convolutional neural net-
work (CNN). An augmented data set is generated with the addition of background image to guardrail 3D models
and used as training set. Transfer learning is utilized and the Visual Geometry Group architecture with 16 layers
(VGG-16) model is adopted to construct the basic features extraction for the neural network. In the CNN im-
plementation, 4000 augmented images were used to train the proposed model, while another 2000 images
collected from real construction jobsites and 2000 images from Google were used to validate the proposed
model. The proposed CNN-based guardrail detection model obtained a high accuracy of 96.5%. In addition, this
study indicates that the synthetic images generated by augment technology can be used to create a large training
dataset, and CNN-based image detection algorithm is a promising approach in construction jobsite safety
monitoring.

1. Introduction overconfidence, workers are still exposed to high risk of falling. The
passive falling prevention measures, such as guardrail, warning lines,
As reported by Occupational Safety and Health Administration fall arrest systems [6,7], act as the onsite approach for reducing falling
(OSHA), approximately 40% of all fatalities at construction sites are risk. Guardrail systems are used on many work surfaces (rooftops,
caused by falls from heights, followed by struck by objects, electrocu- scaffolds, platforms and etc.) to prevent the workings from falling onto
tion, and caught-in/between [1]. More than 4500 US workers (3.4 per a lower level. There have been accidents, in which the required
100,000 full-time workers, more than 13 deaths every day) died on the guardrail systems or part of the system were missing [8]. In current
job in 2015 [1]. According to OSHA's statistics, “Fall protection, con- practice, the guardrail system inspection belongs to the responsibility of
struction” is at top of the list of the most frequently violated OSHA safety officers on site and they have to manually check whether the
standards [2].Reducing accidents of falling from height can sig- guardrail system is in-place and complete. However, not all the missing
nificantly improve the safety performance in the construction industry. safety guardrail situation can be found in time because the limited-
To reduce those accidents, both proactive and passive approaches have number safety officers with other workloads cannot cover all the areas
been proposed for implementation on construction jobsite. on the jobsite at every moment. The advancement and pervasiveness of
Construction safety planning and training are considered as a information technology enable the inspection work to be done auto-
proactive approach to improve worker's safety awareness and reduce matically in real time [9–12]. In a real-time safety monitoring system,
the falling risk [3–5]. Although proper planning can reduce the falling once the guardrail system is missing where it is required, an alert can be
risks on site, it cannot eliminate them. Due to worker's absent-mind or sent to the responsible person immediately.


Corresponding author.
E-mail address: xiaowluo@cityu.edu.hk (X. Luo).

https://doi.org/10.1016/j.autcon.2018.01.003
Received 12 May 2017; Received in revised form 2 January 2018; Accepted 9 January 2018
0926-5805/ © 2018 Elsevier B.V. All rights reserved.
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

To accomplish automatic checking of whether the guardrail system KNN and Scale Invariant Feature Transform (SIFT) algorithms to parse a
is set up appropriately, the first step would be to detect the existence of complete image from a construction site. It is quite straightforward that
guardrail system correctly. Traditionally, the feature based computer the extracted features are used to code one image, and then to conduct
vision techniques are used to process the videos/images for object de- classification and clustering for the images labeling. However, for tra-
tection and classification [13,14]. For each individual task, a precise ditional computer vision technologies, the features are extracted by the
feature model should be established first, and this model development predefined and special-purpose optimized models. Those models can
heavily relies on complex statistic, mathematic and image processing only be manually developed when high-dimensional features are re-
theories. In addition, those traditional image processing techniques quired. Therefore, when multi-features models are simultaneously
have strict constraints on the inputs of the model, including sizes, pixel- considered, the conventional computer vision methods (HOG, HOF,
perfect precision, and channels of the images [15]. Therefore, the Motion Boundary Histogram (MBH), etc.) lose their advantages or even
flexibility of those technologies is greatly limited. Different from the may fail to perform the designed tasks.
conventional image processing techniques, the deep neural network is
an end-to-end process, instead of the step-by-step one. The feature ex- 2.2. Techniques in computer vision-based object detection
traction is accomplished through neural network model training. Since
the neural network is a black box algorithm, the features can be ex- The computer vision techniques in construction-related research
tracted without any prior knowledge. Using CNN with different con- derive from computer vision disciplines. In computer vision, there have
volutional core and convolutional layers of different numbers, the been two main approaches for object detection and classification. The
features of one image can be extracted into the different precision levels first one aims to determine the correct class of a given sample based on
and multiple dimension space. Therefore, once the CNN have enough a set of features (often handcrafted) that are specific to the given class.
cores and layers, the whole feature space of one image can be obtained An example of algorithms to meet this need is SIFT, proposed by Lowe
through model training without external interaction. The shortcoming [26]. This approach can be extended with visual bag-of-words, which
of the deep neural network is that training a deep neural network model aims to determine the class of a sample based on the frequencies of
is time-computing and computationally intensive, which means longer individual features in the image [27]. In the traditional computer vision
time and more expensive hardware are required. As the deep neural process, the object detection and classification is still subjected to the
network is not sensitive to the features, the models trained for some feature extraction. Without an appropriate feature model, it is difficult
special purposes can be used or partially integrated into the new model to implement accurate object detection and classification.
with the help of transfer learning. In this way, the time and hardware The second approach is designed to automatically extract a large
cost would be significantly reduced, making the deep neural network number of image features and then use these features for image clas-
more practical. In this paper, the authors proposed a CNN-based sification. Examples include CNNs, HOGs, etc. The advantage of those
guardrail detection model with the integration of the core part of the algorithms is that they have the self-learning ability from a given da-
VGG-16 model and multi-layers perception (MLP) network, and vali- taset. Since the deep neural network is an end-to-end process, it is not
dated it using image data collected from jobsite and internet. The re- necessary to conduct feature extraction in advance. Along with the
maining of the paper is organized as follows. Section 2 summarizes the advancement in computer hardware and software, the deep neural
related research in computer vision and construction sensing domain. network is considered as the most powerful technology to process the
Section 3 describes the CNN and transfer learning techniques used in image and to solve the computer-vision related problems.
the study. Section 3.1 explains the validation dataset and results. Re- With established objective detection system, visual sensors can play
sults are presented and explained in Section 4. Finally, conclusions are a useful and important role in various management work. Veres et al.
drawn in Section 5. [28] proposed a robust approach for workflow classification in in-
dustrial environments with employing computer vision processing. Zhu
2. Literature review et al. [29] investigated the workforce and equipment detection and
tracking on construction jobsite with video processing technology. Park
In this section, the authors first review the commonly used algo- et al. [30] employed computer vision based objects detection to detect
rithms for computer vision in the construction-related area. To address and track construction workers position on construction site for safety
the limitation of the conventional computer vision methods used in and productivity monitoring. Along with the rapid development of the
construction, the techniques in computer vision-based object detection image processing technology, computer vision based objects detection
in the computer science discipline were reviewed. Based on the com- is considered as a very valuable approach for construction manage-
parison of available techniques in computer vision based object detec- ment.
tion, CNN is considered as a promising approach to address the men-
tioned limitations. Therefore, CNN is introduced in the 3rd subsection. 2.3. Convolutional neural networks

2.1. Application of computer vision in construction Images, seen as matrices, contain a great number of values. A small
image, used for example for digit classification, may consist of 28 × 28
Computer vision techniques have been used in construction-related pixels in grayscale mode. That is 784 features, each may take value
research for object detection. The studies are focused on detection of from the interval between 0 and 255. The size of modern-era images
construction workers, site machinery and progress tracking in con- may, however, be far greater than that. Due to the number of features
struction [16, 17, 18, 19, 20]. Teizer [21] describes the status quo and and loss of the spatial relationship between pixels, artificial neural
challenges in computer vision in construction. Of the computer vision networks (ANNs) are not the ideal solution for image classification
techniques used in construction, the histogram of oriented gradients [31].
(HOG) is one of the widely used techniques. Park and et al. [16] use In order to overcome the disadvantages of standard ANNs for image
HOG and the histogram of HSV (Hue, Saturation, Value) colors as an classification, LeCun et al. [31] developed a technique for digit re-
input for k-nearest neighbors (KNN) classifier. HOG, Histogram of cognition for the US Post, based on CNNs. For training and validation,
Optical Flow (HOF), Motion Boundary Histogram (MBH) were used for the authors used the Modified National Institute of Standards and
action recognition of construction workers [22]. Besides HOG, Haar Technology (MNIST) dataset [32].
Cascade [23] is another popular technique used in construction. Du The limitation of CNNs at that time was, due to the limited com-
et al. [24] use an approach based on Haar Cascade to detect the hard putational power available, it was possible to process only small images
hats of workers on construction. Kim et al. [25] used a combination of (the resolution of the MNIST dataset is only 28 × 28 pixels). Thanks to

59
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

the continuous advancement of high-performance Graphical Processing


Units (GPU) and availability of large datasets, these techniques have
been studied extensively [33–37]. In order to monitor and increase the
progress of CNN development, Stanford Vision Lab organize the Im-
ageNet Large Scale Visual Recognition Challenge (ILSVRC) every year.
The contestants train their algorithms on a database called ImageNet
[38]. The accuracy has significantly (a step from approx. 75% to 85%)
increased in 2012 with the deployment of the first CNN. Since then, a
number of networks have been proposed (e.g. GoogLeNet, AlexNet,
VGG-16).
Convolutional Neural Networks have found their way into the
construction industry. Particularly, Madhuri et al. [39] proposed a
method for roof detection based on Gaussian mixture model and visual
features (e.g. color, texture, contrast). The authors developed an algo-
rithm for image segmentation, labeling every pixel with a class. This
solution uses CNNs.
There are many datasets for research in object recognition, such as
ImageNet [38], Caltech 101 [40], PASCAL Visual Object Classes (VOC)
[41] and etc. However, scientists in the field of civil engineering
pointed out that there is no publicly available dataset and they often
have to collect images before they commence a research [22]. This is
tedious and it is also holding the progress in the construction field as
developers only collect data for a limited number of classes [25]. To Fig. 1. The overall framework of the proposed guardrail detection model.
tackle the problem of tasks requiring a large number of samples, such as
head pose estimation, research in computer vision proposed a method from the training dataset.
that uses rendered images combined with real background images to Background images from construction environments (obtained from
train CNNs [42]. The authors proposed a method to generate large Google) were selected and supplied to the images taken by the virtual
numbers of images from the 3D model of a human head. A similar camera. The synthetic images of the guardrail system are now gener-
method has been suggested in the construction industry by Soltani et al. ated. To improve the classifier's generalization capabilities, data aug-
[43]. However, this method has not been implemented and validated in mentation with various transformations techniques (e.g., rotation,
the construction industry. stretching, flipping etc., see Table 1 for details) was implemented on
each image with guardrail system and 36 augmented images (samples
3. Transfer learning based guardrail detection model are shown in Fig. 2) can be generated. In this way, 2000 augmented
development images are generated.
The final training dataset with 4000 images (half are positive
The proposed guardrail detection model is composed of two main samples and the other half are negative samples) was fed into the
parts: the core part of a deep neural network (VGG-16), and an MLP. proposed model for training. At the beginning of the training, a cross-
Training a full-scale CNN is a long process, in which a large training validation on a sample of synthetic images (667 images) was conducted
data set and intensive computing resource is required. Therefore, it is in order to determine the performance of the models and the influence
impractical to conduct a full-scale training for each individual goal. of their parameters. Then the model parameters were selected.
Thus, researchers and professionals introduce transfer learning [44] to
store the knowledge gained from solving one problem and applying it to
a different but related problem. In this study, the authors use the core 3.2. Transfer learning
part of a deep CNN called VGG-16 [45] to transfer the images features
knowledge stored in the VGG-16 model to the guardrail detection Convolutional neural networks can be designed from scratch and
model. Then the MLP model is trained to process the output of the core subsequently trained on various datasets to achieve optimal perfor-
VGG-16 based object detection. The framework of the proposed mance. This approach requires a large amount of time, even when there
guardrail detection model is shown in Fig. 1. are sufficient hardware resources. Therefore, the method is not im-
plemented in this paper. Instead, the authors exploit the idea of transfer
3.1. Data augmentation learning. It aims to reuse the part of a trained network that recognizes
the coarse-to-medium features of the image and adds classifiers that
During the training phase of the model, only synthetic images were classify the fine details of guardrails.
used. First, the authors developed a 3D model of the commonly seen There are a few ways to use an existing, trained CNN for images
metal guardrail on construction jobsite in Autodesk Revit. The dimen- classification. The methods differ mainly in their complexity and time
sions of guardrail are: 1.8 m in length, 1.1 m in height, and 0.03 m in needed for training. Transfer learning uses a machine learning
diameter for the rails. Noise (reflected in the model parameters) was
added to simulate different guardrail specifications. In addition, as Table 1
guardrails in the real world can take any color, the model was pro- Applied transformations.
grammed to change color on random to generate guardrail 3D model
Transformation Interval
with different colors. To create the images of guardrail system for the
training dataset from different viewpoints, the authors change the Rotation −15 to 15°
camera's (focal length of 50 mm) height (0.5–2.0 m), and distance Width shift −15 to 15%
(4–8 m) from the guardrail, and angle to the guardrail (0 to 3600). Height shift −15 to 15%
Shear shift −20 to 20%
Because safety guardrail can only be reliably recognized when the
Channel shift −20 to 20%
geometry is represented as a whole, the images showing the guardrail Horizontal flip –
as only one post (with the rest of the geometry behind) were excluded

60
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

Fig. 2. Synthetic images of guardrail system in the training dataset.

algorithm (e.g. CNN) as an extractor of features which are then fed into These numbers are then stacked on top of each other to create a volume
another classifier. of information.
Researchers may reuse an existing CNN (in this case VGG-16) and
apply a different classifier that uses features extracted from the last S (i, j ) = (I ∗K )(i, j,) = ∑ ∑ I (m, n) K (i − m, j − n)
m n (1)
layer before fully-connected layers. The reason for using VGG-16 is that
it has been trained with ImageNet database which contains 1000 where i, j are the coordinates of the pixel, I is the discrete function of
classes. That means the convolutional layers have the ability to gen- the image and K is the discrete function of the filter (kernel).
eralize well enough to recognize the coarse features of many different
objects varying in shape, color, texture etc. Therefore, VGG-16 is likely ✓ Stride
to perform well as feature extractors for safety guardrails.
To implement the transfer learning, the lower 19 neural network Stride is a parameter of the convolutional layer. It is used to decrease
layers remain unchanged at the first, and the values of associated the size of 3D information volume. The algorithm only processes every
weights are also kept. There are 14,714,688 weight values, which can N-th row and column. If the stride equals to 1, the filter moves one pixel
be used to represent the feature spaces of one image. Since the VGG-16 at a time. If smaller output volumes are desired, a stride of 2 might be
network has been validated with the open ImageNet database, the chosen, which takes into account only every second row and column.
lower part of the VGG-16 network can be considered to have the ability
to present the whole feature space for objects for detection. After the ✓ Padding
transferring procedure, the lower part of the VGG-16 network would be
used as the foundation part of the whole model, and then an MLP is When convolution is performed near the edges of an image, there is
constructed and integrated with the lower part of the VGG-16 model as a missing information outside the image. For the algorithm to process
the head part of the whole model. Finally, during the model training the information, it needs to add either black (0,0,0) or some form of
process, the weights of the lower of the model would be frozen, and padding (e.g. copying, mirroring the information along the edges).
only the weight parameters of the head part of the model would be
adjusted through training for the specific problem. Fig. 3 illustrates the ✓ Activation function
transfer learning based retraining for the head part of the model.
In order to avoid the vanishing gradient problem with standard
3.3. The architecture of VGG-16 logistic, and hyperbolic tangent functions, VGG-16 uses rectified linear
units (ReLU). The equation is as follows.
Similar to a standard ANN, the architecture of a CNN (Fig. 4) in- f (x ) = max(0, x ) (2)
cludes an input layer, a series of convolutional layers and an output
layer. Each individual layer is explained below. For the activation of the output layer, the sigmoid function in eq. 3
is utilized:
• Convolutional layer f (x ) =
1
1 + e−cx (3)
Convolution is an operation where a filter (e.g. matrix with weights)
is applied to an area of an image and the output is usually a number. where x is the input to the neuron and c is a parameter of the function.

61
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

Fig. 3. Transfer learning based MLP retraining.

• Pooling layer uses backpropagation to minimize the following cost function:


m K
1 ⎡
Pooling is an operation in which the x and y dimension of the 3D J (θ) = − ∑ ∑ y (i) log(hθ (x i))k + (1 − yk(i) ) log(1 − (hθ (x i))k) ⎤⎥
m ⎢ i=1 k=1 k
volume is decreased. Pooling layers remove in each every N-th row and ⎣ ⎦
column of a 3D volume. VGG-16 utilizes MaxPooling algorithm. (5)

• Fully connecting layers 3.4. Model optimization


At the end of the network, the 3D volume of neurons is converted to
Overfitting occurs when the algorithm achieves significantly higher
fully connect layers, which determine the class of the image in the
accuracy on training data than validation dataset. The algorithm fails to
output layer. The activation function of the output layer is sigmoid.
generalize and trains to a noise in training images. It is often accom-
1 panied by extreme values in weights in the network. To mitigate the
f (x ) = effect of overfitting, the authors used two techniques as described
1 + e−cx (4)
below. The argument for using techniques to avoid overfitting is even
where x is the input to the neuron and c is a parameter of the function. stronger in case there is a limited amount of training data, which
The proposed model is based on forward-fed neural network and happens to in our case.

Fig. 4. VGG-16 architecture.

62
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

• Dropout • SVM
Dropout is a technique described by [46] and subsequently Srivas- First, an SVM classifier is used on the features extracted from the first
tava et al. (2014). To reduce overfitting, neurons fire with a defined fully connected layer. The dimensionality of the feature space is 4096.
probability (p) during the training phase, i.e. the neuron either sends a For the training sample set, the authors generated the features in a batch,
signal or remains silent. The following equations illustrate the principle. added the correct classes, and subsequently trained the SVM classifier.
Two choices for kernels were used, namely linear kernel and radial
training : P (x ) = 1 − p (7)
basis function (RBF) kernel. In order to find the optimal soft margin
testing: P (x ) = 1 (8) parameter C, several values on the interval between 0.001 and 100
were tested.
where P(x) denotes the probability of a node firing a signal and p is the
m n
probability the node remains silent. 1
J (θ) = C ∑ [y (i) cos t1 (θT x (i) + (1 − y (i) ) cos t0 (θT x (i) ) ] +
2
∑ θj2
i=1 i=1 (6)
• L2 regularization
With greater weights, there is a higher risk of overfitting. A way to • Retrained fully connected layers
tackle this problem is also L2 regularization. The regularization term is
Secondly, similarly to the previous paragraph, the authors used the
added to the cost function and it essentially penalizes the model for
first part of the CNN (up to, and including convolutional block 5) and
large weights.
instead of adding a new classifier they aimed to reconstruct the last part
λ
L − 1 sl sl + 1 of the network with the same fully connected layers as the original. The
R= ∑∑∑ (θji(l) )2 new layers had random, non-zero weights and they are trained with the
2m l=1 i=1 j=1 (9) images, while keeping the rest of the VGG-16 net unchanged.
where θ are the weights in a neural network.
4. Model validation
3.5. Performance evaluation
4.1. Validation set
To evaluate the performance of the proposed guardrail detection
model, support vector machine (SVM) is utilized to be the reference to To validate the proposed algorithm's performance in classify real-
the evaluation. Since the primary shortcoming of the traditional com- life images classification, the authors obtained a dataset consisting of
puter vision process is the limitation of the feature extraction, the SVM 1000 images (samples are shown in Fig. 6) of temporary construction
is utilized with integrating the core part of the VGG-16 network, and site fence. Images were rescaled to fit the rest of the samples, e.g.
the VGG-16 network is used to obtain the image feature automatically. 300 × 300 pixels. It can be noted that, the guardrails in the 1000
The key idea of SVM is mapping the low-dimensional data into high images have similar shapes to the training set as shown in Fig. 2. To
dimensional-data by core functions, so that the images can be classified validate the robustness of the proposed algorithm, images with the type
into different clusters in high dimensional space, which is considered of guardrail of different appearances (samples are shown in Fig. 7) are
having more accurate classification results. Therefore, the SVM can be used as a 2nd validation dataset to check if the proposed model can
used as the representation of traditional computer vision technology. By detect the existence of guardrail with different appearances. Those
comparison to SVM, the performance of the proposed guardrail detec- images are collected from Google to cover the guardrail systems used in
tion model is evaluated. The workflow of the evaluation is presented in different jobsites in different countries and districts. The ratio and size
Fig. 5. of the training and validation are illustrated in Table 2.

Fig. 5. The workflow of the performance validation.

63
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

Fig. 6. Images of a single guardrail type used for validation.

4.2. Results curve (AUC) describes the probability that the classifier will rank ran-
domly chosen positive samples higher than the randomly chosen ne-
Accuracy and F1 score were chosen as the evaluation criteria of the gative samples.
models and their parameters. Several other metrics are provided for
information, with precision measuring the ability of the model to avoid Tp + Tn
accuracy =
false positives, recall controls for false negatives, and area under the Tp + Tn + Fp + Fn (10)

Fig. 7. Images of multiple guardrail types used for validation

64
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

Table 2
Training and validation data set description.

Data set type Data set size

Training (a). 1000 negative samples


(b). 1000 positive samples (composed with real scene images
with 3D one type guardrail model)
Validation Data set 1: data set with a single guardrail type
(a). 1000 negative samples
(b). 1000 positive samples (images with the same guardrail type
in the real scene)
Data set 2: data set with multiple guardrail types
(a). 1000 negative samples
(b). 1000 positive samples (images with multiple guardrails types
in the real scene)

Tp
precision =
Tp + Fp (11)

Tp
recall =
Tp + Fn (12) Fig. 9. Parameters evaluation for different kernels of SVM training.

precision∗recall
F 1 score = 2∗ 4.2.2. Support vector machine parameters evaluation
precision + recall (13)
Kernels were tested with values between 0.01 and 1000 for the
∞ penalty parameter C of the error term. The training time for linear
AUC = ∫−∞ TPR (T ) FPR′ (T ) dT (14) kernel was shorter than for RBF in general: linear kernel took 15–30 s
where Tp is true positives,Fp is false positives, Tn is false positives, and while RBF took 90–300 s. Fig. 9 shows the effects of different kernels
Fn is false negatives. and their parameters on the detection performance (I). The figure in-
Furthermore, TPR = Tp|(Tp + Fn) and FPR = Fp|(Fp + Tn). dicates that the linear kernel can have much better performance than
the RBF kernel. It increased from relative 0.8 to 1.0. While for RBF
4.2.1. MLP parameters evaluation kernel, the I values start from about 0.65 to 1.0. Along with increasing
Accuracy, AUC and F1 values are calculated to evaluate the the C values, the performance keeps stable. The detail performance
guardrail detection performance. To consider all the aspects of the evaluation values are attached in Appendix tables. Through grid
performance, the mean value of these three indicators (I) are used to searching the best C values for the linear kernel training is 3.0, and the
detect the best parameters for the head part of the proposed model, corresponding Accuracy is 97%, AUC is 99%, and F1 value is 97%.
since Accuracy, AUC and F1 value are all positively related with the As described Section 3.2, to validate the performance of the pro-
performance. The detailed Accuracy, AUC and F1 values are presented posed MLP guardrail detection model and SVM based detection model,
in the Appendix tables. both are tested with two data set: (1) Dataset 1 with only one type of
guardrails, which has 2000 images in total and only 1000 of them have
I = Mean(Accuracy, AUC, F1) guardrails. (2) Dataset 2 with guardrails of different types, which has
Fig. 8 shows the effects of dropout values and L2 regularization 3422 images and 2422 of them have different types of guardrails. 100
parameters on I values. Apparently, the L2 parameter with value 0.01 random tests are conducted, and for each test, 500 positive images
can achieve the best performance, while the effects of drop values (labeled with guardrails-contained image) and 500 negative images
fluctuate. Through grid searching, the best drop values should be 0.2 (labeled with no guardrails-contained image) are used. Fig. 10 illus-
and the L2 parameter should be 0.01. Correspondingly, the Accuracy is trates the performance comparison for the proposed MLP model and
97%, AUC is 99%, and F1 is 97%. SVM model used for guardrail detection.

Fig. 10. Detection accuracy comparison between MLP and SVM for single guardrail type
Fig. 8. Drop values and L2 regularization parameters evaluation for MLP training. and multiple guardrail types test.

65
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

Table 3 5. Discussion
Performance comparison of SVM and retrained MLP with single guardrails type.
Overall the algorithms achieve higher accuracy rates. However, the
Accuracy F1 score Precision Recall
recall is lower than precision and the precision can be over 90% for
SVM Upper bound 81.1% 78.5% 92.5% 69.0% both models. F1 scores are close to the accuracy. AUC confirms the
Mean 78.3% 74.5% 90.1% 63.5% algorithm performs well for both datasets.
Lower bound 75.9% 71.2% 87.5% 59.6%
False negatives (Fig. 11) are mainly caused by images taken in poor
Retrained MLP Upper bound 98.0% 98.0% 97.2% 99.2%
Mean 97.0% 97.1% 96.0% 98.2% lighting conditions, which is also one of the limits of methods based on
Upper bound 96.5% 96.5% 95.0% 97.4% visual data. To overcome this, an RGB-D (Red Green Blue and Depth)
cameras might be used for image capturing, however, their effective
range is only approximately 3 m.
Table 4 Examples of false positives (Fig. 12) suggest that the algorithm has
Performance comparison of SVM and retrained MLP with multiple guardrail types. some difficulties in distinguishing the regular structures (such as rebar
or formwork) from the guardrails. One way to address this caveat could
Accuracy F1 score Precision Recall
be to use more samples containing metal, linear objects such as scaf-
SVM Upper bound 75.3% 69.7% 91.0% 57.0% folding, rebar, formwork etc.
Mean 72.2% 64.9% 87.9% 51.5%
Lower bound 69.1% 60.3% 84.2% 46.8%
Retrained MLP Upper bound 88.1% 87.0% 97.0% 80.6%
Mean 86.0% 84.5% 94.9% 76.1% 6. Conclusion
Upper bound 83.8% 81.6% 93.4% 72.0%
A study was conducted to ascertain if contemporary computer vison
techniques can be used to detect construction guardrails. Two methods,
It shows that when images of only one type of guardrails are used based on machine learning algorithms, were selected and tested. SVM
for testing, the accuracy is higher (by 6–10%) than the images of multi- scores lower (approx. 78.3% for detection of a single guardrail type and
type guardrails using the proposed detection model. It is primarily 72.2% for detection of multiple guardrail types), while retrained VGG-
because that only one type of 3D guardrail model is used to train the 16 scores higher (approx. 97% and 86% respectively).
model. But on the other hand, it shows that the MLP model is much The results show that VGG-16, and more general CNNs are suitable
stable comparing to SVM based detection. For dataset 1 and dataset 2, for object detection tasks in safety management on construction job-
comparing to SVM-based detection model, MLP model has higher de- sites. They have the potential to improve the situation on construction
tection accuracy, and performs better on all the evaluation metrics in- sites and contribute to a reduced number of injuries and fatalities.
cluding precision, recall, and F1 values. When different types of Solutions that aim to achieve even higher accuracy, and lower number
guardrails are used to for validation, MLP model can still have a de- of false positives should be sought. Special attention should be paid to
tection accuracy of 86.0%, while SVM only has an accuracy of 72.2%. varying lighting conditions on sites as these may affect the accuracy of
Tables 3 and 4 shows the performance of the proposed MLP model and the algorithms dramatically.
SVM based guardrail detection.

Fig. 11. Example of false negative detection.

66
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

Fig. 12. Examples of false positive detection.

6.1. Limitations of the research detect potentially dangerous conditions.


The guardrail detection approach proposed in this paper can be
This research used a dataset with only one type of safety guardrail combined with (as-built) Building Information Modeling (BIM) to
only. Further training with the dataset of more guardrails of different achieve the inspection of appropriate guardrail installation in the fu-
types would have to be made in order for the algorithm to generalize to ture. BIM models can provide the locations of the open edges requiring
different guardrails with a high accuracy. Also, occlusion was not ad- guardrail systems, based on OSHA best practice. If the camera used to
dressed in this research, assuming the guardrail will always be visible. capture the jobsite images is carried by the worker, Unmanned Ariel
It is important to understand that although the algorithms achieve Vehicle, Unmanned Ground Vehicle or even Closed-circuit televisions
higher, and increasing accuracy, it is unlikely that the accuracy will be with localization and compass sensors, the images can be mapped to the
100% in the near future. This might seem as a limiting factor, however, locations in the BIM models. In this way, the extended system can au-
it is important to realize that the main purpose of the application of tomatically determine if the guardrail system is appropriately installed.
these algorithms is to supplement human supervision, not to replace
human inspection. Acknowledgment

6.2. Future work The authors would like to acknowledge the contribution of Mr.
Adam Popel, whose collection of construction site images was helpful to
Deep CNNs offer promising results in many fields, including con- generating the synthetic images used in this study.
struction safety. The current implementation relies on a relatively This work was jointly supported by National Science Foundation of
memory-consuming network VGG-16 (the weights file is slightly over China Grant # 51408519, Research Grant Council Grant # 21206415
500 MB). A future challenge might be to try a smaller network without and the Shenzhen Science and Technology Funding Programs
losing the accuracy. (JCYJ20150902162946055). The conclusions herein are those of the
Another extension of this study is its implementation to online video authors and do not necessarily reflect the views of the sponsoring
analysis. That would allow workers and site management automatically agencies.

Appendix A. Results of grid search - support vector machines training

Table 1
Grid search with SVM – linear kernel.

C Accuracy AUC F1

0.001 0.811 0.895 0.821


0.003 0.868 0.948 0.871
0.01 0.914 0.977 0.913
0.03 0.935 0.986 0.934
0.1 0.951 0.991 0.951

67
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

0.3 0.965 0.993 0.965


1 0.968 0.995 0.968
3 0.973 0.996 0.972
10 0.972 0.996 0.972
30 0.970 0.996 0.971
100 0.971 0.996 0.971
300 0.971 0.996 0.971
1000 0.971 0.996 0.971

Table 2
Grid search with SVM - RBF kernel.

C Accuracy AUC F1

0.001 0.755 0.5 0.741


0.003 0.755 0.5 0.741
0.01 0.755 0.5 0.741
0.03 0.755 0.5 0.741
0.1 0.755 0.5 0.741
0.3 0.755 0.839 0.741
1 0.767 0.867 0.741
3 0.832 0.916 0.841
10 0.881 0.963 0.882
30 0.920 0.981 0.919
100 0.941 0.988 0.941
300 0.956 0.991 0.955
1000 0.967 0.944 0.967

Appendix B. Results of grid search – retrained mlp training

In the tables below, dropout values along the X axis, Y axis describes L2 regularization parameter.

Table 3
Grid search - accuracy.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.01 0.95 0.97 0.97 0.97 0.97 0.96 0.96 0.96 0.96 0.97
0.03 0.96 0.96 0.95 0.97 0.96 0.96 0.96 0.96 0.95 0.95
0.1 0.96 0.97 0.97 0.97 0.96 0.97 0.95 0.96 0.95 0.97
0.3 0.97 0.97 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.97
1 0.96 0.96 0.97 0.96 0.96 0.96 0.97 0.96 0.95 0.95
3 0.96 0.9 0.96 0.95 0.97 0.96 0.95 0.96 0.94 0.97
10 0.95 0.95 0.96 0.93 0.91 0.89 0.95 0.95 0.93 0.93
30 0.88 0.86 0.9 0.85 0.87 0.92 0.92 0.92 0.9 0.78
100 0.88 0.71 0.75 0.86 0.88 0.88 0.68 0.86 0.76 0.81

Table 4
Grid search - AUC.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.01 0.99 1 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
0.03 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
0.1 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
0.3 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
1 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
3 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
10 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
30 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.97 0.99
100 0.97 0.96 0.97 0.96 0.96 0.96 0.96 0.95 0.94 0.97

68
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

Table 5
Grid search – F1 score.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.01 0.95 0.97 0.97 0.97 0.97 0.97 0.97 0.96 0.96 0.97
0.03 0.96 0.96 0.94 0.97 0.96 0.97 0.95 0.96 0.95 0.97
0.1 0.96 0.97 0.97 0.97 0.96 0.97 0.95 0.96 0.95 0.97
0.3 0.97 0.97 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.97
1 0.96 0.96 0.97 0.96 0.96 0.96 0.97 0.96 0.95 0.95
3 0.96 0.89 0.96 0.95 0.97 0.96 0.95 0.96 0.94 0.97
10 0.95 0.94 0.96 0.94 0.97 0.96 0.95 0.96 0.94 0.97
30 0.89 0.88 0.9 0.82 0.85 0.92 0.93 0.92 0.9 0.72
100 0.88 0.77 0.67 0.84 0.89 0.88 0.52 0.88 0.8 0.84

References [21] J. Teizer, Status quo and open challenges in vision-based sensing and tracking of
temporary resources on infrastructure construction sites, Adv. Eng. Inform. 29
(2015) 225–238, http://dx.doi.org/10.1016/j.aei.2015.03.006.
[1] BLS, Fatal occupational injuries for selected events or exposures, https://www.bls. [22] J. Yang, Z. Shi, Z. Wu, Vision-based action recognition of construction workers
gov/news.release/cfoi.t02.htm, (2016) , Accessed date: 10 October 2017. using dense trajectories, Adv. Eng. Inform. 30 (2016) 327–336, http://dx.doi.org/
[2] Occupational Safety and Health Administration, Top 10 Most Frequently Cited 10.1016/j.aei.2016.04.009.
Standards, https://www.osha.gov/Top_Ten_Standards.html, (2016) , Accessed date: [23] P. Viola, M.J. Jones, Robust real-time face detection, Int. J. Comput. Vis. 57 (2004)
10 October 2017. 137–154, http://dx.doi.org/10.1023/B:VISI.0000013087.49260.fb.
[3] H. Li, M. Lu, S.-C. Hsu, M. Gray, T. Huang, Proactive behavior-based safety man- [24] S. Du, M. Shehata, W. Badawy, Hard hat detection in video sequences based on face
agement for construction safety improvement, Saf. Sci. 75 (2015) 107–117, http:// features, motion and color information, ICCRD2011–2011 3rd International
dx.doi.org/10.1016/j.ssci.2015.01.013. Conference on Computer Research and Development, 2011, pp. 25–29, , http://dx.
[4] H. Li, M. Lu, G. Chan, M. Skitmore, Proactive training system for safe and efficient doi.org/10.1109/ICCRD.2011.5763846.
precast installation, Autom. Constr. 49 ( (2015) 163–174, http://dx.doi.org/10. [25] H. Kim, K. Kim, H. Kim, Data-driven scene parsing method for recognizing con-
1016/j.autcon.2014.10.010. struction site objects in the whole image, Autom. Constr. 71 (2016) 271–282,
[5] S. Hecker, J.A. Gambatese, Safety in design: a proactive approach to construction http://dx.doi.org/10.1016/j.autcon.2016.08.018.
worker safety and health, Appl. Occup. Environ. Hyg. 18 (2003) 339–342, http:// [26] D.G. Lowe, Distinctive image features from scale invariant keypoints, Int. J.
dx.doi.org/10.1080/10473220301369. Comput. Vis. 60 (2004) 91–11020042, http://dx.doi.org/10.1023/B:VISI.
[6] J.W. Garrett, J. Teizer, Human factors analysis classification system relating to 0000029664.99615.94.
human error awareness taxonomy in construction safety, J. Constr. Eng. Manag. [27] F.-F. Li, P. Pietro, A Bayesian hierarchical model for learning natural scene cate-
135 (2009) 754–763, http://dx.doi.org/10.1061/(ASCE)CO.1943-7862.0000034. gories, 2005 IEEE Computer Society Conference on Computer Vision and Pattern
[7] Y.H. Hung, W.W. Winchester, T.L. Smith-Jackson, B.M. Kleiner, K.L. Babski-Reeves, Recognition (CVPR'05), 2 2005, pp. 524–531, , http://dx.doi.org/10.1109/CVPR.
T.H. Mills, Identifying fall-protection training needs for residential roofing sub- 2005.16.
contractors, Appl. Ergon. 44 (2013) 372–380, http://dx.doi.org/10.1016/j.apergo. [28] G. Veres, H. Grabner, L. Middleton, L. Van Gool, Automatic workflow monitoring in
2012.09.007. industrial environments, Asian Conference on Computer Vision, 2010, pp. 200–213,
[8] Y.Y. Chow, Case studies of accidents involving work at height, Singapore, https:// , http://dx.doi.org/10.1007/978-3-642-19315-6_16.
www.wshc.sg/files/wshc/upload/infostop/attachments/2017/ [29] Z. Zhu, X. Ren, Z. Chen, Integrated detection and tracking of workforce and
IS201704210000000417/Case_Studies_of_Accidents_Involving_Working_at_Heights. equipment from construction jobsite videos, Autom. Constr. 81 (2017) 161–171,
pdf, (2017) , Accessed date: 10 October 2017. http://dx.doi.org/10.1016/j.autcon.2017.05.005.
[9] A. Carbonari, A. Giretti, B. Naticchia, A proactive system for real-time safety [30] M.-W. Park, I. Brilakis, Continuous localization of construction workers via in-
management in construction sites, Autom. Constr. 20 (2011) 686–698, http://dx. tegration of detection and tracking, Autom. Constr. 72 (2016) 129–142, http://dx.
doi.org/10.1016/j.autcon.2011.04.019. doi.org/10.1016/j.autcon.2016.08.039.
[10] T. Cheng, J. Teizer, G.C. Migliaccio, U.C. Gatti, Automated task-level activity [31] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to
analysis through fusion of real time location sensors and worker's thoracic posture document recognition, IEEE (1998), http://dx.doi.org/10.1109/5.726791.
data, Autom. Constr. 29 (2013) 24–39, http://dx.doi.org/10.1016/j.autcon.2012. [32] Y. LeCun, C. Cortes, C.J.C. Burges, The MNIST Database of Handwritten Digits,
08.003. (1998) http://yann.lecun.com/exdb/mnist/ , Accessed date: 10 January 2018.
[11] J. Seo, S. Han, S. Lee, H. Kim, Computer vision techniques for construction safety [33] M. Ghayoumi, A quick review of deep learning in facial expression, J. Commun.
and health monitoring, Adv. Eng. Inform. 29 (2015) 239–251, http://dx.doi.org/ Comput. 14 (2017), http://dx.doi.org/10.17265/1548-7709/2017.01.004.
10.1016/j.aei.2015.02.001. [34] V. John, M. Umetsu, A. Boyali, S. Mita, M. Imanishi, N. Sanma, S. Shibata, Real-time
[12] T. Cheng, J. Teizer, Real-time resource location data collection and visualization hand posture and gesture-based touchless automotive user interface using deep
technology for construction safety and activity monitoring applications, Autom. learning, 2017 IEEE Intelligent Vehicles Symposium (IV), IEEE, 2017, pp. 869–874,
Constr. 34 (2013) 3–15, http://dx.doi.org/10.1016/j.autcon.2012.10.017. , http://dx.doi.org/10.1109/IVS.2017.7995825.
[13] R.J. Campbell, P.J. Flynn, A survey of free-form object representation and re- [35] S.S.M. Salehi, D. Erdogmus, A. Gholipour, Auto-context convolutional neural net-
cognition techniques, Comput. Vis. Image Underst. 81 (2001) 166–210, http://dx. work (Auto-Net) for brain extraction in magnetic resonance imaging, IEEE Trans.
doi.org/10.1006/cviu.2000.0889. Med. Imaging (2017) 1, http://dx.doi.org/10.1109/TMI.2017.2721362.
[14] G. Cheng, J. Han, A survey on object detection in optical remote sensing images, [36] Y. Zhang, W. Chan, N. Jaitly, Very deep convolutional networks for end-to-end
ISPRS J. Photogramm. Remote Sens. 117 (2016) 11–28, http://dx.doi.org/10.1016/ speech recognition, 2017 IEEE International Conference on Acoustics, Speech and
j.isprsjprs.2016.03.014. Signal Processing (ICASSP), IEEE, 2017, pp. 4845–4849, , http://dx.doi.org/10.
[15] A. Mohan, C. Papageorgiou, T. Poggio, Example-based object detection in images by 1109/ICASSP.2017.7953077.
components, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001) 349–361, http://dx. [37] H.W.F. Yeung, J. Li, Y.Y. Chung, Improved performance of face recognition using
doi.org/10.1109/34.917571. CNN with constrained triplet loss layer, 2017 International Joint Conference on
[16] M. Park, E. Palinginis, I. Brilakis, Detection of construction workers in video frames Neural Networks (IJCNN), IEEE, 2017, pp. 1948–1955, , http://dx.doi.org/10.
for automatic initialization of vision trackers, Construction Research Congress, 1109/IJCNN.2017.7966089.
2012, http://dx.doi.org/10.1061/9780784412329.095. [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
[17] I. Brilakis, M.W. Park, G. Jog, Automated vision tracking of project related entities, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large scale
Adv. Eng. Inform. 25 (2011), http://dx.doi.org/10.1016/j.aei.2011.01.003. visual recognition challenge, Int. J. Comput. Vis. 115 (2015) 211–252, http://dx.
[18] V. Pətrəucean, I. Armeni, M. Nahangi, J. Yeung, I. Brilakis, C. Haas, State of re- doi.org/10.1007/s11263-015-0816-y.
search in automatic as-built modelling, Adv. Eng. Inform. 29 (2015) 162–171, [39] M. Siddula, F. Dai, Y. Ye, J. Fan, Unsupervised feature learning for objects of in-
http://dx.doi.org/10.1016/j.aei.2015.01.001. terest detection in cluttered construction roof site images, Procedia Eng. 145 (2016)
[19] M.W. Park, I. Brilakis, Construction worker detection in video frames for initializing 428–435, http://dx.doi.org/10.1016/j.proeng.2016.04.010.
vision trackers, Autom. Constr. 28 (2012) 15–25, http://dx.doi.org/10.1016/j. [40] F.-F. Li, F. Rod, P. Pietro, Learning generative visual models from few training
autcon.2012.06.001. examples: an incremental Bayesian approach tested on 101 object categories,
[20] Y. Ham, K.K. Han, J.J. Lin, M. Golparvar-Fard, Visual monitoring of civil infra- Conference on Computer Vision and Pattern Recognition Workshop (CVPR 2004),
structure systems via camera-equipped unmanned aerial vehicles (UAVs): a review 178 2004, http://dx.doi.org/10.1109/CVPR.2004.109.
of related works, Vis. Eng. 4 (1) (2016), http://dx.doi.org/10.1186/s40327-015- [41] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The Pascal
0029-z.

69
Z. Kolar et al. Automation in Construction 89 (2018) 58–70

visual object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2010) 303–338, [44] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22
http://dx.doi.org/10.1007/s11263-009-0275-4. (2010) 1345–1359, http://dx.doi.org/10.1109/TKDE.2009.191.
[42] X. Liu, W. Liang, Y. Wang, S. Li, M. Pei, 3D head pose estimation with convolutional [45] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
neural network trained on synthetic images, 2016 IEEE International Conference on recognition, International Conference on Learning Representations (ICRL), 2015,
Image Processing (ICIP), 2016, pp. 1289–1293, , http://dx.doi.org/10.1109/ICIP. pp. 1–14, , http://dx.doi.org/10.1016/j.infsof.2008.09.005.
2016.7532566. [46] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R.R. Salakhutdinov,
[43] M.M. Soltani, Z. Zhu, A. Hammad, Automated annotation for visual recognition of Improving neural networks by preventing co-adaptation of feature detectors, ArXiv
construction resources using synthetic images, Autom. Constr. 62 (2016) 14–23, E-Prints, 2012, pp. 1–18 doi:arXiv:1207.0580.
http://dx.doi.org/10.1016/j.autcon.2015.10.002.

70

You might also like