You are on page 1of 13

Advanced Engineering Informatics 35 (2018) 56–68

Contents lists available at ScienceDirect

Advanced Engineering Informatics


journal homepage: www.elsevier.com/locate/aei

Full length article

A deep learning-based method for detecting non-certified work on T


construction sites

Qi Fanga,b, Heng Lib, , Xiaochun Luob, Lieyun Dinga, Timothy M. Rosec, Wangpeng And,
Yantao Yub
a
School of Civil Engineering & Mechanics, Huazhong University of Science & Technology, Wuhan, China
b
Department of Building and Real Estate, The Hong Kong Polytechnic University, Hong Kong
c
School of Civil Engineering and Built Environment, Queensland University of Technology, Brisbane, Australia
d
Department of Computing, The Hong Kong Polytechnic University, Hong Kong

A R T I C L E I N F O A B S T R A C T

Keywords: The construction industry is a high hazard industry. Accidents frequently occur, and part of them are closely
Construction safety relate to workers who are not certified to carry out specific work. Although workers without a trade certificate
Certification checking are restricted entry to construction sites, few ad-hoc approaches have been commonly employed to check if a
Trade recognition worker is carrying out the work for which they are certificated. This paper proposes a novel framework to check
Identification
whether a site worker is working within the constraints of their certification. Our framework comprises key
Deep learning
video clips extraction, trade recognition and worker competency evaluation. Trade recognition is a new pro-
posed method through analyzing the dynamic spatiotemporal relevance between workers and non-worker ob-
jects. We also improved the identification results by analyzing, comparing, and matching multiple face images of
each worker obtained from videos. The experimental results demonstrate the reliability and accuracy of our deep
learning-based method to detect workers who are carrying out work for which they are not certified to facilitate
safety inspection and supervision.

1. Introduction sufficient knowledge, competency and safety awareness to conduct


their job [6]. It is argued that a certified worker who has been fully
The construction industry is a high hazard industry and fatal acci- trained pays more attention to safety and specific site safety require-
dents continue to occur [1]. According to the United States Occupa- ments [7]. Thus, less accidents occur when workers are qualified and
tional Safety and Health Administration (OSHA), approximately 900 their qualifications are appropriately certified, since they have ex-
workers lose their lives on construction sites in the US every year [2]. tensive knowledge of their trade and a deep insight of the consequences
Furthermore, according to the findings of the case study research of their actions [8]. This is supported by a study conducted by Cali-
conducted by the Health and Safety Executive (HSE) in the United fornia state OSHA, which found that the requirement for worker cer-
Kingdom, inadequate knowledge, competency and safety awareness are tification led to a 80% decrease in fatalities from crane accidents [9].
significant underlying causes of fatal accidents [3]. From a survey of Therefore, more recently, Mainland China, Hong Kong and the
1241 construction laborers with reportable injuries, the US Bureau of United Kingdom have introduced policies to forbid non-certified
Labor Statistics (BLS) found 26% of the injured laborers had not re- workers doing construction work. In China, the AQSIQ [10] requires
ceived any training before the injury event, and 74% of the injured had that Chinese special equipment operation staff must undertake rigorous
less than 1-year experience [4]. Similarly, Umeokafor, Evaggelinos, training and pass certification examinations before operating special
et al. [5] collected construction accident data over an 11-year period in equipment, since the operation of special equipment can be difficult
Nigeria and found untrained or inexperienced worker-related accidents and its unsafe operation can cause serious accidents. To improve the
accounted for 40% of all the accidents resulting from unsafe human quality and safety of construction projects, MOHURD [11] has specified
acts. that all construction workers in China must be appropriately trained
In the United States, the OSHA officially recognizes a work certifi- and hold relevant certificates by 2020. Similarly, the related “Desig-
cate as evidence that a qualified construction worker is equipped with nated Workers for Designated Skills” policy provision in Hong Kong


Corresponding author.
E-mail address: bshengli@polyu.edu.hk (H. Li).

https://doi.org/10.1016/j.aei.2018.01.001
Received 8 August 2017; Received in revised form 4 January 2018; Accepted 8 January 2018
1474-0346/ © 2018 Elsevier Ltd. All rights reserved.
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

specifies that only registered skilled workers of specific trade divisions 2. Literature review
are allowed to independently carry out construction work on Hong
Kong construction sites [12]. The principal objectives of this provision In recent years, methods of ‘deep learning’ have greatly progressed
is to enhance the safety and quality of construction works by improving in the field of computer vision. Accordingly, computer vision-based
workers’ skill levels [13]. Further, the United Kingdom’s Common safety behavior monitoring has also developed rapidly. This study fo-
Minimum Standards for Construction [14] requires all workers to be cuses on identifying workers and matching their certified competencies
registered with the Construction Skills Certification Scheme (CSCS), with specific trades classified by activities. Because of the rapid ad-
since CSCS cards are evidence of individuals’ training and qualifications vancements in deep learning-based object detection and tracking, and
required by the specific trades they carry out [15]. face detection and recognition, workers can be automatically identified
In response to recent policy provisions, the Hong Kong Housing Since few scholars have directly investigated automated solutions to
Authority (HKHA) has recommended that Radio Frequency enable the recognition of workers carrying out unauthorized work, we
Identification (RFID) contactless access monitoring and recording sys- also review recent developments in activity recognition and worker
tems should be used at entrances of construction sites to prevent the identification problems in the construction industry to support this re-
entry of unauthorized persons [16]. Although this system can identify search contribution.
and restrict un-certified workers from entering construction sites, it is
still unknown whether each worker is carrying out their appropriate 2.1. Related techniques in computer vision
work according to their certificate. This is a major problem, as con-
struction safety statistics indicate accidents frequently occur as result of Detecting non-certified work on site requires several techniques in
workers with weak safety awareness engaging in site activities they are computer vision. First, object detection and tracking methods are
not certified to carry out. For example, in the US, the Statistics in needed to classify and locate all the objects from the images and un-
Fatality Assessment and Control Evaluation (FACE) program [17] derstand their actions from trajectories, which is the foundation of both
identified that at least 10 of the 93 recent deaths as a result of elec- trade recognition and identification. Second, face detection and re-
trocution involved uncertified and inadequately trained workers at- cognition methods assist us to verify the workers’ identities to de-
tempting to perform electrical work. Another study by the US Depart- termine whether they are certified to undertake the work.
ment of Labor revealed that fatal accidents frequently occur as a result Before introducing related algorithms, it is necessary to understand
of non-specialized workers operating tower cranes [18]. In this paper, the functions and structure of Convolutional Neural Networks (CNN),
we refer to ‘non-certified workers’ as workers who are carrying out which is the basic element of deep learning methods used in computer
trade work they are not certified to carry out on-site and does not in- vision [33]. The purpose of CNN is to extract all the features from a
clude the checking of individuals who would be restricted from entering resource image and then use all these features to classify the object in
a construction site. the image [34,35]. A complete CNN is consisted of multiple convolu-
The accurate checking of certified workers is an important process tional layer, rectified linear units, pooling layers as well as a fully
to ensure construction site safety. This process involves trade certificate connected layer. The parameters of learnable filters in these layers will
checking to determine if tradespersons (recognized by specific con- be finetuned and optimized together with the classification components
struction activity) match their certificates when queried for identifica- to minimize total classification error [36]. With the help of CNN, var-
tions. In previous related research, scholars have applied sensors ious objects in images can be recognized automatically, which is a
[19,20] and handcrafted features based (such as Histogram of Oriented fundamental step for ongoing research stages, mainly including object
Gradient (HOG) [21]) methods for site worker identification. However, detection, tracking, face detection and face recognition.
the sensor-based approach has recognition problems caused by sensor The emergence of CNN has led to a rapid development of the object
loss and the HOG technology is just a kind of handcrafted feature and detection field [33]. Following the continuous improvement from
acknowledged to have poor precision [22]. Although several re- RCNN [34] to SPP [35] to FAST RCNN [36] methods, the most recent
searchers have contributed to activity recognition [23–30], few studies advanced algorithms in the object detection field are Faster Region-
consider the relationship between activities and trades. Further, very based Convolutional Neural Networks (Faster R-CNN) [37], Single Shot
few scholars have focused on solving certification-checking problems. Multibox Detector (SSD) [38] and You Only Look Once (YOLO) [39].
Therefore, there is an urgent need to develop a suitable solution to Despite the faster calculation speed of SSD and YOLO, Faster R-CNN has
automatically check for worker certification. the highest accuracy and allows real-time detection for our purpose.
The field of computer vision has developed rapidly [31], and the Therefore, Faster R-CNN is argued to be the most suitable object de-
vision-based approach to automatic security monitoring has made sig- tection method for this study.
nificant progress [32]. Therefore, we propose the combination of sev- Multiple object tracking (MOT) can be viewed as a problem to as-
eral advanced vision-based deep learning algorithms to check the cer- sociate the same detected objects across multiple frames in a video
tification of workers on construction sites and prevent workers carrying sequence [40]. For most top-ranked MOT solutions [41–43], the speed
out work for which they are not certified. Once non-certified workers is considered too slow and restrict their real-time applications. As the
have been detected, they are alerted, and action will be taken to cease state-of-the-art online tracking algorithm, Simple Online and Real-Time
the activity. As such, a decrease in the levels of non-certified work is Tracking (SORT) [40] is a much simpler framework that achieves fa-
expected to reduce the occurrence of related accidents substantially. vorable performance at high frame rates on the MOT challenge dataset
In the rest of this paper, we firstly review the previous research [44]. Considering its compatibility with detection algorithms, advanced
related to the problem of construction site workers carrying out non- performance and short runtime, SORT is employed in our study.
certified trade work, and the latest developments in computer vision Face detection is the pre-condition for face recognition [45]. Since
technologies in addressing this problem. Secondly, we present our computer face recognition requires close-up photographs with only the
framework, which comprises three modules: key video clips extraction, face visible for classification [46], there is a need to use face detection
trade recognition and worker competency judgment. Thirdly, we pro- methods to extract the bounding boxes of faces before face recognition
pose a set of rules for key video clips extraction and a new trade re- can commence. Recently, in response to the challenges in developing
cognition method based on spatiotemporal relevance between workers reliable face detection methods, researchers have made attempts to use
and non-worker objects, as well as a multiple face images based on the generic object detection methods to solve face detection tasks, since
improved identification strategy. Finally, we evaluate the performance face detection can be considered as a special type of object detection
of each module, discuss the causes of error analysis and present the task in computer vision [47]. One of the most frequently used methods
knowledge contribution of this study. is Faster R-CNN [37]. Researchers have proposed new face detection

57
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

methods by extending the Faster R-CNN algorithm [48–50]. Mean- learning, Luo [70] proposed a deep learning based activity recognition
while, other researchers have moved away from generic object re- method that greatly promote the performance of this field. However,
cognition frameworks and developed novel, special deep CNN models only spatial relevance of objects detected in images was considered,
for face detection [51–53]. All the above CNN-based face detectors which caused a lot of errors by mistaking the passersby for working
achieved outstanding results in massively benchmarked face detection workers. Moreover, construction activity is a dynamic process in a
datasets like [54,55]. video clip instead of a static state in a frame. Therefore, we improve
As the input to the face recognition module, images are cropped to [70] by taking temporal information into consideration and present a
contain only an entire face. Face recognition has become a popular new framework based on spatiotemporal relevance analysis.
research area in computer vision and one of the most successful ap-
plications of image analysis and understanding [56]. The latest face 2.3. Worker identification in construction
recognition methods using CNN-based deep networks have achieved
unprecedented high accuracy, such as DeepFace series [57,58], DeepID Researchers have previously applied worker identification methods
series [59–62] and ‘triplet-based’ loss-function improvement algorithms in the areas of schedule and safety management. Li et al. [19] attached
[63–65]. In addition to the above methods, Lu et al. [56] proposed a Bluetooth devices to hardhats for non-hardhat-use detection and to
double deep CNN-based method for face recognition, which has ad- differentiate between workers – collecting abundant data to mine the
vantages over all the above methods. They designed two CNNs to ex- association between worker characteristics and non-hardhat-use. Kelm
tract face features, inspired by DeepID2, using the triplet loss function of et al. [20] attached a Radio-frequency Identification (RFID) tag to each
FaceNet and providing multi-scale features like the DeepID series. Most piece of Personal Protective Equipment (PPE) for worker identification
importantly, they applied Inception [66], the winner of the 2014 Im- and PPE compliance. In another study, Weerasinghe et al. [71] used the
ageNet competition, as the base network. Until now, this algorithm has Microsoft Kinect sensor to track workers. While other researchers in-
achieved the highest performance (99.75%) on the Labeled Faces in the troduced HOG+ HSV [21,72], HOG+ Color [73] based detectors to
Wild (LFW) dataset [67], and is applied in this study. detect worker in videos shot by normal cameras, Weerasinghe and
Ruwanpura [74] detected workers by the colors of their hardhats and
2.2. Activity recognition in construction measured the productivity of the identified workers based on the as-
sumption that different colored hardhats represent different trades. Chi
Each construction trade involves regular atomic activities. As such, and Caldas [75] proposed a real-time detection and classification
regular and specific trade activities can be identified by activity re- method to recognize mobile heavy equipment and workers and to dis-
cognition. Therefore, activity recognition is the foundation of trade tinguish different types of equipment. Fang et al. [76] put forward a
classification in this study. Recently, activity recognition has made deep learning based method to detect non-hardhat-use workers on
significant advances construction sites.
Accelerometers are a popular activity-obtaining sensor used for All the studies above have taken a significant step forward in con-
activity recognition. Joshua and Varghese [23] clustered acceleration necting specific individual identification information with schedule and
data into several patterns and identified specific activities from these safety management. However, these methods are limited in their re-
patterns by using accelerometers attached to the waist of a worker, in spective ways. The potential unwillingness of workers to wear the
this case a mason. Similarly, Ryu et al. [24] classified activity by ana- sensor during work restricts the use of sensor-based methods, and the
lyzing wrist-worn accelerometer data and achieved high levels of ac- associated problems with the loss of sensors can seriously hinder their
curacy. Another commonly used sensor is a smartphone, which is an further application. As for vision-based methods, distinction can only be
integration of accelerometer and gyroscope sensors [25]. Akhavian and made between different trades [74], and different types of equipment
Behzadan [25] captured body movements via smartphones and used between workers [75]. Since workers on construction sites generally
the collected data to train machine learning algorithms for the simu- wear indiscriminate clothing and protective shoes and hardhats,
lation of various activity types, where activity recognition was per- without face recognition, all the above vision-based detection and
formed by machine learning classifiers. Spatial location and posture has tracking methods [21,72,73,76] cannot confirm worker identities.
also been used to identify human activity. Cheng et al. [26] used the Therefore, there is a need to combine face recognition, detection and
fusion of spatial-temporal and workers' thoracic posture data to identify tracking to reliably identify non-certified workers.
workers’ activity type. However, there are challenges in the significant
cost of sensors and the unwillingness of workers to wear sensors during 3. Objective and scope
work.
Compared with sensors, computer vision methods offer greater The overall objective of this paper is to propose a novel solution to
flexibility and adaptability because they do not require workers to wear address the unresolved problem of reliably identifying workers who are
extra instruments. In the development of computer vision methods to carrying out unauthorized work onsite. There are three aspects to
identify activity, Khosrowpour et al. [27] developed a method to au- consider when developing an automated detection system solution to
tomatically observe activities by classifying RGB-D images. Ad- this problem. Firstly, workers are detected via videos. Secondly, the
ditionally, Yang et al. [28] used three types of descriptors, namely certified work scopes of individual workers are identified according to
HOG, Histogram of Optical Flow (HOF) and Motion Boundary Histo- their registration information. Thirdly, we determine the specific trades
grams (MBH) to compute dense trajectories, which were then inputted based on workers’ activities. Finally, we identify if there are mismatch
into Support Vector Machines for classification. Yang et al. [29] focused between their registered work and that they actually carry out. In
on worker activity recognition using a dense trajectories method from combination, this study proposes a sound framework that comprises a
video clips. Gong et al. [30] explored the potential of an emerging combination of reliable technologies using state-of-the-art deep
action analysis framework, Bag-of-Video-Feature-Words, in learning and learning-based algorithms to solve the above problems. The practical
classifying worker actions. performance and limitations of the framework is tested through ex-
However, due to the limited range and vulnerability to sunlight and periments.
ferromagnetic radiation, RGB-D cameras are argued to be unsuitable for
practical application [68]. Furthermore, handcrafted features based 4. Methodology
(such as HOG) methods performed at low precision in the PASCAL
Visual Object Classes (VOC) Challenge 2006 [69], which make them In response to the objective stated previously, this paper presents a
not an ideal choice in this research. With the rapid development of deep novel framework to provide a reliable check whether workers are only

58
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

Fig. 1. Overall framework of the proposed


method.

carrying out work for which they are certified. Fig. 1 illustrates the boxes of the objects in current frame. Then, the assignment between the
overall framework of the proposed methods. Firstly, key clips of videos objects detected in current frame and the predicted bounding boxes is
that contain a potential trade of workers are extracted by the con- solved optimally using the Hungarian algorithm [77]. Thus, the same
tinuous evaluation against its trigger condition and time length. Then, a objects in different frames are all detected and matched, which is the
trade recognition module based on spatiotemporal relevance is pro- basis of next steps.
posed to detect and classify construction worker trades from key video Key video clips extraction program will be started when a worker is
clips. Meanwhile, workers’ faces are detected and identified via an detected in the video. A sifting algorithm is proposed to roughly esti-
identification module, and their identity information is passed to the mate if the worker is carrying out work that they are permitted. The
human resource database to check their certified trades. Finally, sifting algorithm is conducted for each detected worker independently.
workers whose certificates do not match the trades they are engaged in The work of a worker can be equipment-related or material-related
will be identified. regarding their working contexts. If area (worker ) ≠ ∅ and
area (equipment ) ≠ ∅ are satisfied in a frame, an equipment sifting
4.1. Key video clips extraction process is started. Similarly, a material sifting process will be started if
the corresponding conditions are satisfied. These two processes run
Key video clips extraction is critical for the trade recognition and in parallel and can be implemented simultaneously. Once an
identification modules. On one hand, a construction trade always equipment sifting process is started, the following frames will be de-
consists of a series of continuous movements that are captured across tected one-by-one until one of them meet the requirement that
multiple consecutive frames, and it is difficult to define an activity from area (worker )⋂ area (equipment ) ≠ ∅, which means the end of sifting
only a single static frame. For example, a carpenter who is passing by an process and the beginning of key video clips extraction program. It is
ironwork area might be wrongly detected as an ironworker in a static remarkable that the next sifting process won’t be started again until a
image frame. Therefore, discussion about how to precisely extract a key time interval of Ti has passed since the start of the previous sifting
clip of video that potentially contains a construction trade is valuable. process (The time interval Ti can be customized by the users). Compared
On the other hand, the use of video clips, instead of a single frame, to equipment-related trades, the material area is larger and the spatial
increases the accuracy of worker identification. As a worker’s face may relevance between the material and the corresponding worker is
not appear in the camera in every frame, recognizing faces in a video weaker, which requires a lower threshold of relevance. It is more sus-
clip can significantly improve identification accuracy. ceptible to disturbance, and we need to eliminate the similar cases
This section focuses on how to extract key clips of video, including where workers pass by materials, since it is confusing for the computer.
when to start and how long a clip should be. We solve this problem by We identify passengers by their wide movement range in a small period.
employing a sifting algorithm to pre-judge worker behavior patterns We assume that a total of K frames is taken from a video clip of Tt
(working or passing by). the activity patterns of workers are analyzed seconds (time length for trigger judgement), and the coordinates of the
respectively and distinguished, and then we ignore the special cases in worker’s bounding box center in ith frame is annotated as (x i,yi ) . If the
which a worker passes by an area. Next, the program of key video clips distance between the coordinates of an object in any frame and the
extraction will be introduced in detail as shown in Fig. 2. corresponding coordinates in the first frame is less than a threshold (as
Firstly, we trained a Faster R-CNN model to detect common objects represented in Formula (1), λ is a predefined coefficient (as exampled in
on construction sites (including workers, materials and equipment). Section 5.1), and w indicates the average width of the worker’s
The Faster R-CNN [37] model is then applied to detect the above bounding box), the worker can be considered as a material-related
common objects in every frame of a video sequence. It returns classi- worker instead of a passenger. As shown in Fig. 2, the sifting process is
fication information and bounding box regression, which indicates the repeatedly started until the worker in the video is determined to satisfy
position of each detected object. Secondly, a SORT [40] based tracker is the working condition based on Formula (1) for at least Tt seconds, and
used to associate same objects detected by Faster R-CNN across frames only then the program of key-clips extraction is triggered.
in a video sequence. The SORT based tracker approximates the dis-
placements of the detected objects in previous frame with a linear (x i−x1)2 + (yi −y1)2 ⩽ λw , i ∈ (1,K ] (1)
constant velocity model, which predicts the new position of bounding

59
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

Fig. 2. Key video clips extraction process.

Once the trigger condition is met, a key video clip is obtained by Table 1
extracting the next Tc seconds of video clips from the triggered frame. Tc Atomic activities and interaction contained in each trade.
refers to the time length of clips needed to classify the types of trades.
Trade Atomic activities Interaction
The conventional time length of clips in recent action recognition da-
tasets e.g., UCF-101 [78] and HMDB-51 [79] is around eight seconds. Carpenter Making formwork Worker + Formwork
We follow this convention and select eight seconds as the length of Installing formwork
atomic activity clips. Therefore, when the trigger condition is satisfied, Reinforcing formwork
Dismantling formwork
all the image frames in the next eight seconds will be extracted as a key Installing Doors Worker + Door
video clip. Note that, though the above process, workers have key video Installing Windows Worker + Window
clips of themselves. Since the two-sifting algorithms run in parallel, at Rebar worker Cutting rebar Worker + Rebar
most, two key video clips of a worker can be extracted in a time in- Assembling rebar
terval. Next, trade recognition and identification modules will be si- Placing rebar
multaneously conducted based on the extracted key video clips of each Concrete worker Putting in materials Worker + Cement
worker in every time interval. Mixing concrete Worker + Sand
Pouring concrete Worker + Gravel
Repairing concrete Worker + Concrete
4.2. Trade recognition Scaffolder Installing scaffold Worker + Scaffolding
Dismantling scaffold
Since every atomic activity corresponds to a specific trade as listed Crane driver Operating crane Worker + Crane
Bulldozer driver Operating bulldozer Worker + Bulldozer
in Table 1, a trade can be determined by the backward derivation of
Excavator driver Operating excavator Worker + Excavator
atomic activities. Construction activities involve a spatiotemporal in-
teraction between workers and non-worker objects. Non-worker objects
mainly include materials and equipment. Therefore, construction trades

60
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

Fig. 3. A worker’s trade recognition process in a time interval.

can be recognized by the spatiotemporal interaction pairing between are extracted in a time interval, then the larger one of the two Rpmax
workers and non-worker objects. Fig. 3 shows the specific process. indicates the strongest spatiotemporal relevance and is used to de-
Thus, given a key clip of a worker p, which is annotated as termine the trade type.
Γ = {f1 ,f2 ,…,ft } , fi = {Rpi1,Rpi2,…,Rpn
i
} is the spatial relevance between the
worker and non-worker objects in the ith frame (assuming n non- ⎧ i ∈Max
(1,… ,n)
(Rpi ), Max (Rpi ) ⩾ τ
i ∈ (1,… ,n)
worker objects are detected in this frame). Here, we define spatial re- Rpmax =
⎨ 0, Max (Rpi ) < τ
levance according to [70] as Rpq to indicate interaction between the ⎩ i ∈ (1,… ,n) (4)
worker p and the non-worker object q with
According to the relationship between trade and atomic activities
represented in Table 1, the trade of the worker can then be determined.
Rpq =
⎧ 1
( )
area (p) ∩ area (q)
⎪ 2 1 + min (area (p),area (q)) , if area (p)⋂ area (q) ≠ ∅
⎨ side (p) + side (q)
⎪ 2(side (p) + side (q) + dist (p,q)) , otherwise (2) 4.3. Worker competency judgment

where p and q respectively represent the worker and a non-worker Worker competency judgment includes four steps. Firstly, a face
object, area (∗) represents the area of the bounding box of a worker (∗) detection method is applied as a precondition to extract close-up images
or an object (∗), side (∗) returns the minimum side length, and dist (∗,∗) in key video clips of a worker p in a time interval. Then, we track the
computes the minimum distance between two bounding boxes. Note detected faces of the worker as inputs and propose an improved face
that we define side lengths and box areas by units of pixel(s) and pixel2 recognition method to confirm the identity of the worker. Thirdly,
respectively. certification information of the identified workers is queried from the
From the above data, we calculate in Formula (3) the average human resource database. Finally, the worker’s competent trades are
spatiotemporal relevance in a fix pair of the worker p and a non-worker assessed by their certification.
object q across t frames in Tc seconds, where objects matching between Among existing face detection methods, we chose MTCNN [45]
different frames is achieved by SORT based tracker. method to detect faces for its desirable performance and fast speed.
t i Fig. 4 represents the face extraction process using the MTCNN method.
∑i = 1 Rpq
Rpq = The image is firstly pre-processed to obtain candidates. Each candidate
t (3)
is then resized to three sizes (12 × 12, 24 × 24 and 48 × 48, unit:
Considering that a worker can only engage in one activity at the pixel) as inputs to the core component of the MTCNN method which
same time, we need to calculate the maximum spatiotemporal re- consists of three Convolutional Neural Networks (CNNs). The three-
levance as the most probable activity a worker is engaged in. So, only if unified cascaded CNNs gradually eliminate the false candidate boxes
the maximum average spatiotemporal relevance between a worker and and correct their location. Finally, front or profile faces are detected
all non-worker objects is bigger than a predefined threshold τ , can we from the image.
consider an interaction exists between the two. As listed in Formula (4), Next, we use detected faces as input and employ a face recognition
Rpmax refers to the spatiotemporal relevance between the worker p and a method to identify individual workers. As previously mentioned, the
pending object r (r is the serial number of objects corresponding to the Face Verification with Multi-Task and Multi-Scale Features Fusion [56]
maximum in the set of {Rp1 ,…,Rpq ,…,Rpn } ), and also indicates the worker method was chosen for face recognition in our study. The CNN archi-
p and object r are engaged in a trade. If two key video clips of a worker tecture of this method contains two CNNs (CNN1 and CNN2 as shown in

61
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

Fig. 4. MTCNN method based face extraction process.

Fig. 5). Registration photographs and detected faces in videos are inputs named X , there are wX face images out of W classified to be X ’s. Ac-
into the two models for comparison and analysis. CNN1 employs In- cording to Formula (5), the one whose registration photo gets the
ception [80] architecture because it achieves a significant quality gain in highest score across W face images is predicted as an identity match of
the classification field. CNN2 applies a residual network [81], since it the worker.
can ease the training process by solving overfitting and low-speed of WX
coverage problems. Then, the features from the two CNNs are linked to Score (X ) = ∑ JXj
achieve improved face representation. Finally, based on the combined j=1 (5)
features of the two faces, a Joint Bayesian classifier [82] is used to
judge whether the two faces (the face from the registration photograph P (fi ,f j |HI )
and the detected face) belongs to the same worker or not. As shown in JXj = log
P (fi ,f j |HE ) (6)
Fig. 5, if the two faces do not match and are recognized as two different
individuals by the classifier, the process continues until there is a match The detected worker’s personal information is derived after con-
based on their registration photo. So far, faces that appear in a single ducting a query match with the registered identity stored in the human
frame have been identified by the above methods. resource system to obtain the worker’s certification. In the Chinese
Next, we improve identification accuracy by comparing and mer- Vocational Qualification System, a worker with one certificate often
ging the recognition results of the same person across multiple frames. indicates their competency in one or more trades. For example, a
Since the SORT based tracker can continuously track a worker, more worker who has an earth-rock machinery operator certification is
than one identities are likely recognized with multiple face images. capable as a bulldozer, scraper and excavator machinery operator.
Supposing that in the key video clips of a worker in a time interval, W Conversely, two different certificates sometimes indicate competency in
face images have been matched to registration photos of m potential only one trade. For example, a worker can operate a pile with a cubic
identities, we calculate the most probable identity according to Formula meter of earth and stone mechanical operator certificate or a mid and
(5). In Formula (6), here JXj represents the calculation results of Joint small-size construction machinery operator certificate. Therefore,
Bayesian classifier [82], which can be obtained by Formula (6). In Table 2 lists the most common certifications and their corresponding
Formula (6), where HI represents the intra-personal hypothesis that two competent trades. Through all the above steps, we can identify the
faces features fi and f j belong to the same individual, and HE is the competent trades of the detected workers.
extra-personal hypothesis that two faces are from different individuals. After comparing their trade activity and trade certification, the
Specially, suppose that fi represents the feature vectors of worker X’s system can then identify non-certified workers and alert them.
registration photo and f j represents a face image extracted from the key
video clips. JXj is an indicator of the probability that the extracted face 5. Experiments and results
belongs to X. If JXj > 0, then the extracted face belongs to X. The greater
the value of JXj , the greater the probability that the extracted face be- As represented in Fig. 1, all processes presented in Section 4
longs to X. Going back to Formula (5), for a potential candidate worker are integrated into the proposed non-certified workers detection

62
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

Fig. 5. The process of a worker’s trade certifica-


tion judgment.

system. In the following section, we evaluate the performance of the users, Ti ) were collected to evaluate the performance of our proposed
system. method. The testing dataset was well represented and included a
variety of common types of trades, as shown in Table 3. Following this,
5.1. Preparation for experiments we manually annotated the trades of workers in the videos. A total of 98
different workers appeared in the videos, among which 91 were
In the training phase, we collected training datasets first, manually working and the others were passersby. We also had 10 registration
annotated the datasets and trained the models based on the datasets. photographs captured from different angles for each worker as a re-
Considering there is no related public benchmark in the construction ference results. Meanwhile, relevant parameters were determined by
field that can be used for training and testing, we were required to our tests. To eliminate passersby workers, we defined that Tt = 5 and
collect the datasets ourselves. We collected nearly 8000 images and λ = 2 . That means, a worker whose moving path is inside a range with
manually annotated them using the graphical image annotation tool radius of double width in an arbitrary 5 s can be considered as not a
LabelImg [83] to train a Faster R-CNN based model to detect typical passersby worker. The threshold τ refers to demarcation point between
construction objects. Additionally, existing public datasets (WIDER working and non-working states. Since spatiotemporal relevance in
FACE dataset [54] and CelebFaces+ dataset [84]) were used to re- equipment-related trades is stronger than material-related trades, the
spectively train the face detection and face recognition models. corresponding threshold is set differently. Here, τequipment was set to 0.6
After training, we collected testing datasets and manually annotated and τmaterial was set to 0.25.
them. 60 videos of 120 s (120 s refers to the detection interval set by

63
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

Table 2 Table 4
Certifications and their corresponding competent trades. Test results of the ten experiments.

Certification name Certification type Competent trades TP FP FN Precision Recall

Welder Junior, intermediate, Electric welder 1 8 2 1 0.800 0.889


senior, technician, senior Gas welder 2 8 2 1 0.800 0.889
technician 3 7 1 2 0.875 0.778
4 7 2 2 0.778 0.778
Cubic meter of earth and Junior, intermediate, Bulldozer driver
5 7 2 2 0.778 0.778
stone mechanical senior, technician, senior Excavator driver
6 7 0 2 1.000 0.778
operator technician Pile driver
7 8 1 1 0.889 0.889
Scraper operator
8 7 2 2 0.778 0.778
Middle and small size Junior, intermediate, Windlass operator 9 8 2 1 0.800 0.889
construction senior, technician, senior Middle and small size 10 9 2 1 0.818 0.900
machinery operator technician construction machinery
operator
Pile driver each sub-experiment. Finally, the average precision and recall were
Grader operator
calculated out of the ten experiments, as an indicator of the perfor-
Crane, loading and Junior, intermediate, Loading and unloading mance of our system.
unloading machine senior, technician, senior machine driver
Tables 4 presents the results of the ten experiments. The average
operator technician Crane driver
Forklift driver precision and recall were 0.832 and 0.834. Live examples of trade re-
cognition, face extraction and face recognition from the test are shown
Bricklayer Junior, intermediate, Bricklayer
senior, technician Furnace maker in Figs. 6 and 7.
Mason

Concrete worker Junior, intermediate, Roller compaction 6. Discussion


senior worker
Spraying worker In this section, we discuss the causes of errors analysis and the
Concrete worker
knowledge contribution of this study.
Rebar worker Junior, intermediate, Rebar worker
senior, technician
Scaffolder Junior, intermediate, Scaffolder
6.1. Causes of error analysis
senior
Waterproof worker Junior, intermediate, Waterproof worker Our non-certificated detection system consists of three modules and
senior, technician Asphalt processing contains four computer vision-based technologies. Here, the causes of
worker
errors are analyzed to explain the imperfect performance (0.832 pre-
cision and 0.834 recall) of the whole system.
Firstly, insufficient resolution of videos impacts on the performance
of computer vision-based technologies. If worker faces are not sharp
5.2. Experiment strategy and results
enough, their features cannot be accurately analyzed by the computer
vision based technology. Almost all objects that human eyes can tell can
In preparation phase, detailed certification information of workers
be detected by it in existing resolution. An enhancement in camera
is difficult to collect due to privacy concerns. However, considering
resolutions will further improve this problem.
there are few errors in the query process from the human resource
Secondly, MOT is a challenge [40]. However, applications on con-
database, we can identify the certificates of each worker, which do not
struction site provides some favorable conditions for the tracking al-
influence the evaluation results. Thus, we assume that non-certified
gorithm. The cameras are fixed, and workers and non-worker objects
workers account for 10% of all onsite workers. To simulate this prac-
engaged in some activities are under relatively static states (the sce-
tical situation onsite, the 91 working workers identified in the videos
nario of objects’ scaling doesn’t often occur). Moreover, the trade re-
were randomly divided into 10 folds, where onefold was annotated as
cognition task only requires the tracker to match the corresponding
non-certified worker (their certificates are set to be different from what
objects across successive frames instead of a perfect spatial match be-
work they were engaged in), while the remaining nine folds were an-
tween the tracking trajectories and ground truths. As we specify a larger
notated as certified worker (their certificates are set to be compatible
threshold (2 times the width of the worker), it is enough to cover most
with what work they were engaged in). The experiment was performed
of the location error (the general tracking algorithm will calculate lo-
for ten times, and a different fold was selected sequentially as non-
cation error in its accuracy). However, tracking method errors still
certified worker each time. Then, since precision and recall [85] are
exist, and errors mainly occur from lost targets and identity switches.
mandatory metrics to assess detection performance, we compared the
The tracker can easily lose targets that obstructed for a long period of
results of the detected non-certified workers by our method with the
time. However, the targets will be tracked again if they appear. As for
annotations of the testing dataset to calculate the precision and recall in
the problem with identity switches, only those who are working close

Table 3
Details about training and testing dataset of each module.

Module Training data format Size of training Testing data format Size of testing data
data

Key clips extraction Images captured from construction sites 8k images


annotated with common objects
Trade recognition / / 1. Videos including a variety 1. 60 videos of 120 s with 98 test samples
Competency judgement WIDER FACE dataset & CelebFaces+ 590K faces of trades 2. Ten registration photographs captured for each
dataset 2. Registration photos worker from different angles of the 98 workers

64
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

Fig. 6. An example of trade recognition.

together and overlap in their locations will be mistaken by the tracking Given the proximity over time, the spatiotemporal relevance was pro-
algorithms. Since workers of the same trade generally work in the same posed as an indicator to predict the trades that workers are engaged in.
area, they are very likely to be from the same trade. Thirdly, multiple face images across frames were analyzed and merged
Thirdly, identification performance can be affected if a worker’s to improve the identification performance in videos.
face rarely appears on camera or is obstructed by their hardhat due to
the angle of the camera. Such limitation is common in several specific 7. Conclusion
scenarios, including working in deep foundation or excavation areas or
if a worker is too close to surveillance cameras to capture their face. Construction activities are complex, dangerous and heavily depen-
These are also inherent limitations of most computer-vision methods. dent on the coordination of various types of trade work. Generally, each
However, increasing the level of on-site video sampling can improve trade has a set of specific standards and safety requirements, which
this weakness. With a greater abundance of cameras distributed in specify the safety procedures and safety knowledge required in effective
various locations across the site, identification performance can be in- operations. Many countries teach safety skills to workers via training
creased. programs and examine safety knowledge via qualification and certifi-
cation tests. Therefore, those workers who carry out construction work
6.2. Knowledge contribution without appropriate certification pose a serious threat to construction
site safety. Despite the significant safety threat posed by non-certified
This study has made three key contributions to knowledge. Firstly, construction activity, few researchers have directly investigated pos-
our proposal is the first automatic method for checking the certification sible solutions to improve the reliability of this checking process, which
of workers on construction sites. Subject to the technical bottlenecks indicates a major gap in the knowledge body.
and the complexity of construction activities, there has been no pro- This paper proposes a novel framework for worker certification
posed solution to this problem until now. We applied state-of-the-art checking via video imaging based on deep learning algorithms. Firstly,
deep learning methods and addressed computer-vision identification we set rules to extract key video clips. Then, we demonstrate a new
problems to develop a novel framework to check the certification of framework for automated trade recognition that is applicable to various
workers on construction sites and to prevent workers carrying out non- common trade types. Further, we propose an integrated system to ac-
certified trade work. The experimental results demonstrate the poten- curately check the certification information of a worker based on the
tial of the method. Secondly, we proposed a new trade recognition latest face detection and face recognition methods. In summary, the
model, which was developed based on the hypothesis that trades can be experimental results indicate that the proposed method offers an ef-
perceived as the spatiotemporal interactions between workers and non- fective and feasible solution to detect non-certified work. In light of
worker objects. The object detection and tracking algorithms are then these findings, it is recommended that future research considers how to
employed to extract trajectories of both workers and other objects. link the alert of worker non-compliance to penalty responses, to

65
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

Fig. 7. An example of identification.

encourage on-site safety behavioral change. It is expected that future Occupations. < https://www.bls.gov/news.release/cfoi.t03.htm > , 2017 (last ac-
research in this area will significantly contribute to the body of cessed on 19 July 2017).
[3] Health and Safety Executive, Underlying Causes of Construction Fatal Accidents –
knowledge on reducing unsafe behavior on construction sites, via im- Review and Sample Analysis of Recent Construction Fatal Accidents. < http://
proved monitoring and control. www.hse.gov.uk/construction/resources/phase1.pdf>, 2009 (last accessed on 19
July 2017).
[4] A. Cohen, M.J. Colligan, R. Sinclair, J. Newman, R. Schuler, Assessing Occupational
Acknowledgement Safety and Health Training, National Institutes of Health, Cincinnati, OH, 1998, pp.
1–174.
The authors would like to acknowledge Hanbin Luo, Lei Zhang, [5] N. Umeokafor, K. Evaggelinos, S. Lundy, D. Isaac, S. Allan, O. Igwegbe,
K. Umeokafor, B. Umeadi, The pattern of occupational accidents, injuries, accident
Xianbiao Qi, Chengqian Li and Yachun Huang for their help. We are
causal factors and intervention in Nigerian factories, Dev. Country Stud. 4 (2014)
also thankful for the financial support of (1) the National 12th Five- 119–127.
Year Plan Major Scientific and Technological Issues of China [6] Occupational Safety and Health Administration, Who can be a Qualified Worker.
< https://www.osha.gov/Publications/cranes-qualified-rigger-factsheet.pdf > ,
(NFYPMSTI) through Grant 2015BAK33B04; (2) The Research Grants
2010 (last accessed on 19 July 2017).
Council of Hong Kong grant entitled “Proactively Monitoring [7] D. Langford, S. Rowlinson, E. Sawacha, Safety behaviour and safety management:
Construction Progress by Integrating 3D Laser-scanning and BIM” its influence on the attitudes of workers in the UK construction industry, Eng.
(PolyU 152093/14E); and (3) National Natural Science Foundation of Constr. Architect. Manage. 7 (2000) 133–140, http://dx.doi.org/10.1046/j.1365-
232x.2000.00137.x.
China (grant no. 51678265). [8] M. Törner, A. Pousette, Safety in construction – a comprehensive description of the
characteristics of high safety standards in construction work, from the combined
References perspective of supervisors and experienced workers, J. Saf. Res. 40 (2009) 399–409,
http://dx.doi.org/10.1016/j.jsr.2009.09.005.
[9] Cal-OSHA, State OSHA Annual Report. < https://www.dir.ca.gov/dosh/reports/
[1] Occupational Safety and Health Administration, Construction Industry. < https:// State-OSHA-Annual-Report-(SOAR)-FY-2015.pdf > , 2015 (last accessed on 19 July
www.osha.gov/doc/index.html > , 2017 (last accessed on 19 July 2017). 2017).
[2] Bureau of Labor Statistics, Fatal Occupational Injuries Counts and Rates for Selected [10] General Administration of Quality Supervision, Inspection and Quarantine of the

66
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

People’s Republic of China, Measures for the Supervision and Management of [37] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection
Personnel Operating Special Equipment. < http://www.aqsiq.gov.cn/xxgk_13386/ with region proposal networks, Adv. Neural Inform. Process. Syst. (2015) 91–99.
xxgkztfl/zcfg/201210/t20121017_260314.htm > , 2005 (last accessed on 19 July [38] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, SSD: single
2017). shot multibox detector, in: European Conference on Computer Vision, Springer,
[11] Ministry of Housing and Urban-Rural Development of the People’s Republic of 2016, pp. 21-37. 10.1007/978-3-319-46448-0_2.
China, Guiding Opinions on Strengthening Vocational Training of Construction [39] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-
Workers. < http://www.mohurd.gov.cn/zcfg/jsbwj_0/jsbwjrsjy/201503/ time object detection, in: Proceedings of the IEEE Conference on Computer Vision
t20150331_220595.html > , 2015 (last accessed on 19 July 2017). and Pattern Recognition, 2016, pp. 779–788. 10.1109/CVPR.2016.91.
[12] “Designated Workers for Designated Skills” Provision. < http://www.cic.hk/eng/ [40] A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft, Simple online and realtime tracking,
main/registration_services/cwro/ > , 2017 (last accessed on 19 July 2017). in: 2016 IEEE International Conference on Image Processing (ICIP), IEEE, 2016, pp.
[13] C.I. Council, Construction Workers Registration Ordinance, Booklet, 2017. 3464–3468.
[14] Infrastructure and Projects Authority, The Common Minimum Standards for [41] W. Choi, Near-online multi-target tracking with aggregated local flow descriptor,
Construction. < https://www.gov.uk/government/uploads/system/uploads/ in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp.
attachment_data/file/600885/2017-03-15_Construction_Common__Minimum_ 3029–3037. 10.1109/ICCV.2015.347.
Standards__final___1_.pdf > , 2017 (last accessed on 19 July 2017). [42] C. Kim, F. Li, A. Ciptadi, J.M. Rehg, Multiple hypothesis tracking revisited, in:
[15] About CSCS. < https://www.cscs.uk.com/about/ > , 2017 (last accessed on 19 Proceedings of the IEEE International Conference on Computer Vision, 2015, pp.
July 2017). 4696–4704. 10.1109/ICCV.2015.533.
[16] Hong Kong Housing Authority, Site Safety Handbook. < https://www. [43] J.H. Yoon, M.-H. Yang, J. Lim, K.-J. Yoon, Bayesian multi-object tracking using
housingauthority.gov.hk/mini-site/site-safety/common/resources/handbook/ motion context from multiple objects, in: 2015 IEEE Winter Conference on
201603/HB_res_tcen.pdf > , 2017 (last accessed on 19 July 2017). Applications of Computer Vision (WACV), IEEE, 2015, pp. 33–40.
[17] Centers for Disease Control and Prevention, Maintenance Worker Electrocuted [44] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, K. Schindler, Motchallenge 2015: Towards a
While Attempting to Change a Light Bulb in Washington State. < https://www.cdc. Benchmark for Multi-target Tracking, Available from: <arXiv:1504.01942>, .
gov/niosh/face/stateface/wa/04WA080.html > , 2017 (last accessed on 19 July [45] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using mul-
2017). titask cascaded convolutional networks, IEEE Sign. Process Lett. 23 (2016)
[18] Crane Accidents, More Training Needed for Crane Operators. < https://www. 1499–1503, http://dx.doi.org/10.1109/LSP.2016.2603342.
craneaccidents.com/2004/03/articles/more-training-needed-for-crane-operators/ [46] R. Ranjan, V.M. Patel, R. Chellappa, Hyperface: A Deep Multi-task Learning
https://www.craneaccidents.com/2009/12/report/update/tower-crane-breaks-up- Framework for Face Detection, Landmark Localization, Pose Estimation, and
in-shenzhen-death-toll-up-to-6/ > , 2017 (last accessed on 19 July 2017). Gender Recognition, Available from: <arXiv:1603.01249>, .
[19] H. Li, X. Li, X. Luo, J. Siebert, Investigation of the causality patterns of non-helmet [47] X. Sun, P. Wu, S.C. Hoi, Face Detection using Deep Learning: An Improved Faster
use behavior of construction workers, Autom. Constr. 80 (2017) 95–103, http://dx. RCNN Approach, Available from: <arXiv:1701.08289>, .
doi.org/10.1016/j.autcon.2017.02.006. [48] H. Jiang, E. Learned-Miller, Face detection with the faster R-CNN, in: 2017 12th
[20] A. Kelm, L. Laußat, A. Meins-Becker, D. Platz, M.J. Khazaee, A.M. Costin, IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017),
M. Helmus, J. Teizer, Mobile passive radio frequency identification (RFID) portal 2016. 10.1109/FG.2017.82.
for automated and rapid control of personal protective equipment (PPE) on con- [49] Y. Li, B. Sun, T. Wu, Y. Wang, Face detection with end-to-end integration of a
struction sites, Autom. Constr. 36 (2013) 38–52, http://dx.doi.org/10.1016/j. ConvNet and a 3D model, in: European Conference on Computer Vision, Springer,
autcon.2013.08.009. 2016, pp. 420–436.
[21] M.-W. Park, I. Brilakis, Continuous localization of construction workers via in- [50] S. Wan, Z. Chen, T. Zhang, B. Zhang, K.-K. Wong, Bootstrapping Face Detection with
tegration of detection and tracking, Autom. Constr. 72 (2016) 129–142, http://dx. Hard Negative Examples, Available from: <arXiv:1608.02236>, .
doi.org/10.1016/j.autcon.2016.08.039. [51] P. Hu, D. Ramanan, Finding tiny faces, in: Conference on Computer Vision and
[22] P. Ott, M. Everingham, Implicit color segmentation features for pedestrian and Pattern Recognition 2016, 2016.
object detection, in: 2009 IEEE 12th International Conference on Computer Vision, [52] D. Triantafyllidou, A. Tefas, A fast deep convolutional neural network for face
IEEE, 2009, pp. 723–730. detection in big visual data, in: INNS Conference on Big Data, Springer, 2016, pp.
[23] L. Joshua, K. Varghese, Accelerometer-based activity recognition in construction, J. 61–70.
Comput. Civ. Eng. 25 (2010) 370–379, http://dx.doi.org/10.1061/(ASCE)CP.1943- [53] J. Yu, Y. Jiang, Z. Wang, Z. Cao, T. Huang, UnitBox: an advanced object detection
5487.0000097. network, in: Proceedings of the 2016 ACM on Multimedia Conference, ACM, 2016,
[24] J. Ryu, J. Seo, M. Liu, S. Lee, C.T. Haas, Action recognition using a wristband-type pp. 516–520. 10.1145/2964284.2967274.
activity tracker: case study of masonry work, Constr. Res. Congr. 2016 (2016) [54] WIDER FACE: A Face Detection Benchmark. < http://mmlab.ie.cuhk.edu.hk/
790–799, http://dx.doi.org/10.1061/9780784479827.080. projects/WIDERFace/ > , 2017 (last accessed on 19 July 2017).
[25] R. Akhavian, A.H. Behzadan, Smartphone-based construction workers' activity re- [55] u.o. massachusetts, FDDB: Face Detection Data Set and Benchmark.
cognition and classification, Autom. Constr. 71 (2016) 198–209, http://dx.doi.org/ [56] X. Lu, Y. Yang, W. Zhang, Q. Wang, Y. Wang, Face verification with multi-task and
10.1016/j.autcon.2016.08.015. multi-scale feature fusion, Entropy 19 (2017) 228, http://dx.doi.org/10.3390/
[26] T. Cheng, J. Teizer, G.C. Migliaccio, U.C. Gatti, Automated task-level activity e19050228.
analysis through fusion of real time location sensors and worker's thoracic posture [57] C. Ding, C. Xu, D. Tao, Multi-task pose-invariant face recognition, IEEE Trans.
data, Autom. Constr. 29 (2013) 24–39, http://dx.doi.org/10.1016/j.autcon.2012. Image Process. 24 (2015) 980–993, http://dx.doi.org/10.1109/TIP.2015.2390959.
08.003. [58] Y. Taigman, M. Yang, M.A. Ranzato, L. Wolf, Deepface: closing the gap to human-
[27] A. Khosrowpour, I. Fedorov, A. Holynski, J.C. Niebles, M. Golparvar-Fard, level performance in face verification, in: Conference on Computer Vision and
Automated worker activity analysis in indoor environments for direct-work rate Pattern Recognition 2014, 2014, pp. 1701–1708. 10.1109/CVPR.2014.220.
improvement from long sequences of RGB-D images, Constr. Res. Congr. 2014 [59] Y. Sun, Y. Chen, X. Wang, X. Tang, Deep learning face representation by joint
(2014) 729–738, http://dx.doi.org/10.1061/9780784413517.075. identification-verification, Adv. Neural Inform. Process. Syst. (2014) 1988–1996.
[28] J. Yang, Z. Shi, Z. Wu, Automatic recognition of construction worker activities using [60] Y. Sun, D. Liang, X. Wang, X. Tang, Deepid3: Face Recognition with Very Deep
dense trajectories, in: Proceedings of the International Symposium on Automation Neural Networks, Available from: <arXiv:1502.00873>, .
and Robotics in Construction, vol. 32, Vilnius Gediminas Technical University, [61] Y. Sun, X. Wang, X. Tang, Deep learning face representation from predicting 10,000
Department of Construction Economics & Property, 2015, p. 1. classes, in: Conference on Computer Vision and Pattern Recognition 2014, 2014,
[29] J. Yang, Z. Shi, Z. Wu, Vision-based action recognition of construction workers pp. 1891–1898. 10.1109/CVPR.2014.244.
using dense trajectories, Adv. Eng. Inf. 30 (2016) 327–336, http://dx.doi.org/10. [62] Y. Sun, X. Wang, X. Tang, Deeply learned face representations are sparse, selective,
1016/j.aei.2016.04.009. and robust, in: Conference on Computer Vision and Pattern Recognition 2015,
[30] J. Gong, C.H. Caldas, C. Gordon, Learning and classifying actions of construction 2015, pp. 2892–2900. 10.1109/CVPR.2015.7298907.
workers and equipment using bag-of-video-feature-words and Bayesian network [63] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face re-
models, Adv. Eng. Inf. 25 (2011) 771–782, http://dx.doi.org/10.1016/j.aei.2011. cognition and clustering, in: Proceedings of the IEEE Conference on Computer
06.002. Vision and Pattern Recognition, 2015, pp. 815–823. 10.1109/CVPR.2015.7298682.
[31] H. Guo, Y. Yu, M. Skitmore, Visualization technology-based construction safety [64] O.M. Parkhi, A. Vedaldi, A. Zisserman, Deep face recognition, BMVC 1 (2015) 6.
management: a review, Autom. Constr. 73 (2017) 135–144, http://dx.doi.org/10. [65] H. Oh Song, Y. Xiang, S. Jegelka, S. Savarese, Deep metric learning via lifted
1016/j.autcon.2016.10.004. structured feature embedding, in: Conference on Computer Vision and Pattern
[32] J. Seo, S. Han, S. Lee, H. Kim, Computer vision techniques for construction safety Recognition 2016, 2016, pp. 4004–4012. 10.1109/CVPR.2016.434.
and health monitoring, Adv. Eng. Inf. 29 (2015) 239–251, http://dx.doi.org/10. [66] C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi, Inception-v4, inception-resnet and the
1016/j.aei.2015.02.001. impact of residual connections on learning, in: ICLR 2016 Workshop, 2016.
[33] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con- [67] E. Learned-Miller, G.B. Huang, A. RoyChowdhury, H. Li, G. Hua, Labeled faces in
volutional neural networks, Adv. Neural Inform. Process. Syst. (2012) 1097–1105. the wild: a survey, in: Advances in Face Detection and Facial Image Analysis,
[34] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate Springer, 2016, pp. 189–248.
object detection and semantic segmentation, in: Proceedings of the IEEE Conference [68] R. Starbuck, J. Seo, S. Han, S. Lee, A stereo vision-based approach to marker-less
on Computer Vision and Pattern Recognition, 2014, pp. 580–587. motion capture for on-site kinematic modeling of construction worker tasks,
[35] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional Comput. Civ. Build. Eng. 2014 (2014) 1094–1101.
networks for visual recognition, in: European Conference on Computer Vision, [69] P. Ott, M. Everingham, implicit color segmentation features for pedestrian and
Springer, 2014, pp. 346–361. 10.1007/978-3-319-10578-9_23. object detection, in: 2009 IEEE 12th International Conference on Computer Vision,
[36] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE International Conference on IEEE, 2009, pp. 723–730. 10.1109/ICCV.2009.5459238.
Computer Vision, 2015, pp. 1440–1448. [70] H.L. X. Luo, D. Cao, F. Dai, J. Seo, S. Lee, Recognizing diverse construction activities

67
Q. Fang et al. Advanced Engineering Informatics 35 (2018) 56–68

in site images via relevance networks of construction related objects detected by (NRL) 2 (1955) 83–97, http://dx.doi.org/10.1002/nav.3800020109.
convolutional neural networks, J. Comput. Civ. Eng. (in press). [78] K. Soomro, A.R. Zamir, M. Shah, UCF101: A Dataset of 101 Human Actions Classes
[71] I.T. Weerasinghe, J.Y. Ruwanpura, J.E. Boyd, A.F. Habib, Application of microsoft from Videos in the Wild, Available from: <arXiv:1212.0402>, .
kinect sensor for tracking construction workers, Constr. Res. Congr. 2012 (2012) [79] H. Kuehne, H. Jhuang, R. Stiefelhagen, T. Serre, HMDB51: a large video database
858–867, http://dx.doi.org/10.1061/9780784412329.087. for human motion recognition, in: High Performance Computing in Science and
[72] M.-W. Park, I. Brilakis, Construction worker detection in video frames for in- Engineering ‘12, Springer, 2013, pp. 571–582.
itializing vision trackers, Autom. Constr. 28 (2012) 15–25, http://dx.doi.org/10. [80] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.
1016/j.autcon.2012.06.001. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Conference on
[73] M. Memarzadeh, M. Golparvar-Fard, J.C. Niebles, Automated 2D detection of Computer Vision and Pattern Recognition 2015, 2015, pp. 1–9. 10.1109/CVPR.
construction equipment and workers from site video streams using histograms of 2015.7298594.
oriented gradients and colors, Autom. Constr. 32 (2013) 24–37, http://dx.doi.org/ [81] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
10.1016/j.autcon.2012.12.002. Conference on Computer Vision and Pattern Recognition 2016, 2016, pp. 770–778.
[74] I.T. Weerasinghe, J.Y. Ruwanpura, Automated data acquisition system to assess 10.1109/CVPR.2016.90.
construction worker performance, Constr. Res. Congr. 2009 (2009) 61–70, http:// [82] D. Chen, X. Cao, L. Wang, F. Wen, J. Sun, Bayesian face revisited: a joint for-
dx.doi.org/10.1061/41020(339)7. mulation, in: Computer Vision–ECCV 2012, 2012, pp. 566–579.
[75] S. Chi, C.H. Caldas, Automated object identification using optical video cameras on [83] LabelImg: A Graphical Image Annotation Tool. < https://github.com/tzutalin/
construction sites, Comput.-Aided Civ. Infrastruct. Eng. 26 (2011) 368–380, http:// labelImg > , 2015 (last accessed on 20 June 2017).
dx.doi.org/10.1111/j.1467-8667.2010.00690.x. [84] Large-scale CelebFaces Attributes (CelebA) Dataset. < http://mmlab.ie.cuhk.edu.
[76] Q. Fang, H. Li, X. Luo, L. Ding, H. Luo, T.M. Rose, W. An, Detecting non-hardhat-use hk/projects/CelebA.html > , 2017 (last accessed on 19 July 2017).
by a deep learning method from far-field surveillance videos, Autom. Constr. 85 [85] D.M. Powers, Evaluation: from precision, recall and F-measure to ROC, informed-
(2018) 1–9, http://dx.doi.org/10.1016/j.autcon.2017.09.018. ness, markedness and correlation, 2011.
[77] H.W. Kuhn, The Hungarian method for the assignment problem, Nav. Res. Logist.

68

You might also like