You are on page 1of 19

Crowdsourcing Construction Activity Analysis

from Jobsite Video Streams


Kaijian Liu, S.M.ASCE 1; and Mani Golparvar-Fard, A.M.ASCE 2
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Abstract: The advent of affordable jobsite cameras is reshaping the way on-site construction activities are monitored. To facilitate the
analysis of large collections of videos, research has focused on addressing the problem of manual workface assessment by recognizing
worker and equipment activities using computer-vision algorithms. Despite the explosion of these methods, the ability to automatically
recognize and understand worker and equipment activities from videos is still rather limited. The current algorithms require large-scale
annotated workface assessment video data to learn models that can deal with the high degree of intraclass variability among activity cat-
egories. To address current limitations, this study proposes crowdsourcing the task of workface assessment from jobsite video streams.
By introducing an intuitive web-based platform for massive marketplaces such as Amazon Mechanical Turk (AMT) and several automated
methods, the intelligence of the crowd is engaged for interpreting jobsite videos. The goal is to overcome the limitations of the current
practices of workface assessment and also provide significantly large empirical data sets together with their ground truth that can serve
as the basis for developing video-based activity recognition methods. Six extensive experiments have shown that engaging nonexperts
on AMT to annotate construction activities in jobsite videos can provide complete and detailed workface assessment results with 85% ac-
curacy. It has been demonstrated that crowdsourcing has the potential to minimize time needed for workface assessment, provides ground
truth for algorithmic developments, and most importantly allows on-site professionals to focus their time on the more important task of
root-cause analysis and performance improvements. DOI: 10.1061/(ASCE)CO.1943-7862.0001010. © 2015 American Society of Civil
Engineers.
Author keywords: Activity analysis; Construction productivity; Video-based monitoring; Workface assessment; Crowdsourcing;
Information technologies.

Introduction on-site observations that is needed to guarantee statistically signifi-


cant workface data; and (2) the necessary visual judgments of
On-site operations are among the most important factors that the observers that may produce erroneous data because of the
influence the performance of a construction project (Gouett et al. over-productiveness phenomenon caused by construction workers
2011). Timely and accurate productivity information on labor and under direct observations, instantaneous reaction of the observers
equipment involved in on-site operations can bring an immediate to benchmarking activity categories, the necessary distance limits
awareness of specific issues to construction management. It also to construction workers, and finally, observers’ bias and fatigue
empowers them to take prompt corrective actions, thus avoiding (Khosrowpour et al. 2014b). Current labor-intensive processes take
costly delays. To streamline the cyclical procedure of measuring away time from determining the root causes of issues that affect
and improving the direct-work rates, the time proportion of activ- productivity and how productivity improvements can be planned
ities devoted to actual construction, the Construction Industry and implemented (CII 2010).
Institute (CII) recently proposed new procedures for conducting To address limitations of manual workface assessment, a large
activity analysis (CII 2010). Activity analysis offers a plausible body of research has focused on methods that lead to automation.
solution for monitoring on-site operations and supports root-
These methods range from application of ultra wide band (Cheng
cause analysis on the issues that adversely affect their produc-
et al. 2011; Teizer et al. 2007; Giretti et al. 2009), radio frequency
tivity. Nevertheless, the current procedure for implementing
identification (RFID) tags (Costin et al. 2012; Zhai et al. 2009), and
activity analysis has inefficiencies that prevents a wide-spread
global positioning system (GPS) sensors (Pradhananga and Teizer
adoption. The limitations are (1) the large scale of the manual
2013; Hildreth et al. 2005) to vision methods using video streams
1 (e.g., Peddi et al. 2009; Teizer and Vela 2009; Rezazadeh Azar et al.
Graduate Student, Dept. of Civil and Environmental Engineering,
Univ. of Illinois at Urbana-Champaign, Newmark Civil Engineering 2012). The majority of these methods build on nonvisual sensors
Laboratory, 205 N. Mathews Ave., Urbana, IL 61801. E-mail: kliu15@ and track the location of the workers and equipment. However,
illinois.edu without interpreting the activities and purely based on location in-
2
Assistant Professor and National Center for Supercomputing Applica- formation, deriving workface data is challenging (Khosrowpour
tions (NCSA) Faculty Fellow, Dept. of Civil and Environmental Engineer- et al. 2014a). For example, for drywall activities, distinguishing
ing and Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, between direct-work or tool-time purely based on location is diffi-
Newmark Civil Engineering Laboratory, 205 N. Mathews Ave., Urbana, IL cult because during these activities, the location of a worker may
61801 (corresponding author). E-mail: mgolpar@illinois.edu
not necessarily change.
Note. This manuscript was submitted on October 31, 2014; approved on
March 26, 2015; published online on May 29, 2015. Discussion period open In contrast to location tracking sensors, the growing number
until October 29, 2015; separate discussions must be submitted for individual of cameras on jobsites and the rich information available in site
papers. This paper is part of the Journal of Construction Engineering and videos provide a unique opportunity for automated interpretation
Management, © ASCE, ISSN 0733-9364/04015035(19)/$25.00. of productivity data. Nevertheless, computer-vision methods are

© ASCE 04015035-1 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


not advanced enough to enable detailed assessments from these and tracking by using different visual feature representations
videos. This is because the methods for detection and tracking and learning algorithms. Park and Brilakis (2012) and Memarzadeh
equipment and workers especially when workers interact with tools et al. (2013) explored a more structured approach by using machine
(e.g., Brilakis et al. 2011; Park and Brilakis 2012; Memarzadeh learning techniques and shape and color templates for detection and
et al. 2013) and the methods for interpreting activities from long tracking. Kim and Caldas (2013) also explored joint modeling of
sequences of videos (e.g., Gong et al. 2011; Golparvar-Fard et al. worker actions and construction objects as the basis of construction
2013) are not mature. Beyond the CII-defined activities, the tax- worker activity recognition.
onomy of construction activities are also not fully developed to A few studies have also been focused on end-to-end activity
enable visual activity recognition at the operational level. Finally, analysis. For example, Gong and Caldas (2009) presented a con-
training and testing models used in computer-vision methods for crete bucket model trained by boosted cascade simple features
activity analysis requires large amount of empirical data which to analyze its cyclic operations. Yang et al. (2014) explored a
is not yet available to the research community. In the absence of finite-state machine model to classify tower-crane’s activities into
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

efficient video interpretation methods, tedious manual reviewing concrete pouring and nonconcrete material movements. Azar and
will still be required to extract productivity information from re- McCabe (2012) introduced a logical framework using computer
corded videos, and this takes away time from the more important vision-based techniques to study construction equipment working
task of conducting root-cause analysis. cycles. Instead of assuming strong priors on the relationships be-
In this paper, a new workface assessment framework is intro- tween activities and locations as in the previous works, Gong et al.
duced which provides an easy way to collect and interpret accurate (2011), Golparvar-Fard et al. (2013), and Escorcia et al. (2012) pro-
labor and equipment activity information from jobsite videos. The posed bag-of-words models with different discriminative and gen-
idea is simple: The task of workface assessment is crowdsourced erative classifiers to recognize atomic activities of construction
from jobsite video streams. By introducing an intuitive web-based workers and equipment. By recognizing atomic activities, activities
platform on Amazon Mechanical Turk (AMT), the intelligence of observed are classified in self-contained videos where in a single
the crowd is engaged for interpreting jobsite videos. The goal is to resource starts an activity and ends the same activity within the
overcome the limitations of the current practices of activity analysis video. Khosrowpour et al. (2014b) is also one of the earliest at-
and also provide significantly large empirical data sets together tempts to recognize the full sequence of activities for multiple
with their ground truth that can serve as the basis for developing workers from red-green-blue-depth (RGB-D) data. RGB-D data
automated video-based activity recognition methods. Through ex- collected from depth sensors such as Microsoft Kinect bypass
tensive validation on various parameters of the platform, it is shown the challenges in detection and tracking and provide sequences
that engaging nonexperts on AMT to annotate construction activ- of worker-body skeleton which can be the input to such activity
ities on jobsite videos can achieve accurate workface assessment recognition methods. Jaselskis et al. (2015) also proposed an ap-
results. In the following section, the related works are reviewed, proach of monitoring construction projects in the field by off-site
methods are introduced to develop this proposed tool, and exper- personal through live video streams. Despite the explosion of these
imental results are discussed. methods, the ability to automatically recognize and understand
worker and equipment activities is still limited. Challenges relate
to the large existing variability in execution of construction oper-
Related Work ations, the lack of formal taxonomies for construction activities
in terms of expected worker/equipment roles, and sequence of ac-
Time-lapse photography and videotaping have proven for many
tivities. The complexity of the visual stimuli in activity recognition
years to be very useful means for recording workface activities
in terms of camera motion, occlusions, background clutter, and
(Golparvar-Fard et al. 2009). Since the earlier work of Oglesby et al.
viewpoint changes are other existing challenges. Finally and most
(1989) until now, many researchers have proposed procedures,
importantly, there is lack of data sets together with ground truth
guidelines, and also manual and semiautomated methods for inter-
for more exhaustive research on automated activity recognition
pretation of jobsite video data. Videos have the advantage of being
methods.
understandable by any visually-able person, provide detailed and
dependable information, and allow detailed reviews by the analysts
and on-site management away from the work sites. In the next sec- Crowdsourcing
tion, some of the most relevant works on the topics of video-based
workface assessment are first reviewed. Then, a review on the con- To overcome the current limitations in the standard task of activity
cept of crowdsourcing is provided, followed by research on video recognition, the computer-vision community has recently initiated
annotation tools and existing databases for activity recognition. several projects to investigate the potential of crowdsourcing.
Crowdsourcing refers to collaborative participation of a crowd
of people to help solve a specific problem and typically involves
Computer-Vision Methods for Video-Based a rewarding mechanism, for example, paying for participation
Construction Activity Analysis (Howe 2008). In recent years, Internet has enabled crowdsourcing
Over the past few years, many computer-vision methods have in a broadened and a more dynamic manner. Crowdsourcing from
emerged for inferring the activities of workers and equipment from anyone, anywhere, as needed is now common for image and video
jobsite videos. A reliable method for video-based activity analysis processing, information gathering, and data verification to creative
requires two interdependent components: (1) methods for detecting tasks such as coding, analytics, and production development
and tracking resources; and (2) procedures for activity recognition. (Wightman 2010; Yuen et al. 2011; Shingles and Trichel 2014).
The majority of the previous works have addressed these compo- The wide range of business in crowdsourcing has also promoted
nents as two separate tasks. Brilakis et al. (2011) and Park et al. the development of specialized platforms:
(2011) applied scale invariant feature transforms to track construc- • simple, microtasks-oriented crowdsourcing: Amazon Mechani-
tion resources in both 2D and 3D scenarios. Teizer and Vela (2009), cal Turk and Elance;
Gong and Caldas (2009), Rezazadeh Azar and McCabe (2012), and • complicated, experience-oriented crowdsourcing: 10EQS and
Chi and Caldas (2011) explored construction resources detection oDesk;

© ASCE 04015035-2 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


• open-ended, creative crowdsourcing: IdeaConnection and for video annotation. Gadgil et al. (2014) presented a web-based
Innocentive; and video annotation system to analyze real-time surveillance videos
• funding, consumption, and contribution crowdsourcing: Indie- by annotating each video events’ time interval for forensic analy-
gogo, Kickstarter, and Wikipedia (Shingles and Trichel 2014). sis. Heilbron and Niebles (2014) introduced an automatic video
Among all platforms, the Amazon Mechanical Turk, introduced retrieval system to annotate the time interval of videos that contain
by Amazon in 2005, has gained popularity within the computer- the interested activities. Kim et al. (2014) presented ToolScape for
vision community. Traditionally, computer-vision researchers had crowdsourcing temporal annotation in videos.
to manually create substantially large amounts of annotations Despite popularity and the benefits of crowdsourcing computer-
(i.e., labeled training data). This was always considered as a simple vision tasks, directly applying it to the task of video-based con-
but labor-intensive and costly process. However, with the assistance struction workface assessment can be challenging. Daily site videos
of AMT marketplace, researchers, known as requester, only need exhibit different number of workers and equipment for various op-
to post entire annotation tasks in the form of microtasks or human erations. Workers continuously interact with different tools, and
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

intelligence tasks (HITs) and compensate online users, known as both workers and equipment exhibit changing body postures even
worker or annotator, with predefined payment for the rendered re- when the same activity is being performed. These issues beyond
sults. Through crowdsourcing, easy human annotation tasks which typical challenges in the task of activity recognition could nega-
are extremely difficult or even impossible for computers to perform tively affect the optimal length of annotation tasks or the number
can be accomplished in a timely manner. This is by leveraging the of necessary keyframes for annotation. Also, because of the com-
AMT’s massive labor market which contains hundreds of thou- plexity of construction operations and the lack of formal categori-
sands of annotators who can complete 60% of HITs in 16 h and zation beyond CII defined activities, crowdsourcing video-based
complete 80% of HITs around 64 h (Ipeirotis 2010). The task can workface assessment necessitates a taxonomy and customized data
also be done at a low price of $5 per hour on the AMT. A reason- frameworks to describe construction activities. Finally, the reliabil-
able quality can also be achieved. This is because the AMT keeps ity of crowdsourcing for video-based construction workface assess-
track of the annotators performance and uses that as a screening ment has not been validated. Particularly recruiting nonexpert
condition to guarantee appropriate level of quality in results. Also, annotators from a crowdsourcing marketplace such as AMT may
many techniques are developed to assist requesters with getting negatively affect the quality of the assessment results. Beyond tech-
high-quality annotations. nical challenges, detailed experiments are necessary to examine the
Sorokin and Forsyth (2008) is the first within the computer- capability of nonexperts against expert control groups and also de-
vision community to present a data annotation framework to vise strategies for improving the accuracy of crowdsourcing tasks.
quickly obtain inexpensive project-specific annotations through
the AMT crowdsourcing platform. The proposed framework revo-
lutionized large-scale static image annotation (Vondrick et al. Method
2013). The subsequent efforts to seek the value of massive data sets
To address the challenges of applying crowdsourcing to video-
of labeled images promoted design of efficient visual annotation
based construction workface assessment task, a goal was set on
tools and databases. Russell et al. (2008) introduced LabelMe as
creating a new web-based platform. The nonexpert crowd and spe-
a web-based annotation tool that supports dense polygon labeling
cifically Amazon Mechanical Turk is heavily relied on to conduct
on static image. Deng et al. (2009) presented a crowdsourcing im-
workface assessment. Customized user interfaces have been de-
age annotation platform based on ImageNet that is an image data-
signed to enable construction productivity data retrieval, visualiza-
base of over 11 million images. Everingham et al. (2010) described
tion, and crossvalidation. Exhaustive experiments are designed and
a high-quality image collection strategy for the PASCAL visual
conducted to fine tune parameters of the tool including annotation
object classes (VOC) challenge.
method, annotation frequency, and video length for crowdsourcing
video-based construction workface assessment as individual tasks
Crowdsourcing Video Annotations on the AMT. A compositional structure taxonomy has also been
created for construction activities to decode complex construction
Despite the success of crowdsourcing annotations for still imagery, operations. The performance of the experts and nonexpert annota-
the dynamic nature of videos makes their annotation more chal- tors for detecting, tracking, and recognizing activities are also ex-
lenging (Vondrick et al. 2013). Video annotation requires cost- haustively analyzed. Applying crossvalidation methods to improve
aware and efficient methods instead of frame-by-frame labeling workface assessment accuracy is also investigated. Particularly,
(Vondrick et al. 2013; Wah 2006). Ealier works such as ViPER- several experiments are conducted to seek optimal fold number for
GT tool (Doermann and Mihalcik 2000) gathered groundtruth the crossvalidation process. Fig. 1 shows an overview of the work-
video data without any intelligent method for assisting the anno- flows involved in leveraging the proposed crowdsourcing platform
tation task (Di Salvo et al. 2013). Obviously, the significant num- for workface assessment. A companion video of this manuscript
ber of frames in a video requires smarter ways for propagating also shows various functionalities. In the following section, these
annotations from a subset of keyframes; otherwise, a method will modules and experiments are presented in more detail.
not be scalable. More recently, Yuen et al. (2009) introduced
LabelMe Video which employs linear interpolation with constant
3D-velocity assumption to propagate nonkeyframe spatial annota- Collecting Jobsite Videos
tions in a video. Ali et al. (2011) presented FlowBoost, a tool that Because of the lack of precedent experience on how to design and
can annotate videos from a sparse set of keyframe annotations. validate a crowdsourcing platform specific to video-based con-
Kavasidis et al. (2012) proposed GTTool to support automatic con- struction workface assessment, it is necessary to use real-world
tour extraction, object detection, and tracking to assist annotations jobsite videos. This allows careful examination of the platform’s
across video frame sequences. Vondrick et al. (2013) presented performance throughout the entire design and validation process.
a video annotation tool, VATIC, to address the nonlinear motions Videos chosen for collection focus on concrete placement opera-
in video. Beyond spatial aspects of video annotation, recently, re- tions which contain a range of visually complex activities and
searchers have also focused on extracting temporal information are common on almost all projects. This provides various validation

© ASCE 04015035-3 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 1. Workflow in the crowdsourcing workface assessment tool (images by authors)

scenarios for examining the application of the workface assessment such as place concrete, erect rorm, position rebar, etc., and
tool. The collection of videos follows two principles: (1) to guar- substantial amounts of nondirect work, such as preparatory work,
antee the videos cover various jobsite and video-recording condi- material handling, waiting, etc.
tions; and (2) the videos exhibit different level of difficulty for
activity annotations. In this research, activity difficulty is defined
as physical and reasoning efforts that are required to complete Activity Analysis User Interface
video-based workface assessment using the workface assessment The workface assessment tool has two main interfaces for task
tool. management and workface assessment. Task management interface
To quantify the difficulty level in video annotations, the follow- assists requesters—site managers and engineers, or researchers—to
ing criteria is introduced: manage workface assessment tasks including task publication and
• construction activity conditions: the size of the construction result retrieval. The workface assessment interface provides anno-
crew, and the frequency of changes in sequence of activities tators access to complete video-based workface assessment tasks.
conducted by individual craft workers; As shown in Fig. 1, first, the requesters use task management inter-
• visibility conditions: different occlusion conditions, illumina- face to break an entire video of construction operations into several
tion condition, and background clutter; and human intelligent tasks (HITs) and publish them online or offline.
• recording conditions: camera viewpoint, distance, and camera- Online mode allows the AMT annotators to complete HITs,
motion conditions. whereas offline mode makes HITs only accessible to a predefined
These criteria help determine the amount of the physical effort group of users which can, for example, include expert annotators
and reasoning that is needed to perform the assessment tasks. invited by the requesters. Before publishing the videos for annota-
As shown in Table 1, the collected videos exhibit a large range tion, for privacy purposes, the videos can be processed using
of changes based on these criteria. This allows the capability of image analysis methods to automatically blur human faces when
conducting construction workface assessment to be thoroughly needed. Once logged in, the annotators can then accept published
investigated under different conditions. HITs to generate workface assessment results using the Workface
Real-world videos of concrete placement operations from three Assessment Interface. When all HITs belonging to the same video
different construction sites were collected. For validating the plat- are completed, the requesters can then retrieve, visualize, and cross-
form, a total of eight job site videos (45 min) were chosen. Based validate assessment results. They can also generate formal assess-
on the criteria defined in Table 1, these videos were classified into ment reports through the task management interface.
categories as easy, normal, and hard. Fig. 2 shows example snap- To access the task management interface, a requester should
shots from these videos and their levels of difficulty. The collected provide a set of valid username and password, as shown in Fig. 3.
concrete placement videos cover almost all types of direct work, This verification step can address copyright and privacy issues of

Table 1. Various Criteria Used to Select the Candidate Videos for Experiments
Activity conditions Visibility conditions Recording conditions
Changes in Background Camera Camera
Videos Crew size (person) activities Occlusion Illumination clutter Viewpoint distance (m) motion
Easy videos 2 Low Rare Daylight Low Level 3–15 None
4–5 Low Rare Daylight Normal Level 5–15 None
3–5 Low Rare Cloudy Low Level 15–25 Rare
Normal videos 4–7 Medium Medium Daylight Medium Level 3–25 Rare
6–9 High Rare Daylight Medium Level 5–25 Rare
8 Low Severe Sunny Varied Level 5–25 Severe
Hard videos 10–12 Medium Severe Daylight Sever Tilted 15–45 Rare

© ASCE 04015035-4 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 2. Videos exhibit different levels of difficulty for video annotation purposes (images by authors)

any uploaded video to protect authentic requesters from sharing video upload is designed, shown in Fig. 4, as a function for the
videos under their accounts to unwanted parties. Once passed the requesters to upload their videos and also automatically break
verification step, the requester can use the following functions as- them into several HITs based on a desired length. This function
sociated with the task management interface: also associates most frequently used labels to the HITs and then
• Video links: This function assists the requesters in managing the publishes them online or offline. Most frequently used labels are
progress of the annotators and control the quality of their work compositional structure taxonomies that describe the activities
on the workface assessment HITs. To do so, the HITs are of the workers or equipment in a video.
presented by unique names and hyperlinks and are also classi- • Crossvalidation and accuracy: The online mode for the tool
fied in two categories of published and completed. is prototyped to leverage the knowledge of the nonexpert anno-
• Video upload: The tool is prototyped to simultaneously support tators from the AMT marketplace. To avoid using inaccurate
crowdsourcing workface assessment and collection of large- results produced by these annotators, quality assurance and con-
scale ground-truth data set for both academia and industry. trol should be applied to both preassessment and postassessment
Thus, allowing requesters to upload their own videos is integral steps. To do so, crossvalidation and accuracy is introduced as
to the design of task management interface. To achieve this, functions that report the accuracy of the completed assessment

Fig. 3. Task management interface: (a) log in; (b) video links

© ASCE 04015035-5 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 4. Task management interface: (a) video upload; (b) crossvalidation and accuracy

against groundtruth at the postassessment step. They also allow functions are located above or under the video player and label
the requester to crossvalidate different completed assessments drop-down list. The main functions include
for the same video to achieve a more accurate assessment result. • Workface assessment function: Using this function, in each as-
Fig. 4 presents the interface for these functions. signed HIT, the annotators can annotate the labels necessary for
• Video visualization: To retrieve activity analysis information the activities of the workers and equipment and their body pos-
at any desired level of granularity, video visualization presents ture. This involves (1) associating a role to each new construc-
the workface assessment results in forms of annotated videos, tion worker/equipment, (2) drawing bounding boxes to localize
crew-balance charts, or pie charts. The annotated videos, shown the worker/equipment in 2D video frames, and (3) selecting
in Fig. 5, contain worker/equipment trajectories and activity labels to describe their activity types, body posture, and tools.
type information which are both annotated on the videos. The As the video proceeds, this procedure continues with updating
experts can monitor specific operations and conduct the root- the position of the bounding boxes and their associated labels.
cause analysis for hidden issues that may not easily be re- To create new roles for the workers, the annotators can use
vealed by other visualization forms. The crew-balance charts +NewResource button which brings out a list of existing roles.
represent the time series of worker activities. Unlike annotated After selecting a role, the cursor will be activated to allow
videos, the time series provide a depiction of the construction the annotator to draw a bounding box around the construction
operations in a concise manner. Finally, the pie charts charac- worker or equipment and pinpoint the location in 2D. Next, the
terize the percentage of time spent on various activity cate- annotators select labels from the drop-down list to describe
gories, based on the CII 2010 taxonomy. This method is also the observed activities. The Play and Rewind buttons together
capable of accurately examining the hourly average percentages with the video progress control bar allows the role/activity/
for the overall jobsite which can provide significant benefit to posture labels to be updated for the observed workers and equip-
activity analysis. ment in a video. Upon completion, the annotator can save the
This worface assessment interface is where the annotators work annotation results by pressing on the SaveWork button.
on the assigned HITs. It is the most important component of • Assisting function: This function is designed to help annotators
the tool as it is where the actual workface assessment happens. in generating accurate workface assessment results. This func-
To achieve simplicity and functionality in the design of the inter- tion consists of Introductions, +NewLabel, and Options. Intro-
face, structured rules are followed. The interface consists of three ductions are provided to help the annotators understand how the
main components: (1) video player, (2) label drop-down list, and assigned tasks can be performed. For efficiency and concise-
(3) assisting functions, shown in Fig. 6. The player–taking the ness, the drop-down list only contains most frequently used
most space–shows video content that requires assessment. The labels which allows the annotators to quickly find the labels of
drop-down list contains labels that are presented hierarchically interest. The platform does not require a comprehensive list of
to describe activities of each worker or equipment. Assisting labels to begin with. Rather, the annotators can use +NewLabel

© ASCE 04015035-6 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 5. Video visualization in the task management interface: (a) annotated video; (b) crew-balance chart; (c) detailed and CII-type pie chart; the
annotations in the upper left corner of each box shows the role-activity-tool-body posture of each worker (images by authors)

to insert the necessary new labels (or customize existing ones) to tools and equipment, material handling, waiting, travel, and per-
complete their assessments. The Options enables annotators to sonal. Although this taxonomy is generally applicable to all con-
adjust video player and fine tune the monitoring settings. These struction operations, it does not provide detailed description
settings include different video speeds, hide/show bounding box of tasks that would be necessary for the development of visual
and labels on the videos, and enable/disable resizing of the activity–recognition algorithms. Any vision-based method requires
bounding boxes. To keep the user interface constrained and distinct visual features on worker and equipment activities, and that
simple, all assisting functions are displayed in different win- could only be achieved if different labels are used to describe each
dows which can be triggered by their corresponding functional group of similar visual features.
buttons. In this research, a compositional structure taxonomy is intro-
duced to decode complex construction activities in the following
format: worker role is conducting CII activity category in form
Compositional Structure Taxonomy for Construction
of a visual activity category using tool, body posture and is visible,
Activities
occluded, or outside of the video frame. As a starting point, worker
Activity analysis requires a detailed description of construction type contains 19 different roles for construction worker such as
worker and equipment activities. A sequence of different activities concrete finisher, carpenter, electrician, bricklayer, etc. The second
allows the analysis on the root causes for low productivity rate and layer is CII activity categories which describes worker activities in
also planning and implementing productivity improvement. How- form of direct and nondirect work (preparatory work, tools and
ever, construction activities are complex to describe. They exhibit a equipment, material handling, waiting, travel and personal activ-
large amount of interclass and intraclass variability among different ity). The third layer is the visual activity category introduced to
activities that can be associated with the roles of the workers and provide detailed information on activities, tools, and body posture
equipment. Because of the dynamic nature of construction opera- on direct work activities. Tools vary based on different types of
tions, the temporal sequence of these activities changes frequently activities; nevertheless, they can be an important visual indicator
as well. Without a systematic description, it will be difficult to pro- for training vision-based algorithms. Because of the large number
vide accurate activity analysis information. To address the current of tools involved in any given activity, the interface is designed
limitation, the CII (2010) proposed a new taxonomy that classifies such that it provides illustrative images of each tool to help anno-
all activities into seven categories of direct work, preparatory work, tators to identify them easily. Posture can also indicate an activity

© ASCE 04015035-7 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 6. Workface assessment interface (images by authors)

has changed, and thus, it can be beneficial for the training of the for different types of construction operations, it also enables the
vision-based algorithms. This visual activity category layer with annotators to add missing or new taxonomies on the fly.
detailed representation on activity-tool-body posture can enable
extraction of proper visual features and devising appropriate
computer-vision methods. The method also describes nondirect Extrapolating Annotations from Keyframes
works with worker body postures to enable a better synthesis of The dynamic nature of videos makes frame-by-frame annotations
nondirect work activities. Finally, jobsite videos contain severe necessary but labor-intensive and costly. Crowdsourcing can
occlusions and background clutters can create noise in a represen- reduce human efforts, time, and cost for workface assessment;
tative activity data set. Thus, visibility information of each anno- nevertheless, video annotation still needs strategies to propagate
tation is associated—occluded and outside of video. Because of assessment results from a sparse set of keyframes. Keyframes are
space limitations, Fig. 7 presents only a part of the proposed frames in a video sequence that benchmark the start and end of a
compositional structure taxonomy of construction worker activities construction activity (or a change in role). In the platform, these
related to concrete placement operations. A more detailed represen- changes need to be captured manually by the annotators. The non-
tation is available at http://activityanalysis.cee.illinois.edu/. keyframes are the following frames that contain the same construc-
Using the proposed taxonomy, construction professionals can tion activities as the previous keyframe, although the position of the
analyze both CII and task level activities for root-cause productivity workers or equipment performing the activity may have changed.
analysis purposes. To better plan productivity improvements, the In this section, the extrapolation methods that are developed to
interactions between task level activities and construction worker’s support propagation of the annotations from the keyframes to
posture and tool is introduced. Not only are these interactions nonkeyframes are described. Inspired by Vondrick et al. (2013),
meaningful for development of robust vision-based algorithms, linear and detection-based extrapolation methods are implemented.
but they also enable construction professionals to analyze the re- T is defined as the total number of frames where T ¼ time ×
lationship between tool utilization and direct work rate that could ðframe per secondÞ and B [Eq. (1)] is defined as the 2D pixel co-
lead to productivity improvements through better tool utilization. ordinates of each annotation bounding box
The designed platform is also very flexible. For example, for
quick assessments, the requester can choose role-CII activity struc- B ¼ ½xmin ; xmax ; ymin ; ymax  ð1Þ
ture to annotate construction videos based on the CII taxonomy of
activities. For detailed workface assessment and collecting data sets where xmin , ymax denote the coordinates of upper-left corner and
for the development of computer-vision methods, the requester xmax , ymin denote the coordinates of the lower-right corner of the
can choose to leverage the worker-activity-posture-tool-visibility bounding box. Bt ð0 ≤ t ≤ TÞ is defined as bounding box coordi-
compositional structure taxonomy of construction activities. To nates at time t. Fig. 8 shows examples of applying extrapolation
facilitate annotation process and minimize time required to find a methods to generate nonkeyframe annotations (Bt ) from known
specific activity category from a long list, the platform automati- keyframe annotations (B0 and BT ).
cally inserts the most frequently used compositional structure tax- Linear extrapolation assumes that workers and equipment
onomy during the task publication stage. To cope with the needs that have constant velocity in 3D will also keep their velocity

© ASCE 04015035-8 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Visual Activity Category
Worker CII Activity
Body
Type Category Atomic Activities Tool
Posture
Bucket
Concrete placement Scoop
Shovel
Bull float
Concrete spreader
Concrete tamper
Spreading, leveling Hand float
and smoothing Hand screed
concrete Hand trowel
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Power screed
Power trowel

Bending, Sitting, Standing


Direct work Vibrator
Concrete finisher

Covering or protecting
concrete
Grout pump
Curing
Power sprayer
Concrete edger
Molding expansion
Concrete groover
joint & edge
Straightedge
Broom
Surfacing
Concrete brush
Concrete saw
Cutting concrete
Line tool
Preparatory work
Material handling
Tool and equipment
Waiting
Travel
Personal
Erecting/dismantal Level
scaffold Plumb rule
Erecting/stripping Hammer
formwork Power saw

Bending, Sitting, Standing


Erecting/dismantling Air reviter
temporary structure Claw hammer
Carpenter

Direct work Chisel


Hammer
Installing door & Sander
finishes Square
Tape measure
Volume pneumatic nail gun
Assisting concrete
Bucket
pouring
Non-direct work**
Rebar hickey
Positioning rebar
Rod bending machine
Hand tying tool
Tying rebar Plier
Ironworker

B.S.S*

Direct work Power tying tool


Power saw
Metal shear
Cutting rebar
Hacksaw
Bar cutter
Non-direct work**
B.S.S*: Bending, Sitting, Standing
Non-direct work activity categories**: preparatory work,material handling, tool and equipment,
waiting, travel, personal

Fig. 7. Compositional structure taxonomy of construction worker activities

© ASCE 04015035-9 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Fig. 8. Extrapolation nonkeyframes’ annotation from keyframes (images by authors)
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

unchanged in 2D. Of course, similar to Yuen et al. (2009) in 1 Xl


LabelMe Video, homography-preserving shape interpolation method min arg max wT w þ C ξi
w;b;ξ w;b;ξ 2 i¼1
can be applied to rectify linear extrapolation. However, the anecdotal
observations indicate that most construction videos exhibit limited subject to yi ½wT ϕðxi Þ þ b ≤ 1 − ξ i ξ i ≥ 0; i ¼ 1; : : : ; l
perspective effects and depth variation so that the 2D velocity fol-
ð5Þ
lows the constant 3D velocity, and thus, linear extrapolations are
applied. Assuming 2D constant velocity in both x and y direction,
where ϕðxi Þ is a kernel function that maps xi to a higher-
if a point in x direction is at B0 ðxmin Þ and then at BT ðxmin Þ, the
dimensional space, and C is the penalty parameter. The trained
Bt ðxmin Þ should follow
visual classifier with weight vector w (normal vector to the hyper-
plane that separates positive and negative detection features) will be
BT ðxmin Þ − B0 ðxmin Þ
Bt ðxmin Þ ¼ ðT − tÞ × ð2Þ used to detect construction workers and equipment for nonkey-
T−0 frames’ annotation propagation.
Because of the presence of frequent occlusions and background
Applying all coordinates in keyframes’ bounding boxes to clutter on construction sites, it is very difficult to propagate non-
Eq. (2), the Bt can be calculated as keyframe annotations at a 100% level accuracy. Therefore, the
constrained tracking of Vondrick et al. (2013) is applied to reduce
BT − B 0
Bt ¼ ðT − tÞ × ð3Þ the error in detection of the workers and equipment. Constrained
T tracking finds the best candidate from all possible detections for
each frame to constitute a path with minimum cost. The path with
To localize the position of the workers and equipment in minimum score is defined as B0∶T ¼ B0 ; B1 ; : : : ; BT−1 ; BT , where
nonkeyframes, the detection-based extrapolation method treats B0 and BT are manually generated keyframes’ annotations, and
keyframe annotations as positive samples for training machine B1 ; : : : ; BT−1 are automatically generated from the trained SVM
learning classifiers in computer-vision algorithms. To build a visual classifier. The optimization problem is then defined as
proper classifier for detecting workers and equipment, visual fea-
tures should be properly formed. Shape-based feature descriptors, X
T−1
such as histogram of oriented gradients (HOG) with or without arg min U t ðBt Þ þ PðBt ; Bt−1 Þ ð6Þ
b1∶T−1 t¼1
histogram of color (HOC) have gained popularity in worker and
equipment detection (e.g., Park and Brilakis 2012; Memarzadeh
et al. 2013). Thus, visual feature descriptors xi consisting of where the unary cost U t ðBt Þ is defined by Eq. (7), and pairwise cost
HOG and HOC features are built, as shown in the following PðBt ; Bt−1 Þ is defined by Eq. (8)
equation: 2
min½−w · ϕðBt Þ; α1  þ α2 kBt − Blin
t k ð7Þ
 
HOG
xi ¼ ð4Þ
HSV PðBt ; Bt−1 Þ ¼ α3 kBt − Bt−1 k2 ð8Þ

where HOG is computed based on Dalal and Triggs (2005), and The unary cost U t ðBt Þ calculates the cost of the potential detec-
HOC is a nine-dimensional feature containing three means and six tion in each frame by the score of the visual classifier and l-2 norm
covariance computed from hue, saturation, and value color chan- of its bounding box difference between the SVM detection and the
nels. xi is applied to keyframe annotations to construct positive linear extrapolation. The SVM associates the most possible predic-
samples and to automatically extract patches from keyframes’ tion with the highest score to minimize the most likely cost for the
background to construct negative samples. To learn a specific vis- detection. In this research, −w · ϕðBt Þ is used as the score of the
ual classifier that is able to assign positive samples with high visual classifier. Because of the presence of occlusions, some video
scores, the same procedure is followed in Memarzadeh et al. frames t may not contain a groundtruth detection. These frames
(2013), and a binary support vector machine (SVM) is introduced cause false negatives with small scores to be the potential Bt . In
per type of resource because SVM is a commonly used discrimi- such situations, the annotations for nonkeyframes will rely on
native classifier in machine learning literature and performs well the linear extrapolation method and replaces classifier score −w ·
with carefully created groundtruth data. Each binary SVM classifier ϕðBt Þ with a very small (zero number) α1 .
is trained by feeding all training samples ðxi þ 1Þ=ðxi ; −1Þ to op- The pairwise cost calculates the smoothness of the detection
timize the following objective function: path for each worker/equipment. In this research, the position of

© ASCE 04015035-10 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


the bounding box does not change if the camera motion is minimal. Quality Assurance and Quality Control Methods
Thus, a true path should have the minimum pairwise cost among for AMT
all possible candidates. This pairwise cost has been adopted as
The AMT is a marketplace of hundreds of thousands of annotators
a gauge to test and select the best candidates for the path in each
for solving microtask (HITs) quickly and effortlessly. However,
frame.
because of poor or malicious judgment of the annotators, quick
assessments may lead to erroneous results. To lower the risk of
Annotating Multiple Workers and Equipment obtaining low quality results, AMT annotators are classified into
the following: (1) skilled annotators who posses the ability to pro-
Workface assessment videos always contain a few crew members vide accurate workface assessment results; (2) ethical annotators
and possibly an equipment. Thus, efficient annotation method is who are honest but may be incapable of providing results with
imperative to reduce time and guarantee quality for dense annota- high accuracy because of poor judgments; and (3) unethical anno-
tions. Three annotation methods are introduced that are categorized
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

tators who only try to finish as many tasks as possible in a random


as one-by-one, all-at-once, and role-at-once. fashion just be able to earn money. To reject the unethical annota-
To explain each method, each task is defined on annotating a tors and improve the accuracy of skilled and ethical annotators, pre-
video with T frames and N construction workers/equipment. Using assessment and postassessment quality control steps are designed
the all-at-once method, the annotators annotate or update the labels as follows:
of all N workers and equipment in frame t (0 ≤ t ≤ T) simultane- • Preassessment: This step is used to select skilled and ethical
ously. This requires the annotators to watch the entire video prior to annotators and reject unethical annotators from the AMT mar-
conducting the annotations. Using the one-by-one method, anno- ketplace by testing their performance. In the preassessment
tators annotate or update one worker or equipment for all T frames procedure, a short testing video is added—for which the ground
and then rewind the video to start for the next worker or equipment truth has been previously generated—to the start of each HIT.
until all N workers and equipment are annotated. This method re- At the very first time, the annotators are directed to this testing
quires the annotators to watch the entirety of the video for N times. video. The platform compares these annotations with the ground-
Finally, the role-at-once method assumes there are M different truth and reports the annotator’s accuracy for the given HIT. If the
roles. It then requires the annotators to annotate or update the labels accuracy is above the requester’s predefined threshold, the anno-
of all workers and equipment with the same type of role for all T tator can continue to work on the actual HITs. Otherwise, the
frames and rewind the video to start the next group of workers/ annotator is prohibited to work on any other HITs.
equipment until all M role types are annotated. This method re- • Postassessment: One should note that most if not all annotators
quires watching the video for M times. are aware of the preassessment screening tests on the AMT mar-
Previous works show that annotators on AMT tend to select the ketplace (Ipeirotis et al. 2010; Snow et al. 2008; Le et al. 2010).
all-at-once approach as their primary annotation method. Com- This provides an opportunity for the unethical annotators to
pared with other methods, the all-at-once method may seem to save provide a perfect testing performance to get the pass and then
time by only requiring the annotator to watch the video once. How- generate random assessments for the actual HITs. Thus, besides
ever, this choice is not optimal for all conditions. For example, quality assurance, quality control is needed to examine the ac-
when annotating all workers at once, the annotators may lose track curacy of the actual assessments and to correct potential errors.
of each specific construction worker. Reviewing and correcting This is done through a repeated-labeling approach. This requires
such labels would require additional time. In this research, focusing a video to be annotated for multiple times by several annotators.
only on one worker/equipment may increase the familiarity of the Sheng et al. (2008) also leverages repeated-labeling to deal
annotator and ultimately save time during the annotation updating with noisy data for assessment quality improvement. To do so,
process. One should also note that time is not the only indicator a matching schema is defined that uses cost matrix to find
that needs attention. The trade-off between time and accuracy could corresponding annotations across repeated/multiple assessment
also affect the final annotation results. To quantify the time, accu- results and then apply majority voting strategy to generate the
racy, and their trade-off relationship in real application scenarios, final assessment. Matching schema randomly selects an an-
experiments are conducted to examine the most efficient annota- notation result as the reference and feeds reference and the re-
tion approach. The results will be discussed in the “Experimental maining assessments to cost equation [Eq. (9)] to generate cost
Results” section. matrices for matching corresponding annotations


SRt ðN R Þ − SIt ðN I Þ; if RoleNt R ≡ RoleNt I ; t ¼ 0; : : : ; T
Costt ðN R ; N I Þ ¼ ð9Þ
2 × ½SRt ðN R Þ − SIt ðN I Þ; otherwise

where Costt ðN R ; N I Þ is the cost between N R -th annotation input are grouped based on the minimum cost value in the
from reference and N I -th annotation from input at frame t cost matrix. Once all groups have been matched, majority vot-
(0 ≤ t ≤ T); SRt ðN R Þ − SIt ðN I Þ is the area difference (bounding ing is performed to annotations of same construction worker/
box overlap) between N R -th bounding box and N R -th bound- equipment at each frame. For example, groups are first elim-
ing box; and RoleNt R and RoleNt I are the different annotation inated which have an annotation number less than half of
labels in comparison. The calculated costs are used to consti- the repeat time. Then, average bounding boxes’ coordinates
tute the cost matrix for reference and input annotation results, AVGðBallt Þ of all repeated annotations for the same construc-
which is shown in Fig. 9. Annotations from reference and tion worker/equipment are calculated. Annotation(s) whose

© ASCE 04015035-11 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Reference Annotations at frame t
1 2 … NR N R+1 …
1 Costt (1,1) Costt (2, 1) Costt (N R , 1) Costt (N R+1 , 1)

Input Annotations
2 Costt (1,2) Costt (2, 2) Costt (N R , 2) Costt (N R+1 , 2)

at frame t

NI Costt (1, N I ) Costt (2, N I ) Costt (N R , N I ) Costt (N R+1 , N I )
N I+1 Costt (1, N I+1 ) Costt (2, N I+1 ) Costt (N R , N I+1 ) Costt (N R+1 , N I+1 )

Fig. 9. Cost matrix at frame t


Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

sum of Bt − AVGðBall t Þ is greater than the defined threshold of 10þ nonexpert annotators—typically found on AMT—were as-
at frame t are eliminated. Finally, the average of bounding sembled together with a control group of experts [5þ professionals,
boxes’ coordinates was recalculated and the majority label who in this experiment are (1) field engineers with experience on
for each level of compositional structure taxonomy from the productivity assessments; and (2) students that took a course in pro-
rest of annotations was selected to generate the final annota- ductivity and have experience on productivity assessment]. All ex-
tion for each construction worker/equipment—the annotation pert participants in this experiment have done at least productivity
group—at frame t. assessment for more than a one-day operation. Easy, normal, and
Although repeated-labeling improves the quality of workface as- hard videos of concrete placement operations introduced earlier in
sessment, the unnecessary repeated labelings will cost extra money the paper were leveraged. With these videos, three separate experi-
and time. To save money and time, it is necessary to find an optimal ments were conducted with the annotators from the controlled
repeat time. Thus, experiments are conducted using crossvalidation group of construction experts to investigate the impact of the differ-
method to examine how the accuracy will be changed based on dif- ent annotation methods, video lengths, and annotation frequencies
ferent repeat times which are discussed in the following section. on the accuracy of the workface assessment results.
To validate the hypothesis that crowdsourcing video-based con-
Designing Microtasks from Construction Videos struction workface assessment through the AMT marketplace is a
reliable approach, three additional experiments were conducted on
A video of a construction operation is typically several hours long. the impact of the platform on accuracy by (1) comparing the
This makes assessing it much more difficult than completing a typ- performance of nonexpert annotators with the controlled group
ical short-length microtask on the AMT. To make crowdsourcing of construction experts; (2) testing the linear extrapolation and
feasible, an entire video is broken down into several shorter HITs. detection-based extrapolation methods and presenting the perfor-
For effectiveness, the following is considered: mance of each extrapolation method against ground truth data gen-
• The length of a HIT: The longer a HIT is, the less time the an- erated by the expert annotators; and (3) testing the performance
notator will need to complete the task of annotation for the en- of the postassessment quality control procedure and exploring
tirety of the video. This is because the annotator will spend less the best repeated labeling times for desirable level of accuracy
time on understanding the video content and would not require to
by experimenting crossvalidation with different randomly selected
watch a video for multiple times. However, longer HITs can lead
folds from both expert and nonexpert annotation results.
to the tiredness and the boredom of the annotators which can in
To compare experiments and choose the parameters that achieve
turn lower the accuracy and increase the annotation time. To study
optimal performance, two validation criteria were chosen: (1) the
their trade-off relationship, experiments are conducted and experi-
annotation time spent to complete each experiment; and (2) the ac-
ment results are presented in the “Experimental Results” section.
curacy of the workface assessment results. Because the platform
• Annotation frequency: Although dense labeling seems desirable
applies compositional structure taxonomy of construction activ-
for improving workface assessment accuracy, over labeling may
ities, three separate discussions on accuracy are presented:
lead to cost and time overrun. Manually annotating a sparse set
1. Completeness accuracy examines annotators capability to
of keyframes and applying extrapolation to generate annotations
completely annotate all workers in a HIT video without miss-
for nonkeyframes can provide the same level of accuracy with
ing one.
less time and cost.
2. Bounding box accuracy investigates the accuracy of worker
• Method of stitching several HITs together for deriving final as-
localization. In this case, 50% overlap is used between experi-
sessment results: The platform breaks an entire operation video
mental assessment and groundtruth as the acceptable threshold
into multiple microtask HITs. To stitch the results of these HITs
for accuracy.
and derive the final workface assessment result, one-second
3. Tool accuracy examines the annotator’s capability to correctly
overlap is placed between each HITs. Then, the same method
label construction tools or the nondirect work categories.
used to match repeated annotation results is applied, as shown
In the following section, each experiment is introduced and
in Eq. (9), to all the overlapping annotations of the HITs. They
the experimental results are reported.
are then stitched together to derive the final workface assess-
ment results.
Experimental Results
Results and Discussion
Annotation Method
The experiments for choosing the best annotation method include
Experiment Setup and Performance Measures
leveraging one-by-one, all-at-once, and role-at-once methods to
To identify the key parameters of the platform and validate the ef- annotate easy, Normal, and Hard videos. The time durations (in
fectiveness of crowdsourcing the workface assessment task, a pool seconds) required for each annotation method across the expert

© ASCE 04015035-12 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Table 3. Average Accuracy of Each Annotation Method
Bounding
Methods Completeness box Role Activity Posture Tool
AM01 0.95 0.98 0.83 0.80 0.97 0.84
AM02 0.95 0.98 0.88 0.71 0.97 0.83
AM03 0.97 0.98 0.99 0.80 0.96 0.82

different video lengths of 10, 30, and 60 s. The experimental


results, as shown in Fig. 11 and Table 4, indicate that increasing
video length can reduce the annotation time. It is observed that vid-
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

eos with 60-s length require the least annotation time. Particularly
the saving in the annotation time by posting 60-s videos for easy
and normal videos is significant. For hard videos, presenting videos
of 60-s length does not significantly save annotation time compared
Fig. 10. Time spent using each annotation method for easy (0–15 min), to posting videos with other time durations. This is because com-
normal (15–30 min), and hard (30–45 min) videos bining six 10-s video into one larger 60-s video allows the anno-
tators to become familiar with the video content. This, in turn,
reduces the time needed for interpreting the content multiple times.
annotator group are shown in Fig. 10. The average and total time It also reduces the need for redrawing the bounding boxes.
As shown in Table 5, it is observed that choosing videos with
for annotations are provided in Table 2.
different length does not make a significant difference in the accu-
From these results, it is observed that when there is a small num-
racy of the assessment results. However, compared with other video
ber of construction workers in a video, the one-by-one annotation
lengths, the 60-s video length exhibits lower accuracy for labeling
method shows the best performance. As the number of workers
tools (around 10% lower). This suggests that longer video length
increase, the all-at-one method performs best. The one-by-one an-
could possibly cause tiredness and boredom for the annotators, and
notation method requires a repeating video for each worker. Thus, it
this makes the accuracy lower when compared with the annotation
will take a longer time to annotate hard videos. It is also observed
of the videos that have a shorter length.
that when the activities of each construction worker change fre-
quently, the one-by-one labeling method performs best. This is be- Frequency of Choosing Keyframes for Annotation
cause this method will save time by allowing annotators to focus on Annotating a sparse set of frames as opposed to annotating frames
one construction worker at a time and minimizes the chance of one-by-one can minimize the annotation time, but it may also neg-
mistakes and the need for unnecessary revisions. In contrary, the atively impact the accuracy of the assessment results. To explore
high frequency of changes in activities of the workers overwhelms the trade-off relationship between time and accuracy, three different
annotators who use the all-at-once or role-at-once methods. These fixed annotation frequencies of three times, five times, and nine
methods will require extra time to be spent on revising activity cat- times per minute were experimented with. In other words, the an-
egories because the annotators may then require to rewind the video notators were asked to only use these prefixed times per minute for
frequently and make all annotations consistent with one another. their annotation purposes. The annotation times of each frequency
When the worker activity categories do not frequently change, the are shown in Fig. 12, and the average and the total time duration
all-at-once methods shows the best performance. of the annotations are presented in Table 6. The accuracy of each
The results in Table 3 show that the accuracy in assessment re- frequency is also provided in Table 7. From these experiments, the
sults for categorizing activities does not change significantly across hypothesis that a sparse annotation can save time is validated. The
different annotation methods. However, among all, the role-at-once sparser the annotations are, the less time annotation will require.
method shows the highest accuracy in labeling the worker roles. The results particularly show that this method significantly shortens
This is because this method requires the annotators to reason about the annotation time for easy and normal videos that exhibit high
the role of the workers, and thus, it consequently increases the frequency of changes in activity categories. Nevertheless, the gains
accuracy of labeling roles. are not as obvious for hard videos which exhibit a low frequency of
change in these categories.
Impact of the Video Length Although choosing a smaller number of keyframes saves time
To examine the relationship between video length and the speed required for annotation, it also lowers the accuracy in assessment
of workface assessment, an experiment was conducted using three results. As shown in Tables 6 and 7, the three times per minute

Table 2. Average and Total Annotation Time for Each Annotation Method
Videos with different levels of difficulty in the annotation task
Time Methods Easy_01 Easy_02 Easy_03 Normal_01 Normal_02 Normal_03 Hard_01 Hard_02 Hard_03
Average AM01 550 592 534 1,491 1,317 1,359 1,307 1,084 1,254
AM02 363 836 607 1,404 1,276 1,879 871 823 804
AM03 478 922 788 1,736 1,507 1,647 1,174 908 1,576
Total AM01 — 8,380 — — 20,841 — — 18,232 —
AM02 — 9,034 — — 22,801 — — 12,495 —
AM03 — 10,943 — — 24,459 — — 18,298 —
Note: AM01 = one-by-one; AM02 = all-at-once; AM03 = role-at-once; time is measured in seconds.

© ASCE 04015035-13 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Table 5. Average Accuracy in Workface Assessment Results of Each
Video Length
Lengths
(s) Completeness Bounding box Role Activity Posture Tool
10 0.95 0.98 0.83 0.80 0.97 0.84
30 0.90 0.95 0.88 0.81 0.97 0.83
60 0.90 0.96 0.90 0.82 0.97 0.74
Note: The bold font shows the maximum average accuracy in each
category.
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 11. Annotation time for easy (0–15 min), normal (15–30 min),
and hard (30–45 min) videos

annotation frequency has the lowest average accuracy. This, how-


ever, does not indicate that requesters who prefer higher accuracies
in their workface assessment should not choose to guide the anno-
tators to use lower annotation frequencies. Rather, the experimental
results indicate that the requesters can conduct a cost-benefit analy-
sis for choosing accuracy versus the time needed for completing the
workface assessment tasks. For example, results show that for 60-s
videos, the three times per minute frequency is 66% faster than
other annotation frequencies and results only in 7% reduction in Fig. 12. Annotation time of each video annotation frequency: easy
the accuracy of the workface assessments. (0–15 min), normal (15–30 min), and hard (30–45 min) videos

Expert versus Nonexpert Annotator


To validate the reliability of crowdsourcing on the AMT market- about 3% better for the expert group. This encouraging result
place, an experiment was conducted to compare the annotation time testifies that for video-based workface assessment, nonexpert an-
and accuracy of a large pool of nonexperts (10þ) against a con- notators on the AMT have the potential to perform as well as the
trolled group of construction experts (5þ). Fig. 13 shows the differ- expert groups.
ence in the annotation time between the nonexpert and expert
annotators. Fig. 13 shows that on average, an expert is 22% faster Crossvalidation
than a nonexpert annotator. The tool applies repeated labeling as a postassessment quality con-
The accuracy of annotations between the nonexpert and expert trol step. In this case, the requester can improve the accuracy in
control groups was compared and linear regression was used to in- assessments and also minimize the risk in collecting inaccurate
terpolate all observation points. As shown in Figs. 14 and 15, the data. The idea is to crossvalidate the repeated labeling results across
results testify that the experts, in general, perform slightly better multiple AMT annotators to generate final assessment results with
than the nonexpert groups on the AMT. Fig. 16 shows the differ- potentially higher accuracies. However, unnecessary repetitions
ence in accuracy in the categories of the percentage of complete- lead to additional cost and can potentially increase the assessment
ness of work, the bounding box, body posture, role, activity, and time (unless the annotators work simultaneously). Either way, it is
tool. Although the nonexpert annotators produce higher accuracy in necessary to explore how many rounds of labeling are needed to
selecting roles, as shown at the top of Fig. 15, the difference is less guarantee satisfactory workface assessment results. Experimental
than 2% and is not necessarily more advantageous. Table 8 shows results are presented in Figs. 16(a–c).
the comparison in the accuracy of the workface assessment results To conduct a systematic experiment, the worst annotation result
between the nonexpert and expert groups. This difference is only was chosen from each video category as the original annotation

Table 4. Average and Total Time Spent on Annotating Videos with Different Length
Videos with different levels of difficulty in the annotation task
Time Lengths (s) Easy_01 Easy_02 Easy_03 Normal_01 Normal_02 Normal_03 Hard_01 Hard_02 Hard_03
Average 10 550 592 534 1,492 1,317 1,359 1,307 1,084 1,254
30 488 655 527 1,292 1,230 1,835 662 529 850
60 279 377 354 806 635 1,062 513 436 614
Total 10 — 8,380 — — 20,841 — — 18,232 —
30 — 8,349 — — 21,784 — — 10,227 —
60 — 5,050 — — 12,520 — — 7,812 —
Note: Time is measured in seconds.

© ASCE 04015035-14 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Table 6. Average/Total Annotation Time Spent with Each Annotation Frequency
Videos with different levels of difficulty in the annotation task
Frequencies
Time (times/min) Easy_01 Easy_02 Easy_03 Normal_01 Normal_02 Normal_03 Hard_01 Hard_02 Hard_03
Average 9 184 410 313 588 450 608 334 260 401
5 488 655 161 337 336 460 253 182 292
3 83 125 118 265 189 237 243 198 273
Total 9 — 4,586 — — 8,229 — — 4,975 —
5 — 2,572 — — 5,668 — — 3,630 —
3 — 1,636 — — 3,451 — — 3,575 —
Note: Time is measured in seconds.
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

(onefold), and then annotations from each video category were video conditions and annotation frequencies, two extrapolation
randomly selected to constitute threefold, fourfold, fivefold, six- methods were experimented on easy, normal, and hard videos
fold, sevenfold, and eightfold crossvalidations. For easy, normal, with different annotation frequencies. Annotation frequency is
and hard videos, it is observed that the increase in accuracy tends indicated by average clicks per frame per construction worker,
to be steady after a threefold crossvalidation. Figs. 16(a–c) show
results for easy, normal, and hard videos respectively. Fig. 16(c)
also shows that the activity accuracy of hard videos drop from a
threefold to an eightfold crossvalidation. The accuracy drop after
a sevenfold crossvalidation in Fig. 16(a) may result from the unnec-
essary repeated labelings which generate more erroneous data and
thus cause the incorrect labels to stand out. The activity accuracy
drop in Fig. 16(c) may also result from the unnecessary repeated
labeling or result from the farther camera distance which can chal-
lenge the accuracy of observations. Based on the average perfor-
mance of each fold crossvalidation accuracy, it was concluded
that threefold crossvalidation can provide the optimal performance
and that increasing the fold number increases the risk of producing
erroneous data.

Linear and Detection-Based Extrapolation Method


To assist AMT annotators in generating nonkeyframe annotations,
the tool enables linear and detection-based extrapolation methods.
The extrapolation method treats user-assisted keyframe annotations
as input and generates nonkeyframe annotations automatically.
To examine the performance of each method under different
Fig. 14. Percent of completeness, bounding box, and posture accuracy
difference between expert and nonexpert annotators
Table 7. Average Accuracy of Assessment Results for Different
Annotation Frequencies
Frequencies Bounding
(times/min) Completeness box Role Activity Posture Tool
9 0.94 0.90 0.99 0.78 0.93 0.83
5 0.87 0.85 0.95 0.74 0.94 0.86
3 0.83 0.77 0.86 0.66 0.95 0.85
Note: The bold font shows the maximum average accuracy in each
category.

Fig. 13. Annotration time difference between expert and nonexpert Fig. 15. Percent role, activity, and tool accuracy differences between
annotators expert and nonexpert annotators

© ASCE 04015035-15 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 16. Crossvalidation results for (a) easy; (b) normal; and (c) hard videos

Table 8. Difference in Accuracies between the Expert and Nonexpert Discussion on the Proposed Method and Research
Annotators Challenges
Accuracy of an expert when The results validate the hypothesis that crowdsourcing construction
Category compared with a nonexpert
activity analysis from jobsite videos on the AMT, a marketplace
Completeness þ0.02 with nonexpert annotators, is a reliable approach for conducting
Bounding box þ0.01 activity analysis. In addition, the platform facilitates collection of
Role −0.01 large data sets with their ground truth that could be used for the
Activity þ0.03 development of computer-vision algorithms for automatic activity
Posture 0.00
recognition. Particularly, it is shown that expert annotators are, on
Tool þ0.03
average, 22% faster than nonexpert annotators in terms of their
annotation time. However, the accuracy of annotation among the
nonexperts is within 3% of the accuracy of the expert groups. To
which includes 0.001, 0.005, 0.01, 0.05, and 0.1. Figs. 17(a–c)
fine tune the platform, the impact of different annotation methods,
present the error rates of each method against the ground-truth
different HIT video lengths, and the frequency of requiring anno-
annotations. tations were experimented on and discussed. Based on these exper-
Fig. 17 illustrates that the increase in annotation frequency can imental results, the following conclusions can be made:
lead to the decrease of error rate for both extrapolation methods. 1. The one-by-one annotation method works best with videos
It is observed that increasing annotation frequency after 0.01 clicks that have a small number of construction workers and high
per frame per construction worker could only marginally reduce frequency of changes in activities, whereas the all-at-once an-
error rate. This experiment indicates that dense annotation cannot notation method works best with videos that have a high
necessarily guarantee higher accuracy in annotation performance. number of construction workers and low frequency of changes
As indicated by the error rate for both extrapolation methods in work activities.
in Figs. 17(a and b), linear extrapolation performs as well as the 2. Increasing a HIT video length can reduce the annotation time.
detection-based extrapolation method, and the error rate differ- For example, the 60-s long videos save 47 and 37% annotation
ence is within 5% on average. However, it is also observed that time compared to the 10-s and 30-s long videos. It was also
linear extrapolation method performs much better than detection- observed that the accuracy of workface assessment results
based extrapolation method in Fig. 17(c). This difference is likely slightly improves with an increase in the HIT video length.
caused by the difficulties in extracting effective visual features for 3. Manual annotation of a sparse set of video keyframe is reliable
hard videos and detecting workers in cluttered construction site for achieving complete frame-by-frame annotations. At the
conditions. most extreme case, the three times per minute annotation

Fig. 17. The error rates of the linear and detection-based extrapolation methods for annotating nonkeyframes for (a) easy; (b) normal; and
(c) hard videos

© ASCE 04015035-16 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


frequency reduces the average annotation time by 57% while and emission probabilities between each pair of construction activ-
dropping the accuracy of workface assessment only by 7%. ities from a crowdsourcing platform can improve inference on the
The prominent decrease in annotation time by three times categories of subsequent activity types for each frame and in turn,
per minute could also result in cost savings because AMT improves the quality control process.
charges the requester based on the time the annotator devotes To facilitate frequent implementation of crowdsourced activity
to each HIT. analysis, short-term future work involves devising workflows to
4. A threefold crossvalidation provides the best accuracy-cost assist foremen to videotape their operations on a daily basis. This
trade-off for workface assessment. Increasing the fold number can be done by placing consumer-level cameras on tripods away
(beyond threefold) does not raise accuracy significantly. Also, from the operation of the crew such that for the most part, the crew
quality assurance and control steps are important to guarantee stays within the line-of-sight of the cameras. For the long-term,
the reliability of the assessment results. The repeated label- automatically placing a relocatable network of cameras around
ing can also improve the accuracy of workface assessment. the site and leveraging action cameras mounted on the hardhat or
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

The optimal performance can be achieved with a threefold the safety vest of the workers can address the issues of conducting
(i.e., hiring three AMT annotators per HIT). data collection throughout the site and also the line-of-sight and
5. Increasing the sparsity of annotations by increasing the num- visibility for each camera.
ber of keyframes may not necessarily increase the accuracy Meanwhile, as part of future work, the following will be con-
of 2D localization because the increase in accuracy tends to sidered: (1) plotting the workface assessment results over the
plateau to þ0.05 clicks per frame per construction worker. course a day to get a better understanding on how soon crafts
Finally, increasing the number of keyframes may not necessa- are getting on their tools in the morning, where there are excessive
rily lead to the significant increase in accuracy of 2D localiza- breaks, and where and when crews quit an operation too early; and
tion. The 2D localization accuracy tends to increase from 0.01 (2) integrating observations from different viewpoints. Finally, to
to 0.1 clicks per frame per construction worker; however, the comprehensively validate this new method for construction video-
increase is within 18%. Also, it was observed from the experi- based analysis, a set of detailed crowdsourcing market investiga-
ments that increasing keyframe number after 0.05 clicks per tions and experiments should be conducted, not only to test the
frame per construction worker barely leads to an increase in technical parameters, but also to build a process model to test
accuracy. the cost associated with crowdsourcing, the time span between pub-
6. Linear extrapolation method can perform as well as detection- lishing and retrieval tasks, and potential risks of affecting worker
based extrapolation method for easy and normal videos. How- privacy by outsourcing construction video annotations containing
ever, because of the difficulties in extracting visual features construction workers to the crowd. This platform is publicly acces-
and detecting construction workers in severe construction site sible at http://activityanalysis.cee.illinois.edu. A video is also pro-
conditions from hard videos, the detection-based extrapolation vided as a companion for better illustration of the functionalities
method fails to compete with the linear extrapolation method. of this platform.

Conclusion and Future Work Acknowledgments

This paper presents a novel method that supports crowdsourc- The authors would like to thank Zachry Construction Corpora-
ing construction activity analysis from jobsite video streams. tion and Holder Construction Group for their support with data
The proposed method leverages human intelligence recruited from collection. The authors thank Professor Carl Haas for his very
a massive crowdsourcing marketplace, AMT, together with auto- constructive feedbacks during the development of the workface
mated vision-based detection/tracking algorithms to derive timely assessment platform. The technical support of Deepak Neralla
and reliable construction activity analysis in different challenging with the development of web-based tool is appreciated. The authors
conditions such as severe occlusion, background clutter, and cam- also thank the support of real-time and automated monitoring and
era motions. The experimental result with average accuracy of 85% control (RAAMAC) lab’s members, the graduate and undergradu-
in workface assessment tasks shows the promise of the proposed ate civil engineering students, and other AMT nonexpert annota-
method. The comparisons conducted between nonexperts and con- tors. This work was financially supported by the University of
struction validate the hypothesis that crowdsourcing video-based Illinois Department of Civil and Environmental Engineering’s
construction activity analysis through AMT nonexperts could Innovation Grant. The views and opinions expressed in this paper
achieve similar (or even the same) accuracy as conducting activity are those of the authors and do not represent the views of the in-
analysis by construction experts. dividuals or entities mentioned above.
To improve the platform, future work should focus on (1) the
design of a more robust detection/tracking algorithm that can work
well with sparse human input to effectively generate accurate non- Supplemental Data
keyframe annotations; and (2) the design of a quality control
method that does not require repeated labeling, to reduce reques- A video demonstration of the RAAMAC Crowdsourcing Workface
Assessment Tool is available online in the ASCE Library (www
ters’ cost and avoid erroneous data at the voting stage. As part of
.ascelibrary.org).
the study, a new compositional structure taxonomy for construction
activities is also created that models the interactions between body
posture, activities, and tools. This representation can improve
detection/tracking by enhancing the propagation of manual annota- References
tions to nonkeyframes. Also, studies that focus on using the hidden Ali, K., Hasler, D., and Fleuret, F. (2011). “Flowboost—Appearance learn-
Markov model to automatically infer construction activities from ing from sparsely annotated video.” 2011 IEEE Conf. on Computer
long sequences of jobsite videos could be beneficial to detection/ Vision and Pattern Recognition (CVPR), IEEE, Colorado Springs,
tracking and quality control steps. Learning a set of transition CO, 1433–1440.

© ASCE 04015035-17 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Azar, E. R., and McCabe, B. (2012). “Vision-based recognition of dirt Hildreth, J., Vorster, M., and Martinez, J. (2005). “Reduction of short-
loading cycles in construction sites.” Proc., Construction Research interval gps data for construction operations analysis.” J. Constr. Eng.
Congress, ASCE, Reston, VA, 1042–1051. Manage., 10.1061/(ASCE)0733-9364(2005)131:8(920), 920–927.
Brilakis, I., Park, M.-W., and Jog, G. (2011). “Automated vision tracking of Howe, J. (2008). Crowdsourcing: Why the power of the crowd is driving
project related entities.” Adv. Eng. Inf., 25(4), 713–724. the future of business, 1st Ed., Crown Publishing Group, New York.
Cheng, T., Venugopal, M., Teizer, J., and Vela, P. (2011). “Performance Ipeirotis, P. G. (2010). “Analyzing the amazon mechanical turk market-
evaluation of ultra wideband technology for construction resource place.” Assoc. Comput. Mach. Mag. Stud., 17(2), 16–21.
location tracking in harsh environments.” Autom. Constr., 20(8), Ipeirotis, P. G., Provost, F., and Wang, J. (2010). “Quality management on
1173–1184. amazon mechanical turk.” Proc., Association for Computing Machinery
Chi, S., and Caldas, C. H. (2011). “Automated object identification using (ACM) Special Interest Group on Knowledge Discovery and Data Min-
optical video cameras on construction sites.” Comput.-Aided Civ. Infra- ing (SIGKDD) Workshop on Human Computation, ACM, New York,
struct. Eng., 26(5), 368–380. 64–67.
CII (Construction Industry Institute). (2010). “Guide to activity analysis.”
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Jaselskis, E., Sankar, A., Yousif, A., Clark, B., and Chinta, V. (2015).
Rep. No. IR252-2a, Univ. of Texas, Austin, TX. “Using telepresence for real-time monitoring of construction opera-
Costin, A., Pradhananga, N., and Teizer, J. (2012). “Leveraging passive rfid tions.” J. Manage. Eng., 10.1061/(ASCE)ME.1943-5479.0000336,
technology for construction resource field mobility and status monitor- A4014011.
ing in a high-rise renovation project.” Autom. Constr., 24, 1–15. Kavasidis, I., Palazzo, S., Di Salvo, R., Giordano, D., and Spampinato, C.
Dalal, N., and Triggs, B. (2005). “Histograms of oriented gradients for (2012). “A semi-automatic tool for detection and tracking ground truth
human detection.” IEEE Computer Society Conf. on Computer Vision generation in videos.” Proc., 1st Int. Workshop on Visual Interfaces for
and Pattern Recognition, 2005 (CVPR 2005), Vol. 1, IEEE, San Diego, Ground Truth Collection in Computer Vision Applications, ACM,
886–893. New York, 6.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Khosrowpour, A., Fedorov, I., Holynski, A., Niebles, J. C., and Golparvar-
“Imagenet: A large-scale hierarchical image database.” IEEE Conf. on Fard, M. (2014a). “Automated worker activity analysis in indoor
Computer Vision and Pattern Recognition, 2009 (CVPR 2009), IEEE, environments for direct-work rate improvement from long sequences
Miami, FL, 248–255. of rgb-d images.” Construction Research Congress 2014 Construction
Di Salvo, R., Giordano, D., and Kavasidis, I. (2013). “A crowdsourcing in a Global Network, ASCE, Reston, VA, 729–738.
approach to support video annotation.” Proc., Int. Workshop on Video Khosrowpour, A., Niebles, J. C., and Golparvar-Fard, M. (2014b). “Vision-
and Image Ground Truth in Computer Vision Applications, ACM, based workface assessment using depth images for activity analysis of
New York, 1–6. interior construction operations.” Autom. Constr., 48, 74–87.
Doermann, D., and Mihalcik, D. (2000). “Tools and techniques for video Kim, J., Nguyen, P. T., Weir, S., Guo, P. J., Miller, R. C., and Gajos, K. Z.
performance evaluation.” Int. Conf. on Pattern Recognition, IEEE
(2014). “Crowdsourcing step-by-step information extraction to
Computer Society, Hilton Head, SC, 4167–4167.
enhance existing how-to videos.” Proc., 32nd Annual ACM Conf. on
Escorcia, V., Davila, M. A., Niebles, J. C., and Golparvar-Fard, M. (2012).
Human Factors in Computing Systems (CHI ‘14), ACM, New York,
“Automated vision-based recognition of construction worker actions
4017–4026.
for building interior construction operations using RGBD cameras.”
Kim, J. Y., and Caldas, C. H. (2013). “Vision-based action recognition in
Construction Research Congress 2012, ASCE, Reston, VA, 879–888.
the internal construction site using interactions between worker actions
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman,
and construction objects.” International Association for Automation and
A. (2010). “The pascal visual object classes (voc) challenge.” Int. J.
Robotics in Construction (IAARC), Chennai, India, 661–668.
Comput. Vision, 88(2), 303–338.
Le, J., Edmonds, A., Hester, V., and Biewald, L. (2010). “Ensuring
Gadgil, N. J., Tahboub, K., Kirsh, D., and Delp, E. J. (2014). “A web-based
quality in crowdsourced search relevance evaluation: The effects of
video annotation system for crowdsourcing surveillance videos.” IS&T/
training question distribution.” Proc., SIGIR 2010 Workshop on
SPIE Electronic Imaging, International Society for Optics and Photon-
ics, 90270A–90270A. Crowd Sourcing for Search Evaluation, Microsoft, Geneva, 21–26.
Giretti, A., Carbonari, A., Naticchia, B., and DeGrassi, M. (2009). “Design Memarzadeh, M., Golparvar-Fard, M., and Niebles, J. C. (2013). “Auto-
and first development of an automated real-time safety management mated 2D detection of construction equipment and workers from site
system for construction sites.” J. Civ. Eng. Manage., 15(4), 325–336. video streams using histograms of oriented gradients and colors.”
Golparvar-Fard, M., Heydarian, A., and Niebles, J. C. (2013). “Vision- Autom. Constr., 32, 24–37.
based action recognition of earthmoving equipment using spatio- Oglesby, C. H., Parker, H. W., and Howell, G. A. (1989). Productivity
temporal features and support vector machine classifiers.” Adv. Eng. improvement in construction, McGraw-Hill, New York.
Inf., 27(4), 652–663. Park, M.-W., and Brilakis, I. (2012). “Construction worker detection
Golparvar-Fard, M., Peña-Mora, F., Arboleda, C. A., and Lee, S. (2009). in video frames for initializing vision trackers.” Autom. Constr., 28,
“Visualization of construction progress monitoring with 4D simulation 15–25.
model overlaid on time-lapsed photographs.” J. Comput. Civ. Eng., Park, M.-W., Koch, C., and Brilakis, I. (2011). “Three-dimensional tracking
10.1061/(ASCE)0887-3801(2009)23:6(391), 391–404. of construction resources using an on-site camera system.” J. Comput.
Gong, J., and Caldas, C. H. (2009). “An intelligent video computing Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000168, 541–549.
method for automated productivity analysis of cyclic construction Peddi, A., Huan, L., Bai, Y., and Kim, S. (2009). “Development of human
operations.” Proc., 2009 ASCE Int. Workshop on Computing in Civil pose analyzing algorithms for the determination of construction produc-
Engineering, ASCE, Reston, VA, 64–73. tivity in real-time.” Construction Research Congress, ASCE, Seattle,
Gong, J., Caldas, C. H., and Gordon, C. (2011). “Learning and classifying 11–20.
actions of construction workers and equipment using bag-of-video- Pradhananga, N., and Teizer, J. (2013). “Automatic spatio-temporal analy-
feature-words and bayesian network models.” Adv. Eng. Inf., 25(4), sis of construction site equipment operations using gps data.” Autom.
771–782. Constr., 29, 107–122.
Gouett, M. C., Haas, C. T., Goodrum, P. M., and Caldas, C. H. (2011). Rezazadeh Azar, E., Dickinson, S., and McCabe, B. (2012). “Server-
“Activity analysis for direct-work rate improvement in construction.” customer interaction tracker: Computer vision-based system to estimate
J. Constr. Eng. Manage., 10.1061/(ASCE)CO.1943-7862.0000375, dirt-loading cycles.” J. Constr. Eng. Manage., 10.1061/(ASCE)CO
1117–1124. .1943-7862.0000652, 785–794.
Heilbron, F. C., and Niebles, J. C. (2014). “Collecting and annotating Rezazadeh Azar, E., and McCabe, B. (2012). “Automated visual recogni-
human activities in web videos.” Proc., Int. Conf. on Multimedia tion of dump trucks in construction videos.” J. Comput. Civ. Eng.,
Retrieval, ACM, New York, 377. 10.1061/(ASCE)CP.1943-5487.0000179, 769–781.

© ASCE 04015035-18 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035


Russell, B. C., Torralba, A., Murphy, K. P., and Freeman, W. T. (2008). Teizer, J., and Vela, P. A. (2009). “Personnel tracking on construction sites
“Labelme: A database and web-based tool for image annotation.” using video cameras.” Adv. Eng. Inf., 23(4), 452–462.
Int. J. Comput. Vision, 77(1–3), 157–173. Vondrick, C., Patterson, D., and Ramanan, D. (2013). “Efficiently scaling
Sheng, V. S., Provost, F., and Ipeirotis, P. G. (2008). “Get another label? up crowdsourced video annotation.” Int. J. Comput. Vision, 101(1),
improving data quality and data mining using multiple, noisy labelers.” 184–204.
Proc., 14th Association for Computing Machinery (ACM) Special Wah, C. (2006). “Crowdsourcing and its applications in computer vision.”
Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Univ. of California, San Diego.
Int. Conf. on Knowledge Discovery and Data Mining, ACM, New York, Wightman, D. (2010). “Crowdsourcing human-based computation.” Proc.,
614–622. 6th Nordic Conf. on Human-Computer Interaction: Extending
Shingles, M., and Trichel, J. (2014). “Industrialized crowdsourcing.” 〈http:// Boundaries, ACM, New York, 551–560.
dupress.com/articles/2014-tech-trends-crowdsourcing/〉 (Feb. 1, 2014).
Yang, J., Vela, P., Teizer, J., and Shi, Z. (2014). “Vision-based tower crane
Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. (2008). “Cheap and
tracking for understanding construction activity.” J. Comput. Civ. Eng.,
fast—but is it good?: Evaluating non-expert annotations for natural lan-
Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

10.1061/(ASCE)CP.1943-5487.0000242, 103–112.
guage tasks.” Proc., Conf. on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, Stroudsburg, Yuen, J., Russell, B., Liu, C., and Torralba, A. (2009). “Labelme video:
PA, 254–263. Building a video database with human annotations.” 2009 IEEE 12th
Sorokin, A., and Forsyth, D. (2008). “Utility data annotation with amazon Int. Conf. on Computer Vision, IEEE, Miami, FL, 1451–1458.
mechanical turk.” IEEE Computer Society Conf. on Computer Vision Yuen, M.-C., King, I., and Leung, K.-S. (2011). “A survey of crowd-
and Pattern Recognition Workshops, IEEE, Anchorage, AK, 1–8. sourcing systems.” 2011 IEEE 3rd Int. Conf. on Social Computing
Teizer, J., Lao, D., and Sofer, M. (2007). “Rapid automated monitoring of (socialcom), IEEE, Boston, 766–773.
construction site activities using ultra-wideband.” Proc., 24th Int. Symp. Zhai, D., Goodrum, P. M., Haas, C. T., and Caldas, C. H. (2009). “Relation-
on Automation and Robotics in Construction, International Association ship between automation and integration of construction information
for Automation and Robotics in Construction (IAARC), Chennai, India, systems and labor productivity.” J. Constr. Eng. Manage., 10.1061/
19–21. (ASCE)CO.1943-7862.0000024, 746–753.

© ASCE 04015035-19 J. Constr. Eng. Manage.

J. Constr. Eng. Manage., 2015, 141(11): 04015035

You might also like