Professional Documents
Culture Documents
Crowdsourcing Construction Activity Analysis From Jobsite Video Stream PDF
Crowdsourcing Construction Activity Analysis From Jobsite Video Stream PDF
Abstract: The advent of affordable jobsite cameras is reshaping the way on-site construction activities are monitored. To facilitate the
analysis of large collections of videos, research has focused on addressing the problem of manual workface assessment by recognizing
worker and equipment activities using computer-vision algorithms. Despite the explosion of these methods, the ability to automatically
recognize and understand worker and equipment activities from videos is still rather limited. The current algorithms require large-scale
annotated workface assessment video data to learn models that can deal with the high degree of intraclass variability among activity cat-
egories. To address current limitations, this study proposes crowdsourcing the task of workface assessment from jobsite video streams.
By introducing an intuitive web-based platform for massive marketplaces such as Amazon Mechanical Turk (AMT) and several automated
methods, the intelligence of the crowd is engaged for interpreting jobsite videos. The goal is to overcome the limitations of the current
practices of workface assessment and also provide significantly large empirical data sets together with their ground truth that can serve
as the basis for developing video-based activity recognition methods. Six extensive experiments have shown that engaging nonexperts
on AMT to annotate construction activities in jobsite videos can provide complete and detailed workface assessment results with 85% ac-
curacy. It has been demonstrated that crowdsourcing has the potential to minimize time needed for workface assessment, provides ground
truth for algorithmic developments, and most importantly allows on-site professionals to focus their time on the more important task of
root-cause analysis and performance improvements. DOI: 10.1061/(ASCE)CO.1943-7862.0001010. © 2015 American Society of Civil
Engineers.
Author keywords: Activity analysis; Construction productivity; Video-based monitoring; Workface assessment; Crowdsourcing;
Information technologies.
efficient video interpretation methods, tedious manual reviewing concrete pouring and nonconcrete material movements. Azar and
will still be required to extract productivity information from re- McCabe (2012) introduced a logical framework using computer
corded videos, and this takes away time from the more important vision-based techniques to study construction equipment working
task of conducting root-cause analysis. cycles. Instead of assuming strong priors on the relationships be-
In this paper, a new workface assessment framework is intro- tween activities and locations as in the previous works, Gong et al.
duced which provides an easy way to collect and interpret accurate (2011), Golparvar-Fard et al. (2013), and Escorcia et al. (2012) pro-
labor and equipment activity information from jobsite videos. The posed bag-of-words models with different discriminative and gen-
idea is simple: The task of workface assessment is crowdsourced erative classifiers to recognize atomic activities of construction
from jobsite video streams. By introducing an intuitive web-based workers and equipment. By recognizing atomic activities, activities
platform on Amazon Mechanical Turk (AMT), the intelligence of observed are classified in self-contained videos where in a single
the crowd is engaged for interpreting jobsite videos. The goal is to resource starts an activity and ends the same activity within the
overcome the limitations of the current practices of activity analysis video. Khosrowpour et al. (2014b) is also one of the earliest at-
and also provide significantly large empirical data sets together tempts to recognize the full sequence of activities for multiple
with their ground truth that can serve as the basis for developing workers from red-green-blue-depth (RGB-D) data. RGB-D data
automated video-based activity recognition methods. Through ex- collected from depth sensors such as Microsoft Kinect bypass
tensive validation on various parameters of the platform, it is shown the challenges in detection and tracking and provide sequences
that engaging nonexperts on AMT to annotate construction activ- of worker-body skeleton which can be the input to such activity
ities on jobsite videos can achieve accurate workface assessment recognition methods. Jaselskis et al. (2015) also proposed an ap-
results. In the following section, the related works are reviewed, proach of monitoring construction projects in the field by off-site
methods are introduced to develop this proposed tool, and exper- personal through live video streams. Despite the explosion of these
imental results are discussed. methods, the ability to automatically recognize and understand
worker and equipment activities is still limited. Challenges relate
to the large existing variability in execution of construction oper-
Related Work ations, the lack of formal taxonomies for construction activities
in terms of expected worker/equipment roles, and sequence of ac-
Time-lapse photography and videotaping have proven for many
tivities. The complexity of the visual stimuli in activity recognition
years to be very useful means for recording workface activities
in terms of camera motion, occlusions, background clutter, and
(Golparvar-Fard et al. 2009). Since the earlier work of Oglesby et al.
viewpoint changes are other existing challenges. Finally and most
(1989) until now, many researchers have proposed procedures,
importantly, there is lack of data sets together with ground truth
guidelines, and also manual and semiautomated methods for inter-
for more exhaustive research on automated activity recognition
pretation of jobsite video data. Videos have the advantage of being
methods.
understandable by any visually-able person, provide detailed and
dependable information, and allow detailed reviews by the analysts
and on-site management away from the work sites. In the next sec- Crowdsourcing
tion, some of the most relevant works on the topics of video-based
workface assessment are first reviewed. Then, a review on the con- To overcome the current limitations in the standard task of activity
cept of crowdsourcing is provided, followed by research on video recognition, the computer-vision community has recently initiated
annotation tools and existing databases for activity recognition. several projects to investigate the potential of crowdsourcing.
Crowdsourcing refers to collaborative participation of a crowd
of people to help solve a specific problem and typically involves
Computer-Vision Methods for Video-Based a rewarding mechanism, for example, paying for participation
Construction Activity Analysis (Howe 2008). In recent years, Internet has enabled crowdsourcing
Over the past few years, many computer-vision methods have in a broadened and a more dynamic manner. Crowdsourcing from
emerged for inferring the activities of workers and equipment from anyone, anywhere, as needed is now common for image and video
jobsite videos. A reliable method for video-based activity analysis processing, information gathering, and data verification to creative
requires two interdependent components: (1) methods for detecting tasks such as coding, analytics, and production development
and tracking resources; and (2) procedures for activity recognition. (Wightman 2010; Yuen et al. 2011; Shingles and Trichel 2014).
The majority of the previous works have addressed these compo- The wide range of business in crowdsourcing has also promoted
nents as two separate tasks. Brilakis et al. (2011) and Park et al. the development of specialized platforms:
(2011) applied scale invariant feature transforms to track construc- • simple, microtasks-oriented crowdsourcing: Amazon Mechani-
tion resources in both 2D and 3D scenarios. Teizer and Vela (2009), cal Turk and Elance;
Gong and Caldas (2009), Rezazadeh Azar and McCabe (2012), and • complicated, experience-oriented crowdsourcing: 10EQS and
Chi and Caldas (2011) explored construction resources detection oDesk;
intelligence tasks (HITs) and compensate online users, known as both workers and equipment exhibit changing body postures even
worker or annotator, with predefined payment for the rendered re- when the same activity is being performed. These issues beyond
sults. Through crowdsourcing, easy human annotation tasks which typical challenges in the task of activity recognition could nega-
are extremely difficult or even impossible for computers to perform tively affect the optimal length of annotation tasks or the number
can be accomplished in a timely manner. This is by leveraging the of necessary keyframes for annotation. Also, because of the com-
AMT’s massive labor market which contains hundreds of thou- plexity of construction operations and the lack of formal categori-
sands of annotators who can complete 60% of HITs in 16 h and zation beyond CII defined activities, crowdsourcing video-based
complete 80% of HITs around 64 h (Ipeirotis 2010). The task can workface assessment necessitates a taxonomy and customized data
also be done at a low price of $5 per hour on the AMT. A reason- frameworks to describe construction activities. Finally, the reliabil-
able quality can also be achieved. This is because the AMT keeps ity of crowdsourcing for video-based construction workface assess-
track of the annotators performance and uses that as a screening ment has not been validated. Particularly recruiting nonexpert
condition to guarantee appropriate level of quality in results. Also, annotators from a crowdsourcing marketplace such as AMT may
many techniques are developed to assist requesters with getting negatively affect the quality of the assessment results. Beyond tech-
high-quality annotations. nical challenges, detailed experiments are necessary to examine the
Sorokin and Forsyth (2008) is the first within the computer- capability of nonexperts against expert control groups and also de-
vision community to present a data annotation framework to vise strategies for improving the accuracy of crowdsourcing tasks.
quickly obtain inexpensive project-specific annotations through
the AMT crowdsourcing platform. The proposed framework revo-
lutionized large-scale static image annotation (Vondrick et al. Method
2013). The subsequent efforts to seek the value of massive data sets
To address the challenges of applying crowdsourcing to video-
of labeled images promoted design of efficient visual annotation
based construction workface assessment task, a goal was set on
tools and databases. Russell et al. (2008) introduced LabelMe as
creating a new web-based platform. The nonexpert crowd and spe-
a web-based annotation tool that supports dense polygon labeling
cifically Amazon Mechanical Turk is heavily relied on to conduct
on static image. Deng et al. (2009) presented a crowdsourcing im-
workface assessment. Customized user interfaces have been de-
age annotation platform based on ImageNet that is an image data-
signed to enable construction productivity data retrieval, visualiza-
base of over 11 million images. Everingham et al. (2010) described
tion, and crossvalidation. Exhaustive experiments are designed and
a high-quality image collection strategy for the PASCAL visual
conducted to fine tune parameters of the tool including annotation
object classes (VOC) challenge.
method, annotation frequency, and video length for crowdsourcing
video-based construction workface assessment as individual tasks
Crowdsourcing Video Annotations on the AMT. A compositional structure taxonomy has also been
created for construction activities to decode complex construction
Despite the success of crowdsourcing annotations for still imagery, operations. The performance of the experts and nonexpert annota-
the dynamic nature of videos makes their annotation more chal- tors for detecting, tracking, and recognizing activities are also ex-
lenging (Vondrick et al. 2013). Video annotation requires cost- haustively analyzed. Applying crossvalidation methods to improve
aware and efficient methods instead of frame-by-frame labeling workface assessment accuracy is also investigated. Particularly,
(Vondrick et al. 2013; Wah 2006). Ealier works such as ViPER- several experiments are conducted to seek optimal fold number for
GT tool (Doermann and Mihalcik 2000) gathered groundtruth the crossvalidation process. Fig. 1 shows an overview of the work-
video data without any intelligent method for assisting the anno- flows involved in leveraging the proposed crowdsourcing platform
tation task (Di Salvo et al. 2013). Obviously, the significant num- for workface assessment. A companion video of this manuscript
ber of frames in a video requires smarter ways for propagating also shows various functionalities. In the following section, these
annotations from a subset of keyframes; otherwise, a method will modules and experiments are presented in more detail.
not be scalable. More recently, Yuen et al. (2009) introduced
LabelMe Video which employs linear interpolation with constant
3D-velocity assumption to propagate nonkeyframe spatial annota- Collecting Jobsite Videos
tions in a video. Ali et al. (2011) presented FlowBoost, a tool that Because of the lack of precedent experience on how to design and
can annotate videos from a sparse set of keyframe annotations. validate a crowdsourcing platform specific to video-based con-
Kavasidis et al. (2012) proposed GTTool to support automatic con- struction workface assessment, it is necessary to use real-world
tour extraction, object detection, and tracking to assist annotations jobsite videos. This allows careful examination of the platform’s
across video frame sequences. Vondrick et al. (2013) presented performance throughout the entire design and validation process.
a video annotation tool, VATIC, to address the nonlinear motions Videos chosen for collection focus on concrete placement opera-
in video. Beyond spatial aspects of video annotation, recently, re- tions which contain a range of visually complex activities and
searchers have also focused on extracting temporal information are common on almost all projects. This provides various validation
scenarios for examining the application of the workface assessment such as place concrete, erect rorm, position rebar, etc., and
tool. The collection of videos follows two principles: (1) to guar- substantial amounts of nondirect work, such as preparatory work,
antee the videos cover various jobsite and video-recording condi- material handling, waiting, etc.
tions; and (2) the videos exhibit different level of difficulty for
activity annotations. In this research, activity difficulty is defined
as physical and reasoning efforts that are required to complete Activity Analysis User Interface
video-based workface assessment using the workface assessment The workface assessment tool has two main interfaces for task
tool. management and workface assessment. Task management interface
To quantify the difficulty level in video annotations, the follow- assists requesters—site managers and engineers, or researchers—to
ing criteria is introduced: manage workface assessment tasks including task publication and
• construction activity conditions: the size of the construction result retrieval. The workface assessment interface provides anno-
crew, and the frequency of changes in sequence of activities tators access to complete video-based workface assessment tasks.
conducted by individual craft workers; As shown in Fig. 1, first, the requesters use task management inter-
• visibility conditions: different occlusion conditions, illumina- face to break an entire video of construction operations into several
tion condition, and background clutter; and human intelligent tasks (HITs) and publish them online or offline.
• recording conditions: camera viewpoint, distance, and camera- Online mode allows the AMT annotators to complete HITs,
motion conditions. whereas offline mode makes HITs only accessible to a predefined
These criteria help determine the amount of the physical effort group of users which can, for example, include expert annotators
and reasoning that is needed to perform the assessment tasks. invited by the requesters. Before publishing the videos for annota-
As shown in Table 1, the collected videos exhibit a large range tion, for privacy purposes, the videos can be processed using
of changes based on these criteria. This allows the capability of image analysis methods to automatically blur human faces when
conducting construction workface assessment to be thoroughly needed. Once logged in, the annotators can then accept published
investigated under different conditions. HITs to generate workface assessment results using the Workface
Real-world videos of concrete placement operations from three Assessment Interface. When all HITs belonging to the same video
different construction sites were collected. For validating the plat- are completed, the requesters can then retrieve, visualize, and cross-
form, a total of eight job site videos (45 min) were chosen. Based validate assessment results. They can also generate formal assess-
on the criteria defined in Table 1, these videos were classified into ment reports through the task management interface.
categories as easy, normal, and hard. Fig. 2 shows example snap- To access the task management interface, a requester should
shots from these videos and their levels of difficulty. The collected provide a set of valid username and password, as shown in Fig. 3.
concrete placement videos cover almost all types of direct work, This verification step can address copyright and privacy issues of
Table 1. Various Criteria Used to Select the Candidate Videos for Experiments
Activity conditions Visibility conditions Recording conditions
Changes in Background Camera Camera
Videos Crew size (person) activities Occlusion Illumination clutter Viewpoint distance (m) motion
Easy videos 2 Low Rare Daylight Low Level 3–15 None
4–5 Low Rare Daylight Normal Level 5–15 None
3–5 Low Rare Cloudy Low Level 15–25 Rare
Normal videos 4–7 Medium Medium Daylight Medium Level 3–25 Rare
6–9 High Rare Daylight Medium Level 5–25 Rare
8 Low Severe Sunny Varied Level 5–25 Severe
Hard videos 10–12 Medium Severe Daylight Sever Tilted 15–45 Rare
Fig. 2. Videos exhibit different levels of difficulty for video annotation purposes (images by authors)
any uploaded video to protect authentic requesters from sharing video upload is designed, shown in Fig. 4, as a function for the
videos under their accounts to unwanted parties. Once passed the requesters to upload their videos and also automatically break
verification step, the requester can use the following functions as- them into several HITs based on a desired length. This function
sociated with the task management interface: also associates most frequently used labels to the HITs and then
• Video links: This function assists the requesters in managing the publishes them online or offline. Most frequently used labels are
progress of the annotators and control the quality of their work compositional structure taxonomies that describe the activities
on the workface assessment HITs. To do so, the HITs are of the workers or equipment in a video.
presented by unique names and hyperlinks and are also classi- • Crossvalidation and accuracy: The online mode for the tool
fied in two categories of published and completed. is prototyped to leverage the knowledge of the nonexpert anno-
• Video upload: The tool is prototyped to simultaneously support tators from the AMT marketplace. To avoid using inaccurate
crowdsourcing workface assessment and collection of large- results produced by these annotators, quality assurance and con-
scale ground-truth data set for both academia and industry. trol should be applied to both preassessment and postassessment
Thus, allowing requesters to upload their own videos is integral steps. To do so, crossvalidation and accuracy is introduced as
to the design of task management interface. To achieve this, functions that report the accuracy of the completed assessment
Fig. 3. Task management interface: (a) log in; (b) video links
Fig. 4. Task management interface: (a) video upload; (b) crossvalidation and accuracy
against groundtruth at the postassessment step. They also allow functions are located above or under the video player and label
the requester to crossvalidate different completed assessments drop-down list. The main functions include
for the same video to achieve a more accurate assessment result. • Workface assessment function: Using this function, in each as-
Fig. 4 presents the interface for these functions. signed HIT, the annotators can annotate the labels necessary for
• Video visualization: To retrieve activity analysis information the activities of the workers and equipment and their body pos-
at any desired level of granularity, video visualization presents ture. This involves (1) associating a role to each new construc-
the workface assessment results in forms of annotated videos, tion worker/equipment, (2) drawing bounding boxes to localize
crew-balance charts, or pie charts. The annotated videos, shown the worker/equipment in 2D video frames, and (3) selecting
in Fig. 5, contain worker/equipment trajectories and activity labels to describe their activity types, body posture, and tools.
type information which are both annotated on the videos. The As the video proceeds, this procedure continues with updating
experts can monitor specific operations and conduct the root- the position of the bounding boxes and their associated labels.
cause analysis for hidden issues that may not easily be re- To create new roles for the workers, the annotators can use
vealed by other visualization forms. The crew-balance charts +NewResource button which brings out a list of existing roles.
represent the time series of worker activities. Unlike annotated After selecting a role, the cursor will be activated to allow
videos, the time series provide a depiction of the construction the annotator to draw a bounding box around the construction
operations in a concise manner. Finally, the pie charts charac- worker or equipment and pinpoint the location in 2D. Next, the
terize the percentage of time spent on various activity cate- annotators select labels from the drop-down list to describe
gories, based on the CII 2010 taxonomy. This method is also the observed activities. The Play and Rewind buttons together
capable of accurately examining the hourly average percentages with the video progress control bar allows the role/activity/
for the overall jobsite which can provide significant benefit to posture labels to be updated for the observed workers and equip-
activity analysis. ment in a video. Upon completion, the annotator can save the
This worface assessment interface is where the annotators work annotation results by pressing on the SaveWork button.
on the assigned HITs. It is the most important component of • Assisting function: This function is designed to help annotators
the tool as it is where the actual workface assessment happens. in generating accurate workface assessment results. This func-
To achieve simplicity and functionality in the design of the inter- tion consists of Introductions, +NewLabel, and Options. Intro-
face, structured rules are followed. The interface consists of three ductions are provided to help the annotators understand how the
main components: (1) video player, (2) label drop-down list, and assigned tasks can be performed. For efficiency and concise-
(3) assisting functions, shown in Fig. 6. The player–taking the ness, the drop-down list only contains most frequently used
most space–shows video content that requires assessment. The labels which allows the annotators to quickly find the labels of
drop-down list contains labels that are presented hierarchically interest. The platform does not require a comprehensive list of
to describe activities of each worker or equipment. Assisting labels to begin with. Rather, the annotators can use +NewLabel
Fig. 5. Video visualization in the task management interface: (a) annotated video; (b) crew-balance chart; (c) detailed and CII-type pie chart; the
annotations in the upper left corner of each box shows the role-activity-tool-body posture of each worker (images by authors)
to insert the necessary new labels (or customize existing ones) to tools and equipment, material handling, waiting, travel, and per-
complete their assessments. The Options enables annotators to sonal. Although this taxonomy is generally applicable to all con-
adjust video player and fine tune the monitoring settings. These struction operations, it does not provide detailed description
settings include different video speeds, hide/show bounding box of tasks that would be necessary for the development of visual
and labels on the videos, and enable/disable resizing of the activity–recognition algorithms. Any vision-based method requires
bounding boxes. To keep the user interface constrained and distinct visual features on worker and equipment activities, and that
simple, all assisting functions are displayed in different win- could only be achieved if different labels are used to describe each
dows which can be triggered by their corresponding functional group of similar visual features.
buttons. In this research, a compositional structure taxonomy is intro-
duced to decode complex construction activities in the following
format: worker role is conducting CII activity category in form
Compositional Structure Taxonomy for Construction
of a visual activity category using tool, body posture and is visible,
Activities
occluded, or outside of the video frame. As a starting point, worker
Activity analysis requires a detailed description of construction type contains 19 different roles for construction worker such as
worker and equipment activities. A sequence of different activities concrete finisher, carpenter, electrician, bricklayer, etc. The second
allows the analysis on the root causes for low productivity rate and layer is CII activity categories which describes worker activities in
also planning and implementing productivity improvement. How- form of direct and nondirect work (preparatory work, tools and
ever, construction activities are complex to describe. They exhibit a equipment, material handling, waiting, travel and personal activ-
large amount of interclass and intraclass variability among different ity). The third layer is the visual activity category introduced to
activities that can be associated with the roles of the workers and provide detailed information on activities, tools, and body posture
equipment. Because of the dynamic nature of construction opera- on direct work activities. Tools vary based on different types of
tions, the temporal sequence of these activities changes frequently activities; nevertheless, they can be an important visual indicator
as well. Without a systematic description, it will be difficult to pro- for training vision-based algorithms. Because of the large number
vide accurate activity analysis information. To address the current of tools involved in any given activity, the interface is designed
limitation, the CII (2010) proposed a new taxonomy that classifies such that it provides illustrative images of each tool to help anno-
all activities into seven categories of direct work, preparatory work, tators to identify them easily. Posture can also indicate an activity
has changed, and thus, it can be beneficial for the training of the for different types of construction operations, it also enables the
vision-based algorithms. This visual activity category layer with annotators to add missing or new taxonomies on the fly.
detailed representation on activity-tool-body posture can enable
extraction of proper visual features and devising appropriate
computer-vision methods. The method also describes nondirect Extrapolating Annotations from Keyframes
works with worker body postures to enable a better synthesis of The dynamic nature of videos makes frame-by-frame annotations
nondirect work activities. Finally, jobsite videos contain severe necessary but labor-intensive and costly. Crowdsourcing can
occlusions and background clutters can create noise in a represen- reduce human efforts, time, and cost for workface assessment;
tative activity data set. Thus, visibility information of each anno- nevertheless, video annotation still needs strategies to propagate
tation is associated—occluded and outside of video. Because of assessment results from a sparse set of keyframes. Keyframes are
space limitations, Fig. 7 presents only a part of the proposed frames in a video sequence that benchmark the start and end of a
compositional structure taxonomy of construction worker activities construction activity (or a change in role). In the platform, these
related to concrete placement operations. A more detailed represen- changes need to be captured manually by the annotators. The non-
tation is available at http://activityanalysis.cee.illinois.edu/. keyframes are the following frames that contain the same construc-
Using the proposed taxonomy, construction professionals can tion activities as the previous keyframe, although the position of the
analyze both CII and task level activities for root-cause productivity workers or equipment performing the activity may have changed.
analysis purposes. To better plan productivity improvements, the In this section, the extrapolation methods that are developed to
interactions between task level activities and construction worker’s support propagation of the annotations from the keyframes to
posture and tool is introduced. Not only are these interactions nonkeyframes are described. Inspired by Vondrick et al. (2013),
meaningful for development of robust vision-based algorithms, linear and detection-based extrapolation methods are implemented.
but they also enable construction professionals to analyze the re- T is defined as the total number of frames where T ¼ time ×
lationship between tool utilization and direct work rate that could ðframe per secondÞ and B [Eq. (1)] is defined as the 2D pixel co-
lead to productivity improvements through better tool utilization. ordinates of each annotation bounding box
The designed platform is also very flexible. For example, for
quick assessments, the requester can choose role-CII activity struc- B ¼ ½xmin ; xmax ; ymin ; ymax ð1Þ
ture to annotate construction videos based on the CII taxonomy of
activities. For detailed workface assessment and collecting data sets where xmin , ymax denote the coordinates of upper-left corner and
for the development of computer-vision methods, the requester xmax , ymin denote the coordinates of the lower-right corner of the
can choose to leverage the worker-activity-posture-tool-visibility bounding box. Bt ð0 ≤ t ≤ TÞ is defined as bounding box coordi-
compositional structure taxonomy of construction activities. To nates at time t. Fig. 8 shows examples of applying extrapolation
facilitate annotation process and minimize time required to find a methods to generate nonkeyframe annotations (Bt ) from known
specific activity category from a long list, the platform automati- keyframe annotations (B0 and BT ).
cally inserts the most frequently used compositional structure tax- Linear extrapolation assumes that workers and equipment
onomy during the task publication stage. To cope with the needs that have constant velocity in 3D will also keep their velocity
Power screed
Power trowel
Covering or protecting
concrete
Grout pump
Curing
Power sprayer
Concrete edger
Molding expansion
Concrete groover
joint & edge
Straightedge
Broom
Surfacing
Concrete brush
Concrete saw
Cutting concrete
Line tool
Preparatory work
Material handling
Tool and equipment
Waiting
Travel
Personal
Erecting/dismantal Level
scaffold Plumb rule
Erecting/stripping Hammer
formwork Power saw
B.S.S*
where HOG is computed based on Dalal and Triggs (2005), and The unary cost U t ðBt Þ calculates the cost of the potential detec-
HOC is a nine-dimensional feature containing three means and six tion in each frame by the score of the visual classifier and l-2 norm
covariance computed from hue, saturation, and value color chan- of its bounding box difference between the SVM detection and the
nels. xi is applied to keyframe annotations to construct positive linear extrapolation. The SVM associates the most possible predic-
samples and to automatically extract patches from keyframes’ tion with the highest score to minimize the most likely cost for the
background to construct negative samples. To learn a specific vis- detection. In this research, −w · ϕðBt Þ is used as the score of the
ual classifier that is able to assign positive samples with high visual classifier. Because of the presence of occlusions, some video
scores, the same procedure is followed in Memarzadeh et al. frames t may not contain a groundtruth detection. These frames
(2013), and a binary support vector machine (SVM) is introduced cause false negatives with small scores to be the potential Bt . In
per type of resource because SVM is a commonly used discrimi- such situations, the annotations for nonkeyframes will rely on
native classifier in machine learning literature and performs well the linear extrapolation method and replaces classifier score −w ·
with carefully created groundtruth data. Each binary SVM classifier ϕðBt Þ with a very small (zero number) α1 .
is trained by feeding all training samples ðxi þ 1Þ=ðxi ; −1Þ to op- The pairwise cost calculates the smoothness of the detection
timize the following objective function: path for each worker/equipment. In this research, the position of
SRt ðN R Þ − SIt ðN I Þ; if RoleNt R ≡ RoleNt I ; t ¼ 0; : : : ; T
Costt ðN R ; N I Þ ¼ ð9Þ
2 × ½SRt ðN R Þ − SIt ðN I Þ; otherwise
where Costt ðN R ; N I Þ is the cost between N R -th annotation input are grouped based on the minimum cost value in the
from reference and N I -th annotation from input at frame t cost matrix. Once all groups have been matched, majority vot-
(0 ≤ t ≤ T); SRt ðN R Þ − SIt ðN I Þ is the area difference (bounding ing is performed to annotations of same construction worker/
box overlap) between N R -th bounding box and N R -th bound- equipment at each frame. For example, groups are first elim-
ing box; and RoleNt R and RoleNt I are the different annotation inated which have an annotation number less than half of
labels in comparison. The calculated costs are used to consti- the repeat time. Then, average bounding boxes’ coordinates
tute the cost matrix for reference and input annotation results, AVGðBallt Þ of all repeated annotations for the same construc-
which is shown in Fig. 9. Annotations from reference and tion worker/equipment are calculated. Annotation(s) whose
Input Annotations
2 Costt (1,2) Costt (2, 2) Costt (N R , 2) Costt (N R+1 , 2)
at frame t
…
NI Costt (1, N I ) Costt (2, N I ) Costt (N R , N I ) Costt (N R+1 , N I )
N I+1 Costt (1, N I+1 ) Costt (2, N I+1 ) Costt (N R , N I+1 ) Costt (N R+1 , N I+1 )
…
sum of Bt − AVGðBall t Þ is greater than the defined threshold of 10þ nonexpert annotators—typically found on AMT—were as-
at frame t are eliminated. Finally, the average of bounding sembled together with a control group of experts [5þ professionals,
boxes’ coordinates was recalculated and the majority label who in this experiment are (1) field engineers with experience on
for each level of compositional structure taxonomy from the productivity assessments; and (2) students that took a course in pro-
rest of annotations was selected to generate the final annota- ductivity and have experience on productivity assessment]. All ex-
tion for each construction worker/equipment—the annotation pert participants in this experiment have done at least productivity
group—at frame t. assessment for more than a one-day operation. Easy, normal, and
Although repeated-labeling improves the quality of workface as- hard videos of concrete placement operations introduced earlier in
sessment, the unnecessary repeated labelings will cost extra money the paper were leveraged. With these videos, three separate experi-
and time. To save money and time, it is necessary to find an optimal ments were conducted with the annotators from the controlled
repeat time. Thus, experiments are conducted using crossvalidation group of construction experts to investigate the impact of the differ-
method to examine how the accuracy will be changed based on dif- ent annotation methods, video lengths, and annotation frequencies
ferent repeat times which are discussed in the following section. on the accuracy of the workface assessment results.
To validate the hypothesis that crowdsourcing video-based con-
Designing Microtasks from Construction Videos struction workface assessment through the AMT marketplace is a
reliable approach, three additional experiments were conducted on
A video of a construction operation is typically several hours long. the impact of the platform on accuracy by (1) comparing the
This makes assessing it much more difficult than completing a typ- performance of nonexpert annotators with the controlled group
ical short-length microtask on the AMT. To make crowdsourcing of construction experts; (2) testing the linear extrapolation and
feasible, an entire video is broken down into several shorter HITs. detection-based extrapolation methods and presenting the perfor-
For effectiveness, the following is considered: mance of each extrapolation method against ground truth data gen-
• The length of a HIT: The longer a HIT is, the less time the an- erated by the expert annotators; and (3) testing the performance
notator will need to complete the task of annotation for the en- of the postassessment quality control procedure and exploring
tirety of the video. This is because the annotator will spend less the best repeated labeling times for desirable level of accuracy
time on understanding the video content and would not require to
by experimenting crossvalidation with different randomly selected
watch a video for multiple times. However, longer HITs can lead
folds from both expert and nonexpert annotation results.
to the tiredness and the boredom of the annotators which can in
To compare experiments and choose the parameters that achieve
turn lower the accuracy and increase the annotation time. To study
optimal performance, two validation criteria were chosen: (1) the
their trade-off relationship, experiments are conducted and experi-
annotation time spent to complete each experiment; and (2) the ac-
ment results are presented in the “Experimental Results” section.
curacy of the workface assessment results. Because the platform
• Annotation frequency: Although dense labeling seems desirable
applies compositional structure taxonomy of construction activ-
for improving workface assessment accuracy, over labeling may
ities, three separate discussions on accuracy are presented:
lead to cost and time overrun. Manually annotating a sparse set
1. Completeness accuracy examines annotators capability to
of keyframes and applying extrapolation to generate annotations
completely annotate all workers in a HIT video without miss-
for nonkeyframes can provide the same level of accuracy with
ing one.
less time and cost.
2. Bounding box accuracy investigates the accuracy of worker
• Method of stitching several HITs together for deriving final as-
localization. In this case, 50% overlap is used between experi-
sessment results: The platform breaks an entire operation video
mental assessment and groundtruth as the acceptable threshold
into multiple microtask HITs. To stitch the results of these HITs
for accuracy.
and derive the final workface assessment result, one-second
3. Tool accuracy examines the annotator’s capability to correctly
overlap is placed between each HITs. Then, the same method
label construction tools or the nondirect work categories.
used to match repeated annotation results is applied, as shown
In the following section, each experiment is introduced and
in Eq. (9), to all the overlapping annotations of the HITs. They
the experimental results are reported.
are then stitched together to derive the final workface assess-
ment results.
Experimental Results
Results and Discussion
Annotation Method
The experiments for choosing the best annotation method include
Experiment Setup and Performance Measures
leveraging one-by-one, all-at-once, and role-at-once methods to
To identify the key parameters of the platform and validate the ef- annotate easy, Normal, and Hard videos. The time durations (in
fectiveness of crowdsourcing the workface assessment task, a pool seconds) required for each annotation method across the expert
eos with 60-s length require the least annotation time. Particularly
the saving in the annotation time by posting 60-s videos for easy
and normal videos is significant. For hard videos, presenting videos
of 60-s length does not significantly save annotation time compared
Fig. 10. Time spent using each annotation method for easy (0–15 min), to posting videos with other time durations. This is because com-
normal (15–30 min), and hard (30–45 min) videos bining six 10-s video into one larger 60-s video allows the anno-
tators to become familiar with the video content. This, in turn,
reduces the time needed for interpreting the content multiple times.
annotator group are shown in Fig. 10. The average and total time It also reduces the need for redrawing the bounding boxes.
As shown in Table 5, it is observed that choosing videos with
for annotations are provided in Table 2.
different length does not make a significant difference in the accu-
From these results, it is observed that when there is a small num-
racy of the assessment results. However, compared with other video
ber of construction workers in a video, the one-by-one annotation
lengths, the 60-s video length exhibits lower accuracy for labeling
method shows the best performance. As the number of workers
tools (around 10% lower). This suggests that longer video length
increase, the all-at-one method performs best. The one-by-one an-
could possibly cause tiredness and boredom for the annotators, and
notation method requires a repeating video for each worker. Thus, it
this makes the accuracy lower when compared with the annotation
will take a longer time to annotate hard videos. It is also observed
of the videos that have a shorter length.
that when the activities of each construction worker change fre-
quently, the one-by-one labeling method performs best. This is be- Frequency of Choosing Keyframes for Annotation
cause this method will save time by allowing annotators to focus on Annotating a sparse set of frames as opposed to annotating frames
one construction worker at a time and minimizes the chance of one-by-one can minimize the annotation time, but it may also neg-
mistakes and the need for unnecessary revisions. In contrary, the atively impact the accuracy of the assessment results. To explore
high frequency of changes in activities of the workers overwhelms the trade-off relationship between time and accuracy, three different
annotators who use the all-at-once or role-at-once methods. These fixed annotation frequencies of three times, five times, and nine
methods will require extra time to be spent on revising activity cat- times per minute were experimented with. In other words, the an-
egories because the annotators may then require to rewind the video notators were asked to only use these prefixed times per minute for
frequently and make all annotations consistent with one another. their annotation purposes. The annotation times of each frequency
When the worker activity categories do not frequently change, the are shown in Fig. 12, and the average and the total time duration
all-at-once methods shows the best performance. of the annotations are presented in Table 6. The accuracy of each
The results in Table 3 show that the accuracy in assessment re- frequency is also provided in Table 7. From these experiments, the
sults for categorizing activities does not change significantly across hypothesis that a sparse annotation can save time is validated. The
different annotation methods. However, among all, the role-at-once sparser the annotations are, the less time annotation will require.
method shows the highest accuracy in labeling the worker roles. The results particularly show that this method significantly shortens
This is because this method requires the annotators to reason about the annotation time for easy and normal videos that exhibit high
the role of the workers, and thus, it consequently increases the frequency of changes in activity categories. Nevertheless, the gains
accuracy of labeling roles. are not as obvious for hard videos which exhibit a low frequency of
change in these categories.
Impact of the Video Length Although choosing a smaller number of keyframes saves time
To examine the relationship between video length and the speed required for annotation, it also lowers the accuracy in assessment
of workface assessment, an experiment was conducted using three results. As shown in Tables 6 and 7, the three times per minute
Table 2. Average and Total Annotation Time for Each Annotation Method
Videos with different levels of difficulty in the annotation task
Time Methods Easy_01 Easy_02 Easy_03 Normal_01 Normal_02 Normal_03 Hard_01 Hard_02 Hard_03
Average AM01 550 592 534 1,491 1,317 1,359 1,307 1,084 1,254
AM02 363 836 607 1,404 1,276 1,879 871 823 804
AM03 478 922 788 1,736 1,507 1,647 1,174 908 1,576
Total AM01 — 8,380 — — 20,841 — — 18,232 —
AM02 — 9,034 — — 22,801 — — 12,495 —
AM03 — 10,943 — — 24,459 — — 18,298 —
Note: AM01 = one-by-one; AM02 = all-at-once; AM03 = role-at-once; time is measured in seconds.
Fig. 11. Annotation time for easy (0–15 min), normal (15–30 min),
and hard (30–45 min) videos
Table 4. Average and Total Time Spent on Annotating Videos with Different Length
Videos with different levels of difficulty in the annotation task
Time Lengths (s) Easy_01 Easy_02 Easy_03 Normal_01 Normal_02 Normal_03 Hard_01 Hard_02 Hard_03
Average 10 550 592 534 1,492 1,317 1,359 1,307 1,084 1,254
30 488 655 527 1,292 1,230 1,835 662 529 850
60 279 377 354 806 635 1,062 513 436 614
Total 10 — 8,380 — — 20,841 — — 18,232 —
30 — 8,349 — — 21,784 — — 10,227 —
60 — 5,050 — — 12,520 — — 7,812 —
Note: Time is measured in seconds.
(onefold), and then annotations from each video category were video conditions and annotation frequencies, two extrapolation
randomly selected to constitute threefold, fourfold, fivefold, six- methods were experimented on easy, normal, and hard videos
fold, sevenfold, and eightfold crossvalidations. For easy, normal, with different annotation frequencies. Annotation frequency is
and hard videos, it is observed that the increase in accuracy tends indicated by average clicks per frame per construction worker,
to be steady after a threefold crossvalidation. Figs. 16(a–c) show
results for easy, normal, and hard videos respectively. Fig. 16(c)
also shows that the activity accuracy of hard videos drop from a
threefold to an eightfold crossvalidation. The accuracy drop after
a sevenfold crossvalidation in Fig. 16(a) may result from the unnec-
essary repeated labelings which generate more erroneous data and
thus cause the incorrect labels to stand out. The activity accuracy
drop in Fig. 16(c) may also result from the unnecessary repeated
labeling or result from the farther camera distance which can chal-
lenge the accuracy of observations. Based on the average perfor-
mance of each fold crossvalidation accuracy, it was concluded
that threefold crossvalidation can provide the optimal performance
and that increasing the fold number increases the risk of producing
erroneous data.
Fig. 13. Annotration time difference between expert and nonexpert Fig. 15. Percent role, activity, and tool accuracy differences between
annotators expert and nonexpert annotators
Fig. 16. Crossvalidation results for (a) easy; (b) normal; and (c) hard videos
Table 8. Difference in Accuracies between the Expert and Nonexpert Discussion on the Proposed Method and Research
Annotators Challenges
Accuracy of an expert when The results validate the hypothesis that crowdsourcing construction
Category compared with a nonexpert
activity analysis from jobsite videos on the AMT, a marketplace
Completeness þ0.02 with nonexpert annotators, is a reliable approach for conducting
Bounding box þ0.01 activity analysis. In addition, the platform facilitates collection of
Role −0.01 large data sets with their ground truth that could be used for the
Activity þ0.03 development of computer-vision algorithms for automatic activity
Posture 0.00
recognition. Particularly, it is shown that expert annotators are, on
Tool þ0.03
average, 22% faster than nonexpert annotators in terms of their
annotation time. However, the accuracy of annotation among the
nonexperts is within 3% of the accuracy of the expert groups. To
which includes 0.001, 0.005, 0.01, 0.05, and 0.1. Figs. 17(a–c)
fine tune the platform, the impact of different annotation methods,
present the error rates of each method against the ground-truth
different HIT video lengths, and the frequency of requiring anno-
annotations. tations were experimented on and discussed. Based on these exper-
Fig. 17 illustrates that the increase in annotation frequency can imental results, the following conclusions can be made:
lead to the decrease of error rate for both extrapolation methods. 1. The one-by-one annotation method works best with videos
It is observed that increasing annotation frequency after 0.01 clicks that have a small number of construction workers and high
per frame per construction worker could only marginally reduce frequency of changes in activities, whereas the all-at-once an-
error rate. This experiment indicates that dense annotation cannot notation method works best with videos that have a high
necessarily guarantee higher accuracy in annotation performance. number of construction workers and low frequency of changes
As indicated by the error rate for both extrapolation methods in work activities.
in Figs. 17(a and b), linear extrapolation performs as well as the 2. Increasing a HIT video length can reduce the annotation time.
detection-based extrapolation method, and the error rate differ- For example, the 60-s long videos save 47 and 37% annotation
ence is within 5% on average. However, it is also observed that time compared to the 10-s and 30-s long videos. It was also
linear extrapolation method performs much better than detection- observed that the accuracy of workface assessment results
based extrapolation method in Fig. 17(c). This difference is likely slightly improves with an increase in the HIT video length.
caused by the difficulties in extracting effective visual features for 3. Manual annotation of a sparse set of video keyframe is reliable
hard videos and detecting workers in cluttered construction site for achieving complete frame-by-frame annotations. At the
conditions. most extreme case, the three times per minute annotation
Fig. 17. The error rates of the linear and detection-based extrapolation methods for annotating nonkeyframes for (a) easy; (b) normal; and
(c) hard videos
The optimal performance can be achieved with a threefold the safety vest of the workers can address the issues of conducting
(i.e., hiring three AMT annotators per HIT). data collection throughout the site and also the line-of-sight and
5. Increasing the sparsity of annotations by increasing the num- visibility for each camera.
ber of keyframes may not necessarily increase the accuracy Meanwhile, as part of future work, the following will be con-
of 2D localization because the increase in accuracy tends to sidered: (1) plotting the workface assessment results over the
plateau to þ0.05 clicks per frame per construction worker. course a day to get a better understanding on how soon crafts
Finally, increasing the number of keyframes may not necessa- are getting on their tools in the morning, where there are excessive
rily lead to the significant increase in accuracy of 2D localiza- breaks, and where and when crews quit an operation too early; and
tion. The 2D localization accuracy tends to increase from 0.01 (2) integrating observations from different viewpoints. Finally, to
to 0.1 clicks per frame per construction worker; however, the comprehensively validate this new method for construction video-
increase is within 18%. Also, it was observed from the experi- based analysis, a set of detailed crowdsourcing market investiga-
ments that increasing keyframe number after 0.05 clicks per tions and experiments should be conducted, not only to test the
frame per construction worker barely leads to an increase in technical parameters, but also to build a process model to test
accuracy. the cost associated with crowdsourcing, the time span between pub-
6. Linear extrapolation method can perform as well as detection- lishing and retrieval tasks, and potential risks of affecting worker
based extrapolation method for easy and normal videos. How- privacy by outsourcing construction video annotations containing
ever, because of the difficulties in extracting visual features construction workers to the crowd. This platform is publicly acces-
and detecting construction workers in severe construction site sible at http://activityanalysis.cee.illinois.edu. A video is also pro-
conditions from hard videos, the detection-based extrapolation vided as a companion for better illustration of the functionalities
method fails to compete with the linear extrapolation method. of this platform.
This paper presents a novel method that supports crowdsourc- The authors would like to thank Zachry Construction Corpora-
ing construction activity analysis from jobsite video streams. tion and Holder Construction Group for their support with data
The proposed method leverages human intelligence recruited from collection. The authors thank Professor Carl Haas for his very
a massive crowdsourcing marketplace, AMT, together with auto- constructive feedbacks during the development of the workface
mated vision-based detection/tracking algorithms to derive timely assessment platform. The technical support of Deepak Neralla
and reliable construction activity analysis in different challenging with the development of web-based tool is appreciated. The authors
conditions such as severe occlusion, background clutter, and cam- also thank the support of real-time and automated monitoring and
era motions. The experimental result with average accuracy of 85% control (RAAMAC) lab’s members, the graduate and undergradu-
in workface assessment tasks shows the promise of the proposed ate civil engineering students, and other AMT nonexpert annota-
method. The comparisons conducted between nonexperts and con- tors. This work was financially supported by the University of
struction validate the hypothesis that crowdsourcing video-based Illinois Department of Civil and Environmental Engineering’s
construction activity analysis through AMT nonexperts could Innovation Grant. The views and opinions expressed in this paper
achieve similar (or even the same) accuracy as conducting activity are those of the authors and do not represent the views of the in-
analysis by construction experts. dividuals or entities mentioned above.
To improve the platform, future work should focus on (1) the
design of a more robust detection/tracking algorithm that can work
well with sparse human input to effectively generate accurate non- Supplemental Data
keyframe annotations; and (2) the design of a quality control
method that does not require repeated labeling, to reduce reques- A video demonstration of the RAAMAC Crowdsourcing Workface
Assessment Tool is available online in the ASCE Library (www
ters’ cost and avoid erroneous data at the voting stage. As part of
.ascelibrary.org).
the study, a new compositional structure taxonomy for construction
activities is also created that models the interactions between body
posture, activities, and tools. This representation can improve
detection/tracking by enhancing the propagation of manual annota- References
tions to nonkeyframes. Also, studies that focus on using the hidden Ali, K., Hasler, D., and Fleuret, F. (2011). “Flowboost—Appearance learn-
Markov model to automatically infer construction activities from ing from sparsely annotated video.” 2011 IEEE Conf. on Computer
long sequences of jobsite videos could be beneficial to detection/ Vision and Pattern Recognition (CVPR), IEEE, Colorado Springs,
tracking and quality control steps. Learning a set of transition CO, 1433–1440.
Jaselskis, E., Sankar, A., Yousif, A., Clark, B., and Chinta, V. (2015).
Rep. No. IR252-2a, Univ. of Texas, Austin, TX. “Using telepresence for real-time monitoring of construction opera-
Costin, A., Pradhananga, N., and Teizer, J. (2012). “Leveraging passive rfid tions.” J. Manage. Eng., 10.1061/(ASCE)ME.1943-5479.0000336,
technology for construction resource field mobility and status monitor- A4014011.
ing in a high-rise renovation project.” Autom. Constr., 24, 1–15. Kavasidis, I., Palazzo, S., Di Salvo, R., Giordano, D., and Spampinato, C.
Dalal, N., and Triggs, B. (2005). “Histograms of oriented gradients for (2012). “A semi-automatic tool for detection and tracking ground truth
human detection.” IEEE Computer Society Conf. on Computer Vision generation in videos.” Proc., 1st Int. Workshop on Visual Interfaces for
and Pattern Recognition, 2005 (CVPR 2005), Vol. 1, IEEE, San Diego, Ground Truth Collection in Computer Vision Applications, ACM,
886–893. New York, 6.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Khosrowpour, A., Fedorov, I., Holynski, A., Niebles, J. C., and Golparvar-
“Imagenet: A large-scale hierarchical image database.” IEEE Conf. on Fard, M. (2014a). “Automated worker activity analysis in indoor
Computer Vision and Pattern Recognition, 2009 (CVPR 2009), IEEE, environments for direct-work rate improvement from long sequences
Miami, FL, 248–255. of rgb-d images.” Construction Research Congress 2014 Construction
Di Salvo, R., Giordano, D., and Kavasidis, I. (2013). “A crowdsourcing in a Global Network, ASCE, Reston, VA, 729–738.
approach to support video annotation.” Proc., Int. Workshop on Video Khosrowpour, A., Niebles, J. C., and Golparvar-Fard, M. (2014b). “Vision-
and Image Ground Truth in Computer Vision Applications, ACM, based workface assessment using depth images for activity analysis of
New York, 1–6. interior construction operations.” Autom. Constr., 48, 74–87.
Doermann, D., and Mihalcik, D. (2000). “Tools and techniques for video Kim, J., Nguyen, P. T., Weir, S., Guo, P. J., Miller, R. C., and Gajos, K. Z.
performance evaluation.” Int. Conf. on Pattern Recognition, IEEE
(2014). “Crowdsourcing step-by-step information extraction to
Computer Society, Hilton Head, SC, 4167–4167.
enhance existing how-to videos.” Proc., 32nd Annual ACM Conf. on
Escorcia, V., Davila, M. A., Niebles, J. C., and Golparvar-Fard, M. (2012).
Human Factors in Computing Systems (CHI ‘14), ACM, New York,
“Automated vision-based recognition of construction worker actions
4017–4026.
for building interior construction operations using RGBD cameras.”
Kim, J. Y., and Caldas, C. H. (2013). “Vision-based action recognition in
Construction Research Congress 2012, ASCE, Reston, VA, 879–888.
the internal construction site using interactions between worker actions
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman,
and construction objects.” International Association for Automation and
A. (2010). “The pascal visual object classes (voc) challenge.” Int. J.
Robotics in Construction (IAARC), Chennai, India, 661–668.
Comput. Vision, 88(2), 303–338.
Le, J., Edmonds, A., Hester, V., and Biewald, L. (2010). “Ensuring
Gadgil, N. J., Tahboub, K., Kirsh, D., and Delp, E. J. (2014). “A web-based
quality in crowdsourced search relevance evaluation: The effects of
video annotation system for crowdsourcing surveillance videos.” IS&T/
training question distribution.” Proc., SIGIR 2010 Workshop on
SPIE Electronic Imaging, International Society for Optics and Photon-
ics, 90270A–90270A. Crowd Sourcing for Search Evaluation, Microsoft, Geneva, 21–26.
Giretti, A., Carbonari, A., Naticchia, B., and DeGrassi, M. (2009). “Design Memarzadeh, M., Golparvar-Fard, M., and Niebles, J. C. (2013). “Auto-
and first development of an automated real-time safety management mated 2D detection of construction equipment and workers from site
system for construction sites.” J. Civ. Eng. Manage., 15(4), 325–336. video streams using histograms of oriented gradients and colors.”
Golparvar-Fard, M., Heydarian, A., and Niebles, J. C. (2013). “Vision- Autom. Constr., 32, 24–37.
based action recognition of earthmoving equipment using spatio- Oglesby, C. H., Parker, H. W., and Howell, G. A. (1989). Productivity
temporal features and support vector machine classifiers.” Adv. Eng. improvement in construction, McGraw-Hill, New York.
Inf., 27(4), 652–663. Park, M.-W., and Brilakis, I. (2012). “Construction worker detection
Golparvar-Fard, M., Peña-Mora, F., Arboleda, C. A., and Lee, S. (2009). in video frames for initializing vision trackers.” Autom. Constr., 28,
“Visualization of construction progress monitoring with 4D simulation 15–25.
model overlaid on time-lapsed photographs.” J. Comput. Civ. Eng., Park, M.-W., Koch, C., and Brilakis, I. (2011). “Three-dimensional tracking
10.1061/(ASCE)0887-3801(2009)23:6(391), 391–404. of construction resources using an on-site camera system.” J. Comput.
Gong, J., and Caldas, C. H. (2009). “An intelligent video computing Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000168, 541–549.
method for automated productivity analysis of cyclic construction Peddi, A., Huan, L., Bai, Y., and Kim, S. (2009). “Development of human
operations.” Proc., 2009 ASCE Int. Workshop on Computing in Civil pose analyzing algorithms for the determination of construction produc-
Engineering, ASCE, Reston, VA, 64–73. tivity in real-time.” Construction Research Congress, ASCE, Seattle,
Gong, J., Caldas, C. H., and Gordon, C. (2011). “Learning and classifying 11–20.
actions of construction workers and equipment using bag-of-video- Pradhananga, N., and Teizer, J. (2013). “Automatic spatio-temporal analy-
feature-words and bayesian network models.” Adv. Eng. Inf., 25(4), sis of construction site equipment operations using gps data.” Autom.
771–782. Constr., 29, 107–122.
Gouett, M. C., Haas, C. T., Goodrum, P. M., and Caldas, C. H. (2011). Rezazadeh Azar, E., Dickinson, S., and McCabe, B. (2012). “Server-
“Activity analysis for direct-work rate improvement in construction.” customer interaction tracker: Computer vision-based system to estimate
J. Constr. Eng. Manage., 10.1061/(ASCE)CO.1943-7862.0000375, dirt-loading cycles.” J. Constr. Eng. Manage., 10.1061/(ASCE)CO
1117–1124. .1943-7862.0000652, 785–794.
Heilbron, F. C., and Niebles, J. C. (2014). “Collecting and annotating Rezazadeh Azar, E., and McCabe, B. (2012). “Automated visual recogni-
human activities in web videos.” Proc., Int. Conf. on Multimedia tion of dump trucks in construction videos.” J. Comput. Civ. Eng.,
Retrieval, ACM, New York, 377. 10.1061/(ASCE)CP.1943-5487.0000179, 769–781.
10.1061/(ASCE)CP.1943-5487.0000242, 103–112.
guage tasks.” Proc., Conf. on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, Stroudsburg, Yuen, J., Russell, B., Liu, C., and Torralba, A. (2009). “Labelme video:
PA, 254–263. Building a video database with human annotations.” 2009 IEEE 12th
Sorokin, A., and Forsyth, D. (2008). “Utility data annotation with amazon Int. Conf. on Computer Vision, IEEE, Miami, FL, 1451–1458.
mechanical turk.” IEEE Computer Society Conf. on Computer Vision Yuen, M.-C., King, I., and Leung, K.-S. (2011). “A survey of crowd-
and Pattern Recognition Workshops, IEEE, Anchorage, AK, 1–8. sourcing systems.” 2011 IEEE 3rd Int. Conf. on Social Computing
Teizer, J., Lao, D., and Sofer, M. (2007). “Rapid automated monitoring of (socialcom), IEEE, Boston, 766–773.
construction site activities using ultra-wideband.” Proc., 24th Int. Symp. Zhai, D., Goodrum, P. M., Haas, C. T., and Caldas, C. H. (2009). “Relation-
on Automation and Robotics in Construction, International Association ship between automation and integration of construction information
for Automation and Robotics in Construction (IAARC), Chennai, India, systems and labor productivity.” J. Constr. Eng. Manage., 10.1061/
19–21. (ASCE)CO.1943-7862.0000024, 746–753.