Construction Innovation: Article Information

Construction Innovation
Tracking-based 3D human skeleton extraction from stereo video camera toward an

on-site safety and ergonomic analysis
Meiyin Liu, SangUk Han, SangHyun Lee,
Article information:
To cite this document:
Meiyin Liu, SangUk Han, SangHyun Lee, (2016) "Tracking-based 3D human skeleton extraction from
stereo video camera toward an on-site safety and ergonomic analysis", Construction Innovation, Vol.
16 Issue: 3, pp.348-367, https://doi.org/10.1108/CI-10-2015-0054
Permanent link to this document:
Downloaded by Birmingham City University At 22:32 21 March 2018 (PT)
https://doi.org/10.1108/CI-10-2015-0054
Downloaded on: 21 March 2018, At: 22:32 (PT)
References: this document contains references to 48 other documents.
To copy this document: permissions@emeraldinsight.com
The fulltext of this document has been downloaded 336 times since 2016*
Users who downloaded this article also downloaded:
(2016),"Information technology and safety: Integrating empirical safety risk data with building
information modeling, sensing, and visualization technologies", Construction Innovation, Vol. 16
Iss 3 pp. 323-347 <a href="https://doi.org/10.1108/CI-09-2015-0047">https://doi.org/10.1108/
CI-09-2015-0047</a>
(2016),"Classifying construction site photos for roof detection: A machine-learning method towards
automated measurement of safety performance on roof sites", Construction Innovation, Vol. 16
Iss 3 pp. 368-389 <a href="https://doi.org/10.1108/CI-10-2015-0052">https://doi.org/10.1108/
CI-10-2015-0052</a>
Access to this document was granted through an Emerald subscription provided by emerald-
srm:580444 []
For Authors
If you would like to write for this, or any other Emerald publication, then please use our Emerald
for Authors service information about how to choose which publication to write for and submission
guidelines are available for all. Please visit www.emeraldinsight.com/authors for more information.
About Emerald www.emeraldinsight.com
Emerald is a global publisher linking research and practice to the benefit of society. The company
manages a portfolio of more than 290 journals and over 2,350 books and book series volumes, as
well as providing an extensive range of online products and additional customer resources and
services.
Emerald is both COUNTER 4 and TRANSFER compliant. The organization is a partner of the
Committee on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for
digital archive preservation.
*Related content and download information correct at time of download.

The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/1471-4175.htm
CI
16,3
Tracking-based 3D human
skeleton extraction from stereo
video camera toward an on-site
348 safety and ergonomic analysis
Received 17 October 2015 Meiyin Liu
Revised 12 January 2016
29 March 2016 Civil and Environmental Engineering Department, University of Michigan,
8 April 2016 Ann Arbor, Michigan, USA
Accepted 12 April 2016
SangUk Han
Department of Civil and Environmental Engineering,

University of Alberta, Edmonton, Canada, and
SangHyun Lee
Civil and Environmental Engineering Department, University of Michigan,
Ann Arbor, Michigan, USA
Abstract
Purpose – As a means of data acquisition for the situation awareness, computer vision-based motion
capture technologies have increased the potential to observe and assess manual activities for the
prevention of accidents and injuries in construction. This study thus aims to present a computationally
efficient and robust method of human motion data capture for the on-site motion sensing and analysis.
Design/methodology/approach – This study investigated a tracking approach to three-
dimensional (3D) human skeleton extraction from stereo video streams. Instead of detecting body joints
on each image, the proposed method tracks locations of the body joints over all the successive frames by
learning from the initialized body posture. The corresponding body joints to the ones tracked are then
identified and matched on the image sequences from the other lens and reconstructed in a 3D space
through triangulation to build 3D skeleton models. For validation, a lab test is conducted to evaluate the
accuracy and working ranges of the proposed method, respectively.
Findings – Results of the test reveal that the tracking approach produces accurate outcomes at a distance, with
nearly real-time computational processing, and can be potentially used for site data collection. Thus, the proposed
approach has a potential for various field analyses for construction workers’ safety and ergonomics.
Originality/value – Recently, motion capture technologies have rapidly been developed and studied
in construction. However, existing sensing technologies are not yet readily applicable to construction
environments. This study explores two smartphones as stereo cameras as a potentially suitable means
of data collection in construction for the less operational constrains (e.g. no on-body sensor required, less
sensitivity to sunlight, and flexible ranges of operations).
Keywords Ergonomics, Construction safety, Stereo vision, Computer vision,
3D human skeleton extraction, Motion tracking
Paper type Research paper
Construction Innovation
Vol. 16 No. 3, 2016
pp. 348-367 TheworkpresentedinthispaperwassupportedfinanciallywithaNationalScienceFoundationAward(No.
© Emerald Group Publishing Limited
1471-4175
CMMI-1161123). Any opinions, findings and conclusions or recommendations expressed in this paper are
DOI 10.1108/CI-10-2015-0054 those of the authors and do not necessarily reflect the views of the National Science Foundation.
Introduction Tracking-
The construction industry is labor-intensive, requiring labor force as one of the major based 3D
resources for production. In the USA, 11.1 million employees worked in the construction
industry, which accounted for about 7 per cent of the overall US workforce in 2010
human
(CPWR, 2013). From a cost perspective, labor cost also often forms 33-50 per cent of the skeleton
total project cost in construction (Hanna, 2001; Siriwardana and Ruwanpura, 2012). In
this regard, efficient planning, monitoring and controlling of the on-site employees are 349
key to the success of construction projects. In fact, recent studies in construction have
introduced the automated monitoring of construction workers by addressing the
problems with existing observation methods, which are time-consuming and
painstaking for human observers in their daily practice. (Yang et al., 2010; Gong et al.,
2011a, 2011b; Ray and Teizer, 2012; Cheng et al., 2013; Memarzadeh et al., 2013; Han and
Lee, 2013; Han et al., 2014).
One of the major techniques used for such monitoring is based on computer vision.
Computer vision-based worker monitoring provides several advantages. It eliminates
the need to attach a sensor or a tag to a human worker, and does not interfere with the
workers’ on-going tasks (Moeslund et al., 2006) and the provision of rich context for
behavior-related analysis (Gong and Caldas, 2010). In the previous vision-based studies,
various types of motion representations (e.g. color model, silhouettes, quasi-skeleton
and three-dimensional [3D] skeletons) have been proposed depending upon the purpose
of monitoring (e.g. productivity measurements by Gong and Caldas (2011, 2011b),
unsafe action detection by Han and Lee (2013), ergonomic analysis by Ray and Teizer
(2012) and Seo et al. (2014). In general, coarse motion representations, such as simple
silhouettes of a human subject, are relatively easy and stable to capture but contain
less-detailed information about body configurations and postures, thus in turn making
it hard and limited to understand the context of worker behavior. On the other hand, fine
motion representations – for instance, 3D human skeletal models – are predominant in
construction research on vision-based workers’ safety and ergonomics because they
allow for the diverse types of motion analysis, such as activity recognition (Kim and
Caldas, 2013; Han et al., 2014), posture analysis (Ray and Teizer, 2012) and
biomechanical analysis (Seo et al., 2013). Particularly, taking into account that
continuous exposure to physically demanding activities (e.g. lifting, carrying materials,
bending or twisting, repetitive and forceful hand activity) are the main causes for
work-related ergonomic injuries – the leading type of non-fatalities [Bureau of Labor
Statistics (BLS) 2014]. The 3D skeletal models, if available, can provide rich information
on human movements (e.g. angular information for ergonomic and biomechanical
analysis). It remains consistent with the fact that a biomechanical analysis requires joint
angles as input data (Seo et al., 2013). To ultimately automate an on-site ergonomic and
biomechanical analysis, it is essential to obtain joint angles from 3D skeleton. This
research focuses on 3D skeleton extraction in a cost-effective and computation-efficient
manner with less operation constraints (e.g. a longer range and operable in outdoor
environment) toward on-site applications for worker monitoring.
There are several ways to obtain the 3D skeleton models. For example, red, green,
blue plus depth (RGB-D) sensors that capture RGB images along with depth information
(e.g. Microsoft Kinect) and marker-based motion capture systems (e.g. Vicon and
Optotrak) are commonly used (Ray and Teizer, 2012; Escorcia et al., 2012; Han et al.,
2013, 2014). Yet, these commercial systems are not fully applicable to on-site motion
CI capture. The detailed analysis on the pros and cons for each approach regarding
16,3 potential on-site application will be illustrated in the next background section (Table I).
This paper thus proposes a stereo vision system for the 3D pose estimation that
tracks positions of body joints over two-dimensional (2D) image frames and extracts 3D
human skeletal models from multiple view image sequences. Utilizing both spatial and
temporal information from human figure across video frames, the 3D pose estimation
350 can be accelerated significantly by continuously updating training images over frame
and reducing search space for the detection. To evaluate performance of the proposed
approach, a lab test is conducted, in which a commercial motion capture system is used
as ground truth. The following sections review various types of motion capture systems,
present technical details of the tracking-based motion capture, describe the test settings,
and report and discuss the results.
Background: existing motion capture technologies

Motion capture technologies have been developed from both software and hardware
(e.g. sensor) aspects. In particular, these types of sensors have a significant impact on the
adoption of such technologies in construction because they may interfere with workers’
ongoing work. In this section, three different sensing approaches to motion capture –
namely, on-body sensors, range sensors, and RGB sensors – are reviewed in the text and
compared in terms of pros and cons about the potential applicability to construction
jobsite (Table I).
On-body sensors
Using sensors (e.g. inertial measurement units, magnetic sensors, goniometer etc.)
directly attached to the body parts of interest, an on-body sensor system enables reliable
and robust motion data capture. A variety of applications could take advantage of this
emerging approach, such as rehabilitation, sports science and medicine, geriatric care,
and health and fitness monitoring (Albinali et al., 2009). It has seldom been applied in the
Technologies Advantages Disadvantages
On-body sensors Reliable data source, free of impact Invasive to on-going task, complex
from visual occlusion, illumination installation and operation
change
Range sensors
Structured light sensor Non-invasive to on-going work Limited active range, indoor
application
Time-of-flight camera Non-invasive, outdoor application Low resolution
RGB sensors
Monocular camera Non-invasive to on-going work, Computational complexity,
Table I. outdoor application, long active interference from occlusion and
Comparison of range, simple installation and illumination
existing human operation
motion capture Stereo camera Non-invasive to on-going work, High computation expense,
approaches adopted outdoor application, long active interference from occlusion and
in construction range illumination
construction industry because the attached sensors would interfere with workers’ Tracking-
on-going work. For lab test or education and training purposes, on-body sensors have based 3D
been utilized in construction (Alwasel et al., 2013; Chen et al., 2014).
human
Range sensors
skeleton
Depth images generated by a range sensor have become popular in recent motion
capture studies, for the depth information (i.e. 3D positions) which may help address 351
issues on complex human body segmentation and ambiguities between different poses
under similar appearance (Ganapathi et al., 2010). Two types of range sensors
categorized according to depth measurement techniques have been developed and
utilized as follows:
(1) Structured light sensor: One of range sensors, Microsoft Kinect, has become a
very popular device in human motion capture studies because of its low price
and off-the-shelf applicability. A Kinect device consists of an RGB camera and a
depth sensor. The depth information is computed based on the distortions of
structured lights which project a known pattern of infrared light dots onto the
scene (Chen et al., 2013). This type of sensor has been limited to use in lab settings
(Escorcia et al., 2012) or indoor construction sites (Khosrowpour et al., 2014).
(2) Time-of-flight camera: A time-of-flight camera measures the travel time of light
signals between the light source and the reflecting object. This method is known
to work in an outdoor condition, because it is less sensitive to the sunlight and
can operate in a relatively wide range – up to 10 m (Chen et al., 2013). Recently,
Leone et al. (2010) extracted human skeletons from 3D data clouds captured by a
time-of-flight camera for fall detection; and Diraco et al. (2013) used the extracted
skeleton for posture analysis.
Red, green and blue sensors

This method relies on 3D pose estimation algorithms with the images or image
sequences. Based on the number of cameras, single view and multiple view approaches
have been studied:
• Monocular camera: For the limited information available on a 2D image, the use of
models, such as kinematics, shapes and appearances, has been the widely adopted
approach to reconstruction of 3D human postures. As only one camera is required
for the 3D motion capture, this method possesses the potential for application in
field settings for its ease of use and ability to be easily carried. However, 3D
skeleton extraction from single view images is considerably challenging
compared to other approaches (e.g. range sensors, multiple cameras). This is in
part because of perspective ambiguity (Diraco et al., 2013), which generally brings
with it issues of incomplete motion information (e.g. ambiguous scale) or
significantly complex algorithm. Hence, this category of approaches has hardly
been studied in construction research to date.
• Stereo camera: A stereo camera infers the 3D structure of a scene from two (or
more) images from slightly different viewpoints (Trucco and Verri, 1998).
Technically, this approach is either using an a priori pose model or an explicit 3D
skeleton model to match the priori model with the images or estimating and
mapping 2D image sequences into 3D pose (Moeslund et al., 2006). In construction,
CI Han and Lee (2013) detected the locations of human body joints on 2D images from
16,3 a 3D camcorder and then computed the 3D positions of the joints using
triangulation to build 3D skeletal models. However, the detection approach may
involve computational issues, such as obtaining frequent outliers in pose
estimation and large search spaces for detection, because the detections of body
joints on images are performed on each frame (Moeslund et al., 2001).
352
Stereo cameras are regarded as a potentially suitable means of data collection in
construction because they possess fewer operational constrains that are crucial to
application in a jobsite. No marker or on-body sensor is required, little sensitivity to
sunlight allows for outdoor applications and a wide range of operations. However,
previous research applying this method in construction is required to improve its
computational efficiency and accuracy for field applications. This paper proposes a
stereo camera-based approach with improved computational efficiency and accuracy to

enhance its potential in on-site application.
Methods: tracking approach to 3D motion capture

This research aims to advance a current vision-based motion capture technique in terms
of computation complexity toward the goal of automating 3D human skeleton
extraction in human motion analysis. As studied in prior work (Ray and Teizer, 2012;
Han et al., 2014; Seo et al., 2013), the resulting motion capture data can later be used for
ergonomic and biomechanical analysis which requires fine motion data including
angles at body joints (e.g. arms, legs, shoulders). From a theoretical perspective, motion
capture generally includes pose estimation and tracking as two separate processes
(Moeslund et al., 2006). Pose estimation focuses on the static aspect, detecting and
estimating the location of body parts on each 2D image frame. On the other hand,
tracking is to model the relations between the detected target (e.g. a body joint) in the
previous frame and that in the following frames, which takes into account a motion
aspect rather than static human postures. An existing study (Han and Lee, 2013) has
validated the detection-based approach of obtaining a 3D human skeleton by obtaining
a 2D skeleton from one of the stereo camera views. This existing approach conducted
pose estimation in static frames independently, without considering the motion aspect
between consecutive frames. To fill the knowledge gap from a technical perspective by
modeling temporal correspondence and to accelerate the computation speed of the
existing approach from a practical perspective, a tracking-based approach is proposed
in this paper. By utilizing the temporal constraint of body joint locations between
consecutive frames, the search of body joint location is bounded within a minimum
potential area; thus the computation speed is accelerated significantly and the prior
knowledge (training dataset required in detection-based approach) is not needed. To
improve the portability of sensing devices in data collection, the approach is validated
by using two common smartphones, and the results of motion capture are then
compared with the ones from the accurate sensor-based motion capture system that
simultaneously measures the motion, thus being used to “ground truth”. In this
validation, bone lengths and body joint rotation angles are measured through lab
experiments to assess the consistency of tracking over frames and the precision of
motion capture. The proposed method and experimental settings are further described
in this section.
The proposed method, as shown in Figure 1, technically involves the following three Tracking-
processes: based 3D
(1) 2D body joints tracking; human
(2) 2D skeleton matching; and skeleton
(3) 3D skeleton reconstruction.
To estimate 3D spatial information of the skeleton, the proposed approach first 353
initializes the positions of individual body joints in one of the stereo images and tracks
them over the following frames. Prior to recognizing the corresponding 2D skeleton in
the other stereo image, the stereo camera parameters are set and obtained through stereo
calibration. The camera parameters and feature matching between two stereo images
help identify dual 2D skeletons that characterize the projection of a common skeleton in
3D space on stereo images. Again with camera parameters, the corresponding body joint
locations on 2D images from two views are triangulated to calculate the 3D positions
and build 3D human skeleton models. The detailed descriptions of each process are now
discussed.
Two-dimensional body joints tracking

2D body joint tracking mainly consists of:
• the initialization of trackers on individual joints;
• the tracking of body joints by searching and matching the joints on successive
frames; and
• the production of resultant 2D skeleton models over all of the image sequences
from one camera view.
For the tracking of an object, a tracker model that well represents the appearances of the
object needs to be carefully selected among available models, such as a single point,
multiple points, patch, contour, silhouette and so on. Because the location and movement
Camera 1 Camera 1 Camera 1 Cameras 1 & 2 3D Space
Frame x
Feature
Descriptor
Frame 2
Frame 1
Initializing
Tracking trackers Constructing 2D Matching skeletons Reconstructing 3D
individual joint
over frames skeletons over frames in stereo images skeleton
trackers
Camera 2
Figure 1.
Overview of
tracking-based 3D
skeleton extraction
from multiple view
Calibrating stereo images
cameras
CI of a body joint are essentially characterized by its centroid, a point tracker suitable to
16,3 localize joint centroid is selected. A point tracker is generally simple, computationally
inexpensive and less sensitive to the camera view, compared to other representations
(e.g. patch). In this study, the joint location is initialized on the first frame by detection
algorithms (Yang and Ramanan, 2011) or user input, and then its location in the
successive frame is estimated by a similarity measure between feature descriptors of
354 consecutive frames about the target. The appearance of the target and its nearest
region could be described using color space values, such as RGB values. However,
identical colors perceived by humans do not have the same values in an RGB space, and
the three components are also highly correlated (Paschos, 2001). In this respect, hue,
saturation and value (HSV) would be a suitable option which provides more
perceptually uniform color space values (Yilmaz et al., 2006), and thus is selected to
reduce the difference of color space values of the same point in different frames. In
addition to appearance, the proposed approach utilizes movement feature which

provides important information in tracking. Optical flow is a common and popular
feature describing horizontal and vertical pixel-wise movement of object points in
images. Although optical flow is computed based on pixel intensities across frames, it
minimizes illumination changes by assuming brightness constancy of corresponding
pixels between two consecutive frames (Horn and Schunck, 1981).
In searching the joints with the features described above, global searching for
localization may cause confusion between symmetric joints because the major body
joints are composed of symmetric pairs (e.g. left and right wrists). The proposed method
thus adopts local search within the neighborhood of the joint by setting constraints on
both search region and match area. The search region is defined by the maximum
amount of pixel displacement that the target joint could move between two consecutive
frames. Given the movement velocity of a particular joint and the distance between joint
and camera, together with frame rate of recorded video, a proper dimension of search
region could be selected (e.g. 20 ⫻ 20 pixel) to restrict the possible location of the target
in the following frame. On the other hand, the match area is defined by the region around
the target, in which features of all bounded pixels are included to measure the similarity
between tracked region of current and future state space. The match area is considered
in the proposed computational process, not only to reduce the computation expense by
decreasing search space but also to minimize the impact brought by intensity changes of
pixels from outside of the body joint. By the constraints of both search and match
regions, the target can be properly tracked without significant confusion from
symmetric joint or irrelevant objects (including background).
Particularly, as each body joint would have different appearance (e.g. amount of
pixels occupied) and movement information (e.g. velocity), each body joint could be
individually tracked given unique parameters’ setting. A 2D skeleton is then
constructed by setting up connection “sticks” between body joints which represent the
location and orientation of body parts.
Two-dimensional skeleton matching

Stereo vision cameras from two views can be utilized to compute the depth values of
body joints on 2D images. The depth from a camera to the body joints on image
sequences is calculated by identifying pairs of corresponding body joints on two view
images as an initial step (Han and Lee, 2013).
As a result of 2D body joint tracking, 2D skeleton models are acquired in one of the Tracking-
stereo vision cameras. To identify the corresponding body joints on the image from the based 3D
other viewpoint, feature descriptors, such as scale-invariant feature transform (SIFT) human
(Liu et al., 2011), speeded up robust features (SURF) (Uijlings et al., 2010; Bay et al., 2008)
or correlation values around the joints, can be utilized as allowing for computing the skeleton
features (e.g. edges, corners, objects) that describe and represent the body joint and its
adjacent pixels. Particularly, a transform matrix (i.e. homography) can be computed and 355
applied to get a reasonable estimation of the potential search space in detecting the
corresponding body joints when the distance between two cameras are quite close
(Figure 2) (Han and Lee, 2013).
Specifically, the projective transformation or homography, between left and right
images can be computed by applying feature descriptors (i.e. SIFT or SURF) and
matching between pairs of featured points. As the homography is computed based

on matched features, the computational process is often sensitive to the updating feature
mapping process (i.e. homography parameters keep updating over iterations), and
it may be difficult to empirically determine an ideal parameters’ setting in the feature
matching process. To address the issue, stereo images are undistorted (i.e. lens
distortion corrected) or even rectified before feature matching, using camera parameters
obtained from the calibration (Figure 3). After features are matching, the 3 ⫻ 3
homography matrix (H) is initially estimated by the random sample consensus
(RANSAC) algorithm (Fischler and Bolles. 1981), and the Levenberg-Marquardt
algorithm (Pujol, 2007) is applied to attain the optimal solution. With homography
matrix and joints’ image coordinates in one image (e.g. right-eye image), the
corresponding coordinates on the other image (e.g. left-eye image) can be obtained by
searching estimated regions using the feature descriptors.
Homography
(H)
x1 x1’
Figure 2.
x2 x2’ Homography
between stereo
x’ = H · x images
CI
16,3 Camera 1
356
Image
Rectification
Camera 2
Figure 3.
An example of image
Image
matching after Rectification
rectification
Three-dimensional skeleton reconstruction

Pairs of corresponding body joints (i.e. 2D skeletons) from two view images are
triangulated with camera parameters, including the relative positions between cameras,
to compute the 3D positions of joints (Han and Lee, 2013). In this process, the accuracy
of 3D pose estimation significantly relies on the correctly matched image pixels between
two view images. To minimize the error of points matching process, the search regions
in detecting corresponding joints can be initially estimated using the homography
matrix and then refined by adding constraints from camera parameters. The optimal
matching comes from the epipolar constraint where the location of the corresponding
point is known to lie within a certain row of pixels. The particular row of pixels where
corresponding points are found is determined by the geometric transform between two
images, which can be mathematically represented by the fundamental matrix. To
compute the fundamental matrix, normalized eight-point algorithm (Hartley, 2004) is
applied, given the image coordinates of a set of paired corresponding points. Given a
pair of matched image points describing a common world point, the 3D point can be
located at the intersection of the two projection rays. To obtain the location information
of the two ends of triangle baseline or the optical center of two cameras, camera
parameters (e.g. relative positions between cameras) are determined by stereo camera
calibration process (Zhang, 2004). While calibrating the stereo camera, absolute spatial
structure (proximity and direction) of a set of world points is required to recover the
geometry including cameras’ optical centers, image points, and world points. In practice,
a “chessboard” is widely used to easily provide absolute spatial structure of world
points as the coordinate system could be determined within the plane of the surface.
Given the cameras’ locations, the world points’ coordinates in 3D space can be calculated
by intersecting two corresponding projection rays from the camera’s optical center Tracking-
through image points (Figure 4), using the singular value decomposition (SVD) method based 3D
(Hartley, 2004).
human
Laboratory testing skeleton
A lab test is conducted to evaluate the performance of the tracking approach to 3D
motion capture. Figure 5 illustrates the test setup including the distances between a 357
human subject and sensors, and snapshots of the sensors and a human subject. Notably,
two smartphones were used as cameras for practicality showing that the proposed
approach does not require special camera devices because two common smartphones
would generate expected result. Specifically, two iPhone 6 plus are used, and the videos
are captured by the IOS’s built-in camera app (1080p HD video recording at 30 fps). In
the test, motions of a human subject are simultaneously recorded with two smartphones
and a commercial marker-based motion capture system – Optotrak – which is not

suitable to a field setting. Specifically, the 3D skeleton models extracted from two
smartphones are compared to the ones captured by the commercial motion capture
system, regarded as a ground truth in this test.
Two common actions in construction tasks, squat lifting and walking, are selected as
test cases. Squat lifting is included because overexertion in lifting causes a significant
proportion of the work-related musculoskeletal disorders among construction workers
(CPWR, 2013). On the other hand, walking is selected because it is one of the most
common actions on a jobsite and may include less self-occlusion than others; hence, it
may allow for the fair evaluation of visual sensing. In the test, one human subject
conducts 8 cycles of squat lifting and 16 cycles of walking; all cycles of each posture
taking 30 s. The identical video length for each posture aims at fair comparison in terms
of algorithm performance, and the number of cycles is thus determined by the video
X1
X2
x1’
x1
x2’
x2
Figure 4.
X=f (x, x’) 3D body joints
triangulation
CI
16,3
358
Figure 5.
Testing setup
length. Enough repetition of postures is provided to validate the approach against

instance/time variation.
As performance evaluation metrics, each body part’s lengths and rotation angles at
body joints are computed to see if the proposed method could stably estimate 3D human
poses over time. For the comparison, both motion data sets are temporally synchronized
because of different frame rates and spatially synchronized for the visual inspection.
Both approaches produce 3D joint locations of human skeleton in Cartesian coordinates
as output, and hence the difference between the tracked results and the ground truth is
readily available for the performance evaluation. The frame rate of both cameras is 29
fps, and the human motion data of 500 frames for each action (squat lifting and walking)
is captured and compared.
Three-dimensional skeleton extraction

In 2D pose estimation, the initialization of joint locations on the first frame is critical for
the accurate tracking of body joints on 2D images. In this test, such localization is
manually performed on the first frame. The tracking of joints in the successive image
sequences requires appropriate configuration about the search region and match region,
which, if tuned correctly, may reduce computational loads and appearance confusion
caused by symmetric body parts. The efficiency of the tuning work can be elevated with
visualized parameter settings for empirical performance evaluation; Silhouette 5
(SilhouetteFX, LLC) is selected to assess the tracking approach with various parameter
settings. The search or match region is visualized with a bounding box whose corners
can be drawn over a continuous spectrum of parameters values. One can thus easily
evaluate whether the parameter hinders the correctness of joints’ localization results,
tests various parameters and finds the optimum combination resulting in most accurate Tracking-
tracking results. Once tracking results of individual body joints are obtained, 2D based 3D
skeletons (Figure 6) can be constructed based on the sequence of localized body joints
and the pre-defined spatial hierarchy among them.
human
It is commonly admitted that a longer distance between the scene and the camera skeleton
system, has always been preferred in stereoscopy (Delon and Rougé, 2007). Meanwhile,
the distance between cameras should be limited to ensure enough overlapping area 359
between stereo camera views for target matching. Given the physical experiment’s
condition, distance between cameras valued around 30 cm is selected. The sensitivity of
distance between stereo cameras to 3D reconstruction accuracy could be studied in
future work. Compared to a stereo camera (e.g. 3D camcorder), the use of two separate
cameras in this test may potentially lead to higher accuracy of 3D triangulation for the
longer baseline between cameras (e.g. 27.4 cm between cameras in this test is longer than
35 mm in a 3D camcorder). The captured image sequences from the separated cameras

will be out of synchronization if the timestamps of their starting frames are misaligned.
Figure 6.
Sample results of 2D
skeleton tracking
CI The temporal misalignment, if not calibrated, will lead to the failure of reconstructing
16,3 3D skeletons from dual 2D ones. In this test, the two video streams are manually
synchronized afterward during data analysis by observing the common visual cues
presented in the frames, for instance, clapping of the human subject before and after
conducting the actions. Through this temporal synchronization of stereo video streams,
extracted 2D skeleton on right-eye image can infer the corresponding 2D skeleton on
360 left-eye image captured simultaneously.
To triangulate the dual 2D skeletons for the 3D skeleton, the stereo cameras are
initially calibrated using the MATLAB stereo calibration tool (Computer Vision System
Toolbox, MATLAB R2015b) – the snapshots of calibration tasks are illustrated in
Figure 7. Given the stereo camera parameters from calibration, the 3D skeleton can be
obtained by computing 3D locations of the 14 major body joints. Figure 8 shows a 3D
skeleton generated from the testing data, as an example of qualitative result.
Performance evaluation
The performances of tracking-based motion capture are evaluated in terms of:
• bone lengths between body joints; and
• rotation angles of body parts at the joints.
The bone lengths of each frame are measured by the tracking-based method in the
Euclidean distance between major body joints (e.g. body part IDs 1 to 10 in Figure 9) and
compared to the corresponding bone length of each frame measured by the Optotrak.
Particularly, the measurement is compared to the result of detection-based 3D pose
estimation reported in Han and Lee (2013) to assess the two different approaches
overall – namely, tracking and detection – although the data sets used in the studies are
different.
A rotation angle is a measurement widely used in commercial motion capture
systems (e.g. Vicon and Kinect) to characterize human postures (Meredith and Maddock,
2000). It is defined by the Euler rotation angle between particular body part’s directional
vector (e.g. right forearm) at a certain time (e.g. Tx) and the vector of identical body part
at an initial time (e.g. T0) when the body configuration is initially defined (e.g. T-pose)
(Han et al., 2014). Specifically, the rotation angles are commonly defined in a local
Figure 7.
Camera calibration
with a check board:
all the corners of
check boxes are
automatically
detected by
MATLAB stereo
calibration tool
Tracking-
based 3D
human
skeleton
361
Figure 8.
Qualitative
experiment result:
reconstructed 3D
human skeleton
10 6
7
8 5
9
3 2
Figure 9.
Body parts indices
for bone length and
4 1 rotation angle
validation
coordinate system based on the hierarchical body structure (Figure 10 (a)); that is, a
rotation angle of a child body part (e.g. Part ID 1 in Figure 10 (a)) is determined according
to the angle of its parent body part (e.g. Part ID 2 in Figure 10 (a)), and the errors at the
end joint may be accumulated along errors of the parent body parts. For the fair
evaluation of measurements at each body joint, a rotation angle in this study is thus
CI defined in a global coordinate system rather than that defined in local coordinate
16,3 system, as shown in Figure 10.
The root-mean-square errors (RMSEs) over all frames between the proposed method
and the commercial motion capture system are measured as a matric for performance
evaluation. The calculation is shown in equation (1), where xexp,i denotes the ith (out of n)
sample in experiment data, and similarly xgt,i denotes that in ground truth:
362
冪兺
n
(xexp,i ⫺ xgt,i )2
RMSE ⫽ (1)
i⫽1
n
Results and discussion

The mean of bone length and rotation angle measurements are compared to the ground
truth measured using a commercial motion capture system, Optotrak (www.ndigital.

com), the accuracy of which in tracking markers’ positions is known to be less than 0.1
mm in an experimental setting.
The average error of the difference between the bone length measurement and the ground
truth over each frame are calculated for every body part (Figure 9), and the results including
the comparison with detection-based approach (Han and Lee, 2013), are reported in Table II.
Overall, the proposed tracking approach achieves the accuracy of 3.8 cm compared to the
ground truth and slightly outperforms the detection approach by 2.5 cm. Among all the body
parts, ones linked to hips (IDs of 3 and 9) lead to the largest error, which may be derived from
1 y
2 x
y
z
y x
2 x
1
Figure 10. x
z
Comparison of the z
definitions of rotation
angles between (a) a
local coordinate z
system and (b) a
global coordinate
system (a) (b)
Table II.
Bone length
Body part ID 1 2 3 4 5 6 7 8 Mean SD
difference
comparison (cm) X 3.5 1.3 3.7 7.6 8.7 9.6 4.9 9.2 6.4 3.2
between detection- Y 12.7 6.5 11.5 7.2 8.5 6.1 9.6 9.8 9.0 2.4
and tracking-based Z 5.0 3.5 6.9 12.3 59.4 17.2 22.4 42.2 21.1 20.0
(proposed)
approaches to Note: The slightly different body configuration is studied in Han and Lee (2013) (i.e. body part ID 8 not
skeleton extraction included)
the self-occlusions caused by bending in squat lifting. Also, even small errors in tracking a Tracking-
body part in a z-axial direction (i.e. an object to a camera) may sensitively magnify the errors based 3D
in 3D triangulation. Generally, the upper body parts have better performance in bone length
measurement than lower body parts because of nearly no occlusion of upper body parts
human
when the subject is conducting actions of lifting and walking. skeleton
Technically, the detection-based method requires large and comprehensive training data
sets for the pose estimation because the detection is performed solely by comparing the 363
testing image with the training images. These data sets include images of various human
subjects, actions, backgrounds and so on, especially captured under various conditions (e.g.
illumination and view angles). In contrast, the tracking-based method is free from the large
amount of training data sets required for the detection and is less sensitive to a diversity of
testing images in appearances. Furthermore, the tracking approach may be computationally
efficient as roughly estimating the potential region of joint positions in each frame based on
the localization decision made from the previous frame. This may be one of the reasons that
the tracking-based method has better performance in estimating bone length which depends
on location information of body joints. From the customization perspective, this model-free
approach also provides the user with the flexibility on tracking certain body parts, for
example, upper-body or specific body parts according to the purpose of the study (e.g. upper
limb for ergonomic analysis).
The RMSEs of rotation angles are also measured for the motion data sets of two
actions (i.e. squat lifting and walking), as reported in Table III and IV. Overall, the errors
in X, Y and Z-axes rotations are 6.4, 9.0 and 21.1 degrees, respectively, for squat lifting,
and 7.0, 19.7 and 11.2 degrees respectively, for walking. Despite the extensive
movements of thighs and legs (i.e. body part IDs 1 to 4) during squat lifting, relatively
small errors are found in the experiment, which infer the robustness of the tracking
method. On the other hand, larger errors are mainly caused by forearms (i.e. body part
IDs 5 and 8), which indicate the computational difficulty in tracking the forearms and
hands. Han and Lee (2013) also reported that hands are the most challenging body part
to detect and localize using the detection approach. Through visual inspections for the
error analysis, the low frame rate of 29 fps is found to cause severe blurring around the
body parts that are moving fast, such as forearms in squat lifting. This might be one of
Table III.
Squat lifting: rotation
Body part ID 1 2 3 4 5 6 7 8 9 10 Mean angle comparison
(degree) at body
Detection-based average error (Han and parts between
Lee, 2013) 3.6 5.9 6.3 10.1 10.4 5.7 7.7 8.7 3.0 – 6.3 tracking-based
Tracking-based average error 7.8 3.7 6.6 2.5 1.8 2.9 2.6 4.1 9.2 3.5 3.8 method and Optotrak
Body part ID 1 2 3 4 5 6 7 8 Mean SD Table IV.

Walking: rotation
X 1.6 3.2 7.0 – 15.3 6.8 5.4 9.4 7.0 4.5 angle comparison
Y 7.4 11.8 17.1 – 34.4 25.0 17.4 24.6 19.7 9.1 (degree) at body
Z 6.1 11.2 5.1 – 13.6 13.6 12.2 16.8 11.2 4.2 parts between
tracking-based
Note: Body part ID 4 is not reported because of the significant amount of data loss by Optotrak method and Optotrak
CI major causes for the inaccuracy of tracking because blurred body parts on images may
16,3 lead to the incorrect estimation of search regions on the following image and the
inaccurate localization of the center of the body part. As a potential solution for this issue
in the future studies, videos with higher frame rates and shorter exposure periods will be
investigated to capture visually clear body parts, instead of blurred ones.
364
Conclusion
The method proposed in this paper successfully reduced the computational time to a near
real-time level, thus filling a primary gap that current research about vision-based human
monitoring has faced in the context of construction. The performance is tested by comparing
the measured bone lengths and rotation angles at joints with the ones from a commercial
motion capture system. The results reveal that the tracking approach better performs the 3D
pose estimation than the detection approach in terms of bone length measurements and
rotation angles that are not reported in the previous study. Additionally, this method used
two smartphone cameras to generate stereo video streams and perform 3D human skeleton
reconstruction. Even though the utilization of smart phones for monitoring on-site workers
might be faced with some issues, the experimental results revealed that it can obtain good
accuracy and holds certain potential for future on-site application. Notably, the proposed
approach does not utilize any marker or sensor attached to a human body and has less
operational constraints rather than range sensors.
The motion data that the proposed method can produce from video streams can
potentially be used for behavior monitoring, ergonomic assessment and productivity
analysis as studied in prior work. The technological system can automatically extract 3D
human skeletons that serve as crucial input of ergonomic and safety analysis. It can be
integrated into diverse frameworks which depend on the skeleton configuration or partial
information extracted from it (e.g. joint angles, anthropometry). For example, Ray and Teizer
(2012) proposed an ergonomic analysis framework which utilizes 3D skeleton models for
posture classification (e.g. standing, bending), pose estimation (e.g. body joints location and
angle) and rule-based ergonomic assessment. In addition, the proposed method can also be
integrated into the unsafe behavior detection framework (Han et al., 2014), which compares
the pattern similarity of motion data extracted from motion sensors to detect unsafe actions
similar to a pre-defined motion template of the action. In addition, motion data (e.g. body joint
angles) derived from the skeleton can be used for biomechanical analysis. Seo et al. (2014)
performed the posture assessment to identify body parts enduring forceful exertion by using
motion data captured from range sensor and force data. These examples, in which motion
capture data (i.e. 3D skeletal models) are used, are the potential applications to which the
proposed method is readily available.
The proposed tracking method does not require a significant effort in collecting images to
train detection algorithms; instead, the tracking approach can continuously update
changing appearances of the tracker over frames. By so doing, however, initial localizations
of the trackers on the first frame when incorrectly performed can be the major source of
errors. Also, missing the trackers in the tracking process – for instance, by occlusions – can
be another critical factor affecting the performance. Accordingly, further research is required
to test the introduced method’s performance by taking videos at different view angles,
especially ones causing greater self-occlusion (e.g. side view). It might provide plausible
improvement to the performance by integrating detection and tracking approaches via
complementing shortcoming of each. In addition, laboratory testing is selected to assess
performance of the proposed method before the pursuit of a scale-up field test. However, Tracking-
practical issues such as occlusions, illuminations and camera views, may pose technical based 3D
challenges for the vision-based tracking when applied to jobsites. As the next step, the human
proposed method will be tested under less constrained conditions in future studies to ensure
the performance in a field setting.
skeleton
References 365
Albinali, F., Goodwin, M.S. and Intille, S.S. (2009), “Recognizing stereotypical motor movements in
the laboratory and classroom: a case study with children on the autism spectrum”,
Proceedings of the 11th International Conference on Ubiquitous Computing, ACM,
Orlando, FL, pp. 71-80.
Alwasel, A., Elrayes, K., Abdel-Rahman, E. and Haas, C. (2013), “A human body posture sensor for
monitoring and diagnosing MSD risk factors”, Proceedings of the 30th International
Symposium on Automation and Robotics in Construction (ISARC), Montreal, pp. 531-539.
Bay, H., Tuytelaars, T. and van Gool, L. (2008), “SURF: speeded up robust features”, Computer
Vision and Image Understanding, Vol. 110 No. 3, pp. 346-359.
Bureau of Labor Statistics (BLS) (2014), Nonfatal Occupational Lnesses Requiring Days Away
From Work, 2013, US Department of Labor, available at: www.bls.gov/news.release/osh
2.nr0.htm (accessed 27 Sepember 2015).
Chen, J., Ahn, C.R. and Han, S. (2014), “Detecting the hazards of lifting and carrying in construction
through a coupled 3D sensing and IMUs sensing system”, International Conference for
Computing in Civil and Building Engineering, ASCE, Reston, VA, pp. 1110-1117.
Chen, L., Wei, H. and Ferryman, J. (2013), “A survey of human motion analysis using depth
imagery”, Pattern Recognition Letters, Vol. 34 No. 15, pp. 1995-2006.
Cheng, T., Migliaccio, G., Teizer, J. and Gatti, U. (2013), “Data fusion of real-time location sensing
and physiological status monitoring for ergonomics analysis of construction workers”,
Journal of Computing in Civil Engineering, Vol. 27 No. 3, pp. 320-335.
CPWR (2013), The Construction Chart Book: The US Construction Industry and Its Workers, 5th
ed., CPWR – The Center for Construction Research and Training, Silver Spring.
Delon, J. and Rougé, B. (2007), “Small baseline stereo vision”, Journal of Mathematical Imaging
and Vision, Vol. 28 No. 3, pp. 209-223.
Diraco, G., Leone, A. and Siciliano, P. (2013), “Human posture recognition with a timeof-flight 3d sensor
for in-home applications”, Expert Systems with Applications, Vol. 40 No. 2, pp. 744-751.
Escorcia, V., Davila, M.A., Golparvar-Fard, M. and Niebles, J.C. (2012), “Automated vision-based
recognition of construction worker actions for building interior construction operations
using RGBD cameras”, paper presented at Construction Research Congress, West
Lafayette.
Fischler, M.A. and Bolles, R.C. (1981), “Random sample consensus: a paradigm for model fitting
with applications to image analysis and automated cartography”, Communications of the
ACM, Vol. 24 No. 6, pp. 381-395.
Ganapathi, V., Plagemann, C., Koller, D. and Thrun, S. (2010), “Real time motion capture using a
single time-of-flight camera”, IEEE Conference on Computer Vision and Pattern
Recognition, San Francisco, pp. 755-762.
Gong, J. and Caldas, C.H. (2010), “Computer vision-based video interpretation model for automated
productivity analysis of construction operations”, Journal of Computing in Civil
Engineering, Vol. 24 No. 3, pp. 252-263.
CI Gong, J. and Caldas, C.H. (2011a), “An object recognition, tracking, and contextual
reasoning-based video interpretation method for rapid productivity analysis of
16,3 construction operations”, Automation in Construction, Vol. 20 No. 8, pp. 1211-1226.
Gong, J., Caldas, C.H. and Gordon, C. (2011b), “Learning and classifying actions of construction
workers and equipment using Bag-of-Video-Feature-Words and Bayesian network
models”, Advanced Engineering Infomatics, Vol. 25 No. 4, pp. 771-782.
366 Han, S. and Lee, S. (2013), “A vision-based motion capture and recognition framework for
behavior-based safety management”, Automation in Construction, Vol. 35 No. 1, pp. 131-141.
Han, S., Lee, S. and Peña-Mora, F. (2014), “Comparative study of motion features for
similarity-based modeling and classification of unsafe actions in construction”, Journal of
Computing in Civil Engineering, Vol. 28 No. 1.
Hanna, A.S. (2001), Quantifying the Cumulative Impact of Change Orders for Electrical and Mechanical
Contractors, Construction Industry Institute (CII), University of Texas, Austin, TX.

Hartley, R.I. and Zisserman, A. (2004), Multiple View Geometry in Computer Vision Second
Edition, 2nd ed., Cambridge University Press, Cambridge, MA.
Horn, B.K. and Schunck, B.G. (1981), “Determining optical flow”, Computer Vision and Image
Understanding, Washington, DC.
Khosrowpour, A., Fedorov, I., Holynski, A. and Niebles, J.C. (2014), “Automated worker activity
analysis in indoor environments for direct-work rate improvement from long sequences of
RGB-D images”, Construction Research Congress, Atlanta, GA, pp. 729-738.
Kim, J.Y. and Caldas, C.H. (2013), “Vision-based action recognition in the internal construction site
using interactions between worker actions and construction objects”, International
Symposium on Automation and Robotics in Construction and Mining, Montréal,
pp. 661-668.
Leone, A., Diraco, G. and Siciliano, P. (2010), “An automated active vision system for fall detection
and posture analysis in ambient assisted living applications”, IEEE International
Symposium on Industrial Electronics, Bari, pp. 2301-2306.
Liu, C., Yuen, J. and Torralba, A. (2011), “SIFT flow: dense correspondence across scenes and its
applications”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33
No. 5, pp. 978-994.
Memarzadeh, M., Golparvar-Fard, M. and Niebles, J.C. (2013), “Automated 2D detection of
construction equipment and workers from site video streams using histograms of oriented
gradients and colors”, Automation in Construction, Vol. 32 No. 1, pp. 24-37.
Meredith, M. and Maddock, S. (2000), Motion Capture File Formats Explained, Department of
Computer Science, University of Sheffield, Sheffield.
Moeslund, T.B. and Erik, G. (2001), “A survey of computer vision-based human motion capture”,
Computer Vision and Image Understanding, Vol. 81 No. 3, pp. 231-268.
Moeslund, T.B., Hilton, A. and Kruger, V. (2006), “A survey of advances in vision-based human
motion capture and analysis”, Computer Vision and Image Understanding, Vol. 104
Nos 2/3, pp. 90-126.
Paschos, G. (2001), “Perceptually uniform color spaces for color texture analysis: an empirical
evaluation”, IEEE Transactions on Image Processing, Vol. 10 No. 6, pp. 932-937.
Pujol, J. (2007), “The solution of nonlinear inverse problems and the Levenberg-Marquardt
method”, Geophysics, Vol. 72 No. 4, pp. W1.
Ray, S.J. and Teizer, J. (2012), “Real-time posture analysis of construction workers for ergonomics
training”, Construction Research Congress, West Lafayette, IN, pp. 1001-1010.
Seo, J., Han, S., Lee, S. and Armstrong, T.J. (2013), “Motion-data-driven unsafe pose identification Tracking-
through biomechanical analysis”, Computing in Civil Engineering – Proceedings of the 2013,
ASCE International Workshop on Computing in Civil Engineering, Los Angeles, CA, based 3D
pp. 693-700. human
Seo, J., Starbuck, R., Han, S., Lee, S. and Armstrong, T.J. (2014), “Motion data-driven biomechanical skeleton
analysis during construction tasks on sites”, Journal of Computing in Civil Engineering,
Vol. 29 No. 4. doi: 10.1061/(ASCE)CP.1943-5487.
Siriwardana, C.S.A. and Ruwanpura, J.Y. (2012), “A conceptual model to developa worker 367
performance measurement tool to improve construction productivity”, Construction
Research Congress, West Lafayette, pp. 179-188.
Trucco, E. and Verri, A. (1998), Introductory Techniques for 3-D Computer Vision, Prentice Hall
PTR, Upper Saddle River, NJ.
Uijlings, J.R.R., Smeulders, A.W.M. and Scha, R.J.H. (2010), “Real-time visual concept
classification”, IEEE Transactions on Multimedia, Vol. 12 No. 7, pp. 665-681.

Yang, J., Arif, O., Vela, P.A., Teizer, J. and Shi, Z. (2010), “Tracking multiple workers on construction
sites using video cameras”, Advanced Engineering Infomatics, Vol. 24 No. 4, pp. 428-434.
Yang, Y. and Ramannan, D. (2011), “Articulated pose estimation using flexible mixtures of parts”,
Proceedings of Computer Vision and Pattern Recognition, CO Springs, CO, pp. 1385-1392.
Yilmaz, A., Omar, J. and Mubarak, S. (2006), “Object tracking: a survey”, ACM Computing
Surveys, Vol. 38 No. 4.
Zhang, Z. (2004), “Camera calibration”, in Medioni, G. and Kang, S.B. (Ed.), Emerging Topics in
Computer Vision, Prentice Hall Professional Technical Reference, Upper Saddle Rivier, NJ,
pp. 4-43.
Further reading
Boschman, J.S., van der Molen, H.F., Sluiter, J.K. and Frings-Dresen, M.H. (2012),
“Musculoskeleton disorders among construction workers: a one-year follow-up study”,
BMC Musculoskeleton Disorders, Vol. 13 No. 1, p. 196.
Bureau of Labor Statistics (BLS) (2011), National Census of Fatal Occupational Injuries in 2010
(Preliminary Results), US Department of Labor, available at: www.bls.gov/news.release/
archives/cfoi_08252011.pdf (accessed 27 September 2015).
Everett, J.G. (1999), “Overexertion injuries in construction”, Journal of Construction Engineering
and Management, Vol. 125 No. 2, pp. 109-114.
Golabchi, A., Han, S., Seo, J., Han, S., Lee, S. and Al-Hussein, M. (2015), “An automated
biomechanical simulation approach to ergonomic job analysis for workplace design”,
Journal of Construction Engineering and Management, Vol. 141 No. 8.
MATLAB R (2015), “MATLAB computer vision system toolbox release 2015b”, available at: www.
mathworks.com/products/computer-vision/whatsnew.html (accessed 15 October 2015).
Optotrak Certus (2015), available at: www.ndigital.com/msci/products/optotrak-certus/ (accessed
16 October 2015).
SilhouetteFX (2015), available at: www.silhouettefx.com/ (accessed 16 October 2015).
Corresponding author
SangHyun Lee can be contacted at: shdpm@umich.edu
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com

Construction Innovation: Article Information

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Construction Innovation: Article Information

Uploaded by

Copyright:

Available Formats

Construction Innovation

Tracking-based 3D human skeleton extraction from stereo video camera toward an

*Related content and download information correct at time of download.

Department of Civil and Environmental Engineering,

Background: existing motion capture technologies

Technologies Advantages Disadvantages

Red, green and blue sensors

stereo camera-based approach with improved computational efficiency and accuracy to

Methods: tracking approach to 3D motion capture

Two-dimensional body joints tracking

Camera 1 Camera 1 Camera 1 Cameras 1 & 2 3D Space

addition to appearance, the proposed approach utilizes movement feature which

Two-dimensional skeleton matching

matching between pairs of featured points. As the homography is computed based

Three-dimensional skeleton reconstruction

and a commercial marker-based motion capture system – Optotrak – which is not

length. Enough repetition of postures is provided to validate the approach against

Three-dimensional skeleton extraction

35 mm in a 3D camcorder). The captured image sequences from the separated cameras

Results and discussion

truth measured using a commercial motion capture system, Optotrak (www.ndigital.

Body part ID 1 2 3 4 5 6 7 8 Mean SD Table IV.

Contractors, Construction Industry Institute (CII), University of Texas, Austin, TX.

classification”, IEEE Transactions on Multimedia, Vol. 12 No. 7, pp. 665-681.

You might also like