You are on page 1of 10

Real-time Full-body Motion Reconstruction and Recognition for Off-the-Shelf VR

Devices
Fan Jiang ∗ Xubo Yang † Lele Feng ‡
Shanghai Jiao Tong University Shanghai Jiao Tong University Shanghai Jiao Tong University

Figure 1: Examples of our real-time motion reconstruction based on HTC Vive.

Abstract 1 Introduction

Almost all VR multi-person applications have the requirement of With the rapid development and adoption of virtual reality (VR),
reconstructing the whole-body motions in real-time in order to cre- kinds of VR devices and applications have been accepted by con-
ate deeper immersion. In fact, if we can ensure that the motion sumers. Oculus Rift, HTC Vive and PlayStation VR are three of
we reconstruct is natural, we can improve upon the comfort and the outstanding VR devices which have consumer edition. All these
number of wearing sensors at the cost of reconstruction precision three VR devices are equipped with a pair of hand controllers that
because all users in VR are blindfolded. In this paper, we introduce can help users to interact with the virtual objects. Depending on
a novel real-time motion reconstruction and recognition method in the sensor signals from the HMD and two controllers, we can re-
VR only using the positions and orientations of user’s head and construct user’s head and two hands in virtual world. But user’s
two hands. To reconstruct natural motions according to rare sen- avatar is invisible because we don’t have a tracking system to cap-
sors, we divide the whole body into the upper body and the lower ture user’s physical motions. For this reason, other users in a VR
body. The upper body reconstruction algorithm based on inverse applications, who are only made up of the floating head and hands
kinematics which is more accurate and the lower body based on an- in the air, don’t feel like real human. The experience will feel more
imation blending which only needs a small number of prepared an- isolated and severely reduce the immersion.
imations. Meanwhile, a natural action recognition algorithm based
Almost all VR multi-person interactive entertainment applications
on neural network will run in need to detect the target motions we
have requirements to reconstruct full-body motions. In social appli-
have trained. We show that our method can reconstruct various full-
cations, if we can make your body and appropriate motions visible
body motions, such as walking in any direction, jogging, jumping,
to all the users, not only does the immersion of VR presence for
crouching and turning. Our method has the ability to reconstruct
the application unlike any other, but costumes and avatars can also
natural human motions which can be used in almost all VR multi-
be more easily customized to show user’s distinctive individuali-
person interactive applications. The configuration of sensors we
ties. Furthermore, people used to presenting information by spe-
attached on the head and hands is very popular in VR well-known
cific emotions or gestures, like saying hello by waving to someone.
devices, so the method can be easily integrated into off-the-shelf
In order to create deep immersion, it will be necessary to trans-
VR devices.
fer user’s physical motions onto a virtual avatar. In multiplayer
games like VR theme park, if all our teammates in virtual world are
Keywords: real-time, full-body motion reconstruction, motion re- composed of a floating head in air and two restless hands with a
congnition, VR gun, it will make players so weird and take away from immersion.
Moreover, if we can convey as much avatar presence as possible,
Concepts: •Computing methodologies → Motion processing; it will provide more optional approaches to communicate among
Motion capture; teammates. For example, if one player waves his hand down to
his teammate, the teammate will receive a message about squatting
∗ e-mail:jiangfan3721@sina.com down.
† e-mail:yangxubo@sjtu.edu.cn
‡ e-mail:lelefeng1992@gmail.com There are many ways for capturing motions which are the input
data for reconstructing avatars into virtual world. Each kind of
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
motion capture technology has its own strengths and weaknesses
made or distributed for profit or commercial advantage and that copies bear with regard to accuracy, complexity, expressiveness and operating
this notice and the full citation on the first page. Copyrights for components expenses. For example, optical motion capture systems require to
of this work owned by others than ACM must be honored. Abstracting with attach small reflective markers to a user’s body. High precision is
credit is permitted. To copy otherwise, or republish, to post on servers or to its greatest strength. But cost intensive, special gears for the user
redistribute to lists, requires prior specific permission and/or a fee. Request and environment limitation make it not for ordinary consumers.
permissions from permissions@acm.org. c 2016 ACM. An alternative to marker-based optical motion capture systems is
VRCAI ’16,, December 03-04, 2016, Zhuhai, China low-cost inertial systems based on IMU. Because of small size and
ISBN: 978-1-4503-4692-4/16/12 $15.00 weight, IMU can be easily bound to user’s joints. However, the data
DOI: http://dx.doi.org/10.1145/3013971.3013987 obtained from IMU often affected by noise. Drift error over long

309
time accumulation is fatal flaw which can’t be corrected in absolute [Roetenberg et al. 2007]. They don’t have a requirement of special
position. Position deviation relative to calibrated position of the recording environment or special gears. However, the fatal flaws
IMU in use can also cause relatively large error because of flexible of the system are very noisy and prone to drifts. To correct drifts,
activity of user’s joint. Another motion capture technology based several approaches have been proposed in recent years. [Schep-
on image processing needs to attach nothing to user’s body. It can ers et al. 2010] used magnetic field sensors for this purpose, while
analyze images from cameras and extract the skeleton of the user [Vlasic et al. 2007] employed ultra sonic distance sensors to avoid
for reconstructing human motions, such as Kinect. But small valid drifts. [Kelly et al. 2013] proposed a motion graph based memory
area and low precision are its limitations. efficient technique which uses a single overhead camera for global
position and orientation data along with accelerometers.
Actually in all VR applications, users can see nothing in the real
world because eyes are masked by HMD. In other words, users How to reconstruct visually appealing and natural human motions
don’t know what motions other users take in reality. They can only based on the sensor data is another challenging task. In this con-
see what motions we reconstruct in virtual world. Therefore, if we text, we will summarize several approaches that have been pro-
can ensure that the motion we reconstruct is natural, we can im- posed in recent years. Data-driven methods perform well, which
prove upon the comfort and number of wearing sensors, capturing use a motion capture database include a lot of human motions as
volume and complexity of calibration at the cost of precision. In the knowledge base [Arikan and Forsyth 2002][Pullen and Bregler
addition, in most of VR interactive applications, users pay more at- 2002][Ren et al. 2005][Slyper and Hodgins 2008]. During the re-
tention to motions of upper limbs than the lower body, so its more construction process, the knowledge base is searched to find best
important to reconstruct motions of upper body accurately than the matches according to the low-dimensional sensors data input. How-
lower body. ever, no hit returning from search may cause significant loss of
the quality of synthesized result. The problem of absent motions
From the above, we introduce a novel real-time framework for
can be fixed by interpolation [Kovar and Gleicher 2004][Mukai
reconstructing full-body motions and recognizing natural action
and Kuriyama 2005][Safonova and Hodgins 2007], but interpola-
based on the head mounted display and two hand controllers which
tion methods can’t create new poses which are not present in the
is the standard configuration of well-known VR devices. Our work
knowledge base.
makes two principal contributions:
Several methods similar to above have been proposed to identify
(1) An upper body motion reconstruction algorithm based on in-
motions. Chai and Hodgins [Chai and Hodgins 2005] searched
verse kinematics which is accurate and fast, and a lower body
k-nearest poses according to low-dimensional inputs in a pre-
motion reconstruction algorithm based on animation blending
computed neighbor graph. In order to increase motion matching
which doesn’t need a large animation knowledge base.
speed in neighborhoods that are found in a kd-tree index structure,
(2) A natural action recognition algorithm which has been inte- Kruger et al.[Krüger et al. 2010] have developed a lazy neighbor-
grated into reconstruction algorithm above based on neural net- hood graph method. Tautges et al.[Tautges et al. 2011] used an
work. online capable extension (OLNG), that operates on a window of
frames.
The paper is organized as follows. In Section 2, we discuss previous
work which is related to our approach. In Section 3, we give an Motion reconstruction methods are composed of fast search and
overview about the whole pipeline of our method. In Section 4, motion synthesis techniques. How to find similar motions by low-
we talk about calibration and how to estimate the body forward dimensional inputs and to synthesize poses that not be included in
direction. In Section 5, we describe our proposed method of full- the knowledge base are main challenges. [Chai and Hodgins 2005]
body motion reconstruction. In Section 6, we explain the process build a local linear model according to the candidate nearest neigh-
of natural action recognition algorithm based on neural network. bors. Energy optimization is used to make a best match. [Liu et al.
Experimentation and the results of evaluation of our approach are 2011] extended this method by constructing a series of local linear
presented in Section 7. Section 8 talks about the limitations of our dynamic models online to approximate nonlinear dynamic behav-
method and future work. We conclude our paper with a discussion iors of human movement. [Tautges et al. 2011] proposed a method
in Section 9. that showed visually appealing human motions can be reconstructed
based on only four accelerometers (attached on the wrists and the
2 Related Work ankles). They came up with a kernel based energy minimization
to synthesize the results. [Kim and Lee 2013] reduced sensors fur-
ther. They have shown that some full-body motions can be recon-
There are many ways for human motion capture which are devel-
structed based on the wrist and ankle trajectories only. [Yin and
oped based on optical, magnetic, inertial sensors and so on. All
Pai 2003] used foot pressure distributions to reconstruct full-body
these technologies have strengths and weaknesses in terms of ac-
motions. However, the pressure data needed to be pre-captured and
curacy, complexity, capturing volume and operating expenses, see
be included in the knowledge base. [Riaz et al. 2015] made use of
[Moeslund et al. 2006] for an overview.
accelerometers placed on both wrists and lower trunk, and ground
Optical motion capture systems are widely used because of high- contact information from the lower trunk sensor signals. The ap-
precision. The high resolution calibrated cameras which take pho- pealing motions were reconstructed by identifying ground contact
tos in high speed capture the positions of reflective markers attached results and combining the results with a fast database look-up.
to the user’s body. However, such systems are highly expensive be-
cause of the high-speed, synchronous and continuous photographic Among all the methods aforementioned, the knowledge bases of
devices. Moreover, special recording environment and special gears human motions are needed to obtain candidate poses for the re-
make it difficult to be popularized to general persons. sults or basis of synthesis. The quality of synthesized motions and
the runtime of searching algorithms depended on the size of knowl-
In recent years, IMU, composed of accelerometer, gyroscope and edge base more or less. Almost all VR interactive applications have
magnetometer, has become popular in motion capture area. Be- requirements of low latency, which means the algorithm used to
cause IMU is small in size and weight, they are easily bound to a reconstruct human motions need to be real-time. The comfort of
user’s body. The inertial motion capture systems calculate the po- wearing must be another consideration. The less sensors attached
sitions of each unit by synthesizing three components of the IMU to the user, the more comfortable the user feels during VR experi-

310
Figure 2: Overview of the full pipeline of our motion reconstruction and recognition algorithm

ence. At present, head motion is tracked by head mounted display, to reconstruct and transform the positions and orientations from the
and many VR devices widely accepted by consumers, such as Ocu- world coordinate into the local coordinate.
lus, HTC Vive and PSVR, have two motion controllers for users
hands. The motion reconstruction method which only depends on 3.2 Motion reconstruction algorithm
current motion controllers mentioned above may be the most com-
fortable and easily integrated approaches. In addition, in most of We present a new real-time full-body motion reconstruction frame-
VR interactive applications, users pay more attention to motions of work. Different from other searching method depending on the
upper limbs than the lower body, so its more important to recon- knowledge base, we divide the whole body into the upper body
struct motions of upper body accurately than the lower body. So we and the lower body. The upper body motion reconstruction algo-
introduce a novel real-time framework for reconstructing full-body rithm based on inverse kinematics which is more accurate because
motion and recognizing natural action based on the head mounted the users in VR pay more attention on the upper body. The lower
display and two hand controllers. The upper body motion recon- body motion reconstruction algorithm based on animation blending
struction algorithm based on inverse kinematics which is accurate which only need a small number of animation clips.
enough, and a lower body motion reconstruction algorithm based
on animation blending which is visually appealing and real-time.
3.3 Natural action recognition.

3 Overview A natural action recognition algorithm based on neural network is


another major component of our framework. The neural network
Our system transforms sensor signals obtained from user’s head will be trained offline by the back-propagation learning method ac-
mounted display and two hand controllers in optical tracking sys- cording to the target motion. The velocity’s magnitude and direc-
tem, such as the Lighthouse tracking system of HTC Vive which we tion of users head and two hands maintained by the motion infor-
use for this project. All raw signals will be translated into the posi- mation cache will be sent into the neural network trained offline as
tion and orientation of the sensor in world coordinate by OpenVR. the input data.
In other words, the position and orientation of the users head and
two hands obtained in each frame will be maintained by a motion
information cache, which update the velocity magnitude, direction
4 The preparatory work
and other useful information in the past of the head and two hands
each frame for subsequent reconstruction and detection. The full 4.1 Skeleton Calibration
pipeline is shown in Figure 2. The entire system consists of three
major components described as below. Our framework requires the user to hold two sensors in hands and
one sensor attached to the head. The signals of the sensors can be
translated into the sensor position and orientation in world coordi-
3.1 The estimation of body forward direction
nate. The VR device we use in this paper is HTC Vive, which is
the best VR experience device at present. In the following pages,
We have no way to get the body forward direction directly because we use head to represent head mounted display or other sensors
all the positions we obtained from tracking system are in the world attached to user’s head and use hand to represent hand controllers
coordinate, and the orientation of the head and two hands can’t rep- or other sensors hold or bound to user’s hand. The height of the
resent the body forward direction. So we first introduce a novel horizon is 0.
method for the estimation of body forward direction. Our method
can estimate the stable result of body forward direction according to Our calibration process is very easy because of few sensors. We
the orientation of the head and two hands in the past few frames and instruct the user to take a standard T pose shown in Figure 3 and
relative positions of the head and two hands. The result of this part pull the trigger on the controllers in the hand at the same time. Let
is the base of motion reconstruction and recognition because we pl , pr , and ph be the world positions of left hand, right hand and
can build the local coordinate system for the body which we need head. We assume the height of virtual body in VR for the user is

311
(a) v̄ indicates the average view direction in past few frames

Figure 3: T pose for calibration


(b) p̄ indicates the average perpendicular direction of the line con-
necting two hands in past few frames
hmodel and the width is wmodel . Then the calibration scale hscale
and wscale are given as
Figure 4: Explanatory chart for v̄ and p̄
ph
hscale = (1)
hmodel
|pr − pl | frame, up is the up direction in world coordinate which means (0,
wscale = (2) 0, 1). All the vectors which have the subscripts proj denote that
wmodel
they are the projections of the original vector on horizontal plane.
It ensures that the framework is robust to users of different height
and skeletal size.

4.2 Pre-process sensors data

This section focuses on preparing the useful motion information for


following algorithms from the position and orientation of head and
two hands obtained by the sensor in each frame. We maintain a mo-
tion information cache which retains the position and orientation of
head and two hands in last 50 frames. According to these data, we (a) across (b) alinear
can get the velocity magnitude or direction for some sensor in a
period of frames in the past. We refresh these useful motion infor- Figure 5: Explanatory chart for across and across
mation every frame, such as velocity magnitude in last 10 frames
which can represent the speed of motion or velocity direction in
last 5 frames which we can get the direction of motion from it. All In addition, there are two angles we also need to pay more attention.
information will be updated first before the motion reconstruction They are across and alinear which are shown in Figure 5. They are
and recognition algorithm take the input they need in each frame in computed as:
order to reduce the latency.
hd = (lhp − rhp)proj (5)
4.3 The estimation of body forward direction ld = (lhp − hp)proj (6)
rd = (rhp − hp)proj (7)
Our body forward direction is almost stable in general VR experi-
ence, but the head and hands we attach the sensors to are all active. hd · (ld + rd)
across = acos( ) (8)
So its necessary to estimate the stable body forward direction ac- |hd||ld + rd|
cording to the unstable sensor positions and orientations. hd · ld
alinear = acos( ) (9)
First, there are two features which imply the body forward direc- |hd||ld|
tion in most cases. One is the point that you always look at. The
other is the perpendicular of the plane that two hands and world where hp is the position of the head in current frame. Other sym-
up vector in. So we maintain two values to describe these two fea- bols are the same as above.
tures in past n frames. We denote the view direction with v̄ and the In different situations, the reliability of these reference values are
perpendicular with p̄ shown in Figure 4, they are computed as: different. So we need to start splitting in the following situations
according to the orientation of the head and the velocity magnitude
v̄ + hfproj /n of two hands obtained from the motion information cache discussed
v̄ = (3)
|v̄ + hfproj /n| in last section.
p̄ + (lhp − rhp)proj × up/n We denote the estimation of body forward direction in current frame
p̄ = (4)
|p̄ + (lhp − rhp)proj × up/n| with bfcurrent and in last frame with bflast
where hf is the vector of head forward direction in current frame, (1) If user’s head and two hands are all stable in recent frames, we
lhp and rhp are the positions of left hand and right hand in current have to confirm that across is close to 90 degrees or alinear is

312
just give a brief description and some core formulas of the inverse
kinematics.
To explain it, we use a simple linkage mechanism to simulate the
human arm which is shown in Figure 7. The joint values of a final
desired position can be determined analytically by inspecting the
(a) Two cases of across close to 90 degrees geometry of the linkage. Link lengths are L1 and L2 for the first
and second link, respectively. Then, the joint angles, θ1 and θ2 can
be determined by computing the distance from the base to the goal
and using the law of cosines to compute the interior angles. They
are computed as:

(b) Two cases of alinear close to 0 degrees

Figure 6: Some cases satisfied with across and alinear conditions

close to 0 degree and no much difference between v̄ and half


vector of two hands positions relative to the head. Some cases Figure 7: IK: A simple linkage mechanism to simulate the human
satisfied with these conditions are shown in Figure 6. If all the arm
condtitions are satisfied, bfcurrent = v̄.
(2) If user’s head is stable and two hands are not stable in recent
frames, the positions and orientations of two hands are all in- X
θγ = acos( √ ) (11)
valid to the result, so we need to check the difference between X2 + Y 2
v̄ and p̄. If they have no much difference, v̄ will be more L2 + X 2 + Y 2 − L22
reliable than p̄ because of the stable head, so bfcurrent = v̄. θ1 = acos( 1 √ ) + θγ (12)
2L1 X 2 + Y 2
Otherwise, bfcurrent = v̄/2 + p̄/2
L2 + L22 − X 2 + Y 2
(3) If user’s head is unstable and two hands are stable in recent θ2 = acos( 1 ) (13)
2L1 L2
frames, the orientations of head is invalid to the result. In
most cases in this kind of situation, the user may be looking When we use the inverse kinematics in 3D space, there is one more
around, so we keep the body forward direction, which means constraints called pole vector we need to pay attention to ensure
bfcurrent = bflast the uniqueness of the solution. The pole vector points the bending
direction of the joint.
(4) If user’s head and two hands are all unstable in recent frames,
all the information is unreliable, so we also don’t change the
body forward direction. bfcurrent = bflast 5.2 The upper body reconstruction

At last, the estimation of this frame will be blended into the actual The upper body contains three core effectors. The waist effector
body forward direction denoted by bfactual which is given as: controls the position of the body in the world. The left shoulder
effector and the right shoulder effector are the base vectors for the
bfactual = bflast ∗ (1 − f actor) + bfcurrent ∗ f actor (10) left and right arm which will be involved in the inverse kinematics
calculation.
where factor is the blend factor. 0.1 is an ideal factor value accord-
ing to the visual result. At first, the position of waist effector need to be calculated. Be-
cause of the rare sensor information about the position of the user,
we estimate the position of waist relying mostly on the user’s head
5 Motion Reconstruction position. We denote the height of the waist effector with beht , it
can be given as:
In this section, we discuss in detail our approach to reconstruct full- (
body motion in real-time. At first, we will give a brief description of hht − (mht − muht ∗ hscale ) hht ≥ hminht
the inverse kinematics method we use to reconstruct limbs. Then, beht =
we construct the upper body integrated with the inverse kinematics. hminht − (mht − muht ∗ hscale ) hht < hminht
At last, we will talk about how to reconstruct the lower body motion
based on animation blending which only needs a small number of where hht is the height of user’s head, mht is the height of the hu-
animation clips. man model, muht is the distance between the head and waist of the
model and hminht is the minimum height of user’s head the recon-
struction method accepts which can avoid the waist we reconstruct
5.1 The inverse kinematics below the horizon.
Inverse kinematics method is very popular in animation and interac- Secondly, we need to estimate the waist effector position on the
tive applications[Lander and CONTENT 1998a][Lander and CON- horizontal plane. Simply, the waist effector position is equal to the
TENT 1998b]. In IK, the desired position and possibly orientation user’s head position, but there is a problem that the user’s head will
of the end effector are given by the user along with the initial pose never be absolute static in VR experience. It can cause the obvious
vector. From this, the joint values required to attain that configura- drift of the virtual body though the user doesn’t move. In this paper,
tion are calculated, giving the final pose vector. In this context, we we use interpolation to reduce unnatural drift. When the user is in

313
standing state which can be concluded from the velocity magnitude
of the head in motion information cache, the interpolation will be
enabled to stabilize the position of the waist effector to let the effec-
tor not completely follow the user’s head position. When in other
state, the interpolation will be disabled and the waist effector will
follow the user’s head.
Next, the position of two shoulders need to be estimated in order
to be ready for the inverse kinematics method. Here, we assume
that there is no bending on the user’s upper body, which means the
head, left shoulder, right shoulder and waist effector are in the same Figure 9: Different foot choice when start to walk according to the
plane. We assume the positions of left shoulder and right shoulder swing direction of user’s hand
effector are lse and rse, respectively, the position of waist (body)
effector is be. They are computed as:
as shown in Figure 9. When we detect the user moving forward
me = be + (hp − be)/f actors (14)
or backward by the velocity of user’s head maintained in motion
bf × (hp − be) information cache, we will check if the swing direction of user’s
offset = ∗ f actorw ∗ hscale (15)
|bf × (hp − be)| hands is valid, which means the left hand moving forward and right
lse = me − offset (16) hand moving backward or opposite, to judge the first foot. If not,
we will not activate the walking case and wait for the next frame
rse = me + offset (17)
until we get the valid swing direction or reach the waiting limit. If
where f actors is the adjustment factor of the shoulder’s height, bf we can’t get valid swing direction, left foot first in walking forward
is the body forward direction obtained in last section, f actorw is and right foot first in walking backward are the default because of
the adjustment factor of the shoulder’s width. human’s walking habit.
At last, we will talk about a special case our framework support,
crouch. The crouch case will be activated at the time both condi-
tions have been satisfied. One condition is the direction of users
view is downward shown in Figure 8, which means the user may
be looking at the object near himself on the ground. The other con-
dition is the user’s head must have acceleration on vertical relative
to the ground. When the crouch case is activated, the waist effector
position on the horizontal plane will be frozen and horizontal offset
will produce and be recorded. The offset will be involved into the
calculation of waist effector until the stand case is activated again.

Figure 10: Animation blending graph. The red point coordinate


is translated from the human moving speed and direction. Eight
diamonds are animations of human walking in eight directions we
have prepared.

(a) A bending upper body because of the downward view Next, we will use two values (x,z) to control the animation blend-
ing which is shown in Figure 10. The two values (x,z) represent
the coordinate of a blending point and the all eight animations are
settled on the edge of the unit circle which the center is at the ori-
gin. The moving direction obtained from motion information cache
will be normalized and translated into (x,z). The moving velocity
magnitude obtained from the cache will be translated into the dis-
tance between the blending point and the origin. When we get the
(b) A straight upper body because of the forward view coordinate of the blending point, the two nearest animations will be
blended according to the distance between blending point and the
animation diamond, and the speed of playing animation will depend
Figure 8: Crouch recognition with different view direction on the distance between the blending point and the origin. At last,
the result of animation blending will be involved into the calcula-
tion of the inverse kinematics for the two legs, which is the same as
5.3 The lower body reconstruction the arms.

In this section, we will talk about how to conclude the lower body
pose without any sensor information. This approach is based on
6 Natural Action Recognition
eight animation clips, which are human walking animation in the
direction of front, front-left, left, back-left, back, back-right, right In recent years, neural network has been applied in recognition ap-
and front-right. We need to estimate the speed and direction of proach more and more. In this section, we will briefly introduce
walking according to the sensors on head and two hands and to a natural action recognition algorithm based on neural network.
blend these eight animations to reconstruct the lower body action. The neural network will be trained offline by the back-propagation
learning method according to the target motion. The velocity’s
At first, if the user begin to walk forward or backward, we need to magnitude and direction of user’s head and two hands maintained
judge which foot first according to swing direction of user’s hand by the motion information cache will be transformed to the stan-

314
dard body size according to the calibration scale and then sent into 7.1 Motion reconstruction result
the neural network trained offline as the input data.
Our method can reconstruct various full-body motions, such as
6.1 Offline training walking in any direction, jogging, jumping, crouching and turn-
ing. In Figure 15, we show sample frames of some reconsturction
According to different numbers of motions which need to be recog- results. In Figure 11, a free walking in random direction is shown
nized by the network, we will set different size of input and output to prove that our method has the ability to estimate the feasible and
data, numbers of hidden layers and numbers of nodes in hidden lay- stable result in any walking cases. In Figure 12, we show the first
ers. The training data we record about the motion are the velocity foot judgement according to swing direction of user’s hand. But if
magnitude and direction of the hand controllers. It’s recommended the user doesn’t swing hands during walking, we will use default
to record more training data for the better recognition performance. pace to switch two feet because we have no reference data about
the feet.
The training method we use is the back-propagation algorithm
which the pseudocode is shown in Algorithm 1. Back-propagation The goal of our algorithm is to reconstruct visually appealing mo-
learning consists of two passes through the different layers of net- tions in real-time by the sensors on user’s head and hands in VR. So
work. In the forward pass, the output of the network to an input data the precision of the result is not our strengths because we have no
is computed and all the synaptic weights are kept fixed. During the reference data about the lower body. In VR experience, all user’s
backward pass, the weights are adjusted using an error-correction eyes are masked by HMD and they can only see what we recon-
rule. The error signal is propagated backward through the network. struct in virtual world. They will never know the difference be-
After adjustment, the output of the network should have moved tween real motions and reconstructed motions if the result is natural
closer to the desired response in a statistical sense. After a few enough. So it’s worth to improve upon the comfort and number of
iterations, the error will be small enough to the threshold and the wearing sensors at the cost of reconstruction precision.
network will have the ability to recognize the training motion.
7.2 Motion reconstruction cost
Algorithm 1 Detailed back propagation process
Unlike other tracking systems which need long wearing time and
Require: calibration time, our calibration process only needs one T pose and
Feature vectors β in training samples; will be finished in 1 seconds. The controllers only need to hold in
Targets (actual output) in training samples; hands which is easy to use in comparison to other devices bound on
Ensure: user’s whole body.
Weights and biases;
P (nl )
1: while δi < threshold do Run times for different components of our motion reconstruction
2: for each training sample β do pipeline is shown in table 1 which is real-time and can be used in
3: F orwardP ropagation(β); VR experience.
(n )
4: Compute δi l in output layer;
(l) Table 1: Run times of our reconstruction method.
5: According
P  l = nl − 1, nl − 2, ..., 2: δi =
to order
(l) (l+1) 0 (l)
wji δj f (zi ); Components Time
6: Update weights and biases: Motion Information Cache 2ms
(l) (l) (l) (l+1) Body direction estimation 1ms
wij = wij − αaj δi
(l) (l) (l+1) Upper body reconstruction 2ms
biasi = biasi − αδi ;
Lower body reconstruction 2ms
7: end for
Total 7ms
8: end while
9: return Trained weights and biases;
The resources for our method includes only eight animation clips
which need to be blended in lower body reconstruction. Compare
6.2 Online predicting to other data-driven reconstruction method which need a large ani-
mation database include thousands animations, our method is easy
The recognition network will be connected to the framework only in enough to integrated into any VR applications.
need. During the recognition process, the network fetches the input
data it needs from the motion information cache, then transforms 7.3 Motion recongnition accuracy
the data to the standard body size according to the calibration scale
in order to adapt to different people. Finally the network gives the We use a small interactive VR demo shown in Figure 13 to test our
predicting result of the motion. In order to make the results more motion recognition method. In the demo, we play a animal trainer
reliable, we use a voting scheme to count the predicting result from who uses gestures to communicate with the tigger. The neural net-
the network in a period of time. The result with the most votes is work has been trained for 5 actions: come here, go away, stand up,
the final result of the motion recognition. sit down and say hello to the animal We invite three testers to do
these 5 actions in random. The recognition and accuracy are shown
7 Results in Table 2.

Table 2: Motion recognition results.


We performed various experiments to verify the validity and effec-
tiveness of our method. Our experiments are based on HTC Vive
Testers Total actions Recognitions Correct
which has two hand controllers. All raw signals will be translated
into global position and orientation of the sensor by OpenVR.The Tester 1 30 27 27
PC configuration is higher than the minimum VR requirement, Tester 2 31 25 25
which CPU is Intel i5-4590 and GPU is Nvidia GTX 980. Tester 3 30 27 27

315
Figure 11: Reconstruction results of random walking

(a) Go forward with left foot first

Figure 13: Animal Trainer VR interactive demo

(b) Go forward with right foot first


distinctive individualities. The full-body reconstruction is an indis-
pensable technology in this VR demo.

(c) Go backward with left foot first

(d) Go backward with right foot first (a) A VR shooting game (b) A VR fitting demo

Figure 12: The first foot judgement according to swing direction of Figure 14: Existing VR demos
user’s hand

8 Limitations and future work


According to the result, because we have a voting scheme which
can verify the predicting result of the neural network, so if our Because we reconstruct the full-body motion only from the posi-
method gives a final result, it has a high probability of being correct. tions and orientations of users head and hands in real-time, which
Some valid actions are missed by the neural network, the reason we means we have to conclude the result based on the previous in-
think is that we don’t have enough training data and the network is formation and current sensor signals, different behaviors of similar
under-fitting. input patterns can’t be distinguished. For example, we can’t detect
the motion of step on one place, because we don’t know if you have
7.4 Integration into VR applications any leg actions according to the swinging hands. Another example,
if the user doesn’t swing arms during walking, we have no way to
estimate which foot go first. But even in those cases, our method
Our method is easy to integrated into existing VR applications. Fig- still creates a feasible leg motions based on animation blending.
ure 14 has shown two applications we have added our method into.
One is a VR shooting game which makes player’s whole body vis- In the inverse kinematics, the same position and orientation of the
ible to other teammates.The other is a virtual fitting demo which hand may be corresponding to more than one solutions of the el-
user’s costume and avatar can be easily customized to show user’s bow position. In our framework, we assume there is no bending on

316
user’s wrists. The more accurate result will be solved if two sensors GRAPH/Eurographics Symposium on Computer Animation, Eu-
are bound to the hand and the wrist, respectively. rographics Association, 1–10.
The goal of our framework is preferable to keep the number of sen- L ANDER , J., AND CONTENT, G. 1998. Making kine more flex-
sors attached to the user to a minimum. In the future, we plan to ible. Game Developer Magazine 1, 15-22, 2.
develop techniques for using a carpet of pressure sensing to capture L ANDER , J., AND CONTENT, G. 1998. Oh my god, i inverted
more foot action information in order to reconstruct more accurate kine. Game Developer Magazine 9, 9–14.
motion.
L IU , H., W EI , X., C HAI , J., H A , I., AND R HEE , T. 2011. Real-
time human motion control with a small number of inertial sen-
9 Conclusion sors. In Symposium on Interactive 3D Graphics and Games,
ACM, 133–140.
In this paper, we have presented a novel real-time motion recon-
struction and recognition method using the positions and orienta- M OESLUND , T. B., H ILTON , A., AND K R ÜGER , V. 2006. A
tions of users head and two hands. To this end, we first estimate survey of advances in vision-based human motion capture and
the body forward direction. Then we divide the whole body into analysis. Computer vision and image understanding 104, 2, 90–
the upper body and the lower body. The upper body motion re- 126.
construction algorithm based on inverse kinematics which is more
M UKAI , T., AND K URIYAMA , S. 2005. Geostatistical motion
accurate. The lower body motion reconstruction algorithm based
interpolation. In ACM Transactions on Graphics (TOG), vol. 24,
on animation blending which only need a small number of ani-
ACM, 1062–1070.
mation clips. At the same time, according to the motion informa-
tion we have collected, a natural action recognition algorithm based PARKE , F. I., AND WATERS , K. 1996. Computer Facial Anima-
on neural network will run in need to detect the target motion we tion. A. K. Peters.
have trained. We demonstrated the validity and effectiveness of the
method by reconstructing various motions, such as walking in any P ULLEN , K., AND B REGLER , C. 2002. Motion capture assisted
direction, jogging, jumping, crouching and turning. Our method animation: Texturing and synthesis. In ACM Transactions on
has the ability to reconstruct visually appealing and natural human Graphics (TOG), vol. 21, ACM, 501–508.
motions which can be used in almost all VR multi-person interac- R EN , L., S HAKHNAROVICH , G., H ODGINS , J. K., P FISTER , H.,
tive entertainment applications. The configuration of sensors we AND V IOLA , P. 2005. Learning silhouette features for control
attached on users head and two hands is very popular in VR well- of human motion. ACM Transactions on Graphics (ToG) 24, 4,
known devices, so the method can be easily integrated into existing 1303–1331.
VR devices to create deeper immersion
R IAZ , Q., TAO , G., K R ÜGER , B., AND W EBER , A. 2015. Motion
reconstruction using very few accelerometers and ground con-
Acknowledgements tacts. Graphical Models 79, 23–38.

We would like to thank Cheng Yang, Lele Feng,and Chao Yin ROETENBERG , D., L UINGE , H., AND S LYCKE , P. 2007. Moven:
for their assistance and advice during the research.This work is Full 6dof human motion tracking using miniature inertial sen-
supported in part by the National Natural Science Foundation of sors. Xsen Technologies, December.
China (No. 61173105 and 61373085) and the National High S AFONOVA , A., AND H ODGINS , J. K. 2007. Construction and
Technology Research and Development Program of China (No. optimal search of interpolated motion graphs. In ACM Transac-
2015AA016404). tions on Graphics (TOG), vol. 26, ACM, 106.
S CHEPERS , H. M., ROETENBERG , D., AND V ELTINK , P. H.
References 2010. Ambulatory human motion tracking by fusion of iner-
tial and magnetic sensing with adaptive actuation. Medical &
A RIKAN , O., AND F ORSYTH , D. A. 2002. Interactive motion biological engineering & computing 48, 1, 27–37.
generation from examples. In ACM Transactions on Graphics
(TOG), vol. 21, ACM, 483–490. S LYPER , R., AND H ODGINS , J. K. 2008. Action capture
with accelerometers. In Proceedings of the 2008 ACM SIG-
C HAI , J., AND H ODGINS , J. K. 2005. Performance animation GRAPH/Eurographics Symposium on Computer Animation, Eu-
from low-dimensional control signals. In ACM Transactions on rographics Association, 193–199.
Graphics (TOG), vol. 24, ACM, 686–696.
TAUTGES , J., Z INKE , A., K R ÜGER , B., BAUMANN , J., W E -
K ELLY, P., C ONAIRE , C. Ó., O’C ONNOR , N. E., AND H OD - BER , A., H ELTEN , T., M ÜLLER , M., S EIDEL , H.-P., AND
GINS , J. 2013. Motion synthesis for sports using unobtrusive E BERHARDT, B. 2011. Motion reconstruction using sparse ac-
lightweight body-worn and environment sensing. In Computer celerometer data. ACM Transactions on Graphics (TOG) 30, 3,
Graphics Forum, vol. 32, Wiley Online Library, 48–60. 18.

K IM , H., AND L EE , S.-H. 2013. Reconstructing whole-body mo- V LASIC , D., A DELSBERGER , R., VANNUCCI , G., BARNWELL ,
tions with wrist trajectories. Graphical Models 75, 6, 328–345. J., G ROSS , M., M ATUSIK , W., AND P OPOVI Ć , J. 2007. Prac-
tical motion capture in everyday surroundings. In ACM transac-
KOVAR , L., AND G LEICHER , M. 2004. Automated extraction and tions on graphics (TOG), vol. 26, ACM, 35.
parameterization of motions in large data sets. In ACM Transac-
Y IN , K., AND PAI , D. K. 2003. Footsee: an interactive
tions on Graphics (TOG), vol. 23, ACM, 559–568.
animation system. In Proceedings of the 2003 ACM SIG-
K R ÜGER , B., TAUTGES , J., W EBER , A., AND Z INKE , A. GRAPH/Eurographics symposium on Computer animation, Eu-
2010. Fast local and global similarity searches in large mo- rographics Association, 329–338.
tion capture databases. In Proceedings of the 2010 ACM SIG-

317
(a) Go forward (b) Go forward right

(c) Go left (d) Go backward left

(e) Go backward (f) Crouch

(g) Crouch-walk (h) Jump

Figure 15: Reconstruction results

318

You might also like