You are on page 1of 14

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO.

3, MARCH 2010 351

Automatic Detection and Analysis of Player Action


in Moving Background Sports Video Sequences
Haojie Li, Jinhui Tang, Member, IEEE, Si Wu, Yongdong Zhang, and Shouxun Lin, Member, IEEE

Abstract—This paper presents a system for automatically video [1], [2], semantic events detection and video summariza-
detecting and analyzing complex player actions in moving back- tion [3], [4], enhanced sports TV broadcasting [5], and content
ground sports video sequences, aiming at action-based sports
insertion [6]. The difficulty of video analysis is the semantic
videos indexing and providing kinematic measurements for coach
assistance and performance improvement. The system works gap between the low-level audio-visual features and high-level
in a coarse-to-fine fashion. For an input video, in the coarse concepts. To bridge the gap, some mid-level representations
granularity level, we automatically segment the highlights, that are constructed based on low-level features with clustering
is, the video clips containing the desired action as summaries or classification methods [7]. However, these representations
for general user viewing purposes; in the middle granularity
are extracted from frame-based approaches and have no direct
level, we recognize the action types to support action-based
video indexing and retrieval; and finally in the fine granularity link to high-level semantics. Moreover, they can only deduce
level, the critical kinematic parameters of player action are coarse-level knowledge of the video contents for general user
obtained for sports professionals’ training purposes. However, viewing purposes. Object’s behavior is another type but more
the complex and dynamic background of sports videos and the effective mid-level representation for video content analysis
complexity of player actions bring considerable difficulty to the
[8], [9]. At the same time, for sports professionals such as
automatic analysis. To fulfill such a challenging task, robust
algorithms including global motion estimation with adaptive out- coachers and players, it is desired for finer granularity analysis
liers filtering, object segmentation based on adaptive background to get more detailed information such as action names, match
construction, and automatic human body tracking are proposed tactics, kinematical, or biometric measurements from videos
in this paper. Two visual analyzing tools: motion panorama and for coaching assistant and performance improvements. For ex-
overlay composition, are also introduced. Real diving and jump
ample, by automatically recognizing the actions and obtaining
game videos are used to test the proposed system and algorithms,
and the extensive and encouraging experimental results show player body joint angles in diving game videos, the coaches or
their effectiveness. players can easily retrieve and compare qualitatively or quan-
titatively the performed actions with the same ones performed
Index Terms—Action recognition, human body tracking, sports
training, video analysis, video object segmentation. by elite players in video database, and then improve their
performance in later training or competition. To these ends,
I. Introduction this paper addresses the automatic detection, recognition, and
analysis of player actions from broadcast sports game videos
ITH THE EXPLOSIVE growth of digital videos in
W our daily life, automatic video content analysis has
become a basic requirement for efficient indexing and retrieval
or videos recorded during daily training. More specifically,
we focus on one sports genre where the player performs
his or her action in a large arena and the camera needs to
of long video sequences. In recent years, the analysis of sports
be operated with pan/tilt/zoom to capture the player in the
videos has attracted great attention due to its mass appeal
middle of the image. Usually, one or more cameras are placed
and tremendous commercial potentials. Many works have been
in the side-view to record the entire detailed action. This
conducted and technologies and systems have been developed
includes a broad category of individual sports videos, such
for automatic or semi-automatic parsing the structure of sports
as diving, jumps, gymnastics videos, and so on (see Fig. 1).
Manuscript received October 9, 2008; revised May 10, 2009. First version For such kind of action-critical sports videos, general users
published November 3, 2009; current version published March 5, 2010. This would like to rapidly locate and watch the highlights, namely,
work was supported in part by the National Basic Research Program of China
(Grant No. 2007CB311100), and the Co-building Program of the Beijing the video clips containing the desired action, while sports
Municipal Education Commission. This paper was recommended by Associate professionals will be more interested in the performances of
Editor T. Fujii. the players. Manual analysis of sports videos to achieve such
H. Li is with the School of Software, Dalian University of Technology,
Dalian 116620, China. aims is labor-intensive and time-consuming. Therefore, the
J. Tang is with the School of Computing, National University of Singapore, systems and techniques that can automatically parse long time
117590 Singapore (e-mail: tangjh@comp.nus.edu.sg). videos into browse-able actions and further provide kinematic
S. Wu is with France Telecom Research and Development Beijing, Beijing
100080, China. measurements for performance analysis are demanding.
Y. Zhang and S. Lin are with the Institute of Computing Technology, In this paper,we present an integrated coarse-to-fine sports
Chinese Academy of Sciences, Beijing 100080, China. video analysis system and various robust algorithms. Diving
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. videos are used as case study to demonstrate the effectiveness
Digital Object Identifier 10.1109/TCSVT.2009.2035833 of the system and algorithms due to the following motivations:
c 2010 IEEE
1051-8215/$26.00 

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
352 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010

to automatically detect and classify more complex actions in


video captured with moving camera, is more challenging.
Recent advances in video technology and computing power
have motivated the research of using computer vision and
image processing techniques to analyze sports game video
for tactics statistics, computer-aided coaching, or performance
Fig. 1. Some individual sports actions captured with a side-view camera. improvements. Compared with previous methods that require
From left to right: diving, long jump, and tumbling. retro-reflective markers or magnetic sensors to be placed on
a player body, video-based approach has more advantages:
1) diving is among one of the most popular spectator sports; much less cost, no interference to the performance of player
2) to make the player actions clearly viewed by the references and can analyze the rich archived video clips. Among such
and audiences, there are always side-view shots for diving works, Pascual et al. [16] developed a method for soccer player
videos; and 3) the complex and moving background of diving position tracking aiming at the kinematical motion analysis
videos and the complexity of diving action make the auto- through a graph representation with four static cameras. Aided
matic analysis a rather challenging task. To show the general with manually tracking in some cases, their method could
versatility of the proposed algorithms, we also test some key collect statistics measurements of each player in the game.
algorithms on jump videos. Wang et al. [17] classified tennis games into 58 winning
tactics patterns for archiving video clips and training purposes
A. Related Work using the ball trajectory and Bayesian network. To recover the
Human action detection and recognition in videos is a hot trajectory and ball landing position they turned to a wide-view
topic in computer vision research community and various calibrated camera. The widely-studied human body model
approaches have been proposed [10], [11]. But in the field of based tracking approach has also been suggested for sports
sports video, due to the dynamic background and the complex- biometric analysis. As shown in [18], a 42-dimensional body
ity of sports action, there existed few works and most of them model was used to track the golfer’s postural information and
were focused on action recognition with simple background, then the obtained information was analyzed with respect to
e.g., tennis or soccer videos, or focused on relatively simple a learned ideal motion. Nevertheless, their system ran such
action recognition. These works either took not much pain slowly that each frame would take about 25 min to process,
to segment player’s body or didn’t need accurate tracking and and the initial parameters of the body model needed be set
segmentation of the players. One of the earlier works on sports up manually before tracking. Raquel et al. [19] incorporated
action recognition was provided by Yamato et al. in [12]. strong dynamic models into the human body tracking pro-
They proposed using discrete hidden Markov models (HMMs) cess to recover the 3-D postural parameters of golfer from
to recognize image sequences of six different tennis strokes monocular golf videos. One similar work to our player action
among three subjects. Their system worked in a constrained analysis was provided by Ryan et al. [20]. In their system,
testing environment where the mesh features they used were the authors analyzed the acrobatic gestures of several sports
extracted from binarized human images which were obtained through modeling and characterizing acrobatic movements and
from a pre-known background. Sullivan et al. [13] presented image processing techniques. However, their work was less
a method for detecting tennis players’ strokes based on sim- challenging than ours since that the player movements were
ilarity computed by point-to-point correspondence between captured with static camera and they only recognized simple
shapes. Although impressive results in terms of posture match- acrobatic gestures by global measurements analysis.
ing were given, the extraction of edge needed a clean back-
ground. Miyamori et al. [8] analyzed the tennis player behav-
iors based on silhouette transitions. They extracted the player B. Contributions of This Paper
silhouette images by frame-differencing, so the static back- In this paper, we develop robust algorithms and a system
ground was necessary. Efros et al. [14] developed a generic for fully automatic analysis of complex actions in challeng-
approach to recognizing action in “medium field” sports video ing dynamic background videos, aiming at high-level sports
by introducing a novel motion descriptor based on noisy op- video indexing/retrieval and training purposes. The schematic
tical flow measurements in a spatio-temporal volume for each diagram of the proposed system is illustrated in Fig. 2. For
stabilized human figure. Similarly, Zhu et al. [9] proposed slice an input video, global motion parameters between adjacent
based optical flow histograms as motion descriptor to classify frames are first estimated, which are used as motion feature
player’s basic actions such as left-swing/right-swing in low for the highlights, to say, action clips detection. The detected
resolution tennis video sequence. Lu et al. [15] used the grids highlights are stored into library as video summaries for
of histograms of oriented gradient as representation of players user’s quick browsing. Then the player body shapes in action
to track and recognize player actions like skate left/skate right clips are segmented and fed into hidden Markov models to
and running/walking in hockey and soccer sequences. The recognize the action type, which is used to index the highlight
limitation of the above three works [9], [14], [15] is that they library. With the segmented shape and recognized action type,
only presented results with relatively simple actions and simple some critical kinematic parameters of action are automatically
background such as soccer field, tennis court, or hockey field. obtained through the kinematic analysis component, and visual
Compared with these previous works, the task in this paper, analysis is conducted.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 353

Some robust estimation methods have been developed to


handle outliers [21]–[23]. However, these methods used either
experimentally determined or manually specified thresholds
to remove outliers, thus are not adaptive to other data. In
this paper, we propose an adaptive outlier filtering strategy
using Fisher linear discriminant analysis [24] to improve the
robustness of GME.
In this paper, we adopt the 6-parameter affine model [25] to
describe the camera motion since it is computationally efficient
but powerful to model the translation, rotation, and scaling op-
erations of camera when the relative depth of the object is not
large, which is applicable to most videos including our case.
Let ui = (xi , yi )T be the position of a point p in the current
frame Ik , and ui ′ = (xi ′ , yi ′ )T be the position of p in frame

Ik−1 , then we can build the relation between ui and ui using
the affine model as follows:
ui = Hi A (1)
where A = (a, b, c, d, e, f )T is the global motion parameter
Fig. 2. Block diagram of the proposed system.
and Hi is a 2 × 6 matrix
 ′ ′ 
The contributions of our paper consist of the following xi yi 1 0 0 0
Hi = ′ ′ . (2)
points. 0 0 0 xi yi 1
1) An integrated framework for automatic analysis of sports Assume we have N feature point pairs (ui , ui ) (i =

video in a coarse-to-fine fashion is presented, which 1, . . . , N; N ≥ 3) from frames Ik and Ik−1 , then we have
attempts to extract semantic information with different
granularity to suffice the retrieval requirements from U = HA (3)
general users to sports professionals.
where H = (H1 T , H2 T , . . . , HN T )T , U = (u1 T , u2 T , . . . ,
2) To robustly estimate the global motion between adjacent
uN T )T . Thus A can be obtained by solving (3) using the least
video frames, we propose an adaptive outliers filtering
squares method.
strategy using Fisher linear discriminant analysis.
The feature point pairs are constructed as follows. We first
3) An object segmentation algorithm based on adaptive
calculate the global standard variance, gstd, of pixel values in
dynamic background construction is proposed, which is
frame Ik . Then we scan frame Ik and check each n × n block.
robust to complex and dynamic scenes.
If the standard variance, std, of a block is large enough, to
4) The automatic kinematic analysis of player body move-
say, larger than α ∗ gstd, the upper left corner of the block is
ments is achieved by model fitting. By transferring the ′
selected as ui . ui is obtained by searching nearby blocks in
initial model parameters from the recognized action
frame Ik−1 . In this way, we collect all the point pairs between
templates, our method avoids the manual setup of human
Ik and Ik−1 .
body model.
Suppose A∗ is the initial solution for global motion, we
5) With the enabling techniques above, we present two
calculate the residual error for each pair (ui , ui ′ ) as
visual tools for individual sports game training: motion
panorama and overlay composition, which are percep- ri = ui − Hi A∗ . (4)
tible for the visual analysis or comparing of player
performance. Since A* is the approximate solution to GM and the motion
of outliers is not consistent with GM, the residual errors of
II. Global Motion Estimation outliers will be larger than those of inliers. Therefore, we can
Global motion (GM) is referred to the motion of the separate the point pairs into inliers and outliers according to
background in a video sequence caused by the camera motion. their residual errors and then use the inliers to refine A. Let
Global motion estimation (GME) is a process to estimate R = {|r1 |, . . . , |rN |} be the residual error set for point pair set
′ ′

the rotation, scaling, and translation parameters of camera {(u1 , u1 ), . . . , (uN , uN )}, Rin and Rout be the residual error set
motion by comparing two different frames of a video, which for inliers and outliers. The key problem is to find a appropriate
is the basic technique to the following algorithms of this threshold T such that
paper. The methods for GME can be classified into two cat- Rin = {|ri |||ri | < T }, Rout = {|ri |||ri | ≥ T }
egories: differential method [21], [22] and feature matching-
based method [23]. For methods among both categories, their
 
subject to Rin Rout = R & Rin Rout = ∅. (5)
accuracies are influenced by the so-called outliers, to say, the
measurements whose motion is not consistent with the global Assume that the mean for Rin is µin , probability is pin =
motion, mainly caused by local object (such as player) motion. |Rin |/N; the mean for Rout is µout , and probability is pout =

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
354 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010

1 − pin , then we can compute the between-cluster variance σB2


between Rin and Rout as follows:

σB2 = pout pin (µin − µout )2 . (6)

According to Fisher linear discriminant criteria [24], the


optimal classification is achieved when σB2 is maximized. Thus
the optimal threshold Tr is selected as

Tr = arg max σB2 . (7)


T ∈R

After Tr is found, the inliers set Rin and outliers set Rout
are consequently determined. Then Rout is filtered out and only
Rin is used to re-estimate A. Since the points in Rin are more
likely to comply with GM, the re-estimated GM parameters
are more precise. This is the basic idea of applying Fisher Fig. 3. Result for GME. (a)–(b) Two neighboring video frames.
criteria to filter outliers in our proposed GME method. (c) Global motional vectors between two frames. (d) Motion vectors
The complete GME algorithm is summarized as follows. affecter outlier filtering. (e) Aligned image of (a) to (b) using the
estimated GME parameters. (f) Difference image of (b) and (e). From
1. Select feature point pairs from Ik and Ik−1 and set S0 = (f) we can see that the background is accurately aligned with the proposed
′ ′
{(u1 , u1 ), . . . , (uN , uN )}. GME algorithm.
2. Set Iteration = 1 and S = S0.
3. Iterate the below steps, until (Iteration > max Iter −
ation) or ((Tr < ErrThresh) && (Q < Tr ∗ |S| ∗ 0.5)). global motion features of sports videos. The approach utilizes
two characteristics of action-critical sports videos.
3.1 Compute the motion model A* from S by solving
(3). 1) The action to be detected, which can be discriminated

3.2 For each point pair (ui , ui ) in S, compute its with specific global motion pattern, occurs many times
residual error ri using (4); then determine the in the entire video.
optimal separating threshold Tr using (7). 2) Since the players perform their actions one by one,

3.3 Obtain the inliers set  = {(ui , ui )||ri | < the temporal gap between two successive action clips

Tr , (ui , ui ) ∈ S}; Calculate the total residual error is somewhat regular, e.g., ∼60 s. Meanwhile, there are
Q = (ui ,ui ′ )∈ |ri |. usually some replays following the normal action.
3.4 Set S = , Iteration = Iteration + 1.
Fig. 4 illuminates the distinct vertical global motion patterns
4. Given the estimated motion model parameters A*, a re-
of one type of diving, 3m springboard diving, and their
finement step motivated by [25] is conducted to include
temporal distributions. It should be noted that though replays
more inliers to produce more accurate model parameters.
are sometimes more interesting, here we only care the normal
The refinement is done as follows. We apply A* to S0
′ ′ action, because replays are not always regular, or do not
and construct a new inliers set  = {(ui , ui )||ri | <
′ contain the complete action.
ErrThresh, (ui , ui ) ∈ S0}; then compute the final model
′ Based on the first characteristic, we segment the entire video
A from  .
into small clips with evident global motion, and then group
Fig. 3 gives two successive frames from a diving video them into motion consistent clusters. The action to be detected
and their differences after GM estimation and compensation. should be corresponding to one of these clusters. We describe
It can be seen that the outliers of global motion vectors are the segmentation process as follows.
correctly filtered and the background is accurately aligned with First, we conduct cut detection [26] and segment the entire
the proposed GME algorithm. video into shots, then we perform motion segmentation on
these individual shots. The aim of this step is to eliminate the
influences of abrupt motion changes (see Fig. 4) on the motion
III. Highlight Detection segmentation results. For each shot, we use a 5-frame window
Highlights are the atomic entities of videos at semantic to detect the local peaks of global motion. If the peak value is
level, representing the most expressive or attracting video larger than a predefined threshold, pThresh, a possible action
segments. Highlight detection is a kind of video summarization clip is found. Then we use a sliding window with length L to
technique, which enables users to rapidly digest the contents scan the two directions from this point to determine the start
of long video sequences. In our case of action-critical sports and end point of the clip. For each direction, the scanning
videos, highlight corresponds to the clip which contains the will stop if the number of moving frames (i.e., frames whose
entire diving action, to say, from the player’s take-off to global motion values are larger than mThresh) in the window
entry to the water. In general, successful highlight detection is smaller than r ∗ L; or reaches the start or end of the shot. If
algorithms are usually domain-specific knowledge dependent the length of the detected clip is larger than 20 frames, then
[1], [3]. In this paper, we develop an effective and general it is deemed as an action clip. This process is repeated until
approach to extracting the highlight clips using the repetitive the entire shot is checked. Then proceed to the next shot.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 355

Fig. 4. Vertical global motion patterns of diving actions (dives) and their respective replays. The abrupt changes in the motion curve are caused by shot
transition. ⎧
⎨ T, M > 3 ∗ N or Es > 3 ∗ Et or Et > 3 ∗ Es
DTW(s, t) = dtw(s, t), M < 1.5 ∗ N (8)
minM−N dtw((si , . . . , si+N ), t), M ≥ 1.5 ∗ N

i=1

After all the action clips are segmented, we apply the |A|
hierarchical agglomerative clustering method [27] to cluster

MotionEnergy(A) = |Ai | (10)
the clips. The clips with low distances are first grouped, and i=1
then the clips with relatively larger distances are grouped. The
process is repeated until the stop condition is reached. where A, B are action clips, Ai is the global motion of A,
The dynamic time warping (DTW ) technique [28] is slightly T (X) is the timestamp of clip X, and TG is a threshold, in
modified to enable partial matching and used to compute the terms of frame, to indicate whether two clips are nearby. The
distance between two action clips. Let s = (s1 , . . . , sM ), t = weight of clip A is the combination of the above two terms
(t1 , . . . , tN ) be the global motion sequences of two clips, their
DTW distance is defined as (8)N(suppose M ≥ N). (MotionEnergy(A))1/p ∗ (1 + TemporalPriority(A))−q . (11)
In (8), Es = M |s
i=1 i |, E t = i=1 |ti |. dtw(s, t) is the distance
between s and t computed by the standard DTW technique Then the weight of cluster Ci is the average weight of all
[28]. T is first assigned with a large value, then after the dis- action clips it contains. Finally, we select the desired action
tances between all the action clip pairs are obtained, we set clip, i.e., highlight, cluster as the one which has the largest
T = max{DTW(s′ , t ′ )|DTW(s′ , t ′ ) < T }. weight. Undoubtedly, more complex weighting factors, such as
It is known that determining the cluster number or the stop the statistical measurements of segmented foreground objects
condition for a cluster algorithm is difficult. Here the second in the clips, could be more powerful to select the desired action
characteristic motivates us to add a constraint on the clustering: cluster. However, the presented weighting scheme has worked
too near clips (e.g., dive1, replay1, and replay2 in Fig. 4, well on our experimental data set.
where the temporal gaps are below a threshold TG) cannot To summarize, our highlight detection method consists
be grouped into the same cluster. This constraint eases the of three steps: 1) segment the video into action clips; 2)
determination of clustering stop condition. We first compute cluster the action clips; and 3) select the desired action
the average distance, dist, from all the nearby clips and then cluster.
set the stop condition as: the cluster distance should not After the highlight clips are detected, they are stored into
be larger than dist ∗ 2. This condition prefers that similar video library to support quick browsing of the video.
action clips such as dive1 and dive2, group into cluster C1,
replay1 and replay3 group into cluster C2, while prevents C1
and C2 grouping together. In practice, to ensure that nearby
IV. Player Body Shape Segmentation
clips are not grouped, we simply assign a large distance
to them. For the middle and fine-level analysis, we need to segment
After clustering, clusters with small sizes, i.e., the numbers the player body shapes from the detected highlight clips.
of clips they contain are below N/10, are filtered. N is the For diving video sequences, the non-rigid deformation of
number of total detected clips in the video. Then we select the player body, large camera motion, and cluttered background
desired action cluster from the remaining clusters. As shown in caused by referees or audiences bring great difficulty to this
Fig. 4, for an action clip A, the less preceding nearby action task. We have proposed an automatic object segmentation
clips it has, the more possible it is a normal action. At the algorithm [29] that can deal with such problems. However, like
same time, the motion energy, i.e., the sum of motion vectors, other motion-based segmentation methods [25], this algorithm
of normal action clip should be larger than that of the replay works poorly when the player has little motion between
clip. Therefore, we consider two measurements for each clip successive frames, which is often the case in diving videos at
to weight its probability of being normal action as follows: the early stage of a diving action. To overcome the limitation,
we propose an improved version of the algorithm in this paper.
To be self-contained, we first introduce our previous algorithm
TemporalPriority(A) = #{B|T (B) < T (A),
in [29] (called Algorithm 1 hereafter) and then present the
T (A) − T (B) < TG} (9) improved version (called Algorithm 2 hereafter).

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
356 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010

A. Algorithm 1
The core of Algorithm 1 is to construct a statistical
background image for each frame Ik by registering multiple
neighboring frames to Ik . The main steps of Algorithm 1 are
summarized as follows.
1) Global Motion Estimation: Global motion estimation
(proposed in Section II) and compensation are used to align
neighboring frames to the coordinate system of the current
frame.
2) Foreground Separation: If the adjacent frames are used
directly to construct the background, the ghost-like noise will
appear in the constructed background. To eliminate such noise,
foreground areas are separated from the images using three-
frame-difference before the construction of background.
Fig. 5. Segmentation results of Algorithm 1. Top row: original frames;
3) Background Construction: For frame Ik , the 2L + 1 middle row: constructed backgrounds; bottom row: segmented foreground
consecutive frames Ii (i = k − L, . . . , k + L) are used to objects.
construct its background image. Ii is aligned to the coordinate
system of Ik using the estimated global motion parameters. In Algorithm 2 we take into account of the object motion
The pixels pi in 2L + 1 frames corresponding to the same between frames. The idea is that we only select frames with
location p of current background are found and an improved apparent object motion to construct the background image.
temporal median method is used to determine the value Bk (p) There are many ways to estimate the magnitude of object
of pixel p in the constructed background motion such as the difference of two aligned frames. In this
Bk (p) = mediani (Ii (pi )), Di (pi ) = 1 (12) paper, motivated by the observation that the camera is always
focusing on the player to capture his/her movements, we use
where Di (pi ) is the foreground mask of frame Ii obtained from the global motion between frames as the measurement of
step 2). object motion. When the player is moving, the camera also
4) Object Segmentation: After the background of the cur- moves. Therefore, we can select the frames having salient
rent frame is constructed, the background subtraction method global motion (larger than Th1), to say, key-frames instead
is applied and some post-processing steps are conducted to of consecutive neighboring frames to build the background.
extract the moving object. These steps include: 1) signifi- Meanwhile, when the cumulative global motion exceeds a
cance test [30] is used to decide the threshold to binarize given threshold Th2, no more frames are needed.
the difference image obtained by background subtraction; Algorithm 2 first decides all the key-frames of a clip as
2) connected component analysis is applied on the binarized follows.
image and components which are smaller than 50% of the 1) Neighboring frames with global motion larger than Th1
largest component are removed; and 3) the Snake [31] model are selected as key-frames.
is adopted to smooth each remaining component’s boundary. 2) For the rest frames, if a frame’s cumulative motion to the
The merit of Algorithm 1 is that it adopted the foreground nearest key-frame exceeds Th1, it is selected as key-frame.
separation technique and improved temporal median filtering Then for the current frame Ik , we select L1 and L2 con-
for robust background construction. Algorithm 1 is suitable secutive neighboring key-frames from the left and right of Ik
to segment moving objects from dynamic background videos. respectively to build its background, where
But it works well only when the object has apparent motion,
Li = min(L, arg min CMi (k, J) ≥ Th2) (13)
and when the motion is slight it will fail (see Fig. 5). J

and CMi (k, J) is the cumulative global motion from Ik to the


B. Algorithm 2 Jth key-frame on the left/right of Ik .
The limitation of Algorithm 1 is that it does not consider In Algorithm 2, the frames and the number of frames used
the object motion between frames when selecting frames to to construct the background image for each frame are adaptive
construct the background. This leads to two problems. to the object motion, and thus it is more robust, being a general
1) First, it uses consecutive neighboring frames. When object segmentation method for dynamic background videos.
the object motion is small, many of the consecutive The results of Algorithm 2 will be reported in Section VII.
frames are similar and have no new contribution to
background construction, but degrade the performance
V. Action Recognition Using Hidden Markov
of median filter, because some of the background areas
Models
are occluded by the object for most of the frames.
2) Second, it uses a constant number of frames. When It is demanding to recognize the player actions for the
the object motion is large enough, fewer frames will action-critical sports videos analysis. Here action recognition
suffice to the background construction, thus it costs extra serves two aims. The first one is to index the highlight clips
computation. to enable action-based video retrieval and the second one is

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 357

for entry. A long flight time normally results in an aesthetically


pleasing dive and leads to high score. However, manually
obtaining such kinematic parameters is time-consuming and
error-prone. In this paper, we present an automatic method
through 2-D articulated human body model fitting, to get the
joint angles. The automation is realized by transferring the
initial model parameters from the recognized action templates.
Fitting an articulated human body model to image cues is
a popular approach to obtaining posture information such as
body joint angles or joint locations [11], [33]. In our case,
we adopt a seven degrees of freedom scaled prismatic model
[11], which is represented as S = (x, y, θ, θ1 , θ2 , θ3 , d). In S,
(x, y, θ) is the global position (i.e., the hip joint) and rotation
parameters, (θ1 , θ2 , θ3 ) are the neck angle, hip angle, and knee
angle, respectively, and d is the scale parameter.
The limitations of model-based human tracking are that
Fig. 6. Two body shapes and the distributions of their shape descriptors. it relies on manual initialization of the model parameters
before the tracking to start, and often loses tracking due to
accumulative error. Our approach circumvents these two disad-
to bootstrap the human body tracking in the later kinematic
vantages through action recognition. This is done by manually
analysis. For diving videos, the segmented body shapes of
labeling the 2-D joint locations of the training sample shapes
the player and shape transitions represent different actions.
for each action in the training stage. In the testing stage, when
Hu moments [32] of shape are used as shape descriptors. For
a test sequence is recognized, the corresponding hidden states
robustness, we only adopt the low-order, to say, the first two
are also decoded using viterbi algorithm and thus the initial
Hu moments (denoted by Hu1 and Hu2). However, Hu mo-
model parameters are transferred from the sample shapes of
ments are rotation-invariant. For such sports as gymnastics or
these states. The global position, to say, the hip joint is located
diving, the same postures at different directions are considered
as the mass center of the segmented body shape. When initial-
as different actions. To make rotation-sensitive and to encode
ized, the model parameters are refined by searching with the
local shape variants, the body shape is further divided into four
well-known annealed particle filtering algorithm [34]. Since
subshapes at the mass center (Fig. 6) and the Hu1 of each sub-
the initial parameters for each body shape are obtained from
shape is calculated. Finally, a 7-D shape descriptor including
the recognized action templates independently, our approach
the Hu1 and Hu2 of the body shape, the aspect ratio of the
has not the problem of losing tracking.
bounding box of body and four Hu1s of subshape is obtained.
When matching the body model to the image, a multi-cue
This compact descriptor is robust to segmentation noise and
observation model which considers the shape and foreground
discriminative enough to distinguish different postures.
area is adopted to make the matching robust and efficient.
HMM is adopted as a classifier since it has been shown
For shape, we use the distance transform [35] as similarity
a powerful tool for sequential patterns recognition [4], [12].
measurement between image edge features and model edge
We use continuous hidden Markov model (CHMM) with left-
features. The image edge map is first filtered by removing the
right topology. Each CHMM is trained based on Baum–Welch
edges outside of the bounding rectangle of the predicted body
algorithm. Given an observation sequence, viterbi algorithm
and the edges inside the foreground area. Let d be the mean
is used to calculate each model’s output probability and the
of all the edge’s cost, the shape matching score for model
model with maximum probability is the recognition result.
hypothesis xi is
p(zshape |xi ) = e−λ1 ∗d . (14)
VI. The Analysis of Action
Video-based analysis of player action or tactics is becoming For foreground, we first calculate the overlapping rate r
an important tool for sports training, but usually requires many between the model and the foreground area, and then the
hours of manual work. In this section, we present methods for foreground matching score is defined as
automatic analysis of individual player action.
p(zforeground |xi ) = eλ2 ∗r (15)
A. Kinematic Analysis
where λ1 and λ2 are two constants and are set as 10 and 1,
Kinematic information is very useful for the coaches to respectively, in our experiments.
instruct the players more scientifically. For example, for diving The total matching score is
action, the hip angle and knee angle in the take-off period are
two critical kinematic parameters, since that the extents of p(z|xi ) = p(zshape |xi ) ∗ p(zforeground |xi ). (16)
flexion of hip and knee decide the height of diving. Generally,
a higher dive means longer time in the air for the diver to The human body model and an example of fitting are shown
hold the dive position longer and have more time to prepare in Fig. 7.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
358 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010

Fig. 8. (a) Foreground panorama. (b) Background panorama. (c) Motion


panorama.

Fig. 7. Human body model and an example of model fitting result.


(a) Human body model. (b) Test body shape. (c) Original frame. (d) Canny
edge map. (e) Filtered edge map. (f) Distance transform map. (g) One
initialized model. (h) Final fitted model.

B. Visual Analysis Fig. 9. Overlay frames of same action performed by two players from two
With the enabling techniques provided before, to say, the clips.
global motion estimation, accurate player body segmentation,
and action recognition, we present two visual sports training 2) Overlay Composition: Overlay composition is also a
tools which allow coaches and players to review and compare video blending technique which creates a compositive video
performances in an easy way to enhance coaching strategies. by overlaying two temporally aligned videos at the same scene
1) Motion Panorama: Motion panorama provides an effi- [37]. It provides a visually appealing tool for coaches and
cient and compact representation of the underlying video by players to compare actions performed by different players
constructing a single image from part of frames of a video. or by the same player at different times. This is important
It can show the global background as well as the foreground for sports training since that very small differences in the
player bodies and is more straightforward to reveal the moving technique will lead to very large differences in performance
trajectory or action details than by a video. Panorama has been in competitive sports games.
proved to be a powerful tool for reviewing and evaluating the Given a synchronization point of two clips and the seg-
performance of player action [36]. mented foregrounds, we compose an overlay clip by super-
After estimating global motion parameters, a background imposing one clip’s foregrounds over the other clip using the
panorama is built first. The frames of an action clip are aligned alpha blending technique. Compared with the traditional over-
to the coordinate system of the first frame, which is also lay methods [37] which simply blend the corresponding frames
selected as the world coordinate system of the panorama. Then of two videos, our approach has two (potential) advantages:
for those overlapping regions, the temporal median filtering 1) it doesn’t have the constraint that two clips should be
technique is used to construct the background by selecting the of same scene, because we have segmented the foregrounds
median RGB value. accurately, and 2) the synchronization point can be automati-
Using the global motion and time intervals as criteria, some cally determined by temporally aligning the segmented shape
key-frames are selected automatically, of course a manual sequence using technique such as dynamic time warping. In
selecting function is also provided in the system. The seg- our case, we simply take the entry point as synchronization
mented foregrounds of such key-frames are mapped to a world point to compose the overlay clips for the same action. Fig. 9
coordinate system same to the background panorama to form a gives two example frames of an overlay clip.
foreground panorama. For a better visual effect, the overlapped
foreground regions are fused by alpha blending.
Finally, the resulting motion panorama is created by cov- VII. Experimental Results
ering the background panorama with the effective regions To evaluate the effectiveness and general versatility of the
of foreground panorama. An example of panoramas for a proposed system and algorithms, we conduct experiments on
platform diving clip is shown in Fig. 8. the analysis of two kinds of sports game videos: diving videos

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 359

TABLE I TABLE II
Average ITFs (db) for Six GME Algorithms on Diving Video Average ITFs (db) for Three GME Algorithms
With Progressive Filtering on Diving Video
Proposed i-RANSAC i-LTS Fisher RANSAC LTS
44.17 44.32 43.51 43.12 43.19 42.23 Proposed i-RANSAC i-LTS
44.50 44.52 44.26

and jump videos. We also compare the performance of our TABLE III
algorithms with some competitive algorithms. All the videos Threshold Parameters and Their Ranges for the Detection
are digitized in a MPEG-1 format. Method

Parameter Range
A. Experiments on Diving Videos pThresh 4–7
In this section, diving game videos are used as case study TG 600–1200
mThresh 1.5–3.0
to demonstrate and evaluate the effectiveness of whole system L 15–30
and the algorithms. The diving video set is about 8 h and r 0.15–0.3
includes six platform videos and two springboard videos. p 2–4
q 2–4
1) Global Motion Estimation: We randomly crop 20 action
clips from the springboard and platform diving videos, each For diving videos, even though there exists camera zooming
having ten instances, to test the performance of proposed GME operation to some extent, the dominant global motion is trans-
algorithm. Two state-of-the-art robust estimation methods: lation caused by camera tilting up/down. This fact motivates
RANdom SAmple Consensus (RANSAC) algorithm [38] and us to progressively filter outliers. The idea is that we first use
Least Trimmed Squares (LTS) algorithm [39] are implemented the translation model, i.e., the 2-parameter model to coarsely
for comparison. We also implemented the improved versions filter out most of the outliers, then we use the affine model
for RANSAC and LTS according to [25], which we denote as to finely filter out the rest ones. This is simply implemented
i-RANSAC and i-LMS, respectively. The differences between by adopting the translation model for the first two iterations
original and improved RANSAC and LTS are that, in the (i.e., Iteration = 1, 2 ) of step 3 in the proposed GME
improved versions, refinement procedures are applied for each method. Similar implementations are applied to i-RANSAC
sampled subset to improve the accuracy of motion model, by and i-LTS. The experimental results (see Table II) validate the
iteratively covering more inliers. The thresholds are set same effectiveness of progressive filtering strategy.
to [25] as ErrThresh = 1.5; max Iteration = 50. Actually, this progressive filtering strategy for GME is not
For evaluation we use the interframe transformation fidelity only practical to diving videos, but also applicable to a broad
(ITF) measure [40] category of individual sports videos, such as running, jumps,
1
N gymnastics, and so on. We will see in Section VII-B1 that for
comp
ITF = PNSR(Ik , Ik−1 ) (17) more challenging videos, the improvement is more noticeable.
N − 1 k=2
2) Highlights (Action Clips) Detection: There are seven
comp 2552 threshold parameters in the proposed highlight detection
PNSR(Ik , Ik−1 ) = 10 log comp (18) method. We conduct comprehensive testings with different
MSE(Ik , Ik−1 )
values for the threshold combinations (see Table III) and, the
comp
where Ik−1 is the aligned image to Ik by global motion results show that most of the thresholds, such as pThresh,
comp
compensated on Ik−1 , MSE(Ik , Ik−1 ) is the mean square error mThresh, TG, p, and q are quite robust to the detection
comp
between Ik and Ik−1 . performance, while others, such as L and r have slight effects
Because there is obvious player motion in the testing clips, on the detection results.
we manually label the bounding box of player for each frame, Parameters L and r are related to the segmentation of action
and the residual pixels inside the box are excluded from mean clip, which determines the start and end points of a clip. When
square error (MSE) computation. Similarly, the residuals close L is set with a larger value and r with a smaller value, such
to the borders of the image are also excluded. The average as 30 and 0.15, then in cases of shot cut detection fails, the
ITFs for the comparing algorithms on the 20 clips are tabulated nearby large global motion caused by shot transition will lead
in Table I. Here “Proposed” represents the proposed GME to errors in action clip segmentation. On the other hand, if
method in Section II, while “Fisher” denotes the proposed L is set with smaller value and r with larger value, such as
GME method, but not includes the refinement step 4. 15 and 0.25, some normal action clips may be separated into
It can be seen that the proposed algorithm achieves compa- small incomplete clips. From the experiments, for the testing
rable performance to i-RANSAC and i-LTS: better than i-LTS video data, threshold L has more impact on the performance
while slightly worse than i-RANSAC. Another observation than r. We fix the parameters pThresh, TG, mThresh, r, p,
is that the improved versions of the three algorithms all q as 6, 900, 2.0, 0.2, 2, and 2, respectively, and set L as 15,
outperform their respective original versions, showing that the 20, and 25, the detection results are shown in Table IV.
refinement process, which iteratively increases the set of inliers We can see that the detection method is very effective
such that the parameter estimation is based on more input data, and has achieved high performance. The errors are partly
can result in a more accurate estimation. caused by the segmentation noises, i.e., over-segmentation and

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
360 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010

Fig. 10. Segmentation results for platform diving (the same clip in Fig. 5). Fig. 11. Segmentation results for another platform diving clip.
Top row: original frames; middle row: constructed backgrounds; bottom row:
segmented body shapes.

under-segmentation. Some errors are attributed to variances


in motion patterns. Though the desired actions have similar
global motion pattern, there is still difference to some extent.
For example, some kind of springboard action needs strong
bounce to produce a higher dive, while others do not need.
For the above cases, the DTW may produce unreliable distance
measurements. So more elastic matching methods need to be
investigated to improve the accuracy of distance measurements
in the future. In some cases, when the normal actions are
wrongly classified, the following replayed actions will have
chances of being grouped into the normal action cluster,
resulting in the false positives. Fig. 12. Segmentation results for springboard diving clip.
We also compare our highlight detection method with the
method proposed in [1], which used logo detection, camera and their codes are: straight−A, pike−B, tuck−C, free−D.
motion analysis and a grammar-based parser to detect the Hence, 305C denotes reverse 2 21 somersaults in tuck position
“dive” action. The recall and precision of [1] on our testing and 6245D is a back armstand dive with two somersaults and
data are 93.1% and 92.3%, respectively. Clearly, our method 2 21 twists. The platform diving includes all the six categories
has higher performance. dive while springboard diving includes categories 1–5. In
3) Player Body Segmentation: The improved algorithm, this paper, we take platform diving as test data for action
i.e., Algorithm 2 is applied to segment player body from action recognition.
clips detected in Section III. Compared with Algorithm 1 (see For all the 21 actions from the six platform videos, 13
Fig. 5), it is more robust and works well for the entire clip actions are selected for experiments since the rest eight actions
(see Fig. 10). Figs. 11 and 12 show more segmentation results have too few samples for training. The 13 actions occupy near
of Algorithm 2 for two types of diving: platform diving and 90% of the six videos. We randomly select 2/3 of these action
springboard diving. The values for Th1 and Th2 are set to 4 clips for training and the rest 1/3 for testing and run three
and 50 for image with 352 × 288 resolutions and L is set as times. In our experiments, when 25 states and three Gaussians
11. for each state are used, the CHMM classifiers give best results.
4) Action Recognition: Before action recognition, we first The recognition results are summarized in Table V.
give some background knowledge of dive actions. Each dive is The total accuracy of action recognition is around 85%
represented with a code in the form of cn1 n2 n3 p. The meaning and the results are encouraging. However, since HMM is a
of each part is as follows. c is the category number, whose generative model, its discriminative power may be depressed
value is from 1 to 6, representing six types of dive: forward by the alike action pairs such as 305C and 307C, 207B
dive, backward dive, reverse dive, inward dive, twisting dive and 207C and so on. When an action is discriminative, the
and armstand dive; n1 is only used for categories 5 and 6, and accuracy will be higher. This is the case for 107B, 5251B and
encodes the diver’s facing and moving directions with 1 to 4, 5253B, where 107B has a distinct pattern of running in the
corresponding to forward, backward, reverse and inward dive Platform, 5251B and 5253B have distinct twisting patterns but
respectively. So in categories 1 to 4, n1 is absorbed into c; n2 different number of twists.
denotes how many rotations, or somersaults will be performed 5) Kinematic Analysis: The human body model fitting is
in the dive, by half cycle; n3 describes twisting number, by half conducted to obtain player’s kinematic information, to say,
cycle; p describes the diver’s position in the air. The positions hip angle and knee angle in the take-off period. Four kinds

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 361

TABLE IV
Highlights Detection Results

Videos Count of Action Clips L = 15 L = 20 L = 25


Recall Precision Recall Precision Recall Precision
3 m Springboard 120 97.5% 99.0% 99.0 % 100 % 96.7% 99.0%
10 m Platform 396 97.7% 99.2% 98.2% 99.5% 97.4% 98.9%
Total 516 97.6% 99.2% 98.4% 99.6% 97.2% 99.0%

of actions: 205B, 305C, 307C, and 5251B, each with ten


instances, are used to test the algorithm. The main body
joints (i.e., head, shoulder, hip, knee, and foot) of the training
shapes of these actions in take-off period are labeled manually
with a friendly graphical user interface. Then the body model
parameters are computed. When the HMM for an action is
trained, we calculate the mean body model parameters for each
Gaussian of the hidden states. In testing, the HMM states are
decoded using the viterbi algorithm and thus the initial body
model parameters for each frame are obtained. Finally, the
annealed particle filtering with 500 particles and three layers
is used to refine the model parameters.
To evaluate the effect of automatic estimation of initial
model parameters from HMM states, we also conduct tracking
experiments with manual initialization of the model for the
first frame (denoted as manual in this section). With the
same settings, i.e., 500 particles and three layers, our method
Fig. 13. Human model fitting results for action 307C.
successfully tracks all the clips for the take-off period, while
manual method loses tracking after seven to ten frames for
some of the clips. By increasing the particle number to 1000,
manual method tracks all the clips well. The errors for hip
angle and knee angle between the manually labeled and auto-
matically tracked for the proposed method with 500 particles
and manual method with 1000 particles are tabulated in
Table VI. We can see that the automatic method has smaller
errors than the manual method. This shows that by automat-
ically transferring initial parameters for each frame from the
HMM state, we not only improve the robustness of tracking,
but also improve the accuracy of tracking results.
Some model fitting examples for action 307C are illumi-
nated in Fig. 13.
Fig. 14. Motion panorama for two springboard diving clips. 6) Visual Analysis: The presented visual analyzing tools:
motion panorama and overlay composition are very suitable
for the individual sports action analysis. Here we illuminate
TABLE V
more results for motion panorama and overlay composition as
Action Recognition Results
in Figs. 14 and 15, respectively.
No. Action Code Action Count Recognized Recognition
Rate B. Experiments on Jump Videos
1 107B 56 53 94.6%
2 205B 19 16 84.2% To demonstrate the general versatility of the proposed
3 207B 22 19 86.3% algorithms, we conduct experiments on two types of jump
4 207C 36 30 83.3%
5 305C 21 16 76.2%
videos: long jump and high jump videos. In this section,
6 307C 32 23 71.9% two algorithms, i.e., GME algorithm and object segmentation
7 405B 14 11 78.8% algorithm are tested, because these two algorithms are the
8 407C 41 33 82.5%
9 5251B 12 12 100%
critical steps for the analysis of sports videos with camera
10 5253B 32 29 90.6% works. The jump video set includes 21 long jump clips and
11 6241B 18 16 88.9% 18 high jump clips.
12 626B 18 15 83.3% 1) Global Motion Estimation: Compared with diving
13 626C 21 17 80.1%
Total 342 290 84.8% videos, these two kinds of jump videos are more challenging
because they have relatively higher outlier ratios (∼41.2%)

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
362 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010

TABLE VI
Errors for Hip Angle and Knee Angle

Automatic Method Manual Method


Mean Error Max Error Min Error Mean Error Max Error Min Error
Hip angle (degree) 1.32 5.35 0.06 1.45 5.76 0.05
Knee angle (degree) 0.88 5.71 0.04 1.07 6.33 0.07

TABLE VII
Average ITFs (db) for Three GME Algorithms on Jump Video

Proposed i-RANSAC i-LTS


43.13 43.52 41.14

TABLE VIII
Average ITFs (db) for Three GME Algorithms With Progressive
Filtering on Jump Video

Proposed i-RANSAC i-LTS


46.59 46.05 44.92

Fig. 15. Overlay composition for action 205B of two divers.

Fig. 18. Segmentation results on high jump video.

than diving videos (∼32.4%), which brings more difficulties


Fig. 16. GME results on long jump video. From left to right: original
frames, estimated motion vectors, and differencing images. We can see that for GME.
though some inlier motion vectors are missed, all selected inliers (white) are Table VII gives the ITFs for the three GME algorithms:
part of the background motion. Consequently, the global motion is correctly Proposed, i-RANSAC and i-LTS. Again, the proposed method
estimated.
achieved competitive performance to i-RANSAC. We apply
the progressive filtering strategy to jump videos and the
comparing results are given in Table VIII. The results show
that this strategy produces about 3 db improvement in average
ITF, revealing that the progressive filtering strategy is more
effective for those videos with higher outlier ratios. We can
also notice that our proposed method outperforms i-RANSAC
for this testing. Intrinsically, our GME method is a process
of “iteratively eliminating outliers” plus one pass “increasing
inliers” (i.e., refinement step), thus the refinement has very
low chance of introducing false inliers, while i-RANSAC and
i-LTS are processes of “random sampling” plus “iteratively
increasing inliers” (refinement step), which means that though
the refinement can improve the estimation of model parameters
by fitting more inliers, it also has the possibility of introducing
wrong inliers gradually.
Fig. 17. Segmentation results on long jump video. Some results of the proposed GME method on a long jump
video are shown in Fig. 16.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 363

2) Player Body Segmentation: The proposed object seg- [7] L.-Y. Duan, M. Xu, T.-S. Chua, Q. Tian, and C.-S. Xu, “A mid-level
representation framework for semantic sports video analysis,” in Proc.
mentation algorithm is applied to the jump videos to extract ACM Conf. Multimedia, 2003, pp. 33–44.
player body shapes. Figs. 17 and 18 give some segmentation [8] H. Miyamori and S. Iisaku, “Video annotation for content-based re-
results on long jump video and high jump video. It is noted that trieval using human behavior analysis and domain knowledge,” in Proc.
Automat. Face Gesture Recognition, 2000, pp. 320–325.
for both videos, besides the player, there are also some moving [9] G. Y. Zhu, Q. M. Huang, C. S. Xu, L. Xing, W. Gao, and H. Yao,
audiences. The results show that our algorithm works well in “Human behavior analysis for highlight ranking in broadcast racket
the presence of other moving objects, if their movements are sports video,” IEEE Trans. Multimedia, vol. 9, no. 6, pp. 1167–1182,
Oct. 2007.
relatively slower than that of the player. [10] R. Poppe, “Vision-based human motion analysis: An overview,” Com-
put. Vision Image Understand., vol. 108, nos. 1–2, pp. 4–18, Oct. 2007.
[11] T. Moeslund, A. Hilton, and V. Krüger, “A survey of advances in
VIII. Conclusion and Future Work vision-based human motion capture and analysis,” Comput. Vision Image
Understand., vol. 104, nos. 2–3, pp. 90–126, Nov. 2006.
In this paper, we have presented this paper for the auto- [12] J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action in time-
matic detection, recognition and analysis of player motion in sequential images using hidden Markov model,” in Proc. Comput. Vision
Pattern Recognit., 1992, pp. 379–385.
action-critical sports video, for high-level content-based video [13] J. Sullivan and S. Carlsson, “Recognizing and tracking human action,”
indexing/retrieval and training purposes. All the works are in Proc. Eur. Conf. Comput. Vision, Lecture Notes in Computer Science
performed automatically and integrated in a comprehensive 2350, no. 1. 2002, pp. 629–644.
[14] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, “Recognizing action at
framework which is applicable for sports video analysis. a distance,” in Proc. Int. Conf. Comput. Vision, 2003, pp. 726–733.
Various key techniques and algorithms are proposed to fulfill [15] W.-L. Lu, K. Okuma, and J. J. Little, “Tracking and recognizing actions
such work. of multiple hockey players using the boosted particle filter,” Image Vision
Comput., vol. 27, no. 1, pp. 189–205, Jan. 2009.
The framework and algorithms are tested on broadcast [16] P. Fogueroa, N. Leite, and R. Barros, “Tracking soccer players aiming
diving videos and jump videos. The extensive experiments their kinematical motion analysis,” Comput. Vision Image Understand.,
show the effectiveness of the proposed system. However, there vol. 101, no. 2, pp. 122–135, Feb. 2006.
[17] J. R. Wang and N. Parameswaran, “Detecting tactics patterns for
is still room for improvement. In the highlight detection, the archiving tennis video clips,” in Proc. Int. Symp. Multimedia Softw.
caption in video that indicates the player profile when he/she Eng., 2004, pp. 186–192.
is preparing for play is an important cue to identify an action, [18] A. W. B. Smith and B. C. Lovell, “Visual tracking for sports
applications,” in Proc. Workshop Digital Image Comput., 2005,
and we expect an improvement by integrating such information pp. 79–84.
in the future. For action recognition, we are considering [19] R. Urtasun, D. J. Fleet, and P. Fua, “Monocular 3-D tracking of the golf
building hybrid classifier by combining HMMs and main body swing,” in Proc. Comput. Vision Pattern Recognit., 2005, pp. 932–938.
[20] R. Cassel, C. Collet, and R. Gherbi, “Real-time acrobatic gesture
feature such as head tracking and shape analysis to improve analysis,” in Proc. Gesture Workshop, 2005, pp. 88–99.
the classifier’s discriminating capability and to recognize the [21] J. Konrad and F. Dufaux, “Digital Equipment Corporation, improved
actions with small training samples. Meanwhile, further work global motion estimation for N3,” in Meeting of ISO/IEC/SC29/WG11,
no. MPEG97/M3096, San Jose, CA, 1998.
needs to be done to automatically analyze more kinematic [22] B. Qi and A. Amer, “Robust and fast global motion estimation oriented
measurements for training instruction. to video object segmentation,” in Proc. IEEE Int. Conf. Image Process.,
2005, pp. 153–156.
[23] T. Chiew, C. How, D. Bull, and C. Canagarajah, “Rapid block-based
global motion estimation and its applications,” in Proc. Int. Conf.
Acknowledgment Consumer Electron., 2002, pp. 228–229.
[24] N. Ostu, “A threshold selection method from gray level histogram,” IEEE
The authors would like to thank the anonymous reviewers Trans. Syst. Man. Cybern., vol. 9, no. 1, pp. 62–66, Jan. 1979.
for their valuable comments that helped to improve this paper. [25] D. Farin, “Automatic video segmentation employing object/camera
They also would like to thank Dr. S. Tang, Dr. R. Hong, and modeling techniques,” Ph.D. dissertation, CIP-Data Library, Tech. Univ.
Eindhoven, Eindhoven, The Netherlands, 2005.
Dr. K. Tao for their useful suggestions. [26] T. S. Chua, H. Feng, and A. Chandrashekhara, “An unified framework for
shot boundary detection via active learning,” in Proc. Int. Conf. Acoust.
Speech Signal Proces., 2003, pp. 845–848.
References [27] A. K. Jain and R. C. Dube, “Clustering methods and algorithms,” in
Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall,
[1] F. Wang, J.-T. Li, Y.-D. Zhang, and S.-X. Lin, “Semantic and structural 1988, pp. 55–142.
analysis of TV diving programs,” J. Comput. Sci. Technol., vol. 19, [28] C. S. Myers and L. R. Rabiner, “A comparative study of several dynamic
no. 6, pp. 928–935, Nov. 2004. time-warping algorithms for connected word recognition,” Bell Syst.
[2] L. X. Xie, P. Xu, S.-F. Chang, A. Divakaran, and H. Sun, “Structure Tech. J., vol. 60, no. 7, pp. 1389–1409, Sep. 1981.
analysis of soccer video with domain knowledge and hidden Markov [29] S. Wu, S.-X. Lin, and Y.-D. Zhang, “Automatic segmentation of moving
models,” Pattern Recognit. Lett., vol. 25, no. 7, pp. 767–775, May 2004. objects in video sequences based on dynamic background construction,”
[3] C.-C. Lien, C.-L. Chiang, and C.-H. Lee, “Scene-based event detection Chin. J. Comput., vol. 28, no. 8, pp. 1386–1392, Aug. 2005.
for baseball videos,” J. Visual Commun. Image Representation, vol. 18, [30] T. Aach, A. Kaup, and R. Mester, “Statistical model-based change
no. 1, pp. 1–14, Feb. 2007. detection in moving video,” Signal Process., vol. 31, no. 2, pp. 165–
[4] G. Xu, Y.-F. Ma, H.-J. Zhang, and S.-Q. Yang, “An HMM-based 180, Mar. 1993.
framework for video semantic analysis,” IEEE Trans. Circuits Syst. Video [31] M. Kass, A. Witkini, and D. Terzopoulos, “Snakes: Active contour
Technol., vol. 15, no. 11, pp. 1422–1433, Nov. 2005. models,” Int. J. Comput. Vision, vol. 1, no. 4, pp. 321–331, Jan.
[5] G. Pingali, Y. Jean, A. Opalach, and I. Carlbom, “LucentVision: Con- 1998.
verting real world events into multimedia experiences,” in Proc. IEEE [32] M. K. Hu, “Visual pattern recognition by moment invariants,” IEEE
Int. Conf. Multimedia Expo, 2000, pp. 1433–1436. Trans. Inform. Theory, vol. 8, no. 2, pp. 179–187, Feb. 1962.
[6] X. G. Yu, N. J. Jiang, L.-F. Cheong, H. W. Leong, and X. Yan, “Auto- [33] R. Green and L. Guan, “Quantifying and recognizing human movement
matic camera calibration of broadcast tennis video with applications to patterns from monocular video images-part I: A new framework for
3-D virtual content insertion and ball detection and tracking,” Comput. modeling human motion,” IEEE Trans. Circuits Syst. Video Technol.,
Vision Image Understand., vol. 113, no. 5, pp. 643–652, May 2009. vol. 14, no. 2, pp. 179–190, Feb. 2004.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
364 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010

[34] J. Deutscher, A. Blake, and I. Reid, “Articulated body motion capture by Si Wu received the B.E. and M.S. degrees from
annealed particle filtering,” in Proc. Comput. Vision Pattern Recognition, Xiangtan University, Hunan, China, both in applied
2000, pp. 126–133. mathematics, in 1998 and 2001, respectively, and the
[35] D. M. Gavrila and V. Philomin, “Real-time object detection for smart Ph.D. degree in computer science from the Institute
vehicles,” in Proc. Int. Conf. Comput. Vision, 1999, pp. 87–93. of Computing Technology, Chinese Academy of
[36] C.-T. Hsu and Y.-C. Tsan, “Mosaics of video sequences with moving Sciences, Beijing, China, in 2005.
objects,” Signal Process. Image Commun., vol. 19, no. 11, pp. 81–98, Since 2005, he has been a Project Manager with
Nov. 2004. France Telecom Research and Development Beijing,
[37] http://www.dartfish.com/en/sports-enhancements/sport performance Beijing, China. His current works focus on multime-
software/index.htm dia value added service research and developments.
[38] M. A. Fischler and R. C. Bolles, “Random sample consensus: A
paradigm for model fitting with applications to image analysis and
automated cartography,” Commun. Assoc. Comput. Machinery, vol. 24,
no. 6, pp. 381–395, Jun. 1981. Yongdong Zhang received the B.E. and M.S. de-
[39] P. J. Rousseeuw and K. Van Driessen, “Computing LTS regression for grees from Dalian University of Technology, Dalian,
large data sets,” Data Mining Knowl. Discovery, vol. 12, no. 1, pp. 29– both in signal and information processing, in 1995
45, Jan. 2006. and 1998, respectively, and received the Ph.D. de-
[40] F. S. Rovati, D. Pau, E. Piccinelli, L. Pezzoni, and J.-M. Bard, “An gree in electronic engineering from Tianjin Univer-
innovative, high quality and search window independent motion esti- sity, Tianjin, China, in 2002.
mation algorithm and architecture for MPEG-2 encoding,” IEEE Trans. Since 2002, he has been an Associate Professor
Consumer Electron., vol. 46, no. 3, pp. 697–705, Aug. 2000. with the Institute of Computing Technology, Chinese
Academy of Sciences, Beijing, China. His research
interests include video coding and transcoding, video
Haojie Li received the B.E. degree in computer analysis and retrieval, and universal media access.
science from Nankai University, Tianjin, China, in
1996, and the Ph.D degree in computer science from
the Institute of Computing Technology, Chinese Shouxun Lin (M’99) received the Ph.D. degree
Academy of Sciences, Beijing, China, in 2007. in computer science from the Beijing Institute of
From 1996 to 2001, he was with Yantai Post, Technology, Beijing, China, in 1998.
Shandong, China, as a Software Engineer. From Since 1995, he has been an Associate Professor
2007 to 2009, he was a Research Fellow with and Professor (in 2000) with the Institute of Com-
the School of Computing, National University of puting Technology (ICT), Chinese Academy of Sci-
Singapore, Singapore. Since December 2009, he has ences (CAS), Beijing, China. From 2000 to 2005, he
been an Associate Professor with the School of Soft- was the Vice Director of the Digital Laboratory, ICT,
ware, Dalian University of Technology, Dalian, China. His research interests CAS. His research interests include video coding,
include computer vision, pattern recognition, and image/video retrieval. video content analysis, statistical machine transla-
tion, and evaluation of computer–human interaction.

Jinhui Tang (S’04–M’08) received the B.E. and


Ph.D. degrees in signal and information processing
from the University of Science and Technology of
China, Hefei, China, in 2003 and 2008, respectively.
Since July 2008, he has been a Research Fellow
with the School of Computing, National Univer-
sity of Singapore, Singapore. His research interests
include content-based image retrieval, video content
analysis, and pattern recognition.
Dr. Tang is a recipient of the 2008 President
Scholarship of the Chinese Academy of Science, and
a co-recipient of the Best Paper Award from the Association for Computing
Machinery Multimedia in 2007.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.

You might also like