Professional Documents
Culture Documents
Abstract—This paper presents a system for automatically video [1], [2], semantic events detection and video summariza-
detecting and analyzing complex player actions in moving back- tion [3], [4], enhanced sports TV broadcasting [5], and content
ground sports video sequences, aiming at action-based sports
insertion [6]. The difficulty of video analysis is the semantic
videos indexing and providing kinematic measurements for coach
assistance and performance improvement. The system works gap between the low-level audio-visual features and high-level
in a coarse-to-fine fashion. For an input video, in the coarse concepts. To bridge the gap, some mid-level representations
granularity level, we automatically segment the highlights, that are constructed based on low-level features with clustering
is, the video clips containing the desired action as summaries or classification methods [7]. However, these representations
for general user viewing purposes; in the middle granularity
are extracted from frame-based approaches and have no direct
level, we recognize the action types to support action-based
video indexing and retrieval; and finally in the fine granularity link to high-level semantics. Moreover, they can only deduce
level, the critical kinematic parameters of player action are coarse-level knowledge of the video contents for general user
obtained for sports professionals’ training purposes. However, viewing purposes. Object’s behavior is another type but more
the complex and dynamic background of sports videos and the effective mid-level representation for video content analysis
complexity of player actions bring considerable difficulty to the
[8], [9]. At the same time, for sports professionals such as
automatic analysis. To fulfill such a challenging task, robust
algorithms including global motion estimation with adaptive out- coachers and players, it is desired for finer granularity analysis
liers filtering, object segmentation based on adaptive background to get more detailed information such as action names, match
construction, and automatic human body tracking are proposed tactics, kinematical, or biometric measurements from videos
in this paper. Two visual analyzing tools: motion panorama and for coaching assistant and performance improvements. For ex-
overlay composition, are also introduced. Real diving and jump
ample, by automatically recognizing the actions and obtaining
game videos are used to test the proposed system and algorithms,
and the extensive and encouraging experimental results show player body joint angles in diving game videos, the coaches or
their effectiveness. players can easily retrieve and compare qualitatively or quan-
titatively the performed actions with the same ones performed
Index Terms—Action recognition, human body tracking, sports
training, video analysis, video object segmentation. by elite players in video database, and then improve their
performance in later training or competition. To these ends,
I. Introduction this paper addresses the automatic detection, recognition, and
analysis of player actions from broadcast sports game videos
ITH THE EXPLOSIVE growth of digital videos in
W our daily life, automatic video content analysis has
become a basic requirement for efficient indexing and retrieval
or videos recorded during daily training. More specifically,
we focus on one sports genre where the player performs
his or her action in a large arena and the camera needs to
of long video sequences. In recent years, the analysis of sports
be operated with pan/tilt/zoom to capture the player in the
videos has attracted great attention due to its mass appeal
middle of the image. Usually, one or more cameras are placed
and tremendous commercial potentials. Many works have been
in the side-view to record the entire detailed action. This
conducted and technologies and systems have been developed
includes a broad category of individual sports videos, such
for automatic or semi-automatic parsing the structure of sports
as diving, jumps, gymnastics videos, and so on (see Fig. 1).
Manuscript received October 9, 2008; revised May 10, 2009. First version For such kind of action-critical sports videos, general users
published November 3, 2009; current version published March 5, 2010. This would like to rapidly locate and watch the highlights, namely,
work was supported in part by the National Basic Research Program of China
(Grant No. 2007CB311100), and the Co-building Program of the Beijing the video clips containing the desired action, while sports
Municipal Education Commission. This paper was recommended by Associate professionals will be more interested in the performances of
Editor T. Fujii. the players. Manual analysis of sports videos to achieve such
H. Li is with the School of Software, Dalian University of Technology,
Dalian 116620, China. aims is labor-intensive and time-consuming. Therefore, the
J. Tang is with the School of Computing, National University of Singapore, systems and techniques that can automatically parse long time
117590 Singapore (e-mail: tangjh@comp.nus.edu.sg). videos into browse-able actions and further provide kinematic
S. Wu is with France Telecom Research and Development Beijing, Beijing
100080, China. measurements for performance analysis are demanding.
Y. Zhang and S. Lin are with the Institute of Computing Technology, In this paper,we present an integrated coarse-to-fine sports
Chinese Academy of Sciences, Beijing 100080, China. video analysis system and various robust algorithms. Diving
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. videos are used as case study to demonstrate the effectiveness
Digital Object Identifier 10.1109/TCSVT.2009.2035833 of the system and algorithms due to the following motivations:
c 2010 IEEE
1051-8215/$26.00
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
352 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 353
video in a coarse-to-fine fashion is presented, which 1, . . . , N; N ≥ 3) from frames Ik and Ik−1 , then we have
attempts to extract semantic information with different
granularity to suffice the retrieval requirements from U = HA (3)
general users to sports professionals.
where H = (H1 T , H2 T , . . . , HN T )T , U = (u1 T , u2 T , . . . ,
2) To robustly estimate the global motion between adjacent
uN T )T . Thus A can be obtained by solving (3) using the least
video frames, we propose an adaptive outliers filtering
squares method.
strategy using Fisher linear discriminant analysis.
The feature point pairs are constructed as follows. We first
3) An object segmentation algorithm based on adaptive
calculate the global standard variance, gstd, of pixel values in
dynamic background construction is proposed, which is
frame Ik . Then we scan frame Ik and check each n × n block.
robust to complex and dynamic scenes.
If the standard variance, std, of a block is large enough, to
4) The automatic kinematic analysis of player body move-
say, larger than α ∗ gstd, the upper left corner of the block is
ments is achieved by model fitting. By transferring the ′
selected as ui . ui is obtained by searching nearby blocks in
initial model parameters from the recognized action
frame Ik−1 . In this way, we collect all the point pairs between
templates, our method avoids the manual setup of human
Ik and Ik−1 .
body model.
Suppose A∗ is the initial solution for global motion, we
5) With the enabling techniques above, we present two
calculate the residual error for each pair (ui , ui ′ ) as
visual tools for individual sports game training: motion
panorama and overlay composition, which are percep- ri = ui − Hi A∗ . (4)
tible for the visual analysis or comparing of player
performance. Since A* is the approximate solution to GM and the motion
of outliers is not consistent with GM, the residual errors of
II. Global Motion Estimation outliers will be larger than those of inliers. Therefore, we can
Global motion (GM) is referred to the motion of the separate the point pairs into inliers and outliers according to
background in a video sequence caused by the camera motion. their residual errors and then use the inliers to refine A. Let
Global motion estimation (GME) is a process to estimate R = {|r1 |, . . . , |rN |} be the residual error set for point pair set
′ ′
the rotation, scaling, and translation parameters of camera {(u1 , u1 ), . . . , (uN , uN )}, Rin and Rout be the residual error set
motion by comparing two different frames of a video, which for inliers and outliers. The key problem is to find a appropriate
is the basic technique to the following algorithms of this threshold T such that
paper. The methods for GME can be classified into two cat- Rin = {|ri |||ri | < T }, Rout = {|ri |||ri | ≥ T }
egories: differential method [21], [22] and feature matching-
based method [23]. For methods among both categories, their
subject to Rin Rout = R & Rin Rout = ∅. (5)
accuracies are influenced by the so-called outliers, to say, the
measurements whose motion is not consistent with the global Assume that the mean for Rin is µin , probability is pin =
motion, mainly caused by local object (such as player) motion. |Rin |/N; the mean for Rout is µout , and probability is pout =
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
354 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010
After Tr is found, the inliers set Rin and outliers set Rout
are consequently determined. Then Rout is filtered out and only
Rin is used to re-estimate A. Since the points in Rin are more
likely to comply with GM, the re-estimated GM parameters
are more precise. This is the basic idea of applying Fisher Fig. 3. Result for GME. (a)–(b) Two neighboring video frames.
criteria to filter outliers in our proposed GME method. (c) Global motional vectors between two frames. (d) Motion vectors
The complete GME algorithm is summarized as follows. affecter outlier filtering. (e) Aligned image of (a) to (b) using the
estimated GME parameters. (f) Difference image of (b) and (e). From
1. Select feature point pairs from Ik and Ik−1 and set S0 = (f) we can see that the background is accurately aligned with the proposed
′ ′
{(u1 , u1 ), . . . , (uN , uN )}. GME algorithm.
2. Set Iteration = 1 and S = S0.
3. Iterate the below steps, until (Iteration > max Iter −
ation) or ((Tr < ErrThresh) && (Q < Tr ∗ |S| ∗ 0.5)). global motion features of sports videos. The approach utilizes
two characteristics of action-critical sports videos.
3.1 Compute the motion model A* from S by solving
(3). 1) The action to be detected, which can be discriminated
′
3.2 For each point pair (ui , ui ) in S, compute its with specific global motion pattern, occurs many times
residual error ri using (4); then determine the in the entire video.
optimal separating threshold Tr using (7). 2) Since the players perform their actions one by one,
′
3.3 Obtain the inliers set = {(ui , ui )||ri | < the temporal gap between two successive action clips
′
Tr , (ui , ui ) ∈ S}; Calculate the total residual error is somewhat regular, e.g., ∼60 s. Meanwhile, there are
Q = (ui ,ui ′ )∈ |ri |. usually some replays following the normal action.
3.4 Set S = , Iteration = Iteration + 1.
Fig. 4 illuminates the distinct vertical global motion patterns
4. Given the estimated motion model parameters A*, a re-
of one type of diving, 3m springboard diving, and their
finement step motivated by [25] is conducted to include
temporal distributions. It should be noted that though replays
more inliers to produce more accurate model parameters.
are sometimes more interesting, here we only care the normal
The refinement is done as follows. We apply A* to S0
′ ′ action, because replays are not always regular, or do not
and construct a new inliers set = {(ui , ui )||ri | <
′ contain the complete action.
ErrThresh, (ui , ui ) ∈ S0}; then compute the final model
′ Based on the first characteristic, we segment the entire video
A from .
into small clips with evident global motion, and then group
Fig. 3 gives two successive frames from a diving video them into motion consistent clusters. The action to be detected
and their differences after GM estimation and compensation. should be corresponding to one of these clusters. We describe
It can be seen that the outliers of global motion vectors are the segmentation process as follows.
correctly filtered and the background is accurately aligned with First, we conduct cut detection [26] and segment the entire
the proposed GME algorithm. video into shots, then we perform motion segmentation on
these individual shots. The aim of this step is to eliminate the
influences of abrupt motion changes (see Fig. 4) on the motion
III. Highlight Detection segmentation results. For each shot, we use a 5-frame window
Highlights are the atomic entities of videos at semantic to detect the local peaks of global motion. If the peak value is
level, representing the most expressive or attracting video larger than a predefined threshold, pThresh, a possible action
segments. Highlight detection is a kind of video summarization clip is found. Then we use a sliding window with length L to
technique, which enables users to rapidly digest the contents scan the two directions from this point to determine the start
of long video sequences. In our case of action-critical sports and end point of the clip. For each direction, the scanning
videos, highlight corresponds to the clip which contains the will stop if the number of moving frames (i.e., frames whose
entire diving action, to say, from the player’s take-off to global motion values are larger than mThresh) in the window
entry to the water. In general, successful highlight detection is smaller than r ∗ L; or reaches the start or end of the shot. If
algorithms are usually domain-specific knowledge dependent the length of the detected clip is larger than 20 frames, then
[1], [3]. In this paper, we develop an effective and general it is deemed as an action clip. This process is repeated until
approach to extracting the highlight clips using the repetitive the entire shot is checked. Then proceed to the next shot.
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 355
Fig. 4. Vertical global motion patterns of diving actions (dives) and their respective replays. The abrupt changes in the motion curve are caused by shot
transition. ⎧
⎨ T, M > 3 ∗ N or Es > 3 ∗ Et or Et > 3 ∗ Es
DTW(s, t) = dtw(s, t), M < 1.5 ∗ N (8)
minM−N dtw((si , . . . , si+N ), t), M ≥ 1.5 ∗ N
⎩
i=1
After all the action clips are segmented, we apply the |A|
hierarchical agglomerative clustering method [27] to cluster
MotionEnergy(A) = |Ai | (10)
the clips. The clips with low distances are first grouped, and i=1
then the clips with relatively larger distances are grouped. The
process is repeated until the stop condition is reached. where A, B are action clips, Ai is the global motion of A,
The dynamic time warping (DTW ) technique [28] is slightly T (X) is the timestamp of clip X, and TG is a threshold, in
modified to enable partial matching and used to compute the terms of frame, to indicate whether two clips are nearby. The
distance between two action clips. Let s = (s1 , . . . , sM ), t = weight of clip A is the combination of the above two terms
(t1 , . . . , tN ) be the global motion sequences of two clips, their
DTW distance is defined as (8)N(suppose M ≥ N). (MotionEnergy(A))1/p ∗ (1 + TemporalPriority(A))−q . (11)
In (8), Es = M |s
i=1 i |, E t = i=1 |ti |. dtw(s, t) is the distance
between s and t computed by the standard DTW technique Then the weight of cluster Ci is the average weight of all
[28]. T is first assigned with a large value, then after the dis- action clips it contains. Finally, we select the desired action
tances between all the action clip pairs are obtained, we set clip, i.e., highlight, cluster as the one which has the largest
T = max{DTW(s′ , t ′ )|DTW(s′ , t ′ ) < T }. weight. Undoubtedly, more complex weighting factors, such as
It is known that determining the cluster number or the stop the statistical measurements of segmented foreground objects
condition for a cluster algorithm is difficult. Here the second in the clips, could be more powerful to select the desired action
characteristic motivates us to add a constraint on the clustering: cluster. However, the presented weighting scheme has worked
too near clips (e.g., dive1, replay1, and replay2 in Fig. 4, well on our experimental data set.
where the temporal gaps are below a threshold TG) cannot To summarize, our highlight detection method consists
be grouped into the same cluster. This constraint eases the of three steps: 1) segment the video into action clips; 2)
determination of clustering stop condition. We first compute cluster the action clips; and 3) select the desired action
the average distance, dist, from all the nearby clips and then cluster.
set the stop condition as: the cluster distance should not After the highlight clips are detected, they are stored into
be larger than dist ∗ 2. This condition prefers that similar video library to support quick browsing of the video.
action clips such as dive1 and dive2, group into cluster C1,
replay1 and replay3 group into cluster C2, while prevents C1
and C2 grouping together. In practice, to ensure that nearby
IV. Player Body Shape Segmentation
clips are not grouped, we simply assign a large distance
to them. For the middle and fine-level analysis, we need to segment
After clustering, clusters with small sizes, i.e., the numbers the player body shapes from the detected highlight clips.
of clips they contain are below N/10, are filtered. N is the For diving video sequences, the non-rigid deformation of
number of total detected clips in the video. Then we select the player body, large camera motion, and cluttered background
desired action cluster from the remaining clusters. As shown in caused by referees or audiences bring great difficulty to this
Fig. 4, for an action clip A, the less preceding nearby action task. We have proposed an automatic object segmentation
clips it has, the more possible it is a normal action. At the algorithm [29] that can deal with such problems. However, like
same time, the motion energy, i.e., the sum of motion vectors, other motion-based segmentation methods [25], this algorithm
of normal action clip should be larger than that of the replay works poorly when the player has little motion between
clip. Therefore, we consider two measurements for each clip successive frames, which is often the case in diving videos at
to weight its probability of being normal action as follows: the early stage of a diving action. To overcome the limitation,
we propose an improved version of the algorithm in this paper.
To be self-contained, we first introduce our previous algorithm
TemporalPriority(A) = #{B|T (B) < T (A),
in [29] (called Algorithm 1 hereafter) and then present the
T (A) − T (B) < TG} (9) improved version (called Algorithm 2 hereafter).
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
356 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010
A. Algorithm 1
The core of Algorithm 1 is to construct a statistical
background image for each frame Ik by registering multiple
neighboring frames to Ik . The main steps of Algorithm 1 are
summarized as follows.
1) Global Motion Estimation: Global motion estimation
(proposed in Section II) and compensation are used to align
neighboring frames to the coordinate system of the current
frame.
2) Foreground Separation: If the adjacent frames are used
directly to construct the background, the ghost-like noise will
appear in the constructed background. To eliminate such noise,
foreground areas are separated from the images using three-
frame-difference before the construction of background.
Fig. 5. Segmentation results of Algorithm 1. Top row: original frames;
3) Background Construction: For frame Ik , the 2L + 1 middle row: constructed backgrounds; bottom row: segmented foreground
consecutive frames Ii (i = k − L, . . . , k + L) are used to objects.
construct its background image. Ii is aligned to the coordinate
system of Ik using the estimated global motion parameters. In Algorithm 2 we take into account of the object motion
The pixels pi in 2L + 1 frames corresponding to the same between frames. The idea is that we only select frames with
location p of current background are found and an improved apparent object motion to construct the background image.
temporal median method is used to determine the value Bk (p) There are many ways to estimate the magnitude of object
of pixel p in the constructed background motion such as the difference of two aligned frames. In this
Bk (p) = mediani (Ii (pi )), Di (pi ) = 1 (12) paper, motivated by the observation that the camera is always
focusing on the player to capture his/her movements, we use
where Di (pi ) is the foreground mask of frame Ii obtained from the global motion between frames as the measurement of
step 2). object motion. When the player is moving, the camera also
4) Object Segmentation: After the background of the cur- moves. Therefore, we can select the frames having salient
rent frame is constructed, the background subtraction method global motion (larger than Th1), to say, key-frames instead
is applied and some post-processing steps are conducted to of consecutive neighboring frames to build the background.
extract the moving object. These steps include: 1) signifi- Meanwhile, when the cumulative global motion exceeds a
cance test [30] is used to decide the threshold to binarize given threshold Th2, no more frames are needed.
the difference image obtained by background subtraction; Algorithm 2 first decides all the key-frames of a clip as
2) connected component analysis is applied on the binarized follows.
image and components which are smaller than 50% of the 1) Neighboring frames with global motion larger than Th1
largest component are removed; and 3) the Snake [31] model are selected as key-frames.
is adopted to smooth each remaining component’s boundary. 2) For the rest frames, if a frame’s cumulative motion to the
The merit of Algorithm 1 is that it adopted the foreground nearest key-frame exceeds Th1, it is selected as key-frame.
separation technique and improved temporal median filtering Then for the current frame Ik , we select L1 and L2 con-
for robust background construction. Algorithm 1 is suitable secutive neighboring key-frames from the left and right of Ik
to segment moving objects from dynamic background videos. respectively to build its background, where
But it works well only when the object has apparent motion,
Li = min(L, arg min CMi (k, J) ≥ Th2) (13)
and when the motion is slight it will fail (see Fig. 5). J
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 357
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
358 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010
B. Visual Analysis Fig. 9. Overlay frames of same action performed by two players from two
With the enabling techniques provided before, to say, the clips.
global motion estimation, accurate player body segmentation,
and action recognition, we present two visual sports training 2) Overlay Composition: Overlay composition is also a
tools which allow coaches and players to review and compare video blending technique which creates a compositive video
performances in an easy way to enhance coaching strategies. by overlaying two temporally aligned videos at the same scene
1) Motion Panorama: Motion panorama provides an effi- [37]. It provides a visually appealing tool for coaches and
cient and compact representation of the underlying video by players to compare actions performed by different players
constructing a single image from part of frames of a video. or by the same player at different times. This is important
It can show the global background as well as the foreground for sports training since that very small differences in the
player bodies and is more straightforward to reveal the moving technique will lead to very large differences in performance
trajectory or action details than by a video. Panorama has been in competitive sports games.
proved to be a powerful tool for reviewing and evaluating the Given a synchronization point of two clips and the seg-
performance of player action [36]. mented foregrounds, we compose an overlay clip by super-
After estimating global motion parameters, a background imposing one clip’s foregrounds over the other clip using the
panorama is built first. The frames of an action clip are aligned alpha blending technique. Compared with the traditional over-
to the coordinate system of the first frame, which is also lay methods [37] which simply blend the corresponding frames
selected as the world coordinate system of the panorama. Then of two videos, our approach has two (potential) advantages:
for those overlapping regions, the temporal median filtering 1) it doesn’t have the constraint that two clips should be
technique is used to construct the background by selecting the of same scene, because we have segmented the foregrounds
median RGB value. accurately, and 2) the synchronization point can be automati-
Using the global motion and time intervals as criteria, some cally determined by temporally aligning the segmented shape
key-frames are selected automatically, of course a manual sequence using technique such as dynamic time warping. In
selecting function is also provided in the system. The seg- our case, we simply take the entry point as synchronization
mented foregrounds of such key-frames are mapped to a world point to compose the overlay clips for the same action. Fig. 9
coordinate system same to the background panorama to form a gives two example frames of an overlay clip.
foreground panorama. For a better visual effect, the overlapped
foreground regions are fused by alpha blending.
Finally, the resulting motion panorama is created by cov- VII. Experimental Results
ering the background panorama with the effective regions To evaluate the effectiveness and general versatility of the
of foreground panorama. An example of panoramas for a proposed system and algorithms, we conduct experiments on
platform diving clip is shown in Fig. 8. the analysis of two kinds of sports game videos: diving videos
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 359
TABLE I TABLE II
Average ITFs (db) for Six GME Algorithms on Diving Video Average ITFs (db) for Three GME Algorithms
With Progressive Filtering on Diving Video
Proposed i-RANSAC i-LTS Fisher RANSAC LTS
44.17 44.32 43.51 43.12 43.19 42.23 Proposed i-RANSAC i-LTS
44.50 44.52 44.26
and jump videos. We also compare the performance of our TABLE III
algorithms with some competitive algorithms. All the videos Threshold Parameters and Their Ranges for the Detection
are digitized in a MPEG-1 format. Method
Parameter Range
A. Experiments on Diving Videos pThresh 4–7
In this section, diving game videos are used as case study TG 600–1200
mThresh 1.5–3.0
to demonstrate and evaluate the effectiveness of whole system L 15–30
and the algorithms. The diving video set is about 8 h and r 0.15–0.3
includes six platform videos and two springboard videos. p 2–4
q 2–4
1) Global Motion Estimation: We randomly crop 20 action
clips from the springboard and platform diving videos, each For diving videos, even though there exists camera zooming
having ten instances, to test the performance of proposed GME operation to some extent, the dominant global motion is trans-
algorithm. Two state-of-the-art robust estimation methods: lation caused by camera tilting up/down. This fact motivates
RANdom SAmple Consensus (RANSAC) algorithm [38] and us to progressively filter outliers. The idea is that we first use
Least Trimmed Squares (LTS) algorithm [39] are implemented the translation model, i.e., the 2-parameter model to coarsely
for comparison. We also implemented the improved versions filter out most of the outliers, then we use the affine model
for RANSAC and LTS according to [25], which we denote as to finely filter out the rest ones. This is simply implemented
i-RANSAC and i-LMS, respectively. The differences between by adopting the translation model for the first two iterations
original and improved RANSAC and LTS are that, in the (i.e., Iteration = 1, 2 ) of step 3 in the proposed GME
improved versions, refinement procedures are applied for each method. Similar implementations are applied to i-RANSAC
sampled subset to improve the accuracy of motion model, by and i-LTS. The experimental results (see Table II) validate the
iteratively covering more inliers. The thresholds are set same effectiveness of progressive filtering strategy.
to [25] as ErrThresh = 1.5; max Iteration = 50. Actually, this progressive filtering strategy for GME is not
For evaluation we use the interframe transformation fidelity only practical to diving videos, but also applicable to a broad
(ITF) measure [40] category of individual sports videos, such as running, jumps,
1
N gymnastics, and so on. We will see in Section VII-B1 that for
comp
ITF = PNSR(Ik , Ik−1 ) (17) more challenging videos, the improvement is more noticeable.
N − 1 k=2
2) Highlights (Action Clips) Detection: There are seven
comp 2552 threshold parameters in the proposed highlight detection
PNSR(Ik , Ik−1 ) = 10 log comp (18) method. We conduct comprehensive testings with different
MSE(Ik , Ik−1 )
values for the threshold combinations (see Table III) and, the
comp
where Ik−1 is the aligned image to Ik by global motion results show that most of the thresholds, such as pThresh,
comp
compensated on Ik−1 , MSE(Ik , Ik−1 ) is the mean square error mThresh, TG, p, and q are quite robust to the detection
comp
between Ik and Ik−1 . performance, while others, such as L and r have slight effects
Because there is obvious player motion in the testing clips, on the detection results.
we manually label the bounding box of player for each frame, Parameters L and r are related to the segmentation of action
and the residual pixels inside the box are excluded from mean clip, which determines the start and end points of a clip. When
square error (MSE) computation. Similarly, the residuals close L is set with a larger value and r with a smaller value, such
to the borders of the image are also excluded. The average as 30 and 0.15, then in cases of shot cut detection fails, the
ITFs for the comparing algorithms on the 20 clips are tabulated nearby large global motion caused by shot transition will lead
in Table I. Here “Proposed” represents the proposed GME to errors in action clip segmentation. On the other hand, if
method in Section II, while “Fisher” denotes the proposed L is set with smaller value and r with larger value, such as
GME method, but not includes the refinement step 4. 15 and 0.25, some normal action clips may be separated into
It can be seen that the proposed algorithm achieves compa- small incomplete clips. From the experiments, for the testing
rable performance to i-RANSAC and i-LTS: better than i-LTS video data, threshold L has more impact on the performance
while slightly worse than i-RANSAC. Another observation than r. We fix the parameters pThresh, TG, mThresh, r, p,
is that the improved versions of the three algorithms all q as 6, 900, 2.0, 0.2, 2, and 2, respectively, and set L as 15,
outperform their respective original versions, showing that the 20, and 25, the detection results are shown in Table IV.
refinement process, which iteratively increases the set of inliers We can see that the detection method is very effective
such that the parameter estimation is based on more input data, and has achieved high performance. The errors are partly
can result in a more accurate estimation. caused by the segmentation noises, i.e., over-segmentation and
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
360 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010
Fig. 10. Segmentation results for platform diving (the same clip in Fig. 5). Fig. 11. Segmentation results for another platform diving clip.
Top row: original frames; middle row: constructed backgrounds; bottom row:
segmented body shapes.
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 361
TABLE IV
Highlights Detection Results
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
362 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010
TABLE VI
Errors for Hip Angle and Knee Angle
TABLE VII
Average ITFs (db) for Three GME Algorithms on Jump Video
TABLE VIII
Average ITFs (db) for Three GME Algorithms With Progressive
Filtering on Jump Video
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
LI et al.: AUTOMATIC DETECTION AND ANALYSIS OF PLAYER ACTION IN MOVING BACKGROUND SPORTS VIDEO SEQUENCES 363
2) Player Body Segmentation: The proposed object seg- [7] L.-Y. Duan, M. Xu, T.-S. Chua, Q. Tian, and C.-S. Xu, “A mid-level
representation framework for semantic sports video analysis,” in Proc.
mentation algorithm is applied to the jump videos to extract ACM Conf. Multimedia, 2003, pp. 33–44.
player body shapes. Figs. 17 and 18 give some segmentation [8] H. Miyamori and S. Iisaku, “Video annotation for content-based re-
results on long jump video and high jump video. It is noted that trieval using human behavior analysis and domain knowledge,” in Proc.
Automat. Face Gesture Recognition, 2000, pp. 320–325.
for both videos, besides the player, there are also some moving [9] G. Y. Zhu, Q. M. Huang, C. S. Xu, L. Xing, W. Gao, and H. Yao,
audiences. The results show that our algorithm works well in “Human behavior analysis for highlight ranking in broadcast racket
the presence of other moving objects, if their movements are sports video,” IEEE Trans. Multimedia, vol. 9, no. 6, pp. 1167–1182,
Oct. 2007.
relatively slower than that of the player. [10] R. Poppe, “Vision-based human motion analysis: An overview,” Com-
put. Vision Image Understand., vol. 108, nos. 1–2, pp. 4–18, Oct. 2007.
[11] T. Moeslund, A. Hilton, and V. Krüger, “A survey of advances in
VIII. Conclusion and Future Work vision-based human motion capture and analysis,” Comput. Vision Image
Understand., vol. 104, nos. 2–3, pp. 90–126, Nov. 2006.
In this paper, we have presented this paper for the auto- [12] J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action in time-
matic detection, recognition and analysis of player motion in sequential images using hidden Markov model,” in Proc. Comput. Vision
Pattern Recognit., 1992, pp. 379–385.
action-critical sports video, for high-level content-based video [13] J. Sullivan and S. Carlsson, “Recognizing and tracking human action,”
indexing/retrieval and training purposes. All the works are in Proc. Eur. Conf. Comput. Vision, Lecture Notes in Computer Science
performed automatically and integrated in a comprehensive 2350, no. 1. 2002, pp. 629–644.
[14] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, “Recognizing action at
framework which is applicable for sports video analysis. a distance,” in Proc. Int. Conf. Comput. Vision, 2003, pp. 726–733.
Various key techniques and algorithms are proposed to fulfill [15] W.-L. Lu, K. Okuma, and J. J. Little, “Tracking and recognizing actions
such work. of multiple hockey players using the boosted particle filter,” Image Vision
Comput., vol. 27, no. 1, pp. 189–205, Jan. 2009.
The framework and algorithms are tested on broadcast [16] P. Fogueroa, N. Leite, and R. Barros, “Tracking soccer players aiming
diving videos and jump videos. The extensive experiments their kinematical motion analysis,” Comput. Vision Image Understand.,
show the effectiveness of the proposed system. However, there vol. 101, no. 2, pp. 122–135, Feb. 2006.
[17] J. R. Wang and N. Parameswaran, “Detecting tactics patterns for
is still room for improvement. In the highlight detection, the archiving tennis video clips,” in Proc. Int. Symp. Multimedia Softw.
caption in video that indicates the player profile when he/she Eng., 2004, pp. 186–192.
is preparing for play is an important cue to identify an action, [18] A. W. B. Smith and B. C. Lovell, “Visual tracking for sports
applications,” in Proc. Workshop Digital Image Comput., 2005,
and we expect an improvement by integrating such information pp. 79–84.
in the future. For action recognition, we are considering [19] R. Urtasun, D. J. Fleet, and P. Fua, “Monocular 3-D tracking of the golf
building hybrid classifier by combining HMMs and main body swing,” in Proc. Comput. Vision Pattern Recognit., 2005, pp. 932–938.
[20] R. Cassel, C. Collet, and R. Gherbi, “Real-time acrobatic gesture
feature such as head tracking and shape analysis to improve analysis,” in Proc. Gesture Workshop, 2005, pp. 88–99.
the classifier’s discriminating capability and to recognize the [21] J. Konrad and F. Dufaux, “Digital Equipment Corporation, improved
actions with small training samples. Meanwhile, further work global motion estimation for N3,” in Meeting of ISO/IEC/SC29/WG11,
no. MPEG97/M3096, San Jose, CA, 1998.
needs to be done to automatically analyze more kinematic [22] B. Qi and A. Amer, “Robust and fast global motion estimation oriented
measurements for training instruction. to video object segmentation,” in Proc. IEEE Int. Conf. Image Process.,
2005, pp. 153–156.
[23] T. Chiew, C. How, D. Bull, and C. Canagarajah, “Rapid block-based
global motion estimation and its applications,” in Proc. Int. Conf.
Acknowledgment Consumer Electron., 2002, pp. 228–229.
[24] N. Ostu, “A threshold selection method from gray level histogram,” IEEE
The authors would like to thank the anonymous reviewers Trans. Syst. Man. Cybern., vol. 9, no. 1, pp. 62–66, Jan. 1979.
for their valuable comments that helped to improve this paper. [25] D. Farin, “Automatic video segmentation employing object/camera
They also would like to thank Dr. S. Tang, Dr. R. Hong, and modeling techniques,” Ph.D. dissertation, CIP-Data Library, Tech. Univ.
Eindhoven, Eindhoven, The Netherlands, 2005.
Dr. K. Tao for their useful suggestions. [26] T. S. Chua, H. Feng, and A. Chandrashekhara, “An unified framework for
shot boundary detection via active learning,” in Proc. Int. Conf. Acoust.
Speech Signal Proces., 2003, pp. 845–848.
References [27] A. K. Jain and R. C. Dube, “Clustering methods and algorithms,” in
Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall,
[1] F. Wang, J.-T. Li, Y.-D. Zhang, and S.-X. Lin, “Semantic and structural 1988, pp. 55–142.
analysis of TV diving programs,” J. Comput. Sci. Technol., vol. 19, [28] C. S. Myers and L. R. Rabiner, “A comparative study of several dynamic
no. 6, pp. 928–935, Nov. 2004. time-warping algorithms for connected word recognition,” Bell Syst.
[2] L. X. Xie, P. Xu, S.-F. Chang, A. Divakaran, and H. Sun, “Structure Tech. J., vol. 60, no. 7, pp. 1389–1409, Sep. 1981.
analysis of soccer video with domain knowledge and hidden Markov [29] S. Wu, S.-X. Lin, and Y.-D. Zhang, “Automatic segmentation of moving
models,” Pattern Recognit. Lett., vol. 25, no. 7, pp. 767–775, May 2004. objects in video sequences based on dynamic background construction,”
[3] C.-C. Lien, C.-L. Chiang, and C.-H. Lee, “Scene-based event detection Chin. J. Comput., vol. 28, no. 8, pp. 1386–1392, Aug. 2005.
for baseball videos,” J. Visual Commun. Image Representation, vol. 18, [30] T. Aach, A. Kaup, and R. Mester, “Statistical model-based change
no. 1, pp. 1–14, Feb. 2007. detection in moving video,” Signal Process., vol. 31, no. 2, pp. 165–
[4] G. Xu, Y.-F. Ma, H.-J. Zhang, and S.-Q. Yang, “An HMM-based 180, Mar. 1993.
framework for video semantic analysis,” IEEE Trans. Circuits Syst. Video [31] M. Kass, A. Witkini, and D. Terzopoulos, “Snakes: Active contour
Technol., vol. 15, no. 11, pp. 1422–1433, Nov. 2005. models,” Int. J. Comput. Vision, vol. 1, no. 4, pp. 321–331, Jan.
[5] G. Pingali, Y. Jean, A. Opalach, and I. Carlbom, “LucentVision: Con- 1998.
verting real world events into multimedia experiences,” in Proc. IEEE [32] M. K. Hu, “Visual pattern recognition by moment invariants,” IEEE
Int. Conf. Multimedia Expo, 2000, pp. 1433–1436. Trans. Inform. Theory, vol. 8, no. 2, pp. 179–187, Feb. 1962.
[6] X. G. Yu, N. J. Jiang, L.-F. Cheong, H. W. Leong, and X. Yan, “Auto- [33] R. Green and L. Guan, “Quantifying and recognizing human movement
matic camera calibration of broadcast tennis video with applications to patterns from monocular video images-part I: A new framework for
3-D virtual content insertion and ball detection and tracking,” Comput. modeling human motion,” IEEE Trans. Circuits Syst. Video Technol.,
Vision Image Understand., vol. 113, no. 5, pp. 643–652, May 2009. vol. 14, no. 2, pp. 179–190, Feb. 2004.
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.
364 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 3, MARCH 2010
[34] J. Deutscher, A. Blake, and I. Reid, “Articulated body motion capture by Si Wu received the B.E. and M.S. degrees from
annealed particle filtering,” in Proc. Comput. Vision Pattern Recognition, Xiangtan University, Hunan, China, both in applied
2000, pp. 126–133. mathematics, in 1998 and 2001, respectively, and the
[35] D. M. Gavrila and V. Philomin, “Real-time object detection for smart Ph.D. degree in computer science from the Institute
vehicles,” in Proc. Int. Conf. Comput. Vision, 1999, pp. 87–93. of Computing Technology, Chinese Academy of
[36] C.-T. Hsu and Y.-C. Tsan, “Mosaics of video sequences with moving Sciences, Beijing, China, in 2005.
objects,” Signal Process. Image Commun., vol. 19, no. 11, pp. 81–98, Since 2005, he has been a Project Manager with
Nov. 2004. France Telecom Research and Development Beijing,
[37] http://www.dartfish.com/en/sports-enhancements/sport performance Beijing, China. His current works focus on multime-
software/index.htm dia value added service research and developments.
[38] M. A. Fischler and R. C. Bolles, “Random sample consensus: A
paradigm for model fitting with applications to image analysis and
automated cartography,” Commun. Assoc. Comput. Machinery, vol. 24,
no. 6, pp. 381–395, Jun. 1981. Yongdong Zhang received the B.E. and M.S. de-
[39] P. J. Rousseeuw and K. Van Driessen, “Computing LTS regression for grees from Dalian University of Technology, Dalian,
large data sets,” Data Mining Knowl. Discovery, vol. 12, no. 1, pp. 29– both in signal and information processing, in 1995
45, Jan. 2006. and 1998, respectively, and received the Ph.D. de-
[40] F. S. Rovati, D. Pau, E. Piccinelli, L. Pezzoni, and J.-M. Bard, “An gree in electronic engineering from Tianjin Univer-
innovative, high quality and search window independent motion esti- sity, Tianjin, China, in 2002.
mation algorithm and architecture for MPEG-2 encoding,” IEEE Trans. Since 2002, he has been an Associate Professor
Consumer Electron., vol. 46, no. 3, pp. 697–705, Aug. 2000. with the Institute of Computing Technology, Chinese
Academy of Sciences, Beijing, China. His research
interests include video coding and transcoding, video
Haojie Li received the B.E. degree in computer analysis and retrieval, and universal media access.
science from Nankai University, Tianjin, China, in
1996, and the Ph.D degree in computer science from
the Institute of Computing Technology, Chinese Shouxun Lin (M’99) received the Ph.D. degree
Academy of Sciences, Beijing, China, in 2007. in computer science from the Beijing Institute of
From 1996 to 2001, he was with Yantai Post, Technology, Beijing, China, in 1998.
Shandong, China, as a Software Engineer. From Since 1995, he has been an Associate Professor
2007 to 2009, he was a Research Fellow with and Professor (in 2000) with the Institute of Com-
the School of Computing, National University of puting Technology (ICT), Chinese Academy of Sci-
Singapore, Singapore. Since December 2009, he has ences (CAS), Beijing, China. From 2000 to 2005, he
been an Associate Professor with the School of Soft- was the Vice Director of the Digital Laboratory, ICT,
ware, Dalian University of Technology, Dalian, China. His research interests CAS. His research interests include video coding,
include computer vision, pattern recognition, and image/video retrieval. video content analysis, statistical machine transla-
tion, and evaluation of computer–human interaction.
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:465 UTC from IE Xplore. Restricon aply.