You are on page 1of 7

Systems and Computers in Japan, Vol. 29, No.

7, 1998
Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J80-D-II, No. 9, September 1997, pp. 2421–2427

Scene Cut Detection and Article Extraction in News Video


Based on Clustering of DCT Features

Yasuo Ariki

Faculty of Science and Technology, Ryukoku University, Otsu, Japan 520-21

SUMMARY 1. Introduction

With the development of digital and multi-channel


This paper proposes a method that automatically
broadcasting, it is expected that news program broadcasting
extracts individual articles from news videos. Each frame will increase. This environment presents a requirement
of the news video is compressed using the discrete cosine from users for digests of a particular news subject or only
transform (DCT), and scene cuts are detected based on the the most interesting news, among a large number of news
DCT features obtained. The conventional method of cut programs. A system that satisfies such a requirement is
detection is based on the difference between adjacent called a news-on-demand (NOD) system. There are studies
frames. Misdetection may occur when the intensity of a part of such systems from the viewpoint of multimedia data-
or the whole of the image changes. This paper describes a bases [1–3].
solution to the above problem obtained by clustering the The central function in the NOD system is to extract
frames in the news video, based on the property that con- the text, video, and voice information that characterizes the
secutive frames of the same scene are similar. The news news article, and to provide an index. However, since the
video has a syntax structure in which it moves from the index must be prepared for individual news articles, in the
NOD system it is necessary to extract the individual news
studio to the site and then returns to the studio. This struc-
articles from the news program. Manual extraction of the
ture is observed as a loop in the detected cut-point frame
news articles represents a considerable waste of effort since
set. Consequently, the studio is recognized by detecting the a tremendous number of articles must be handled. With this
loop and the article is extracted. An experiment was per- situation as background, our work aims at the automatic
formed on 30 days of NHK news. A cut detection rate of extraction by computer of news articles from the videos of
87.9% and an article extraction rate of 99.2% were ob- news programs.
tained. An article extraction experiment was performed on A news video is composed of several articles, and an
ten days of three commercial TV channel programming, article is composed of several scenes. Furthermore, a scene
and the effectiveness was demonstrated. © 1998 Scripta is composed from several frames. The image at the point of
Technica, Syst Comp Jpn, 29(7): 50–56, 1998 change between the scenes (cut-point frame) shows an
outline of the entire video. In other words, in order to extract
an article, it is not necessary to process all frames: it suffices
to process only the cut-point frames [2, 4]. From this
perspective, we attempt to detect the cut points of a news
Key words: Cut detection; DCT; clustering; article video as the first step and then to detect articles based on
extraction; news video. the result so that the processing time is reduced.

CCC0882-1666/98/070050-07
50 © 1998 Scripta Technica
It is also desired to compress frames from the view-
point of storage of news videos, since the amount of data in
each frame of a news video is tremendous. If the cut points
can be detected for compressed frames, this is convenient
in handling news video databases. From this viewpoint, we
Fig. 1. System organization.
propose a method for news video frames, where the image
is compressed by the DCT (discrete-cosine transform) used
in JPEG, and then the cut point is detected based on the
changes in the DC and AC components obtained in the DCT
[5].
In most conventional methods for cut detection, the 3. Image Data Input
frame is divided into several blocks. The DC components
of the corresponding blocks in adjacent frames are exam- Five-minute NHK news programs were recorded on
ined. If the difference exceeds some threshold, a cut point 8-mm tape for one month. As is shown in Fig. 2, the video
is detected. In other words, local changes are examined [4, news was sampled at the rate of 30 frame/s by an Indigo2
6]. However, misdetection may arise when the intensity of computer. JPEG compression of quality 75% was applied
a part or the whole of the image changes, as in the case of to each frame, and the result was stored on a hard disk as a
a camera flash or the turning on or off of a lamp. movie file in SGI format. The above processing can be
In order to cope with this problem, we propose to executed in real time, using SGI tools and dedicated hard-
form clusters of consecutive frames, based on the property ware for JPEG compression. When the image size is 320 ´
that adjacent frames are similar. The cut points can be 240 pixels, a memory capacity of about 200 MB is required
detected as frames that separate two clusters. Using this for five minutes of news.
approach, the problem of a result that is sensitive to changes
in intensity can be avoided.
A news program has an iterative structure in which a 4. Detecting Scene Cuts
newscaster in the studio introduces the content of an article;
this is followed by several scenes of the site, and then the 4.1. Existence of scene clusters
scene comes back to the studio. In the set of detected
cut-point frames, the structure appears as a loop structure In the same scene, adjacent frames are similar and the
with the studio scene as the start. Consequently, by detect- DCT components are likewise close. When the scene
ing such a loop structure, the studio scene can be identified. changes, the DCT components change greatly. In other
In this paper, news articles are extracted based on the words, it is expected that the frames in the same scene form
detected studio scenes [7]. a cluster, represented in terms of the DCT components.
In section 2 we describe the system configuration that In order to verify this idea, a preliminary experiment
extracts an article and in section 3 we discuss the input was tried in which each frame in the news video was
method for image data. Sections 4 and 5 describe the represented by the DCT components, and principal compo-
proposed cut detection together with an evaluation of the nent analysis was applied to the representation; the forma-
approach. Sections 6 and 7 discuss article extraction based tion of the cluster was represented in two dimensions. The
on cut detection. news video used in the experiment was composed from
1594 frames (about 53 s), partitioned into three scenes.

2. System Configuration

The system that extracts individual articles from


video news is composed of three units: the image data input
unit, the cut detector, and the article extraction unit, as
shown in Fig. 1. The operation is outlined as follows. The
image data input unit digitizes the video news and stores
the data on a hard disk. The cut detector divides the video
news into several scenes. The article extraction unit divides
the material into articles based on information concerning
the start of the scenes (cut-point frames). Each of these
processing steps is described below. Fig. 2. Flow of news frame input.

51
Figure 3 shows the result. Clusters are obviously
formed for scenes 1 and 2 in the figure. Scene 3 does not
stay within a cluster but is connected to another cluster
Scene 3¢ by a curved line. This is due to the camera work.
It is evident that the clusters of Scene 3 and Scene 3¢
correspond to the start and the end of the scene, respectively,
and that camera work is not contained in either.
Based on the above preliminary experiment, the cuts
are detected in this study by forming frame clusters instead
of by detecting interframe differences as in the conventional
method.

4.2. Detection of scene cuts

4.2.1. Cluster formation


Fig. 4. Process of cluster creation.
As is shown in Fig. 4(a), the video news is composed
of a number of frames. Each frame is separated into
L ´ M blocks, as shown in Fig. 4(b). When the DCT is
applied to each block, the power is concentrated in the lower
frequencies. By extracting K lower-order DCT compo- 4.2.2. Division of clusters
nents, which correspond to lower frequencies, from each
Escape from a cluster is determined by the over-
block, L ´ M ´ K feature parameters are obtained for the
number, which is defined as follows. The cluster is approxi-
entire frame. These feature parameters are represented by a mated by a normal distribution without a correlation between
single point in multidimensional space. axes, as in Fig. 4(c). Let the mean and the standard deviation
The sets of feature parameters derived from associ- along the i-th axis be mi and s i, respectively. Then, the
ated scenes are expected to lie close to each other in the over-number is the number of DCT components xi of the
multidimensional space, forming a cluster. When the scene input frame, for which
changes, the point in multidimensional space escapes from
the cluster. In other words, a cut can be detected from this (1)
escape behavior.
A frame for which the over-number exceeds
(L ´ M ´ K ) / 2 is called an out-frame. When more than P
out-frames occur in succession, it is decided that the point
has escaped from the cluster, and this constitutes cut detec-
tion. By using this approach, cut misdetection can be elimi-
nated for short-term variations, such as flashes or turning
on or off of lamps.
When a cut is detected, a new cluster is formed based
on the frames in the following second. If a cut is not detected
afterward, the mean and the variance of the cluster are
updated using the new input frame.

4.2.3. Problems and solutions in cluster


formation
When a cluster is formed in a scene containing rapid
movement, the normal distribution spreads due to large
variations in the image. Then, the point does not escape
from the cluster even if the scene changes. It should be
noted that the DC component, among the DCT compo-
Fig. 3. Two-dimensional drawing of news image nents, is insensitive to movement in the image although it
sequence by principal component analysis. is affected by intensity. Consequently, it is expected that

52
misdetection can be avoided in images containing many 5.2. Class of undetected cut points
movements, such as camera work, by increasing the weight
for the DC component instead of the AC components. From Table 2 shows the classes of 55 undetected cut points.
this perspective, the DC and the AC components are not The features and reasons for the undetected cut points can
weighted equally in cut detection. The weight is defined so be summarized as follows.
that the DC component has a larger effect on the over-
number, while the first AC component has a larger variance (A) Cut point immediately following scene with
in the cluster. rapid movement

For cut points immediately after a scene with a par-


5. Evaluation of Cut Detection ticularly rapid movement of the object or the camera, the
situation is improved by weighting the DC component in
5.1. Evaluation experiment proportion to the variance of the first AC component, as was
previously described. However, 21 cut points remain unde-
An evaluation experiment was performed for the cut tected.
detection described in section 4.2. Five-minute NHK news
videos for 15 days were used as the data for evaluation. The (B) Cut point with sliding change
parameters were determined based on a preliminary experi-
ment [5]. The optimal number of pixels in the block was 40 This is a cut point where the preceding scene changes
´ 40. Consequently, the number of blocks L ´ M was set to to the next scene with vertical border-line sliding in the
48. Three DCT components were used: the DC component horizontal direction. The change is then slow since the cut
together with the vertical and horizontal components of the point extends over several frames and may be absorbed by
first AC frequency. The feature parameters L ´ M ´ K for the clustering process.
each frame represent 144 (48 ´ 3) dimensions. The number
of out-frames P was set to 10 based on the results from the (C) Cut point immediately after short scene of less
preliminary experiment.
than one second duration
The news videos for 15 days contained 455 cut points
in total. In the evaluation experiment, 55 cut points were
In the proposed method, when a cut point is detected,
not detected, and there were 41 misdetected cut points.
a new cluster is formed based on the frames in the following
Table 1 shows the results. The success rate and the match
second. Consequently, if there is a short scene of less than
rate in Table 1 are defined as follows.
one second, the new cluster contains not only the features
number of correctly detected cut points of the short scene of less than one second but also the
success rate =
number of actually existing cut points features of the next scene. This prevents detection of the cut
(2) point.
number of correctly detected cut points
match rate =
total number of detected cut points (D) Cut point with dissolving change
(3)
This is a case for which the preceding scene and the
When the success rate is high, there are fewer overlooked
succeeding scene overlap and the scene is changed by cross
cut points (missed detections). When the match rate is high,
fading. It is a cut point with an artificial process.
there are fewer false detections of cut points.

Table 2. Classification of undetected cut points


Table 1. Scene cut detection rate (%) (number of cut points)

53
(E) Cut point with zooming change

This is a cut point with an artificial process as in (B)


where the scene is changed by a circular border-line gradu-
ally expanding.

6. Extraction of Articles

6.1. Method of article extraction

A news program has an iterative structure in which


the newscaster introduces the content of the article in the
studio, which is followed by several scenes of the site, and
then the scene returns to the studio. Consequently, the
studio scene can be detected based on this structure, and the Fig. 6. Loop points and cut points.
news articles can be extracted based on the studio scene
detected. In detecting the studio scene, the concept of the
loop point is utilized. The only scenes that appear iteratively
are the studio scene and particular scenes such as appear in
sports news. Such an iterative scene is extracted as a loop where x mi is the i-th component among the 144-dimen-

point, and then the studio scene is detected. sional DCT components of frame m.
The article extraction process can be separated into Calculating the distance using Eq. (4), if there exists
three steps as in Fig. 5. Following the flow shown in the a frame n with a distance less than a certain threshold,
figure, the article extraction process is described below. frames m and n are identified as loop points. Then, frame n
is moved further, and the set of cut-point frames belonging
to the same loop point is determined. Next, cut-point frame
6.2. Loop detection
m is moved forward, and another loop point is identified. It
may happen, as is shown in Fig. 6, that a small loop point
The transition of the scenes composing the news
video can be represented as in Fig. 6. In the figure, the is formed, but a long loop as in the case of the studio scene
detected cut-point frame is indicated by a black square. is not formed.
Since similar scenes are located close together, the news
video forms loops, repeatedly returning to the studio scene. 6.3. Identification of studio scene
The proposed method detects the cut point (loop point) as
the start of the loop. Based on the loop points determined by the loop
The algorithm for loop point detection is as follows. detection process, the studio scene containing the news-
Initially, the cut-point frame m at the head of the news video caster is identified. In some special articles (such as sports
is fixed. By moving the cut-point frame n forward along the news), it may happen that scenes at the same position and
time axis, the Euclid distance d(m, n) between the frames with the same angle are repeatedly inserted, and loops are
is calculated as follows. formed with short intervals. The loop with the studio scene
containing the newscaster, however, can be separated, since
(4) it continues for a long time.
Figure 7 shows the situation. There exist two long
loops with loop point 1 as the start, as well as three short
loops with loop point 2 as the start. The length of the portion
between arrows in the figure is proportional to the number
of frames in the loop. The studio scene is the start of a loop
with long duration and should correspond to loop point 1.
In order to identify the studio scene, it suffices to examine
the average number of frames f in a loop as defined by Eq.
(5) for each loop point and to select the loop point with the
Fig. 5. Flow of news article extraction. maximum f.

54
Fig. 7. Loop points and studio scene.

Fig. 8. Frame distance from the estimated studio frame.

(5)
studio scene. It can be seen from the figure that the studio
scene frame is clearly discriminated from the other frames.
In Eq. (5), N is the number of loops exiting from the
considered loop point, and ni is the number of frames in
each loop. 7. Evaluation Experiment for Article
Extraction
6.4. Article extraction
An evaluation experiment was performed for article
When the article is extracted based only on the cut extraction. The materials used in the experiment were five-
points obtained by cut detection as in Fig. 6, the article fails minute NHK newscasts for 30 days, and newscasts of three
to be extracted when the studio scene is not detected in the commercial broadcast programs (A, M, and Y). The news
cut detection process. Consequently, when the studio scene of the three commercial broadcasts lasts three minutes, and
is identified by loop detection, the scene is searched for all the news for ten days was recorded. The threshold in the
frames in the news video, and then the article is extracted. processing was determined based on five days of NHK
By separating the identification of the studio scene and the news. In the evaluation experiment, the identification rates
extraction of the article, the article can be extracted from of the studio scene and the article extraction rate were
various news videos. determined.
More precisely, the article is extracted as follows. The Table 3 shows the result. In the identification of the
top frame of the studio scene, identified by the method studio scene, a 100% identification rate was obtained for
described in the previous section, is extracted. By compar- the NHK news for 30 days. In extracting the articles based
ing that frame with all frames in the news video, the on the studio scene, an error of 0.8% was produced. This is
distances are calculated. In the calculation, each frame is due to the fact that there was a shift in the camera angle in
separated into blocks of 8 ´ 8 and, by calculating the the studio scene that produced a two-second delay in ex-
difference pixelwise in each block, the absolute value of the tracting the frames from the actual scene. The identification
sum is calculated. The distance is smaller for similar rate was 90% in one of the commercial broadcasts. This is
frames. Consequently, frames with small distances are due to the fact that there are several cameras in the studio
identified as the frames for the studio scene, and then the scene, which are switched from time to time. This resulted
articles are extracted.
As a preliminary experiment, the studio scenes were
specified by manual inspection, and we then tested to what
extent the system was able to determine the studio scene. Table 3. Evaluation of article extraction (%)
Figure 8 shows the result. The horizontal axis of the figure
is the frame number, and the vertical axis is the distance
from the studio scene to each frame. When the value is
small, it implies that the considered frame is close to the

55
in a failure of loop point detection and in identification of REFERENCES
the studio scene.
1. K. Mitsui, S. Shimojo, S. Nishio, and H. Miyahara.
Realization of news-on-demand system based on sce-
8. Conclusions nario database. Tech. Rep. I.E.I.C.E., DE96-2 (1996).
2. Y. Nakajima, H. Hori, T. Kano, and T. Shiobara. TV
news retrieval based on similar image search. Tech.
In this paper, DCT, which has been used in image
Rep. Image. Elect., 145, pp. 17–20 (1995).
compression, is applied to the news video, and cuts are
3. A. Ando and T. Imai. Broadcast program request
detected by forming scene clusters based on the DCT
system based on speech recognition. Tech. Rep. Aud.
features obtained. The news video has a syntax structure in
Vis. Com., Inf. Proc. Soc. 10-4, pp. 25–30 (1995).
which the scene moves from the studio to the site and then
4. K. Ohtsuji, Y. Tonomura, and Y. Ohniwa. Video cut
comes back to the studio. By detecting loops based on this
detection. Tech. Rep. I.E.I.C.E., IE91-116 (1991).
property, the studio scene is identified. Based on the iden-
5. E. Iwanari and Y. Ariki. Scene clustering and cut
tified studio scene, all frames of the studio scene are ex-
detection based on DCT components. Tech. Rep.
tracted, and the articles are then extracted.
I.E.I.C.E., PRU93-119 (1994).
In this study, the studio scene is identified based on
6. A. Nagasaka and Y. Tanaka. Detection of cut change
the detected cut point. By comparing the scene with all
in video image. Rep. Construction of Self-Org. Inf.
frames in the news video, the articles are extracted without
Database for Creative R&D Support, pp. 120–127,
being much affected by the cut detection rate. For future
Sci. Tech. Agency (1992).
study we plan to construct a system by integrating our
7. Y. Saito and Y. Ariki. Toward news video database—
results with character recognition and speech recognition
detection of news studio scene and article extraction.
so as to retrieve the news articles.
Tech. Rep. Image Elect., 95-04-04, pp. 13–16 (Nov.
1995).
Acknowledgments The author is grateful for the
assistance of Miss Y. Saito and Miss A. Odagiri in the data
collection and experiments.

AUTHOR

Yasuo Ariki (member) graduated in 1974 from the Dept. Inf., Kyoto University. He completed the Master’s Program in
1976 and doctoral program in 1979. In 1980 he was a research associate in the Dept. Inf. at Kyoto University; he became an
associate professor in 1990 and a professor in 1992. He earned his D.Eng. from Ryukoku University. From 1987 to 1990 he
was a visiting researcher at Edinburgh University. He is engaged in research on image processing and speech information
processing. He is a member of the Information Processing Society, the Acoustical Society of Japan, the Society of Artificial
Intelligence, the Image Electronics Society, and IEEE.

56

You might also like