Professional Documents
Culture Documents
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o abstract
Available online 3 January 2013 This paper describes a novel system that uses music emotion and human face as features for automatic
Keywords: highlights extraction for drama video. These high-level audiovisual features are used because music
Affective content evokes emotion response from the viewer and characters express emotion on their faces. In addition,
Video highlights extraction a novel scheme is developed to improve the accuracy of music emotion recognition in drama video.
Music detection Specifically, emotion recognition is performed not on the input audio signal but on the noise-free music
Video highlights evaluation available from the album of the incidental music, with the presence of incidental music detected by an
audio fingerprint technique. Besides the conventional subjective evaluation, we propose a new metric
for quantitative performance evaluation of highlights extraction. Experiments conducted over four
different types of drama videos demonstrate that the proposed system significantly outperforms
baseline ones in terms of both subjective and objective measures.
& 2012 Elsevier B.V. All rights reserved.
0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2012.03.034
112 K.-S. Lin et al. / Neurocomputing 119 (2013) 111–117
emotion response from the listener [10,11]. In fact, directors of 2.1. Adaptive music detection by audio fingerprint
drama video often bring the viewer’s emotion to a climax through
the use of incidental music, and that the presence of actors and As a content-based signature, audio fingerprint has been used
actresses in a sequence of consecutive frames is often indicative to characterize or identify an audio sample [13]. For example, the
of a pinnacle moment of the drama (the longer the sequence, the audio searching engine developed by Wang [15,16] applies a
stronger the highlight). Therefore, we use face and music emotion short-time Fourier transform to an audio segment and chooses
to improve the accuracy of highlights extraction and incorporate local peaks as landmarks. The differences of time and frequency
these two high-level features in our system. values (expressed by a hash table) between landmarks of neigh-
We also investigate the relation between highlights and two boring time windows constitute the fingerprint. The more match-
low-level visual features, shot duration and motion magnitude, ing between two audio segments in the hash values, the more
which are exploited by directors to generate emotion effects. likely the two segments are originated from the same song. Since
Specifically, short shot duration and high motion magnitude are the landmarks have high energy relatively, the audio fingerprint is
used to highlight an action scene, whereas long shot duration and robust to noise. The audio fingerprint technique used in our
low motion magnitude are used to highlight a romantic scene [10]. system is fast and accurate. In fact, the search time is on the
In short, the contribution of this paper is three-fold. First, we order of 5–500 ms.
present a system that integrates information from music emotion, The similarity of two audio segments is indicated by the
human face, shot duration, and motion magnitude for highlights number of matched hash values. In our case, one of the two
extraction of drama video (Sections 2–4). Second, we devise a audio segments represents an input audio segment, and the other
novel two-stage music emotion recognition scheme and a novel represents an audio segment of an incidental music in the album.
adaptive audio fingerprint technique to improve the accuracy of Then a binary decision is made according to the similarity score to
emotion recognition for incidental music (Section 2). This determine whether the input audio segment is music or not. The
improves the accuracy of emotion recognition for incidental threshold used for the binary decision is of fundamental impor-
music because there is no music emotion recognition system that tance, because it controls the accuracy of music detection. How-
can tackle the music with noise. Third, we propose a novel ever, since the audio signal of drama video is a blend of speech
distance metric that can be utilized to evaluate the performance and music signals, the similarity between a pure music segment
of highlights extraction in a quantitative fashion (Section 5). To and a blended audio segment is not as high as that of two pure
our best knowledge, few quantitative evaluations of highlights music segments. Such pair of audio segments can be erroneously
extraction if any have been reported in the literature. classified as dissimilar if a constant threshold is used. For high-
lights extraction, we want to detect all music segments including
those that are blended with speech.
We propose an adaptive technique that automatically deter-
mines the threshold through the use of a low short-time energy
2. Music detection and music emotion recognition ratio r of the input audio segment. The low short-time energy
ratio r is obtained by counting the number of frames with short-
The input drama video consists of audio and visual signals, time energy e smaller than one half of the average short-time
which are processed separately. For the audio signal, the system energy e and dividing the resulting number by N, the total
detects the presence of incidental music and employs the well- number of frames in a time window. Specifically,
known MIRtoolbox [12] for music emotion recognition of the PN1
incidental music. We develop a novel music emotion recognition n ¼ 0 ½sgnð0:5een Þ þ1
r¼ ð1Þ
scheme (Fig. 1) to avoid the interference of speech signal and 2N
environmental noise which are usually blended with music in a
where sgn( ) is the sign operator that yields 1 for positive input
drama video; in fact, there is no music emotion recognition
and –1 for negative input, and en is the short-time energy of the
system that can tackle the music with noise. As it is usually the
nth frame. The short-time energy is computed on a frame basis.
case, we assume that the album that contains all the incidental
If an audio signal has a high r value, it has more silence frames.
music used in a drama video, a.k.a. the original soundtrack, is
In general, speech signals have more silence frames than music
available when the drama video is released. The music emotion
signals. Therefore, we can discriminate speech from music based
recognition is performed on the music provided in the album, not
on the r value.
the input audio signal. Emotion recognition is performed on the
Then the adaptive threshold is determined empirically by
clean data because the input audio signal may be corrupted with
observing the relationship between the value of r and the
speech signal and environmental noise, which usually degrade
number of matched hash values. The r value and its correspond-
the accuracy of emotion recognition. Furthermore, given an
ing threshold are shown in Table 1, where the range of r is
album, we use audio fingerprint [13] to detect the presence of
partitioned evenly into five bands, and a different threshold value
each incidental music and the specific portion an incidental music
for each band is assigned. As we can see, the threshold increases
is played. Another advantage of the audio fingerprint approach is
with the r value. This is desirable because when the input audio
that it is free of the time-consuming and labor-intensive labeling
of the training data that is typically required for conventional
machine learning approach [14]. Table 1
Adaptive threshold associated with the low short-
time energy ratio.
r Threshold
0.0–0.2 3
0.2–0.4 4
0.4–0.6 5
0.6–0.8 6
Fig. 1. The proposed two-stage music emotion recognition scheme that avoids the 0.8–1.0 7
interference of speech signal and environmental noise.
K.-S. Lin et al. / Neurocomputing 119 (2013) 111–117 113
is more likely to be a speech signal as judged by its r value, we 3.1. Human face
raise the threshold to avoid possible false matches and thereby
achieve better accuracy in music detection. The use of face as a feature for highlights extraction in our
system is based on two main observations. First, the highlight
scenes of a drama video often show the interactions between
2.2. Music emotion recognition characters. Therefore, human face represents an important can-
didate feature for highlights extraction. Second, in the context of
Instead of representing emotions as discrete labels such as anger video highlights, the size of face does not correlate well with its
and happiness, we approach it from a dimensional perspective and importance. For example, the two frames shown in Fig. 2 repre-
define emotions as points in a three-dimensional space [17], the sent the highlights of a TV drama, but the size of human face is
three axes of which are arousal (exciting or calming), valence obviously quite different. Therefore, we direct our attention to the
(positive or negative), and dominance (a sense of control or freedom number of human faces appear in each frame. Typically, when
to act). This dimensional approach avoids the ambiguity and more faces appear in a frame, there are more interactions
granularity issues inherent to discrete labels [18]. In addition, it between the characters, and this attracts more viewers’ attention.
allows one to intuitively treat emotion points that are far away from However, faces that are too small (for example, less than 5% of the
the origin in the 3-D emotion space as emotions of high intensity frame size) to invoke affective response are not considered. In our
[19]. Based on this framework, we apply machine learning models to system, we adopt the algorithm described in [22] for face
predict the emotion values of each short-time music segment with detection.
respect to the three dimensions and consider the Euclidean distance The highlight score of human face is obtained as follows. Each
between the predicted emotion values and the origin as the high- frame is initially assigned a score equal to the number of faces
light score of the music segment. detected for the frame. An example sequence of initial scores is
Specifically, we apply the MIRtoolbox [12] to predict the emotion shown in Fig. 3. Then the initial score of each frame is propagated
value of a music segment. The training dataset of the emotion value to its two neighboring frames. Finally, the initial score of each
prediction module is composed of 110 audio sequences of film frame is summed with the initial scores of the two neighboring
music (each sequence is on average annotated by 116 human frames to form the final score (called smoothed score because the
subjects). A number of music features (timbre, harmony, rhythm, summation is in essence a temporal smoothing operation).
etc.) are extracted, and a regression algorithm called multivariate
regression [20] is employed to learn the relationship between music
features and emotion values. One regression model is trained for 3.2. Shot duration
each emotion dimension. As the characteristics of film music are
close to the music used in drama videos, we believe the MIRtoolbox Shot duration is used as a feature in our system for highlights
is applicable to our system. extraction. It has been found that short-duration shots can invoke
higher arousal than other shots [5]. This accounts for the fact that
short-duration shots are often part of the highlight scenes of a
drama video. On the other hand, directors often use long-duration
3. Visual features extraction
shots to draw the viewer’s attention to specific scenes such as
romance and slow motion [10]. This is why one can also find long-
It has been shown that the presence of human face in video
duration shots in a highlight scene as well. On the basis of these
often catches the viewer’s attention and invokes emotion [9,21].
observations, we propose to detect both short- and long-duration
Shot duration and motion magnitude also impact the affective
shots in the highlights extraction process.
response of a viewer [5]. This section describes how our system
In our system, the strength of a shot as a candidate of the
uses human face, shot duration, and motion magnitude as
highlights is modeled by an exponential function. Let the high-
features for highlights extraction of drama video.
light strength of a frame in the kth shot be denoted by sk and the
shot duration by nk–pk, we have
where pk is the index of the first frame and nk is the last frame of
the kth shot. As we can see, the shorter the shot duration, the
higher the sk . To have a high score for both short- and long-
duration shots, we measure the highlight score of a shot duration
by its distance from the average highlight strength. More pre-
Fig. 2. Two highlight frames of the drama Flower Shop without Rose. cisely, the highlight score s^ k of any frame in the kth shot is
Fig. 3. An illustration of the calculation of highlight score of human faces for the drama video Flower Shop without Rose.
114 K.-S. Lin et al. / Neurocomputing 119 (2013) 111–117
obtained by computes the highlight score of each feature using the methods
described in Sections 2 and 3. Then it linearly combines the
s^ k ¼ sk s ð3Þ
highlight scores to obtain the overall highlight score, denoted
where s is the average of sk’s of an episode. by H, for each second of the drama video. Finally, it extracts those
video segments with highlight score higher than a threshold. The
3.3. Motion magnitude threshold is automatically determined by the system according to
the desired length of the highlight sequence.
A motion vector contains both magnitude and orientation The overall highlight score is computed by a weighted sum of
information. Unlike motion orientation, motion magnitude is a the four highlight scores
good indication of highlight. Similar to the case for shot duration, H ¼ f M HM þf F HF þf S HS þ f A HA ð6Þ
a highlight scene of drama video may not strictly contain fast-
where HM, HF, HS, and HA, respectively, are the scores of music
motion frames; slow-motion frames can be part of the highlights
emotion, human face, shot duration, and motion magnitude, and
as well. For example, a scene showing tears slowly trickling down
fM, fF, fS, and fA are the corresponding weighting factors. In our
an actress’s face can invoke strong emotion responses from the
system, each of these four scores is normalized to the range [0, 1]
viewers [10]. Therefore, we propose to consider both fast- and
slow-motion frames in the highlights extraction process. and sum-to-one. In our experiments, the four weights are simply
set to [0.3, 0.3, 0.2, 0.2], respectively, to place more emphasis on
The detection of fast- and slow-motion frames consists of two
steps. First, we compute the normalized average motion magni- high-level features over low-level ones. Our evaluation also
shows that high-level features are indeed more useful for high-
tude ak of all blocks within the kth frame
, lights extraction (cf. Section 6.1).
PI
i ¼ 1 v k ðiÞ
ak ¼ ð4Þ
,
IV k 5. Distance metric for highlights
,
where v k ðiÞ is the motion vector
,
of the ith block, I is the total In our experiment, the consistency between the highlight
number of motion vectors, and V k is the largest motion vector of sequence generated by the proposed system and the subjective
the frame. As we can see, the average motion is normalized with results provided by the subjects for the same drama video are
respect to the largest motion vector. evaluated. A high consistency means the system performs well.
Second, to signify both fast- and slow-motion frames and Each subject expressed in writing their view of the highlights of a
suppress average-motion frames, we measure the highlight score drama video after watching the entire drama video. More pre-
a^ k of the kth frame of a drama video by cisely, we ask the subjects to describe, in words, the highlights at
the story unit level [23] (for example, ‘‘actor knees down and
a^ k ¼ ak a ð5Þ
proposes to actress’’) and give a score to each highlight. Similarly,
where a is the average of ak’s of an episode. This method turns out the score of each video segment that is extracted by the proposed
to be effective yet simple for sifting video frames with either high system is the summation of the H within the segment.
or slow motion. Story unit [23] in hierarchical video data representation is a
series of shots that communicate a unified action with a common
locale and time. Because it is easier for the subjects to perceive
4. Highlights extraction procedure and remember a video at the level of story unit [24], especially
when the drama video is long, we ask them to describe the
Fig. 4 shows the block diagram of the proposed system. Given highlights of each story unit than to label the video segments.
a drama video, the system processes the input audio and video Because the result given by the subjects is in the form of textual
signals separately. It first extracts the audiovisual features and description, the result is manually interpreted and matched with
6. Experimental results
segments extracted by the baseline system get the same score 7. Conclusion
because they are simply a uniform sampling of the video. The
oracle system represents a lower bound of distance on using the Video highlights extraction often requires domain-specific
set of the highlights provided by the subjects. knowledge. Despite that this subject has been intensively studied,
Fig. 7 shows the means and standard deviations of distances little work has focused on the video highlights extraction for
between the highlights outputted by three different approaches drama video. In this paper, we use two high-level features (music
and the highlights provided by the subjects for the four videos. emotion and human face) and two low-level features (shot
There are two observations to be made. First, the distance of the duration and motion magnitude) to extract highlights based on
proposed system that uses the four features are less than that the findings of psychology and filmology studies. Using the noise-
uses only one feature; in other words, the four features are free incidental music album allows us to prevent the effect of
complementary to detect highlights from drama video. Second, noise on music emotion classification and thereby improve the
the distances of the proposed system are less than those of the performance of overall system. The evaluation results successfully
baseline system, and the improvement is significant under the illustrate that the high-level features, especially music emotion,
one-tailed t-test (p-valueo 0.05) [27]; that is, the highlight are effective for video highlights extraction.
sequences generated by the proposed system are more consistent
with the subjective results provided by the subjects.
Acknowledgment
[22] W. Kienzle, G. Bakir, M. Franz, and B. Scholkopf, Face detection—efficient and Cheng-Te Lee received the BS degree and the MS
rank deficient, in: Advances in Neural Information Processing Systems, degree both in Computer Science and Information
Vancouver, CA, 2005, pp. 673–680. Engineering from National Taiwan University, Taipei,
[23] J.M. Boggs, D.W. Petrie, The Art of Watching Films, fifth ed., Mayfield, Taiwan, in 2008 and 2011, respectively. His research
Mountain View, CA, 2000. interests include content analysis of audio signals,
[24] Y. Rui, T. -S. Huang, S. Mehrotra, Constructing table-of-content for videos,, sound source separation, and machine learning for
Multimedia Syst. Special Sect. Video Libraries 7 (5) (1999) 359–368. multimedia signal analysis.
[25] A.G. Money, H. Agius, Video summarization: a conceptual framework and
survey of the state of the art, J. Vis. Commun. Image Represent. 19 (2) (2008)
121–143.
[26] L. Herranz, J.M. Martinez, A framework for scalable summarization of video,
IEEE Trans. Circ. Syst. Video Technol. 20 (9) (2010) 1265–1270, September.
[27] M. Hollander, D.A. Wolfe, Nonparametric Statistical Methods, John Wiley &
Sons Inc., Hoboken, NJ, 1999.