Professional Documents
Culture Documents
2013/14
Name:
[Xiao LU]
Programme:
[Telecommunications
Engineering with
Management]
Class:
[2010215103]
QM Student No.
[100668843]
[10212802]
Project No.
[IM_2802]
Date [12th May 2014]
1
Table of Contents
Abstract............................................................................................................................................... 4
Chapter 1: Introduction..................................................................................................................... 6
1.1 Introduction and Motivation................................................................................................... 6
1.1.1 What is music emotion recognition..................................................................................... 6
1.1.2 Applications and significance..............................................................................................6
1.2 Project Description................................................................................................................... 7
1.2.1 Work in this project............................................................................................................. 7
1.2.2 Achievement........................................................................................................................ 8
1.3 Structure of The Report...........................................................................................................8
Chapter 2: Background......................................................................................................................9
2.1 Emotion Model..........................................................................................................................9
2.1.1 Introduction......................................................................................................................... 9
2.1.2 Related model...................................................................................................................... 9
2.1.3 Discussion on perception of emotion................................................................................ 10
2.2 Audio Features........................................................................................................................ 11
2.2.1 What is audio feature......................................................................................................... 11
2.2.2 Introduction of some audio features.................................................................................. 11
2.3 General Approaches for Music Emotion Recognition.........................................................16
2.3.1 Emotion Classification...................................................................................................... 16
2.3.2 Emotion Regression...........................................................................................................17
2.3.3 Comparisons and Approach Choosing.............................................................................. 18
2.4 Related Work.......................................................................................................................... 18
2.4.1 Related Work..................................................................................................................... 18
2.4.2 Discussions and Approach Choosing................................................................................ 19
Chapter 3: Design and Implementation......................................................................................... 20
3.1 Ground Truth Preparation.................................................................................................... 20
3.1.1 Samples collection.............................................................................................................20
3.1.2 Emotion model choosing and labeling process................................................................. 20
3.1.3 Data processing................................................................................................................. 21
3.2 Feature Extraction..................................................................................................................21
3.2.1 Introduction of CAMEL.................................................................................................... 21
3.2.2 Introduction of VAMP plug-in...........................................................................................23
3.2.3 Modifications and extraction............................................................................................. 24
3.3 Emotion Regression................................................................................................................25
3.3.1 Machine learning tool........................................................................................................25
3.3.2 Regression System.............................................................................................................25
Chapter 4: Results and Discussion..................................................................................................27
4.1 Evaluation............................................................................................................................... 27
4.1.1 Evaluation Method............................................................................................................ 27
4.1.2 Experiments and Results................................................................................................... 28
4.2 Discussion................................................................................................................................ 32
4.2.1 Analysis and Discussion to the Experiments.....................................................................32
4.2.2 Discussion about Emotion and Labeling........................................................................... 33
Chapter 5: Conclusion and Further Work..................................................................................... 35
5.1 Summary................................................................................................................................. 35
5.2 Further Work.......................................................................................................................... 35
References..........................................................................................................................................36
Acknowledgement.............................................................................................................................37
Appendix........................................................................................................................................... 38
Risk Assessment................................................................................................................................ 52
Environmental Impact Assessment.................................................................................................53
Abstract
Emotion is oftentimes an essential element expressed in music, thus it can be a useful mean to
categorize and organize music data, and can be widely used in applications. In this sense, automatic
music emotion recognition is becoming more and more significant for music information retrieval
and other applications. The aim of this paper is to explore the relationship between audio features
and the emotion in pure music, and to achieve emotion recognition in terms of emotion regression
by employing machine learning technique. A psychological 3-dimension model, PAD(PleasureArousal-Dominance) model, is introduced in the research for labeling the emotion values. For the
training data, 1,217 samples of pure music are collected from the Internet, and are labeled with PAD
emotional state values by a group of 10 people. The feature extraction plug-in programme is an
improved version of a tool called CAMEL(Content-based Audio and Music Extraction Library). It
is modified so that the programme can read ".wav"file as the input instead of a ".txt"file, and also
output an ".arff" file for the regression learning tool. Also an extra plug-in programme based on
CAMEL using Vamp language is written for a host programme in the project. The ground truth data
are fed into the machine learning tool called WEKA, and then the emotion regression is achieved by
using the SMOreg algorithm and is evaluated using a method called cross-validation. By using
cross-validation, the correlation coefficients (CF) reach 52% on Pleasure, 85% on Arousal and 75%
on Dominance respectively. Several experiments and results are presented and some meaningful
conclusions about audio features and regression are drawn.
PAD(Pleasure-Arousal-Dominance)
1217 10 PAD
CAMEL wav
txt arff
CAMEL Vamp WEKA
SMO
P, A, D CF 52%85% 75%
Chapter 1: Introduction
1.1 Introduction and Motivation
1.1.1 What is music emotion recognition
The rapid growth of artificial intelligence has brought infinite possibilities to our real life. For
music, the automatic recognition technology has contributed a lot for music information retrieval
and other applications. The currently occurred big data concept makes it more inevitable. A special
topic is music emotion recognition. While most of other information of music can be retrieved or
recognised "correctly", emotion expressed in a piece of music can be somewhat ambiguous and
hard to quantify since emotion is hard to be modeled and measured, and also our perception of
emotion is subjective and even totally different. In short, the research on music emotion recognition
is still in its early stages, although it has drawn much more attention in these years.
Computational systems for music emotion recognition is usually based on a model of emotion, for
example in this paper the PAD emotional state model is employed.
Emotion recognition can be viewed as a multiclass-multilabel classification or regression problem
where we try to annotate each music piece with a set of emotions[1]. Regression is a special mean
of classification that uses statistical technique for estimating the relationships among variables,
including a number of regression algorithms. Normally emotion classification means to
automatically get a class or a type of emotion, for example "sad", "happy", and "anxious", whereas
emotion regression tries to regress an emotion value that describes a state of mood. There can be 8
classes, 16 classes or much more classes of emotion classification, but it still has some limits and
ambiguity. Emotion regression can be somewhat more exact and accurate to describe humans
emotion.
This project focuses on pure music emotion regression, as pure music does not contain lyric
information and vocal performance. Also only low-level and some mid-level features are employed
for the regression training to explore the effectiveness. A flowchart of the project is presented in
Figure 2. In general, the project contains three main parts: ground truth (training dataset) collecting
7
and labeling, writing the feature extraction plug-in programme, and regression learning.
For data collection there are 1,217 samples of pure music collected from the Internet and all of their
format is ".wav". They are then annotated by a group of ten people including the author using a
labeling system built by our group. Note that the project focuses on pure music e.g.classical music
and instrumental music but not pop songs that contain lyrics which may be an important factor that
affects an audiences emotion. Also the project uses PAD model as a more accurate and complex
model of humans emotion.
Secondly a feature extraction tool called CAMEL is used and largely modified to satisfy the
requirement for input data and output data format in the project. CAMEL is not an often used
feature extraction tool but it can extract some time-domain features and some spectral features that
other tool may not provide, and therefore it is adjusted to be more convenient to use. In addition in
this project the tool is also modified to a plug-in programme for a finished host programme using
Vamp language.
After a data processing step, the ground truth data is fed into the machine learning tool called
WEKA and the regression learning is achieved. At last the results are evaluated according to the
correlation coefficient for each of P,A,D dimension by an approach called cross-validation, and are
analysed in full details.
1.2.2 Achievement
The project employs a machine learning and data mining toolkit called Weka to regress the emotion
in each scale of P, A, and D. By using SMOreg algorithm to build regressor with RBF kernel and
default parameters, the correlation coefficients (CF values) are: 51% on Pleasure, 85% on Arousal
and 75% on Dominance respectively. Plus, several analysis and experiments on the results were
made, and several comments and conclusions have been drawn.
Chapter 2: Background
2.1 Emotion Model
2.1.1 Introduction
Though emotion is a subjective and ambiguous reference, computers need exact values and
numbers to "understand" what people think and judge. Therefore, whether in emotion regression or
emotion classification, it is necessary to have some means to quantify or measure our mood in order
to organize and retrieve music. Most researchers suggest that emotion can be measured by a multidimensional model. This means that a state of mood can be described as a multi-dimensional vector,
which can be then used for data training and information retrieval. Russell[2] and Thayer[3] have
made a significant contribution to the process of building emotion descriptors into low-dimensional
models, for example,Valence-Arousal model(see Figure 3).
2.1.2 Related model
One frequently used emotion model is Thayers 2-dimensional emotion model, the Valence-Arousal
Model. In this model, "Valence"stands for appraisal of polarity in emotion, such as positive and
negative, and "Arousal"represents the intensity or the strength of emotion. A high value of arousal
indicates a strong emotion such as excitement or anger. In Figure 3, there are some adjective words
of emotion in every quadrant labeled by Russell. The V-A model is quite easy to understand and use,
however it has some limitations because of its low dimension, for example, anger and fear are both
intensely negative, meaning that they have similar V-A value, but they are not actually the same
mood. In this case a 3-dimensional emotion model should be more adequate for measuring emotion.
Another model used in this project is the PAD(Pleasure-Arousal-Dominance) Model proposed by
Mehrabian[4]. The Pleasure-Displeasure Scale measures how pleasant an emotion may be. P space
corresponds to V space in VA model. The Arousal-Nonarousal Scale measures the intensity or
strength of the emotion. For example, while both anxiety and anger are negative emotions, the latter
has a higher intensity or a higher arousal state. The Dominance-Submissiveness Scale represents the
controlling and dominant nature of the emotion, which is an extended dimension as indicated in
figure 3. For instance, both fear and anger are unpleasant emotions, but anger is a dominant emotion
while fear is a submissive emotion.
2.1.3 Discussion on perception of emotion
As stated before, people may have different moods when listening to the same music, which is
because of the subjectivity of aesthetic. For instance, a rock song can make rock fans excited and
happy, while for some people it might make them feel slightly frightened. For a piece of classical
music, even two professional musicians can also have different perception and understanding of
what emotion the music is trying to express. Besides, different culture affect peoples perception of
music emotion because any music has its culture background. Therefore some researchers focus on
a specific countrys music such as Chinese songs in [5].
There is also a distinction between ones perception of the emotion(s) "expressed"by music and the
emotion(s) "induced"by music. What emotion a song expresses and what it stimulates to audience
are not always the same. For instance, a song maybe in a slightly sad mood, but because it sounds
so beautiful, the audience may feel relaxed or excited. Another example is particularly occurred in
music written for movies: when the background music is creating an evil ambient, the music itself
can be angry or excited, however it gives audience a perception of fear and anxiety. This point
should be carefully considered labeling ground truth data with emotion value.
10
Generally there are two categories of audio feature: time-domain features and frequency-domain
features. Basically the original information of an audio is the amplitude values in time domain, and
we need to use Fourier Transform to convert it into frequency domain. Audio features can also be
separated into high-level features and low-level features. In this case high-level means that the
features are closer to human perception, human semantic and musical expression, while low-level
features are based on computation and analysis of a signal from the physical or acoustical aspect,
such as spectral centroid ,spectral variance ,and spectral standard deviation. A general flowchart of
obtaining spectral features is shown in Figure 4.
2.2.2 Introduction of some audio features
1)Zero-Crossing Rate(ZCR)
The zero-crossing rate is a measure of the number of time the signal value cross the zero axe,
implemented in time-domain. It is not only used in audio analysis but it does contribute a lot to
speech recognition. Periodic sounds tend to have a small value of it , while noisy sounds tend to
11
have a high value of it. It can be easily computed and used to identify the voice and noise. The
definition of ZCR is:
(1)
where
(2)
and
(3)
From figure 5 it can be inferred that when ZCR reaches high value there might be noises and when
ZCR is relatively low there should be speech or voices.
2)Spectral centroid
Spectral centroid indicates where the "center of mass" of the spectrum is. For human perception it
12
can be described as the "brightness" of a sound. It is calculated as the weighted mean of the
frequencies present in the signal, determined using a Fourier transform, with their magnitudes as the
weights[6]. The definition of spectral centroid is as follow:
(4)
3)Spectral slop
The spectral slope represents the amount of decreasing of the spectral amplitude. It is computed by
linear regression of the spectral amplitude.
a ( f ) slope f const
where
slope
1
a
(k )
k
(5)
N f (k ) * a(k ) f (k ) * a(k )
k
N f ( k ) ( f (k ))
k
(6)
4)Spectral roll-off
The spectral roll-off point is the frequency so that 95% of the signal energy is contained below
this frequency . It is correlated somehow to the harmonic/noise cutting frequency.
n
i 1
i 1
p power (i ) power (i )
(7)
13
A flowchart of extracting Spectral Roll-off is shown in Figure 6. And figure 7 provides a graphical
information of these spectral features introduced above.
5)Tristimulus features
The tristimulus values have been introduced as a descriptor for timbre[6]. It has three different types
of energy ratio allowing a fine description of the first harmonic of the spectrum, which are
perceptually more salient. In (8)(9)(10) the a(n) denotes the energy of the nth harmonic of the
spectrum and H is the number of harmonics available in spectrum.
(8)
(9)
(10)
14
Figure 8 Mel-frequency
MFCC is based on known variation of the human ears critical bandwidth with frequency [7]. In
other words it is much closer to human auditory system, for Mel-Frequency is non-linearly spaced
on the frequency axis(figure 8). MFCC has two types of filter which are spaced linearly at low
frequency below 1000 Hz and logarithmic spacing above 1000Hz. The definition of Mel-frequency
is:
(21)
Where Mel(f) denotes the Mel-frequency and f denotes the frequency.
The term "Cepstrum" comes from "Spectrum", which is a useful information for audio analysis. The
steps of computing Cepstrum is shown in Figure 9. The Cepstrum is defined as the inverse DFT of
the log magnitude of the DFT of a signal.
(32)
1
Where F is the IDFT and F is the DFT.
15
Figure 10 demonstrates the whole process of extracting MFCC. The output, Mel-scale Frequency
Cepstral Coefficients, are actually 12 or 13 DCT(Discrete Cosine Transform) coefficients.
16
Group
A
B
C
D
E
F
G
Adjectives
Group
Adjectives
cheerful,gay,happy
H
dramatic, emphatic
fanciful, light
I
agitated, exciting
delicate,graceful
J
frustrated
dreamy,leisurely
K
mysterious, spooky
longing, pathetic
L
passionate
dark, depressing
M
bluesy
sacred, spiritual
{( x1 , y1 ), ( x2 , y2 ),..., ( xn , yn )}
xi R 2 , yi R 2 , , and i=1, 2,
, n. The relation between the input
xi
and output
yi
17
can be mapped by an optimal regression function f(x) by SVR training. Assuming linearity, f can be
represented as the following hyperplane:
f ( x) ( x) b
where
(43)
Rn , b R
(Principal Component Analysis) before regression. The best performance evaluated in terms of the
19
The samples were labeled using a simple graphical interface(figure 12) built by our group. The
group of 10 people whose ages are from 22 to 50 annotate the samples with emotion values
according to Self Assessment Manikins(SAM)[15] . In more details, the samples were labeled with
3-dimension integer values: Pleasure, Arousal and Dominance, ranging from 0 to 8 . Each scale is
20
restricted to 9 specific values. In this interface each block that shows a picture of a state of emotion
corresponds to a value from 0 to 8, and it is designed for users to better understand the emotion
value of each scale. To operate this system first we play and listen to a music sample and try to find
the emotion expressed in the music. Then, we simply select the button of each PAD scale. In
addition, we try to keep all the volume of the samples the same, to avoid the influence on emotion
that the high volume or low volume bring to the annotator. Due to the large number of the
samples(1,217), the labeling process took more than 2 months.
3.1.3 Data processing
After all the members in the test finished labeling 1,217 samples of pure music, the annotated P, A,
D values of each sample were recorded in an .xls file. The final emotion value in each PAD scale is
the mean of all the values that the 10 group members labeled, for example the P value of a sample is
the average of P values labeled by all members.
As shown in Figure 13, the training data for machine learning is 3 individual file in "arff" format,
containing multi-dimension feature vectors together with the emotion value of each scale of PAD,
and this is achieved by both programme and manual operations.
a set of audio features from an audio. A list of features in CAMEL is shown in table 2 and full list
can be found in appendix.
Table 2: List of Features in CAMEL
Feature
1 Mean
2 Variance
3 Standard Deviation
4 Average Deviation
5 Skewness
6 Kurtosis
7 ZCR
8 RMS
9 Non-Zero Count
10 Spectral Centroid
11 Spectral Variance
12 Spectral Standard Deviation
13 Spectral Average Deviation
.
.
.
27 Spectral Loudness
28 Spectral Sharpness
29 Peak Tristimulus 1
30 Peak Tristimulus 2
31 Peak Tristimulus 3
32 MFCC
33 Bark
Dimension
1
1
1
1
1
1
1
1
1
1
1
1
1
Domain
Time
Frequency
1
1
1
1
1
1
13
26
Peak
Harmonic
There are 4 types of domain used in the CAMEL: Time, Frequency, Peak and Harmonic domain.
The Time domain is simply the original PCM values that represent the amplitude value of the signal
sampled at a specific rate (e.g.44.1kHz) over time. The Frequency domain is simply the Fourier
Transform over time domain, representing information of magnitude versus frequency values. The
Peak domain is an evaluation of the Frequency domain, where only values of frequency above the
threshold are kept, with some transformation according to neighbouring values. the Harmonic
domain picks only values of the Peak domain which are harmonics: whole number multiples
(within some threshold of tolerance) of the fundamental frequency of the Time domain. The
harmonic domain is closer to music and human auditory.
There are 9 time-domain features in CAMEL, which is quite special because other extraction tools
22
such as jAudio do not have that much time-domain features. Both MFCC and Bark are multidimension features. MFCC contains 13 dimensions and Bark contains 26 dimensions. So in general
this tool can extract a 70-dimension feature vector.
3.2.2 Introduction of VAMP plug-in
A Vamp plugin is a chunk of compiled program code that carries out analysis of a digital audio
signal, returning audio features information. It is a powerful language that can be used to write
plug-in programme for audio feature extraction, and several plug-ins can be integrated together on a
host programme.
As presented in figure 14, in this project an extra plug-in based on CAMEL and written using Vamp
is completed. It is simply used for a GUI host programme written by other people in the lab in order
to extract audio features and employ machine learning.
23
Figure 15 shows the graphical interface of the beta version of the host and plug-in programme.
(More details can be found in the appendix)
3.2.3 Modifications and extraction
The modified version of the feature extraction tool is written in C++ language and developed using
Visual Studio as the integrated development environment. The samples needed to extract features
are all in ".wav" format , but since CAMEL can only read in a ".txt" file containing PCM
24
information originally, the first work is to modify the programme so that it can use ".wav" file
directly as the input format. As figure 16 shows, At this stage another library, Libsndfile, is
integrated into the CAMEL programme to help read in the audio files. Next, in order to implement
machine learning, the extracted features need to be in a particular format, the ".arff" format, and
thus some modifications are made in the main function in CAMEL to make it generate an ARFF file
as the output. The programme was ran on the server, and the feature sets of 1,217 samples of music
were extracted spending 3 days. The window size set in FFT is 1024, applying Hanning window,
and using Magnitude spectrum.
In this regression process (Figure 17), at the beginning, the 1,217 audio files in ".wav" format are
put into the modified CAMEL programme. After 3 days all the features are extracted and they are
saved in ".arff" files. By both manual operation and using programme, the labeled P, A, D values are
25
combined with the feature set and three individual files are generated. Then the ".arff" files, as the
training data, are fed into Weka for regression learning. The prediction results are then achieved in
Weka. At last the results are analyzed, evaluated according to the correlation coefficient for each of
P,A,D dimension by an approach called cross-validation.
26
Where
X ,Y
(54)
27
Emotion Space
CF Value
Pleasure
51.71%
Arousal
85.87%
Dominance
74.87%
Prediction
Error
5.3
5.822
0.522
7.1
5.805
-1.295
7.208
0.208
3.8
3.345
-0.455
6.7
5.145
-1.555
4.1
5.549
1.449
5.7
6.152
0.452
4.4
4.278
-0.122
5.545
5.402
-0.143
4.3
4.172
-0.128
6.182
5.162
-1.019
6.1
5.78
-0.32
28
3.8
4.456
0.656
4.891
-0.109
5.7
6.21
0.51
6.8
5.546
-1.254
3.5
3.371
-0.129
5.909
5.867
-0.042
5.455
4.773
-0.682
5.7
5.15
-0.55
29
In P-space
In A-space
In D-space
(1)Spectral Loudness
(1)Spectral Loudness
(1)Spectral Loudness
(2)MFCC3
(2)MFCC3
(2)MFCC3
(3)Kurtosis
(3)Kurtosis
(4)Peak Tristimulus1
(3)Peak Tristimulus1
(5)Peak Tristimulus2
(4)Peak Tristimulus2
(4)Peak Tristimulus3
(6)Peak Tristimulus3
(5)Spectral Sharpness
(5)Spectral Spread
(7)ZCR
(6)MFCC4
(8)MFCC9
Feature Name
This experiment then kept only the features that are selected to be most correlated, and do
regression learning again. The regression result of reducing the dimension of the feature set is
shown in Table 6.
Table 6: Result Using Selected Features and Comparison
CF Value
CF Value
after Selection
Pleasure
43.47%
51.71%
Arousal
82.29%
85.17%
Dominance
71.67%
74.87%
Emotion Space
30
Emotion Space
CF Value of jAudio
CF Value of CAMEL
Pleasure
49.93%
51.71%
Arousal
87.71%
85.17%
Dominance
76.86%
74.87%
Selected Features
Selected Features
Selected Features
for P-dimension
for A-dimension
for D-dimension
Standard Deviation
Standard Deviation
Deviation
31
Deviation0
Emotion Space
Pleasure
41.31%
51.71%
Arousal
81.25%
85.17%
Dominance
65%
74.87%
4.2 Discussion
4.2.1 Analysis and Discussion to the Experiments
In experiment 1, the final results of the regression system are in expectation. The prediction output
32
sample was generated during cross-validation, demonstrating the real process of automatic
recognition on each testing data. We could also see that both the regression performance on Arousal
and Dominance are good. Although the CF value in Pleasure dimension is a bit lower than that in A
and D dimension, it should be noticed that, compared to other researches which use only audio
features for regression, the coefficient is quite normal. In [13], the experiment result shows that, for
Pleasure dimension, lyric features contribute the most to the performance of regression (reaches
62.3% CF value) whereas the audio feature set contributes the least (47.3% CF value). Overall, in
experiment 1 we validated the availability and effectiveness of the regression system this project
aims to build.
In experiment 2, compared to the performance of using all the features, the result of using only 5 to
7 selected features seems good enough in A and D dimension, despite an 8% reduction in P
dimension. In real applications if there are only a few dimensions of features needed to be extracted
for recognition, the time in calculating will be greatly saved.
More importantly, in Table 5 we can fine some features that are significant for more than one
dimension. First, Spectral Loudness and the 3rd coefficient of MFCC are critical for all the three
dimension. Spectral Loudness and Spectral Sharpness are computed over the Bark Bands and is
related to brightness and noisiness of the music respectively. We may infer that the reason Spectral
Loudness is important is that it is based on Bark Bands which is closer to human auditory, and that
the brightness (also the noisiness) is a very important characteristic for music to express a state of
emotion. MFCC is also related to human perception, so it is very appropriate to be employed in
music emotion regression. Kurtosis is a very simple statistical time-domain feature that measures of
the "peakedness" of the probability distribution of a real-valued random variable. Another timedomain feature is ZCR who contributes much to the regression of arousal. The Peak Tristimulus 1
to 3 have been introduced in Section2.2.2, and they are very critical because they describes the
"timbre" of the sound which can be very much related to the emotion especially its strength.
In experiment 3, a comparison is made between different set of features. The results of two set of
features are close. CAMEL has better performance on P whereas jAudio outperforms on A and D
dimension. As introduced before, CAMEL has 70-dimension features of which most of them are
low-level and basic features of acoustics, but jAudio has 112-dimension features including some
mid-level musical features such as tempo and beat as well as a important feature in audio signal
processing called LPC. One noticeable find is that there are only 4 features (16-dimension) the two
33
feature sets both have, but since they perform very similarly in the regression learning, it might be a
good idea to combine the useful features in each of the tools to achieve a greater result.
In the last experiment, from the result (Table 9) we obtain that, by using audio features for
regression of songs containing lyric, the CF values are much lower than the result of pure music
without lyric and singing in P and D dimension. The reason why both methods work well on
Arousal dimension is that the intensity of emotion in music is mostly expressed within its acoustical
features but not what the lyrics can express. In conclusion, this result indicates that audio features
do perform well for the regression of pure music.
4.2.2 Discussion about Emotion and Labeling
Beside the implementation of feature extraction and regression learning method, there remains
some issues about our perception of emotion and the labeling process that can, to some degree,
affect the result of regression.
First, as stated before, people may have different, even opposite perception to the same piece of
music. The subjectivity of perception may obscure the accuracy of the emotion value annotated,
though people will not always have different feeling to the same song. Second, the annotator may
have slightly different understanding of the PAD model, especially for the Dominance scale. There
might be some disagreement on whether a song presents a sense of dominance or submissiveness,
except for some obvious emotion state such as fear and anger. Thirdly there are some issues in
music itself, for example, a piece of music may contain more than one type or one state of mood,
especially in pure music, and sometimes there can be two contrary emotions in the same piece. One
of the assumption of this project is that there is only one state of mood in the whole length of music,
so we use one emotion vector to label the samples. But as stated in [1], there have some researches
that use a time-series of vectors to track changes in emotional content over the duration of a piece,
which might be more adequate. In addition, the genre of a song may also contribute a lot to
audiences perception but it is not considered in this project since most of the audio features
extracted are very low level features.
34
35
References
[1]Kim, Y. E., Schmidt, E. M., Migneco, R., Morton, B. G., Richardson, P., Scott, J., ... & Turnbull,
D. (2010, August). Music emotion recognition: A state of the art review. In Proc. ISMIR (pp. 255266).
[2] J. A. Russell, "A circumspect model of affect,"Journal of Psychology and Social Psychology, vol.
39, no. 6, p. 1161,1980.
[3] R. E. Thayer: "The Biopsychology of Mood and Arousal,"New York: Oxford University Press,
1989.
[4] Mehrabian, A.: "Framework for A Comprehensive Description and Measurement of Motional
States,"Genetic, Social, and General Psychology Monographs, vol. 121, pp. 33361, 1995
[5]Wang, X., Chen, X., Yang, D., & Wu, Y. (2011). Music Emotion Classification of Chinese Songs
based on Lyrics Using TF* IDF and Rhyme. In ISMIR (pp. 765-770).
[6]Peeters, G. (2004). A large set of audio features for sound description (similarity and description)
in the cuidado project. IRCAM, Paris, France.
[7] E.C. Gordon,Signal and Linear System Analysis.John Wiley & Sons Ltd., New York, USA,1998.
[8]Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using mel
frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques. arXiv preprint
arXiv:1003.4083.
[9]T. Li and M. Ogihara, "Detecting emotion in music," in Proc. of the Intl. Conf. on Music
Information Retrieval, Baltimore, MD, October 2003.
[10]C. Cao and M. Li, "Thinkits submissions for MIREX2009 audio music classification and
similarity tasks." ISMIR, MIREX 2009.
[11]Han, B. J., Ho, S., Dannenberg, R. B., & Hwang, E. (2009). SMERS: Music emotion
recognition using support vector regression.
[12]Yang, Y. H., Lin, Y. C., Su, Y. F., & Chen, H. H. (2008). A regression approach to music
emotion recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 16(2), 448457.
[13]Guan, D., Chen, X., & Yang, D. (2012). Music Emotion Regression Based on Multi-modal
Features. In 9th International Symposium on Computer Music Modeling and Recognition.
[14]Wang, X., Wu, Y., Chen, X., & Yang, D. (2013, October). Enhance popular music emotion
regression by importing structure information. In Signal and Information Processing Association
Annual Summit and Conference (APSIPA), 2013 Asia-Pacific (pp. 1-4). IEEE.
[15] P. Juslin and P. Luakka, "Expression, perception, and induction of musical emotions: A review
and questionnaire study of everyday listening,"Journal of New Music Research, vol. 33, no. 3, p.
217, 2004.
[16]Sanden, C., Befus, C. R., & Zhang, J. Z. (2010, September). Camel: a lightweight framework
for content-based audio and music analysis. In Proceedings of the 5th Audio Mostly Conference: A
Conference on Interaction with Sound (p. 22). ACM.
[17]Weka: Data Mining Software in Java, Retrieved February 1, 2014 from
http://www.cs.waikato.ac.nz/ml/weka
[18]Zitao Liu (2012). Correlation Coefficient. Retrieved February 1, 2014 from
http://www.cnblogs.com/Gavin_Liu/archive/2010/09/19/1830902.html
36
Acknowledgement
I would like to thank my supervisor for his great help in this project and thank my parents for their
support. Also I appreciate group members in the lab for their helpful suggestions on my work.
37
Appendix
38
39
40
41
42
43
44
45
46
47
48
3)Table 8:
Selected Features
Selected Features
Selected Features
for P-dimension
for A-dimension
for D-dimension
Deviation
Deviation
Standard Deviation
Standard Deviation
Deviation0
Standard Deviation0
Average0
Average0
49
50
51
Risk Assessment
Description of
Risk
Description of
Impact
Likelihood
rating
Impact
rating
Preventative
actions
Catastrophic
Take action if
cost effective.
Catastrophic
Serious
52
53