You are on page 1of 53

Undergraduate Project Report

2013/14

[Research and Implementation of Emotion


Recognition of Pure Music Based on Spectral
Features]

Name:

[Xiao LU]

Programme:

[Telecommunications
Engineering with
Management]

Class:

[2010215103]

QM Student No.

[100668843]

BUPT Student No.

[10212802]

Project No.

[IM_2802]
Date [12th May 2014]
1

Table of Contents
Abstract............................................................................................................................................... 4
Chapter 1: Introduction..................................................................................................................... 6
1.1 Introduction and Motivation................................................................................................... 6
1.1.1 What is music emotion recognition..................................................................................... 6
1.1.2 Applications and significance..............................................................................................6
1.2 Project Description................................................................................................................... 7
1.2.1 Work in this project............................................................................................................. 7
1.2.2 Achievement........................................................................................................................ 8
1.3 Structure of The Report...........................................................................................................8
Chapter 2: Background......................................................................................................................9
2.1 Emotion Model..........................................................................................................................9
2.1.1 Introduction......................................................................................................................... 9
2.1.2 Related model...................................................................................................................... 9
2.1.3 Discussion on perception of emotion................................................................................ 10
2.2 Audio Features........................................................................................................................ 11
2.2.1 What is audio feature......................................................................................................... 11
2.2.2 Introduction of some audio features.................................................................................. 11
2.3 General Approaches for Music Emotion Recognition.........................................................16
2.3.1 Emotion Classification...................................................................................................... 16
2.3.2 Emotion Regression...........................................................................................................17
2.3.3 Comparisons and Approach Choosing.............................................................................. 18
2.4 Related Work.......................................................................................................................... 18
2.4.1 Related Work..................................................................................................................... 18
2.4.2 Discussions and Approach Choosing................................................................................ 19
Chapter 3: Design and Implementation......................................................................................... 20
3.1 Ground Truth Preparation.................................................................................................... 20
3.1.1 Samples collection.............................................................................................................20
3.1.2 Emotion model choosing and labeling process................................................................. 20
3.1.3 Data processing................................................................................................................. 21
3.2 Feature Extraction..................................................................................................................21
3.2.1 Introduction of CAMEL.................................................................................................... 21
3.2.2 Introduction of VAMP plug-in...........................................................................................23
3.2.3 Modifications and extraction............................................................................................. 24
3.3 Emotion Regression................................................................................................................25
3.3.1 Machine learning tool........................................................................................................25
3.3.2 Regression System.............................................................................................................25
Chapter 4: Results and Discussion..................................................................................................27

4.1 Evaluation............................................................................................................................... 27
4.1.1 Evaluation Method............................................................................................................ 27
4.1.2 Experiments and Results................................................................................................... 28
4.2 Discussion................................................................................................................................ 32
4.2.1 Analysis and Discussion to the Experiments.....................................................................32
4.2.2 Discussion about Emotion and Labeling........................................................................... 33
Chapter 5: Conclusion and Further Work..................................................................................... 35
5.1 Summary................................................................................................................................. 35
5.2 Further Work.......................................................................................................................... 35
References..........................................................................................................................................36
Acknowledgement.............................................................................................................................37
Appendix........................................................................................................................................... 38
Risk Assessment................................................................................................................................ 52
Environmental Impact Assessment.................................................................................................53

Abstract
Emotion is oftentimes an essential element expressed in music, thus it can be a useful mean to
categorize and organize music data, and can be widely used in applications. In this sense, automatic
music emotion recognition is becoming more and more significant for music information retrieval
and other applications. The aim of this paper is to explore the relationship between audio features
and the emotion in pure music, and to achieve emotion recognition in terms of emotion regression
by employing machine learning technique. A psychological 3-dimension model, PAD(PleasureArousal-Dominance) model, is introduced in the research for labeling the emotion values. For the
training data, 1,217 samples of pure music are collected from the Internet, and are labeled with PAD
emotional state values by a group of 10 people. The feature extraction plug-in programme is an
improved version of a tool called CAMEL(Content-based Audio and Music Extraction Library). It
is modified so that the programme can read ".wav"file as the input instead of a ".txt"file, and also
output an ".arff" file for the regression learning tool. Also an extra plug-in programme based on
CAMEL using Vamp language is written for a host programme in the project. The ground truth data
are fed into the machine learning tool called WEKA, and then the emotion regression is achieved by
using the SMOreg algorithm and is evaluated using a method called cross-validation. By using
cross-validation, the correlation coefficients (CF) reach 52% on Pleasure, 85% on Arousal and 75%
on Dominance respectively. Several experiments and results are presented and some meaningful
conclusions about audio features and regression are drawn.

(Chinese translation of the Abstract)

PAD(Pleasure-Arousal-Dominance)
1217 10 PAD
CAMEL wav
txt arff
CAMEL Vamp WEKA
SMO
P, A, D CF 52%85% 75%

Chapter 1: Introduction
1.1 Introduction and Motivation
1.1.1 What is music emotion recognition
The rapid growth of artificial intelligence has brought infinite possibilities to our real life. For
music, the automatic recognition technology has contributed a lot for music information retrieval
and other applications. The currently occurred big data concept makes it more inevitable. A special
topic is music emotion recognition. While most of other information of music can be retrieved or
recognised "correctly", emotion expressed in a piece of music can be somewhat ambiguous and
hard to quantify since emotion is hard to be modeled and measured, and also our perception of
emotion is subjective and even totally different. In short, the research on music emotion recognition
is still in its early stages, although it has drawn much more attention in these years.
Computational systems for music emotion recognition is usually based on a model of emotion, for
example in this paper the PAD emotional state model is employed.
Emotion recognition can be viewed as a multiclass-multilabel classification or regression problem
where we try to annotate each music piece with a set of emotions[1]. Regression is a special mean
of classification that uses statistical technique for estimating the relationships among variables,
including a number of regression algorithms. Normally emotion classification means to
automatically get a class or a type of emotion, for example "sad", "happy", and "anxious", whereas
emotion regression tries to regress an emotion value that describes a state of mood. There can be 8
classes, 16 classes or much more classes of emotion classification, but it still has some limits and
ambiguity. Emotion regression can be somewhat more exact and accurate to describe humans
emotion.

1.1.2 Applications and significance


With the explosion of vast and easily-accessible digital music libraries over the past decade, there
has been a rapid expansion of music information retrieval research towards automated systems for
searching and organizing music and related data[1]. Plus, in many music recommendation systems
or web radios, music emotion can be an essential criteria for users. For example, a web radio called
6

Musicovery (http://musicovery.com/) provides a function to recommend songs according to the


mood that users choose, as is shown in Figure 1. (Note that the term "mood" and "emotion"
correspond to the same thing in this paper)

Figure 1 An example for application using emotion

1.2 Project Description


1.2.1 Work in this project

Figure 2 An overview of the project

This project focuses on pure music emotion regression, as pure music does not contain lyric
information and vocal performance. Also only low-level and some mid-level features are employed
for the regression training to explore the effectiveness. A flowchart of the project is presented in
Figure 2. In general, the project contains three main parts: ground truth (training dataset) collecting
7

and labeling, writing the feature extraction plug-in programme, and regression learning.
For data collection there are 1,217 samples of pure music collected from the Internet and all of their
format is ".wav". They are then annotated by a group of ten people including the author using a
labeling system built by our group. Note that the project focuses on pure music e.g.classical music
and instrumental music but not pop songs that contain lyrics which may be an important factor that
affects an audiences emotion. Also the project uses PAD model as a more accurate and complex
model of humans emotion.
Secondly a feature extraction tool called CAMEL is used and largely modified to satisfy the
requirement for input data and output data format in the project. CAMEL is not an often used
feature extraction tool but it can extract some time-domain features and some spectral features that
other tool may not provide, and therefore it is adjusted to be more convenient to use. In addition in
this project the tool is also modified to a plug-in programme for a finished host programme using
Vamp language.
After a data processing step, the ground truth data is fed into the machine learning tool called
WEKA and the regression learning is achieved. At last the results are evaluated according to the
correlation coefficient for each of P,A,D dimension by an approach called cross-validation, and are
analysed in full details.
1.2.2 Achievement
The project employs a machine learning and data mining toolkit called Weka to regress the emotion
in each scale of P, A, and D. By using SMOreg algorithm to build regressor with RBF kernel and
default parameters, the correlation coefficients (CF values) are: 51% on Pleasure, 85% on Arousal
and 75% on Dominance respectively. Plus, several analysis and experiments on the results were
made, and several comments and conclusions have been drawn.

1.3 Structure of The Report


In this report, some background knowledge is introduced, including emotion model employed, a
detailed explanation on audio features, approaches for emotion recognition and related work. Next
in chapter 3 the design and implementation of the project are shown with full details including data
processing, feature extraction programme and how the project works. In chapter 4 the results are
presented and discussed. At last the conclusion is made and some further work will be discussed.
8

Chapter 2: Background
2.1 Emotion Model
2.1.1 Introduction
Though emotion is a subjective and ambiguous reference, computers need exact values and
numbers to "understand" what people think and judge. Therefore, whether in emotion regression or
emotion classification, it is necessary to have some means to quantify or measure our mood in order
to organize and retrieve music. Most researchers suggest that emotion can be measured by a multidimensional model. This means that a state of mood can be described as a multi-dimensional vector,
which can be then used for data training and information retrieval. Russell[2] and Thayer[3] have
made a significant contribution to the process of building emotion descriptors into low-dimensional
models, for example,Valence-Arousal model(see Figure 3).
2.1.2 Related model

Figure 3 The Valence-Arousal space

One frequently used emotion model is Thayers 2-dimensional emotion model, the Valence-Arousal
Model. In this model, "Valence"stands for appraisal of polarity in emotion, such as positive and
negative, and "Arousal"represents the intensity or the strength of emotion. A high value of arousal
indicates a strong emotion such as excitement or anger. In Figure 3, there are some adjective words
of emotion in every quadrant labeled by Russell. The V-A model is quite easy to understand and use,
however it has some limitations because of its low dimension, for example, anger and fear are both
intensely negative, meaning that they have similar V-A value, but they are not actually the same
mood. In this case a 3-dimensional emotion model should be more adequate for measuring emotion.
Another model used in this project is the PAD(Pleasure-Arousal-Dominance) Model proposed by
Mehrabian[4]. The Pleasure-Displeasure Scale measures how pleasant an emotion may be. P space
corresponds to V space in VA model. The Arousal-Nonarousal Scale measures the intensity or
strength of the emotion. For example, while both anxiety and anger are negative emotions, the latter
has a higher intensity or a higher arousal state. The Dominance-Submissiveness Scale represents the
controlling and dominant nature of the emotion, which is an extended dimension as indicated in
figure 3. For instance, both fear and anger are unpleasant emotions, but anger is a dominant emotion
while fear is a submissive emotion.
2.1.3 Discussion on perception of emotion
As stated before, people may have different moods when listening to the same music, which is
because of the subjectivity of aesthetic. For instance, a rock song can make rock fans excited and
happy, while for some people it might make them feel slightly frightened. For a piece of classical
music, even two professional musicians can also have different perception and understanding of
what emotion the music is trying to express. Besides, different culture affect peoples perception of
music emotion because any music has its culture background. Therefore some researchers focus on
a specific countrys music such as Chinese songs in [5].
There is also a distinction between ones perception of the emotion(s) "expressed"by music and the
emotion(s) "induced"by music. What emotion a song expresses and what it stimulates to audience
are not always the same. For instance, a song maybe in a slightly sad mood, but because it sounds
so beautiful, the audience may feel relaxed or excited. Another example is particularly occurred in
music written for movies: when the background music is creating an evil ambient, the music itself
can be angry or excited, however it gives audience a perception of fear and anxiety. This point
should be carefully considered labeling ground truth data with emotion value.
10

2.2 Audio Features


2.2.1 What is audio feature
In order to characterize audio signals, many transforms and statistical calculations on the signal can
be employed to represent the most distinctive attributes of the signal. The process of extracting
audio features is actually related to digital signal processing. For music information retrieval(MIR),
it is necessary to use a set of audio features to represent an audio which might be quite large, or in
other words, to identify the critical components of the audio signal that are good for recognizing the
content and discarding all the redundancies. In whether music information retrieval or music
emotion retrieval, no dominant single feature has emerged up to present, so we usually use multidimensional feature vectors that contain several feature values.

Figure 4 Flowchart of extracting spectral features

Generally there are two categories of audio feature: time-domain features and frequency-domain
features. Basically the original information of an audio is the amplitude values in time domain, and
we need to use Fourier Transform to convert it into frequency domain. Audio features can also be
separated into high-level features and low-level features. In this case high-level means that the
features are closer to human perception, human semantic and musical expression, while low-level
features are based on computation and analysis of a signal from the physical or acoustical aspect,
such as spectral centroid ,spectral variance ,and spectral standard deviation. A general flowchart of
obtaining spectral features is shown in Figure 4.
2.2.2 Introduction of some audio features
1)Zero-Crossing Rate(ZCR)
The zero-crossing rate is a measure of the number of time the signal value cross the zero axe,
implemented in time-domain. It is not only used in audio analysis but it does contribute a lot to
speech recognition. Periodic sounds tend to have a small value of it , while noisy sounds tend to
11

have a high value of it. It can be easily computed and used to identify the voice and noise. The
definition of ZCR is:

(1)
where

(2)
and

(3)

Figure 5 ZCR of an audio

From figure 5 it can be inferred that when ZCR reaches high value there might be noises and when
ZCR is relatively low there should be speech or voices.
2)Spectral centroid
Spectral centroid indicates where the "center of mass" of the spectrum is. For human perception it
12

can be described as the "brightness" of a sound. It is calculated as the weighted mean of the
frequencies present in the signal, determined using a Fourier transform, with their magnitudes as the
weights[6]. The definition of spectral centroid is as follow:

(4)

3)Spectral slop
The spectral slope represents the amount of decreasing of the spectral amplitude. It is computed by
linear regression of the spectral amplitude.
a ( f ) slope f const

where

slope

1
a
(k )
k

(5)

N f (k ) * a(k ) f (k ) * a(k )
k

N f ( k ) ( f (k ))
k

(6)

4)Spectral roll-off
The spectral roll-off point is the frequency so that 95% of the signal energy is contained below
this frequency . It is correlated somehow to the harmonic/noise cutting frequency.
n

i 1

i 1

p power (i ) power (i )

(7)

Figure 6 Process of Extracting Spectral Roll-off feature

13

Figure 7 Graphical depiction of statistical spectral features [1]

A flowchart of extracting Spectral Roll-off is shown in Figure 6. And figure 7 provides a graphical
information of these spectral features introduced above.

5)Tristimulus features
The tristimulus values have been introduced as a descriptor for timbre[6]. It has three different types
of energy ratio allowing a fine description of the first harmonic of the spectrum, which are
perceptually more salient. In (8)(9)(10) the a(n) denotes the energy of the nth harmonic of the
spectrum and H is the number of harmonics available in spectrum.

(8)

(9)

(10)

14

6)MFCC(Mel-scale Frequency Cepstral Coefficients)

Figure 8 Mel-frequency

MFCC is based on known variation of the human ears critical bandwidth with frequency [7]. In
other words it is much closer to human auditory system, for Mel-Frequency is non-linearly spaced
on the frequency axis(figure 8). MFCC has two types of filter which are spaced linearly at low
frequency below 1000 Hz and logarithmic spacing above 1000Hz. The definition of Mel-frequency
is:

(21)
Where Mel(f) denotes the Mel-frequency and f denotes the frequency.

Figure 9 Procedure of calculating Cepstrum

The term "Cepstrum" comes from "Spectrum", which is a useful information for audio analysis. The
steps of computing Cepstrum is shown in Figure 9. The Cepstrum is defined as the inverse DFT of
the log magnitude of the DFT of a signal.
(32)
1
Where F is the IDFT and F is the DFT.

15

Figure 10 MFCC Block Diagram [8]

Figure 10 demonstrates the whole process of extracting MFCC. The output, Mel-scale Frequency
Cepstral Coefficients, are actually 12 or 13 DCT(Discrete Cosine Transform) coefficients.

2.3 General Approaches for Music Emotion Recognition


2.3.1 Emotion Classification
In emotion classification, the moods are classified into several classes, for example, in one of the
first publications on this topic [9], 13 mood categories are defined with some adjectives indicating
the mood in each class or group, as shown in Table 1. Another is example is 11 emotion categories
based on VA model in Figure 11. The classifier can be achieved by using SVMs (Support Vector
Machines), for example in [9], audio features related to timbre, rythm and pitch were used to train
the SVMs, and the final accuracy is 45%. In 2009, Cao and Li submitted a system that was a top
performer in several categories, including mood classification (65.7%) [10]. The system uses a
"super vector" of low-level audio features, and the classification is implemented by employing a
Gaussian Super Vector followed by Support Vector Machine (GSV-SVM).

16

Table 1: The Adjective Groups defined in [9]

Group
A
B
C
D
E
F
G

Adjectives
Group
Adjectives
cheerful,gay,happy
H
dramatic, emphatic
fanciful, light
I
agitated, exciting
delicate,graceful
J
frustrated
dreamy,leisurely
K
mysterious, spooky
longing, pathetic
L
passionate
dark, depressing
M
bluesy
sacred, spiritual

Figure 11 An Example of 11 Emotion Categories in VA space

2.3.2 Emotion Regression


Emotion regression aims to regress the emotion value but not the category of the emotion.
Therefore the output of a regression is a prediction value calculated referring to the regression
algorithms. There is not a given number of how many emotion states in the regression problem.
The basic idea of regression is to find a function that accurately approximates target values using
input values. SVR (Support Vector Regression) is an application of SVM to find the mapping
function between input and output.[11]. For the training sets
with

{( x1 , y1 ), ( x2 , y2 ),..., ( xn , yn )}

xi R 2 , yi R 2 , , and i=1, 2,
, n. The relation between the input

xi

and output

yi
17

can be mapped by an optimal regression function f(x) by SVR training. Assuming linearity, f can be
represented as the following hyperplane:
f ( x) ( x) b

where

(43)

Rn , b R

and denotes a nonlinear transformation from R 2 to a high-dimensional space

The aim of a basic SVR is to compute the value of and b.


2.3.3 Comparisons and Approach Choosing
In music emotion classification, the emotion space is modeled by a fixed number of classes. It
might be a plausible approach to music emotion retrieval, but the detailed states of mood within
each class are unsure, and this ambiguity may confuse users when they retrieve music according to
emotion. In music emotion regression, however, the emotion space is viewed as continuous and
each point in the space is considered as a distinctive emotional state. In this way, the ambiguity
associated with emotion classes can be successfully avoided. So music emotion regression is
considered more appropriate for music emotion retrieval, and in this paper regression learning
method is employed other than classification.

2.4 Related Work


2.4.1 Related Work
Latest work in music emotion prediction from audio has indicated that parametric regression
approaches can outperform labeled classifications using equivalent features.[1]
A regression approach was proposed by Yang et al. for music emotion recognition in [12]
implementing on V-A model, which introduced the use of regression for mapping high-dimensional
acoustic features to the two-dimensional space. The ground truth is consisted of 195 music clips
labeled with VA values. 114-dimension features were extracted by employing PsySound and
Marsyas tools, and they were then reduced to a tractable number of dimensions by using PCA
18

(Principal Component Analysis) before regression. The best performance evaluated in terms of the

R 2 statistics reaches 58.3% for arousal and 28.1% for valence.


In [13], Guan et al proposed an AdaBoost-based approach for music emotion regression. In the
paper 1,687 Chinese songs were collected and labeled based on PAD model. Multi-modal features
were employed, including audio, MIDI and lyric features. The best performance is when using
AdaBoost.RM algorithm, the correlation coefficients for regression on P, A, D are respectively 72%,
84%, 75%. Plus, by using SMOreg which is employed in this project, the correlation coefficients
reached 69% on P-space, 82.8% on A-space, and 72% on D-space.
Employing PAD model, Wang et al enhanced the emotion regression on pop music by considering
song structure.[14] In the paper different structures (e.g.Verse and Chorus) of popular songs are
regressed separately. The results show that structure information can help improve emotion
regression. Verse is good for pleasure recognition, while chorus is good for arousal and dominance.
2.4.2 Discussions and Approach Choosing
In this project, we try to focus on pure music that are without any lyric information in order to
explore and discover the relationship between acoustical (audio) features and music emotion. Many
researches focus on pop songs but it is undoubtedly that lyrics have a lot of indication of emotion
especially on whether the emotion is positive or negative. So the samples in this project are all pure
music other than songs. In addition the project employs a feature extraction tool call CAMEL which
is not commonly used, but in CAMEL some special features might be critical and meaningful for
emotion regression.

19

Chapter 3: Design and Implementation


3.1 Ground Truth Preparation
3.1.1 Samples collection
The 1,217 samples of pure music were downloaded from the Internet. We searched for the social
tags(e.g. comments or tags) using some keywords and then selected them uniformly from different
tendencies of the emotions(e.g. excitement, relaxation, sadness and anxiety). All of the audio files
are in ".wav" format, with sampling rate at 44kHz, 16bits per sample, stereo channel, and PCM
coded.
3.1.2 Emotion model choosing and labeling process
Based on the discussions in Section2.1, the project chooses the PAD model as the emotion model,
as it provides a more accurate and general measurement especially its "Dominance" dimension.

Figure 12 The interface of the labeling system

The samples were labeled using a simple graphical interface(figure 12) built by our group. The
group of 10 people whose ages are from 22 to 50 annotate the samples with emotion values
according to Self Assessment Manikins(SAM)[15] . In more details, the samples were labeled with
3-dimension integer values: Pleasure, Arousal and Dominance, ranging from 0 to 8 . Each scale is

20

restricted to 9 specific values. In this interface each block that shows a picture of a state of emotion
corresponds to a value from 0 to 8, and it is designed for users to better understand the emotion
value of each scale. To operate this system first we play and listen to a music sample and try to find
the emotion expressed in the music. Then, we simply select the button of each PAD scale. In
addition, we try to keep all the volume of the samples the same, to avoid the influence on emotion
that the high volume or low volume bring to the annotator. Due to the large number of the
samples(1,217), the labeling process took more than 2 months.
3.1.3 Data processing
After all the members in the test finished labeling 1,217 samples of pure music, the annotated P, A,
D values of each sample were recorded in an .xls file. The final emotion value in each PAD scale is
the mean of all the values that the 10 group members labeled, for example the P value of a sample is
the average of P values labeled by all members.

Figure 13 Data processing

As shown in Figure 13, the training data for machine learning is 3 individual file in "arff" format,
containing multi-dimension feature vectors together with the emotion value of each scale of PAD,
and this is achieved by both programme and manual operations.

3.2 Feature Extraction


3.2.1 Introduction of CAMEL
The feature extraction programme written in this project is based on a tool called CAMEL(Contentbased Audio and Music Extraction Library), which is a a lightweight framework and open source
programme for Content-based Audio and Music Analysis[16]. The main use of CAMEL is to extract
21

a set of audio features from an audio. A list of features in CAMEL is shown in table 2 and full list
can be found in appendix.
Table 2: List of Features in CAMEL

Feature
1 Mean
2 Variance
3 Standard Deviation
4 Average Deviation
5 Skewness
6 Kurtosis
7 ZCR
8 RMS
9 Non-Zero Count
10 Spectral Centroid
11 Spectral Variance
12 Spectral Standard Deviation
13 Spectral Average Deviation
.
.
.
27 Spectral Loudness
28 Spectral Sharpness
29 Peak Tristimulus 1
30 Peak Tristimulus 2
31 Peak Tristimulus 3
32 MFCC
33 Bark

Dimension
1
1
1
1
1
1
1
1
1
1
1
1
1

Domain

Time

Frequency
1
1
1
1
1
1
13
26

Peak
Harmonic

There are 4 types of domain used in the CAMEL: Time, Frequency, Peak and Harmonic domain.
The Time domain is simply the original PCM values that represent the amplitude value of the signal
sampled at a specific rate (e.g.44.1kHz) over time. The Frequency domain is simply the Fourier
Transform over time domain, representing information of magnitude versus frequency values. The
Peak domain is an evaluation of the Frequency domain, where only values of frequency above the
threshold are kept, with some transformation according to neighbouring values. the Harmonic
domain picks only values of the Peak domain which are harmonics: whole number multiples
(within some threshold of tolerance) of the fundamental frequency of the Time domain. The
harmonic domain is closer to music and human auditory.
There are 9 time-domain features in CAMEL, which is quite special because other extraction tools
22

such as jAudio do not have that much time-domain features. Both MFCC and Bark are multidimension features. MFCC contains 13 dimensions and Bark contains 26 dimensions. So in general
this tool can extract a 70-dimension feature vector.
3.2.2 Introduction of VAMP plug-in
A Vamp plugin is a chunk of compiled program code that carries out analysis of a digital audio
signal, returning audio features information. It is a powerful language that can be used to write
plug-in programme for audio feature extraction, and several plug-ins can be integrated together on a
host programme.

Figure 14 Vamp Plug-in in the project

As presented in figure 14, in this project an extra plug-in based on CAMEL and written using Vamp
is completed. It is simply used for a GUI host programme written by other people in the lab in order
to extract audio features and employ machine learning.

23

Figure 15 The Interface of the Host and Plug-in

Figure 15 shows the graphical interface of the beta version of the host and plug-in programme.
(More details can be found in the appendix)
3.2.3 Modifications and extraction

Figure 16 Modifications on the programme

The modified version of the feature extraction tool is written in C++ language and developed using
Visual Studio as the integrated development environment. The samples needed to extract features
are all in ".wav" format , but since CAMEL can only read in a ".txt" file containing PCM
24

information originally, the first work is to modify the programme so that it can use ".wav" file
directly as the input format. As figure 16 shows, At this stage another library, Libsndfile, is
integrated into the CAMEL programme to help read in the audio files. Next, in order to implement
machine learning, the extracted features need to be in a particular format, the ".arff" format, and
thus some modifications are made in the main function in CAMEL to make it generate an ARFF file
as the output. The programme was ran on the server, and the feature sets of 1,217 samples of music
were extracted spending 3 days. The window size set in FFT is 1024, applying Hanning window,
and using Magnitude spectrum.

3.3 Emotion Regression


3.3.1 Machine learning tool
Weka[17], a machine learning toolkit is used in the experiment. SMOReg algorithm is used to build
regressors, with RBF (Radial Basis Function) Kernel and default parameters. The project employs
RBF as the kernel function instead of using linear or polynomial functions due to its flexibility.
3.3.2 Regression System

Figure 17 The Flowchart of the Regression System

In this regression process (Figure 17), at the beginning, the 1,217 audio files in ".wav" format are
put into the modified CAMEL programme. After 3 days all the features are extracted and they are
saved in ".arff" files. By both manual operation and using programme, the labeled P, A, D values are

25

combined with the feature set and three individual files are generated. Then the ".arff" files, as the
training data, are fed into Weka for regression learning. The prediction results are then achieved in
Weka. At last the results are analyzed, evaluated according to the correlation coefficient for each of
P,A,D dimension by an approach called cross-validation.

26

Chapter 4: Results and Discussion


4.1 Evaluation
4.1.1 Evaluation Method
1)Cross-validation
In k-fold cross-validation, the original samples are randomly partitioned into k subsamples. Of the k
subsamples, a single subsample is retained as the validation data for testing the model, and the
remaining (k-1) subsamples are used as training data [18]. The cross-validation process is then
repeated k times (the folds), with each of the k subsamples used exactly once as the validation data.
The k results from the folds then can be averaged (or otherwise combined) to produce a single
estimation. The merit of this method over repeated random sub-sampling is that all observations are
involved in both training and validation, and that each observation is used for validation exactly
once. In this project 5-fold cross-validation is employed.
2)Correlation Coefficient
Correlation coefficient (CF) statistic developed by Karl Pearson is commonly adopted to measure
the performances of regressors i.e.how well the regression equation truly represents. Pearsons
correlation coefficient between two variables is defined as the covariance of the two variables
divided by the product of their standard deviations:
X ,Y )
X ,Y cov(

x Y

Where

X ,Y

(54)

denotes the CF value, cov( X ,Y ) denotes the covariance of X and Y, and x Y

each denotes its standard deviation.

27

4.1.2 Experiments and Results


1)Experiment 1 - Implementation and Validation of the Regression System
The first experiment in the project is to use all the feature extracted by CAMEL to do the emotion
regression on each of P, A, and D values labeled. The feature set is 70 dimensional including a 13dimension MFCC and 26-dimension Bark feature. After cross-validation with 5-fold the output
correlation coefficients (CF) are: 51.71% on P scale, 83.81% on A scale, and 75.1% on D scale as
Table 3 shows. A small sample of prediction result is shown in Table 4 and more details can be
found in appendix.
Table 3: Result Using All Features in CAMEL

Emotion Space

CF Value

Pleasure

51.71%

Arousal

85.87%

Dominance

74.87%

Table 4: A Sample of Prediction Result in A-dimension

Actual Arousal Value

Prediction

Error

5.3

5.822

0.522

7.1

5.805

-1.295

7.208

0.208

3.8

3.345

-0.455

6.7

5.145

-1.555

4.1

5.549

1.449

5.7

6.152

0.452

4.4

4.278

-0.122

5.545

5.402

-0.143

4.3

4.172

-0.128

6.182

5.162

-1.019

6.1

5.78

-0.32
28

3.8

4.456

0.656

4.891

-0.109

5.7

6.21

0.51

6.8

5.546

-1.254

3.5

3.371

-0.129

5.909

5.867

-0.042

5.455

4.773

-0.682

5.7

5.15

-0.55

2)Experiment 2 - Feature Selection and Dimension Reduction


Next, a Correlation-based Feature Subset Selection with BestFirst algorithm was applied as the
search method set in Weka. This defined attribute (feature) selection method gives a small number
of the most significant and critical features that are highly correlated to the regression result. The
feature subsets selected in each of P, A, D space are presented in Table 5.

29

Table 5: Selected Features List

In P-space

In A-space

In D-space

(1)Spectral Loudness

(1)Spectral Loudness

(1)Spectral Loudness

(2)MFCC3

(2)MFCC3

(2)MFCC3

(3)Kurtosis

(3)Kurtosis
(4)Peak Tristimulus1

(3)Peak Tristimulus1

(5)Peak Tristimulus2

(4)Peak Tristimulus2

(4)Peak Tristimulus3

(6)Peak Tristimulus3

(5)Spectral Sharpness

(5)Spectral Spread

(7)ZCR

(6)MFCC4

(8)MFCC9

Feature Name

This experiment then kept only the features that are selected to be most correlated, and do
regression learning again. The regression result of reducing the dimension of the feature set is
shown in Table 6.
Table 6: Result Using Selected Features and Comparison

CF Value

CF Value

after Selection

using ALL features

Pleasure

43.47%

51.71%

Arousal

82.29%

85.17%

Dominance

71.67%

74.87%

Emotion Space

30

3)Experiment 3 - Comparison of Using Different Feature Sets


In this experiment, another set of features extracted using another tool called jAudio was employed
for training and regression on the same ground truth i.e.the 1,217 pure music. The experiment aims
to explore the relationship between audio features and pure music emotions a step further, and to
test the effectiveness of using different features. The results of regression using another feature set
in jAudio and the comparison to that of using CAMEL are shown in Table 7, and the list of selected
feature by Weka is shown in Table 8. Note that the dimension of jAudio feature set is 112 and in
CAMEL it is 70.
Table 7: Result and Comparison using jAudio and CAMEL feature set

Emotion Space

CF Value of jAudio

CF Value of CAMEL

Pleasure

49.93%

51.71%

Arousal

87.71%

85.17%

Dominance

76.86%

74.87%

Table 8: Best Features of jAudio

Selected Features

Selected Features

Selected Features

for P-dimension

for A-dimension

for D-dimension

Spectral Flux Overall Standard Deviation

Fraction Of Low Energy Windows Overall

Spectral Flux Overall Standard Deviation

Strength Of Strongest Beat Overall

Standard Deviation

Root Mean Square Overall Standard

Standard Deviation

MFCC Overall Standard Deviation12

Deviation

LPC Overall Standard Deviation2

LPC Overall Standard Deviation0

Fraction Of Low Energy Windows

LPC Overall Standard Deviation3

LPC Overall Standard Deviation1

LPC Overall Standard Deviation4

LPC Overall Standard Deviation2

LPC Overall Standard Deviation5

LPC Overall Standard Deviation3

LPC Overall Standard Deviation6

Method of Moments Overall Standard Deviation0

LPC Overall Standard Deviation7

Area Method of Moments Overall Standard

Overall Standard Deviation


LPC Overall Standard Deviation3
Method of Moments Overall Standard
Deviation0
Spectral Flux Overall Average0

31

LPC Overall Standard Deviation8

Deviation0

LPC Overall Average0

Spectral Rolloff Point Overall Average0

MFCC Overall Average7

LPC Overall Average2

LPC Overall Average0

LPC Overall Average2

LPC Overall Average3

LPC Overall Average1

LPC Overall Average3

LPC Overall Average4

LPC Overall Average2

LPC Overall Average4

LPC Overall Average5

Area Method of Moments Overall Average0

LPC Overall Average5

LPC Overall Average6

LPC Overall Average6

LPC Overall Average7

LPC Overall Average7

LPC Overall Average8

LPC Overall Average8

Area Method of Moments Overall Average0

Area Method of Moments Overall Average0

4)Experiment 4 - Comparison of Regression on Songs and Pure Music


This experiment compares the performances of using the same audio feature set on pure music and
pop songs with lyrics and vocal singing. In this test, another ground truth containing 2,000 pop
songs was used which is collected and annotated by other people in the lab. The data are also
labeled with PAD value and the features are extracted by CAMEL. The result and comparison is
shown in Table 9.
Table 9: Result and Comparison of using different ground truth

Emotion Space

CF Value of 2,000 Songs

CF Value of Pure Music

Pleasure

41.31%

51.71%

Arousal

81.25%

85.17%

Dominance

65%

74.87%

4.2 Discussion
4.2.1 Analysis and Discussion to the Experiments
In experiment 1, the final results of the regression system are in expectation. The prediction output
32

sample was generated during cross-validation, demonstrating the real process of automatic
recognition on each testing data. We could also see that both the regression performance on Arousal
and Dominance are good. Although the CF value in Pleasure dimension is a bit lower than that in A
and D dimension, it should be noticed that, compared to other researches which use only audio
features for regression, the coefficient is quite normal. In [13], the experiment result shows that, for
Pleasure dimension, lyric features contribute the most to the performance of regression (reaches
62.3% CF value) whereas the audio feature set contributes the least (47.3% CF value). Overall, in
experiment 1 we validated the availability and effectiveness of the regression system this project
aims to build.
In experiment 2, compared to the performance of using all the features, the result of using only 5 to
7 selected features seems good enough in A and D dimension, despite an 8% reduction in P
dimension. In real applications if there are only a few dimensions of features needed to be extracted
for recognition, the time in calculating will be greatly saved.
More importantly, in Table 5 we can fine some features that are significant for more than one
dimension. First, Spectral Loudness and the 3rd coefficient of MFCC are critical for all the three
dimension. Spectral Loudness and Spectral Sharpness are computed over the Bark Bands and is
related to brightness and noisiness of the music respectively. We may infer that the reason Spectral
Loudness is important is that it is based on Bark Bands which is closer to human auditory, and that
the brightness (also the noisiness) is a very important characteristic for music to express a state of
emotion. MFCC is also related to human perception, so it is very appropriate to be employed in
music emotion regression. Kurtosis is a very simple statistical time-domain feature that measures of
the "peakedness" of the probability distribution of a real-valued random variable. Another timedomain feature is ZCR who contributes much to the regression of arousal. The Peak Tristimulus 1
to 3 have been introduced in Section2.2.2, and they are very critical because they describes the
"timbre" of the sound which can be very much related to the emotion especially its strength.
In experiment 3, a comparison is made between different set of features. The results of two set of
features are close. CAMEL has better performance on P whereas jAudio outperforms on A and D
dimension. As introduced before, CAMEL has 70-dimension features of which most of them are
low-level and basic features of acoustics, but jAudio has 112-dimension features including some
mid-level musical features such as tempo and beat as well as a important feature in audio signal
processing called LPC. One noticeable find is that there are only 4 features (16-dimension) the two

33

feature sets both have, but since they perform very similarly in the regression learning, it might be a
good idea to combine the useful features in each of the tools to achieve a greater result.
In the last experiment, from the result (Table 9) we obtain that, by using audio features for
regression of songs containing lyric, the CF values are much lower than the result of pure music
without lyric and singing in P and D dimension. The reason why both methods work well on
Arousal dimension is that the intensity of emotion in music is mostly expressed within its acoustical
features but not what the lyrics can express. In conclusion, this result indicates that audio features
do perform well for the regression of pure music.
4.2.2 Discussion about Emotion and Labeling
Beside the implementation of feature extraction and regression learning method, there remains
some issues about our perception of emotion and the labeling process that can, to some degree,
affect the result of regression.
First, as stated before, people may have different, even opposite perception to the same piece of
music. The subjectivity of perception may obscure the accuracy of the emotion value annotated,
though people will not always have different feeling to the same song. Second, the annotator may
have slightly different understanding of the PAD model, especially for the Dominance scale. There
might be some disagreement on whether a song presents a sense of dominance or submissiveness,
except for some obvious emotion state such as fear and anger. Thirdly there are some issues in
music itself, for example, a piece of music may contain more than one type or one state of mood,
especially in pure music, and sometimes there can be two contrary emotions in the same piece. One
of the assumption of this project is that there is only one state of mood in the whole length of music,
so we use one emotion vector to label the samples. But as stated in [1], there have some researches
that use a time-series of vectors to track changes in emotional content over the duration of a piece,
which might be more adequate. In addition, the genre of a song may also contribute a lot to
audiences perception but it is not considered in this project since most of the audio features
extracted are very low level features.

34

Chapter 5: Conclusion and Further Work


5.1 Summary
This paper presents the research on the relationship between emotion in pure music and audio
features extracted by a modified tool, as well as the implementation of emotion regression of pure
music. A special ground truth dataset, 1,217 samples of pure music, were collected and labeled with
PAD emotion values. A feature extraction tool, CAEML, has been modified to a more convenient
version for the project as well as a plug-in programme for Vamp host programme. The training data
containing extracted 70-dimension feature set and emotion values were fed into a machine learning
tool called Weka and the evaluation of the regression learning was gained. Additionally through
analysis of the result we have drawn some conclusions: (1)the regression system works well since
its CF value meet the expectations; (2)some features such as Spectral Loudness, MFCC and Peak
Tristimulus contribute greatly to the regression; (3)two different feature sets from jAudio and
CAMEL perform very similarly and there can be more explorations on the two sets; (4)for music
emotion regression, audio features perform much better on the regression of pure music comparing
to songs with lyrics.

5.2 Further Work


First, the ground truth data can be better processed, for example, some music samples that are too
long or have very complex emotions can be removed to ensure there is no confusion when labeling
a sample with emotion value.
Second, the feature extraction tool, CAMEL, can be optimized more in future works. Originally it
cannot extract a set of features fastly because every feature is extracted and computed from the
original data which means it applies Fourier Transform and window functions several times on the
same audio.
Third, try using other tools especially those who contains more musical features that may help
enhance the recognition.

35

References
[1]Kim, Y. E., Schmidt, E. M., Migneco, R., Morton, B. G., Richardson, P., Scott, J., ... & Turnbull,
D. (2010, August). Music emotion recognition: A state of the art review. In Proc. ISMIR (pp. 255266).
[2] J. A. Russell, "A circumspect model of affect,"Journal of Psychology and Social Psychology, vol.
39, no. 6, p. 1161,1980.
[3] R. E. Thayer: "The Biopsychology of Mood and Arousal,"New York: Oxford University Press,
1989.
[4] Mehrabian, A.: "Framework for A Comprehensive Description and Measurement of Motional
States,"Genetic, Social, and General Psychology Monographs, vol. 121, pp. 33361, 1995
[5]Wang, X., Chen, X., Yang, D., & Wu, Y. (2011). Music Emotion Classification of Chinese Songs
based on Lyrics Using TF* IDF and Rhyme. In ISMIR (pp. 765-770).
[6]Peeters, G. (2004). A large set of audio features for sound description (similarity and description)
in the cuidado project. IRCAM, Paris, France.
[7] E.C. Gordon,Signal and Linear System Analysis.John Wiley & Sons Ltd., New York, USA,1998.
[8]Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using mel
frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques. arXiv preprint
arXiv:1003.4083.
[9]T. Li and M. Ogihara, "Detecting emotion in music," in Proc. of the Intl. Conf. on Music
Information Retrieval, Baltimore, MD, October 2003.
[10]C. Cao and M. Li, "Thinkits submissions for MIREX2009 audio music classification and
similarity tasks." ISMIR, MIREX 2009.
[11]Han, B. J., Ho, S., Dannenberg, R. B., & Hwang, E. (2009). SMERS: Music emotion
recognition using support vector regression.
[12]Yang, Y. H., Lin, Y. C., Su, Y. F., & Chen, H. H. (2008). A regression approach to music
emotion recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 16(2), 448457.
[13]Guan, D., Chen, X., & Yang, D. (2012). Music Emotion Regression Based on Multi-modal
Features. In 9th International Symposium on Computer Music Modeling and Recognition.
[14]Wang, X., Wu, Y., Chen, X., & Yang, D. (2013, October). Enhance popular music emotion
regression by importing structure information. In Signal and Information Processing Association
Annual Summit and Conference (APSIPA), 2013 Asia-Pacific (pp. 1-4). IEEE.
[15] P. Juslin and P. Luakka, "Expression, perception, and induction of musical emotions: A review
and questionnaire study of everyday listening,"Journal of New Music Research, vol. 33, no. 3, p.
217, 2004.
[16]Sanden, C., Befus, C. R., & Zhang, J. Z. (2010, September). Camel: a lightweight framework
for content-based audio and music analysis. In Proceedings of the 5th Audio Mostly Conference: A
Conference on Interaction with Sound (p. 22). ACM.
[17]Weka: Data Mining Software in Java, Retrieved February 1, 2014 from
http://www.cs.waikato.ac.nz/ml/weka
[18]Zitao Liu (2012). Correlation Coefficient. Retrieved February 1, 2014 from
http://www.cnblogs.com/Gavin_Liu/archive/2010/09/19/1830902.html

36

Acknowledgement
I would like to thank my supervisor for his great help in this project and thank my parents for their
support. Also I appreciate group members in the lab for their helpful suggestions on my work.

37

Appendix

38

39

40

41

42

43

44

45

1)List of CAMELs feature set


Feature
1 Mean
2 Variance
3 Standard Deviation
4 Average Deviation
5 Skewness
6 Kurtosis
7 ZCR
8 RMS
9 Non-Zero Count
10 Spectral Centroid
11 Spectral Variance
12 Spectral Standard Deviation
13 Spectral Average Deviation
14 Spectral Skewness
15 Spectral Kurtosis
16 Spectral Irregularity K
17 Spectral Irregularity J
18 Spectral Flatness
19 Spectral Tonality
20 Spectral Min
21 Spectral Max
22 Spectral Crest
23 Spectral Slope
24 Spectral Spread
25 Spectral Rolloff
26 Spectral HPS
27 Spectral Loudness
28 Spectral Sharpness
29 Peak Tristimulus 1
30 Peak Tristimulus 2
31 Peak Tristimulus 3
32 MFCC
33 Bark

46

2) The Interface of Vamp host and plug-in programme.


(Note: the author only wrote the plug-in, and the host programme was written by other group
members in our lab)

47

48

3)Table 8:
Selected Features

Selected Features

Selected Features

for P-dimension

for A-dimension

for D-dimension

Spectral Flux Overall Standard

Fraction Of Low Energy Windows

Spectral Flux Overall Standard

Deviation

Overall Standard Deviation

Deviation

Strength Of Strongest Beat Overall

MFCC Overall Standard Deviation12

Root Mean Square Overall

Standard Deviation

LPC Overall Standard Deviation0

Standard Deviation

LPC Overall Standard Deviation2

LPC Overall Standard Deviation1

Fraction Of Low Energy

LPC Overall Standard Deviation3


LPC Overall Standard Deviation4

LPC Overall Standard Deviation2


LPC Overall Standard Deviation3

LPC Overall Standard Deviation5

Method of Moments Overall Standard

LPC Overall Standard Deviation6

Deviation0

LPC Overall Standard Deviation7

Area Method of Moments Overall

LPC Overall Standard Deviation8

Standard Deviation0

Spectral Rolloff Point Overall

MFCC Overall Average7

Average0

LPC Overall Average2

LPC Overall Average0

LPC Overall Average3

LPC Overall Average1

LPC Overall Average4

LPC Overall Average2

LPC Overall Average5

Area Method of Moments Overall

LPC Overall Average6

Average0

LPC Overall Average7


LPC Overall Average8
Area Method of Moments Overall
Average0

Windows Overall Standard


Deviation
LPC Overall Standard Deviation3
Method of Moments Overall
Standard Deviation0
Spectral Flux Overall Average0
LPC Overall Average0
LPC Overall Average2
LPC Overall Average3
LPC Overall Average4
LPC Overall Average5
LPC Overall Average6
LPC Overall Average7
LPC Overall Average8
Area Method of Moments Overall
Average0

49

4)A demo/sample of music emotion prediction on Arousal(output by Weka)

50

51

Risk Assessment
Description of
Risk

Description of
Impact

Likelihood
rating

Impact
rating

Preventative
actions

Fail to collect the


ground truth data
(Rare)

Catastrophic

Take action if
cost effective.

Bugs of the plugin programme

Catastrophic

Fix them urgently

The result may


not be ideal

Serious

Try to make the


result as accurate
as possible

52

Environmental Impact Assessment


The project was completed using computer software and programmes on my personal computer and
on the server, and there is not much environmental impact in this project: no cost of manufacture,
little waste disposal and recycling, some use of energy for the feature extraction tool to run on the
server for three days.

53

You might also like