You are on page 1of 5

The pursuit of happiness in music: retrieving valence with high-level musical descriptors

Jose Fornari & Tuomas Eerola
Finnish Centre of Excellence in Interdisciplinary Music Research Department of Music University of Jyväskylä, Finland

Abstract. In the study of music emotions, valence is normally referred as one of the emotional dimensions that describes music appraisal of happiness, whose scale goes from sad to happy. Nevertheless, related literature shows that valence is known to be particularly difficult to be predicted by a computational model. As valence is a contextual music feature, it is assumed here that the prediction of valence requires high-level music descriptors in its model. This work shows the usage of eight high-level descriptors, previously developed by our group, to describe happiness in music. Each high-level descriptor was independently tested using the correlation of its prediction with the mean rating of valence created by thirty-five listeners over one piece of music. Finally, a model using all high-level descriptors was developed and the result of its prediction is described and compared, for the same music piece, with two other important models for the dynamic rating of music emotion.

1 Introduction
Music emotion has been studied by many researches in the field of psychology, such as the ones described in [1]. The literature mentions three main models for music emotions: 1) categorical model, originated from the work of [2], that describes music in terms of a list of basic emotions [3], 2) dimensional model, originated from the work of [4], who proposed that all emotions can be described in a Cartesian coordinate system of emotional dimensions [5], and 3) component process model, from the work of [6] that describes emotion appraised according to the situation of its occurrence and the current listener's mental (emotional) state. Computational models, for the analysis and retrieval of emotional content in music have also been studied and developed, in particular by the Music Information Retrieval (MIR) community, that also maintain a repository of publication on the field (available at the International Society for MIR: To name a few, [7] studied a computational model for musical genre classification that is similar (although simpler) to emotion retrieval. [8] provided a good example of audio feature extraction, using multivariate data analysis and behavioral validation of the features. There are several other examples of computing models for retrieving emotional re-

lated features from music, such as [9] and [10] that studied the retrieval of high-level features of music, such as tonality, in a variety of music audio files. In the study of the continuous development of music emotion, [11] used a twodimensional model to measure along time the music emotions appraised by listeners, for several music pieces. The emotion dimensions that he described are: arousal (that goes from calm to agitated) and valence (that goes from sad to happy). Then, he proposed a linear model with five acoustic descriptors to predict these dimensions in a time series analysis of each music piece. [12] applied the same listener’s mean ratings of [11], however, to develop and test a general model (meaning, one same model for all music pieces). This model was created with System Identification techniques to predict the same emotional dimensions.

2 The difficulty of predicting valence
As seeing in [11] and [12] results, these models successfully predicted the dimension of arousal, with high correlation with the ground-truth. However, the retrieval of valence has proved to be particularly difficult. This may be due to the fact that the previous models did not make extensive usage of high-level acoustic features. While low-level features account for basic temporal psychoacoustic features, such as loudness, roughness or pitch, the high-level ones account for cognitive musical features. These are contextual-based and deliver one prediction for each overall music excerpt. If this assumption is true, it is understandable the reason why valence, as a highly contextual dimension of music emotion, is poorly described by models using lowlevel descriptors. Intuitively, it was expected that valence, as the measurement of happiness in music, would be mostly related to the high-level descriptors of: key clarity (major x minor), harmonic complexity, and pulse clarity. However, as it is described further, the experimental result pointed out to other perspectives.

3 Designing high-level musical descriptors
Lately, our research group has being involved with the computational development of eight high-level musical descriptors. They are: 1) Pulse Clarity (the sensation of pulse in music). 2) Key Clarity (the sensation of a tonal center in music). 3) Harmonic complexity (the sensation of complexity delivered by the musical chords). 4) Articulation (music feature going from staccato to legato). 5) Repetition (presence of repeated musical patterns). 6) Mode (music feature going from minor to major tonalities). 7) Event Density (amount of distinctive and simultaneous musical events). 8) Brightness (sensation of how bright the music excerpt is). The design of these descriptors was done using Matlab. They all involve different techniques and approaches whose explanations are too extensive to fit in this work and will be thoroughly described in further publications. To test and improve the development of these descriptors, behavioral data was collected from thirty-three listeners that were asked to rate the same features that were

predicted by the descriptors. They rated one hundred short excerpts of music (five seconds of length each) from movie sound tracks. Their mean rating was then correlated with the descriptors predictions. After several experiments and adjustments, all descriptors presented a correlation with this ground-truth above fifty percent.

4 Building a model to predict valence
In the temporal dynamics of emotion research described in [11], Schubert created a ground-truth with data collected from thirty-five listeners that dynamically measured the emotion ratings depicted into a two-dimensional emotion plan that was then mapped in two coordinates: arousal and valence. Listener’s ratings were sampled every one second. The mean rating of these measurements, mapped into arousal and valence, created a ground-truth that was used later in [12] by Korhonen and also here, in this work. Here, the correlation between each high-level descriptor prediction and the valence mean rating from Schubert’s ground-truth was calculated. The valence mean rating utilized was the one from the music piece called “Aranjuez concerto”. This one is 2:45’ long. During the first minute the guitar performs alone (solo), then it is suddenly accompanied by a full orchestra whose intensity fades towards the end, till the guitar, once again, plays alone. For this piece, the correlation coefficient presented between the high-level descriptors and the valence mean rating are: event density: r = -59.05%, harmonic complexity: r = 43.54%, brightness: r = -40.39%, pulse clarity: r = 34.62%, repetition: r = 16.15%, articulation: r = -9.03%, key clarity: r = -7.77%, mode: r = -5.35%. Then, a multiple regression model was created with all eight descriptors. The model employs a time frame of three seconds (related to the cognitive “now time” of music) and hop-size of one second to predict the continuous development of valence. This model presented a correlation coefficient r = 0.6484, which leads to a coefficient of determination: R2 = 42%. For the same ground-truth, Schubert’s model used five music descriptors: 1) Tempo, 2) Spectral Centroid, 3) Loudness, 4) Melodic Contour and 5) Texture. The descriptors output differentiation was regarded as the model predictors. Using time series analysis, he built an ordinary least square (OLS) model for each music excerpt. Korhonen’s approach had eighteen low-level descriptors (see [12] for details) to test several models designed throughout system identification. The best general model reported in his work was an ARX (autoregressive with extra inputs). Table 1 shows the results comparison of R2 for the measurement of valence in the Aranjuez concerto, for all models.
Table 1. Emotional dimension: VALENCE. Ground-truth: Aranjuez concerto

type R2

Schubert OLS 33%

Korhonen ARX -88%

This work model Multiple Regression 42%

Event Density One Descriptor 35%

The table shows that the model reported here performed significantly better than the previous ones for this specific music piece. The last column of table 1 shows the per-

formance for the high-level descriptor “event density”, the one that presented the highest correlation with the ground-truth. This descriptor alone presents higher results than previous models. The results shown seem to suggest that high-level descriptors can be successfully used to improve the dynamic prediction of valence. The figure below depicts the comparison between the thirty-three listeners mean rating for valence in the Aranjuez concerto and the prediction given by the multiple regressive model using the eight high-level musical descriptors.

Fig. 1. Mean rating of behavioral data for valence (continuous line) and model prediction (dashed line)

5 Discussion and Conclusions
This work is part of a bigger project, called “Tuning you Brain for Music”, the BrainTuning project ( An important part of it is the study of acoustic features retrieval from music and the relation of them to specific emotional appraisals. Following this goal, the high-level descriptors (here briefly described) were designed and implemented. They were initially conceived out of the evident lack of such descriptors in the literature. In Braintuning, a fairly large number of studies for the retrieval of emotional connotations in music were investigated. As seem in previous models, for the dynamic retrieval of highly contextual emotions such as the appraisal of happiness (represented by valence), low-level descriptors are not enough, once that they do not take into consideration the contextual aspects of music. It was interesting to notice that the high-level prediction of the high-level descriptor “event density” presented the highest correlation with the valence mean rating,

while the predictions of “key clarity” and “mode” correlated very poorly. This seems to imply that, at least in this particular case, musical sensation of a major or minor tonality (represented by “mode”) or a tonal center (“key clarity”) is not related to valence, as it might be intuitively inferred. What most counted here was the amount of simultaneous musical events (event density). By “event”, it is here understood any perceivable rhythmic, melodic or harmonic pattern. This experiment chose the music piece “Aranjuez” because it was the one that previous models presented the lowest prediction rate for valence. Of course, more experiments are needed and are already planned for further studies. Nevertheless, we believed that this work may have presented an interesting prospect for the development of better high-level descriptors and models for the continuous measurement of contextual musical features such as valence.

1. Sloboda, J. A. and Juslin, P. (Eds.): Music and Emotion: Theory and Research. Oxford: Oxford University Press. ISBN 0-19-263188-8. (2001) 2. Ekman, P.: An argument for basic emotions. Cognition & Emotion, 6 (3/4): 169–200, (1992). 3. Juslin, P. N., & Laukka, P.: Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological Bulletin(129), 770-814. (2003) 4. Russell, J.A.: Core affect and the psychological construction of emotion. Psychological Review Vol. 110, No. 1, 145- 172. (2003) 5. Laukka, P., Juslin, P. N., & Bresin, R.: A dimensional approach to vocal expression of emotion. Cognition and Emotion, 19, 633-653. (2005) 6. Scherer, K. R., & Zentner, K. R.: Emotional effects of music: production rules. In J. P. N. & J. A. Sloboda (Eds.), Music and emotion: Theory and research (pp. 361-392). Oxford: Oxford University Press (2001) 7. Tzanetakis, G., & Cook, P.: Musical Genre Classification of Audio Signals. IEEE Transactions on Speech and Audio Processing, 10(5), 293-302. (2002) 8. Leman, M., Vermeulen, V., De Voogdt, L., Moelants, D., & Lesaffre, M.: Correlation of Gestural Musical Audio Cues. Gesture-Based Communication in Human-Computer Interaction. 5th International Gesture Workshop, GW 2003, 40-54. (2004) 9. Wu, T.-L., & Jeng, S.-K.: Automatic emotion classification of musical segments. Proceedings of the 9th International Conference on Music Perception & Cognition, Bologna, (2006) 10. Gomez, E., & Herrera, P.: Estimating The Tonality Of Polyphonic Audio Files: Cogtive Versus Machine Learning Modelling Strategies. Paper presented at the Proceedings of the 5th International ISMIR 2004 Conference, October 2004., Barcelona, Spain. (2004) 11. Schubert, E.: Measuring emotion continuously: Validity and reliability of the twodimensional emotion space. Aust. J. Psychol., vol. 51, no. 3, pp. 154–165. (1999) 12. Korhonen, M., Clausi, D., Jernigan, M.: Modeling Emotional Content of Music Using System Identification. IEEE Transactions on Systems, Man and Cybernetics. Volume: 36, Issue: 3, pages: 588- 599. (2006)