You are on page 1of 4

Estimating the Perception of Complexity in Musical Harmony

Jose Fornari & Tuomas Eerola
Finnish Centre of Excellence in Interdisciplinary Music Research Department of Music University of Jyväskylä, Finland,

The perception of complexity in musical harmony is here seeing as being directly related to the psychoacoustic entropy carried by the musical harmony. As such, it depends on a variety of factors mainly related to musical chords structure and progression. The literature shows few examples of the computational estimation, directly from audio, of the complexity in musical. So far, the perception of this feature is normally rated by means of human expert listening. An efficient computational model able to automatic estimate the perception of complexity in musical harmony can be useful in a broad range of applications, such as in the fields of: psychology, music therapy and music retrieval (e.g. in the large-scale search of music databases, according to this feature). In this work we present an approach for the computational prediction of harmonic complexity in musical audio files and compare it with its human rating, based on a behavioral study conducted with thirty-three listeners that rated the harmonic complexity of one hundred music excerpts. The correlation between the results of the computational model and the listeners mean-ratings are here presented and discussed.



As pointed by [1], in music, “subjective complexity reflects information content”. Information Theory relates complexity with the amount of entropy (also known as randomness) presented in the information source output. Therefore, in music, we here infer that the amount of perceptual (or psychoacoustic) entropy is related to the sensation of complexity. In the study of Music Complexity it is common to perform separately the retrieve of this feature for each musical component, such as for: melody, harmony and rhythm. Harmonic Complexity (HC), as here seen, is defined as the contextual musical feature that almost any listener can perceive and intuitively grade how much complex the harmonic structure of a particular musical excerpt is. Temperley, in his studies on tension in music, although not specifically referring to the term "complexity", suggested four components that seem to be related to HC perception. They are: 1) Harmonic changing rate. 2) Harmonic changing rate on weak beats. 3) Dissonant (ornamental) notes rate. 4) Harmonic distance of consecutive chords [2]. Scheirer, in a study with short duration (five seconds) musical excerpts, pointed out what he considered to be the most prominent acoustic descriptors of music complexity, respectively shown here, from the most to the least important, they are: 1) Coherence of spectral assignment to auditory streams. 2)

Variance of the auditory streams amount. 3) Loudness of the strongest moment. 4) Most-likely tempo (“pulse clarity”). 5) Variance of time between beats [3]. The first and most important descriptor is related to the spectral spreading and therefore proportional to the amount of noise (or entropy) presented in the musical excerpt. Considering the great quantity of components related to its perception, anyone can infer that it is not an easy task to measure HC. As our experiments building a ground-truth data suggested, it is also difficult for the listeners to have a common agreement upon their perception of HC. In our experiments, the listeners considered it to be related to a variety of chord features, such as: 1) notes composition (from simple triads, major, minor, augmented, diminished, to complex clusters), 2) chords function (e.g. fundamental, dominant, subdominant), 3) chords tensions (related to the presence of tension notes, such as: 7ths, 9ths, 13ths, etc.) and 4) chords progression (diatonic, modal, chromatic, atonal). As a high-level musical feature, HC is a scalar measurement that portraits the overall harmonic complexity of a musical excerpt. This perception can only be properly conveyed in musical excerpts that are longer than the cognitive "now time" of music, considered to have approximately the duration of three seconds [5]. The objective of the study here presented is to investigate a computational model able to describe the human judgment of harmonic complexity, based on the calculation of acoustic principles that are related to this perception. In order to do that, we carried out studies, both behavioral and computational, using real musical recordings excerpts, as following described.

The first step was to establish an HC model ground-truth, given by the human rating of HC in music audio data. For that a group of thirty-three music students were invited to listen to one hundred excerpts of music and rate their HC. Each person, isolated, listened to a random-ordered queue of these one hundred music excerpts and rated their HC from zero (no harmonic complexity) to one (harmony very complex). All music excerpts were instrumental (without singing voice), had five seconds of duration and were extracted from movies soundtracks. The behavioral data was collected with a program that we developed using Pd (Pure Data) language. Afterwards, the data was analyzed, pruned, and the mean-rating of all listeners’ measurements was established as the HC ground-truth. After the rating section, each listener was asked

to post comments about the experiment. They were asked to describe how difficult it was to rate HC and on which musical features their rating was mostly based. Summarizing all their opinions, they paid attention to: 1) harmonic change velocity and amount of change (seven comments), traditional (triadic) chords versus altered chords (seven comments), predictability and sophistication of chord progression (five comments), clarity of a tonal centre, amount of instruments playing together, dissonances. Overall, most of the listeners considered difficult to rate HC, especially in excerpts with high amounts of musical activity, that didn’t have a clear chord structure or for atonals and/or electro-acoustics excerpts. These comments expressed how the listeners clearly paid attention on different musical aspects in order to rate HC. This was an important piece of information for the design and improvement of our initial computational model. With this in mind we focused on which features are the most important ones to be retrieved in order to properly predict HC with our computational model. In order to see which concepts were understood by listeners and well defined by us, we calculated the mean inter-subject correlation and looked at the use of the scales. There were no apparent problems in the scales although few deviant listeners could be identified from the correlation matrix, as Figure 1 shows.

Figure 1. Inter-subject correlation for HC listeners’ rating.

Assessing HC seemed to be a rather difficult task for the listeners or several different strategies seemed to be used. This was already apparent in the listeners’ comments, which seemed to cover a rather large range of concepts. When calculating the means for our musical feature evaluations, we eliminated those participants whose ratings did not correlate significantly with the rest of the participants. Nevertheless, only three ratings were discarded in this fashion (some of the dark-blue stripes shown in this figure).

To create a computational model able to predict the complexity in musical harmony, we started by investigating principles related to the amount of complexity found in chords structures and their progressions. Chords are related to the musical scale region where note events happen close enough in time to be perceived as simultaneous and therefore

interpreted by the human audition as chords. In audio signal processing, this corresponds to the fundamental partials that coexist in a narrow time frame of approximately fifty milliseconds and are gathered in a region of frequency where musical chords mostly occur. In musical terms this is normally located below the melody region and above the bass line. However, as the bass line also influences the interpretation of harmony, this one also had to be taken into consideration on our computational model. The model starts by calculating the chromagrams for these two regions (bass and harmony). A chromagram is a form of spectrogram whose partials are folded in one musical octave, with twelve bins, corresponding to the musical notes in a chromatic musical scale [7]. We separate each region using pass-band digital filters in order to attenuate the effects of partials not related to the musical harmonic structure. These two chromagrams are calculated for each time frame of audio corresponding to the window-size of its lowest frequency. Therefore, each chromagram is presented in the form of a matrix with twelve lines, each one corresponding to one musical note from the chromatic scale, and several columns, corresponding to the number of time frames. We then tested three principles that initially seemed to be related to the HC perception, as they were created from the studies and evidences mentioned in the introduction of this work. The first principle is what we named here as auto-similarity. This one measures how similar each chromagram column is with each other. This calculation returns an array whose size is the same as the number of columns in the chromagram. The variation of this array is inversely proportional to the auto-similarity. The rationale behind this principle is that auto-similarity may be proportional to the randomness of the chromagram and this one seems to be related to chord progression, one of the studied aspects of HC perception. Next, these two chromagram matrixes are collapsed in time, which means that their columns were summated and their consequent twelve points arrays normalized. As the music excerpts rated in this experiment were all with five seconds of duration, collapsing the chromagram seemed reasonable once this is nearby the cognitive “now time” of music, however, for longer audio files, a more sophisticated approach should be developed. In this work, the collapsed chromagram array proved to be efficient in representing the overall harmonic information for this particular small time-duration. We then calculate two principals out of these two arrays, named here as: energy and density. Energy is the summation of the square of each array element. The idea was to investigate if the array energy was somehow related to the chord structure. The second principle, Density, accounts for the array sparsity. The idea is that, for each array, the smaller the number of zeros between nonzero elements, the bigger it is its density. The intention here was to investigate how much the collapsed chromagram density is related to the chord structure. This came from the notion that, in terms of musical harmony, the closer the notes in a chord are, the more complex the harmony tends to be. Figure 2 shows the diagram of the computational model where it is depicted the calculation of these six principles of harmonic complexity.

unexpected result. Further studies with a broader range of music genres may unveil different prospects. Finally we calculated a multiple regression model with these six principles. This one presented a correlation with ground-truth of: r = 0.61. Although this is a fairly high correlation, further studies are needed to make sure that this multiple regression model is not over-fitted as result of the large (six) number of components. Nevertheless, the individual correlation of Density and Auto-similarity for both chromagrams within the region of Harmony and Bass are, by themselves, high enough to be sounding results. Figure 3 depicts the behavioral mean-rating and the model prediction for the one hundred music excerpts.

Figure 2. The six principles for harmonic complexity prediction.

This computational model was written and simulated in Matlab, where it calculated the three principles depicted above, for two chromagrams representing different regions of frequency, here called as: bass and harmony. The results of our measurements are described in the following section.

Figure 3. Multiple Regression Prediction (dots) of Behavioral Data (bars). Correlation: r=0.61.

Using our model, we calculated the energy, density and auto-similarity for the bass and harmony chromagrams. This resulted in six predictions per music excerpt. We then calculated these six principles for the same one hundred music excerpts that were rated by the listeners (as described in section 2). The correlation of each principle with the listeners mean-rating is shown in the Table 1.
Table 1. Correlation of the principles with the ground-truth.

This linear model, made with the multiple regression of these six principles, reached a coefficient of determination of R2 = 0.37, thus explaining about 37% of data. Its prediction scatter diagram, shown in Figure 4, yields to the model evaluation.

Energy Density Auto-similarity

Harmony chromagram 0.2733 0.5568 0.4626

Bass chromagram 0.3459 0.4635 0.5138

The principle that presented highest correlation was Density, in the Harmony chromagram, followed by Auto-similarity in the Bass chromagram. The principles with least correlation were the Energy in both chromagrams. This results support our initial supposition of HC is mostly related to chord structure (density) and perceptual randomness (auto-similarity). Interestingly enough, this last one is most verified in the Bass chromagram, instead of the Harmony one, as we initially suspected. Although the music excerpts are selected from a broad range of music styles, all are from sound tracks and without singing voice, which narrowed down their similarity, placing them in one specific genre of music. Eventually this may have contributed to this

Figure 4. Computational Model Evaluation.

[4] Amatriain, X., Herrera, P. (2000). Transmitting Audio Content as Sound

This study introduced a novel approach for the estimation of complexity in music harmony. The majority of the music material used in the experiment can be categorized as belonging to one musical genre; orchestral movies sound tracks. All excerpts were instrumental and five seconds long. We suspect that this may have influenced the experimental results in at least two ways: 1) The short-duration of five seconds, near the cognitive now time, although sufficient to convey emotional content, is not enough to analyze two important aspects: prosody and forgetfulness. For prosody, these excerpts may be compared to a still frame of musical emotional content that, in most of the cases, is a dynamic phenomenon. The relation between the emotional prosody of music and its effects on the overall rating of harmonic complexity are still to be studied. For the forgetfulness, the analysis on how the natural forgetting curve is affected by musical novelty, by repeating patterns, or even by the repetition of similar but variant patterns, is also to be studied and will lead to a more complex computational model than the one introduced here. Secondly, for the musical genre, we believe that a further study should broad it up, taking into consideration other genres and the listener’s predilection. Also we didn’t consider how many times the listener repeated each excerpt before rating it (which may have influenced the results). As we reported in section 2, the average of listeners found difficult to rate harmonic complexity. This may be the reason for the small correlation between listeners (r=0.39) shown in Figure 1, that is significantly smaller than the correlation of the computational model with the listener’s mean-rating (r=0.61). This is due to the fact that several musical excerpts are, at the same time, complex in the chord structure and in chord progression. Maybe a further study separating these two forms of harmonic complexity would be convenient. This would involve two sets of music excerpts, one with only static chords of different complexity and another one with chords of the same structural complexity but different degrees of the one related to chord progression. Nevertheless, the results shown here are promising and we hope that they can inspire further researches leading to a better understanding of the perception of complexity in musical harmony.

Objects. In Proceedings of the AES 22nd Conference on Virtual, Synthetic, and Entertainment Audio, Helsinki. [5] Leman et al., (2005). Communicating Expressiveness and Affect in Multimodal Interactive Systems. IEEE MultiMedia. vol. 12. pg 43 - 53. ISSN:1070-986X. [6] Likert, R. (1932). A Technique for the Measurement of Attitudes. Archives of Psychology 140: pp. 1-55. [7] Chai, W., Vercoe, B. (2005). Detection of Key Change in Classical Piano Music, Proceedings of the 6th International l Conference on Music Information Retrieval,}, London.

We would like to thank the BrainTuning project ( FP6-2004-NEST-PATH-028570 and the Music Cognition Group of the University of Jyväskylä.

[1] Berlyne,
D. E. (1971). Aesthetics and psychobiology. Appleton-Century-Crofts, New York. [2] Temperley, D. (2001). The cognition of basic musical structures. MIT Press, Cambridge, London. [3] Scheirer, E. D., Watson, R. B., and Vercoe, B. L. (2000). On the perceived complexity of short musical segments. In Proceedings of the 2000 International Conference on Music Perception and Cognition, Keele, UK.