You are on page 1of 6


Jonatas Manzolli NICS - UNICAMP
The musician’s ability to develop expressive performances is one of the most important musical skills and it is also one of the major criteria used to evaluate the musician’s musical interpretation. Traditionally, expressive performances are evaluated only by personal judgment of musically trained professionals, however, also passive of personal preferences and misconceptions. We present here a new method to automatically evaluate expressive piano performances using Psychoacoustic Fitness Function (PFF) based on three psychoacoustic measurements: a) loudness, b) pitch and c) spectrum magnitude. In this paper we used PFF to evaluate the dynamic development of four important piano touches that are known to be common technique tools used in expressive piano performances. They are: intensity, legato, staccato and rhythmic pulsation (or simply pulse). This method is derived from the Evolutionary Sound Synthesis Method, the ESSynth [6], more specifically, from the ESSynth’s Fitness Function in the selection process. Given a set of pianists recordings that are the reference in expressiveness, the Target set, we used the PFF to evaluate by comparison with the Target, the expressiveness in twelve pianists recordings, for the four piano touches. The results, as they are presented below, are enough to motivate us that this method, as it is further developed, may turn to be an important contribution for the evaluation of expressive musical performances.

Iracele Livero NICS - UNICAMP

Jose Fornari NICS - UNICAMP

Particularly, the interest of this paper is derived from [6] where we presented the ESSynth. Specifically in [5] this method was extended to bear the manipulation of perceptually meaningful sonic features described by three psychoacoustic parameters: loudness, pitch and spectrum magnitude. In this extension, the evaluation of synthesized sounds was done by the arithmetic mean of the three PFF, loudness, pitch and spectrum. After studying applications using AI, and more specifically EC methods, such as the ESSynth, we decided to investigate the potential usage of them for the analysis (or evaluation) and synthesis application, such as the one described in this work, where we present a method for evaluating expressive piano performances. 2. THE FOUR PIANO TOUCHES



The problem of evaluating the expressiveness in piano performances is long dated, since [2]. This is an interesting problem and has been addressed by means of several different approaches. Lately, Artificial Intelligence (AI) techniques have also been applied to evaluate the problem of expressive piano performances. As few examples, Widmer studied the use of machine learning techniques to understand how expressive music performances are produced [15]. Goebl studied the role of timing and intensity in the production and perception of melody of expressive piano performances [16]. We also have studied AI methods such as Evolutionary Computation (EC) and Neuroinformatics to simulate compositional environments and thus create new methods for sound synthesis. In [6] we presented the evolutionary sound synthesis, the ESSynth, a method for sound synthesis based on the Darwinian natural selection theory. In ESSynth we have reproduction and selection of waveforms, playing the role of individuals in a population. In [7] we have introduced the software named Vox Populi, that uses evolutionary computation methods for algorithmic musical composition. Together with Wasserman we created the Roboser, a live musical composition method based on synthetic emotions [14].

Despite of having a wide range of parameters to be evaluated, as we approached to study the problem of evaluating expressive piano performances, we decided to do it by evaluating a set of four piano touches: Pulse, Legato, Staccato and Intensity. Pulse, refers to the pianist ability of keeping control of its rhythm, exploring it as a way of inferring a discursive rhythmic into its performance. Legato is the ability of playing melodies (sequential single notes) and harmonies or clusters (simultaneous notes) in a way of connecting them so its slopes of intensity are lowered and less perceptible. Staccato, in opposition to Legato, is the ability of separate or spread in time all notes and clusters so, in the musical discourse, they can be presented as separated as possible. Intensity refers to the ability of controlling the variation of strength of each musical entity (note or cluster) by controlling the velocity the pianist’s fingers hit the piano keys. In the paper we take the evaluation of a pianist in these four piano touches as being directly proportional to the evaluation of this pianist expressive performance. This four piano touches are evaluated through the three PFF (loudness, pitch and spectrum), where the arithmetic mean of them is the final expressive performance evaluation.. 3. MEASURING PIANISTIC SONORITY WITH PFF

PFF is part of the ESSynth method. It is responsible for the selection of waveforms. In order to explain the concept of PFF we first have to overview the basics about ESSynth method. This is made of three structures:

• B(n), the population set, in its n-th generation. The first generation population is B(0). This one is made of individuals that will reproduce and be selected. • T, the target set, made of the reference individuals • f, the fitness function, that selects the best individual. The best individual in the n-th generation is w*n. This one is the individual belonging to B(n) nearest of T. For each generation a new w*n is sought, and put to the system output as the synthesized sound. B(n) Crossover Mutation f Fitness Evaluation Output waveforms T

Now we define the genotype distance between two individuals. Being Ca and Cb two psychoacoustic curves of the same type (loudness, or pitch or spectrum),

a and b , the respectively from the waveforms Euclidean distance between Ca and Cb is given by:



d c (c a , c b ) =

w*1, w*2, w*3,..., w*n
Figure 1. Basic ESSynth diagram.

(3) The equation (3) is a psychoacoustic distance from where the PFF evaluation of the specific psychoacoustic parameter comes. Given the genotypes

k =1



(k ) − c b (k )


g a = (l a , p a , s a )


The sound segments, or waveforms, are named as

n w individuals and represented by r , which means the r-

g b = (lb , pb , sb )

both elements of G, it is possible to

define the distance between them as:

th individual in the n-th generation population. As we see it, the most relevant psychoacoustic measures within the individuals are those covering the sonic perception of: intensity, frequency and partials, or magnitude spectrum composition. They are respectively: loudness, pitch and spectrum. ESSynth represents the individual’s genotype with these three psychoacoustic parameters, as it is seen in the equation below:

1 D(ga , gb ) = dL (l a , lb ) + dP ( pa , pb ) + dS (sa , sb ) 3 (4)
Equation (4) is the genotype distance, given by the arithmetic mean of the other three psychoacoustic distances. Being: G(n) = { 1 , 2 , ..., M } the n-th genotype generation associated with the population set B(n) with M individuals and G = { 1 , 2 ,..., Q } the genotype target set, extracted from T. The distance between sets G
(n )



gn gn


g n = l rn (t),p rn (t) , s rn (f) r



n g The genotype r can be seeing as one element within
the space of vectors G, the space of psychoacoustic curves. G is a Cartesian product of three spaces of continuous functions, as follows,


g g


and G is defined as:

DH (G , G ) =

min 1≤ a ≤ M 1≤ j ≤ Q

n D( g a , g j )

G = L× P× S


Where the spaces of functions are:

L for loudness, P


for pitch and S spectrum. The next draw depicts the correspondence between the individual and its genotype



Genotype g r Psychoacoustic Curves Loudness l(t) Pitch p(t)


This distance gives the best individual in the Population, in comparison with the individuals in the Target set. If the distance DH is zero then the individual within the Population is equal to one within the Target set. PFF is given simply by the normalization of all population individual’s distances, so they will go from zero (in case the individual is a clone of one belonging to T) to one (the individual most different from the individuals in T).

Spectrum s(f)
Figure 2. The genotype representation of an individual.



We recorded twelve pianists, each one performing four pre-selected piano pieces. Each piece was chosen to be a good representation of one particular piano touch. Bela Bartok's “Bear Dance” for Pulsation, Chopin's “Nocturne op. 9 n. 2” for Legato, Bela Bartok's “Jeering Song” for Staccato and Katunda's “Estro Africano 1” for Intensity. These recordings are available to be heard and downloaded at the link below (in MP3 audio format): The pianists recorded in the same conditions (same day, same room, same piano and same recording equipment). Three of them were professional pianists and recorded versions that we considered to embody the desired aspects of expressive piano performance, for each piano piece. These three groups of 4 recordings were then put as the Target set, which is the reference for the PFF evaluation. The other nine group of recordings made part of the Population set to be evaluated in the expressiveness by the PFF. Following the ESSynth terminology, each recording is named as individual, so we had four Populations and four correspondent Target sets, each one for one piano piece, representing one particular piano touch. As a minor variation, we placed all twelve individuals in the Population, including the three that also belong to the Target. This was done to check that the PFF evaluation of the individual clones of the ones in the Target set would be zero. Then we proceeded by calculating the genotype of each individual and measuring their genotype distances with the Target set. For each Population we then normalized the twelve measurements which gave us the PFF evaluation. The results are shown in the next four graphics. Each one depicts four curves, from top to bottom, the PFF evaluation for: loudness, pitch, spectrum magnitude, and the total PFF. The individuals are digital audio files recorded at the sampling rate of 44.1KHz, with 16 bits of resolution and one channel (mono).

Figure 3. PFF evaluation for the “Pulse” recording group. The Target set here is made of the individuals 1, 4 and 11.

The best individual is #5. This is a recording of a pianist playing with more pulsation inference, although it is not as regular as the ones in the Target set. The most distant one, #3, plays with pulse but keeps the sustain pedal pressed, what may have been the cause for the algorithm ranked it as the most distant one. Listening to the recordings, though, we would pick the #9 as the most distant one.

Figure 4. PFF evaluation for the “Legato” recording group. The Target set here is made of the individuals 1, 4 and 12.

The best individual here is #2, that in fact is a recording of a pianist playing with great legato (linking the notes as much as possible), as well as the Target set ones do. The most distant individual is #10. This is a recording of a pianist playing staccato (detaching the notes as much as possible), which is the opposite of legato.

Table 1. The best individuals of each group of recordings. 4.1. Pulse Target set: 1,4,11 Dl = 5 Dp = 10 De = 5 D=5 4.1.1. Legato Target set: 1,4,12 Dl = 2 Dp = 2 De = 7 D=2 4.1.2. Staccato Intensity Target set: 1,4,11 Dl = 9 Dp = 2 De = 5 D=2 Target set: 1,11,12 Dl = 2 Dp = 2 De = 8 D=8

The next four tables show the PFF for each individual within the four groups of recordings. The value for the best individual are underlined.
Figure 5. PFF evaluation for the “Staccato” recording group. The Target set here is made of the individuals 1, 4 and 11. individual 1 2 3 4 5 6 7 8 9 10 11 12 Dl 0 0.799 8 0.824 3 0 0.686 5 0.707 5 0.710 2 0.913 0 0.741 2 0.793 7 0 1.000 0 Dp 0 0.0568 1.0000 0 0.0557 0.0823 0.0588 0.0681 0.0578 0.0536 0 0.0699 PULSE De 0 0.6826 0.8028 0 0.5374 0.5737 1.0000 0.7705 0.6242 0.7660 0 0.8248 D 0 0.5858 1.0000 0 0.4871 0.5190 0.6734 0.6667 0.5417 0.6141 0 0.7212

The algorithm picked #2 as the best individual. Listening to this one and comparing to the individuals in Target set, we come to the conclusion that #2 is in fact the one that presents a better Staccato, although its recordings has some mistaken notes. In contrast with it, the individual #10, the most distant one, is a recording of a pianist playing with the sustain pedal pressed, which make the Staccato impossible to be perceived, although this pianist played the piece with no mistaken notes. individual 1 2 Dl 0 0.592 0 0.748 7 0 0.754 8 0.769 1 0.937 7 1.000 0 0.623 8 0.847 5 0.647 4 0

LEGATO De 0 0.4543 0.5599 0 0.5282 0.4353 0.2542 1.0000 0.6652 0.8811 0.6134 0 D 0 0.4690 0.6393 0 0.6742 0.5752 0.7692 0.9153 0.6136 1.0000 0.5918 0

Dp 0 0.2334 0.4359 0 0.5565 0.3651 0.9071 0.4974 0.3853 1.0000 0.3541 0

Figure 6. PFF evaluation for the “Intensity” recording group. The Target set here is made of the individuals 1, 11 and 12.

3 4 5 6 7 8 9 10 11 12

The best individual, #8, is a recording of a pianist playing with great variations of intensity (from pianissimo to fortissimo), as well as the ones in the Target. The most distant one, #5, is a recording with almost no intensity variation, with the only exception being the termination in fading out. The next table shows what are the best individuals (by its number within the Population) according to each PFF evaluation: Dl (loudness PFF), Dp (pitch PFF), De (spectrum PFF) and D (total PFF). individual 1 2 3 4 5 6 7 8 9 10 11 12 Dl 0 0.633 9 0.803 6 0 0.640 7 0.647 7 0.690 6 10.00 0 0.535 2 0.798 0 0 0.974 4

STACCATO De 0 0.6622 0.7575 0 0.3728 0.6239 0.4920 0.5778 0.6525 10.000 0 0.8089 D 0 0.5762 0.7484 0 0.8734 0.8095 10.000 0.9360 0.6214 0.8116 0 0.7205

Dp 0 0.5758 0.7478 0 0.8733 0.8093 10.000 0.9352 0.6212 0.8112 0 0.7194 individual 1 2 3 4 5 6 7 8 9 10 11 12 Dl 0 0.576 6 0.819 1 0.949 6 0.747 6 1.000 0 0.848 3 0.641 3 0.842 8 0.950 1 0 0

INTENSITY De 0 1.0000 0.9781 0.6516 0.7935 0.6910 0.6868 0.4518 0.8257 0.5738 0 0 D 0 1.0000 0.9781 0.6516 0.7935 0.6910 0.6868 0.4518 0.8257 0.5738 0 0

Dp 0 0.0446 0.0548 0.2611 1.0000 0.0532 0.0478 0.0510 0.0627 0.0483 0 0



We have presented here a method for automatic evaluation of expressive piano performance based on the psychoacoustic measurement of four piano touchs: pulse, legato, staccato and intensity. This evaluation is based on what we called PFF evaluation: the normalized genotype distance between the genotypes of each individual within the Population and the Target set, the genotypes of individuals previously selected as the reference in expressiveness. This method accomplishes two important tasks: 1) it keeps the human sensibility and choice in the decision loop, by deciding which individuals will belong to the Target and therefore influence on the decision making of selecting the best

individual among Population. 2) The best individual’s selection is automatic and therefore shields away human matters not beneficial for the evaluation process, such as personal preferences, prejudices, misconceptions and biases. As we had said, PFF evaluation comes from the ESSynth’s selection process. As we briefly said, ESSynth has also another process called reproduction. This one uses genetic operators such as crossover and mutation to create new individuals inheriting sonic features of its predecessors, that are the offspring between the population individuals and the best individual. We believe that the reproduction process of ESSynth can also be used the expressive piano performance problem, however not to evaluate but as a way to manipulate the expressive performance. Of course this is not a trivial task and so ought to be thoroughly studied in order to further this research into this direction. This may open a new field of researches in sound synthesis, bringing about new evolutionary methods that not just generate new waveforms, but also generate new performances. This is particularly an original feature found in EC methods as they are evolutionaries (dynamic in time) rather than deterministics (i.e. additive synthesis, fm synthesis and so forth). Another aspect that deserves deeper explanation is in regard of the PFF pitch evaluation. It is known that pitch is a psychoacoustic parameter related to the perception of the fundamental harmonic of sounds. In fact only a narrow group of sounds have a defined pitch: the melodic sounds. These are sounds such as the ones produced by the piano playing. However, we developed our pitch algorithm to extract pitch from melodies (single notes played in sequence) instead of harmony (several notes playing simultaneously). We have not yet studied an extension for our algorithm to calculate harmonic pitch, but we have used the melodic pitch algorithm to calculate the pitch PFF for the piano playing recordings, knowing that they have melodic and harmonic pitch information. Further study is therefore necessary to conclude whether or not its pitch PFF is valid for both melodic and harmonic pitch. It would be interesting to compare its PFF evaluation from audio recording and MIDI, once they have its actual pitch information listed in terms of musical notes. To conclude we would like to say that EC methods applied to audio and music seem to us as a viable way of opening a new field of investigation that may lead to the exploration of subtle perceptual sonic nuances so far impossible of being analyzed, transformed or synthesized.

Based on Synthetic Emotions. In: Published by IEEE Multimedia Computer Society, Vol:10:4, p. 82-90.



[15] Widmer, G. 2001. Using AI and machine learning to
study expressive music performance: Project survey and first report. AI Communications 14(3), 149-162.

[1] Author, E. ''The title of the conference paper'', Proceedings of the International Computer Music Conference, Miami, USA, 2004. [2] Allen. F. J. 1913.
91(2278),424-425. Pianoforte touch. Nature

[16] Goebl, W. 2003. The role of timing and intensity in
the Production and Perception of Melody in Expressive Piano Performance. PhD Dissertation, Institut fur Musikwissenschaft, Karl-Franzens University, Germany.

[3] Askenfelt, A., Galembo, A., Cuddy, L. E. 1998. On
the accoustics and psychology of piano touch and tone. Journal of Acoustical Society of America. 103(5 Pt. 2), 2873.

[4] Bresin, R., Battel, G. 2000. Articulation strategies in
expressive piano performances. Journal of New Music Research. 29 (3), 211-224.

[5] Fornari, J. 2003. A Síntese Evolutiva de Segmentos
Sonoros. PhD Dissertation. Faculty of Electrical Engeering, State University of Campinas (UNICAMP). Brazil.

[6] Manzolli, J., Maia Jr. A., Fornari, J. & Damiani, F.
2001. The evolutionary sound synthesis method. Proceedings of the ninth ACM international conference on Multimedia, Ottawa, Canada, 585 – 587, ISBN:1-58113-394-4.

[7] Moroni, A., Manzolli, J., Von Zuben, F. and
Gudwin, R. 2000. Vox Populi: An Interactive Evolutionary System for Algorithmic Music Composition, San Francisco, USA: Leonardo Music Journal, - MIT Press, Vol 10, pg 49-54.

[8] Moroni, A, von Zuben, F. and Manzolli, J. 2002.
ArTbitration, San Francisco, USA: Leonardo Music Journal - MIT Press, 2002, Vol:11-45-55.

[9] Richerme, C. 1996. A técnica pianística. Uma
abordagem científica. S.João Boa Vista. Air Musical. p.27 and 28.

[10] Repp. B. H. 1993. Some empirical observations on
sound level properties of recorded piano tones. Journal of the Acoustical Society of America. 93(2),1136-44.

[11] Repp. B. H. 1996. Patterns of note onset
asynchronies in expressive piano performances. Journal of the Acoustical Society of America. 100(6),3917-3932.

[12] Shaffer, L. H. 1981. Performances of Chopin, Bach
and Bartòk: Studies in motor Cognitive Psycology 13,326-376. programming.

[13] Tro. J. 1998. Micro dynamics deviation as a
measure of musical quality in piano performances?. In Proceedings of the 5th International Conference on Music Perception and Cognition (ICPMC5), August, 26-30, edited by S. W. Yi (Western Music Research Institute, Seoul National University, Seoul, Korea).

[14] Wasserman, K.C., Eng, K., Verschure, P.F.M.J.,
Manzolli, J. 2003. Live Soundscape Composition