You are on page 1of 15

This article was downloaded by: [University of Exeter]

On: 10 August 2015, At: 23:20


Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: 5 Howick Place,
London, SW1P 1WG

Journal of New Music Research


Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/nnmr20

Quantitative Analysis of Phrasing Strategies in


Expressive Performance: Computational Methods and
Analysis of Performances of Unaccompanied Bach for
Solo Violin
a a b
Eric Cheng & Elaine Chew
a
University of Southern California Viterbi School of Engineering , Los Angeles, California,
USA
b
Radcliffe Institute for Advanced Study, Harvard University , Cambridge, Massachusetts,
USA
Published online: 23 Jun 2009.

To cite this article: Eric Cheng & Elaine Chew (2008) Quantitative Analysis of Phrasing Strategies in Expressive Performance:
Computational Methods and Analysis of Performances of Unaccompanied Bach for Solo Violin, Journal of New Music Research,
37:4, 325-338, DOI: 10.1080/09298210802711660

To link to this article: http://dx.doi.org/10.1080/09298210802711660

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and
should be independently verified with primary sources of information. Taylor and Francis shall not be liable for
any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever
or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of
the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Journal of New Music Research
2008, Vol. 37, No. 4, pp. 325–338

Quantitative Analysis of Phrasing Strategies in Expressive


Performance: Computational Methods and Analysis of
Performances of Unaccompanied Bach for Solo Violin

Eric Cheng1 and Elaine Chew1,2


1
University of Southern California Viterbi School of Engineering, Los Angeles, California, USA; 2Radcliffe Institute for
Advanced Study, Harvard University, Cambridge, Massachusetts, USA
Downloaded by [University of Exeter] at 23:20 10 August 2015

on the fact that a single composition can be performed,


Abstract time and again, in a multitude of ways, yet remain fresh
This paper presents computational methods for quan- and captivating over hundreds of years. How is it that a
titative description and analysis of expressive performance single composition, represented in one score, can yield
strategies in violin performances. We present seemingly infinite possibilities when brought to life in the
general techniques for extracting beat-level tempo and realm of sound? A large part of what makes this possible
loudness data, and the Local Maximum Phrase Detection is the ability of musicians to form unique interpretations
(LMPD) method. The LMPD method equates local of a piece in contrast to other ones, and to communicate
maxima in the loudness curve with interpreted phrases, these interpretations through musical prosody – the
and defines measures of phrase strength (clarity), phrase strategic variation of timing (tempo variances, agogic
volatility (standard deviation), and phrase typicality accents), dynamics (amplitude and loudness), articula-
(concurrence with norm), for characterizing each phrase. tion (note length), and timbre.
The methods are developed in the context of, and applied Musical prosody is essential – indeed, Palmer and
to, eleven recorded performances of the Andante move- Hutchins (2006) call it ‘‘obligatory’’ – to effective
ment from Bach’s Sonata No. 2 in A minor BWV 1003 for performance. In any written score, the composer
solo violin by master violinists. For each performance, we indicates some, but not all, of the structural and
present tempo and loudness summary statistics of the expressive information. The incompleteness of describing
entire piece, its sections, and each phrase. In our music using a score, or almost any extra-musical means,
experiments, we find that loudness is a more consistent gives a musician significant control over the information
indicator of phrasing strategies, suggesting that phrase communicated, and the method and effectiveness of the
structure may impose stricter constraints on dynamic than communication.
on tempo variation. The results of the LMPD method By varying the musical parameters of time, amplitude,
show that Kremer’s performance exhibits the highest, and articulation, and timbre, musicians are able to emphasize
Enescu’s the lowest, phrase volatility; Milstein’s shows the structural groupings, such as the end of a section or the
highest average phrase typicality, and Enescu’s the lowest; beginning of a new one, and express a variety of
and, Grumiaux plays with the highest, and Menuhin the emotions. These communicative functions of musical
lowest, average phrase strength. prosody are perhaps best illustrated by film soundtracks,
where the performed music serves to clarify and enhance
the structure and mood of the visual material.
1. Introduction
In this paper, we are concerned with the questions: do
No two performances of a piece of music are exactly musicians have unlimited freedom to imprint their
alike. The survival of classical music performance relies expressive visions on the pieces they perform? Or, does

Correspondence: Eric Cheng, 1394 Midvale Ave. #202, Los Angeles, CA 90024, USA. E-mail: echeng@post.harvard.edu

DOI: 10.1080/09298210802711660 Ó 2008 Taylor & Francis


326 Eric Cheng and Elaine Chew

the musical structure limit the possibilities? How do Next, we describe work related to each method in their
musicians vary the musical parameters in order to order of appearance in this article.
generate expression? Can we describe the available
choices for expressive freedom? Can we systematically
1.2 Related work in beat and tempo extraction
compare different interpretative strategies?
Our goal is to design generalizable ways to formulate Past research in the analysis of expressive strategies has
quantitative and musically meaningful answers to questions tended to focus on the piano, where the advent of
such as these through computational analysis of tempo and computer-monitored recording devices and the percus-
dynamics. To this end, we describe our end-to-end analysis sive nature of the instrument facilitate the extraction of
process, beginning with audio recordings and tempo and beat-level tempo data. For example, Langner and Goebl
dynamics determination, through automatic extraction of (2003), Sundberg (2003), and Todd (1992, 1995) all used
phrases and quantitative descriptions of phrasing strategies, data from computer-monitored acoustic piano perfor-
to comparisons of music interpretations as conveyed by the mances. A similar computer-monitored system for
phrasings in expert violin performances. acoustic violin has yet to be developed that could
facilitate the accurate detection of beat onsets.
Detecting beat is a first step toward determining the
1.1 Paper organization
Downloaded by [University of Exeter] at 23:20 10 August 2015

tempo. The typical beat tracking system is designed for


The paper is organized as follows: we begin by presenting music with a strong and regular beat, such as dance
methodologies for extracting beat-level tempo and music, or music with percussion instruments. Few beat
dynamic data from audio recordings, an essential first extraction algorithms are equipped to deal with expres-
step before these expressive parameters can be analysed sive timing variations. Automatic beat tracking algo-
and interpreted. Tempo is computed from manually rithms designed for expressive performances, such as
extracted beat onsets, while dynamics are obtained using those developed by Dixon (2000, 2001), still require
the PEAQ (Kabal, 2002) psychoacoustic loudness model human intervention to consistently find the beats.
and a perceptually optimized smoothing algorithm. Violin performances pose special challenges for beat
The tempo and loudness extraction techniques are extraction because, as bowed instruments, violins can
introduced in the context of analysing eleven perfor- produce soft, indeterminate onsets. Researchers have
mances of the Andante (third) movement from Bach’s proposed general and automatic onset detection methods
Sonata No. 2 in A minor BWV 1003 for solo violin by based on a number of acoustic features (Bello et al.,
master violinists, ranging from Ehnes to Szigeti. We 2004). Collins (2005) showed that these algorithms’
calculate tempo and dynamic means and ranges over performances are markedly worse for the kinds of non-
musical segments of varying lengths to identify trends in percussive soft onsets exhibited by violin sounds.
expressive strategies. We then inspect the tempo and The ground truth for the evaluation of onset detection
loudness data, and infer that, at least for the violin algorithms is typically created by manual annotation.
performances we are studying, loudness is a more Because our aim is expression analysis, for which having
consistent indicator of phrasing strategy. reliable and accurate tempo information is key, we
Next, we present the Local Maximum Phrase Detec- chose to follow a method similar to that employed by
tion (LMPD) method, briefly introduced and demon- Repp (1998, 1999a,b): manual onset detection using a
strated in Cheng and Chew (2007). The LMPD is digital waveform editor. We use Apple’s Final Cut Pro1
designed to describe higher level phrasing strategies. to mark the beats. Similar annotation tools exist in other
The LMPD finds phrases by locating the local maxima in applications such as the Sonic Visualiser.2
the loudness curve; it then ascribes quantities of phrase
strength (average increase/decrease in loudness), phrase
1.3 Related work on extraction of dynamics
volatility (standard deviation of phrase strengths in a
performance), and phrase typicality (concurrence with Dynamics extraction poses difficulties due to the com-
other performances) to each phrase. These are musically plexity of loudness perception, which depends on more
motivated quantities to describe the degree of dynamic than the sound pressure levels encoded in a typical audio
change used to indicate a phrase, the range of such waveform.
variations employed, and the commonness of a particular Repp (1999a) used waveform amplitude as a mea-
phrase segmentation strategy. sure of loudness when analysing expressive dynamics in
The final part of the article presents detailed results commercial piano recordings. He computed the root-
and analyses of the application of the LMPD method to mean-square amplitude envelope of each digitized
the eleven performances of Bach’s Violin Sonata No. 2.
Through these results, we examine the relationship
between prosody and musical phrase structure, present- 1
Apple’s Final Cut Pro – www.apple.com/finalcutpro
2
ing some conclusions and discussions. Sonic Visualiser – www.sonicvisualiser.org.
Quantitative analysis of phrasing strategies in expressive performance 327

waveform and applied an automatic peak-picking algo- concluded that timing and dynamics were independently
rithm to identify the peak amplitude following each note controlled within phrases. In contrast, Todd (1992) used
onset. Peak amplitudes were converted into peak sound a cresendo/decrescendo-accelerando/ritardando relation-
levels in decibels (dB) and interpreted as approximate, ship between dynamics and timing as the basis of a model
but sufficient, measures of perceived loudness. for musical expression.
In order to accurately quantify perceived loudness, we In order to study phrasing strategy, we first need
consider psychoacoustic loudness models that account methods for accurately recognizing, localizing, and label-
for the temporal and spectral contexts in which sounds ling phrases. Research in automatic extraction of phrases
are heard. We choose two existing loudness models, a from expressive performances is still fairly new. In Cheng
single-band equivalent sound level with revised low- and Chew (2007), we presented the Local Maximum Phrase
frequency B-weighting, Leq(RLB), model (see Soulodre Detection (LMPD) algorithm that uses local maxima in the
& Norcross, 2003), and a MATLAB implementation of loudness curve to locate phrases. Chuan and Chew (2007)
the multi-band PEAQ loudness model as described by proposed a dynamic programming algorithm for extracting
Kabal (2002). We apply Gaussian smoothing to the phrases by finding the sequence of quadratic or spline curves
output of the two models; Langner and Goebl (2003) that best fits the tempo time series.
implement a similar smoothing function in their tempo- Prior research on expressive phrasing strategies has
Downloaded by [University of Exeter] at 23:20 10 August 2015

loudness visualization application. We then system- most frequently been local in nature. That is, studies have
atically evaluate the loudness waveforms produced by discussed how performers vary musical parameters within
the two models for a range of smoothing window sizes, a single phrase or near a single phrase boundary (see, e.g.
and compare the results with human (an expert Gabrielsson, 1987; Cambouropoulos, 2001; Langner &
musician’s) annotation of a test piece. Goebl, 2003). While certain local phrasing strategies – such
Our evaluation is a step towards listener-centred as clarifying phrase boundaries with declines in tempo and
evaluation of loudness models in the context of dynamics – are well documented, relatively little is known
continuously sounding music. Other researchers have about higher level phrasing strategies. Such strategies,
carried out evaluations of loudness models. For example, such as how performers choose to segment a piece into
Soulodre and Norcross (2003) evaluated seven equivalent phrases, and how their particular strategies relate to
sound level (Leq) loudness models against subjective others’ are the subject of our study.
reference data, and found the Leq(RLB) model to Other work on phrasing strategy has focused on syn-
perform better than other Leq models. Skovenberg and thesis rather than analysis. Analysis-by-synthesis was
Nielsen (2004) confirmed Soulodre and Norcross’ results, used by Sundberg et al. (2003) to test the validity of
although they find that two new models they propose at hypothesized rule-based models of expressive timing and
TC Electronics perform even better than the Leq(RLB). dynamics. In the Expression Synthesis Project (Chew
These evaluations are based on monophonic segments of et al., 2005), which uses a driving (wheel and pedals)
music or speech for matching loudness. We have based interface to generate expressive performances, Chew
our evaluation on reference measurements reported et al. (2006) used roadmaps that correspond to alternate
continuously over a performance of the entire Andante phrasing strategies to generate performances with
movement from Bach’s Violin Sonata No. 2, so as to different musical interpretations.
better simulate natural human listening behaviour. We apply our quantitative phrase strategy analysis
techniques to performances of Bach’s Violin Sonata No. 2
to provide musically meaningful information about the
1.4 Related research on interpretation strategies
interpretations. A few other researchers have focused on the
A wealth of research has been conducted on expressive analysis of entire pieces. For example, Sapp (2007) intro-
music performance – see, for example, the reviews by duced the use of scape plots to present correlation values of
Gabrielsson (1999, 2003) and by Widmer and Goebl segments of tempo/loudness data of varying lengths, ap-
(2004). Research on expressive strategies includes micro plying the technique to performances of Chopin’s
and macro analyses of timing/tempo and dynamics Mazurkas. Dixon et al., (2002) presented an interface for
information; very few have focused on phrasing strategy, visualizing expressive information over time as a continuous
much less the examination of musical interpretation. Our trajectory on a two-dimensional tempo-loudness space.
study relates local timing and dynamics information to Hong (2003) analysed passages from Bach’s Cello Suite in C
interpretation with respect to phrasing strategies. major BWV 1009 to test the validity of the aforementioned
We find tempo and dynamic strategies to be vastly timing/dynamics relationship proposed by Todd (1992).
different for the recordings we examine, and choose to
focus on dynamics information for the extraction of
phrase information. Repp (1998, 1999a) used principal
2. Data extraction
components analysis to identify independent timing and We choose as our data set eleven commercially available
dynamics strategies in a group of performances, and recordings of the Andante (third) movement from Bach’s
328 Eric Cheng and Elaine Chew

Sonata No. 2 in A minor BWV 1003 for solo violin. We manually added for the analysis of section 3, and are not
select this piece for its regular pulse and unambiguous part of the original manuscript.
phrase structure – qualities that simplify both data The performances in the eleven recordings were by
extraction and analysis. Additionally, there are no Ehnes, Enescu, Grumiaux, Heifetz, Kremer, Menuhin,
dynamic markings in the score, leaving greater room Milstein (1956, 1975), Mintz, Szeryng, and Szigeti. Table 1
for individual expression. The score for the piece is lists the sources of the recordings we used in this study.
shown in Figure 1. The phrases shown in the score are Following the convention established by Repp (1998,
Downloaded by [University of Exeter] at 23:20 10 August 2015

Fig. 1. Score with phrase annotations.


Quantitative analysis of phrasing strategies in expressive performance 329

Table 1. Recordings of Bach’s Violin Sonata No. 2.

Performer Year Recorded/Released Label (Catalog #) UPC

1 Ehnes 2000/2000 Analekta (231478) 774204314729


2 Enescu 1949/2002 Classica D’oro (2014) 723724387225
3 Grumiaux 1961/1994 Philips (438736) 028943873628
4 Heifetz 1952/1995 RCA (61748) 090266174829
5 Kremer 2002/2005 ECM New Series (000506502) 028947672913
6 Menuhin 1936/2000 EMI Classics Références (67197) 724356719729
7 Milstein 1956/2001 EMI Classics (66869) 724356686922
8 Milstein 1975/1998 DG The Originals (457701) 028945770123
9 Mintz 1984/1995 DG Masters (445526) 028944552621
10 Szeryng 1967/1997 DG Double (453004) 028945300429
11 Szigeti 1955–1956/2003 Vanguard (1246) 699675124625
12 Grand Average
Downloaded by [University of Exeter] at 23:20 10 August 2015

1999a), we created a twelfth ‘‘performer’’, the grand average


profile, which was the average of all eleven performances for
analysis purposes. The following sub-sections present the
methodologies employed to extract tempo and dynamics
data in the context of these performances.

2.1 Tempo
To extract beat onset times, we use the digital waveform
editor in Apple’s Final Cut Pro video editing software.
Final Cut Pro has a built-in marking tool allowing onsets
to be marked while tapping along to the recording.
Onsets are marked every eighth note. These onsets are
then exported into MATLAB where tempo values for
each onset are calculated as the inverse of the inter-onset
interval, and smoothed using a rectangular (a moving
average) smoothing window. An example tempo plot,
one for Ehnes’ performance, is shown in Figure 2. Fig. 2. Tempo plot after manual onset detection and
smoothing.
2.2 Dynamics
This section describes our process for selecting a loudness continuously annotating the perceived loudness of each
model for extracting the dynamics of an expressive eighth note while listening to the recording. Loudness
performance. The methodology presented here serves as values are generated for the entire Andante movement of
an example of how loudness models could be evaluated Bach’s Violin Sonata No. 2.
and compared for the purpose of analysing expressive We apply Gaussian smoothing to the output of each
performances. loudness model. To evaluate the performance of the
We compare a single-band Leq(RLB) model and a loudness models, we compute the correlation coefficient
MATLAB implementation of the multi-band PEAQ between the reference curve and the smoothed loudness
loudness model by Kabal (2002). For both the Leq(RLB) curve produced by each model. After optimizing the
and PEAQ models, the original waveform is processed to smoothing window parameters to maximize the correla-
obtain a loudness waveform. We sampled this waveform tion coefficient for each model, we then select the model
at the onset times and smoothed the result using a with the highest maximum correlation coefficient.
Gaussian window. The PEAQ model yields the highest correlation value.
To test the models, we compare the loudness values The maximum correlation for the PEAQ model is
obtained for a single recording (Milstein, 1956) against a rPEAQ ¼ 0.854, versus the corresponding value for the
manually plotted reference curve, created independently Leq(RLB) model, rLeq(RLB) ¼ 0.839. The optimal half-
by one of the authors (Cheng, an expert violinist) width of the Gaussian smoothing window for the PEAQ
330 Eric Cheng and Elaine Chew

model is 2.3 eighth notes. The optimized model output is indicates the performer’s mean tempo or dynamic value,
shown with the reference curve in Figure 3. and the bars span the individual’s tempo or dynamic
Other comparisons are conducted, leading to similar range. The range is computed as the mean + 2 standard
results, using a rectangular smoothing window, and deviations, which gives approximately the 95% con-
using a scaled mean square error (SMSE) measure. The fidence interval.
presentation and discussion of these results are outside Two interesting points can be made about the two
the scope of the present article; they are outlined in plots in Figure 4. First, note that the lower extremes of
Cheng (2006). the dynamic ranges reach comparably closer to zero than
those of the tempo ranges. In other words, a tempo of
zero is several more standard deviations away from the
tempo means than a loudness value of zero is from the
3. Data analysis: summary statistics dynamic means. This makes sense, since a zero-valued
We first present summary tempo and dynamic statistics tempo corresponds to an infinitely long inter-onset
for the entire piece, for each section in the piece, and for interval; a tempo of zero signifies stasis, an infinite wait,
each manually prescribed phrase. Consider the score for which is musically uninteresting and precludes further
the Andante movement of Bach’s Violin Sonata No. 2 as continuation. While this observation may seem trivial,
Downloaded by [University of Exeter] at 23:20 10 August 2015

presented in Figure 1. Observe that the piece consists of it does highlight a fundamental difference between
two repeated sections, yielding four sections, A, A0 , B, tempo and dynamics: while a performer may employ a
and B0 , in total. The phrase boundaries, marked in the zero-valued dynamic, a zero-valued tempo may only
score, are determined by examining the score indepen- ever be approached asymptotically. This difference no
dently of the recordings and later confirmed as reason- doubt has a basis in the way we perceive the two
able by an expert violinist.3 Sections 3.1, 3.2, and 3.3 parameters: tempo perception requires a comparison of
present the tempo and dynamic means and ranges temporally spaced events, while loudness may be
for the entire piece, each section, and each phrase, perceived nearly instantaneously, and without a preced-
respectively. ing or following event.
Second, there appear to be more significant differences
in the global tempo strategies in Figure 4(a) than in the
3.1 Global means and ranges
dynamic strategies in Figure 4(b). For example, the mean
Figures 4(a) and (b) show the tempo and dynamic means tempo for Performer 8 (Milstein, 1975) lies completely
and ranges for each performer, over the entire piece. The outside the tempo range of all but two other recordings,
performers are indicated by numbers; their names are whereas the dynamic means of all the recordings lie well
listed in Table 1. Each circle in the middle of the bar within all the dynamic ranges.
The musical significance of these observations is
further discussed in section 6.

3.2 Section means and ranges


The individual means and ranges, calculated over each of
the four sections, are shown in Figures 5(a) through (d).
For a given performer, the four bars, from left to right,
represent values calculated over sections A, A0 , B, and B0 ,
respectively. In cases where a performer acknowledged
only one repeat, the absent sections have been omitted.
By examining the grand average profile, we can begin
to identify potential trends in section-to-section varia-
tions. In Figure 5(a), the grand average profile suggests
that performers, on average, perform the repeat occur-
rence of a section at a quieter mean dynamic than the
first occurrence. This appears to be a weak trend,
however. Ehnes, Heifetz, Szeryng, and Szigeti all violate
it to varying degrees.
Fig. 3. Correlation-optimal PEAQ model curve. In Figure 5(b), each successive section tends to be
played at a slower mean tempo, as if there is a section-
level ritard over the course of the piece. This trend is
3 more widely followed by all performers, with the
Daphne Wang, violin DMA student at the USC Thornton
School of Music. exception of Kremer, Mintz, and Szeryng.
Quantitative analysis of phrasing strategies in expressive performance 331
Downloaded by [University of Exeter] at 23:20 10 August 2015

Fig. 4. (a) Global tempo means and ranges. (b) Global dynamic means and ranges.

In Figure 5(c), the second occurrence of a section sections B and B0 is largely a function of the final-
tends to be played with a wider dynamic range than the phrase ritard.
first. This appears to be a strong trend, violated only by Similarly striking is the contrast between tempo and
Menuhin and Szigeti. dynamic means in plots 6(a) and (b). While the dynamic
In Figure 5(d), the tempo range tends to grow wider means are tightly packed and highly similar in shape, the
with each successive section, with the final section tempo means show wide variations. While it is possible
exhibiting the widest tempo range. On average, the that artifacts of the recording process constrain the
tempo range of the final section is 89.1% greater than dynamic strategies to be within similar ranges, similarity
the average tempo range for the previous three sections. in phrase-to-phrase shapings are preserved through these
The greatest tempo range occurs in the final section for manipulations, and can only be explained by similarities
every performer. As we shall see in the next section, the in expressive strategy.
rise in tempo range in the final section is largely When examining the dynamic mean and range plots in
attributable to the final-phrase ritard. Figures 6(a) and (c), we also begin to visually identify
regions of greater and lesser agreement among perfor-
mers. Figure 6(a) shows that the performers follow
3.3 Phrase means and ranges
similar dynamic strategies in the A and A0 sections, while
Shown in Figures 6(a) through (d) are the individual they appear to diverge significantly in the B and B0
means and ranges calculated over each phrase (as sections. This concurrence is apparent in the dynamic
indicated on the score in Figure 1). Each curve represents ranges as well. It is surprising how similar the first-half
the phrase-by-phrase data from one performance. dynamic strategies are, given that there are no dynamic
Vertical lines divide the plots into the four sections: A, markings in the score.
A0 , B, and B0 . One way to make sense of these regions of narrow
A striking visual feature occurs in the tempo range diversity is to interpret the degree of performer agreement
plot shown in Figure 6(d), where there is a pronounced as indicative of the constraint placed on performers by the
spike in the tempo range in the final phrases of the B and musical structure. Under this interpretation, the plots
B0 sections. Figure 6(b) shows that these spikes in the suggest that the musical structure is more clearly defined in
tempo range correspond to drops in mean tempo in the first half of the piece, forcing performers to follow
these final phrases. Indeed, there appears to be a similar dynamic strategies. Conversely, the diversity of
consistent trend towards playing the final phrase of a second-half strategies suggests the musical structure is less
section with a slower mean tempo, a section-level version well defined, leaving room for performers to employ a
of the much-cited phrase-ending ritard (see Cambour- wider variety of strategies. Alternatively, stylistic conven-
opoulos, 2001; Langner & Goebl, 2003; Sundberg et al., tion could grant performers varying degrees of interpretive
2003). In Figures 6(b) and (d), we also see that the freedom throughout the piece though the musical structure
previously mentioned spike in the tempo range in may be equally well defined.
332 Eric Cheng and Elaine Chew
Downloaded by [University of Exeter] at 23:20 10 August 2015

Fig. 5. Section-level statistics. (a) Dynamic means. (b) Tempo means. (c) Dynamic ranges. (d) Tempo ranges.

4. The local maximum phrase detection 4.1 The case for loudness
method To devise a systematic phrase detection method, we
Our strategy for global phrasing analysis is to mathe- superimposed the author-annotated phrase boundaries
matically identify phrases within a performance and over beat-level tempo and dynamic plots, as shown for
compare the number and locations of phrases across the section A in Figures 7(a) and (b). Vertical lines
performances. In a previous paper, we introduced a denote phrase beginnings, and each trajectory represents
Local Maximum Phrase Detection (LMPD) method for performance data for one recording.
analysing global phrasing strategies (see Cheng & Chew, Observe that the loudness trajectories appear to be
2007). Here, we present the technique, and provide more consistently related to the annotated phrase
further motivation for the method, before delving into boundaries than their tempo counterparts. In particular,
the analysis results when it is applied to the eleven phrases are well characterized by a crescendo/decrescen-
performances of the Andante movement from Bach’s do arch similar to that mentioned in several past studies,
Violin Sonata No. 2. such as Gabrielsson (1987), Sundberg et al. (2003), and
Quantitative analysis of phrasing strategies in expressive performance 333
Downloaded by [University of Exeter] at 23:20 10 August 2015

Fig. 6. Phrase-level statistics. (a) Dynamic means. (b) Tempo means. (c) Dynamic ranges. (d) Tempo ranges.

Todd (1992). In some cases, phrases are characterized by


4.2 Local maximum phrase detection
two or more sub-arches, suggesting that those performers
chose to further divide the annotated phrases into sub- If we assume that each phrase can be characterized by a
phrases in those particular recordings. crescendo/decrescendo arch, then each phrase should
In contrast, tempo strategies are less systematically also contain a local maximum in loudness. Here we
related to the annotated phrase boundaries, with a define a local maximum as any loudness value for which
greater diversity of trajectories. These observations, the two surrounding values (before and after) are
confirmed by higher average inter-performer correlations smaller. The LMPD method uses this local maximum
for loudness than for tempo ( rtempo ¼ 0.4862 versus as a mathematical indicator for the existence of a phrase.
rloudness ¼ 0.7627), lead us to conclude that loudness is a The method consists of two steps: (1) record the number
more reliable parameter for phrase detection, in parti- and locations of local loudness maxima for each
cular, that a crescendo/decrescendo arch is a reliable performance; and, (2) interpret each local maximum as
indicator for the occurrence of a phrase. a phrase or sub-phrase. This method allows us to define
334 Eric Cheng and Elaine Chew
Downloaded by [University of Exeter] at 23:20 10 August 2015

Fig. 7. Note-level trajectories for section A: (a) tempo, (b) dynamics.

Fig. 8. Phrase strength parameters. Fig. 9. Phrase typicality parameters for section A. Top:
loudness profile of the grand average performance. Bottom:
local maxima counts for each eighth note, Pk.
additional mathematical descriptors, described in the
following sub-sections, to quantify the characteristics of local minima as shown in Figure 8. The phrase strength
phrasing strategy. value quantifies the degree to which the peak loudness in
a phrase stands out from its neighbouring loudness dips,
the softer beginning and end of the phrase or sub-phrase.
4.2.1 Phrase strength
These values allow us to measure the prominence, and in
We define the phrase strength, Sj, of a phrase to be equal a sense, the clarity of a phrase.
to the average loudness difference between its local
maximum and the two adjacent local minima:
4.2.2 Phrase volatility
1 The phrase volatility, V, is defined to be the standard
Sj ¼ ½ðMj  mj Þ þ ðMj  mjþ1 Þ;
2 deviation of all Sj values in the performance. The phrase
volatility value measures the degree and quantity
where Mj is the loudness value of the local maximum and of variance from the average phrase strength. A
mj and mjþ1 are the loudness values of the two adjoining performance exhibiting a large variability of phrase
Quantitative analysis of phrasing strategies in expressive performance 335

strengths would have a high V, while a performance of performers. The greater the number of performers
exhibiting relatively constant phrase strength would have placing a local maximum at a particular location, the
a low V. greater the typicality of a phrase with a maximum at that
location. Figure 9 shows how Pk varies with location in
section A. The vertical dotted lines indicate local
4.2.3 Phrase typicality
maxima.
The phrase typicality, Tk, of a phrase with a local For a given performance, if the local maxima occur at
maximum at location k quantifies the popularity of that {k1, k2, . . . , kn}, then the average phrase typicality for
location as a phrase peak. It is defined to be the proportion the performance is given by the expression:
of other performers who also place a local maximum at
the location k. Mathematically, the phrase typicality for a 1X kn
T ¼ Tk :
phrase with a local maximum at position k is given by: n k¼k1

Pk  1 High average typicality would point to high concurrence


Tk ¼ ;
N1 with other performers in the choice of phrase peak
positions in the score. Low average typicality would
Downloaded by [University of Exeter] at 23:20 10 August 2015

where Pk is the total number of performers placing a point to overall low agreement with other performers in
local maximum at location k, and N is the total number phrasing choice. The average phrase typicality can thus

Fig. 10. LMPD Results. (a) Local maxima counts. (b) Average phrase strength. (c) Phrase volatility. (d) Average phrase typicality. (e)
Legend.
336 Eric Cheng and Elaine Chew

be interpreted as some measure of the uniqueness of a loudness models for extracting dynamic data, and
performer’s phrasing strategy. applied these methodologies to eleven audio recordings
of solo violin performances of Bach’s Violin Sonata No. 2
in A minor BWV 1003. We identified expressive trends
5. Data analysis: LMPD results through the use of means and ranges for contexts of
This section describes the analysis results when the varying sizes (individual phrases, sections, and entire
LMPD method is applied to the eleven performances pieces). Our analyses have uncovered significant differ-
(plus the grand average performance) of the Andante ences between the use of tempo and dynamic variations
movement from Bach’s Violin Sonata No. 2. The num- in expressive performance strategies in master violinists’
bers of local maxima per performance, referred to as recordings of Bach’s Violin Sonata No. 2. At the beat
local maxima counts, are shown in Figure 10(a). The local level, we find that the dynamic strategies are highly
maxima counts range from 45 (Mintz) to 53 (Heifetz). consistent and systematic across performers and perfor-
The eleven performances produced only six distinct local mances, despite a complete lack of written guidance in
maxima counts, suggesting there may be a finite number the score, so much so that we were able to devise a
of musically acceptable phrasing strategies. systematic phrase detection algorithm, the LMPD
The average phrase strength values are shown in method, based on local maxima in loudness. In contrast,
Downloaded by [University of Exeter] at 23:20 10 August 2015

Figure 10(b). Grumiaux played with the highest average we found the tempo strategies to be more diverse and
phrase strength while Menuhin played with the lowest. elude such a systematic description.
Informal listening tests support these results. It would be We should note that dynamic data from audio
interesting to conduct formal listening tests to determine recordings are an imperfect representation of the perfor-
whether phrase strength values correlate well with human mer’s expressive vision, as artifacts of the recording
perception of phrasing clarity. process and the recording engineers’ choices inevitably
Figure 10(c) shows the phrase volatility values. Kremer influence the result. However, while sound engineering
stands out as the performer with by far the greatest phrase may change the dynamic range and distribution, it does
volatility. Indeed, when listening to his recording, phrase not erase the basic shaping of a phrase, nor the musician’s
volatility proved to be the defining feature of his phrasing strategy. Thus, while the consistency of global
performance. This result highlights a need to better under- dynamic means and ranges may in part be due to recording
stand the perceptual relevance of our mathematical artifacts, we believe the similarity of dynamic shaping at
descriptors; it is conceivable that phrase strength, volatility, the note-to-note and phrase-to-phrase levels to be repre-
and typicality values should be assigned differing degrees of sentative of similarity in expressive strategy.
perceptual importance depending on the performer. We utilized the highly systematic relationship between
The phrase typicality averages are shown in Figure phrase structure and dynamic variation to analyse global
10(d). Enescu had the lowest typicality (TEnescu ¼ 0.31) phrasing strategy using the LMPD method. The LMPD
while Milstein (1956) showed the highest (TMilstein ¼ method and its accompanying descriptors of phrase strength,
0.51). It is less clear what are the direct perceptual volatility, and typicality yield phrasing strategy analyses that
correlates to phrase typicality, as a phrase is more than concur with certain characteristics of the violinists’ expressive
simply the location of its maximum loudness. One gestures. While the mathematical descriptors provide some
interpretation of average phrase typicality is that it insight into phrasing strategy, further tests must be done to
serves as a measure of the perceived uniqueness of a determine their perceptual significance.
performer’s phrasing strategy. If we interpret the diversity of observed expressive
In music and the visual arts, often being different and strategies to be an indicator of the constraints that guide
unique is judged to be more desirable than being a them, it becomes clear that tempo and dynamics are
conformist. However, Milstein, whose 1956 performance subject to fundamentally different constraints. From a
had the highest average phrase typicality score, is widely perceptual perspective, this could be explained by the fact
regarded as the greatest performer and interpreter of that perception of tempo requires the comparison of
Bach’s unaccompanied sonatas, which calls to question temporally spaced events, while dynamics can be
the view that being different from others might be a perceived in isolation, and almost instantaneously. From
desirable quality in phrasing strategy. Thus, we leave a musical perspective, the consistency of dynamic
open the debate on subjective aesthetic judgment with strategies suggests that musical phrase structure imposes
regard to a high or low typicality score. stricter constraints on dynamic variation, constraining
musicians to follow similar trajectories. It is possible that
differences in how we perceive tempo and dynamics
determine their expressive functions, which in turn
6. Discussion and conclusion dictate the degree of structural constraint. Thus, if
We have developed general methodologies for extracting tempo and dynamics highlight aspects of the
beat-level tempo and for selecting between competing musical landscape with differing temporal resolutions,
Quantitative analysis of phrasing strategies in expressive performance 337

the function of dynamics as a clarifier of musical Chew, Elaine, François, A.R.J., Liu, J. & Yang, A. (2005).
structure at fine-grain resolutions would force musicians ESP: a driving interface for musical expression synthesis.
to vary dynamics more tightly according to musical In Proceedings of the Conference on New Interfaces for
structure. Musical Expression, Vancouver, Canada, 26–28 May.
One might be tempted to hypothesize that our findings Chew, E., Liu, J. & François, A.R.J. (2006). ESP: roadmaps
would generalize to other instruments and, to a certain as constructed interpretations and guides to expressive
degree, other pieces. After all, perceptual constraints on performance. In Proceedings of the First Workshop on
tempo and dynamics should be the same regardless of Audio and Music Computing for Multimedia, Santa
instrument or musical style. However, a forthcoming Barbara, CA, 27 October, pp. 137–145.
Chuan, C.H. & Chew, E. (2007). A dynamic programming
study of recordings of Chopin piano music by Nicholas
approach to the extraction of phrase boundaries from
Cook (manuscript made available through personal
tempo variations in expressive performances. In Proceed-
communication, 2008) identifies recordings in which
ings of the 8th International Conference on Music
phrases are articulated almost exclusively using tempo
Information Retrieval (ISMIR), Vienna, Austria, 23–27
as well as recordings where performers use almost September.
exclusively dynamics, suggesting that phrasing strategies Collins, N. (2005). A comparison of sound onset detection
could be governed more by cultural constructs than algorithms with emphasis on psychoacoustically
Downloaded by [University of Exeter] at 23:20 10 August 2015

perceptual constraints. While our results indicate violi- motivated detection functions. In Proceedings of the
nists employ similar expressive strategies as pianists in 118th Audio Engineering Society Convention, Barcelona,
the marking of phrase boundaries, further tests must be Spain, 28–31 May.
done to reveal the extent of generality of the LMPD Cook, N. (2008). Objective expression: analysing phrase arching
method for performances of compositions by other in recordings of Chopin’s Mazurkas. Manuscript made
composers, and for other instruments, and the situations available through personal communication, September.
under which the technique could be extended to analysis Dixon, S. (2000). A lightweight multi-agent musical beat
of expressive tempo strategies. tracking system. In Proceedings of the 6th Pacific
Rim International Conference on Artificial Intelligence,
Melbourne, Australia.
Acknowledgements Dixon, S. (2001). Automatic extraction of tempo and beat
We thank the reviewers for their thoughtful and construc- from expressive performances. Journal of New Music
tive comments and suggestions. The research has been Research, 30(1), 39–58.
funded in part by a Frank H. and Eva B. Buck Foundation Dixon, S., Goebl, W. & Widmer, G. (2002). The PERFOR-
Scholarship and a National Science Foundation (NSF) MANCE WORM: real-time visualization of expression
grant Career-0347988. Any opinions, findings, conclusions, based on Langner’s tempo-loudness animation. In
Proceedings of the International Computer Music Conf,
or recommendations expressed in this material are those of
Goteborg, Sweden, 16–21 September.
the authors and do not necessarily reflect those of the Frank
Gabrielsson, A. (1987). Once again: the Theme from
H. and Eva B. Buck Foundation or NSF.
Mozart’s Piano Sonata in A Major (K.331). In Action
and Perception in Rhythm and Music (pp. 81–103). Royal
Swedish Academy of Music: Stockholm.
References
Gabrielsson, A. (1999). Music performance. In D. Deutsch
Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, (Ed.), The Psychology of Music (2nd ed.) (pp. 501–602).
M. & Sandler, S.B. (2005). A tutorial on onset detection San Diego: Academic Press.
in music signals. IEEE Transactions on Speech and Audio Gabrielsson, A. (2003). Music performance research at the
Processing, 13(5), 1035–1047. millennium. Psychology of Music, 31(3), 221–272.
Cambouropoulos, E. (2001). The local boundary detection Hong, J.-L. (2003). Investigating expressive timing and
model (LBDM) and its application in the study of expressive dynamics in recorded cello performances. Psychology of
timing. In Proceedings of the International Computer Music Music, 31(3), 340–352.
Conference, Havana, Cuba, 17–22 September. Kabal, P. (2002). An examination and interpretation of
Cheng, E. (2006). Computational analysis of expression in ITU-R BS.1387: perceptual evaluation of audio quality.
violin performance. Masters Thesis, Hsieh Department of Technical Audio Paper, Department of Electrical and
Electrical Engineering – Systems, University of Southern Computer Engineering, McGill University, May 2002.
California, USA. Langner, J. & Goebl, W. (2003), Visualizing expressive
Cheng, E. & Chew, E. (2007). A local maximum phrase performance in tempo-loudness space. Computer Music
detection method and the analysis of phrasing strategies Journal, 27(4), 69–83.
in expressive performances. In Proceedings of Mathe- Palmer, C. & Hutchins, S. (2006). What is musical
matics and Computation in Music, the First International prosody? In B.H. Ross (Ed.), Psychology of Learning
Conference of the Society of Mathematics and Computa- and Motivation (Vol. 46, pp. 245–278). Amsterdam:
tion in Music, Berlin, Germany, 18–20 May. Elsevier Press.
338 Eric Cheng and Elaine Chew

Repp, B.H. (1998), A microcosm of musical expression. I. Soulodre, G.A. & Norcross, S.G. (2003). Objective measures
Quantitative analysis of pianists’ timing in the initial of loudness. In Proceedings of the 115th Audio Engineer-
measures of Chopin’s Etude in E major. Journal of the ing Society, New York, 10–13 October.
Acoustical Society of America, 104(2), 1085–1100. Sundberg, J., Friberg, A. & Bresin, R. (2003). Attempts to
Repp, B.H. (1999a). A microcosm of musical expression. II. reproduce a pianist’s expressive timing with director
Quantitative analysis of pianists’ dynamics in the initial musices performance rules. Journal of New Music
measures of Chopin’s Etude in E major. Acoustical Research, 32(3), 317–325.
Society of America, 105(3), 1972–1988. Todd, N.P.M. (1992). The dynamics of dynamics: a model
Repp, B.H. (1999b). A microcosm of musical expression. of musical expression. Journal of the Acoustical Society of
III. Contributions of timing and dynamics to the America, 91(6), 3540–3550.
aesthetic impression of pianists’ performances of the Todd, N.P.M. (1995). The kinematics of musical expression.
initial measures of Chopin’s Etude in E major. Acoustical Journal of the Acoustical Society of America, 97(3), 1940–
Society of America, 106(1), 469–478. 1949.
Sapp, C. (2007). Comparative analysis of multiple musical Widmer, G. & Goebl, W. (2004). Computational models of
performances. In Proceedings of the International Con- expressive music performance: the state of the art.
ference on Music Information Retrieval, Vienna, Austria, Journal of New Music Research, 33(3), 203–216.
23–27 September, pp. 497–500.
Downloaded by [University of Exeter] at 23:20 10 August 2015

Skovenborg, E. & Nielsen, S.H. (2004). Evaluation of


different loudness models with music and speech material.
In Proceedings of the 117th Convention of the Audio
Engineering Society, San Francisco, CA, 28–31 October.

You might also like