Professional Documents
Culture Documents
Abstract
The purpose of this study was to determine how evaluations of four intact performances
of a Chopin étude (Op. 25, No. 9, Butterfly), differing in quality, related to evaluations
of each performance’s sections. We presented each performance as five excerpts:
(a) opening phrase, (b) first half, (c) second half, (d) coda, and (e) intact. We then
presented 10 of the 20 excerpts a second time to assess listener reliability. Results
revealed an interaction between performer and ratings: Ratings of intact performances
were significantly higher than were ratings of component sections for the two best
performances. For the other two performances, however, ratings of intact excerpts
were significantly lower than ratings of component sections. Nonsignificant factors
were sections and listener background. Data indicate that ratings of longer sections
(first and second halves) were more closely tied to intact performance ratings than
were shorter sections (opening phrase and coda). Correlational data revealed that
piano majors were marginally more reliable than nonpiano majors.
Keywords
adjudication, music performance, piano
1
McGill University, Montreal, QC, Canada
2
Florida State University, Tallahassee, FL, USA
Corresponding Author:
Joel Wapnick, McGill University, 555 Sherbrooke St. W., Montreal, QC H3W 2J1, Canada
Email: jwapnick@music.mcgill.ca
Wapnick and Darrow 463
long were rated higher and more reliably than were 20-s excerpts. Participants in a
study by S. Thompson, Williamon, and Valentine (2007) rated piano works by Bach
and Chopin continuously by moving a mouse along a 7-point scale. Although responses
stabilized to some degree after 15 to 20 s, they tended to drift upward afterward, par-
ticularly within the 1st min.
In the present study, we examined the effects of excerpt duration by comparing rat-
ings of intact performances of a Chopin étude with ratings of sections taken from the
performance. We also were interested in how ratings of particular sections of the étude
might relate to ratings of their corresponding intact performances. It is possible that
ratings of an intact work would be predicted better from ratings of its opening phrase
than from ratings of other sections. On the other hand, ratings of the étude’s coda
might be allied more closely to ratings of intact performances. In addition, we com-
pared ratings of intact performances with ratings of the étude’s first and second halves,
and we investigated whether the quality of an intact performance relative to other
intact performances might be important in determining how ratings of sections versus
ratings of the whole piece related to each other. Would the numerical rating of an
intact performance be close to the mean rating of the sections it comprises, or would it
be higher for the higher-rated performances and lower for the lower-rated perfor-
mances? In summary, we were interested in how the perception of the whole related to
the perception of its parts, and in whether differences in quality of the four “wholes”
employed in this study—the four intact performances—would interact with the ratings
of their corresponding sections.
Method
This study consisted of two phases. In Phase 1, evaluators listened to and rated
22 versions of Chopin’s Butterfly Étude, Op. 25, No. 9. We used these ratings to select
4 versions from the 22 for use in Phase 2, as described in a following section.
Evaluators
We employed two groups of evaluators, all volunteers. Phase 1 evaluators were
9 people: 3 piano major graduate students and 6 full-time professors in performance,
theory, musicology, and music education. All were from the same university music
school. Phase 2 evaluators were 111 undergraduate and graduate music majors from
two universities. Eighty-five were nonpiano majors and 26 were piano majors.
Six of the 9 evaluators each were given a CD to listen to at leisure. Three others
heard the CD at the same time in a classroom setting. Evaluators were given the fol-
lowing instructions:
You are about to hear 22 different versions of Chopin’s Butterfly Étude (Op. 25,
No. 9). After you hear each one, give it a grade from 0 to 100, where 100 rep-
resents what to your mind is as fine a performance as you are likely to hear of
the piece and 0 represents a very poor effort. I realize that the sound quality on
these performances varies; please try to ignore sound quality and focus on the
quality of the performances only.
We employed a 100-point scale for this task because such a scale commonly is
used at many universities in the evaluation of music performances.
following: (a) opening phrase (mm. 1–8, mean duration = 8 s); (b) first half
(mm. 1–24, mean duration = 29 s); (c) second half (mm. 25–51, mean duration = 33 s);
(d) coda (mm. 37–51, mean duration = 18 s); and (e) intact (mm. 1–51, mean duration =
62 s). We then presented half of the 20 versions (Trials 1 through 10) again in the same
order (Trials 21 through 30) to assess evaluator reliability.
Listening sessions were held in groups, and evaluators were read the following
instructions:
The purpose of this study is to determine the minimum amount of time one
needs to judge music reliably. Accordingly, you will be hearing 30 musical
examples, all from the same piece and each lasting in duration from 8 seconds
to 1 minute. Each trial will be of either the entire performance of Chopin’s
Butterfly Étude or an excerpt from that work. Please rate these trials on a 7-point
scale by circling a number, where 1 signifies a poor performance and 7 signifies
an excellent performance. Also, and in accordance with the purpose of this
experiment, please hold off making your rating choice until after the entire trial
example is played. There are some differences in recording quality—try to
ignore them and rate just the pianists. You will have 5 seconds between trials to
rate each performance. The entire experiment will take 18 minutes.
We will begin with a complete performance of the Butterfly Étude, to give you
an idea of what the piece sounds like. Do not rate this performance. I will
announce the numbers of the trials following it as we go.
Results
Phase 1: Ratings
Ratings from Phase 1 revealed differences in grading across the 9 evaluators (Table C
in the online supplemental material; see http://jrme.sagepub.com/supplemental), which
was confirmed by a one-way repeated-measures analysis of variance, F(8, 168) = 17.46,
p < .001, η2 = .454, and post hoc LSD tests: Evaluator 1 gave significantly higher rat-
ings than all other evaluators (p < .01), Evaluators 2 and 3 gave significantly higher
ratings than all other evaluators except Evaluator 1 (p < .03), and Evaluators 4 and 5
gave significantly higher ratings than Evaluators 6, 7, 8, and 9 (p < .05).
Table 2. Phase 2: Means and Standard Deviations for Sectional and Complete Performances
Performance 1 Performance 2 Performance 3 Performance 4
M SD M SD M SD M SD
Opening 5.59 1.01 4.31 1.00 4.53 0.94 3.27 1.28
First half 6.01 0.94 4.89 1.08 4.11 1.24 2.74 1.22
Second half 5.90 1.11 5.13 1.29 4.06 1.18 3.08 1.21
Coda 5.54 1.16 5.21 1.14 3.69 1.28 3.47 1.28
Intact 6.48 0.97 5.46 1.02 3.68 1.15 2.47 1.06
468 Journal of Research in Music Education 60(4)
first half and between second half and coda, because in both cases, one of the sections
was subsumed within the other. This did not turn out to be the case.
Reliability. We assessed reliability by calculating Pearson product-moment correla-
tions for the first 10 excerpts (Trials 1 though 10) and their repeated presentations
(Trials 21 through 30). For piano majors, mean r(25) = .66 (p < .01), and for nonpiano
majors, mean r(85) = .60 (p < .01). Fifty-three percent of nonpiano majors’ correlations
were significant, whereas 64% of piano majors’ correlations were significant (p < .05).
Reliability by section. We calculated Pearson product-moment correlations individu-
ally for each of the 10 excerpts that were presented twice (Trials 1 and 21, 2 and 22,
etc.). The two correlations for each of the five section types (opening phrase, first half,
second half, coda, and intact) then were averaged separately for piano majors and
nonpiano majors. Results revealed that mean correlations varied from .27 to .38,
except for correlations of .20 and .22 for nonmajors when rating opening phrase and
coda, respectively. It thus seems that nonpiano majors were affected somewhat by sec-
tion length, as lower reliabilities were associated only with shorter sections. For piano
majors, however, section type did not appear to affect reliability.
Discussion
In this study, we attempted to determine how evaluations of an intact piece of music
would relate to evaluations of its component sections and to see whether such asso-
ciations would vary as a consequence of performance quality. Participants listened to
four performances of Chopin’s Butterfly Étude, differing in overall quality, each
presented in its entirety (intact) and in sections. Results indicated that for the two
higher rated performances, the whole was greater than its parts—intact ratings were
higher than sectional ratings. For the two lower rated performances, however, the
opposite was true—intact ratings were rated lower than sectional ratings. Thus for
each of the four performances, intact ratings diverged from what would have been
obtained by averaging the ratings of their component sections.
Our results are consistent with previous research showing that expert performances
are rated higher for longer excerpts than for shorter ones (Bergee, 1993, 1997; Fiske,
1975, 1979; Hewitt, 2007; Hewitt & Smith, 2004; S. Thompson et al., 2007; Wapnick
et al., 2005). They are, in fact, strikingly similar to those of Geringer and Johnson
(2007), who found that although university and professional wind band ensembles
were rated higher for longer than for shorter excerpts, lower quality high school
ensemble performances were rated lower for longer than for shorter excerpts. At pres-
ent, it is not known whether this interaction was due to absolute standards regarding
performance quality or whether it resulted from implicit comparisons made across
performances. It is possible, for example, that evaluators became more confident in
their evaluations the longer they listened to a performance. This increase in confidence
may have propelled their ratings either above or below the mean rating of the perfor-
mance’s component sections, depending on whether the evaluation was positive or
470 Journal of Research in Music Education 60(4)
negative. Or it is possible that what is judged good depends on what else is being
judged.
Although we found no particularly strong correlations between opening phrase and
intact performance, or for coda and intact performance, we did find that correlations
with corresponding intact performances were somewhat stronger for ratings based on
the first half or the second half of the étude (mean r = .38) than for the shorter opening
and closing sections (mean r = .25). This greater predictability can be seen in Table 2:
Rating differences across the four performances are marked for first-half and second-
half sections and for intact performances. For the opening phrase, however,
Performance 3 actually was rated slightly higher than Performance 2; and for the coda,
differences in ratings between the top two performances, and between the bottom two
performances, are much narrower than differences in ratings for corresponding intact
performances.
Our examination of the relationship between opening and closing sections of a
work and its intact rating was aimed at determining whether intact ratings would have
been susceptible to either a primacy effect or a recency effect. However, the impor-
tance of an opening or closing section heard in isolation may differ from its impor-
tance when embedded within an intact performance. One way for researchers to
examine true primacy and recency effects would be to splice together a number of
different openings followed by one unvarying subsequent performance (primacy
effect) and to present one performance followed by several different endings (recency
effect).
Given that this study involved listening to only one composition, it should be
viewed as an impetus for future research rather than a departure point for making gen-
eralizations. Chopin’s Butterfly Étude is a brief, monothematic, fast, technically
demanding piece of music. These particular musical characteristics might have
affected ratings. It is conceivable that we might have obtained entirely different results
had we used more extended and structurally complex pieces of music. Nevertheless,
results of our study suggest that the evaluations of sections may differ systematically
from evaluations of the intact work—they may be either lower or higher than ratings
of intact works, depending on overall quality of performance.
In our study, evaluators rated excerpts globally. We might have obtained valuable
additional data had we used a ratings rubric (Ciorba & Smith, 2009; Geringer et al.,
2009; Latimer, Bergee, & Cohen, 2010) or if we had asked for ratings of specific musi-
cal aspects. However, we felt that doing so would increase the time necessary for
evaluators to rate 30 performances beyond what would have been acceptable.
It should be borne in mind that results from this study do not mean that any of the
pianists whose performances were heard by evaluators is in any sense “superior.”
Although the highest rated performance was rated higher than the second highest rated
performance, Performance 2 came from a live concert, whereas Performance 1 was
extracted from a commercial compact disc. Also, these two performances may have
benefited from superior sound recording in comparison to the other two performances.
Wapnick and Darrow 471
Although none of the recordings we chose suffered from severe sonic problems, and
we instructed evaluators to ignore recording quality when assigning ratings to excerpts,
recording quality nevertheless might have affected ratings.
In Phase 1 of this study, we found that although our 9 evaluators differed in their
grading standards, they agreed substantially on which performances were better than
others (Table 1). The correlations obtained likely depended on the specifics of the
experimental situation, however. Twelve of the 22 performances were by internation-
ally recognized concert artists. Had we narrowed the spread of performance quality by
using only performances of such pianists, it is highly likely that we would have
obtained lower correlations. Also, it is possible that agreement between raters would
have been reduced had we asked less musically experienced evaluators to rate the
performances. Previous research has yielded mixed results on this point, with some
studies showing that musical experience and familiarity improve either or both intra-
and interrater reliability (Fiske, 1977a; Hewitt, 2002, 2005; Kinney, 2009; Morrison,
Montemayer, & Wiltshire, 2004; Sparks, 1990; W. Thompson et al., 1998; Towers,
1980) and other studies not confirming such results (Fiske, 1975, 1977; Heath, 1976;
Roberts, 1975; Wapnick, Flowers, Alegant, & Jasinskas, 1993). It also seems possible
that the effects of variables such as excerpt familiarity (Bergee, 2007; Kinney, 2009;
Wapnick et al., 2004), and aspects of the musical stimuli, particularly tempo (Geringer
& Johnson, 2007; LeBlanc, Colman, McCrary, Sherrill, & Malin, 1988; Wapnick
et al., 2005), might interact with listening time.
Finally, the issue of part versus whole evaluation might be considered in relation to
musical interpretation. It is possible that an artist’s overall conception of a piece of
music may be of greater importance than how accurately and expressively the notes
are played—within any particular section or perhaps throughout the entire work. It
may prove difficult for one to define musical conception operationally, but perhaps its
effects can be shown in those cases where measurements of the parts differ from
measurement of the whole.
Funding
The authors received no financial support for the research, authorship, and/or publication of this
article.
References
Bergee, M. J. (1993). A comparison of faculty, peer, and self-evaluation of applied brass jury
performances. Journal of Research in Music Education, 41, 19–27. doi:10.2307/3345476
Bergee, M. J. (1995). Primary and higher-order factors in a scale assessing concert band perfor-
mance. Bulletin of the Council for Research in Music Education, 126, 1–14.
Bergee, M. J. (1997). Relationships among faculty, peer, and self-evaluations of applied per-
formances. Journal of Research in Music Education, 45, 601–612. doi:10.2307/3345425
472 Journal of Research in Music Education 60(4)
Hewitt, M. P., & Smith, B. P. (2004). The influence of teaching-career level and primary perfor-
mance instrument on the assessment of music performance. Journal of Research in Music
Education, 52, 314–327. doi:10.1177/002242940405200404
Kinney, D. (2009). Internal consistency of performance evaluations as a function of music
expertise and excerpt familiarity. Journal of Research in Music Education, 56, 322–337.
doi:10.1177/0022429408328934
Latimer, M. E., Bergee, M. J., & Cohen, M. L. (2010). Reliability and perceived utility of a
weighted music performance assessment rubric. Journal of Research in Music Education,
58, 168–183. doi:10.1177/0022429410369836
LeBlanc, A., Colman, J., McCrary, J., Sherrill, C., & Malin, S. (1988). Tempo preferences
of different age music listeners. Journal of Research in Music Education, 36, 156–168.
doi:10.2307/3344637
LeBlanc, A., & Sherrill, C. (1986). Effect of vocal vibrato and performer’s sex on children’s music
preference. Journal of Research in Music Education, 34, 222–237. doi:10.2307/3345258
McCrary, J. (1993). Effects of listeners’ and performers’ race on music preferences. Journal of
Research in Music Education, 41, 200–211. doi:10.2307/3345325
Morrison, S. J., Montemayer, M., & Wiltshire, E. S. (2004). The effect of a recorded model
on band students’ performance self-evaluations, achievement, and attitude. Journal of
Research in Music Education, 52, 116–129. doi:10.2307/3345434
Roberts, B. A. (1975). Judge-group differences in the rating of piano performances (MusM
thesis). University of Western Ontario, London, Canada.
Sparks, G. E. (1990). The effect of self-evaluation on musical achievement, attentiveness, and
attitudes of elementary school instrumental students (Doctoral dissertation). Louisiana State
University, Baton Rouge.
Thompson, S., Williamon, A., & Valentine, E. (2007). Time-dependent characteristics of per-
formance evaluation. Music Perception, 25, 13–29.
Thompson, W. F., Diamond, C. P., & Balkwill, L. L. (1998). The adjudication of six perfor-
mances of a Chopin étude: A study of expert knowledge. Psychology of Music, 26, 154–174.
doi:10.1177/0305735698262004
Towers, R. (1980). Age-group differences in judge reliability of solo voice performances
(Unpublished master’s thesis). University of Western Ontario, London, Canada.
Vasil, T. (1973). The effects of systematically varying selected factors on music performance
adjudication (Unpublished doctoral dissertation). University of Connecticut, Storrs.
Wagner, M. J. (1991). The effect of adjudicating three videotaped popular music performances
on a “composite critique” rating and an “overall” rating. Missouri Journal of Research in
Music Education, 28, 53–70.
Wapnick, J., Darrow, A.-A., Kovacs, J., & Dalrymple, L. (1997). Effects of physical attrac-
tiveness on evaluation of vocal performance. Journal of Research in Music Education, 45,
470–479. doi:10.2307/3345540
Wapnick, J., & Ekholm, E. (1997). Evaluation of vocal production in solo voice performance.
Journal of Voice, 11, 429–436. doi:10.1016/S0892-1997(97)80039-2
Wapnick, J., Flowers, P., Alegant, M., & Jasinskas, L. (1993). Consistency in piano performance
evaluation. Journal of Research in Music Education, 4, 282–292. doi:10.2307/3345504
474 Journal of Research in Music Education 60(4)
Wapnick, J., Kovacs-Mazza, J., & Darrow, A.-A. (2000). Effects of performer attractive-
ness, stage behavior, and dress on evaluation of children’s piano performances. Journal of
Research in Music Education, 48, 323–336. doi:10.2307/3345367
Wapnick, J., Ryan, C., Campbell, L., Deek, P., Lemire, R., & Darrow, A.-A. (2005). Effects of
excerpt tempo and duration on musicians’ ratings of high-level piano performances. Journal
of Research in Music Education, 53, 162–176. doi:10.1177/002242940505300206
Wapnick, J., Ryan, C., Lacaille, N., & Darrow, A.-A. (2004). Effects of selected variables on
musicians’ ratings of high-level piano performances. International Journal of Music Educa-
tion, 22, 7–20. doi:10.1177/0255761404042371
Zdzinski, S. F., & Barnes, G. V. (2002). Development and validation of a string performance
rating scale. Journal of Research in Music Education, 50, 245–255. doi:10.2307/3345801
Bios
Joel Wapnick is professor of music education at McGill University. His research interests deal
with musical and nonmusical variables affecting music performance evaluation.
Alice-Ann Darrow is Irwin Cooper Professor of Music Therapy and Music Education at
Florida State University. Her research interests include teaching music to special populations
and the role of music in deaf culture.