You are on page 1of 13

465318

of Research in Music EducationWapnick and Darrow


JRM60410.1177/0022429412465318Journal

Journal of Research in Music Education


Sectional Versus Intact 60(4) 462­–474
© 2013 National
Evaluations of Four Association for Music Education
Reprints and permission:

Versions of a Chopin Étude sagepub.com/journalsPermissions.nav


DOI: 10.1177/0022429412465318
http://jrme.sagepub.com

Joel Wapnick1 and Alice-Ann Darrow2

Abstract
The purpose of this study was to determine how evaluations of four intact performances
of a Chopin étude (Op. 25, No. 9, Butterfly), differing in quality, related to evaluations
of each performance’s sections. We presented each performance as five excerpts:
(a) opening phrase, (b) first half, (c) second half, (d) coda, and (e) intact. We then
presented 10 of the 20 excerpts a second time to assess listener reliability. Results
revealed an interaction between performer and ratings: Ratings of intact performances
were significantly higher than were ratings of component sections for the two best
performances. For the other two performances, however, ratings of intact excerpts
were significantly lower than ratings of component sections. Nonsignificant factors
were sections and listener background. Data indicate that ratings of longer sections
(first and second halves) were more closely tied to intact performance ratings than
were shorter sections (opening phrase and coda). Correlational data revealed that
piano majors were marginally more reliable than nonpiano majors.

Keywords
adjudication, music performance, piano

Excerpt duration recently has emerged as a variable of interest in music adjudication


research (Geringer & Johnson, 2007; Geringer, Allen, MacLeod, & Scott, 2009;
W. Thompson, Diamond, & Balkwill, 2007; Wapnick et al., 2005). Nevertheless,
many issues concerning this variable have yet to be thoroughly addressed. One of them
might be posed in the form of a question: How long is “long enough”? In this context,

1
McGill University, Montreal, QC, Canada
2
Florida State University, Tallahassee, FL, USA

Corresponding Author:
Joel Wapnick, McGill University, 555 Sherbrooke St. W., Montreal, QC H3W 2J1, Canada
Email: jwapnick@music.mcgill.ca
Wapnick and Darrow 463

“long enough” is concerned neither with courtesy shown to a performer or performing


group nor with aesthetic enjoyment of the music, important though these consider-
ations may be. The concept of “long enough” is based on the assumption that there is
a point in time beyond which evaluations no longer change substantially and reliability
no longer improves. In competitions and auditions, judges sometimes end perfor-
mances arbitrarily; they have heard enough, and thus they have accepted the assump-
tion of “long enough”, at least tacitly. In other situations, however, such as recital
performances, judges listen to the end of a performance. Here there is no assumption
of “long enough;” events occurring past any point in time, even near the end of an
hour-long recital, may affect evaluations.
Geringer et al. (2009) contended that efficiency in adjudication is an important
concern because listening to a large number of performances is often expensive, time-
consuming, and subject to error due to the potential effects of fatigue and bias. Their
study concerned the selection of violinist candidates for admission to three Florida
all-state orchestras. The traditional adjudication process required grading performances
of a 1-min étude, an orchestral excerpt, major and minor scales, and sight-reading. The
researchers found, however, that decisions based on the use of a prescreening rubric,
as applied solely to the evaluation of the étude, correlated well with decisions based on
the more extensive traditional approach. They were able to reduce “long enough” to
1 min per student.
With the exception of a dissertation by Vasil (1973) and a series of studies con-
ducted by Bergee (1993, 1995, 1997, 2003), research studies in music adjudication
universally exposed listeners to excerpts shorter than 3 min. The employment of 1- to
3-min excerpts has been used commonly (Ekholm, 1994; Gillespie, 1997; W. Thompson
et al., 1998; Wagner, 1991; Wapnick, Darrow, Kovacs, & Dalrymple, 1997; Wapnick
& Ekholm, 1997; Wapnick, Kovacs-Mazza, & Darrow, 2000; Wapnick, Ryan,
Lacaille, & Darrow, 2004), and many studies used excerpts even briefer than that
(Brittin, 2002; Byo & Crone, 1989; Fiske, 1975, 1977b; Geringer & Madsen, 1998;
LeBlanc & Sherrill, 1986; McCrary, 1993; Zdzinski & Barnes, 2002). The use of brief
excerpts is problematic only to the degree that evaluations of them might be more
unreliable than evaluations from longer excerpts, or that brief excerpt evaluations
might not be representative of evaluations taken from longer excerpts. In at least two
studies, however, it has been shown that for relatively short pieces, listening durations
for sectional versus intact performances may differ considerably without substantial
differences arising in either ratings or reliability. Vasil (1973) found no such differ-
ences for excerpts that were 5 min, 2.5 min, and 1.25 min long. Geringer and Johnson
(2007) obtained similar results using shorter durations. They compared excerpts from
wind band performances that were 12 s, 25 s, and 50 s long and found no main effect
for duration. However, they did find that professional- and university-level perfor-
mances were rated higher when excerpt duration was longer, and that lower ratings of
lower quality high school performances were associated with shorter excerpt dura-
tions. These results corroborate those of Wapnick et al. (2005), who found that for
experienced listeners, excerpts from performances of high-level pianists that were 60 s
464 Journal of Research in Music Education 60(4)

long were rated higher and more reliably than were 20-s excerpts. Participants in a
study by S. Thompson, Williamon, and Valentine (2007) rated piano works by Bach
and Chopin continuously by moving a mouse along a 7-point scale. Although responses
stabilized to some degree after 15 to 20 s, they tended to drift upward afterward, par-
ticularly within the 1st min.
In the present study, we examined the effects of excerpt duration by comparing rat-
ings of intact performances of a Chopin étude with ratings of sections taken from the
performance. We also were interested in how ratings of particular sections of the étude
might relate to ratings of their corresponding intact performances. It is possible that
ratings of an intact work would be predicted better from ratings of its opening phrase
than from ratings of other sections. On the other hand, ratings of the étude’s coda
might be allied more closely to ratings of intact performances. In addition, we com-
pared ratings of intact performances with ratings of the étude’s first and second halves,
and we investigated whether the quality of an intact performance relative to other
intact performances might be important in determining how ratings of sections versus
ratings of the whole piece related to each other. Would the numerical rating of an
intact performance be close to the mean rating of the sections it comprises, or would it
be higher for the higher-rated performances and lower for the lower-rated perfor-
mances? In summary, we were interested in how the perception of the whole related to
the perception of its parts, and in whether differences in quality of the four “wholes”
employed in this study—the four intact performances—would interact with the ratings
of their corresponding sections.

Method
This study consisted of two phases. In Phase 1, evaluators listened to and rated
22 versions of Chopin’s Butterfly Étude, Op. 25, No. 9. We used these ratings to select
4 versions from the 22 for use in Phase 2, as described in a following section.

Evaluators
We employed two groups of evaluators, all volunteers. Phase 1 evaluators were
9 people: 3 piano major graduate students and 6 full-time professors in performance,
theory, musicology, and music education. All were from the same university music
school. Phase 2 evaluators were 111 undergraduate and graduate music majors from
two universities. Eighty-five were nonpiano majors and 26 were piano majors.

Phase 1: Materials and Procedures


We recorded 22 complete performances of Chopin’s Butterfly Étude from YouTube
(www.youtube.com) to an Apple iMac using Wiretap Studio (Ambrosia Software,
Inc.). YouTube performances were used because they included a wide variety of skill
levels. We edited them using Peak (Bias, Inc.) to equalize loudness levels and to stan-
dardize durations of silence before and after each performance.
Wapnick and Darrow 465

Six of the 9 evaluators each were given a CD to listen to at leisure. Three others
heard the CD at the same time in a classroom setting. Evaluators were given the fol-
lowing instructions:

You are about to hear 22 different versions of Chopin’s Butterfly Étude (Op. 25,
No. 9). After you hear each one, give it a grade from 0 to 100, where 100 rep-
resents what to your mind is as fine a performance as you are likely to hear of
the piece and 0 represents a very poor effort. I realize that the sound quality on
these performances varies; please try to ignore sound quality and focus on the
quality of the performances only.

We employed a 100-point scale for this task because such a scale commonly is
used at many universities in the evaluation of music performances.

Phase 2: Materials and Procedures


We selected 4 performances from the 22 heard by Phase 1 evaluators (see Table A1,
in the online supplemental material available at http://jrme.sagepub.com/supplemen-
tal) for use in Phase 2. We chose performances that differed markedly from each other
in rating quality. This allowed us to determine whether differences in relative quality
would affect the strength and direction of discrepancies between overall and sectional
ratings for each of the 4 performances.
The four Phase 2 recordings were played by an internationally renowned concert
pianist, a winner of international piano competitions, an experienced university piano
teacher, and an amateur pianist (Performances 15, 11, 21, and 10, respectively, in
Table A of the Appendix, available in the online supplemental material at http://jrme.
sagepub.com/supplemental). Durations of intact performances ranged from 59 s to 65
s. It should be noted that the lowest rated of the four performances was fluid and com-
petent and was played at a tempo appropriate for the étude. Henceforth, these perfor-
mances will be referred to by number, with Performance 1 being highest rated and
Performance 4 being lowest rated.
To verify that the four performances differed in rated quality, we performed six
t tests (see Table B, http://jrme.sagepub.com/supplemental). With the exception of
the t test comparing the two lower-rated performances (p = .061), the mean rating for
each performance was significantly different from mean ratings for the other three
(p < .025).
The first excerpt on the experimental CD consisted of a “model” intact version of
the Chopin étude (not one of the four Phase 2 performances and not to be rated by
participants). Of the 22 versions rated in Phase 1, this one had attained the second
highest mean and is the eighth performance as shown in Table A (see http://jrme.sagepub
.com/supplemental).
Twenty trials followed, and they consisted of a randomly ordered presentation of
the five excerpts taken from the four performances. The excerpts consisted of the
466 Journal of Research in Music Education 60(4)

following: (a) opening phrase (mm. 1–8, mean duration = 8 s); (b) first half
(mm. 1–24, mean duration = 29 s); (c) second half (mm. 25–51, mean duration = 33 s);
(d) coda (mm. 37–51, mean duration = 18 s); and (e) intact (mm. 1–51, mean duration =
62 s). We then presented half of the 20 versions (Trials 1 through 10) again in the same
order (Trials 21 through 30) to assess evaluator reliability.
Listening sessions were held in groups, and evaluators were read the following
instructions:

The purpose of this study is to determine the minimum amount of time one
needs to judge music reliably. Accordingly, you will be hearing 30 musical
examples, all from the same piece and each lasting in duration from 8 seconds
to 1 minute. Each trial will be of either the entire performance of Chopin’s
Butterfly Étude or an excerpt from that work. Please rate these trials on a 7-point
scale by circling a number, where 1 signifies a poor performance and 7 signifies
an excellent performance. Also, and in accordance with the purpose of this
experiment, please hold off making your rating choice until after the entire trial
example is played. There are some differences in recording quality—try to
ignore them and rate just the pianists. You will have 5 seconds between trials to
rate each performance. The entire experiment will take 18 minutes.

We will begin with a complete performance of the Butterfly Étude, to give you
an idea of what the piece sounds like. Do not rate this performance. I will
announce the numbers of the trials following it as we go.

Do you have any questions?

Results
Phase 1: Ratings

Ratings from Phase 1 revealed differences in grading across the 9 evaluators (Table C
in the online supplemental material; see http://jrme.sagepub.com/supplemental), which
was confirmed by a one-way repeated-measures analysis of variance, F(8, 168) = 17.46,
p < .001, η2 = .454, and post hoc LSD tests: Evaluator 1 gave significantly higher rat-
ings than all other evaluators (p < .01), Evaluators 2 and 3 gave significantly higher
ratings than all other evaluators except Evaluator 1 (p < .03), and Evaluators 4 and 5
gave significantly higher ratings than Evaluators 6, 7, 8, and 9 (p < .05).

Phase 1: Correlational Data


There appeared to be considerable consistency concerning ratings of the 22 perfor-
mances, as can be seen in Table 1. Of the 36 Pearson product-moment correlations
across the 9 evaluators, 32 were significant at the .01 level and the remaining 4 were
significant at the .05 level. The average Pearson r was .67.
Wapnick and Darrow 467

Table 1. Phase 1: Correlations Between Evaluators


Evaluator 1 2 3 4 5 6 7 8 9
1 —  
2 .43* —  
3 .64** .71** —  
4 .73** .62** .83** —  
5 .58** .44* .72** .53** —  
6 .80** .65** .83** .81** .79** —  
7 .65** .63** .71** .75** .45* .73** —  
8 .74** .54** .71** .75** .47* .75** .67** —  
9 .61** .77** .71** .61** .66** .74** .63** .57** —
*p < .05. ** p < .01.

Phase 2: Effects of Major, Section, and Performer


We performed a three-way mixed-design analysis of variance to determine the effects
of major (piano vs. nonpiano), section (opening phrase, first half, second half, coda,
and intact performance), and the four performers on ratings. Two significant effects
were found: a main effect for performer, F(3, 327) = 366.66, p < .001, η2 = 0.770),
and a significant performer by section interaction (F [12, 1308] = 18.42, p < .001,
η2 = .143. Post hoc analysis of the performer effect revealed that ratings for all per-
formers were significantly different from each other (p < .01 in all cases). The
Performer × Section interaction can be interpreted by examining intact ratings as a
function of performance rating (see Table 2): For Performers 1 and 2, intact ratings
were higher than were the ratings for any of the sections, but for Performers 3 and 4,
intact performances were rated lower than any of the sectional ratings (although for
Performer 3, the mean intact rating was virtually identical to the mean rating for
the coda).
No other significant factors emerged from this analysis of variance. Piano majors
did not give higher or lower ratings than nonpiano majors, nor were there significant
differences in ratings for the various sections of the etude.

Table 2. Phase 2: Means and Standard Deviations for Sectional and Complete Performances
Performance 1 Performance 2 Performance 3 Performance 4
  M SD M SD M SD M SD
Opening 5.59 1.01 4.31 1.00 4.53 0.94 3.27 1.28
First half 6.01 0.94 4.89 1.08 4.11 1.24 2.74 1.22
Second half 5.90 1.11 5.13 1.29 4.06 1.18 3.08 1.21
Coda 5.54 1.16 5.21 1.14 3.69 1.28 3.47 1.28
Intact 6.48 0.97 5.46 1.02 3.68 1.15 2.47 1.06
468 Journal of Research in Music Education 60(4)

Phase 2: Correlational Data


Prediction of ratings from sections to the intact performance. To determine whether
intact performances were predicted more accurately from the opening phrase, from the
coda, or from the first or second halves of the étude, we computed separate Pearson r
correlations for each of the four performances. Results are shown in Table 3.
It is difficult to generalize from these data other than to note that (a) correlations
among all four sections and the intact performance are higher for the highest-rated
performance than for the other performances and (b) ratings of the intact perfor-
mances appear to be predicted somewhat better from the larger sections (first and
second half of the étude) than from the smaller sections (opening phrase and coda):
The average Pearson product-moment correlation across the performances was .38 for
first half–intact and second half–intact correlation coefficients averaged together, but
was only .25 for opening–intact and coda–intact coefficients averaged together.
Finally, we expected to find higher-than-average correlations between opening and

Table 3. Pearson r Correlations by Section for Each Performance


Opening First Half Second Half Coda Intact
Performance 1
 Opening —  
  First half .42** —  
  Second half .25** .30** —  
 Coda .46** .30** .37** —  
 Intact .40** .40** .51** .42** —
Performance 2
 Opening —  
  First half .01 —  
  Second half .13 .17 —  
 Coda .00 .18 .17 —  
 Intact .24* .46** .39** .23* —
Performance 3
 Opening —  
  First half .19* —  
  Second half .18 .04 —  
 Coda .17 .32* .21* —  
 Intact .13 .40** .10 .23* —
Performance 4
 Opening —  
  First half .34** —  
  Second half .18 .31** —  
 Coda .16 .06 .28** —  
 Intact .34** .48** .32** .04 —
*p < .05. **p < .01.
Wapnick and Darrow 469

first half and between second half and coda, because in both cases, one of the sections
was subsumed within the other. This did not turn out to be the case.
Reliability. We assessed reliability by calculating Pearson product-moment correla-
tions for the first 10 excerpts (Trials 1 though 10) and their repeated presentations
(Trials 21 through 30). For piano majors, mean r(25) = .66 (p < .01), and for nonpiano
majors, mean r(85) = .60 (p < .01). Fifty-three percent of nonpiano majors’ correlations
were significant, whereas 64% of piano majors’ correlations were significant (p < .05).
Reliability by section. We calculated Pearson product-moment correlations individu-
ally for each of the 10 excerpts that were presented twice (Trials 1 and 21, 2 and 22,
etc.). The two correlations for each of the five section types (opening phrase, first half,
second half, coda, and intact) then were averaged separately for piano majors and
nonpiano majors. Results revealed that mean correlations varied from .27 to .38,
except for correlations of .20 and .22 for nonmajors when rating opening phrase and
coda, respectively. It thus seems that nonpiano majors were affected somewhat by sec-
tion length, as lower reliabilities were associated only with shorter sections. For piano
majors, however, section type did not appear to affect reliability.

Discussion
In this study, we attempted to determine how evaluations of an intact piece of music
would relate to evaluations of its component sections and to see whether such asso-
ciations would vary as a consequence of performance quality. Participants listened to
four performances of Chopin’s Butterfly Étude, differing in overall quality, each
presented in its entirety (intact) and in sections. Results indicated that for the two
higher rated performances, the whole was greater than its parts—intact ratings were
higher than sectional ratings. For the two lower rated performances, however, the
opposite was true—intact ratings were rated lower than sectional ratings. Thus for
each of the four performances, intact ratings diverged from what would have been
obtained by averaging the ratings of their component sections.
Our results are consistent with previous research showing that expert performances
are rated higher for longer excerpts than for shorter ones (Bergee, 1993, 1997; Fiske,
1975, 1979; Hewitt, 2007; Hewitt & Smith, 2004; S. Thompson et al., 2007; Wapnick
et al., 2005). They are, in fact, strikingly similar to those of Geringer and Johnson
(2007), who found that although university and professional wind band ensembles
were rated higher for longer than for shorter excerpts, lower quality high school
ensemble performances were rated lower for longer than for shorter excerpts. At pres-
ent, it is not known whether this interaction was due to absolute standards regarding
performance quality or whether it resulted from implicit comparisons made across
performances. It is possible, for example, that evaluators became more confident in
their evaluations the longer they listened to a performance. This increase in confidence
may have propelled their ratings either above or below the mean rating of the perfor-
mance’s component sections, depending on whether the evaluation was positive or
470 Journal of Research in Music Education 60(4)

negative. Or it is possible that what is judged good depends on what else is being
judged.
Although we found no particularly strong correlations between opening phrase and
intact performance, or for coda and intact performance, we did find that correlations
with corresponding intact performances were somewhat stronger for ratings based on
the first half or the second half of the étude (mean r = .38) than for the shorter opening
and closing sections (mean r = .25). This greater predictability can be seen in Table 2:
Rating differences across the four performances are marked for first-half and second-
half sections and for intact performances. For the opening phrase, however,
Performance 3 actually was rated slightly higher than Performance 2; and for the coda,
differences in ratings between the top two performances, and between the bottom two
performances, are much narrower than differences in ratings for corresponding intact
performances.
Our examination of the relationship between opening and closing sections of a
work and its intact rating was aimed at determining whether intact ratings would have
been susceptible to either a primacy effect or a recency effect. However, the impor-
tance of an opening or closing section heard in isolation may differ from its impor-
tance when embedded within an intact performance. One way for researchers to
examine true primacy and recency effects would be to splice together a number of
different openings followed by one unvarying subsequent performance (primacy
effect) and to present one performance followed by several different endings (recency
effect).
Given that this study involved listening to only one composition, it should be
viewed as an impetus for future research rather than a departure point for making gen-
eralizations. Chopin’s Butterfly Étude is a brief, monothematic, fast, technically
demanding piece of music. These particular musical characteristics might have
affected ratings. It is conceivable that we might have obtained entirely different results
had we used more extended and structurally complex pieces of music. Nevertheless,
results of our study suggest that the evaluations of sections may differ systematically
from evaluations of the intact work—they may be either lower or higher than ratings
of intact works, depending on overall quality of performance.
In our study, evaluators rated excerpts globally. We might have obtained valuable
additional data had we used a ratings rubric (Ciorba & Smith, 2009; Geringer et al.,
2009; Latimer, Bergee, & Cohen, 2010) or if we had asked for ratings of specific musi-
cal aspects. However, we felt that doing so would increase the time necessary for
evaluators to rate 30 performances beyond what would have been acceptable.
It should be borne in mind that results from this study do not mean that any of the
pianists whose performances were heard by evaluators is in any sense “superior.”
Although the highest rated performance was rated higher than the second highest rated
performance, Performance 2 came from a live concert, whereas Performance 1 was
extracted from a commercial compact disc. Also, these two performances may have
benefited from superior sound recording in comparison to the other two performances.
Wapnick and Darrow 471

Although none of the recordings we chose suffered from severe sonic problems, and
we instructed evaluators to ignore recording quality when assigning ratings to excerpts,
recording quality nevertheless might have affected ratings.
In Phase 1 of this study, we found that although our 9 evaluators differed in their
grading standards, they agreed substantially on which performances were better than
others (Table 1). The correlations obtained likely depended on the specifics of the
experimental situation, however. Twelve of the 22 performances were by internation-
ally recognized concert artists. Had we narrowed the spread of performance quality by
using only performances of such pianists, it is highly likely that we would have
obtained lower correlations. Also, it is possible that agreement between raters would
have been reduced had we asked less musically experienced evaluators to rate the
performances. Previous research has yielded mixed results on this point, with some
studies showing that musical experience and familiarity improve either or both intra-
and interrater reliability (Fiske, 1977a; Hewitt, 2002, 2005; Kinney, 2009; Morrison,
Montemayer, & Wiltshire, 2004; Sparks, 1990; W. Thompson et al., 1998; Towers,
1980) and other studies not confirming such results (Fiske, 1975, 1977; Heath, 1976;
Roberts, 1975; Wapnick, Flowers, Alegant, & Jasinskas, 1993). It also seems possible
that the effects of variables such as excerpt familiarity (Bergee, 2007; Kinney, 2009;
Wapnick et al., 2004), and aspects of the musical stimuli, particularly tempo (Geringer
& Johnson, 2007; LeBlanc, Colman, McCrary, Sherrill, & Malin, 1988; Wapnick
et al., 2005), might interact with listening time.
Finally, the issue of part versus whole evaluation might be considered in relation to
musical interpretation. It is possible that an artist’s overall conception of a piece of
music may be of greater importance than how accurately and expressively the notes
are played—within any particular section or perhaps throughout the entire work. It
may prove difficult for one to define musical conception operationally, but perhaps its
effects can be shown in those cases where measurements of the parts differ from
measurement of the whole.

Declaration of Conflicting Interests


The authors declared no potential conflicts of interest with respect to the research, authorship,
and/or publication of this article.

Funding
The authors received no financial support for the research, authorship, and/or publication of this
article.

References
Bergee, M. J. (1993). A comparison of faculty, peer, and self-evaluation of applied brass jury
performances. Journal of Research in Music Education, 41, 19–27. doi:10.2307/3345476
Bergee, M. J. (1995). Primary and higher-order factors in a scale assessing concert band perfor-
mance. Bulletin of the Council for Research in Music Education, 126, 1–14.
Bergee, M. J. (1997). Relationships among faculty, peer, and self-evaluations of applied per-
formances. Journal of Research in Music Education, 45, 601–612. doi:10.2307/3345425
472 Journal of Research in Music Education 60(4)

Bergee, M. J. (2003). Faculty interjudge reliability of music performance evaluation. Journal of


Research in Music Education, 51, 137–150. doi:10.2307/3345847
Bergee, M. J. (2007). Performer, rater, occasion, and sequence as sources of variability in
music performance assessment. Journal of Research in Music Education, 55, 344–358.
doi:10.1177/0022429408317515
Brittin, R. L. (2002). Instrumentalists’ assessment of solo performance with compact disc,
piano, or no accompaniment. Journal of Research in Music Education, 50, 63–74.
doi:10.2307/3345693
Byo, J. L., & Crone, L. J. (1989). Adjudication by non-musicians: A comparison of professional
and amateur performances. Missouri Journal of Research in Music Education, 26, 60–73.
Ciorba, C. R., & Smith, N. Y. (2009). Measurement of instrumental and vocal undergradu-
ate performance juries using a multidimensional assessment rubric. Journal of Research in
Music Education, 57, 5–15. doi:10.1177/0022429409333405
Ekholm, E. (1994). The effect of guided listening on evaluation of solo vocal performance
(Master’s thesis). McGill University, Montreal, Canada.
Fiske, H. E. (1975). Judge-group differences in the rating of secondary school trumpet perfor-
mances. Journal of Research in Music Education, 23, 186–196. doi:10.2307/3344643
Fiske, H. E. (1977a). Relationship of selected factors in trumpet performance adjudication reli-
ability. Journal of Research in Music Education, 25, 256–263. doi:10.2307/3345266
Fiske, H. E. (1977b). Who’s to judge? New insights into performance evaluation. Music Educa-
tors Journal, 64(4), 23–25. doi:10.2307/3395371
Fiske, H. E. (1979). Musical performance evaluation ability: toward a model of specificity.
Bulletin of the Council for Research in Music Education, 59, 27-31.
Geringer, J. M., Allen, M. L., MacLeod, R. B., & Scott, L. (2009). Using a prescreening rubric
for all-state violin selection: influences of performance and teaching experience. Update:
Applications of Research in Music Education, 28, 41–46. doi:10.1177/8755123309344109
Geringer, J. M., & Johnson, C. M. (2007). Effects of excerpt duration, tempo, and performance
level on musicians’ ratings of wind band performances. Journal of Research in Music Edu-
cation, 55, 289–301. doi:10.1177/0022429408317366
Geringer, J. M., & Madsen, C. K. (1998). Musicians’ ratings of good versus bad vocal and string
performances. Journal of Research in Music Education, 46, 522–534. doi:10.2307/3345348
Gillespie, R. (1997). Ratings of violin and viola vibrato performance in audio-only and audiovisual
presentations. Journal of Research in Music Education, 45, 212–220. doi:10.2307/3345581
Heath, C. E. (1976). The effect of instruction on the consistency of ratings given in the adjudi-
cation of trumpet solo excerpts (Doctoral dissertation), Indiana University, Bloomington.
Hewitt, M. P. (2002). Self-evaluation tendencies of junior high instrumentalists. Journal of
Research in Music Education, 50, 215–226. doi:10.2307/3345799
Hewitt, M. P. (2005). Self-evaluation accuracy among high school and middle school
instrumentalists. Journal of Research in Music Education, 53, 148–161. doi:10.1177/
002242940505300205
Hewitt, M. P. (2007). Influence of primary performance instrument and education level on
music performance evaluation. Journal of Research in Music Education, 55, 18–30.
doi:10.1177/002242940705500103
Wapnick and Darrow 473

Hewitt, M. P., & Smith, B. P. (2004). The influence of teaching-career level and primary perfor-
mance instrument on the assessment of music performance. Journal of Research in Music
Education, 52, 314–327. doi:10.1177/002242940405200404
Kinney, D. (2009). Internal consistency of performance evaluations as a function of music
expertise and excerpt familiarity. Journal of Research in Music Education, 56, 322–337.
doi:10.1177/0022429408328934
Latimer, M. E., Bergee, M. J., & Cohen, M. L. (2010). Reliability and perceived utility of a
weighted music performance assessment rubric. Journal of Research in Music Education,
58, 168–183. doi:10.1177/0022429410369836
LeBlanc, A., Colman, J., McCrary, J., Sherrill, C., & Malin, S. (1988). Tempo preferences
of different age music listeners. Journal of Research in Music Education, 36, 156–168.
doi:10.2307/3344637
LeBlanc, A., & Sherrill, C. (1986). Effect of vocal vibrato and performer’s sex on children’s music
preference. Journal of Research in Music Education, 34, 222–237. doi:10.2307/3345258
McCrary, J. (1993). Effects of listeners’ and performers’ race on music preferences. Journal of
Research in Music Education, 41, 200–211. doi:10.2307/3345325
Morrison, S. J., Montemayer, M., & Wiltshire, E. S. (2004). The effect of a recorded model
on band students’ performance self-evaluations, achievement, and attitude. Journal of
Research in Music Education, 52, 116–129. doi:10.2307/3345434
Roberts, B. A. (1975). Judge-group differences in the rating of piano performances (MusM
thesis). University of Western Ontario, London, Canada.
Sparks, G. E. (1990). The effect of self-evaluation on musical achievement, attentiveness, and
attitudes of elementary school instrumental students (Doctoral dissertation). Louisiana State
University, Baton Rouge.
Thompson, S., Williamon, A., & Valentine, E. (2007). Time-dependent characteristics of per-
formance evaluation. Music Perception, 25, 13–29.
Thompson, W. F., Diamond, C. P., & Balkwill, L. L. (1998). The adjudication of six perfor-
mances of a Chopin étude: A study of expert knowledge. Psychology of Music, 26, 154–174.
doi:10.1177/0305735698262004
Towers, R. (1980). Age-group differences in judge reliability of solo voice performances
(Unpublished master’s thesis). University of Western Ontario, London, Canada.
Vasil, T. (1973). The effects of systematically varying selected factors on music performance
adjudication (Unpublished doctoral dissertation). University of Connecticut, Storrs.
Wagner, M. J. (1991). The effect of adjudicating three videotaped popular music performances
on a “composite critique” rating and an “overall” rating. Missouri Journal of Research in
Music Education, 28, 53–70.
Wapnick, J., Darrow, A.-A., Kovacs, J., & Dalrymple, L. (1997). Effects of physical attrac-
tiveness on evaluation of vocal performance. Journal of Research in Music Education, 45,
470–479. doi:10.2307/3345540
Wapnick, J., & Ekholm, E. (1997). Evaluation of vocal production in solo voice performance.
Journal of Voice, 11, 429–436. doi:10.1016/S0892-1997(97)80039-2
Wapnick, J., Flowers, P., Alegant, M., & Jasinskas, L. (1993). Consistency in piano performance
evaluation. Journal of Research in Music Education, 4, 282–292. doi:10.2307/3345504
474 Journal of Research in Music Education 60(4)

Wapnick, J., Kovacs-Mazza, J., & Darrow, A.-A. (2000). Effects of performer attractive-
ness, stage behavior, and dress on evaluation of children’s piano performances. Journal of
Research in Music Education, 48, 323–336. doi:10.2307/3345367
Wapnick, J., Ryan, C., Campbell, L., Deek, P., Lemire, R., & Darrow, A.-A. (2005). Effects of
excerpt tempo and duration on musicians’ ratings of high-level piano performances. Journal
of Research in Music Education, 53, 162–176. doi:10.1177/002242940505300206
Wapnick, J., Ryan, C., Lacaille, N., & Darrow, A.-A. (2004). Effects of selected variables on
musicians’ ratings of high-level piano performances. International Journal of Music Educa-
tion, 22, 7–20. doi:10.1177/0255761404042371
Zdzinski, S. F., & Barnes, G. V. (2002). Development and validation of a string performance
rating scale. Journal of Research in Music Education, 50, 245–255. doi:10.2307/3345801

Bios
Joel Wapnick is professor of music education at McGill University. His research interests deal
with musical and nonmusical variables affecting music performance evaluation.

Alice-Ann Darrow is Irwin Cooper Professor of Music Therapy and Music Education at
Florida State University. Her research interests include teaching music to special populations
and the role of music in deaf culture.

Submitted August 25, 2011; accepted May 10, 2012.

You might also like