Wapnick e Ekholm - 1997 - Expert Consensus in Solo Voice Performance Evaluation

Journal of Voice
Vdl. 1I. No. 4. pp. 429-436

© 1997 Lippincon-Raven Publishers, Philadelphia
Expert Consensus in Solo Voice Performance Evaluation
Joel Wapnick and Elizabeth Ekholm
McGill Universit3,, Montreal. Quebec, Canada
Summary: Experts were interviewed to identify criteria for evaluation of vocal per-
formance. A scale was then constructed and inter- and intrajudge reliability assessed.
Experts listened to 19 different performances, plus 6 presented a second time. Inter-
judge reliability for one judge was modest, but increased dramatically as the size of the
judge panel increased. The most reliable items were overall score and intonation
accuracy. Diction was less reliable than other items. Intrajudge reliability was higher
for overall score than for any other item. A factor analysis on the test items yielded
factors labelled intrinsic quality, execution, and diction. Another factor analysis, using
the experts as variables, revealed two underlying evaluative dimensions. It was found
that 13 experts were primarily influenced by execution, and that 8 were mainly affected
by intrinsic quality. Interjudge and intrajudge reliabilities of these two groups differed.
Key Words: Voice evaluation--Reliability--Rating scales--Voice quality--Singing.
Although it is true that there has been considerable common lore that voice teachers have difficulty in agree-
research on evaluation in music performance, almost ing with each other in evaluative situations (1).
none of it has focused on vocal performance. There are Boyle and Radocy (2) observed that "the measure-
two possible reasons for this: (a) researchers appear to ment of musical performance is inherently subjective.
have been more interested in instrumental evaluation Music consists of sequential aural sensations; any judg-
than in vocal evaluation, and (b) vocal research may have ment of a musical performance is based on those sensa-
been conducted but not reported because sufficient reli- tions as they are processed by the judge's brain." (p.
ability and validity was difficult to obtain. Indeed, vocal 171) The lack of objectivity does not mean that the
performance evaluation appears to be considerably more achievement of a general consensus is not important. If
complicated than instrumental performance evaluation. there is no consensus, evaluation loses much of its mean-
Such elements as diction and transmission of the emo- ing.
tional meaning of lyrics have no instrumental counter- Jorgenson (l) examined the vocal technique literature,
part. In addition, the importance that singers, teachers, from eighteenth-century writings of bel canto masters to
and audiences place on timbral qualities associated with scientific studies of contemporary researchers. He con-
vocal production (color, resonance, etc.) may exceed that cluded that there were at least two main areas of agree-
for instrumental evaluation. It is even unclear whether ment concerning what constituted good vocal produc-
vocal teachers agree on the musical manifestations of tion: " T h e first is that there should be an ease in the
certain evaluative adjectives. The potential difficulty in musculature that is involved. There should be no stretch-
evaluating vocal performance thus might explain the ing or pushing. The vocal 'noise' should not be forced.
The second is that there is an acoustical 'adjustment' in
the vocal tract that results in the 2,800 cycle ring" (p.
Accepted August 27, 1996.
Address correspondence and reprint requests to Joel Wapnick, Mc- 35). Such agreement, of course, does not imply that vocal
Gill University. Faculty of Music, 555 Sherbrooke St. W., Montreal, teachers agree on what methods should be used to
Quebec H3A I E3, Canada. achieve good vocal production.
Portions of this study were presented at the Music Educators Na-
tional Conference National Convention. Cincinnati, Ohio, U.S.A., In reviewing the literature on voice pedagogy, Van
April 1994. den Berg and Vennard (3) encountered the following
429
430 J. WAPNICK AND E. EKHOLM
descriptors of desirable vocal production: freedom, lack probably is inappropriate for the evaluation of mature
of interference, ring, "singer's formant," intonation ac- singers. Moreover, many of the items in his scale focused
curacy, resonance, timbre, color, brilliance, power, in- on technical aspects of vocal production (e.g., "soft pal-
tensity, focus, body, depth, sensation of height or high ate too low during singing"). It is not clear whether such
placement, velvet, floating quality, mellowness, clarity aspects are widely accepted by voice teachers as being
and purity of vowel production, appropriateness of vi- necessary concomitants of good singing. Instead, it is
brato, efficient use of breath support, and flexibility. The possible that experts may disagree on technical means
descriptors they found for faulty vocal production are at but agree on musical ends.
least equally evocative: smothered, straining, throttled,
inefficient, tense, leaky, breathy, yelling, forced, reach-
PURPOSES OF T H E STUDY
ing, shallow, white, throaty, clutched, too covered,
pinched, twangy, nasal, honky, hooty, spread, dull and One purpose of this study was to interview voice ex-
diffused, and lack of intensity or focus. Van den Berg perts to identify criteria on which they typically based
and Vennard (3) felt that there was a need for singers and their evaluations of solo vocal performance. These cri-
singing teachers to associate their vocabularies with ob- teria were concerned solely with the sound rather than
jectively defined standards. They were particularly con- with technical or interpretive considerations. Such crite-
cerned that such terms as the ones listed above might not ria might be useful for evaluating singers at any stage of
be used consistently by any one person at different times development, regardless of the particular techniques em-
and that the same term used by many people might not ployed, and regardless of whether the audition is visually
have the same meaning. These concerns prompted them observed or audiotaped.
to suggest that a standard set of recorded vocal examples A second purpose was to construct and test a rating
be established. Such examples might then be matched to scale, based on the interviews. The scale assessed a
appropriate descriptive terms in the same way that scales larger group of experts' inter- and intrajudge reliability
of color and intensity are matched to color descriptors. when evaluating 19 different performances of a particu-
Van den Berg and Vennard also recommended using a lar musical excerpt.
sonograph to analyze the harmonic structure of standard-
ized vocal samples. Samples could then be compared
METHOD
with sonograms of any voice to determine which descrip-
tive term would correctly describe the sound. Vennard Preparation of the vocal evaluation form
worked toward a more objective terminology until his Seven experts, all experienced voice teachers at the
death in 1971. university level, were interviewed to determine the cri-
Campbell (4) attempted a computer simulation of solo teria they used for evaluating vocal production in solo
vocal performance adjudication. He was at least partially voice performance. Interviews lasted approximately 1
successful in that "the simulation effort produces scores hour each. The interview format was semistructured. Ex-
which correlate with the criterion (the average of the perts mentioned important criteria as they came to mind,
judges' scores) in the same range (0.47 to 0.63) as do and explained their use of terms when necessary. Sub-
human judges. This is considerably higher than the in- jects subsequently were shown a list of criteria collected
terjudge correlations (0.26 to 0.42)" (p. 32). However, from surveys of literature on vocal production (3,11-14)
adjudication by computer has not become popular. There and were asked to comment on the importance of each
are at least two reasons for this: (a) resistance from teach- criterion shown. From these interviews, a list of test
ers and students, and (b) difficulty in preparing large items was drawn up. Only items that all seven experts
numbers of pieces for computer evaluation. indicated were important were included. Twelve items
A number of researchers have used the facet-factorial were thus obtained: appropriate vibrato, color/warmth,
approach to construct scales for evaluating instrumental diction, dynamic range, efficient breath management,
performance (5-8) or for evaluating choral performance evenness of registration, flexibility, freedom throughout
(9). Jones' (10) use of a facet-factorial approach for the vocal range, intensity, intonation accuracy, legato line,
construction of a scale to evaluate vocal solo perfor- and resonance/ring.
mance by high school students appears to be the only A seven-point scale was used for judging performance
such attempt of its kind. His scale consisted of 32 state- on each of the 12 items (1 = poor; 7 = excellent).
ments. Judges rated these statements according to their Judges also were given space to write comments con-
agreement or disagreement with them. Because his scale cerning strengths and weaknesses of the performance. At
was designed specifically for high school soloists, it the bottom of the page, there was a question regarding
Journal of Voice. VoL II, No. 4, 1997

CONSENSUS IN SOLO VOICE EVALUATION 431
overall performance, which was rated on a 0 to 100 scale Expert ratings

rather than on the seven-point scale used for individual Copies of the experimental master audiotape and the
items. Instructions on the form stated that this overall voice performance evaluation form were sent to 28 ex-
rating need not reflect ratings of the items previously perienced university voice teachers. They were in-
judged. structed to listen to each performance only once while
Recordings completing the evaluation forms, to simulate an adjudi-
cation experience. They were also instructed to rate only
Seventeen singers were recorded in a medium-sized
vocal production (not musical interpretation), to use the
concert hall (seating capacity of 600). Each performed an
entire range of the rating scales, to rest approximately
excerpt from Mozart's lied "Ridente la calma" (mm.
halfway through the tape to minimize fatigue, and to use
1-27). This excerpt was chosen for several reasons: (a) it
the same playback equipment for the entire tape.
was available in both medium-high and medium-low
versions, which made it appropriate for most voices; (b)
the text was in Italian and fairly easy to pronounce, thus RESULTS
minimizing the possibility that novices would be de-
tected from pronunciation difficulties alone; (c) the slow, Twenty-one of the 28 sets of evaluation forms (75%)
lyrical nature of the song made it suitable for almost any were returned completed. Data from them were analyzed
type of voice; (d) it was technically complex enough to to determine (a) interjudge reliability for each judge and
reveal strengths and weaknesses in the singer's vocal for each test item, (b) intrajudge reliability for the six
production; (e) the range was broad enough to allow test-retest performances, and (c) correlations between
evaluation of most of the singer's range; and (f) it was individual items. Also, a factor analysis was performed
not too musically complex to be learned in a short period on test items to determine if there were any systematic
of time. groupings among them.
Performances varied by voice classification (baritone,
tenor, countertenor, contralto, mezzo soprano, and so- lnterjudge reliability
prano) and by achievement level (novice to advanced). Pearson product-moment correlations were calculated
Each performance lasted approximately 1 minute and 15 for each of the 2 l0 different pairs of judges. These cor-
seconds. Three pairs of microphones were used: a pair of relations were determined by using all of the 247 data
Bruel and Kjaer model 401 l's, a pair of Bruel and Kjaer points collected for each judge (13 ratings x 19 perfor-
model 4006's, and a pair consisting of a Sennheiser mances). The 20 correlations collected for each judge
MKH30 microphone and a Sennheiser MKH40 micro- (with each of the other judges) were then averaged. This
phone. This was done to vary slightly the sound quality provided a measure of each expert's agreement with the
of the taped performances, thereby obscuring differences group of experts. These average judge-group correla-
between the recordings and the two commercial record- tions ranged from 0.35 to 0.57, and the average for all
ings (performed by Elly Ameling and Roberta Alex- judges was 0.49.
ander) that were included in the final master tape. Other A subsequent analysis of variance revealed that some
recording equipment used was a Sony MCP 3000 series judges demonstrated significantly more agreement with
console, a PCM 3402 analog-to-digital converter, and a the judge group than did others (F = 8.50; df = 20, 399;
Sony DAT PCM 2500A recorder. p < 0.001). Average judge-group correlations were then
correlated with expert's age (r = 0.01), number of years
The master tape
of teaching experience (r = -0.12), and number of years
The master tape consisted of 25 performances: the 17 of adjudication experience (r = -0.15). These three cor-
recorded excerpts, 2 commercial recordings, and 6 per- relations revealed no significant associations.
formances repeated a second time. Repeated perfor-
mances were selected at random from the original 19 Interjudge reliability as a function of specific item
performances. Order of performances on the tape was Pearson r coefficients, as measures of interjudge reli-
determined randomly, except that the last 6 performances ability, were calculated for each test item and for "over-
were the repetitions of earlier performances. all score" (see Table 1). T-tests for differences between
An analog copy of the master tape was dubbed to a dependent correlations showed that experts were more
Yamaha KX-800U cassette recorder, from a TASCAM reliable in judging "overall score" and "intonation ac-
DA-30 digital recorder. Copies were then made from the curacy" than they were in judging any of the other items
analog master, using a KABA Realtime Professional Du- (p < 0.001). In addition, experts were significantly less
plicating System. reliable in judging diction than they were in judging any
Journal of Voice. Vol. II. No. 4. 1997

432 J. W A P N I C K A N D E. E K H O L M
other items (p < 0.001). Moreover, they were signifi- TABLE 2. Average Pearson r coefficients for inteljudge
cantly more reliable in judging " f r e e d o m throughout vo- reliabilit3, as a function of judge panel size
cal range" than they were in judging any of the ten items No. of Average
rated below it in Table 1 (p < 0.01). All other differences judges Pearson r
among items were not significant. per panel coefficient
1 .49
I n t e r j u d g e r e l i a b i l i t y as a function of p a n e l size 2 .67
As reported above, the average j u d g e - g r o u p con'ela- 3 .75
tion was 0.49. When these data were recalculated to 4 .80
5 .82
show the effects of using more than 1 judge to predict 6 .85
scores of other judges, reliability increased sharply as the 7 .86
number of judges used to predict from increased from 8 .89
9 .89
one to four (from 0.49 to 0.80; Table 2). Beyond four I0 .90
judges, reliability improved more slowly (from 0.80 to
0.90, as the panel size increased from 4 to 10).
Interjudge split-half reliability mances presented twice on the tape. A Pearson p r o d u c t -

The performances excluding the retest (the last six moment correlation was thus calculated for each judge.
performances) were divided into two subtests, consisting There were considerable differences between individual
of odd-numbered and even-numbered performances re- judges: r varied from 0.33 to 0.89, and the mean r was
spectively. A Pearson product-moment correlation was 0.70. Additional analyses showed that intrajudge reliabil-
then calculated for each item and for "overall score," ity was not significantly correlated with age (r = -0.18),
using sums of performance ratings in each subtest, to teaching experience (r = -0.21), or adjudication expe-
determine interjudge split-half reliability. As can be seen rience (r = -0.18).
from Table 3, 10 of the correlation coefficients were In addition, Pearson r coefficients for each of the 13
above 0.80. All 13 revealed significant associations (p < items were calculated from test-retest ratings. A com-
0.05). parison of Table 4 with Table 1 reveals that intrajudge
A one-way repeated measures analysis of variance reliability was higher than interjudge reliability. T-tests
found significant differences in interjudge reliability for for differences between dependent correlations indicated
the different performances (F = 6.90; p < 0.001). Al- that experts were more consistent in judging "overall
though there was a nonsignificant correlation between s c o r e " than they were in judging any of the other items
ratings and reliability, examination of the data revealed (p < 0.001). Also, they were more consistent in judging
that judges appeared to be more reliable for very good appropriate vibrato than they were in judging flexibility
performances than they were for average performances. (t = 2.32; d f = 121; p < 0.05) or c o l o r / w a r m t h
(t = 2.5 ! ; df = 121 ; p < 0.02). Finally, they were more
lntrajudge reliability consistent in judging resonance/ring than they were in
Intrajudge reliability, or judge consistency, was deter- judging flexibility (t = 2.08; df = 121; p < 0.05) or
mined by comparing experts' ratings for the six perfor-
TABLE 1. hlterjudge reliabilio, for ratings of evaluation TABLE 3. hueljudge split-half rel&bilit3, for ratings of
scale items evaluation scale items
Pearson r Pearson r
Item coefficient Item coefficient
Overall score .64 Dynamic range .92
Intonation accuracy .64 Efficient breath management .92
Freedom throughout vocal range .55 Legato line .90
Appropriate vibrato .52 Diction .89
Evenness of registration .52 Intensity .88
Resonance/ring .51 Overall score .87
Flexibility .50 Resonance/ring .86
Intensity .48 Evenness of registration .85
Dynamic range .48 Appropriate vibrato .85
Legato line .47 Freedom throughout vocal range .82
Efficient breath management .46 Intonation accuracy .78
Color/warmth .43 Flexibility .78
Diction .34 Color/warmth .67
Journal of Voice, VoL I1, No. 4, 1997

CONSENSUS IN SOI~O VOICE EVALUATION 433
TABLE 4. httrajudge reliabilit3, for ratings of et,aluation tween correlations showed that overall score was corre-
scale items lated significantly more highly with freedom throughout
Pearson r vocal range than it was with all other items except even-
Item coefficient ness of registration and resonance/ring.
Overall score .87
Appropriate vibrato .76 Performance ratings
Resonance/ring .75
A one-way repeated measures analysis of variance
Intonation accuracy .74
Legato line .74 showed that there were significant differences in experts'
Freedom throughout vocal range .73 mean ratings for the different performances (F = 11.32:
Efficient breath management .73
p < 0.001). This, of course, was expected, in that the
Evenness of registration .72
Intensity .70 performances on the tape varied widely in accomplish-
Dynamic range .69 ment. The two performances by professional recording
Diction .69
artists were ranked first and second.
Flexibility .65
Color/warmth .65
Factor analyses
A factor analysis was performed on the 12 test items
color/warmth (t = 2.16; df = 121: p < 0.05). All other (overall score was excluded). Three factors resulted: in-
comparisons were not significant. trinsic qualit3, (consisting of color/warmth, appropriate
vibrato, resonance /ring, dynamic range, and intensity),
Correlations between items execution (comprising flexibility, evenness of registra-
The Pearson product-moment correlation matrix tion, freedom throughout vocal range, efficient breath
shown in Table 5 revealed that all 13 items were signifi- management, intonation accuracy, and legato line), and
cantly correlated with each other. "Freedom throughout diction (including diction only). Factor loadings are
vocal range" and "evenness of registration," and "free- shown in Table 6. The three factors accounted for 76.2%
dom throughout vocal range" and "flexibility" were the of the total variance (27.8% by intrinsic quality, 37.3%
two most highly correlated pairs of items (r = 0.79 and by execution, and 11.1% by diction).
0.77, respectively), not including correlations of test Another factor analysis was performed, this time using
items with "overall score." The three different items in the 21 experts as variables, with their raw scores as the
these two pairs were among the five most highly corre- dependent measure. There were two main, underlying
lated items with overall score. Correlations with overall factors. Thirteen experts loaded primarily on one factor,
score ranged from 0.57 (diction) to 0.82 ("freedom and 8 experts loaded primarily on the other. The two
throughout vocal range"). T-tests for differences be- factors accounted for 59.4% of the total variance.
TABLE 5. Pearson prodtwt-monwnt correlation matrix for evaluation scale items"
Efficient Evenness Freedom

breath of throughout Reso-
Appropriate Color/ Dynamic manage- registra- vocal Intonation Legato nance/ Overall
vibrato warmth Diction range merit lion Flexibility range Intensity accuracy line ring score
Appropriate
vibrato
Color/warmth .70 --
Diction .49 .45 --
Dynamic range .61 .64 .51 --
Efficient breath
management .64 .55 .49 .63
Evenness of
registration .65 .58 .54 .62 .69
Flexibility .63 .56 .55 .65 .71 .74
Freedom
throughout
vocal range .66 .61 .54 .68 .69 .79 .77
Intensity .69 .70 .61 .74 .75 .74 .72 .74
Intonation
accuracy .59 .56 .50 .60 .61 .68 .65 .69 .62 --
Legato line .63 .57 .51 .62 .65 .64 .72 .66 .65 .61 --
Resonance/ring .65 .67 .50 .67 .66 .66 .63 .69 .74 .69 .63 --
Overall score .76 .71 .57 .74 .72 .79 .77 .82 .77 .77 .73 .78 --
" All items were significantly associated with all others (p < .(]5).
Journal of Voice. Vol. 11, No. 4, 1997

434 J. W A P N I C K A N D E. E K H O L M
T A B L E 6. Factor analysis Ioadings on test items To determine if the two groups differed in interjudge
Factor/test item Loading reliability, a t-test was performed on experts' mean Pear-
son r coefficients. The analysis showed that the reliabil-
"Intrinsic quality'"
Color/warmth .87 ity for group I was significantly higher than the reliabil-
Appropriate vibrato .69 ity for group 2 (t = 6.63; df = 418; p < 0.001). Subjects
Resonance/ring .67 in group 1 had a mean Pearson r coefficient of 0.52,
Dynamic range .62
Intensity .61 whereas group 2 subjects had a mean Pearson r coeffi-
"Execution" cient of 0.45. Thus experts who were primarily influ-
Flexibility .81 enced by execution were significantly more reliable than
Freedom throughout vocal range .77 experts who were primarily influenced by intrinsic qual-
Efficient breath management .75 ity. No significant correlations were found between the
Intonation accuracy .67 two groups and age, teaching experience, or adjudication
Legato line .67
"'Diction" experience.
Diction .91 The significant difference between groups in inter-
judge reliability might have been caused by differences
in group size, as there were more experts in group 1 than
in group 2. To examine this possibility, correlation ma-
As can be seen from Table 7, the six test items that trices (each expert by every other expert) were calculated
loaded most heavily on factor 1 for group 1 were "even- within each group. As expected, the matrices resulted in
ness o f registration," " f r e e d o m throughout vocal higher coefficients than those found when all experts
range," "efficient breath management," intonation ac- were pooled together. Nevertheless, a t-test based upon
curacy, flexibility, and legato line. These are the same these coefficients revealed a significant difference be-
items that loaded on execution in the initial factor analy- tween the groups: group 1 experts had a significantly
sis (Table 6). However, the six test items that loaded higher mean Pearson r coefficient (r = 0.57) than did
most heavily on factor 1 for group 2 were resonance group 2 experts (r = 0.46; t = 8.47: df = 210; p <
/ring, intensity, dynamic range, color/warmth, appropri- 0.001 ).
ate vibrato, and intonation accuracy. These items, with An analysis of variance was carried out to determine if
the exception of intonation accuracy, loaded on intrinsic the two groups differed in use of the rating scale. A
quality in the initial factor analysis. It thus appears that significant difference was found only for "legato line"
the two groups differed from each other in that intrinsic (F = 13.94: df = 1,397: p = 0.001): experts in group
quality primarily influenced evaluation for one group but 1 rated it significantly lower than experts in group 2. An
execution was more important in rating performances for analysis of variance also was conducted to determine if
the other. the 19 performances were rated differently by experts in
T A B L E 7. Factor analysis Ioadings fi~r two groups o f eaperts
Group I (n= 13) Group 2 (n = 8)

Factor Test item Loading Factor Test item Loading
1
Evenness of registration .79 Resonance/ring .82
Freedom throughout vocal range .79 Intensity .75
Efficient breath management .79 Dynamic range .73
Intonation accuracy .78 Color/warmth .71
Flexibility .78 Appropriate vibrato .60
Legato line .71 Intonation accuracy .59
Resonance/ring .67 Efficient breath management .53
Intensity .62 Freedom throughout vocal range .51
Dynamic range .56
Color/warmth .91 Flexibility .81

Appropriate vibrato .67 Legato line .79
Diction .90 Diction .90
Journal ~ Voice, Vol. I1, No. 4. 1997

CONSENSUS IN SOLO VOICE EVALUATION 435
the two groups. There were significant differences for six figure might have been expected, especially given the
of the performances. These were the performances that wide range in quality of the performances rated. Never-
were ranked third, fifth, twelfth, thirteenth, fourteenth, theless, results indicated that evaluations pooled from
and seventeenth, overall. tour or more judges demonstrated considerable inter-
In addition to these differences, the groups differed in judge reliability (r > 0.80). This would appear to have
intrajudge reliability. Group 1 was more reliable in this implications for how vocal auditions and examinations
respect than was group 2 (group 1" r = 0.73: group 2: r should be assessed.
= 0.64). A test for the difference between independent Evaluation of vocal performance may be a more for-
correlations showed the difference to be significant (z = midable task than evaluation of instrumental perfor-
3.39, p = 0.001). mance. A possible explanation is that judgments of in-
trinsic timbral quality may be both more important and
Comments more difficult than they are for instrumental evaluation.
Experts' comments dealt mostly with evaluations of Adding to the difficulty is the existence of over 20 dif-
test items or with techniques for improving weaknesses ferent vocal classifications, from soubrette (light so-
in elements of vocal production. However, a few of the prano) to serioser Bass ("serious bass"). The subtle tim-
experts expressed reservations about evaluating singers bral distinctions that separate a great voice from a good
from tape recordings. They commented that small voices one may have no counterpart in instrumental evaluation.
sound relatively bigger on tape than they would in a live Results from the factor analyses indicated that (a) all
situation, and that big voices conversely sound relatively but one of the test items could be categorized as relating
smaller on tape than in a live situation. One expert re- to either execution or to intrinsic quality of the voice, and
turned the evaluation package incomplete, explaining (b) judges who relied primarily on the "execution"
that it was impossible to separate vocal production from items for making evaluations had higher inter- and in-
interpretation and other musical concerns, and that "mu- trajudge reliability than judges who relied primarily on
sical and emotional aspects [are] part of the technique, the "intrinsic quality" items. This does not necessarily
even in vocalises and scales." This expert also felt that imply that experts should focus on execution to raise
"phrasing" and "vowel shaping" should have been in- their reliability. Judges who do this may be avoiding the
cluded in the list of test items. Another expert com- more crucial, but difficult, issue of evaluating vocal qual-
mented on the importance of "'engaging both emotion ity. The results do imply, however, that even experts
and body for beautiful sound." have difficulty in evaluating vocal quality.
Two experts criticized the short duration of the ex- Each of the 12 individual items was significantly cor-
cerpts. They claimed that longer excerpts were needed to related with overall score. This is in accordance with
allow enough time "'to form a clear picture of the [sing- previous research (15-17) and would seem to indicate
er's] strengths and weaknesses." One of them thus found that the individual items were all important in determin-
it impossible to comment on the vocal production or to ing an overall impression. On the other hand, the signifi-
rate "overall score." Another judge claimed that he was cant correlations may mean merely that once the judges
influenced by the order of performances on the tape. made an overall impression, they then rated individual
Finally, one judge commented on the difficulty of rating items accordingly. It should be noted, however, that all
all of the performances by the same standard. He ex- 13 items were significantly correlated with each other. It
plained that "in judging voice exams, one tries to take thus would appear difficult to assume causality in any
the level of the singer into consideration." This judge direction.
also mentioned that it was unusual to have to separate There was considerable variability in both inter- and
"technical matters" from "musicality and presenta- intrajudge reliability among the judges. In both analyses,
tion." however, such variability was not related to age, teaching
experience or adjudication experience. The ability to
DISCUSSION evaluate consistently thus appears to be a skill that is not
necessarily acquired in the course of a normal teaching
It is common lore that vocal experts often disagree or singing career. It is unclear at present what it is related
with each other in their evaluations of vocal perfor- to, or even if it is a skill that is easily teachable.
mance. To a certain extent, results from our study bear Although a few judges criticized the use of audiotape,
this out: intrajudge reliability was much higher than in- we felt that audiotape had advantages over videotape.
terjudge reliability, and the interjudge Pearson r coeffi- We were concerned solely with auditory aspects of vocal
cient for individual judges averaged only 0.49. A higher performance. As many clues to a singer's technique are
Journal of Voice, VoL II. No. 4, 1997

436 J. WAPNICK A N D E. E K H O L M
visual, audiotaped performances may have helped ex- adjudication. In: E.rperimental Research in the Psychology of Mu-
sic. Vol. 7. Iowa City: University of Iowa Press, 1970:1--40.
perts focus on the end product--the sound being pro- 5. Abeles H. Development and validation of a clarinet performance
duced--rather than on the technique used to produce it. adjudication scale. J Res Mus Ed 1973;21:246-55.
Furthermore, audiotaped performances may have re- 6. Bergee M. An application of the facet-factorial approach to scale
construction in a rathlg scale for" euphonium and tuba music per'-
duced the likelihood of bias resulting from identification formance. University of Kansas, 1987. Dissertation.
of the singers. Finally, the use of audiotape has a certain 7. DCamp C.B. Application of tire facet-factorial approach to scale
amount of external validity, as audiotaped vocal audi- construction in the developing of a rating scale for" high school
band pelformance. University of Iowa, 1980. Dissertation.
tions are often used today in lieu of live auditions. 8. Sagen DP. The development and validity of a university band
This study focused on the auditory aspects of vocal performance rating scale. J Band Res 1983;18:1-1 I.
production rather than on vocal technique p e r se. How- 9. Cooksey JM. A facet-factorial approach to rating high school cho-
ral performance. J Res Mus Ed 1977;25:100-14.
ever, the establishment of standards for judging vocal 10. Jones H. An application of the facet-factorial approach to scale
performance would seem to be important for the subse- construction in the development of a rating scale for high school
quent determination of effective and efficient techniques. vocal solo pelformance. University of Oklahoma, 1986. Disserta-
tion.
It is perhaps because of the lack of the establishment of I I. Fields VA. Training tire Singbtg Voice. New York: King's Crown
such standards that there has been very little empirical Press, 1947.
research in which different vocal techniques have been 12. Miller R. Quality in the singing voice. In: V. Lawrence, ed. Tran-
scripts of the Fourteenth Symposium: Care of the Professional
compared directly with each other. Given the ubiquity Voice, Part Ih Pedagogy and Medical. New York: The Voice
and importance of vocal performance, further research in Foundation, 1986:193-202.
this area would appear to be warranted. 13. Reid CL. Voice: Psyche and Soma. New York: The Joseph Patel-
son Music House, 1975.
14. Vennard W. Shlging: The Mechanism and tire Technic. New York:
REFERENCES Carl Fischer, 1967.
15. Burnsed V, Hinkle D, King S. Perlbrmance evaluation reliability at
1. Jorgenson D. A history of conflict. NATS Bull 1980;36:31-5. selected concert festivals. J Band Res 1985;21:22-9.
2. Boyle J, Radocy R. Measurement and Evaluation of Musical Ex- 16. Fiske HE. Judge-group differences in the rating of secondary
periences. New York: Schirmer Books, 1987. school trumpet performances. J Res Mtts Ed 1975;23:186-96.
3. Van den Berg J, Vennard W. Toward an objective vocabulary. 17. Wapnick J, Rosenquist M. Preferences of undergraduate music
NATS Bull 1959;15:10-5. majors for sequenced versus performed piano music. J Res Mus Ed
4. Campbell WC. A computer simulation of musical performance 1991 ;39:152-60.
Journal of Voice. I/ol. 11. No. 4, 1997

Wapnick e Ekholm - 1997 - Expert Consensus in Solo Voice Performance Evaluation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wapnick e Ekholm - 1997 - Expert Consensus in Solo Voice Performance Evaluation

Uploaded by

Copyright:

Available Formats

Journal of Voice

Vdl. 1I. No. 4. pp. 429-436

Expert Consensus in Solo Voice Performance Evaluation

Joel Wapnick and Elizabeth Ekholm

McGill Universit3,, Montreal. Quebec, Canada

Journal of Voice. VoL II, No. 4, 1997

overall performance, which was rated on a 0 to 100 scale Expert ratings

Journal of Voice. Vol. II. No. 4. 1997

Interjudge split-half reliability mances presented twice on the tape. A Pearson p r o d u c t -

Journal of Voice, VoL I1, No. 4, 1997

TABLE 5. Pearson prodtwt-monwnt correlation matrix for evaluation scale items"

Efficient Evenness Freedom

Journal of Voice. Vol. 11, No. 4, 1997

T A B L E 7. Factor analysis Ioadings fi~r two groups o f eaperts

Group I (n= 13) Group 2 (n = 8)

Color/warmth .91 Flexibility .81

Diction .90 Diction .90

Journal ~ Voice, Vol. I1, No. 4. 1997

Journal of Voice, VoL II. No. 4, 1997

Journal of Voice. I/ol. 11. No. 4, 1997

You might also like