Professional Documents
Culture Documents
Summary: Experts were interviewed to identify criteria for evaluation of vocal per-
formance. A scale was then constructed and inter- and intrajudge reliability assessed.
Experts listened to 19 different performances, plus 6 presented a second time. Inter-
judge reliability for one judge was modest, but increased dramatically as the size of the
judge panel increased. The most reliable items were overall score and intonation
accuracy. Diction was less reliable than other items. Intrajudge reliability was higher
for overall score than for any other item. A factor analysis on the test items yielded
factors labelled intrinsic quality, execution, and diction. Another factor analysis, using
the experts as variables, revealed two underlying evaluative dimensions. It was found
that 13 experts were primarily influenced by execution, and that 8 were mainly affected
by intrinsic quality. Interjudge and intrajudge reliabilities of these two groups differed.
Key Words: Voice evaluation--Reliability--Rating scales--Voice quality--Singing.
Although it is true that there has been considerable common lore that voice teachers have difficulty in agree-
research on evaluation in music performance, almost ing with each other in evaluative situations (1).
none of it has focused on vocal performance. There are Boyle and Radocy (2) observed that "the measure-
two possible reasons for this: (a) researchers appear to ment of musical performance is inherently subjective.
have been more interested in instrumental evaluation Music consists of sequential aural sensations; any judg-
than in vocal evaluation, and (b) vocal research may have ment of a musical performance is based on those sensa-
been conducted but not reported because sufficient reli- tions as they are processed by the judge's brain." (p.
ability and validity was difficult to obtain. Indeed, vocal 171) The lack of objectivity does not mean that the
performance evaluation appears to be considerably more achievement of a general consensus is not important. If
complicated than instrumental performance evaluation. there is no consensus, evaluation loses much of its mean-
Such elements as diction and transmission of the emo- ing.
tional meaning of lyrics have no instrumental counter- Jorgenson (l) examined the vocal technique literature,
part. In addition, the importance that singers, teachers, from eighteenth-century writings of bel canto masters to
and audiences place on timbral qualities associated with scientific studies of contemporary researchers. He con-
vocal production (color, resonance, etc.) may exceed that cluded that there were at least two main areas of agree-
for instrumental evaluation. It is even unclear whether ment concerning what constituted good vocal produc-
vocal teachers agree on the musical manifestations of tion: " T h e first is that there should be an ease in the
certain evaluative adjectives. The potential difficulty in musculature that is involved. There should be no stretch-
evaluating vocal performance thus might explain the ing or pushing. The vocal 'noise' should not be forced.
The second is that there is an acoustical 'adjustment' in
the vocal tract that results in the 2,800 cycle ring" (p.
Accepted August 27, 1996.
Address correspondence and reprint requests to Joel Wapnick, Mc- 35). Such agreement, of course, does not imply that vocal
Gill University. Faculty of Music, 555 Sherbrooke St. W., Montreal, teachers agree on what methods should be used to
Quebec H3A I E3, Canada. achieve good vocal production.
Portions of this study were presented at the Music Educators Na-
tional Conference National Convention. Cincinnati, Ohio, U.S.A., In reviewing the literature on voice pedagogy, Van
April 1994. den Berg and Vennard (3) encountered the following
429
430 J. WAPNICK AND E. EKHOLM
descriptors of desirable vocal production: freedom, lack probably is inappropriate for the evaluation of mature
of interference, ring, "singer's formant," intonation ac- singers. Moreover, many of the items in his scale focused
curacy, resonance, timbre, color, brilliance, power, in- on technical aspects of vocal production (e.g., "soft pal-
tensity, focus, body, depth, sensation of height or high ate too low during singing"). It is not clear whether such
placement, velvet, floating quality, mellowness, clarity aspects are widely accepted by voice teachers as being
and purity of vowel production, appropriateness of vi- necessary concomitants of good singing. Instead, it is
brato, efficient use of breath support, and flexibility. The possible that experts may disagree on technical means
descriptors they found for faulty vocal production are at but agree on musical ends.
least equally evocative: smothered, straining, throttled,
inefficient, tense, leaky, breathy, yelling, forced, reach-
PURPOSES OF T H E STUDY
ing, shallow, white, throaty, clutched, too covered,
pinched, twangy, nasal, honky, hooty, spread, dull and One purpose of this study was to interview voice ex-
diffused, and lack of intensity or focus. Van den Berg perts to identify criteria on which they typically based
and Vennard (3) felt that there was a need for singers and their evaluations of solo vocal performance. These cri-
singing teachers to associate their vocabularies with ob- teria were concerned solely with the sound rather than
jectively defined standards. They were particularly con- with technical or interpretive considerations. Such crite-
cerned that such terms as the ones listed above might not ria might be useful for evaluating singers at any stage of
be used consistently by any one person at different times development, regardless of the particular techniques em-
and that the same term used by many people might not ployed, and regardless of whether the audition is visually
have the same meaning. These concerns prompted them observed or audiotaped.
to suggest that a standard set of recorded vocal examples A second purpose was to construct and test a rating
be established. Such examples might then be matched to scale, based on the interviews. The scale assessed a
appropriate descriptive terms in the same way that scales larger group of experts' inter- and intrajudge reliability
of color and intensity are matched to color descriptors. when evaluating 19 different performances of a particu-
Van den Berg and Vennard also recommended using a lar musical excerpt.
sonograph to analyze the harmonic structure of standard-
ized vocal samples. Samples could then be compared
METHOD
with sonograms of any voice to determine which descrip-
tive term would correctly describe the sound. Vennard Preparation of the vocal evaluation form
worked toward a more objective terminology until his Seven experts, all experienced voice teachers at the
death in 1971. university level, were interviewed to determine the cri-
Campbell (4) attempted a computer simulation of solo teria they used for evaluating vocal production in solo
vocal performance adjudication. He was at least partially voice performance. Interviews lasted approximately 1
successful in that "the simulation effort produces scores hour each. The interview format was semistructured. Ex-
which correlate with the criterion (the average of the perts mentioned important criteria as they came to mind,
judges' scores) in the same range (0.47 to 0.63) as do and explained their use of terms when necessary. Sub-
human judges. This is considerably higher than the in- jects subsequently were shown a list of criteria collected
terjudge correlations (0.26 to 0.42)" (p. 32). However, from surveys of literature on vocal production (3,11-14)
adjudication by computer has not become popular. There and were asked to comment on the importance of each
are at least two reasons for this: (a) resistance from teach- criterion shown. From these interviews, a list of test
ers and students, and (b) difficulty in preparing large items was drawn up. Only items that all seven experts
numbers of pieces for computer evaluation. indicated were important were included. Twelve items
A number of researchers have used the facet-factorial were thus obtained: appropriate vibrato, color/warmth,
approach to construct scales for evaluating instrumental diction, dynamic range, efficient breath management,
performance (5-8) or for evaluating choral performance evenness of registration, flexibility, freedom throughout
(9). Jones' (10) use of a facet-factorial approach for the vocal range, intensity, intonation accuracy, legato line,
construction of a scale to evaluate vocal solo perfor- and resonance/ring.
mance by high school students appears to be the only A seven-point scale was used for judging performance
such attempt of its kind. His scale consisted of 32 state- on each of the 12 items (1 = poor; 7 = excellent).
ments. Judges rated these statements according to their Judges also were given space to write comments con-
agreement or disagreement with them. Because his scale cerning strengths and weaknesses of the performance. At
was designed specifically for high school soloists, it the bottom of the page, there was a question regarding
other items (p < 0.001). Moreover, they were signifi- TABLE 2. Average Pearson r coefficients for inteljudge
cantly more reliable in judging " f r e e d o m throughout vo- reliabilit3, as a function of judge panel size
cal range" than they were in judging any of the ten items No. of Average
rated below it in Table 1 (p < 0.01). All other differences judges Pearson r
among items were not significant. per panel coefficient
1 .49
I n t e r j u d g e r e l i a b i l i t y as a function of p a n e l size 2 .67
As reported above, the average j u d g e - g r o u p con'ela- 3 .75
tion was 0.49. When these data were recalculated to 4 .80
5 .82
show the effects of using more than 1 judge to predict 6 .85
scores of other judges, reliability increased sharply as the 7 .86
number of judges used to predict from increased from 8 .89
9 .89
one to four (from 0.49 to 0.80; Table 2). Beyond four I0 .90
judges, reliability improved more slowly (from 0.80 to
0.90, as the panel size increased from 4 to 10).
TABLE 1. hlterjudge reliabilio, for ratings of evaluation TABLE 3. hueljudge split-half rel&bilit3, for ratings of
scale items evaluation scale items
Pearson r Pearson r
Item coefficient Item coefficient
Overall score .64 Dynamic range .92
Intonation accuracy .64 Efficient breath management .92
Freedom throughout vocal range .55 Legato line .90
Appropriate vibrato .52 Diction .89
Evenness of registration .52 Intensity .88
Resonance/ring .51 Overall score .87
Flexibility .50 Resonance/ring .86
Intensity .48 Evenness of registration .85
Dynamic range .48 Appropriate vibrato .85
Legato line .47 Freedom throughout vocal range .82
Efficient breath management .46 Intonation accuracy .78
Color/warmth .43 Flexibility .78
Diction .34 Color/warmth .67
TABLE 4. httrajudge reliabilit3, for ratings of et,aluation tween correlations showed that overall score was corre-
scale items lated significantly more highly with freedom throughout
Pearson r vocal range than it was with all other items except even-
Item coefficient ness of registration and resonance/ring.
Overall score .87
Appropriate vibrato .76 Performance ratings
Resonance/ring .75
A one-way repeated measures analysis of variance
Intonation accuracy .74
Legato line .74 showed that there were significant differences in experts'
Freedom throughout vocal range .73 mean ratings for the different performances (F = 11.32:
Efficient breath management .73
p < 0.001). This, of course, was expected, in that the
Evenness of registration .72
Intensity .70 performances on the tape varied widely in accomplish-
Dynamic range .69 ment. The two performances by professional recording
Diction .69
artists were ranked first and second.
Flexibility .65
Color/warmth .65
Factor analyses
A factor analysis was performed on the 12 test items
color/warmth (t = 2.16; df = 121: p < 0.05). All other (overall score was excluded). Three factors resulted: in-
comparisons were not significant. trinsic qualit3, (consisting of color/warmth, appropriate
vibrato, resonance /ring, dynamic range, and intensity),
Correlations between items execution (comprising flexibility, evenness of registra-
The Pearson product-moment correlation matrix tion, freedom throughout vocal range, efficient breath
shown in Table 5 revealed that all 13 items were signifi- management, intonation accuracy, and legato line), and
cantly correlated with each other. "Freedom throughout diction (including diction only). Factor loadings are
vocal range" and "evenness of registration," and "free- shown in Table 6. The three factors accounted for 76.2%
dom throughout vocal range" and "flexibility" were the of the total variance (27.8% by intrinsic quality, 37.3%
two most highly correlated pairs of items (r = 0.79 and by execution, and 11.1% by diction).
0.77, respectively), not including correlations of test Another factor analysis was performed, this time using
items with "overall score." The three different items in the 21 experts as variables, with their raw scores as the
these two pairs were among the five most highly corre- dependent measure. There were two main, underlying
lated items with overall score. Correlations with overall factors. Thirteen experts loaded primarily on one factor,
score ranged from 0.57 (diction) to 0.82 ("freedom and 8 experts loaded primarily on the other. The two
throughout vocal range"). T-tests for differences be- factors accounted for 59.4% of the total variance.
" All items were significantly associated with all others (p < .(]5).
T A B L E 6. Factor analysis Ioadings on test items To determine if the two groups differed in interjudge
Factor/test item Loading reliability, a t-test was performed on experts' mean Pear-
son r coefficients. The analysis showed that the reliabil-
"Intrinsic quality'"
Color/warmth .87 ity for group I was significantly higher than the reliabil-
Appropriate vibrato .69 ity for group 2 (t = 6.63; df = 418; p < 0.001). Subjects
Resonance/ring .67 in group 1 had a mean Pearson r coefficient of 0.52,
Dynamic range .62
Intensity .61 whereas group 2 subjects had a mean Pearson r coeffi-
"Execution" cient of 0.45. Thus experts who were primarily influ-
Flexibility .81 enced by execution were significantly more reliable than
Evenness of registration .78
Freedom throughout vocal range .77 experts who were primarily influenced by intrinsic qual-
Efficient breath management .75 ity. No significant correlations were found between the
Intonation accuracy .67 two groups and age, teaching experience, or adjudication
Legato line .67
"'Diction" experience.
Diction .91 The significant difference between groups in inter-
judge reliability might have been caused by differences
in group size, as there were more experts in group 1 than
in group 2. To examine this possibility, correlation ma-
As can be seen from Table 7, the six test items that trices (each expert by every other expert) were calculated
loaded most heavily on factor 1 for group 1 were "even- within each group. As expected, the matrices resulted in
ness o f registration," " f r e e d o m throughout vocal higher coefficients than those found when all experts
range," "efficient breath management," intonation ac- were pooled together. Nevertheless, a t-test based upon
curacy, flexibility, and legato line. These are the same these coefficients revealed a significant difference be-
items that loaded on execution in the initial factor analy- tween the groups: group 1 experts had a significantly
sis (Table 6). However, the six test items that loaded higher mean Pearson r coefficient (r = 0.57) than did
most heavily on factor 1 for group 2 were resonance group 2 experts (r = 0.46; t = 8.47: df = 210; p <
/ring, intensity, dynamic range, color/warmth, appropri- 0.001 ).
ate vibrato, and intonation accuracy. These items, with An analysis of variance was carried out to determine if
the exception of intonation accuracy, loaded on intrinsic the two groups differed in use of the rating scale. A
quality in the initial factor analysis. It thus appears that significant difference was found only for "legato line"
the two groups differed from each other in that intrinsic (F = 13.94: df = 1,397: p = 0.001): experts in group
quality primarily influenced evaluation for one group but 1 rated it significantly lower than experts in group 2. An
execution was more important in rating performances for analysis of variance also was conducted to determine if
the other. the 19 performances were rated differently by experts in
the two groups. There were significant differences for six figure might have been expected, especially given the
of the performances. These were the performances that wide range in quality of the performances rated. Never-
were ranked third, fifth, twelfth, thirteenth, fourteenth, theless, results indicated that evaluations pooled from
and seventeenth, overall. tour or more judges demonstrated considerable inter-
In addition to these differences, the groups differed in judge reliability (r > 0.80). This would appear to have
intrajudge reliability. Group 1 was more reliable in this implications for how vocal auditions and examinations
respect than was group 2 (group 1" r = 0.73: group 2: r should be assessed.
= 0.64). A test for the difference between independent Evaluation of vocal performance may be a more for-
correlations showed the difference to be significant (z = midable task than evaluation of instrumental perfor-
3.39, p = 0.001). mance. A possible explanation is that judgments of in-
trinsic timbral quality may be both more important and
Comments more difficult than they are for instrumental evaluation.
Experts' comments dealt mostly with evaluations of Adding to the difficulty is the existence of over 20 dif-
test items or with techniques for improving weaknesses ferent vocal classifications, from soubrette (light so-
in elements of vocal production. However, a few of the prano) to serioser Bass ("serious bass"). The subtle tim-
experts expressed reservations about evaluating singers bral distinctions that separate a great voice from a good
from tape recordings. They commented that small voices one may have no counterpart in instrumental evaluation.
sound relatively bigger on tape than they would in a live Results from the factor analyses indicated that (a) all
situation, and that big voices conversely sound relatively but one of the test items could be categorized as relating
smaller on tape than in a live situation. One expert re- to either execution or to intrinsic quality of the voice, and
turned the evaluation package incomplete, explaining (b) judges who relied primarily on the "execution"
that it was impossible to separate vocal production from items for making evaluations had higher inter- and in-
interpretation and other musical concerns, and that "mu- trajudge reliability than judges who relied primarily on
sical and emotional aspects [are] part of the technique, the "intrinsic quality" items. This does not necessarily
even in vocalises and scales." This expert also felt that imply that experts should focus on execution to raise
"phrasing" and "vowel shaping" should have been in- their reliability. Judges who do this may be avoiding the
cluded in the list of test items. Another expert com- more crucial, but difficult, issue of evaluating vocal qual-
mented on the importance of "'engaging both emotion ity. The results do imply, however, that even experts
and body for beautiful sound." have difficulty in evaluating vocal quality.
Two experts criticized the short duration of the ex- Each of the 12 individual items was significantly cor-
cerpts. They claimed that longer excerpts were needed to related with overall score. This is in accordance with
allow enough time "'to form a clear picture of the [sing- previous research (15-17) and would seem to indicate
er's] strengths and weaknesses." One of them thus found that the individual items were all important in determin-
it impossible to comment on the vocal production or to ing an overall impression. On the other hand, the signifi-
rate "overall score." Another judge claimed that he was cant correlations may mean merely that once the judges
influenced by the order of performances on the tape. made an overall impression, they then rated individual
Finally, one judge commented on the difficulty of rating items accordingly. It should be noted, however, that all
all of the performances by the same standard. He ex- 13 items were significantly correlated with each other. It
plained that "in judging voice exams, one tries to take thus would appear difficult to assume causality in any
the level of the singer into consideration." This judge direction.
also mentioned that it was unusual to have to separate There was considerable variability in both inter- and
"technical matters" from "musicality and presenta- intrajudge reliability among the judges. In both analyses,
tion." however, such variability was not related to age, teaching
experience or adjudication experience. The ability to
DISCUSSION evaluate consistently thus appears to be a skill that is not
necessarily acquired in the course of a normal teaching
It is common lore that vocal experts often disagree or singing career. It is unclear at present what it is related
with each other in their evaluations of vocal perfor- to, or even if it is a skill that is easily teachable.
mance. To a certain extent, results from our study bear Although a few judges criticized the use of audiotape,
this out: intrajudge reliability was much higher than in- we felt that audiotape had advantages over videotape.
terjudge reliability, and the interjudge Pearson r coeffi- We were concerned solely with auditory aspects of vocal
cient for individual judges averaged only 0.49. A higher performance. As many clues to a singer's technique are
visual, audiotaped performances may have helped ex- adjudication. In: E.rperimental Research in the Psychology of Mu-
sic. Vol. 7. Iowa City: University of Iowa Press, 1970:1--40.
perts focus on the end product--the sound being pro- 5. Abeles H. Development and validation of a clarinet performance
duced--rather than on the technique used to produce it. adjudication scale. J Res Mus Ed 1973;21:246-55.
Furthermore, audiotaped performances may have re- 6. Bergee M. An application of the facet-factorial approach to scale
construction in a rathlg scale for" euphonium and tuba music per'-
duced the likelihood of bias resulting from identification formance. University of Kansas, 1987. Dissertation.
of the singers. Finally, the use of audiotape has a certain 7. DCamp C.B. Application of tire facet-factorial approach to scale
amount of external validity, as audiotaped vocal audi- construction in the developing of a rating scale for" high school
band pelformance. University of Iowa, 1980. Dissertation.
tions are often used today in lieu of live auditions. 8. Sagen DP. The development and validity of a university band
This study focused on the auditory aspects of vocal performance rating scale. J Band Res 1983;18:1-1 I.
production rather than on vocal technique p e r se. How- 9. Cooksey JM. A facet-factorial approach to rating high school cho-
ral performance. J Res Mus Ed 1977;25:100-14.
ever, the establishment of standards for judging vocal 10. Jones H. An application of the facet-factorial approach to scale
performance would seem to be important for the subse- construction in the development of a rating scale for high school
quent determination of effective and efficient techniques. vocal solo pelformance. University of Oklahoma, 1986. Disserta-
tion.
It is perhaps because of the lack of the establishment of I I. Fields VA. Training tire Singbtg Voice. New York: King's Crown
such standards that there has been very little empirical Press, 1947.
research in which different vocal techniques have been 12. Miller R. Quality in the singing voice. In: V. Lawrence, ed. Tran-
scripts of the Fourteenth Symposium: Care of the Professional
compared directly with each other. Given the ubiquity Voice, Part Ih Pedagogy and Medical. New York: The Voice
and importance of vocal performance, further research in Foundation, 1986:193-202.
this area would appear to be warranted. 13. Reid CL. Voice: Psyche and Soma. New York: The Joseph Patel-
son Music House, 1975.
14. Vennard W. Shlging: The Mechanism and tire Technic. New York:
REFERENCES Carl Fischer, 1967.
15. Burnsed V, Hinkle D, King S. Perlbrmance evaluation reliability at
1. Jorgenson D. A history of conflict. NATS Bull 1980;36:31-5. selected concert festivals. J Band Res 1985;21:22-9.
2. Boyle J, Radocy R. Measurement and Evaluation of Musical Ex- 16. Fiske HE. Judge-group differences in the rating of secondary
periences. New York: Schirmer Books, 1987. school trumpet performances. J Res Mtts Ed 1975;23:186-96.
3. Van den Berg J, Vennard W. Toward an objective vocabulary. 17. Wapnick J, Rosenquist M. Preferences of undergraduate music
NATS Bull 1959;15:10-5. majors for sequenced versus performed piano music. J Res Mus Ed
4. Campbell WC. A computer simulation of musical performance 1991 ;39:152-60.