Research in Language Testing

SEARCH I
ANGUAGE
TESTING
JOHN W. OLLER, JR.

KYLE PERKINS
245497
RESEARCH IN
LANGUAGE
TESTING
John W. Oiler, Jr.

University of New Mexico
Kyle Perkins
Southern Illinois University
Newbury House Publishers, Inc. / Rowley / Massachusetts / 01969

Library of Congress Cataloging in Publication Data
Main entry under title:
Research in language testing.
Many of the contributions included were originally

presented at the 1st International Conference on Frontiers
in Language Proficiency and Dominance Testing held at
Southern Illinois University, April 1977.
“Complementary sequel to Language in education: testing
the tests.”
Bibliography: p.
Includes index.
1. Language arts—Ability testing--Addresses,
essays, lectures. 2. Language and languages—Ability
testing-Addresses, essays, lectures. 3. Cloze pro¬
cedure—Addresses, essays, lectures. 4. Intelligence
levels—Addresses, essays, lectures. I. Oiler, John W.
II. Perkins, Kyle.
LB1575.8.R46 407.6 78-26643
ISBN 0-88377-131-4
Cover design by Barbara Frake.
NEWBURY HOUSE PUBLISHERS, INC.
Language Science
Language Teaching
lil
/ Language Learning
ROWLEY, MASSACHUSETTS 01969
Copyright © 1980 by Newbury House Publishers, Inc. All rights reserved. No part of this book
may be reproduced or transmitted in any form or by any means, electronic or mechanical, including
photocopying, recording, or by any information storage and retrieval system, without permission in
writing from the Publisher.
Printed in the U.S.A.

First printing: January 1980
5 4 3 2
/
r
Mary Anne
and
Jani
Contents
Preface ix
An Overview 1
Part I How many factors are there in second language skill?
1 Two mutually exclusive hypotheses about second language

ability: indivisible or partially divisible competence 13
John W. Oiler, Jr., and Frances Butler Flinofotis
2 Is language ability divisible or unitary? A factor analysis of 22

English language proficiency tests 24
George Scholz, Debby Hendricks, Randon Spurling,
Marianne Johnson, and Lela Vandenburg
3 Separating the g factor from reading comprehension 34

Douglas E. Flahive
4 An analysis of various ESL proficiency tests 47

Kay K. Hisama
Discussion Questions 54
Part II Investigations of listening tasks
5 Listening competence: a prerequisite to communication 59

Pamela Cohelan Benson and Christine Hjelt
6 Communicative effectiveness as predicted by judgments of the

severity of learner errors in dictations 66
Frank Bacheller
Part III Investigations of speaking tasks
7 Oral proficiency testing in an intensive English language program 77

Debby Hendricks, George Scholz, Randon Spurling,
8 Rater reliability and oral proficiency evaluations 91

Karen A. Mullen
9 Accent and the evaluation of ESL oral proficiency 102

Donn R. Callaway

vi Contents
Part IY Investigations of reading tasks
10 Cloze as an alternative method of ESL placement and proficiency

testing 121
Frances Butler Hinofotis
11 An alternative cloze testing procedure: multiple-choice format 129

Frances Butler Hinofotis and Becky Gerlach Snow
12 The effects of agreement/disagreement on cloze scores 134

Naomi Doerr
13 TOEFL scores in relation to standardized reading tests 142

Kyle Perkins and Keith Pharis
Part V Investigations of writing tasks
14 Scoring and rating essay tasks 151

Celeste M. Kaczmarek
15 Evaluating writing proficiency in ESL 160

Kareti A. Mullen
16 Measures of syntactic complexity in evaluating ESL

compositions 171
Douglas E. Flahive and Becky Gerlach Snow
17 Discrete point versus global scoring for cohesive devices 177

Jill Evola, Ellen Matner, and Becky Lentz
Discussion Questions
182
Part VI Native versus nonnative performance:
What’s the difference?
18 We all make the same mistakes: a comparative study of native and

nonnative errors in taking dictation 187
Michelle Fishman
19 Processing of indirectly conveyed meaning: assertion versus

presupposition in first and second language acquisition 195
Patricia L. Carrell
20 Can ESL cloze tests be contrastively biased?—Vietnamese as a

test case
208
Craig B. Wilson
Discussion Questions
215
Contents vii
Part VII Measuring factors supposed to contribute to success

in second or foreign language learning
21 The correlation between aptitude scores and achievement

measures in Japanese and German 219
Sadako 0. Clarke
22 Behavioral and attitudinal correlates of progress in ESL by native

speakers of Japanese 227
Mitsuhisa Murakami
23 Seven types of learner variables in relation to ESL learning 233

John W. Oiler, Jr., Kyle Perkins, and Mitsuhisa Murakami
24 Integrative and instrumental motivations: in search of a measure 241

Thomas Ray Johnson and Kathy Krug
References 253
Appendix 261
About the Authors 306
Index 311
1
-
Preface
This is a book reporting practical research into the nature of language proficiency.
It deals primarily with second or foreign language learners—some in classroom
settings, others in more natural surroundings. Secondarily it addresses the
question of the comparability of first and second language proficiency. Is learning
the mother tongue similar to the process of adding another language? The focus is
on the discourse processing skills exhibited by language users.
This book is a complementary sequel to Language in Education: Testing the
Tests (Oiler and Perkins, 1978). Whereas that volume addressed the broader issue
of language proficiency in relation to educational testing in general for both
monolingual English speakers and bilingual children, this volume concentrates
more specifically on the composition of language proficiency especially for second
language learners. Several chapters reach beyond this central question to assess,
for instance, the relatedness of first and second language learning, and a whole
section is devoted to aptitude, attitude, and behavioral variables believed to play a
role in facilitating or inhibiting language acquisition.
There is probably no other area of educational endeavor about which there
are so many conflicting opinions and where, at the same time, it is possible to do
appropriate empirical research to obtain answers by refining some hunches and
ruling out others. It seems to us that the time has come for the theorists and
researchers alike to turn attention to the explanation of how people produce and
comprehend meanings in the ordinary contexts of human experience. We find it
encouraging that one of the most persistent prods in this direction is coming from
the investigation of educational tests which have for so long been aimed at dividing
things up into long lists of supposedly unrelated components.
We cordially invite the reader to challenge, test, refine, reformulate, and
apply the findings reported here.
John W. Oiler, Jr.

University of New Mexico
Kyle Perkins
Southern Illinois University
IX
Acknowledgments
Having grown out of a joint effort of the Center for English as a Second Language
and the Department of Linguistics at Southern Illinois University during academic
1976-1977, this volume is so much the result of multiple contributors that it
would be impossible to acknowledge all of those who deserve credit for the fact
that it has at last materialized as a physical entity. Some of the prime movers for its
completion are unfortunately not represented among the co-authors. Both
Richard Daeseh and Charles Parish (the co-directors of the Center for English as a
Second Language) provided much of the inspiration for the work as well as a good
deal of the hard labor. They, along with a team of other workers, helped in
preparing tests, scoring them, and coding the data.
James E. Redden, Professor of Linguistics, helped to press much of the work to
completion indirectly by organizing (along with Kyle Perkins) the First Interna¬
tional Conference on Frontiers in Language Proficiency and Dominance Testing
held at SIU in April 1977. Many of the contributions included here were either
presented or discussed at that conference, and several of them appeared in the
Proceedings edited by Dr. Redden. Student workers who helped in a variety of
capacities ranging from insight to keypunching included Damian Kodgis, Joseph
Repka, William Flick, Steve Spurling, Linda Thornberg, Sally Chai, and many
others. A special debt of gratitude is due Sheila Brutton and all the other members
of the CESL faculty who generously gave class time and a good deal of leisure time
to various phases of the project Sheila Brutton and Dick Daeseh graciously
helped with much of the taping necessary for some of the oral testing.
Administrative assistant to the Department of Linguistics, Lillian Higgerson, and
CESL’s right hand, Cathy Merriman especially, helped in too many ways to be
counted. We thank them particularly along with all the CESL staff.
Finally, the completion of the work would not have been possible except for a
CESL research and teaching grant to the first editor and co-author during
academic 1976-1977 while he was on leave without pay from the University of
New Mexico. The enthusiastic involvement of students, staff, and faculty was, to
us, an encouragement and an inspiration in days when research in higher
education often seems to require the lavish expenditure of federal tax money. We
are glad to say that in this case the costs were sustained by normal university
outlays and by pooling and coordinating existing resources of faculty, staff, and
student time.
XI
An Overview
A fundamental question in educational research is how knowledge is acquired,

transmitted, stored, retrieved, created, modified, remembered, and forgotten. We
know that language enjoys a privileged part in all this because it is both the princi¬
pal instrument and the main object of discourse processing. Language is the sys¬
tem that undergirds discourse. It is also the perceptible manifestation of discourse
when it is in progress, and it is the main source of public evidence that discourse
has occurred after the fact Therefore, it is our belief that research into the nature
of human discourse processing skills is crucial to the basic problems of education.
Educators must be concerned with the nature of such skills, and with their acqui¬
sition, modification, and application in the course of normal human experience.
In our earlier volume on language as a factor in educational tests (Oiler and
Perkins, 1978), we expressed increasing disenchantment with the common
tendency of educators to parse the world of experience into seemingly pigeon¬
holed systems of knowledge; e.g., at the lower grades there is arithmetic, language
arts, social science, etc.; at the more advanced levels there are the traditional
departments, schools, colleges, etc. The reason that this tendency is lamentable is
that it seems to be based for the most part on unresearched assumptions rather
than solid empirical study. Similarly we expressed concern for the tendency of
educators to divide up skills into supposedly distinct aspects, subcomponents,
perceptual modalities, and the like. For instance, intelligence has been divided
into verbal and nonverbal aspects; language skill is segmented into vocabulary,
syntax, and phonology and also into receptive and productive repertoires,
visual/manual and auditory/articulatory skills, etc.; reading is parsed into too
many categories to mention, and so on it goes. In the earlier book we presented evi¬
dence that the commonly accepted distinctions may not be nearly so clear-cut as is
often supposed by curriculum specialists, testers, and other educators. Indeed, a
single factor of global language proficiency seems to account for the lion’s share of
variance in a wide variety of educational tests including nonverbal and verbal IQ
measures, achievement batteries, and even personality inventories and affective
measures. Furthermore, the importance of language proficiency to scores
obtained on all the aforementioned tests is overwhelmingly sustained in studies of
monolingual as well as bilingual school populations in the United States. Although
the research on this issue has just begun, the results to date are, in our view,
impressive and preponderantly in favor of the assumption that language skill
pervades every area of the school curriculum even more strongly than was ever
thought by curriculum writers or testers.
1
2 RESEARCH IN LANGUAGE TESTING
The truly startling possibility that the previous research raises is that many of
the bits and pieces of skill and knowledge posited as separate components of lan¬
guage proficiency, intelligence, reading ability, and other aptitudes may be so
thoroughly rooted in a single intelligence base that they are indistinguishable from
it. Further, there may be little or no profit at all in trying to make many of the dis¬
tinctions common to tests and curricula in education. It is just possible that, for
instance, vocabulary and syntax (as components of language proficiency) are
essentially the same, psychologically and perhaps even neurologically.
Educational tests and curricula, of course, usually reflect a very different set
of assumptions. They treat the learning of arithmetic, literature, history, science,
etc., as distinct and in some cases even unrelated endeavors. They are apt to
assume (probably incorrectly) that language ability has little or nothing to do with
the acquisition of computational skills, the development of motoric abilities, and
so on. Moreover, even the language curricula intended for the teaching of native
language skills or foreign languages are commonly based on the assumption that
several distinct skills exist (e.g., listening, speaking, reading, and writing) and that
each of these can be divided up into multiple subcomponents (e.g., vocabulary,
syntax, phonology/graphology), and into receptive and productive repertoires,
and so forth. Sometimes it is even insisted that the separate skills and components
must be taught in separate classes, often by different instructors using unrelated
curricula. In many of our schools and colleges there are separate language courses
for conversation, pronunciation, reading, composition, and so on. In many
programs separate course sequences exist for oral skills as opposed to reading and
writing, or for productive skills in contrast to receptive skills, and so on. It now
seems likely that the assumptions on which such distinctions are based have been
misguided from the outset, founded as they were on untested analytical theories
that neglected fundamental properties of human intelligence and discourse
processing skills.
The present volume extends the research base offered in the earlier volume
primarily with reference to the nature of language proficiency in foreign and
second language learners. A second language population is defined as one that
learns the target language in a setting where that language is used for everyday
communication in the surrounding community. A foreign language population, on
the other hand, is one that studies, but usually fails to really learn the language, in a
classroom context exclusively. That is, in the case of foreign language instruction
the target language is not a commonly used medium of communication outside the
classroom. The prime question is: How many components and what sort of
components are present in the language proficiency of these groups? Subsidiary
questions relate to the efficiency of tests requiring listening, speaking, reading,
and writing performances; the validity and reliability of subjective judgments of all
the foregoing sorts of performances; the relative efficiency of various placement
procedures for foreign students; the effects of attitudes and beliefs on the ability of
subjects to fill in blanks in passages of prose dealing with controversial topics; and
the relative information yield of different scoring methods for essays and other
tasks. A brief section is devoted to a comparison of native and nonnative
Overview 3
performances on various tasks including taking dictation, judging the truth value
of propositions, and filling in blanks in prose (cloze procedure). Finally, a
substantial section is addressed to various issues of background, aptitude,
behavior, and attitude as correlates of language proficiency.
The ramifications of the studies included are too numerous to be explored
fully in this introduction. However, it is possible to highlight some of the surpris¬
ing results and perhaps to provide a sense of some remarkable trends toward
common conclusions. These trends are especially noteworthy because in most
cases the projects approached similar questions quite unintentionally from very
different vantage points and with completely independent empirical methods.
Nevertheless, in many cases similar results were obtained and similar if not
identical conclusions were supported. Some of the findings and conclusions are
regarded, therefore, as particularly secure.
For instance, evidence from many of the studies included sustains the
conclusion that a single unitary factor may underlie all (or nearly all) of the more
than sixty processing tasks investigated.
In Chap. 1, Oiler and Hinofotis, for example, found that a single global factor
of language proficiency accounted for no less than 65% of the total variance in sev¬
eral batteries of proficiency tests. Their study included measures aimed at listening
comprehension, grammatical structure, reading vocabulary, reading comprehen¬
sion, writing ability, oral fluency, accent, oral grammatical accuracy, oral vocabu¬
lary usage, conversational comprehension, as well as ability to take dictation and
fill in blanks in prose. In Chap. 2, Scholz, Hendricks, Spurling, Johnson, and
Vandenburg have extended the question to a wide variety of other discourse pro¬
cessing tasks. In all, twenty-two English language proficiency tests roughly divisi¬
ble into listening, speaking, reading, writing, and grammar tests were investigated.
In spite of the considerable unreliability in some of the tasks because of a variety of
uncontrollable factors, a single general factor accounted for over half the variance
in all the tests. Furthermore, at least one of the tests in eaeh of the five major cate¬
gories (listening, speaking, etc.) shared 67% or more of its variance with a single
global factor.
In Chap. 3, Flahive discusses an attempt to separate a reading ability factor
from the general factor posited by Oiler and Hinofotis and by Scholz et al. He, too,
however, found substantial support for a single general factor. The correlations be¬
tween various reading tests, and language proficiency measures, with Raven’s Pro¬
gressive Matrices (commonly believed to be a nonverbal measure of intelligence)
ranged from .61 to .84. The subject sample consisted of a group of relatively ad¬
vanced ESL students at Southern Illinois University (Center for English as a Sec¬
ond Language). Similarly, in Chap. 4, Hisama showed that a single factor could
account for no less than 82.4% of the total variance in four tests: syntax (English
structure), listening comprehension, reading comprehension, and a fill-in blank
procedure. Her sample consisted of 136 CESL students at SIU.
In Chap. 5, Benson and Hjelt cite evidence from several empirical studies
demonstrating the importance of listening comprehension to reading and other
skills. Bacheller (Chap. 6) argues that because the goal of the language learner is
communication, the subjective evaluation of communicative effect is a sensible

approach to language proficiency testing. His data show that a scale of communi¬
cative effectiveness correlated between .63 and .94 with a variety of scoring
methods for listening, speaking, reading, and writing tasks. Bacheller’s results thus
reveal the existence of a substantial common factor underlying all the tests he
examined. His data also support the proposed communicative effectiveness scale
as a technique for evaluating listening skill.
Productive oral proficiency tests of a variety of sorts are discussed in Chap. 7
by Hendricks, Scholz, Spurling, Johnson, and Vandenburg. Their data support the
finding of Oiler and Hinofotis that skilled raters do not distinguish between com¬
monly posited components of speaking skill (e.g., fluency, comprehension,
grammar, vocabulary, and accent). The correlations of the various oral profi¬
ciency scales (of a variant of the Foreign Service Institute Oral Interview) with the
same general factor of language proficiency very closely approximate the esti¬
mated reliabilities of the same scales. Therefore, if the variance that is common to
all the scales were extracted from each, essentially no Reliable variance would re¬
main to be accounted for in any of the scales.
Mullen’s first paper, which is Chap. 8 on oral proficiency evaluations, reveals
strong intercorrelations among all four of the scales studied (listening, pronuncia¬
tion, fluency, and grammar) ranging from .77 to .85. Her data strongly support the
view that the four scales are all essentially measures of the same general factor we
have seen in other studies. Moreover, her findings are solidly supported by an
independent study of Callaway (Chap. 9) which employed 60 different raters judg¬
ing 16 samples of speech. His data revealed intercorrelations between various
scales of nonnativeness ranging from .7 3 to .89. When the mean ratings of speech
samples were correlated across scales, the correlations very nearly approximated
unity, ranging from .97 to .998.
Paralleling the results with scores on oral tasks in Part IV, Hinofotis (Chap.
10) and Hinofotis and Snow (Chap. 11) offer data from cloze studies showing that
standard cloze tasks and multiple-choice versions are both promising measures of
a rather wide range of ESL skills. Their work adds to an already substantial and
rapidly growing body of data supporting the use of cloze tasks as measures of lan¬
guage proficiency. Hinofotis reasons that written cloze tests (for all the reasons
argued previously, and on the basis of the additional data she provides) are suit¬
able measures of overall language ability. Correlations between cloze tests scored
by two methods in her study with the sum of subscores from the Comprehensive
English Language Test battery used at CESL, and with the Test of English as a
Foreign Language (Educational Testing Service, Princeton, N.J.) ranged from .71
to .84.
Perkins and Pharis (Chap. 12) studied the relationship of total scores on
three batteries of widely used standardized reading tests (the Iowa Silent Reading
Tests, the Nelson Denny Reading Test, and the McGraw-Hill Reading Test) with
the Test of English as a Foreign Language for three different samples of subjects.
Results revealed that two of the reading batteries were generally too difficult for
Overview 5
the tested subjects, but in all cases, the correlations with the TOEFL total score
were equal to or greater than the correlation between the reading tests per se. The
correlation between the TOEFL and the McGraw-Hill was a whopping .91 (N =
40).
Doerr’s study (Chap. 13) constitutes an interesting departure from the
preceding studies in both method and theory. She wanted to know whether agree¬
ment or disagreement of test takers with the subject matter in prose tests would
affect their ability to fill in blanks in those texts. She selected (as suggested by Dr.
Charles Parish) three controversial topics—the desirability of the decriminaliza¬
tion of marijuana, the possibility of abolishing capital punishment, and the immo¬
rality of aborting unwanted children. She wondered if agreement with the opinions
expressed by the author of a passage would result in higher scores while disagree¬
ment with the views of the author might result in lower scores. Previous research
by Manis and Dawes (1961) had shown a significant effect of agreement/disagree¬
ment for native speakers. However, Doerr found no significant effect for nonnative
speakers and showed a .90 correlation between the scores on texts with which
subjects indicated agreement and the scores on texts with which they indicated
disagreement. Her results suggest that the validity of the cloze procedure is rela¬
tively unaffected by the possible controversiality of selected topics. At least this
appears to be so for nonnative speakers.
Part V includes four chapters examining approaches to the measurement of
writing ability. Kaczmarek (Chap. 14) offers some redeeming evidence in favor of
the use of essay tasks as measures of writing ability. Furthermore, her work
demonstrates that both subjective ratings and objective scores of essays are sub¬
stantially correlated with a wide variety of scores on other discourse processing
tasks. Mullen’s second contribution (Chap. 15), on subjective evaluations of
essays, demonstrates the essential unity of ratings of structure, organization, quan¬
tity, and vocabulary. Correlations among four scales aimed at the just mentioned
theoretical components of subjective judgments of compositions ranged from .67
to .84.
Flahiveand Snow (Chap. 16) and Evola, Manier, and Lentz (Chap. 17) show
that discrete point approaches to scoring of essays are, on the whole, considerably
less promising than holistic evaluations of communicative effect. Compare, for in¬
stance, the correlations found by Flahive and Snow (see their Table 16-6, column
4) between a complexity index and a holistic evaluation with the correlations
reported by Bacheller (Table 6-1). Although the data in the two studies are not
strictly comparable, the greater variability among subjects in Bacheller s data
(which includes the whole range of abilities represented at CESL) is compensated
for in part by the greater homogeneity of the test constructs posited by Flahive and
Snow (that is, all their measures were supposed to be measures of essay writing
ability or some component of that ability, while in Bacheller s study listening,
speaking, reading, writing, and grammatical knowledge were all included). The
comparison seems to suggest that holistic ratings are more informative than
discrete point scoring methods. This conclusion is further substantiated by the
findings of Evola, Mamer, and Lentz. Examine, for instance, their Table 17-3
where none of the discrete point measures used accounted for more than about
10% of the variance in the essay ratings and the essay scores, while the latter two
(see Kaczmarek, Chap. 14, Table 14-3) share no less than 59 and64% respective¬
ly, of their variances with the same common factor. Thus the holistic types of
scores would seem to be about six times as informative concerning global lan¬
guage proficiency as the discrete point measures are. Further evidence in support
of this same conclusion is offered incidentally by Johnson and Krug, Chap. 24.
They found that a discrete point redundancy index was a moderate to weak
predictor of scores on other measures of ESL proficiency. They found correla¬
tions ranging from .30 to .64 (see their Table 24-5).
Although the three chapters in Part VI all refer to substantially different
methods and distinct subject populations, all are concerned with the important
question of how similar or different first and second language acquisition are. Fish¬
man (Chap. 18) boldly asserts that “we all make the same mistakes”; at least this
appears to be true for dictation tasks attempted by native and nonnative students at
Southern Illinois University. Two convincing proofs that natives and nonnatives
share a number of aspects of discourse processing are offered by Fishman. She
shows that the rank ordering of segments in a text is substantially similar for the
two subject populations (rho = .68). Two Q-type factor analyses also support her
conclusions.
Carrell (Chap. 19) takes a basically different tack. Her research questions are
mainly concerned with the processing of two kinds of indirectly conveyed mean¬
ings. The first type is presuppositional meaning where a given statement or asser¬
tion requires that some other proposition be taken for granted in order for the im¬
mediate assertion to be understood. For instance, if someone says, “Close the
door,” uidess the door is open, the statement does not make sense. We must pre¬
suppose that the door is open. A second type of indirectly conveyed meaning is
what may be termed implicational. In this case, the suggested meaning is implied
or entailed by the asserted meaning. For instance, if someone says “George had to
leave town,” this implies that in fact he did leave. It seems, in a nontechnical sense,
that the difference between the two types of meanings is dependent on the implicit
time sequence of the events referred to or suggested in relation to the time of utter¬
ance.
In any case, Carrell hypothesized that both natives and nonnatives would
have more trouble in identifying false presuppositions (in relation to pictured
information) than false implications when the problem was to judge the truth or
falsehood of given statements. Half of the statements actually required false pre¬
suppositions and half of them entailed false implications. The results sustained
Carrell’s hypothesis and revealed a similar pattern for both natives and non¬
natives, indicating a possible underlying similarity of processing strategies. This
result accords well with Fishman’s findings.
In Chap. 20, Wilson takes yet another approach to a comparison of native and
nonnative performance on language tests. He asked whether it was possible or not
Overview 7
to deliberately create a cloze test that would be more difficult for ESL learners
who were native speakers of Vietnamese than for learners with other language
backgrounds. He employed 72 subjects from 12 different language backgrounds
including 9 native speakers of English. On the basis of a contrastive analysis of
English and Vietnamese structures, three types of cloze tests were constructed
which were expected to become progressively more difficult only for the Vietnam¬
ese subjects (37 of the 7 2 subjects tested). One cloze test was constructed by plac¬
ing blanks at points believed to be difficult for Vietnamese learning English.
Another was deliberately loaded with structures believed difficult for Vietnamese
on the basis of the contrastive analysis, but deletion points were selected randomly
on an every fifth word basis. A third text was doubly biased by salting it with struc¬
tures believed difficult for Vietnamese and by also selecting the most difficult
points for inserting blanks in the text Another regular cloze test over a passage
judged similar in difficulty level to the other three was used as a control test.
Contrary to the initial prediction, all the biased tests were more difficult for
all the subjects. Again, in this respect, natives and nonnatives performed simi¬
larly. Language background did not appear to be a significant factor in differentiat¬
ing the relative difficulties of the four tests. In sum, it is apparently possible to
make a test more or less difficult on the basis of a contrastive analysis, but it seems
to be difficult to do so in a way that will deliberately discriminate only against a
particular language group. Again, native and nonnative performances revealed
fundamental similarities.
Part VII contains four papers that address different aspects of the problem of
explaining the variability in language proficiency exhibited by foreign and second
language learners. Clarke (Chap. 21) presents data that may be surprising to some
of the advocates of aptitude testing as a basis for predicting success in foreign lan¬
guage learning. Her data, and the data she cites from the manual for the Modem
Language Aptitude Test, suggest that it is a weak predictor at best of foreign lan¬
guage attainment Clarke is mystified by the peculiar discrepancies in the MLAT
as a predictor of success in German as compared against success in Japanese. In
neither case are the obtained correlations consistently strong. For instance, for the
Japanese learners correlations range from .23 to .74, and for the German learners
from .07 to .48.
Although more research is clearly needed, it seems to us that the unreliability
of the MLAT as a predictor of attainment may be due to a variety of factors. One of
the important possibilities that we feel should be studied is that the MLAT may
measure primarily performance on the sort of discrete point tasks frequently char¬
acteristic of discrete point foreign language teaching—a variety of teaching that
fails so completely that even its best products ordinarily cannot engage in normal
conversation or letter writing or even reading a newspaper in the target language
with reasonable fluency and comprehension. Many students of foreign languages
would be helpless if asked to make simple introductions in the language or follow a
set of directions to find an address in a city where the target language is spoken.
Therefore, if the “aptitude” that the MLAT measures is the ability to deal
with the sorts of irrelevant skills that are characteristically attended to in many
ineffective foreign language programs, it should be no great surprise that its
correlations with such unreliable attainments should reveal substantial random¬
ness. Perhaps it will be possible to devise more pragmatically oriented aptitude
tests that will produce more reliable predictive relationships with actual language
processing tasks.
One of the most surprising results reported in any of the papers included in
this volume appears in Chap. 22 by Murakami. He was interested in a number of
behavioral, demographic, and attitudinal variables which he thought might be
causally related to the learning of English as a second language. His subject popu¬
lation consisted of 30 native speakers of Japanese studying in various academic
areas at Southern Illinois University. The best overall predictor of language profi¬
ciency for both cloze scores and dictation scores was the student’s status at time of
testing. The graduate students consistently did better than the undergraduates,
who in turn did better than those enrolled in classes in English as a second lan¬
guage.
The surprise came with respect to two of the behavioral variables. Oddly, the
number of close English-speaking friends subjects indicated having was more
strongly correlated with cloze scores than with dictation (.39 and .28, respec¬
tively), and the number of pages subjects indicated writing in the target language
each semester was more strongly correlated with the dictation score than with the
cloze score (.64 and .32, respectively). This is surprising because we should
expect English-speaking friends to make more of an impact on ability to perform a
listening task (namely, the dictation) rather than a reading and writing task
(namely, the cloze test), while the amount of writing in English each semester
might be expected to relate more strongly to scores on the reading and writing task
(cloze). In both cases, the findings were exactly reversed. We will leave it to the
reader to explore Murakami s proposed explanation, but perhaps we should note
that his results are radically opposed to the separation of skills in language teach¬
ing.
Chapter 23, jointly authored by the co-editors of this volume and Murakami,
investigates seven types of variables that might be expected to be moderate causal
factors in the attainment of second language proficiency. Contrary to many
popular predictions, the data do not fit the traditional assumptions concerning
integrataive and instrumental motives. These and other variables frequently
believed to contribute to success in learning a second language are no more
strongly related to measures of ESL proficiency than are the three extraneous
questions regarding agreement/disagreement with controversial statements about
marijuana, capital punishment, and abortion (included relative to the Doerr study.
Chap. 13 above). Several explanations are considered. It is not possible to rule out
the alternative that the so-called attitude questions, and in fact all the rest, might
be indirect though unintentional measures of language proficiency to start with.
This explanation would wipe out a great deal of the empirical support claimed by
attitude theorists, in spite of the fact that it leaves the theoretical arguments for
Overview 9
attitudes and motivations unscathed. In fact, we are inclined to believe that some
of those theoretical arguments are correct, but the empirical foundation is shaky;
see Oiler and Perkins (1978, Chap. 5).
The last chapter in the volume, Johnson and Krug (Chap. 24) considers the
possibility of using a redundancy index as a measure of the degree of integrative¬
ness of second language learners. It was reasoned that the tendency of second lan¬
guage learners to appropriately use functors such as the plural morpheme, the
copula, the possessive inflection, and the third person singular habitual present
marker might be taken as a measure of the learner’s desire to be like valued mem¬
bers of the target language community. While the redundancy index did in fact
prove to be a better predictor of attained language proficiency than more tradi¬
tional attitude questions, it can be argued that the redundancy index is a more or
less direct measure of language proficiency. Therefore, what has to be demon¬
strated is that it is also a measure of an “integrative” motive. The latter relation¬
ship cannot be clearly inferred from the Johnson and Krug study. While a modest
correlation of .32 (significant at p < .05) was found with one of the so-called
“integrative” reasons for learning English, the redundancy index was also corre¬
lated at .23 (p < .10) with an instrumental motive (see Table 24-1). However, it
failed to correlate significantly with any of the other integrative or instrumental
motives.
We conclude our introduction by noting that all the research recorded here
points unmistakably to the fact that language skills of all the traditionally posited
sorts are fundamentally related. This appears to be true for natives and nonnatives
alike. It seems that all human beings rather naturally attend to meaning in compre¬
hending and producing discourse, and they are either incapable of or at least not
good at attending to much of anything else.
.
Part I
How Many Factors Are There

in Second Language Skill?
How can the differences in performance on language tests be explained? To

what factors or components of human ability can they be attributed? Are
they due primarily to a single source of variability? Or are they attributable to
multiple sources? If multiple sources exist, can their existence be demon¬
strated by testing techniques? Is reading ability a source of variance that can
be distinguished clearly from verbal intelligence? Nonverbal intelligence? Is
knowledge of words a different sort of knowledge than the ability to form
grammatical sequences of them? Can multiple aspects of speaking ability
such as fluency, vocabulary, and accent be judged separately or evaluated
independently? What sorts of tests produce the maximum information yield
concerning individual differences in language proficiency? These and other
questions are discussed in Chaps. 1 through 4.
Chapter
Two Mutually Exclusive Hypotheses about

Second Language Ability:
Indivisible or Partially Divisible Competence1
John W. Oiler, Jr., and Frances Butler Hinofotis
Two hypotheses proposed to explain the variance in second language tests

are investigated. Hypothesis 1 (HI) claims that language skill is separable
into components related either to linguistically defined categories (e.g.,
phonology, syntax, and lexicon) or the traditionally recognized skills (i.e.,
listening, speaking, reading, and writing). Although tests of the presumed
separable components are believed to produce substantial overlapping
variances, it is assumed in HI that tests aimed at a certain component (e.g.,
listening skill or vocabulary knowledge) should also produce some meaning¬
ful variance that is unique to that component (i.e., not overlapping with
variances of tests aimed at other components). Another possibility (H2) is
that second language ability may be a more unitary factor such that once the
common variance on a variety of language tasks is explained, essentially no
meaningful unique variance attributable to separate components will
remain. Previous studies have provided rather convincing support for H2,
though it seems to be the less obvious of the two alternatives. Data from 159
Iranian subjects at the University of Tehran, Iran, who took a cloze test, a
dictation, and the five subparts of the Test of English as a Foreign Language
also support H2 in this report However, when an oral interview task is
included, the picture is less clear. Data from 106 foreign students (from
mixed language backgrounds) at the Center for English as a Second
Language at Southern Illinois University suggest the possibility of unique
variances associated with components of grammatical knowledge, e.g.,
syntax versus phonology, or vocabulary).
One of the empirical methods for investigating the composition of mental

ability is to examine the pattern of intercorrelations between tests that purport to
13
14 I: FACTORS IN SECOND LANGUAGE SKILL
measure different aspects of that mental ability. Factor analysis is one of the sta¬
tistical procedures for studying the tendency of measures to produce meaningful
variances, that is, variances which are either unique to a particular test or com¬
mon to two or more tests. All factoring methods aim to simplify the data available
in a correlation matrix—the main question is how many factors and what sorts are
required to explain essentially all of the variance in a given matrix? By variance we
mean the algebraic quantity used in statistics to characterize the dispersion of
scores about a mean score for a certain population of subjects on a certain test or
battery of tests. By correlation we mean a similar quantity used to characterize the
degree of overlap in variance, or the tendency for scores on separate tests to covary
proportionately about their respective means.
The particular question investigated here is whether there is any unique
variance associated with certain language processing tasks. For instance, is there
any unique variance associated with tests that purport to measure vocabulary
knowledge, for instance, as opposed to tests that purport to measure, say, syn¬
tactic knowledge? Or is there any unique variance associated with, say, listening
comprehension as opposed to speaking ability, for example, as judged by tests with
those respective labels? In short, can language skill be partitioned into meaning¬
ful components which can be tested separately? Or, viewed the other way around,
does variance in the performance of different language tasks support the compo-
nential theory of language competence?
Two mutually exclusive hypotheses have been offered. First there is what we
will refer to as the divisible competence hypothesis: it has been argued by many lin¬
guists and pedagogues that language proficiency can be divided into separate com¬
ponents and separate skills or aspects of them. The components usually singled
out include phonology, syntax, and lexicon and the skills listening, speaking, read¬
ing, and writing. Some have argued further that it is necesary to distinguish
between receptive versus productive repertoires (that is, listening/speaking
versus reading/writing). It was even contended by Lado (1961) that the
grammatical components posited for one skill or modality may be different from
those functional in a different skill or modality. In a similar vein, Clark (1972)
spoke of separate “grammars” for speaking and listening.
A second major hypothesis is that language proficiency may be functionally
rather unitary. The components of language competence, whatever they may be,
may function more or less similarly in any language-based task. If this were the
case, high correlations would be expected between valid language tests of all sorts.
Seemingly contradictory results, such as the fact that listening comprehension
usually exceeds speaking proficiency in either first or second language speakers,
would have to be explained on some basis other than the postulation of separate
grammars or components of competence. For instance, one might appeal to the
load on attention and short-term memory that is exerted by different language¬
processing tasks. It may require more mental energy to speak than to listen, or to
write than to read, and so forth.
If the variance associated with language tests which are aimed at separate
components or skills were substantially overlappiing (that is, if the tests were
Oller/Hinofolis: Second language ability 15
strongly correlated), the unitary competence hypothesis would be sustained. If

unique variances could be shown to be associated with tests aimed at separate
skills and/or at separate components, some version of the divisible competence
hypothesis would be sustained.
The unitary competence hypothesis is reminiscent of Spearman’s general
factor of intelligence. In fact, Spearman, about the turn of the century, invented
factor-analytic techniques explicitly for the purpose of testing for just such a
general factor of intelligence. Oddly, though, the construct of intelligence remains
very poorly understood. A widely cited reference book. Assessmen t of Children s
Intelligence (Sattler, 1974), contains nearly a thousand pages of text but humbly
confesses early in the preface that “it does not present a thorough discussion of the
nature of intelligence” (p. viii). Nevertheless, Spearman’s general factor, accord¬
ing to a prominent Berkeley theorist, stands “like a Rock of Gibraltar in psycho¬
metrics, defying any attempt to construct a test of complex problem-solving which
excludes it” (Jensen, 1972, p. 195). Our understanding of the nature of language
competence has advanced more substantially than have theories of intelligence.
We know much more explicitly what we mean by “language proficiency” than we
know about the meaning of “intelligence.” Indeed, it has recently been
empirically demonstrated that variance in language proficiency accounts for the
main portion of variance in tests of “intelligence” (Stump, 1978). This result
raises serious doubts about the bold assertion of Sattler (1974) that there are many
valid measures of “intelligence” (p. viii).
Because Spearman’s argument for a general factor of intelligence is similar to
the unitary language competence hypothesis, it is possible to apply to the language
question a statistical method devised as a test for a general factor of intelligence.
Nunnally (1967) shows that if a general factor exists and is common to a variety of
tests, the products of factor loadings on that general factor must predict the simple
zero-order correlations between the tests. That is, if a general (or unitary) factor
exists, the product of the loadings of any two tests on g (the general factor) will
equal the raw correlation between those same tests. This follows from the general
fact that for any factor matrix, the sum of products of loadings of any variables A
and B on the respective factors must equal the correlation between/! and B at least
to the extent that the factor matrix exhausts the covariances in the original co¬
relation matrix. Therefore, the goodness of fit of the unitary hypothesis can be
tested directly by factoring a variety of language tests to a principal components
solution and then testing for a general factor by using the loadings on the first
principal component to predict the original correlation matrix.
In this study, the above-mentioned statistical test was applied to three sets of
data. The first sample of data was from a population of 159 Iranians who took the
five subparts of the Test of English as a Foreign Language (ETS) plus a cloze test
and a dictation. Subjects were students at the University of Tehran in Iran. All the
tests were administered with the help of the university and the American Field
Service in 1972 and 1973. Results are presented in Tables 1-1 to 1-3.
Table 1-1 gives the loadings on a general factor as well as the squares of those
loadings. The loadings, of course, may be interpreted as simple product moment
16 I; FACTORS IN SECOND LANGUAGE SKILL
Table 1-1 Principal Components Solution (with

Iterations) for the Five Subtests of the Test of English as a
Foreign Language, a Cloze Test, and a Dictation (/V =159
Iranian Subjects)
Loadings on
Test g factor* h2
Listening Comprehension .87 .76
English Structure .82 .67
Vocabulary .67 .45
Reading Ability .73 .53
Writing Ability .78 .61
Cloze (any appropriate word scoring) .87 .76
Dictation .76 .58
Eigenvalue 4.36
♦Accounts for 100% of the total variance in the factor matrix (using
an iterative procedure with communality estimates in the diagonal less
than unity).
Table 1-2 Correlation Matrix (above the Diagonal) and

Predicted Correlations Derived from Respective Products of Loadings
ong (below the Diagonal)
Test 1 2 3 4 5 6 7
1 Listening Comprehension .69 .56 .64 .68 .76 .69
2 English Structure .71 .64 .57 .65 .68 .63
3 Vocabulary .58 .55 .49 .60 .51 .47
4 Reading Ability .64 .60 .49 .58 .65 .53
5 Writing Ability .68 .64 .52 .57 .67 .52
6 Cloze .76 .71 .58 .64 .68 .75
7 Dictation .66 .62 .51 .55 .59 .66
Table 1-3 Residual Matrix with g Loadings Partialed Out (Mean of Absolute
Values = .026, Range = .08): Observed r minus Product of Loadings on g
Test 1 2 3 4 5 6 7
1 Listening Comprehension -.02 -.02 .00 .00 .00 .03
2 English Structure -.08 -.03 .01 -.03 .01
3 Vocabulary .00 .07 -.01 -.04
4 Reading Ability .01 .01 -.02
5 Writing Ability -.01 -.07
6 Cloze
-.08
7 Dictation
Oller/Hinofotis: Second language ability 17
correlations between the various test scores and the hypothetical variable which
may be taken as an empirical estimate of a unitary language proficiency factor. It is
in fact a linear combination of the original variables. The squared loadings indi¬
cate the proportion of variance overlap between the hypothetical factor defined by
the principal components analysis and any particular test variable. For instance,
the Listening Comprehension subtest of the TOEFL correlates at .87 with the
hypothetical g factor, thus accounting for .76 (for 76%) of the variance ing; or
alternatively, we may say that g accounts for 7 6% of the total variance in the
Listening Comprehension section of the TOEFL.
The next step is to determine how well the general factor or unitary compe¬
tence hypothesis accounts for the observed correlations between the various sub¬
tests used in the study. In other words, once the variance that can be attributed tog
is partialed out how much variance will remain? Will it be necessary to posit other
factors in addition tog, or will theg factor suffice to explain essentially all of the
nonerror variance?
Table 1-2 presents in the upper half correlations between test scores, and in
the lower half the predicted correlations based on the respective products of
loadings ong. Table 1-3 then presents the residuals—that is, what is left over after
the products of loadings ong are subtracted (that is, partialed out). For instance,
the product of loadings of Listening Comprehension and English Structure on the
g factor is .71, while the actual correlation between the Listening Comprehension
test and the English Structure test is .69. This leaves a residual of-.02. Proceed¬
ing in similar fashion for all variables, it soon becomes apparent from Table 1-3
that once the g factor is partialed out, practically no variance remains to be
explained.
Allowing for even a small percentage of error variance attributable to the
unreliability and less than perfect validity of each of the various measures,
essentially no variance is left once g is removed. This is noteworthy for several
reasons. In spite of the fact that there are two tasks that require listening—namely,
the Dictation and the Listening Comprehension subsection of the TOEFL—no
separate listening factor emerges. Similarly, in spite of the fact that there are
several tests that require reading comprehension, vocabulary, and structure, no
unique factors are needed to account for the variance in those tests, and neither do
they produce any unique variances that can be associated with anything different
from what is measured by the Cloze test or the Dictation.
A second set of data comes from foreign students at Southern Illinois
University. No productive oral task was included in the immediately foregoing
study or in the earlier work with the UCLA ESL Placement Exam (Oiler, 1976).
Hinofotis (1976), however, collected data from 106 subjects at SIU using the FSI
oral interview with its five subscales along with a cloze test, and the three subparts
of the Placement Examination used there by the Center for English as a Second
Language. (See also Hinofotis, Chap. 10, this volume.) Results parallel to the ones
given in Tables 1-1 through 1-3 above are presented in Tables 1-4 through 1-6 for
the latter group of subjects and for the respective set of tests.
Table 1-4 Principal Components Solution (with Itera¬

tions) for the FSI Oral Interview Scales, the SIU CESL Place¬
ment Subtests, and a Cloze Test (N = 106 Subjects from
Mixed Language Backgrounds at SIU)
Loadings on
Tests g factor* h2
Cloze .81 .66
FSI Accent .72 .52
FSI Grammar .89 .79
FSI Vocabulary .87 .76
FSI Fluency .87 .76
FSI Comprehension .86 .74
CESL Listening Comprehension .78 .61
CESL Structure .69 .48
CESL Reading .76 .58
Eigenvalue 5.90
♦Accounts for 87% of the total variance.
Table 1-5 Correlation Matrix (above Diagonal) and Predicted Correlations

Derived from Respective Products of Loadings on g (below Diagonal)
Test 1 2 3 4 5 6 7 8 9
1 Cloze .51 .62 .55 .58 .58 .74 .69 .80
2 FSI Accent .58 .67 .65 .66 .68 .48 .55 .48
3 FSI Grammar .72 .64 .87 .85 .82 .64 .59 .53
4 FSI Vocabulary .70 .63 .77 .85 .84 .60 .48 .55
5 FSI Fluency .70 .63 .77 .76 .83 .63 .48 .51
6 FSI Comprehension .70 .62 .77 .75 .75 .58 .49 .53
7 CESL Listening Comprehensi on .70 .56 .69 .68 .68 .67 .61 .74
8 CESL Structure .56 .50 .61 .60 .60 .59 .54 .63
9 CESL Reading .62 .54 .68 .66 .66 .65 .59 .52
Values = .091, Range = .17) : Observed r minus Product of Loadings on g
Test 1 2 3 4 5 6 7 8 9
1 Cloze .07 -.10 ■-.15 --.12 --.12 .04 .13 .18

2 FSI Accent .03 .02 .05 .04 -.08 - .05 - .06
3 FSI Grammar .10 .08 .04 -.05 --.02 - .15
4 FSI Vocabulary -.10 --.16 -.08 --.12 - .11
5 FSI Fluency .08 -.05 --.12 - .15
6 FSI Comprehension .09 --.10 - .12
7 CESL Listening Comprehension .07 .15
8 CESL Structure —.11
9 CESL Reading
Table 1-7 Varimax Rotated Factor Solution (with Iterations)

for a Cloze Test, the Five Subscales of the FSI Oral Interview, and
the Three Subtests of the CESL Placement Examination (N = 106
Subjects at SI U)
Factor 1 Factor 2 /j2*

Test
Cloze .34 .84 .83
FSI Accent .63 .38 .54
FSI Grammar .84 .40 .87
FSI Vocabulary .86 .33 .85
FSI Fluency .86 .34 .86
FSI Comprehension .84 .34 .83
CESL Listening Comprehension .42 .71 .68
CESL Structure .34 .67 .57
CESL Reading Comprehension .28 .84 .79
Eigenvalue 6.82
♦Factors 1 and 2 account for 56 and 44%, respectively, of the total variance
in the factor matrix.
The first of two factors in the principal components analysis accounts for
87% of the total variance in the factor matrix and receives no loading less than
69% from any single test. The residuals in Table 1-6 are never as high as .20 and
are always small in proportion to the observed correlations and the respective
products of factor loadings.
The existence of a substantial general factor seems to be demonstrated,
though the possibility remains that there is some unique variance that is associ¬
ated with the FSI Oral Interview which is not also associated with the other tests
used. A two-factor explanation is supported by a varimax rotated orthogonal solu¬
tion derived from the principal components analysis. The orthogonal solution is
displayed in Table 1 -7. The heaviest loadings on Factor 1 in Table 1 -7 are from the
subscales of the FSI Oral Interview while the heaviest loadings on Factor 2 are
from the cloze and CESL placement subtests. An oblique two-factor solution (not
displayed), however, revealed a .71 correlation between two similarly differenti¬
ated factors. Hence the evidence for clearly distinct variance associated with a
speaking factor is not completely convincing, but neither can it be ruled out. By
comparing the eigenvalue associated with the two-factor solution in Table 1-7
with the eigenvalue associated with the one-factor solution in Table 1-4, it is
possible to form an impression of the advantage gained by the two-factor solution
over the one-factor—about 13% of the variance in the two-factor solution is not
accounted for by the g factor.
A third and final set of data comes from 51 of the above-mentioned subjects
who also took the TOEFL. The data from these subjects with the five TOEFL
subtests included are given in Tables 1-8 to 1-10. In this case, the g factor
accounts for only .65 of the total variance in the principal components matrix,
while two additional factors are required to account for the remaining .35. The
absolute mean of the residuals is .155 and has a range of .36, which is consider-
Table 1-8 Principal Components Solution (with Itera¬

tions) for a Cloze Test, the Five Subscales on the FSI Oral
Interview, the Three Subtests of the SIU CESL Placement
Test, and the Five Subtests of the TOEFL (N = 51 Subjects
from Mixed Native Language Backgrounds)
Loadings on
Test g factor* h2
Cloze .80 .64
FSI Accent .29 .08
FSI Grammar .68 .46
FSI Vocabulary .66 .44
FSI Fluency .64 .41
FSI Comprehension .65 .42
CESL Listening Comprehension .76 .58
CESL Reading Comprehension .58 .34
TOEFL Listening Comprehension .67 .45
TOEFL English Structure .73 .53
TOEFL Vocabulary .57 .32
TOEFL Reading Ability .78 .61
TOEFL Writing Ability .68 .46
Eigenvalue 5.94
♦Accounts for 65% of the total variance in the factor matrix.
ably larger than for either of the two previous populations. However, there is con¬
siderably less variance in the latter population on all tests. This is because the pro¬
cedure for selecting the subjects to take the TOEFL eliminated roughly the
bottom half of the distribution—i.e., no subject who placed below the middle of
the distribution also took the TOEFL. Hence the correlations in Table 1-9 for the
51 subjects are depressed as compared with the correlations in Table 1-5 for the
full 106 subjects. For instance, whereas in Table 1-5 hardly any of the correla¬
tions are below .5, in Table 1-8 many are below .3.
Table 1-11 gives a varimax rotated solution for the 51 subjects over the 14
tests, indicating three orthogonal factors which may tentatively be labeled “read¬
ing/graphic” (Factor 1, with .39 of the variance), “oral interview” (Factor 2, with
.38 of the variance), and “listening” (Factor 3, with .23 of the variance). The total
eigenvalue for these three factors is 9.20 as compared with 5.94 in Table 1-8.
Hence the three-factor solution accounts for 35% more variance than the single¬
factor solution.
The results of this last analysis demonstrate the existence of a substantial g
factor but do not rule out the possibility of unique variances associated with
subtests aimed at separate skills (though unique variances associated with
separate components of skills are ruled out). Further research will be necessary to
determine whether Factor 1 in Table 1-7 and 2 in Table 1-11 indeed constitute a
“speaking” factor in the most general sense—i.e., whether such factors will have
d- VO oo on O r- vo CN CN
VO o CN CN CN CN to d d to C" vo
1
po o r". CN d vo d- o to ON CO
o
CO CN CO to
o to
3
TD
O d- CN CO on O CN o vo VO to o ON
VO O CN CN CN CN d- CN d CN d- d CO
<D
>
00 d- o O _ ON CN ON d- VO CN r- o
O to o CO d CO CN VO to d to d to to
<D
Q.
CD/ CM to ,_ to d O d- vo ON 00 CN vo
O' to o CO CO d CO CN d d CO d-
E
o ON CO o CN vo r_ vo vo on
to CN CO to ON
CN to CN CO d CO d- CO
"3 1
<D
> ,_
00 ON ON CO o d- vo vo o CO vo to
to o CN CN CO CO CN CO CO
CD
Q
c/n
c r- CN to ON CO CN CN d d- to CO ON CN
o vo o CO CO d d- CO d to to d- to to
4-*
<D VO 00 ,_ ON to on on 00 d r- r-
CN to d* CN CO d- d CO to d-
O
U
-a to d- CO o oo CN On on CO VO o
CD CN to 00 d d CN CO d- d- CO to d-
<4-1
o
CD d* 00 to o CN CO O o 00 d 00 00
i_ CN d- 00 d d- to CO CO d* d- CO d^
to
CL.
Correlation Matrix (above Diagonal) and
co d VO d d d- CN o on vo O ON CO vo
CO to d d d- to CO CO d- to CO to d-
(N o ON 00 ON CN CO ON r- CO o
o CN CN CN CN CN
,_ CO d" CO CN _ vo VO d- 00 VO CN d-
CN to to to to VO CO d to to d* VO to
c
c _o
o c <S)
rd c
o
C <S)
'</)
<D
O C
CD c JZ
00
rd JZ <D CD CD
<D -C Q. D >-
<D
Q a. E *->
E CL o D
£ c
o E o 15
o o
<S) U o 00 to _d <
<
CD u c D 00
CD C 00 1— -C c 00
-O 1— rd
CD c
ZJ 00 ’c Si JZ c
JZ C CD d
■*-> o
13 CJ 6b d
Accent
E CJ a <D "O c/i O CD "d

jD D c
Q. d
c E rd c
CD </n
+-» CD —I LU > OH £
o d o
D E _l to O' -J —I -1 —I _l
O o
Table 1-9
CD 6 > LL -J -J -J ll LL Ll ll LL
bo N
U
to UJ LU LU LU LU
c VO to to to to ll)
to
Ll)
to
LU o o o o o
'~TD
rd
u Li- LL LL U- LL u u u H i- i- I- 1-
S
O <D *— CN CO d- to vo 00 ON o T— CN CO d*
T— ’- ’— ’— ’—
O CO 00 co d- cn d* T— 00 in cn
CN T- r— CN CN o 1— o o CN o
T—
1 1 1 i i
CO cn CO C^ co in co cn o cn CN CO
cn ,— T- T- CN T— o r— »— o O CN
T_’
i | | 1 I i i
CD 00 m CO OA CO m cn o cn cn m
cn CN r— T- O T- T- o o T- i— o
• T-
1 1 1 1 1 1 1 l
ii
CD oo CO 00 r- VO CN r-
o o
o T—. CN o ; o CM O o
C
rd | | 1 1 1
O'
CN d- in CT\ T- d- cn O'* r~-
cn o o T- r— o o T- CN o o
in 1 | 1 1 | 1
T—;
II CO r- CN CO CN CN
CN CN CN CN CN r- '— o
<s 1 i 1 l | 1
3
rd m CO O'* in 00
00 <— o CN o
>
<D i | 1 i 1
+->
3 cn I--
r*-
o c- o r— T- r— O o
t/1
-O \ 1 1 1 1
d- CN CO CN m
CO CN cn cn cn cn
o
l
c
rd
<D cn CO co
in CN cn cn cn
2
3 m co vo
O CN CM cn
"O
JD
rd O in
cn m
rd
Q-
tn
bO
c
T3
rd
O
_J C
o
<st
bO
C
£ T3
rd
X O c
\— o c
rd <L>
JZ
<L»
.c
o 
TOEFL Reading Ability
Q.
rd 3
TOEFL Writing Ability
3 CL E
T3 O*3 E Q. o
TOEFL Vocabulary
*tn o E u
CD u o b©
O' >* bo <d u c
W) ’c
3
nj _c .E 3 c CJ
C 0) bo
E D
-O
T3 c
Q. cd
E rd
U E 0>
rd C£
o i— o o
■D < CD > u
CD cn cn LlJ LU
> o cn cn cn LU LU O O
u ll ll LL u u
JD
cd -D T— CN cn d* 00 0^ o
14
12
13
h- O
Table 1-11 Varimax Rotated Solution (with Iterations) for a Cloze Test,
the Five Subscales of the FSI Oral Interview, the Three Subtests of the SIU
CESL Placement Examination, and the Five Subtests of the TOEFL (A/ = 51)
Test Factor 1 Factor 2 Factor 3 /72*
Cloze .84 .12 .34 .84

FSI Accent .03 .64 -.15 .43
FSI Grammar .20 .89 .12 .84
FSI Vocabulary .18 .81 .19 .73
FSI Fluency .00 .87 .34 .88
FSI Comprehension .17 .82 .19 .73
CESL Listening Comprehension .41 .21 .75 .77
CESL Structure .59 .07 .02 .35
CESL Reading Comprehension .56 -.01 .42 .49
TOEFL Listening Comprehension .29 .19 .75 .68
TOEFL English Structure .61 .17 .45 .60
TOEFL Vocabulary .64 .12 .14 .44
TOEFL Reading Ability .85 .19 .21 .80
TOEFL Writing Ability .62 .07 .44 .57
Eigenvalue 9.20
♦Factors 1,2, and 3 account for 39, 38, and 23%, respectively, of the total variance in
the matrix.
variances in common with other tests aimed at speaking ability (e.g., oral cloze,
reading aloud, sentence repetition) which are not also common to tasks relying on
other skills. Similarly, further research will be required to see if other tests that
require listening comprehension will load on a factor such as 3 in Table 1-11,
which is actually distinct from the possible speaking and graphic factors.
Considering the results of all three sets of data, the notion of separate
components of structure, vocabulary, and phonology finds no support There is
substantial evidence that the five subscales on the FSI Oral Interview, for
instance, are equivalent The choice between the unitary competence hypothesis
and the possibility of separate skills is less clear. There is some evidence to sug¬
gest that (excluding the oral interview data) if the data represent the whole range of
subject variability, the unitary competence hypothesis may be the best explana¬
tion, but if the variability is somewhat less, a moderate version of a separate skills
hypothesis would he preferred. Regarding the oral interview data, there seems to
be some unique variance associated either with a separate speaking factor or with a
consistency factor—the tendency of judges simply to rate subjects similarly on all
of the FSI scales. Certainly there is substantial evidence that a general factor
exists which accounts for .65 or more of the total variance in the several batteries
of tests investigated.
Note
1. A version of this paper was presented at the winter meeting of the Linguistic Society of America in
Philadelphia, Dec. 30, 1976.
Chapter
Is Language Ability Divisible or Unitary?

A Factor Analysis of 22 English Language
Proficiency Tests
George Scholz, Debby Hendricks, Randon Spurling,

This paper reports a partial replication of Oiler and Hinofotis (Chap. 1). Two
hypotheses concerning the nature of language skill are evaluated. The
divisible competence hypothesis considers language proficiency a composite
of various skills and components. It is related to the discrete point approach
to teaching and testing. Although each of the components and/or skills is
expected to have a certain amount of overlapping variance with other
components and/or skills, it is assumed that tests of each separable skill or
component will have a certain amount of unique variance not associated
with tests designed for the other language components or skills. The second
hypothesis, which claims that language ability is essentially unitary, predicts
that once the common variance in a variety of language tasks has been
explained, there will be no leftover unique variances which can be at¬
tributed to separate skills or components. A factor analysis of the data from
182 subjects who participated in an experimental English language testing
project at the Center for English as a Second Language at Southern Illinois
University in the spring of 1977 tends to support the unitary competence
hypothesis.
Two mutually exclusive hypotheses have been offered to explain the nature
of language competence. The first has been called the divisible competence
hypothesis (Chap. 1). According to this hypothesis, language ability can be sep¬
arated into a number of relatively independent parts. Thus linguists traditionally
divide language ability into the areas of phonology, syntax, morphology, seman¬
tics, and so on. This schema is reflected in discrete point theories of language test
ing, where it is assumed that an efficient language test must be aimed at only one
24
Scholz et al.: Language ability—divisible or unitary? 25
skill (i.e., oral, reading, or writing) or only one modality of a skill (listening versus
speaking).
The appropriateness of the unitary competence hypothesis can be deter¬
mined by factoring a number of language tests to a principal components solution.
Then the existence of a general factor can be tested by using the products of the
first factor to predict the original correlation matrix. If the difference between the
factor loading products and correlations is relatively small, this would indicate that
one general factor accounts for the majority of the common variance among all the
language tests.
Method
Subjects. All 182 students enrolled at the Center for Enghsh as a Second Lan¬
guage (Southern Illinois University) during the second term of the spring semester
of 1977 were tested. The largest language backgrounds represented were Farsi,
Arabic, Spanish, and Japanese. Only subjects actually enrolled in CESL classes
were tested. Those who passed the normal placement tests at sufficiently high
levels to be exempted, of course, were not included in the testing.
Procedure. Twenty-two test scores were obtained. Since nearly all the testing
had to be done during class time and because the courses at CESL are limited to
about 15 students each (spaced over five levels ranging from beginning to
advanced), only 27 of the students completed all the tests. This was largely due to
absences over the several weeks of testing. The smallest number of subjects com¬
pleting any pair of tests, however, was 65 and the largest was 165.
The test data in all cases were keypunched, and two factoring procedures
were applied—a principal components analysis was followed by a varimax rota¬
tion. The main issue was to find the factor solution that best explained the maxi¬
mum amount of variance in the data. More specifically, the problem was to choose
between the multiple-factor solution (the varimax rotation) and the single-factor
solution (the first factor of the principal components analysis). Choosing the latter
solution would eliminate the divisible competence hypothesis, and choosing the
former would eliminate the unitary competence hypothesis.
In order to maximize the possibility of the multiple-factor result, multiple
tasks requiring listening, speaking, reading, writing, and grammatical decisions
were included among the tests. Here we concentrated primarily on the differentia¬
tion of skills rather than components of skills because of the previous results of
Oiler and Hinofotis which tended to show that the frequently posited components
of phonology, vocabulary, and syntax probably do not constitute distinct sources
of test variance. We thought perhaps it would be possible to sort out the skills of
listening, speaking, reading, and writing, or possibly some combination of subsets
of these skills such as listening/speaking versus reading/writing or perhaps listen¬
ing/reading versus speaking/writing.
Tests.* The measures used will be discussed briefly here as they fall into the
categories of listening, speaking, reading, writing, and grammar tasks. Excluding
the scales from the Foreign Service Institute Oral Interview, two of the subtests
from Harris’s Comprehensive English Language Test, and the CESL Reading
Test, all the tests used are reproduced in the Appendix at the end of this volume.
(Some of them are also used in several of the subsequent chapters—especially
Chaps. 6, 7,12,14,17, 23, and 24.) All that is given here is a brief description of
each measure.
1. CELT Listening Comprehension. The CESL Placement Test contains

two subtests from Harris’s Comprehensive English Language Test (CELT). The
Listening Comprehension section from this test battery was the first of five listen¬
ing tasks used in this study. (See also tests 14 and 21 below.)
*2. Listening Cloze (Open-Ended). Students heard each of three para¬
graphs two times. The first time they were instructed to just listen. On the second
reading, item numbers were read at the beginning of each segment of text
containing a deletion and the deleted word was replaced on the tape by the word
“blank,” which was read with the same intonation as was originally used for the
deleted word. Subjects were supplied with a numbered answer sheet and were
asked to write down the missing words. They were not supplied, however, with any
written version of the text. Here, and in nearly all the subsequent experimental
tests, two or more passages ranging from relatively easy to difficult were used in
order to maximize variability among the whole range of student abilities
represented at CESL.
*3. Listening Cloze (Multiple-Choice Eormat). This series of tests was
exactly parallel to the open-ended Listening Cloze except that the student was
given an answer sheet, in this case containing multiple-choice alternatives for each
blank in the recorded texts.
*4. Multiple-Choice Listening Comprehension. Students listened to three
taped passages, and after each they were asked to answer comprehension
questions by selecting the best choice from a field of four alternatives.
*5. Dictation. A standard dictation was administered and scored in the tra¬
ditional way (one point for each correct word). For a recent discussion of the tech¬
nique, see Oiler (1979, Chap. 10), also his references.
6 to 10. Oral Interview. Oral Interviews were conducted following the

guidelines of the Foreign Service Institute Oral Interview Procedure. Although
the interviewers lacked FSI certification, they followed the procedure used by
Educational Testing Service to train Peace Corps language proficiency testers in
the FSI technique. See Chap. 7 for further discussion of the procedures followed.
Five evaluative scales were used—Accent (6), Grammar (7), Vocabulary (8),
Fluency (9), and Comprehension (10). Interviews were conducted by six teams
•Asterisked items appear in the Appendix at the end of this volume.

consisting of two raters each. In all, it was possible to interview only 70 of the total
182 subjects because of scheduling conflicts and attrition. The interviews were
conducted throughout the period of testing between February and April 1977.
*11. Repetition (Elicited Imitation). Each of three taped texts was first read
in its entirety and then repeated in briefer segments while students were asked to
repeat in pauses provided. Responses were taped at the CESL/SIU language lab¬
oratory and later were scored by exact and acceptable word methods (see Chap. 7).
*12. Oral Cloze Test (with Spoken Responses). Each of five passages was
first read without deletions, then read again with every seventh word deleted. Stu¬
dents were asked to provide the missing words at pause points. Responses were
taped and scored by exact and acceptable word methods.
*13. Reading Aloud. Students were asked to read three short paragraphs
aloud. Responses were taped and scored by the exact word method.
14. CESL Reading Test. This test is a modified version of the Science
Research Associates Reading for Understanding Placement Test (Thurstone,
Table 2-1 Principal Components Solution (without Iterations)

for the Five Subtests of the FSI Oral Interview, the Three Subtests
of the CESL Placement Test, and the Eighteen Subtests of the
CESL Testing Project (/V> 65, <162)
Loadings on
Test g factor* h2
CELT Listening Comprehension .64 .41
Listening Cloze (Open-Ended) .65 .42
Listening Cloze (Multiple-Choice Format) .38 .14
Multiple-Choice Listening Comprehension .46 .21
Dictation .74 .55
Oral Interview—Accent .51 .26
Oral Interview—Grammar .83 .69
Oral Interview—Vocabulary .77 .59
Oral Interview—Fluency .75 .56
Oral Interview—Comprehension .84 .71
Repetition .71 .50
Oral Cloze (Spoken Responses) .75 .56
Reading Aloud .67 .45
CESL Reading Test .70 .49
Multiple-Choice Reading Match .74 .55
Standard Cloze .77 .59
Essay Ratings (by Teachers) .77 .59
Essay Score .74 .55
Multiple-Choice Writing .85 .72
Recall Rating .80 .64
CELT Structure .67 .45
Grammar (Parish Test) .82 .67
*Accounts for 51.4% of the variance in the total factor matrix.

Table 2-2 Correlation Matrix (above the Diagonal) and Predicted

Correlations from Respective Products of Loadings on g (below the Diagonal)
Test 1 2 3 4 5 6 7
1 CELT Listening Comprehension .34 .29 .26 .48 .33 .43
2 Listening Cloze (Open-Ended) .42 .13 .11 .56 .42 .62
3 Listening Cloze (Multiple Choice Format) .24 .24 .77 .24 .06 .39
4 Multiple-Choice Listening Comprehension .29 .30 .17 .29 .02 .45
5 Dictation .47 .48 .28 .34 .25 .38
6 Oral Interview—Accent .33 .33 .19 .23 .38 .46

7 Oral Interview—Grammar .53 .54 .32 .38 .61 .42
8 Oral Interview—Vocabulary .49 .50 .29 .35 .57 .39 .64
9 Oral Interview—Fluency .48 .49 .29 .35 .56 .38 .62
10 Oral Interview—Comprehension .54 .52 .32 .39 .62 .43 .70
11 Repetition .45 .46 .27 .33 .53 .36 .59
12 Oral Cloze (Spoken Responses) .48 .49 .29 .35 .56 .38 .62
13 Reading Aloud .43 .44 .25 .31 .50 .34 .56
14 CESL Reading Test .45 .46 .27 .32 .52 .36 .58
15 Multiple-Choice Reading Match .47 .48 .28 .34 .55 .38 .61
16 Standard Cloze .49 .50 .29 .35 .57 .39 .64
17 Essay Ratings (by Teachers) .49 .50 .29 .35 .57 .39 .64
18 Essay Score .47 .48 .28 .34 .55 .38 .61
19 Multiple-Choice Writing .54 .55 .32 .39 .63 .43 .71
20 Recall Rating .51 .52 .30 .37 .59 .41 .66
21 CELT Structure .43 .44 .25 .31 .50 .34 .56
22 Grammar (Parish Test) .52 .53 .31 .38 .61 .42 .68
1963). It is used as part of the CESL placement battery (also including tests 1 and
21 of the tests described in this section).
*15. Multiple-Choice Reading Match Test. Students were given multiple-

choice tests over three passages of prose. Items were inserted in the texts requir¬
ing subjects to substitute an appropriate synonym or paraphrase for an underlined
word, phrase, or clause.
*16. Standard Cloze. This score was based on the sum of scores on eight
cloze tests in a traditional format. Six of the passages used here are discussed in
greater detail in Chap. 12. The remaining texts are given in the Appendix.
17. Essay Rating (by Teachers). The students wrote an essay about an
accident, who was involved, who was at fault, etc. It was scored on a six-point scale
discussed in Chap. 14.
18. Essay Score. The above-mentioned essays were rescored on a more or

less objective basis described in detail in Chap. 14.
*19. Multiple-Choice Writing Task. This task required three different per¬
formances on three multiple-choice tests over three passages each (nine in all).
Table 2-2 (continued)
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
.48 .41 .46 .43 .58 .36 .61 .37 .33 .46 .43 .52 .49 .53 .45
.46 .61 .68 .57 .58 .41 .32 .40 .52 .38 .37 .54 31 .24 .44
.36 .23 .23 .26 .41 .12 .30 .08 .14 .09 .28 .22 .15 .48 .26
.37 .25 .36 .34 .38 .27 .50 .20 .23 .25 .33 .26 .23 .47 .36
.39 .50 .50 .51 .61 .41 .39 .66 .73 .53 .66 .70 .51 .48 .69
.39 .54 .44 .42 .43 .34 .25 .17 .38 .30 .39 .38 .28 .32 .37
.89 .78 .87 .54 .50 .52 .79 .60 .51 .64 .47 .63 .64 .47 .57
.77 .84 .50 .47 .45 .46 .52 .39 .57 .50 .61 .61 .44 .52
.58 .82 .55 .50 .47 .37 .50 .43 .54 .52 .58 .52 .29 .49
.65 .63 .62 .58 .51 .40 .66 .50 .61 .62 .69 .61 .34 .59
.55 .53 .60 .68 .51 .37 .41 .39 .43 .50 .60 .52 .32 .56
.58 .56 .63 .53 .41, .45 .48 .49 .54 .62 .57 .49 .49 .59
.52 .50 .56 .48 .50 .56 .50 .58 .47 .38 .54 .62 .37 .57
.54 .53 .59 .50 .53 .47 .49 .51 .60 .41 .61 .63 .70 .62
.57 .56 .62 .53 .56 .50 .52 .73 .61 .55 .69 .62 .38 .64
.59 .58 .65 .55 .58 .52 .54 .57 .61 .61 .73 .71 .56 .74
.59 .58 .65 .55 .58 .52 .54 .57 .59 .65 .67 .69 .54 .60
.57 .56 .62 .53 .56 .50 .52 .55 .57 .57 .66 .54 .43 .55
.65 .64 .71 ..60 .64 .57 .60 .63 .66 .65 .63 .63 .59 .68
.62 .60 .67 .57 .60 .54 .56 .59 .62 .62 .59 .68 .59 .70
.52 .50 .56 .48 .50 .45 .47 .50 .52 .52 .50 .57 .54 .65
.63 .62 .69 .58 .62 .55 .57 .61 .63 .63 .61 .70 .66 .55
The first subtest required selecting an appropriate word, phrase, or clause to form
a continuation of the text The second set of passages required editing of errors,
and the third required correct ordering of words, phrases, or clauses.
*20. Rating of Recall Task. Three short paragraphs were each displayed for
one minute, using an overhead projector. Students were instructed to write down
what they had read, paying special attention to the meaning. Responses were
subjectively rated on a six-point scale (see Chap. 14 for amplification).
21. CELT Structure. This test is aimed at assessing knowledge of grammar

in a fairly traditional sense of structural linguistics. It is a subtest on Harris’s Com¬
prehensive English Language Test and is used by CESL as part of the placement
battery.
*22. Grammar (Parish Test). A 127-item fill-in-blank test focusing on syn¬

tactically significant structures. This test can be construed as a discrete point
grammar test in a modified cloze format For detailed discussion see Oiler (1979,
Appendix).
Values = .082, SD = .067): Observed r minus Product of Loadings on g
Test 1 2 3 4 5 6 7
©
o
00
1 CELT Listening Comprehension -.03 .01 .00 -.10
J
2 Listening Cloze (Open-Ended) -.11 -.19 .08 .09 .08
3 Listening Cloze (Multiple-Choice Format) .60 -.04 -.13 .07
4 Multiple-Choice Listening Comprehension -.05 -.25 .07
5 Dictation -.13 -.23
6 Oral Interview—Accent .04
7 Oral Interview—Grammar
8 Oral Interview—Vocabulary
9 Oral Interview—Fluency
10 Oral Interview—Comprehension
11 Repetition
12 Oral Cloze (Spoken Responses)
13 Reading Aloud
14 CESL Reading Test
15 Multiple-Choice Reading Match
16 Standard Cloze
17 Essay Ratings (by Teachers)
18 Essay Score
19 Multiple-Choice Writing
20 Recall Rating
21 CELT Structure
22 Grammar (Parish Test)
Results and Discussion

The results are presented in Tables 2-1 and 2-3. The principal component dis¬
played in Table 2-1 reveals that 51.4% of the total variance can be accounted for
by theg factor.1 In other words, slightly over half of the variance associated with all
22 tests can be attributed to a single general factor. Along with the loadings of each
subtest ong, Table 2-1 also gives the squares of these loadings (h2). Each loading,
of course, is the correlation between the test in question and the hypothetical
unitary factor underlying the entire group of tests. Thus the Oral Interview
Comprehension scale correlates at .84 withg. The squared loadings (h2) indicate
the amount of common variance which exists between each subtest and the hypo¬
thetical general factor. Accordingly, the Oral Interview Comprehension scale
accounts for 71% of the variance ing factor, or conversely g accounts for 71% of
the variance of the Oral Interview Comprehension scale.
It can be seen that nearly all the tests share a good deal of variance with the
general factor. Exceptions are the Multiple-Choice Listening Comprehension and
the Listening Cloze (Multiple-Choice Format) scores (with correlations of .46 and
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
-.01 -.07 -.08 -.02 .10 -.07 .16 -.10 -.16 -.03 -.04 -.02 -.02 .10 -.07
-.04 .12 .16 .11 .09 -.03 -.14 -.08 .02 -.12 -.11 -.01 -.15 -.20 -.09
.07 -.06 .01 -.01 .12 -.13 .03 -.20 -.15 -.20 .00 -.10 -.15 .23 -.05
.02 -.10 -.03 .01 .03 -.04 .18 -.14 -.12 -.10 -.01 -.13 -.08 .16 -.02
-.18 -.06 -.12 -.02 .05 .09 -.13 .11 .16 -.04 .11 .07 -.08 -.02 .08
.00 .16 .01 .06 .05 .00 -.11 -.21 -.01 -.09 .01 -.05 -.03 -.02 -.05
.25 .16 .17 -.05 -.12 -.04 -.09 -.01 -.13 .00 -.14 -.08 -.02 -.09 -.11
.19 .19 -.05 -.11 -.07 -.08 -.05 -.20 -.02 -.07 -.04 -.01 -.08 -.11
.00 .02 -.06 -.03 -.16 -.06 -.15 -.04 -.04 -.06 -.08 -.21 -.13
.02 -.05 .03 -.19 .04 -.15 -.04 .00 -.02 -.06 -.22 -.10
.15 .03 -.13 -.12 -.16 -.12 -.03 .00 -.05 -.16 -.02
-.09 -.08 -.08 -.09 -.04 .06 -.07 -.11 -.01 -.03
.09 .00 .06 -.05 -.12 -.03 .08 -.08 .02
-.03 -.03 .06 -.11 .01 .07 .23 .05
.16 -.06 .00 .06 .03 -.12 .03
.02 .04 .07 .09 .04 .11
-.08 .02 .07 .02 -.03
.03 -.05 -.07 -.06
-.05 .02 -.02
.05 .04
.10
.36, and/r’s of.21 and.14, respectively), andtheFSI Oral Interview Accent scale
(with a correlation of .51, h2 = .26).
Following the method recommended by Nunnally (1967) and applied by
Oiler and Hinofotis (Chap. 1), Table 2-2 above the diagonal presents the pre¬
dicted correlations between the tests, based on the products of their g loadings,
while below the diagonal the actual correlations between subtests are given. The
Residual Coefficient Matrix in Table 2-3 presents the percent of each correlation
which remains after the products of the loadings on g are subtracted from the
original correlations. Thus the product of the loadings for the Essay Rating and the
CESL Reading Test is .54, while the actual correlation between these two tests is
.60. A positive residual of .06 remains after the variance which can be attributed to
g has been removed. This operation is performed for all the products of loadings on
g .
An examination of Table 2-3 reveals that relatively small residuals remain,
once theg factor has been extracted. The most notable exception is the residual for
the Listening Cloze (in multiple-choice format) and the Multiple-Choice Listen¬
ing Comprehension subtests. However, these high residuals may be due to unrelia-
Table 2-4 Varimax Rotated Solution (without Iterations) for the Five
Subscales of the FSI Oral Interview, the Three Subtests of the CESL Place¬
ment Test, and the Eighteen Subtests of the CESL Testing Project (N = 65
to 162 subjects)*
Test Factor 1 Factor 2 Factor 3 Factor 4
CELT Listening Comprehension .46 .56

Listening Cloze (Open-Ended) .42 .44 .56
Listening Cloze (Multiple-Choice)
Multiple-Choice Listening Comprehension
Dictation .84
Oral Interview—Accent .72
Oral Interview—Grammar .85
Oral Interview—Vocabulary .83
Oral Interview—Fluency .76
Repetition .38 .34 .58
Oral Cloze (Spoken Responses) .46 .62
Reading Aloud .34 .38 .49
CESL Reading .80
Multiple-Choice Reading Match .74 .42
Standard Cloze .77 .41
Essay Ratings .50 .43 .50
Essay Score .63
Multiple-Choice Writing .63 .37 .41
Recall Rating .43 .42 .63
CELT Structure .74
Grammar (Parish Test) .60 .51
*Only factor loadings above .32 (p <.05, with 65 df) are reported. The significant
loadings on all four factors account for 57% of the total variance in all the tests.
bility in these tests. They were administered near the end of the testing period
when student and staff fatigue was near its limits.
Allowing a certain degree of error variance due to unreliability in each test,
there is surprisingly little residual variance once the g factor has been accounted
for. Despite the fact that the tests can be grouped into the traditional areas of
listening, speaking, reading, and writing, separate factors for these skill areas do
not emerge. The results fail to reveal any unique variance which can be associated
with separate language skills or modalities. The large amount of overlapping vari¬
ance between the tests tends to support the unitary competence hypothesis.
Another way of factoring the data is to apply a varimax rotation to the princi¬
pal components solution. This technique sets up uncorrelated reference vectors
and differentiates the mathematical factors thus defined by maximizing the
loading of each contributing test on one and only one factor. Hence, if the divisible
competence hypothesis were correct, we might expect listening, speaking, read¬
ing, and writing tasks to load on different factors, or we might expect phonologi¬
cal, syntax, and vocabulary measures to load on separate factors. As shown in
Table 2-4, all the scales of the FSI type Oral Interview, with the exception of the
accent rating, load heavily on Factor 2. The three subparts of the CESL Placement
Test load mostly on Factor 4. However, the eighteen experimental tests are scat¬
tered over all four factors in no discernible pattern which could be associated with
the divisible competence hypothesis. For instance, Factor 1 receives significant
loadings from two listening tasks (Listening Cloze, Open-Ended, .42; Dictation
.84), four speaking tasks (Oral Interview Comprehension. 3 9, Repetition.38, Oral
Cloze .46, Reading Aloud .34), two reading tasks (Multiple-Choice Reading
Match .74, Standard Cloze .77), all four writing scores (Essay Ratings .50, Essay
Score .63, Multiple-Choice Writing .63, and Recall Rating .43), and finally from
Parish s Grammar Test (.60). The remaining three factors shown in Table 2-4 are
similarly complex. The significant loadings on the four factors taken altogether
account for 57% of all the variance in all the tests. This represents only a 6% gain
over the single-factor solution.
In conclusion, no clear pattern appears in which tests are grouped according
to the posited skills of listening, speaking, reading, and writing, or components of
phonology, lexicon, or grammar, the data seem to fit best with the unitary compe¬
tence hypothesis; and the divisible competence hypothesis is thus rejected.
Note
1. Editors’ note: The factor solutions presented in Tables 2-1 and 2-4 are based on pairwise deletion
of missing cases (Nie, Hull, Jenkins, Steinbrenner, and Bent, 197 5). They differ somewhat, therefore,
from those presented in Oiler 1979 (Appendix, Tables 1 and 2), which were based onlistwise deletion
of missing cases. However, the unitary factor solution is preferred in both cases. For further discussion,
see Oiler (1979), especially the section of the Appendix entitled The Carbondale Project
Chapter
Separating the g Factor

from Reading Comprehension
Douglas E. Flahive
This study examines the relationship between scores on a nonverbal IQ test,

three reading tests, and the Test of English as a Foreign Language. Subjects
were nonnative speakers of English at CESL/SIU. The nonverbal IQ
measure employed was Raven’s Progressive Matrices Test One reading'test
was of the traditional multiple-choice type, the second a paraphrase task,
and the third a cloze test The TOEFL and nonverbal IQ scores were used to
predict reading comprehension scores on the three reading tasks. The
nonverbal IQ scores accounted for nearly the same amount of variance as
TOEFL scores when the traditional multiple-choice reading test was used as
the criterion variable. Considerably less variance was accounted for in pre¬
dicting paraphrase and cloze test scores. Results suggest that traditional
multiple-choice reading tests are not simply tests of language proficiency
but are also tests of nonverbal intelligence.
This chapter addresses topics that have been focal points of extensive educa¬
tional research since the beginning of this century: intelligence and language com¬
prehension. Despite this extensive research, little agreement exists among
researchers concerning even the basic definitions of the terms. Is intelligence a
static construct measured by performance on standardized IQ tests as Jensen
claims, or is it a dynamic construct as Piaget and his followers claim? And what
about language comprehension? What is it, and how is it measured? If one reads
and attempts to synthesize the numerous studies that have addressed the second
question, the frustrating conclusion is that there is no universally accepted defini¬
tion of language comprehension, nor is there a universally valid technique for
assessing language comprehension.
34
Flahive: The g factor 35
While attempts to define and measure language comprehension have been

numerous and varied, a persistent question runs through many of the efforts. The
question concerns the role of intelligence in language comprehension. Based on
the pioneering work of Davis (1944), many of the same factors associated with
reading comprehension are also known to be associated with intelligence. Carroll
(1972), in summarizing nearly three decades of research into questions of lan¬
guage comprehension, concludes that reading comprehension is dependent on
two factors: grammatical knowledge and inferential ability. In this study, an
attempt is made to determine how much of the variability in the reading scores by
nonnative speakers of English can be predicted on the basis of scores on a
nonverbal IQ test
Subjects. Twenty students enrolled in a semi-intensive English class were

administered five tests over a one-week period. The students represented seven
different native language backgrounds. Their TOEFL scores ranged from 437 to
568.
Materials. The Raven’s Progressive Matrices (Standard Form) was selected

as the intelligence measure for two major reasons: first it has proved to be a reli¬
able measure of intelligence in a wide variety of non-English-speaking settings;1
and second, the test is language-free. Using a language-free test of intelligence
seemed to be more practical than trying to find and administer comparable
intelligence tests in the several native languages of the subjects. The test requires
the examinee to look at a pattern which has a portion missing and select from
among eight choices the one which best completes the larger pattern. The test
contains five sections with twelve items in each section. The items become
progressively more difficult as the subject advances through each section. As an
added control to make sure that the test results were relatively free from language
bias and that the subjects fully understood the task requirements, directions were
given in the native languages of the subjects. The test was group-administered, and
the subjects were allowed 45 minutes to complete it The maximum possible score
is 60. The Test of English as a Foreign Language was selected in order to measure
overall English proficiency. This examination, in multiple-choice machine-
scorable form, has five parts—Listening Comprehension, English Structure,
Vocabulary, Reading Comprehension, and Writing Ability. Only the total score
was used in this study.
In addition to the TOEFL and the Raven’s, three reading comprehension
tests were administered: the paragraph comprehension portion of the McGraw-
Hill Basic Skills System Reading Test (Raygor, 1970); a paraphrase recognition
test developed by K. Perkins and C. Yorio; and a cloze test developed by the
author. The McGraw-Hill Test is designed to measure a subject’s ability to make
inferences, to pick out main thoughts and supporting ideas, and to discover organi¬
zational patterns in paragraphs and essays. It contains a total of 10 passages—5
long (approximately 250 words each) and 5 short (approximately 7 5 words each)
Table 3-1 Means and Standard Deviations for a followed by 30 multiple-

Nonverbal Intelligence Measure (Raven’s), the Test choice questions. Subjects
of English as a Foreign Language, and Three were allowed 40 minutes.
Reading Tasks (N - 20) The Perkins-Yorio test
(which appears below as
Mean SD
Appendix 3A) is a 50-item
Raven’s Progressive Matrices 49.90 6.34
paraphrase identification
TOEFL 509.40 38.80
McGraw-Hill reading test 15.50 4.28 test with items designed to
Perkins-Yorio test 41.20 4.63 measure comprehension of
Cloze test 18.85 7.44 specific grammatical struc¬
tures. Perkins (1976) gives
further details concerning the rationale for this test. Subjects were allowed 30
minutes to complete it. The third reading comprehension task, a cloze test
(Appendix 3B), was a passage of 400 words plus taken from an introductory
economics text It was modified slightly to keep the text at an appropriate
difficulty level. After two lead sentences, every 7 th word was deleted for a total of
50 blanks. The appropriate word scoring method was used and subjects were
allowed 30 minutes.

The means and standard deviations for each of the five tests are presented in Table
3-1. The mean score on the nonverbal IQ test places this group just above the 7 5th
percentile of the reference group upon which the test was normed (Raven, 1960).
One can reasonably conclude that the subjects were above average in intelli¬
gence. The reader is free to interpret the degree of English language proficiency
which is represented by the TOEFL score* Given the range of scores, the only
firm conclusion that can be drawn is that the subjects ranged from highly to
marginally proficient in overall English skills. This interpretation is confirmed by
Table 3-2 Intercorrelations of the Nonverbal Intelligence Measure (Raven’s),

TOEFL, and Three Reading Tasks (N - 20)
Raven’s TOEFL McGraw-Hill Perkins-Yorio Cloze

test test
Raven’s .61 .84 .68 .61

TOEFL .59 .84 .75
McGraw-Hill test .67 .65
Perkins-Yorio test .62
Cloze
p <.01 for all correlations.
♦Editors’note: TOEFL scores are converted to a standardized scale with a mean of 500 and a standard
deviation of 100 based on scores of the original reference population tested in 1964 (ETS, 1973).
However, the average score for all candidates tested between October 1977 and June 1971 was 492
with a standard deviation of 80.
Flahive: The g factor 37
the results of the McGraw-Hill test, which ranks these students from the 9th to the
85th percentile when compared against a reference group of college freshmen and
sophomores. 1 n fact, the mean of the latter group of native speakers of E nglish falls
at the 22nd percentile of Raven s reference group. Performance on the Perkins-
Yorio test was only slightly lower than scores achieved by college freshmen who
were native speakers of English.
The correlations among the five measures are seen in Table 3-2. The fact that
all the measures are highly correlated is not surprising, since three of the tests are
purportedly testing the same skill, reading. Nor are the correlations between the
Perkins-Yorio test and the TOEFL and the cloze and TOEFL surprising. They are
all measures of language proficiency. What is surprising is the high correlation
between the Raven’s and the McGraw-Hill test. Over 70% of the variance in the
McGraw-Hill test scores is accounted for by scores on the nonverbal intelligence
test To contrast the power of the Raven’s as a predictor of scores on the three read¬
ing tests with that of the TOEFL as predictor of the reading scores, three separate
Table 3-3 Results of Regression Analyses Using

Raven’s and TOEFL to Predict Reading Scores
(N = 20)
R2
Raven’s and TOEFL to predict McGraw-Hill .716
Raven’s alone to predict McGraw-Hill .707
Raven’s and TOEFL to predict Perkins-Yorio .754

Raven’s alone to predict Perkins-Yorio .45 7
Raven’s and TOEFL to predict Cloze .598

Raven’s alone to predict Cloze .371
p < .01 for all /?2 values.
regression analyses were run with the Raven s and the TOEFL as predictors in the
full models and the Raven’s alone in the restricted models. Results of these
analyses are given in Table 3-3. Almost as much variance in the McGraw-Hill test
was accounted for by the Raven’s alone as by the Raven’s and TOEFL combined.
However, in attempting to predict scores on the Perkins-Yorio and the cloze,
significantly less variance was accounted for by the Raven’s without the TOEFL.
From the description of the tests given above, the reasons for these results seem
rather clear. Neither the cloze nor the Perkins-Yorio requires the complex reason¬
ing involved in the McGraw-Hill multiple-choice reading test.
Conclusion
With the exception of the relatively high correlation between reading scores and
the nonverbal test of intelligence, the results of this study are, for the most part, not
unexpected. It is reasonable to conclude that reading subtests found on widely

used measures of nonnative language ability are not simply tests of reading ability.
They are also tests of intelligence. If the multiple-choice format for reading com¬
prehension tests which is found on many second language tests including the
Michigan Test of English Language Proficiency and the TOEFL also assesses
intelligence, the validity of such instruments must be questioned. In a recent
study, Johnson (1977) reports that when the TOEFL was administered to native
speakers, subscores for listening comprehension, grammar, and vocabulary were
close to their maximum. Scores on the reading subtest were significantly lower.
From this and other evidence (Genesee, 1976), it should be clear that traditional
tests of reading comprehension are testing not only language proficiency but also
intelligence.
While it would be rash to suggest that all traditional multiple-choice reading
tests are simply disguised intelligence tests, the findings reported in this study
suggest that additional studies be undertaken to determine more precisely the role
of intelligence in reading comprehension. In addition, the findings also suggest
that the role of intelligence in other areas of language processing and in other
techniques of language testing be examined to provide teachers and researchers
alike with a clearer picture of where language proficiency ends and intelligence
begins.
Note
1. The author of the Raven's claims that it is not a tme test of intelligence but rather a “test of
observation and clear thinking.” Then he goes on to say that it correlates a,t the .86 level with the
Terman-Merrill scale and has a g saturation of .82.
Appendix 3A
Perkins-Yorio Test
1. The athlete got plenty of rest after the race.
a. The athlete received a lot of money after the race.
b. Someone gave the athlete an important gift after the race.
c. The athlete enjoyed plenty of relaxation after the race.
d. The athlete became very famous after the race.
2. John should have bought the car.

a. John bought the car.
b. John didn’t buy the car.
c. John is going to buy the car.
d. John should buy the car.
Perkins-Yorio test 39
3. Tom was told to turn the lights out before closing the lab.
a. Somebody told Tom to turn the lights out before closing the lab.
b. Tom decided to turn the lights out before closing the lab.
c. Turning the lights out before closing the lab was Tom’s idea.
d. Tom had told somebody to turn the lights out before'closing the lab.
4. It's important to remember that every foreign student should be registered

with the government by January 31 of each year.
a. It’s not necessary for foreign students to ever register with the govern¬
ment
b. All foreign students must register with the government at the beginning
of each year, and that’s important
c. Foreign students should remember that registration with the govern¬
ment is not important
d. Foreign students may or may not register with the government by Janu¬
ary 31 of each year.
5. The famous singer ended her recital with a folk song.

a. The first song the famous singer performed was a folk song.
b. The famous singer began her recital with a folk song.
c. The famous singer started her recital with a folk song.
d. The famous singer finished her recital with a folk song.
6. The man took my mother’s bag and ran away.

a. The man gave my mother a bag.
b. My mother gave the man a bag.
c. My mother and the man ran away.
d. My mother had a bag and the man took it
7. Alice has been a secretary for five years.

a. Alice was a secretary.
b. Alice used to be a secretary.
c. Alice is a secretary.
d. Alice was never a secretary.
8. If they had had more time, Bob and David would have visited some of their
friends in Chicago.
a. Bob and David didn’t visit their friends.
b. Bob and David visited all of their friends.
c. Bob and David visited some of their friends.
d. Bob and .David don’t have any friends in Chicago.
9. The man promised to tell the truth to the committee.

a. The man said that he would tell no lies to the committee.
b. The man said he wouldn’t talk to the committee.
c. The man didn’t promise to tell what actually happened.
d. The man said he would tell the committee the truth.
10. Sam gave John the book his brother bought him several years ago. Who has
the book?
a. John. c. Sam’s brother.
b. Sam. d. John’s brother.
11. The new textbook has been written by one of the teachers at the Institute.
a. The Institute wrote the new textbook.
b. No teacher at the Institute has ever written a textbook.
c. There is a new textbook and a teacher at the Institute wrote it
d. Teachers at the Institute have never written textbooks.
12. That Carlos and Juan got 80 on the proficiency test without studying very
much surprised the rest of the class.
a. Nobody was surprised that Carlos and Juan got 80 on the proficiency
test because they had studied a lot.
b. Carlos and Juan were surprised that they got 80 on the proficiency test
because they hadn’t studied very hard.
c. Although Carlos and Juan studied very hard, they were surprised that
the rest of the class got 80 on the proficiency test
d. Everybody in the class was surprised that Carlos and Juan got 80 on the
proficiency test because they had not studied very hard.
13. Our visit to the country was very restful because we had been working verv
hard.
a. We were never able to relax in the country.
b. Our visit to the country was bad for our nerves.
c. We were very relaxed after our visit to the country because we didn’t
have to work there.
d. We were very tired in the country because we had to work there.
14. The house near the shore of the lake was destroyed by the storm last night
a. The storm destroyed the shore of the lake last night
b. The storm destroyed the house near the lake last night
c. The lake destroyed the house near the shore last night
d. There was a storm last night and it destroyed the shore of the lake.
15. The judge determined that the report contained an obvious untruth.
a. The judge showed that the material in the report was clearly correct.
b. The judge didn’t prove that the report was true.
c. The judge showed the error that the report contained.
d. Everybody knew that the report was full of lies except the judge.
16. “Had Tom known they were coming, would he have waited for them?” asked
Mary.
a. Tom knew they were coming.
b. Tom didn’t know they were coming.
c. Mary is asking if Tom knew they were coming.
d. Tom is asking if they were coming.
17. Tom met Jane at his best friend’s brother s house. Where did Tom meet
Jane?
a. at Tom’s friend’s house.
b. at Jane’s house.
c. at the house of a brother of Tom’s friend.
d. at Tom’s brother's house.
18. The mailman carried the bag which contained the letters.
a. The mailman had a bag; there were no letters in it
b. The mailman carried the letters in a bag.
c. The mailman carried an empty bag.
d. The mailman carried the letters in his hand.
19. Tom did his homework and Mary did too.

a. Tom did his homework but Mary didn’t
b. Tom did not do his homework but Mary did.
c. Tom did his homework and Mary did hers.
d. Tom didn’t do his homework and Mary didn’t either.
20. “Never will I repeat a thing like that,” said Alice to her parents.
a. Ahce is making a promise to her parents.
b. Alice is asking her parents a question.
c. Alice’s parents are asking her a question.
d. Alice’s parents are making her a promise.
21. I didn’t know that Mac hadn’t been killed after all.
a. Mac was killed but I didn’t know it
b. Mac wasn’t killed and I knew it
c. I knew that Mac was dead.
d. I didn’t know that Mac was alive.
22. Visiting relatives can be boring.

a. Relatives who are visiting are interesting.
b. Relatives who are visiting are not boring.
c. To visit relatives is enjoyable.
d. To visit relatives is not enjoyable.
23. “Could you open the door, please.”

a. I know that something will happen in the future.
b. I am asking if you were able to open the door.
c. I know that something happened in the past.
d. I am asking you to open the door.
24. Tom said that he doesn’t beat his wife anymore.

a. He admitted that he had beaten his wife in the past.
b. He said that he had never beaten his wife.
c. He said that he still beats his wife.
d. He didn’t say that he’d stopped beating his wife.
25. “Barry goes to the movies every night; we seldom do.”

a. Barry seldom goes to the movies.
b. Barry and we go to the movies every night
c. Barry and we seldom go to the movies.
d. We do not go to the movies very often.
26. My friend’s sister lives in Birmingham, which is a suburb of Detroit

a. My friend lives in a suburb of Detroit
b. I have a sister who lives in Birmingham.
c. My friend has a sister who lives in Birmingham.
d. My sister has a friend who lives in Birmingham.
27. The man who won the contest in 1971 married the girl who won the contest
in 1972.
a. The man won the contest in 1972 and then married the girl.
b. The girl won the contest in 1972 and then married the man.
c. The man married the girl and she won the contest in 1971.
d. The man won the contest in 1971; the girl won the contest in 1972, and
now they are married.
28. The joke wasn’t very funny; yet, the audience laughed unendingly.
a. There was no end to the audience’s laughter.
b. The audience didn’t laugh at the end of the joke, because it wasn’t very
funny.
c. The audience stopped laughing when they realized the joke wasn't very
funny.
d. The audience started laughing before the bad joke ended.
29. Paul asked his brother’s advice because he couldn’t decide which car to buy.
a. Paul had difficulty deciding which car to buy.
b. Paul knew which car to buy.
c. Paul asked his brother to buy the car.
d. Paul’s brother bought a car following his advice.
30. I saw my father’s business associate’s calling-card lying on the floor.

a. I saw my father and his business associate lying on the floor.
b. I saw my father’s associate lying on the floor.
c. I saw my father’s calling-card lying on the floor.
d. I saw my father’s associate’s card lying on the floor.
31. Tom hadn’t intended on staying up so late but his friends didn’t leave until
midnight.
a. Tom wanted to stay up until midnight
b. Tom wanted to go to bed early.
c. Tom didn’t want to go to bed early.
d. Tom’s friends left early so he went to bed.
Perkins-Yorio Test 43
32. “Henry must have arrived already.”

a. I conclude that Henry has arrived.
b. Henry should arrive.
c. Henry had the obligation to arrive.
d. It is impossible for Henry to have arrived.
33. The police found Alice and Tom in my father s car.

a. My father found Alice and Tom in his car.
b. My father has a car and the police found Alice and Tom in it
c. My father found Alice and Tom in a police car.
d. My father s car was found by the police.
34. Bill is too tired to talk to.

a. We can’t talk to Bill because we are too tired to talk.
b. Bill can’t talk to us because we are tired.
c. We can’t talk to Bill because he is too tired.
d. We are too tired to talk.
35. The salesman’s untruthfulness surprised us.

a. The salesman was truthful and that didn’t surprise us.
b. The salesman was a liar and that didn’t surprise us.
c. The salesman’s instructions were to tell the truth and he did.
d. We were surprised at the salesman’s lies.
36. Jim was said to have been seen leaving the scene of the crime accompanied
by a blonde woman.
a. Somebody said that Jim had seen the blonde woman leaving the scene of
the crime.
b. Jim and a blonde woman said that they had seen someone leaving the
scene of the crime.
c. A blonde woman said that she had seen Jim accompanied by someone
leaving the scene of the crime.
d. Somebody said that they had seen Jim and a blonde woman leaving the
scene of the crime.
37. Neither her daughter nor her son was at home when Mrs. Wilson returned.
Who was at home?
a. Mrs. Wilson’s son. c. No one.
b. Mrs. Wilson’s daughter. d. Both son and daughter.
38. The rain seemed endless during the spring.

a. The rain continued to fall on through the spring.
b. The rain stopped falling soon after the spring began.
c. There was no rain during the spring because we had an unusual dry
season.
d. It rained only once during the spring.
39. The John Smith who introduced the speaker who received the award is not
the same John Smith who received the first award ever given two years ago.
a. John Smith received the first award two years ago and he introduced the
speaker.
b. John Smith introduced the speaker, the speaker received the first award
two years ago.
c. John Smith introduced the speaker and another John Smith received
the first award two years ago.
d. The speaker introduced John Smith and he had received the award two
years ago.
40. Although John hates vegetables, he ate the beans because Mary made them.
a. John likes beans.
b. John likes Mary.
c. John doesn’t hke Mary.
d. John likes vegetables, except for beans.
41. We will never take a test again without studying.

a. We promise never to take a test again without studying.
b. In the future we will never study for a test.
c. We are going to take a test but we are not going to study.
d. We won’t take a test again if we have to study.
42. I can’t pay my bill for three more weeks.

a. I’ll never be able to pay my bill.
b. I paid my bill three weeks ago.
c. I’ll be able to pay my bill in three weeks.
d. I will not pay my bill.
43. His father is said to have been a genius.

a. A genius said that he had been his father.
b. His father told people that he was a genius.
c. People said his father was a genius.
d. His father said that he had been a genius.
44. You got a C but you can do a lot better.

a. You can do better than I can do.
b. You can do better than you actually did.
c. You can do better than I did.
d. You can’t do any better than you did.
45. Flying airplanes can be dangerous.

a. Airplanes that are flying can be dangerous.
b. Airplanes that are flying are safe.
c. To fly airplanes is safe.
d. To fly airplanes is not dangerous at all.
46. If I knew the answer I would tell her.

a. I don’t know the answer.
b. I will know the answer.
c. I knew the answer.
d. I have known the answer for a long time.
47. Tom didn’t manage to close the door.

a. The door was open and it is now closed.
b. The door was open and it still is.
c. The door was closed and it is now open.
d. The door was closed and it remained closed.
48. I saw my mother’s sister’s dog. Who owned the dog?

a. My aunt
b. My mother.
c. Me.
d. My sister.
49. That important steps should be taken to solve the pollution problems that
affect our cities is clear to everyone.
a. Nobody thinks that the pollution problems that affect our cities are
really serious.
b. Everybody thinks it is clear that something ought to be done to stop
pollution in our cities.
c. The pollution problems that affect our cities are not very serious, and
everybody thinks that’s very clear.
d. Not everyone is sure that there is a pollution problem in the cities.
50. The lion’s constant restlessness fascinates the crowd.

a. The lion rests all the time.
b. The lion never rests.
c. The lion is never restless.
d. The crowd enjoys seeing the lion resting.
Appendix 3B
Cloze Test
The Capital Crisis
Economists have long agreed that the basis of the capitalistic system is, quite
simply, privately owned capital. Capital is simply money that comes from savings
and is used for investment Now, however, economics have begun to ! the
causes and implications of a unique 2 For the first time in more 3 half
a century, the United States,_4 some other industrial economies, faces a
shortage_5_the capital that is needed to 6 new goods, profits, and jobs.
The shortage_1_capital poses a threat to 8 survival of the country’s
economic system-2_immediately, it raises doubts about the 10 ability to
recover from the latest recession_11 push unemployment down to about 5%
-L2 the work force and keep it 13 that level. Ultimately the formation of
-14-is the formation of jobs._L5_down of the pace at which
-16-nation saves its earnings and invests _JL__ resources results in reduced
economic activity_L§_fewer jobs. Yet the amount of 19 devoted to invest¬
ment in the United States is_20 that found in other major industrialized
-21-Furthermore, the need for additional goods 22 and technology has
never been greater. In the_23 few years, enormous amounts of money 24
be required to develop alternative sources 25 energy such as nuclear power,
improve pollution-26— ? erect more efficient factories, and 27 rebuilding
the decaying cities.
But—28 the funds be available? Although economists 29 been debat¬
ing the question for some , the issue has only recently attracted the 31
of the federal government. Some government 32 have begun to recognize that
the current-33-of capital could cause an economic 34 far more serious
than ever before_35_the history of the country.
Basically, —36— are two ways of approaching the 37 Some liberal
economists advocate a system-38_would encourage investors to invest their
money —39— high priority projects such 40_energy development These
economists say that with such_41_system available capital could he 42
more effectively than it is today. Conservative economists. 43 . maintain that
such a system is unworkable and_44_seriously harm the capitalistic system.
-45-solution to the problem is_46_lower the tax on profits earned from
—47-investments. In this way they hope to_48 investment in all areas of the
economy-49-this time it is impossible to_30_which of the two approaches
will work. The one fact that is certain is that some type of action is necessary soon if
an economic crisis is to be averted.
Chapter 4
An Analysis of
Various ESL Proficiency Tests
Kay K. Hisama
Two different methods of analysis were applied to the results of four ESL
proficiency tests. The profile method showed that the tests produced
somewhat different score patterns in relation to proficiency levels and native
language backgrounds of the subjects tested. On the other hand, a principal
components analysis revealed a single source of variance across all the tests
in spite of their different formats. It was concluded that score pattern
differences across the different tests may be partly attributable to minimal
sampling “biases” and unreliabilities in the tests. Nevertheless, all four tests
investigated reveal substantial reliability and validity as measures of a single
global proficiency factor.
Testing is an essential part of instruction. The most popular use of standard¬

ized tests is to group or place students for instructional purposes. There are three
major English proficiency tests for nonnative speakers on the market. The two
most widely used tests are the Test of English as a Foreign Language (TOEFL) and
the Michigan Test of English Language Proficiency (often referred to as the Michi¬
gan test). A recent addition is the Comprehensive English Language Test (CELT).
All these tests consist of several subtests. In the case of the TOEFL, there are
five—Listening Comprehension, English Structure, Vocabulary, Reading Com¬
prehension, and Writing.
The term “battery” is conventionally applied to such a set of separate tests to
be administered to a group of individuals. It should be recognized, however, that
even though each subtest is separately named and different subscores are usually
reported, they are usually combined into a total score for the purpose of admis¬
sion, placement, or other decisions. Nevertheless, the practice of using multiple
subtests is believed to be a sound one in the light of much previous research. For
47
example, the results of several tests are likely to give a better prognosis of college
achievement than any single test The most obvious rationale for multiple
measurement is that any human ability is complex and thus requires a wide range
of item samples. It is believed, therefore, that a single set of test items or a single
test may fail to measure a very complex human ability such as language profi¬
ciency.
In the case of tests of English as a second language, however, there seems to
be a large amount of overlap in the information given by the subtests. Such overlap
is demonstrated by the rather high intercorrelations usually observed among them.
In the Seventh Mental Measurements Yearbook, for example, Chase (1972) noted
the substantial intercorrelations among the subtests of the TOEFL. Two possible
explanations were given: (1) there is an obvious overlap in format from test to test,
or (2) many common behaviors are required by the various subtests in the battery.
The same explanations can be offered for the overlap in variance on the parts of the
other two tests mentioned above, the CELT and the Michigan test.
These empirical facts as well as recent theoretical developments in language
and communication have generated much controversy among test specialists as to
how ESL proficiency should be measured. In response to the lead of Carroll
(1961), a substantial controversy has developed in terms of the contrast between
the discrete point and the integrative approaches to language testing. Certainly,
there is room for more empirical study. This investigation attempts to shed some
light on the multiple measurement of ESL proficiency based on empirical data ob¬
tained from the Center for English as a Second Language (CESL) at Southern
Illinois University, Carbondale.
Method
Subjects. A total of 136 nonnative students at CESL served as subjects. They
were tested during three separate six-week terms in October and November 197 5
and January 1976. Two criteria for selection were imposed: subjects had to be new
entering students at CESL, and they had to be recent arrivals to the United States
with no more than one month in residence at the time of testing. The test scores of
subjects who met these criteria were retained for analysis.
Language groups included: Arabic (N— 29), Farsi-Persian (N= 49), Spanish
(iV=27), African (iV=10), and Asian (iV=18). The latter group included speakers
of Japanese, Chinese, Vietnamese, and Thai. The remaining three subjects were
all from different language backgrounds.
Test Instruments. Four tests were administered upon entrance to CESL: the
first and second tests were the Structure and Listening subtests of the CELT; the
third was a remedial reading test entitled Reading for Understanding Placement
Test (RFUPT); and the fourth, designed by the author, was a modified cloze test
referred to as the New Cloze Test (NCT).
There were several contrasts between the tests. For one, the first three were
in a multiple-choice format. Examinees were required to choose one of four
Hisama: ESL proficiency tests 49
alternatives in response to each test question. However, the fourth test was an
open-ended fill-in-the-blank cloze procedure. A second contrast was that the Two
CELT tests were specially developed and validated with reference to nonnative
speakers of English while the RFUPT was constructed with reference to a
remedial reading program designed for students at intermediate elementary
school levels through college* The NCT, of course, was developed specifically for
testing the nonnative speakers at CESL.1 Third, the dominant register of language
used for the two CELT tests was conversational. These tests contained many
colloquial expressions which foreign students might not have encountered
previously. On the other hand, the register for the RFUPT was appropriate to
written materials likely to be used at school. The bulk of the questions in the
RFLPT required that the subjects have a good conceptual understanding of
various school subjects such as general science and language arts or social studies.
Fourth, the total number of items for each of the CELT tests and the RFUPT
differed considerably, ranging from a low of 50 for the Listening test to a high of
100 for the RFUPT. Accordingly, the timing of the tests varied. The Listening test
required all examinees to proceed at the same speed while the RFUPT and the
Structure test had fixed time limits. These differences in the manner of timing, in
the total number of test items, together with differences in difficulty level of each
test, tended to change the effects of guessing from test to test Finally, the NCT
substantially differed from the other three tests in that there was little or no chance
of subjects making correct random responses. For the multiple-choice tests the
odds against a correct guess are four to one, but on the open-ended cloze test the
odds against a correct guess are much stronger.
Method of Analysis. The scores of the four tests were analyzed by applying
two different techniques: a profile method and factor analysis. A profile is a
graphic summary of the results of multiple measurement The principal advan¬
tage of the profile is its simplicity—it presents test results in a nontechnical way.
The method is applicable, however, only to the extent that individual test scores
are reliable. Fortunately, all tests used in this study demonstrated substantial relia¬
bility and concurrent validity, as can easily be inferred from the correlations and
the factor analysis discussed below. (For additional statistical data, see Hisama,
1977a.)
Since the four tests contained different numbers of items, the raw scores are
not directly comparable. Therefore, they were transformed to standard scores by
the following formula:
T= 50-
where T = standard score; X = raw score; X = mean of the raw scores obtained by
the group; Sx = standard deviation of the raw scores.
•Editors’ note: This is the same test referred toby Scholz etaL in Chap. 2 as the CESL Reading Test It
is a modified version of the Science Research Associates, Reading Improvement Placement Test
Figure 4-1. Profiles of Different Language Proficiency Levels Based on Standard

Scores.
By transforming each score in this way, all tests were converted to a common
scale of measurement with a mean of 50 and a standard deviation of 10. By using
T-scores in the construction of profiles, direct comparisons may be made among
the various patterns of variability thus displayed. The T-scores on the four tests
used in this study are given in profile form, first by level of placement at CESL and
then by language group.
Subsequently, the interrelationships among the four tests are explored by
using a factoring method. Factor analysis is not a single method. It subsumes a
wide variety of procedures. The most important characteristic of factor analysis is
its data-reduction capability. Given an array of correlation coefficients (a correla¬
tion matrix) for a set of variables, factor analysis techniques enable us to see
whether some underlying pattern of relationships exists such that the data may be
“reduced” to a smaller set of factors (or components) that may be taken as source
variables accounting for the observed interrelations in the data. The method used
for this study was the principal component solution. Factors were extracted from
the correlation matrix with unities in the main diagonal.
Hisaraa: ESL proficiency tests 51

Profiles of ESL Test Scores by Proficiency Levels. Figure 4-1 shows the mean
standard scores obtained by seven defined proficiency levels. Normally six
proficiency levels exist at CESL. Students are initially placed in courses by an
examination procedure and then they progress through the courses until they are
exempted from further study. Levels 1 to 5 are designed to take students from
minimal skill to fluency. Level 6 offers study preparatory to graduate work, and
Level 7 is a category used here to represent students who have graduated from
Level 6. Observed score differences indicate the spread that exists between levels
which may be due to either the initial placement process or the instructional
effects of courses (or both). The standard scores on individual tests can be com¬
pared in terms of their relative deviation across levels. The test that maximizes the
differentiation across levels may be considered to be the most sensitive to the
differences that actually exist across the levels. In Fig. 4-1, the spread produced
by the RFUPT is greater than that generated by any of the other tests. However, it
is important to note that the RFUPT score is also used in the definition of and
Table 4-1 Means, Standard Deviations, and Total Numbers

of Test Items Used in the Factor Analysis (N = 1 36)
Test Mean (%) SD (%) Items
Structure 31.57 (42.09) 15.05 (20.06) 75
Listening 19.33 (38.66) 8.43 (16.85) 50
RFUPT 44.54 (44.54) 16.36 (16.36) 100
NCT 22.96 (46.85) 12.84 (26.21) 49
setting up of levels in the first place, along with the Listening and Structure por¬
tions of the CELT. Therefore, the fact that the NCT produces nearly as much
spread as the RFUPT is evidence of its concurrent validity.
Profile of ESL Test Scores by Native Language. Figure 4-2 shows the pro¬
files based on mean T-scores obtained on the four tests by five different language
groups. A host of factors may account for observed group differences: availability
of college education in their native countries, availability of scholarships or
private funds, geographical and sociocultural distance from English-speaking
societies and political entities, political and military relationships, or a combina¬
tion of these and other factors. Rather than speculate on these questions, let us
simply examine the pattern of variability as we did in Fig. 4-1. Here it is the NCT
that seems to afford the greatest discrimination across groups. The Structure test is
Table 4-2 Correlation Coefficients among the

Four Tests (N - 136)
Structure Listening RFUPT NCT
Structure .723 .769 .789
Listening .727 .731
RFUPT .848
NCT
almost as good a discriminator, followed by the RFUPT and the Listening test,
both of which produce considerably less spread among groups.
Factor Analysis of the Four Tests. Means and standard deviations are given in
Table 4-1, followed by correlations in Table 4-2 and the principal component
factor solution in Table 4-3. The correlation coefficients reveal that the NCT
accounts for the greatest amount of the total variance in all the tests. This can be
seen from the fact that the correlation of the NCT with each other test is always the
highest correlation in the matrix (Table 4-2). Also, in the factor analysis, the NCT
produces the strongest loading ong, followed closely by the RFUPT. However, the
loadings of Listening and Structure ong are also quite strong. Indeed, theg factor
accounts for over 82% of the variance in all four tests. The remaining variance is
too small to be subjected to further analysis (the eigenvalue of the second compo-
Hisama: ESL proficiency tests 53
nent extracted is less than 1). The nearly Table 4-3 Factor Matrix
identical factor loadings of the NCT and Based on Principal Component
RFUPT are noteworthy inasmuch as the Analysis*
formats and reference populations for the
Test 9 h2
two tests were radically different. This
Structure .90 .81
seems to indicate that common mental
Listening .87 .76
processing strategies are required by all
RFUPT .92 .85
four of the ESL tests examined here in NCT .93 .86
spite of the fact that they appear to be quite
different at the surface. An important task, ♦Values for the second factor are not
reported because its eigenvalue is
then, will be to characterize these common
less than 1.
strategies as demonstrated by speakers of
English as a second language.
Score pattern differences among the different subgroups may be largely due
to unreliabilities or to sampling “biases” in the tests—the number of items, the
effect of random guessing, the register of test items, accent of the speakers on the
listening test, and so on. For this reason alone, multiple measures of ESL profi¬
ciency may be defended by arguing that their use will tend to cancel out or nullify
such biases. However, there is a large element of luck here in selecting the
measures to be used unless serious research is conducted to justify the choices and
to deliberately minimize the unreliability and biases.
Note
1. The NCT is being prepared for publication. Interested parties may write the author at Depart¬
ment of Special Education, Southern Illinois University, Carbondale, Ill. 62901, for more
information.
Part I Discussion Questions
1. What possible theoretical formulations would fit the unitary competence

hypothesis? Is such an empirical finding inconsonant with the notion that
language skill can ultimately be broken down into many thousands, perhaps
many millions, of semidiscrete bits of knowledge which are intricately inter¬
related? What about the notion of multiple components, or multiple aspects,
modalities, or functions? Could they exist in theory but not be measurable in
practice (at present)? Discuss then the relationship of test constructs (what
the tests are supposed to measure) in contrast with theoretical constructs.
Must they ever coincide?
2. If Spearman’s^ was supposedly a factor of general intelligence, what is theg

factor that appears in at least three of the chapters in Part I? Is it distinct from
Spearman’s g? If not, then what is the true relationship between language
proficiency and intelligence? Are they distinct theoretical constructs which
are not practically distinguishable in test variances (at least not yet)? Or,
alternatively, are they merely distinct test constructs which turned out to be
incorrect? Or something else? (See also Oiler and Perkins, 1978).
3. If the two-factor solution for the various tests used by Oiler and Hinofotis in
their second study (see Table 1-7) were correct, how would we explain the
three-factor solution in their third study (see Table 1-11) or the four-factor
solution of Scholz et al. (Table 2-4)? In other words, if the separable vari¬
ances were reliable, why would they not tend to sort out similarly on differ¬
ent occasions if the same type of analysis is applied?
4. Just as an exercise in analytical interpretation of factor analyses, try to inter¬

pret the factors mentioned in Question 3. In other words, try to explain them
in terms of their patterns of correlations (or factor loadings) that derive from
the various contributing tests. Try to find an explanation that will be consis¬
tent with all of the several varimax rotated solutions. Remember that the
mathematically defined factors in each separate solution (e.g., factors 1 and
2 in Table 1-7) are orthogonal to each other, that is, they are uncorrelated
with each other. Therefore, each factor must be explained more or less as a
separate entity.
5. Discuss the advantages and disadvantages of the single-factor solutions

(Tables 1-1, 1-4,1-8, 2-1, and 4-3) in comparison with the multifactor solu¬
tions. Bear in mind that the solution which explains the greatest amount of
variance in the most theoretically satisfying manner is to be preferred.
54
Discussion questions 55
Therefore, we should seek a solution that is consistent (i.e., one that contains
no self-contradictions), exhaustive (i.e., one that explains all the data that
need to be explained), and simple (i.e., one that is as uncomplicated as pos¬
sible).
6. Now, consider the curricular implications of the preferences you have

elected. Compare the multifactor approach against the single-factor idea.
What would be done in the classroom in one case that would not be done in
the other?
7. When Flahive says that there is no “universally accepted definition of lan¬

guage comprehension, nor is there a universally valid technique for assess¬
ing language comprehension,’" is this not also true of just about any theoreti¬
cal object of interest? For instance, is there a universally accepted definition
of the electron? Of cancer? Of history? Of motivation? Further, consider the
implication that universal agreement is some kind of prerequisite for practi¬
cal decisions. Can you think of any cases where agreement is irrelevant to
the nature of things? How about the value of pi? Or the composition of moon
rocks? Antimatter? The force that binds atoms together? Is language profi¬
ciency a fundamentally different sort of construct?
8. When significant correlations are found between all the various language
tests and the so-called nonverbal measure of intelligence (see Table 3-2,
where the correlations in question range from .61 to .84), how is it that the
validity of the language tests is questioned rather than the validity of the so-
called nonverbal intelligence test? Fully 71 % of the variance in the McGraw-
Hill reading test is also present in the Raven’s Progressive Matrices. No less
than 37% of the variance in the so-called nonverbal IQ test is common to
every one of the language measures used (the TOEFL, Perkins-Yorio, and
the cloze). Further, this is true for a group of nonnative speakers of English.
How then is it possible to reason that the Raven’s test is pure while only the
McGraw-Hill reading test is confounded by an extraneous variable, namely,
nonverbal intelligence? Would it not be just as reasonable to assume that the
Raven’s is impure, i.e., actually a measure of language proficiency? Indeed,
is the case for the latter claim not stronger than that for the former? Notice,
for instance, that the correlations between all the language tests are as we
might expect; it is the relationship of all of them to the nonverbal IQ test
which demands explanation.
9. Note also that the correlation between the McGraw-Hill reading test and the
nonverbal IQ test is equal to the correlation between the Perkins-Yorio and
the TOEFL (Table 3-2). Further, these are the highest correlations in the
entire matrix (nine points higher than the next in line). The Perkins-Yorio
correlation with TOEFL would normally be interpreted to mean that they
are measuring the same thing (namely, ESL proficiency). Is it fair then to
take a completely different approach to the explanation of the correlation of
identical magnitude between the McGraw-Hill reading test and the

Raven’s? While no factor analysis is offered in Chap. 3, it seems certain that a
single-factor solution would explain the vast majority of the reliable variance
in all the tests including the nonverbal IQ measure. (The enterprising reader
may want to run such an analysis. It is possible to do so on the basis of the
data in Table 3-2).
10. Compare the results of Flahive in Chap. 3 with those of Stump (in Oiler and
Perkins, 1978). Stump’s results showed that a widely used, group-adminis¬
tered IQ test containing both verbal and nonverbal sections) generated little
(if any) reliable variance that was not also present in cloze and dictation
tasks. Stump used native speakers of English at the 4th and 7 th grades
(about 100 subjects at each level). How do these results square with
Flahive’s? Moreover, what does the comparison imply for the relation
between first and second language learning? (Stump used natives, Flahive
nonnatives). See also Part VI in this volume. The chapters there rely on task
performance of natives and nonnatives as viewed through types of errors,
patterns of difficulty of various structures, and the hke.
11. If the relationship among skills were perfectly similar across levels at CESL
(Southern Illinois University), and across language groups as displayed in
the profile analyses of Hisama, what would the displays look hke?* Are there
any reasons not to expect skills to be quite similar across language groups or
levels at CESL? What factors would be expected to contribute to differ¬
ences? What factors might contribute to the levehng of differences?
12. Compare the correlation matrix of Hisama in Table 4-2 with that of Flahive
in Table 3-2. If the labels were not given for column and row headings, how
would one know that different tests had been employed to derive the two
tables? In other words, is the pattern of test relationships sufficient to show
that a nonverbal IQ test was used by Flahive but not by Hisama? If the reader
does compute the principal components solution for the data in Table 3-2
(see Question 7), the factor solutions may be compared.
•Actually, there is more than one possibility, but a simple theoretical ideal would be for them to fall
into a pattern of perfectly straight parallel lines. However, owing to practical imperfections in tests and
test performances, we should expect only approximately straight and approximately parallel lines.
Curvilinear solutions are theoretically possible but do not seem likely on the basis of Hisama’s data.
Part II
Investigations of Listening Tasks
What role does practice in listening play in the development of ability to

speak, read, and write a target language? In what order should tasks be
presented in language classroom situations? Sequentially? Simultaneously?
If sequentially, in what sequence? What theoretical arguments and practical
empirical data can be brought to bear? Is it possible to make reliable
subjective judgments about the bstening abihty of second language learners?
Are subjective evaluations of responses to a dictation task as rehable and
valid as objective scores based on the more traditional word-by-word
scoring technique? How do such subjective judgments relate to test scores
on other tasks aimed at bstening, speaking, reading, writing, and grammati¬
cal decisions? These questions, and others are dealt with in the two chapters
which follow as Part II.
Chapter
Listening Competence:
A Prerequisite to Communication1
Pamela Cohelan Benson and Christine Hjelt
Three different views concerning the introduction of the traditionally

recognized skills in second or foreign language study are explored. It is
concluded that the development of a solid understanding of the meaning of
heard utterances is prerequisite to practice in the productive manipulation
of utterances in communication. A model of language teaching that
progresses in small cycles (as does the Asher approach) from an initial phase
of listening with understanding to a second phase of possibly concurrent
practice in speaking, reading, and writing is proposed. Evidence is offered
from a wide variety of sources in support of the view that practice in
listening-with-coinprehension is foundational to the development of a
thoroughgoing all-round competence (or expectancy grammar) in the target
language.
How many teachers of English as a Second Language actually have solid

confidence in their methods for teaching students who have had no exposure to the
target language? How many teachers who work with beginning students are
entirely comfortable with the sequence of material offered through the text they
happen to be using? Further, are there any teachers who, in evaluating the first fifty
class hours, have not asked themselves if another approach might have given their
students a firmer foundation in the target language? In considering these ques¬
tions, it may be helpful to look at several hypotheses concerning the sequence of
tasks in second or foreign language learning.
A popular hypothesis has been the view that oral skills should be introduced
first—that language skills are developed linearly, beginning with listening/speak-
ing and proceeding to reading/writing. According to this view, oral production is
often emphasized from lesson one. For instance, Paulston and Bruder (1975), by
59
60 II: LISTENING TASKS
the very lesson format which they propose, seem to stress the need for immediate
oral production, beginning with tightly controlled mechanical drills:
The focus of the lesson is on grammar and structural pattern drills; we are not con¬
cerned here with the teaching of pronunciation, listening comprehension, reading or
writing. It may well be argued that it is impossible to learn only discrete items in a
language lesson and that a lesson of grammar also involves elements of pronunciation
and listening comprehension. This is undoubtedly so, and there is always incidental
learning taking place. The teacher, however, in making up his plans, must focus on
teaching one thing at a time so that he can concentrate on reading a predetermined
degree of student proficiency, know what to correct (and just as important, what not to
correct), and aid the students with prepared explanations (p. 23).
Another hypothesis which the prospective teacher of beginning students

might consider is that the acquisition of skills is a kind of conglomerative or
multiple parallel sort of process; therefore, tasks requiring all four of the tradition¬
ally recognized skills can be introduced simultaneously, with each skill reinforc¬
ing the others. Donaldson (1971) seems to have this sort of view in mind when he
gives an explanation of how the cognitive-code approach to language teaching
differs from the grammar-translation approach:
It [the cognitive-code version] should be a four-skills approach, but not in the manner of
audio-lingual habit theory. The four skills should be practiced simultaneously after the
presentation of explicit grammatical rules. Practice of all skills—but practice based
upon study and analysis—is a prime objective (p. 132).
A third hypothesis which deserves consideration is that language learning is a

more orderly integrative process initially requiring contextual decoding of the
meanings of new utterances before meaningful and creative encoding can take
place. It is this hypothesis which we wish to focus on. Along with certain theoreti¬
cal arguments, we will consider several sources of data which provide supportive
empirical evidence for this third position.
The late Valerian Postovsky (1974) challenged the assumption commonly
associated with the audio-lingual method that intensive oral practice in the target
language at the beginning stages of instruction will result in faster acquisition of
the language. He proposed that the production of speech is “an end result of
complex and mostly covert processes which constitute linguistic competence” (p.
229). He reasoned that in acquiring the ability to decode, the language learner
must develop recognition knowledge, while to encode he must develop retrieval
knowledge. Postovsky asserted that time is better spent, in the initial phases of a
language program, on developing the student’s capacity to decode. He credited
Susan Ervin-Tripp (1970) as the source of the processing model he offered in
making his case. She reasons that
the evidence from natural learning suggests that manifest speech is largely secondary.
That is, as long as the learner orients to speech, interprets it and learns the form or
arrangement that represents the meaning, he learns speech as fast as someone speaking
[i.e., someone practicing speech] (p. 339).
Benson/Hjelt: Listening competence 61
Postovsky considered it to be a great handicap for students in audio-lingual

programs to hear each other speaking error-ridden varieties of the target language
from the outset of instruction. While they might develop a fluency among
themselves in their classroom varieties of the target language, they would no doubt
experience difficulty in trying to understand a native speaker in a more natural
context
To investigate the effects of delay in oral practice at the beginning of language
study, Postovsky conducted an experiment in the Russian Language Department
of the Defense Language Institute in Monterey, Calif. Sixty-one matched pairs of
students were divided between experimental and control groups for a twelve-week
experiment For the first four weeks of the program the experimental group
delayed oral practice in Russian. They were introduced to the Cyrillic alphabet
and were given some pronunciation practice in the first three days to enable them
to write their responses to the aurally presented material. The control group
followed the regular program in Russian which emphasized intensive oral
practice. After the first three days, the control group was also introduced to the
Cyrillic alphabet Each group used the same materials and had the same number of
contact hours. At the end of the initial four weeks, the experimental and control
groups were merged into the regular program and at six weeks and twelve weeks
examinations were given. All four skills were tested—listening, reading, writing,
and speaking. The mean scores of the experimental group were measurably higher
across all the skills at both the sixth and the twelfth week. The most interesting
results were the higher speaking scores of the experimental group, who had far less
speaking practice yet outperformed the control group on the tests (Postovsky,
1974, p. 235). The speaking scores were broken up into parts in an attempt to
determine where the greatest differences were. It was concluded that the two com¬
ponent scores which contributed most to the total speaking scores were Control of
Grammar and Reading Aloud.
Postovsky readily admitted that his study could not be taken as conclusive
evidence in support of a particular theory of second language acquisition;
however, it does give us some data to consider and raises some interesting
questions. Is it not possible that a clearer conceptualization of the target language
occurs when the beginning student is spared the imperfect language generated by
himself and his classmates? If the student gives his initial concentration to material
presented aurally by a native speaker and if the student writes his response to that
material, is it not possible that his perception of auditory input will be
strengthened?
Another study which produced data supporting the case for positive transfer
of listening comprehension to speaking, reading, and writing tasks was done by
Asher (1969). Following his method (the total physical response method),
students listen to commands in the target language and imitate the teacher as he
responds to the orders. Orders begin with simple one-word imperatives such as
“stand” or “jump” and become increasingly complex, e.g., “Run to the table, put
down the paper, and sit on the chair!” (Asher, 1969, p. 5). Students imitate the
physical execution of the commands as demonstrated by the instructor. Asher
found that students working under his method achieved a high degree of “listen¬
ing fluency” which transferred directly to speaking as well as reading and writing.
In one test, Asher demonstrated that when students were required to do both
listening and speaking, their comprehension of the target language was decreased
(Asher, 1969, p. 13).
Asher (1974) reports on a Spanish teaching experiment with twenty-seven
college students who had no prior knowledge of Spanish. This group was divided
in half, and each section met with an instructor for three hours one evening per
week for two semesters. The students followed Asher s method by simply sitting
around the instructor and responding physically to his commands. As each student
felt able, he or she would volunteer to respond physically (not verbally) without
the instructor s modeling of the required bodily movements. Students progressed
from one-word commands to commands like,“ When Henry runs to the blackboard
and draws a funny picture of Molly, Molly will throw her purse at Henry” (Asher,
1974, p. 27). After ten hours of instruction they were “invited but not pressured”
to change roles with the instructor and give commands for the others. Still later the
students produced skits and worked on role-playing situations. Reading and writ¬
ing were not formally dealt with. If the students requested it, the instructor might
write a new vocabulary item on the board at the end of the class, but this casual
procedure required only a few minutes of class time.
After 45 hours of instruction, of which 70 percent was spent on listening
comprehension, 20 percent on speaking, and 10 percent on reading and writing
tasks (with no homework assignments at all!), the experimental group was tested
against the three control groups. One group consisted of high school students who
had taken one year of Spanish, a second group consisted of college students just
finishing their first semester of Spanish, and a third group was made up of college
students who were just completing their second semester of Spanish. Measured
against the group of high school students with approximately 200 hours of class
work on a test of listening and reading comprehension, the experimental group
with only 45 hours of training had a mean score of 16.63, while the high school
group had a mean score of 14.63. On similar tests, the experimental group also
scored significantly higher on the listening comprehension task than either of the
college control groups (Asher, 1974, p. 28).
At the end of 90 hours of instruction using Asher’s method, the experimental
group took the Pimsleur Spanish Proficiency Tests, Form C. This test was
designed for students who had completed 150 hours of intensive audio-lingual
training. The experimental group performed beyond the 50th percentile in most
skills (Asher, 1974, p. 30).
In evaluating this study, Asher notes that perhaps his most important finding
was the extent to which listening comprehension transferred to other skills. On the
success of his method he writes:
Benson/Hjell: Listening competence 63
When language input is organized to synchronize with the student’s body movement,
the second language can be internalized in chunks rather than word by word. The
chunking phenomenon means more rapid assimilation of a cognitive map about the lin¬
guistic code of the target language (pp. 30-31).
The findings of both Asher and Postovsky challenge the first hypothesis men¬
tioned at the beginning of the chapter, which claims that speaking performance
must be emphasized in the initial phases of language instruction. Indeed
Postovsky’s work demonstrates that an immediate emphasis on speaking may even
hinder the learner’s capacity to process (decode) second language data. The
second hypothesis, which states that language learning is an integrative process
and that all language skills can therefore be introduced simultaneously with each
skill reinforcing the others, must also be questioned since both studies presented
skills sequentially, with listening skills preceding speaking skills.
Postovsky’s somewhat modest fndings show an improvement in all skills
when speech is delayed four weeks in what is otherwise a fairly traditional
behaviorist approach to language learning. The strengthening of listening skills
definitely seemed to benefit his students.
It is Asher’s findings, however, that provide the more dramatic evidence. By
allowing processing prior to speech, the subjects in Asher’s experimental group
were able to develop a listening competence more quickly than with traditional
classroom methods. In this approach, Asher delays speaking only ten instructional
hours. At this point students are encouraged to assume the teacher’s speaking role
for the language which has been conceptualized and is ready for production. It is
also important to note that the Pimsleur test, which assesses the full range of lan¬
guage skills, shows unusual competence in reading and writing skills even though
little direct instruction was given in these areas.
It would seem that both these studies of foreign or second language learning
support the third hypothesis—namely, that language learning is a highly organized
integrative process initially requiring contextual decoding of the meanings of new
utterances before meaningful and creative encoding can take place.
Support for this hypothesis can also be found in two additional studies using
subjects from radically different backgrounds. The first project, by Naiman
(1974), employed 112 first and second graders in a Canadian bilingual school. He
compared a comprehension task (translating from the target language to the native
language) with a production task, and elicited oral imitation in the target language.
In all five syntactic structures used in the experiment, performance on the
comprehension task exceeded performance on the imitation task (as cited by
Swain, Dumas, and Naiman, 1974), indicating that comprehension may well be
prerequisite to production skills.
The second study (Sticht, 1972), used 96 men in the U.S. Army. In Sticht’s
experiment, learning strategies necessary to reading ability were also shown to be
necessary to listening comprehension. He used a test consisting of three brief
prose passages at the 6.5, 7.5, and 14.5 grade levels as judged by the Flesch reada-
bility scale. The passages were presented alternately as listening and reading tests
to 40 men classified on the basis of an IQ test as having Low Mental Aptitude
(LMA) and to 56 men grouped as having Average Mental Aptitude (AMA) (Sticht,
1972, p. 287). The results for both groups showed a high interrelatedness between
listening scores and reading scores. The most striking contrast was at the 7.5 level
among the LMA subjects. At this level, mean reading scores were in the 52nd per¬
centile. Although the differences at the other levels were smaller, the mean
listening scores were, with only one exception, equal to or greater than the mean
reading scores (Sticht, 1972, p. 288). This supported Sticht’s hypothesis that
“developmentally, skill in learning by listening precedes and actually forms the
basis for the acquisition of skill in learning by reading” (p. 286).
On the basis of all the foregoing studies, we can assert with confidence that
the development of listening ability demands the integration of phonological,
grammatical, and lexical data into a relatively unitary competence or expectancy
grammar (Oiler, 1974). Listening skill therefore seems to be an essential prerequi¬
site to oral communication and appears to be tightly integrated with the other tra¬
ditionally recognized language skills, both receptive and productive.
A final source of evidence for the hypothesis which we are positing is the
evidence for the statistical correlation between listening comprehension tasks and
tests requiring speaking, reading, writing, and specific grammatical decisions. For
evidence of this sort, the reader is referred especially to the earlier chapters of this
volume.
To cite a few specific instances, Irvine, Atai, and Oiler (197 5) found that the
Listening Comprehension subtest of the TOEFL and scores on a cloze test were
correlated at .69; Listening Comprehension was correlated with dictation at .76
for a group of 159 Iranians. The correlation between the cloze and the dictation
scores was .75. These three correlations were higher than any other correlations
among the TOEFL subtests. A factor analysis over the same data reported by Oiler
and Hinofotis (Chap. 1) showed that all the reliable variance in all of the TOEFL
subtests plus the cloze and dictation was accounted for by a single factor (substan¬
tially common to all the tests). See Table 1-1.
Further evidence comes from Johansson (1973b), who also found a high
correlation between a multiple-choice listening comprehension task and dicta¬
tion. In a test administered to 26 students at Lund University in Sweden, Johans¬
son found a correlation of .83 between listening comprehension and dictation.
This correlation was higher than any of the other correlations between the subtests
(grammar, vocabulary, and dictation with noise, pp. 108-109).*
"■Editor’s note: also relevant are the results of Scholz et al. (Chap. 2), especially the tests numbered 1 to
5 (Table 2-1). Furthermore, a factor analysis of the 16 separate subscores on listening tasks from the
Carbondale project (Oiler, 1979, Appendix) revealed a single general component accounting for 40%
of the variance in all the tasks. Since the estimated reliabilities on the same tasks averaged about
J~A (or .63), it is concluded that the single-factor solution explains nearly all the reliable variance in
all the tasks.
Benson/Hjell: Listening competence 65
Two important findings seem to emerge from these studies. First, when
listening practice precedes work on oral skills, the development of an appropriate
expectancy grammar seems to be enhanced. Our personal preference is a cyclical
learning model in which comprehension skills precede production skills in small
learning cycles—as in the Asher approach, for instance. This is not to suggest,
however, that learning can take place only through acoustic channels. In fact, in
the studies of Postovsky and Asher, learning was apparently enhanced by physical
response—either bodily action or writing. Studies with deaf children also show
that hearing can be bypassed altogether if other sensory receptors are involved.
However, we suggest that deliberately bypassing listening, or failing to give ade¬
quate practice in listening is probably less efficient and is likely to be a frustrating
way to teach a language.
The second important finding is the reassertion of listening comprehension
as an integrative or global skill which, as its name implies, entails comprehension
or conceptualization and organization of new language data. Only when this “com¬
prehending' process is functional can the learner begin to manipulate his new lan¬
guage in meaningful and creative ways. The same process expands the expectancy
grammar in the new language and thus affects development in all areas of language
learning.
Our concern in this chapter has been with the early stages of language learn¬
ing. Most teachers would accept the proposition that listening comprehension is an
important aspect of language learning not only in the beginning but in all stages of
development Since it is an important skill, testing procedures may be weighted
toward aural proficiency. However, teaching listening comprehension, particu¬
larly to beginners, is a process that is not well conceived in most curricula. The
tendency is to rely on pattern drills, contrived dialogues, or grammatical presenta¬
tions in the textbook and to simply assume that comprehension will follow. Often it
doesn’t
There are two areas of failure in these early stages. First, when beginners are
asked to use words and structures too soon, they are forced to say what they do not
know how to process in the target language because the target language vocabu¬
lary and structures are incomplete in their developing grammatical systems.
Second, and perhaps this is the most critical factor, the context provided is often
insufficient, and the beginning student cannot possibly succeed in conceptualiz¬
ing the sense of utterances of the new language. In many of the approaches which
we have been arguing against, the conceptualization of the relation between utter¬
ance and extralinguistic context seems to be thought of as a kind of vague eventual
goal, perhaps the ultimate product of the learning cycle. In our view it becomes the
crucial prerequisite—the best foundation upon which the language learning
process can be established.
Note
1. Editor s note: this paper has appeared in a slightly different form in the Modern Language Journal
62, March 1978, pp. 85-90. It is reprinted here by permission.
Chapter
Communicative Effectiveness as Predicted

by Judgments of the Severity
of Learner Errors in Dictations
Frank Bacheller
A scale of communicative effectiveness (SCE) is proposed as a basis for the

subjective evaluation of renditions of segments of prose presented to
learners in the form of a dictation. It is demonstrated that the proposed SCE
is approximately as reliable as the more traditional scoring method. The
correlation with that method was .94. Correlations with selected other tests
requiring listening, speaking, reading, and writing ranged from .63 to .77.
Item discrimination indices for the SCE on the 33 segments of dictation
ranged from .31 to .77. Considering the brevity of the segments, these
indices are remarkably high, indicating substantial reliability and internal
consistency of the judgments. It is suggested that such a scale might well be
adapted to other pragmatic language testing techniques (e.g., composition
and elicited imitation).
The Goal of the Language Learner Is Communication. Anyone who has been
in a situation where he had to use a language other than his native language for
everyday purposes probably found that when speaking, emphasis was placed on
getting ideas across more than on grammatical details, and also when listening, the
focus was on the overall meaning rather than on trying to catch every single sound,
morpheme, or word. Whether listening, speaking, reading, or writing, the problem
is relating the surface forms of utterances to extralinguistic context in systematic
ways.
Richards (1971) points out that, when placed in a situation where he must
communicate, the learner controls the language to suit his intentions. For
example, he may simplify the syntax or find a circumlocution to express some¬
thing he hasn’t learned to say in a more idiomatic way.
How well a learner communicates depends on his ability to supply enough
information in the linguistic surface form to enable an audience to relate that form
66
Bacheller: Communicative effectiveness 67
to extralinguistic context—that is, to understand the meaning. It also depends on

how well the learner himself is able to extract information from the surface forms
of the utterances of others by relating them to extralinguistic context
A Basis for Judging Errors. Acts of communication are affected by some types
of errors more than others. Enkvist(1973) says a learner s errors should be judged
according to the degree to which they interfere with communication. Clues as to
what kinds of errors cause the most interference are given by two studies in which
native speakers were asked to interpret sentences containing learner errors. In one
experiment, Olssen (1973) learned that sentences with what she defined as
“syntactical errors” could be more easily interpreted than those with “semantic
errors.” These results agree with the fact that cloze items that delete content words
are generally harder than those that delete function words (Abom, Rubenstein,
and Sterling, 1959).
Burt (197 5) assessed the comprehensibility of sentences containing errors at
different levels of language organization and found that errors which affected
overall sentence organization, such as wrong word order, missing, wrong, or mis¬
placed sentence connectors, and use of words in contextually inappropriate situa¬
tions, hindered communication more than those that affected single constituents.
One reason for differences in the way errors affect comprehensibility has to
do with the organizational levels of language (Miller and Johns on-Laird, 1976).
When going from lower levels to higher levels of language organization, contextual
constraints increase in strength and effect with each step. At higher levels of organ¬
ization (say the paragraph level versus the syllable level—see Oiler, 1972) greater
redundancy results in greater understanding. It follows then that an error at a high
level of organization will interfere with communication more severely than one at
a lower level because the high-order error will cause a greater reduction in redun¬
dancy.
In normal discourse processing, the language user has his attention focused
on communication and seems to depend more on high-level organizational con¬
straints and less on lower-order constraints. Greater economy is achieved because
the higher-order constraints carry a richer informational payload. The redun¬
dancy of the linguistic sequence at all levels of organization enables one to plan for
or anticipate what comes next When the stream of forms is related to extralin¬
guistic context a rich source of redundancy can be utilized in relation to what is
probable or plausible in specific situations (Oiler, 1974, Frederiksen, 1975a,
1975b, Miller and Johnson-Laird, 1976).
Proficiency and Error Severity—An Experiment

Concerning the relation between errors and proficiency several questions arise:
(1) Can errors be rated for severity in a reliable way? (2) Do more advanced
learners tend to make less serious errors? and (3) Is error severity a sensitive index
of the communicative effectiveness of learners?
In an attempt to answer these questions, dictations written by 112 of the ESL
students who participated in the language testing project at the Center for English
Table 6-1 A Scale of Communicative Effectiveness
Points Meaning of student’s answer Surface form of student’s written answer

0 None of the intended meaning of the No written response, or if there is a
segment is captured response, it does not contain even one
recognizable word of the dictated
segment
1 Overall meaning of the dictated At least one word of the segment is
segment is missed reproduced
2 Meaning is somewhat distorted Sequence may not be complete. There
may be missing forms and/or intrusions
3 Subject apparently understood the Same as 2, but less severe errors are
meaning of the segment made or fewer of them
4 Meaning of the segment is captured Synonymous substitutes for words in
in its entirety the segment may have been used and/
or there may have been spelling errors
indicating phonological difficulties
5 Meaning understood Surface form of the material is repro¬
duced exactly as dictated (except for
trivial spelling errors leaving pronunci¬
ation unchanged)
as a Second Language (CESL) at Southern Illinois University (see Chap. 2) were

graded for error severity. The scale given in Table 6-1 was used. It is intended to
be a measure of the learner’s ability to capture meaning in rendering surface form
of segments of text dictated between pauses.
Dictation was chosen for error elicitation because it clearly shows how well
the learner is able to grasp the meaning of what is said and how well he is able to
reconstruct the surface form used to express that meaning. A highly proficient
learner, for example, will catch the entire meaning of a dictated text and will be
able to write it down quite accurately. Even if the surface form is changed slightly,
the meaning will probably be almost perfectly preserved. A less proficient learner,
however, may catch the meaning carried by the surface forms, but not the surface
forms themselves. This is exemplified by the learner who wrote “by neighbor who
is jeweler’ for “by a neighborhood jeweler” (from a dictation used to pilot the
SCE, cited by Oiler, 1973b).
A learner with still less proficiency may relate the wrong meaning to the
sequence (but perhaps a meaning that can be expressed with a similar “sounding”
sequence). An example of such an error is “he had his exam and he had dis¬
covered his field” instead of “he had his eyes examined and he had his cavities
filled.” Finally, a learner with very little ability may not be able to correctly infer
any meaning at all and will merely try to write down the syllable shapes that he
hears. This almost always results in distorted forms and/or sequences that are idio-
syncratically segmented.
Methods. The dictations used in this study consisted of three tape-recorded

passages. These are given in the Appendix at the end of this volume at entry 5.
Each passage was presented three times. The first time, the student was told to
Table 6-2 Correlations of SCE Scores with Several

Global Tests Given in the CESL Testing Project
during the Spring of 1977
Test r* N
5. Dictation (total score) ^ .94 112

(Dictation A) .91 112
(Dictation B) .86 112
(Dictation C) .88 112
10. Oral Interview Comprehension .65 44
11. Repetition!" .72 46
16. Standard Cloze 1" .77 98
18. Essay Score .68 106
20. Recall Ratingf .65 106
22. Grammar (Parish test) 1" .74 104
FSI Style Oral Interview v .63 44
* All correlations reported are significant at p <.001. For

these calculations, pairwise deletion of missing cases was
used. Flence N varies from 44 to 11 2.
tNumbers here refer to the order of entries in the Appendix

at the end of this volume. Unnumbered items do not appear
in the Appendix.
Vjhis score was based on a composite of the five FSI scales

described in greater detail by Hendricks et al. in Chap. 7;
also see results of Scholz et al. in Chap. 2.
listen, the second time to write, and the third time to make corrections (see the
complete directions in the Appendix). The second time, pauses were inserted
between phrases to allow time for writing. Marks of punctuation were given on the
second pass. For purposes of grading with the SCE, each segment between pauses
(marked by slashes in the Appendix) was considered to be an “item."
This resulted in 33 items. Each item was then graded on the SCE and total
scores for each student on each passage were correlated with scores obtained by
grading the same dictations in the conventional way (one point for each correct
word in the sequence dictated). Severity scores were also correlated with scores
obtained on other tasks included in the CESL testing project.
In addition to the above, the number of spelling errors made per item by each
student was correlated with other scores. It was decided not to penalize for spell¬
ing errors because (1) spelling errors don’t show up in oral production, (2) native
speakers often make spelling errors, and (3) other researchers (notably Johansson,
1973b, Whitaker, 1976, and Oiler, 1979) recommend not counting spelling
errors. Oiler (1979) cites empirical evidence from a study at UCLA showing that
spelling errors in dictations tend to be uncorrelated or even negatively correlated
with conventional scores on the same dictations as well as a variety of other lan
guage processing tasks. Examples of mere spelling errors are vaccene (vaccine),
developped (developed), and includ (include).
Table 6-3 Sample Errors as Classified on the SCE (Original Dictated Segments
in Italics)
Level 1
could be a contribution of great merit— could be a coatripution of married
spilled hot coffee at breakfast— spell hot coffee breakfast
for so many children to suffer— four seventy children to supper, for some children to suppert
tripped and fell down the stairs—trip and down stairs, to downstairs
We learn by what we reject—we learn by reject, we learn both we jack
as well as by what we accept—as well as my accept
Now there is no longer any need—now there is no longer
on the nonappreciation of literature—a. leadership, on the appreciate
AH in all, it was a bad day to get out of bed—all of the day a bad day is get up
it doesn't pay to get up-\t doesn’t paid get up, it doesn’t lake get up
Level 2
/ have always believed— I never always believed, I never believe youth
One day / woke up—Monday, I will come
Some days—Sunday, sun days
All in all, it was a bad day to get out of bed- all of them was bad days, all of us have one bad
day when we got up
as well as by what we accept—as well as what I decide to accept
spilled hot coffee at breakfast-spilled up hot coffee and breakfast, for breakfast we have a
cup of coffee
that every college curriculum—that every college
it doesn’t pay to get up—be doesn’t try to get up
the student would decide—the students decided
it used to strike— I used to strike
Level 3
tripped and fell down the stairs—tup and fell down stair
about four million children a year-about four million child a year
All in all, it was a bad day to get out of bed-all of_it was a bad day to get up
in 1963 a vaccine was developed-\n 1963 a vaccine is developed
the student would decide-the student will decide
One day / woke up—one day I wake up
Now there is no longer any need-tbere is know longer any need
and the sun was shining— and the sun shined
Level 4
It used to strike—it use to strike
I have always believed-i always believed
Birds were singing-beards were singing
that every college curriculum-that every coleege corecloum

The SCE was shown to be surprisingly reliable and valid. The conventional scores
correlated with the subjectively determined SCE scores at .94. For each dictation
passage separately, SCE and conventional scores correlated at .91, .86, and .88,
respectively. SCE scores also correlated strongly, as shown in Table 6-1, with
Repetition (.72), Standard Cloze (.77), and Grammar (Parish test, .74). Item
analysis supported scale reliability and validity by showing that 23 of the 33 items
correlated with the total severity score (a kind of discrimination index) at .60 or
above, and of those 23 items, 11 correlated at .70 or above. The highest correla¬
tion achieved was .77, and the lowest was .31. The mean was .62.
All of the foregoing supports the notion that the more proficient a learner is,
the less severe his errors are and the higher his SCE scores. This corroborates
results obtained by Olssen (1973), who found that a more proficient group of
learners made fewer severe errors than did a less proficient group.
Since the degree of error severity is strongly correlated with proficiency,
SCE is shown to be a sensitive index of the overall communicative effectiveness of
second language learners. Further evidence of this sensitivity is found by looking
at how the scale actually judges learner errors and seeing if those judgments agree
with native speaker intuitions. Table 6-2 gives samples of actual errors as classi¬
fied on the scale.
Correlations between spelling errors, dictation, and other global proficiency
scores showed no consistent relationship. For the most part they were very low, or
even negative in some cases. The mean correlation was .08, and the range was from
-.02 to .26. These correlations indicate that spelling errors are unrelated to over¬
all ESL proficiency. Thus, as others have recommended, it seems best to avoid
counting spelling errors when scoring dictations in either the conventional way or
with the SCE.
In conclusion, this research suggests that the SCE is a reliable measure of
language proficiency; that severity or errors is inversely related to the degree of
proficiency or communicative effectiveness; and that the SCE is a valid index of
the latter. An important implication is that subjective scoring methods may be just
about as reliable as more quantitative objective methods. This may be true not
only in scoring dictations but possibly also for other pragmatic tasks such as
elicited imitation and composition. It seems that it would be a relatively straight¬
forward matter to generalize the SCE (with modifications) to many other
pragmatic tasks.
An implication for teaching is that languages might be better learned if
students were systematically and consistently pointed to the meaning of what is
being said. This would rule out, it seems, much discrete point teaching that empha¬
sizes surface details while casually forgetting about meaning.
Part II Discussion Questions
1. What sequence of skills is implied or possibly overtly planned into the mate¬
rials you use as a teacher or those you may have been exposed to as a student?
Is a particular skill or modality of processing afforded a privileged status?
2. Consider the relationship that Benson and Hjelt posit in Chap. 5 between
listening and reading. Would you expect good readers to be good listeners on
the whole and poor readers to be poor listeners? What about the reverse? In
other words, should good listeners tend to be good readers, and poor ones
poor readers? Why so? Does improvement of skill in reading necessarily
carry over into listening and vice versa? Reflect on Sticht’s findings. Better
yet, read the original study for yourself and reevaluate the conclusions of
Benson and Hjelt
3. In what ways might practicing speaking too early tend to reduce the effi¬
ciency of learning? Consider the learner's forming of an utterance as a step-
by-step process. What steps must be executed in speaking that are not neces¬
sary to auditory decoding? What steps (if any) are necessary to listening with
comprehension that may not be necessary to producing sensible utterances
orally?
4. Ervin-Tripp is cited as saying that listening with comprehension may have as

much impact on speaking ability as actual practice in speaking. How is this
possible? Does it not go against the grain of much language teaching dogma?
What are the empirical foundations of views opposing her claim?
5. Asher’s method of teaching foreign languages has sometimes been criticized

as overly simplistic. It has sometimes been argued that beginning with com¬
mand forms must eventually lead to a dead end. Just how much can be taught
using only commands? How can the bridge be made to other common forms
of utterance (questions, declaratives, exclamations, contrary to fact condi¬
tionals, hypotheticals, past and future references, imperfective versus per¬
fective statements relative to past events, etc.)? (The solution to the diffi¬
culty, contrary to much naysaying, is not particularly difficult Furthermore,
Asher’s results prove that the transition can be made.)
6. In what way does Asher’s method encourage students to map utterances

pragmatically onto extralinguistic contexts? Discuss analytically, in step-by-
step fashion, the phases that a listener goes through in executing a complex
imperative, e.g., “Stand up. Go across the room, and draw a square on the
72
blackboard.” In what sense must the learner map the command onto the
stream of experience? Consider the delicate sequencing of motor activities
in relation to the syntactic ordering of elements within the utterance offered
by the instructor.
7. Do you agree with the stress Bacheller (Chap. 6) places on the global com¬
municative effect of utterances rather than the bits and pieces that go to
make up their surface form? Reflect on your own experience while watching
television or listening to the news on the radio or simply having a conversa¬
tion. Do you attend to the morphemes? How many times, for instance, did
the plural morpheme occur in Question 6? What does all this imply for
teaching procedures that deliberately exclude meaning while forcing the
learner to attend meticulously to surface form? Can you think of ways of get¬
ting students to attend to surface form without losing cognizance of overall
communicative intention and effects? Can you conceive of methods of
teaching languages that require the learner to manipulate surface forms from
the phoneme to full-fledged discourse without ever disregarding meaning?
(What we are suggesting here cannot be done, incidentally, with the
usual pattern drill methods, but it can be and is done with a variety of other
methods.)
8. Consider the effects of errors at different levels of structure in the linguistic

hierarchy. For instance, if deleted, which is harder to replace in the follow¬
ing sentence, a single letter, a phoneme, the plural ending on books, the word
some, the phrase some books printed in the United States, or the entire sen¬
tence? It is difficult to obtain some books printed in the United States in
many school libraries in this country, but overseas the problem is much more
severe. Similarly, what sorts of distortions are the most difficult to compen¬
sate for? Suppose we substituted an n for every m that appears in the forego¬
ing text, or consider replacing the word obtain with the word detain (where a
single morpheme is distorted) or put the word pleasant in place of severe?
9. How would you expect spelling to be related to language processing scores in

general? How do you explain Bacheller s findings concerning spelling errors?
Part III
Investigations of Speaking Tasks
A principal question dealt with extensively in the three chapters in this part
is whether or not speaking proficiency can be broken down into smaller
components or separable contributing characteristics. Can judges reliably
differentiate pronunciation, fluency, vocabulary,grammar, pleasantness,
nativeness, intelligibility, or comprehension as aspects of speaking skill.''
Can they reliably distinguish some subset of these possible dimensions of
speech? What sorts of oral testing procedures seem to provide the greatest
yield of individual differences (variance) among learners at various levels of
development? How reliable are the subjective evaluations of interviews,
speech samples, or other oral production performances? Are naive judges as
good as trained judges? Do they use different criteria? How good are judges
of speech samples at guessing the native language background of nonnative
speakers of English? Does knowledge of the language supposed to be the
source of the accent help the judge to guess the source language correctly?
These are just some of the important questions dealt with.
Chapter
Oral Proficiency Testing in

an Intensive English Language Program
Debby Hendricks, George Scholz, Randon Spurling,

Several oral testing techniques are investigated. A variant of the FSI oral
interview technique is compared against three pragmatic testing methods
that are somewhat simpler to apply. All the oral testing techniques studied
are correlated with the CESL Placement battery. Reliability indices for the
various subscales on the FSI Oral Interview are reported. Scores on the
various oral testing techniques (twenty-seven separate scores in all) are
factor analyzed to see if oral language proficiency can be broken down into
component parts. The results seem to support a single-factor solution. A
multiple-factor alternative is examined but discarded as probably unreli¬
able. In both factor solutions, the FSI scales of Accent, Grammar,
Vocabulary, Fluency, and Comprehension all load on a common factor,
suggesting that they are unitary. There is no evidence to support the claim
that separate aspects or oral proficiency can be clearly distinguished.
The need for developing efficient valid methods of testing oral proficiency
has long been recognized. The most obvious approach to oral testing, and the one
presumed to be most valid, is the oral interview. An interview provides a very
direct method of challenging someone to speak; and it offers a realistic situation in
which to assess overall oral mastery of a particular language. Alternate methods,
especially less direct approaches, have been criticized for not really putting genu¬
ine language ability to the test. However, this may be more the fault of the influ¬
ence of the typical discrete point orientation of language testers in recent years
than a real difficulty in devising feasible alternatives to oral interview.
The discrete point approach to language testing fails to provide either a con¬
textualized sociolinguistic setting or a sufficiently detailed sample of the language.
77
78 III: SPEAKING TASKS
As Spolsky et al. (1972) have indicated, discrete point tests may either be based on
a sample of some inventory of linguistic elements of a particular language, or be
derived from some notion of functional needs. However, if we look beyond the
level of words, the total number of elements in any language is not finite or well
defined; therefore, sampling techniques can scarcely be applied at all, and never
very systematically. Further, a selection based on some idea of functional neces¬
sity can never be quite certain that a specific element or set of elements is in fact
really necessary. The natural redundancy of any language allows one to use it with¬
out having fully mastered every structure of the language. In fact, the creative and
nonrepetitive nature of language systems in use makes it impossible for anyone to
“know” all the structures in the sense of separate items of knowledge. Finally,
neither of the commonly used methods of discrete point item selection seems
likely to provide an adequate assessment of a person s ability to function in a real¬
istic language setting.
To date, the most widely used technique for evaluating oral proficiency is the
Foreign Service Institute (FSI) Oral Interview (Spolsky et al., 1972, Wilds, 1975,
and Oiler, 1979). The FSI interview has high reliability, and because it takes place
in a more or less natural setting, it is believed to be a good determinant of a person’s
true language competence. Its major drawback is that it is time-consuming and
expensive to administer and score. Altogether, about a half hour is required on the
average for administering and scoring each interview (Wilds, 1975).
One alternative to oral interviewing would be to develop other pragmatic tests
of oral proficiency which (though they may be less direct than the oral interview)
are not tied to discrete point methods of test construction, scoring, and interpreta¬
tion. It has been suggested that reduced redundancy tests, such as cloze tests, may
provide valid replacements for the oral interview (Clark, 1975). Oiler and
Hinofotis (Chap. 1) found correlations ranging from .51 to .62 between a cloze test
and the various subscales of the FSI oral interview. This encourages us to believe
that some type of cloze test (written or oral) might be refined to provide some of the
information available through the more expensive and time-consuming oral inter¬
view techniques.
This chapter evaluates four possible procedures for testing oral proficiency.
Three pragmatic speaking tasks, repetition (elicited imitation), oral cloze, and
reading aloud, are used along with a traditional FSI-type oral interview. The inter¬
view technique used here, however, does not fully meet FSI requirements—in
particular, the teams of examiners did not include a trained linguist. Results show
that this fact apparently did not reduce reliability, but for this reason we refer to
the technique here as an FSI type of oral interview. The results of all four profi¬
ciency tests were intercorrelated and examined in relation to the factor results of
Scholz et al. (Chap. 2). The experimental tests were also correlated with the scores
of the Center for English as a Second Language (CESL) placement test consisting
of tests 1, 14, and 21 described in Chap. 2.
In addition to the factoring already reported in Chap. 2, further factoring was
done (both a principal component solution and a varimax rotation) to determine
Hendricks et al.: Oral proficiency testing 79
which of the specifically oral tasks and which scores based on those tasks
generated the most meaningful variance in relation to the other oral tasks.
Experiment
Subjects. Seventy of the 182 students enrolled at all levels of proficiency at the
Center for English as a Second Language at Southern Illinois University at
Carbondale served as subjects. Since all 182 CESL students were invited to
participate, the ones who actually elected to do so were probably not completely
representative of the whole group. In fact, they tended to be students who were
more confident about their skill in English. Nevertheless, their overall level of
performance was not particularly high on the FSI scales. The mean for Accent was
1.68 (out of a possible 4 points); for Grammar it was 12.69 (possible, 36); for
Vocabulary it was 8.98 (possible, 24); for Fluency, 5.58 (12); and for Comprehen¬
sion 10.38 (23). Thus on the scale in Appendix 7C this group would get a mean
proficiency rating of 1 + (based on a score of 39.31), and according to Appendix
7A, which gives rough verbal descriptions of the five proficiency levels, they
would be able to do little more than fulfill minimum travel needs.
Test Materials. The CESL Placement Test comprised of three parts
described in Chap. 2 was used for some of the correlations reported below.
Subtests were aimed at Listening Comprehension, Structure, and Reading. The
five proficiency levels of the FSI Oral Interview are defined roughly in Appendix
7 A (taken from the Manual for Peace Corps Language Testers prepared by Educa¬
tional Testing Service). The five subscales are similarly described in Appendix
7B. Finally, the recommended weighting and conversion tables for interpreting
scores on the five scales to assign a proficiency level rating are given as Appendix
7C.
In cases where the subject had very little or no command of the language, the
interview time was about 15 minutes or even less, but normally the interviews
lasted between 20 and 30 minutes. Each exchange was tape-recorded and
subjects were rated by two interviewers. In the weightings assigned to the various
subscales, Grammar received the heaviest weight followed by Vocabulary, Com¬
prehension, Fluency, and Accent, respectively, as shown in Appendix 7C.
Interviewers. Prospective FSI interviewers normally must participate in a
training program which for Peace Corps Language Testers requited tour to five
days (ETS, 1970). On the surface, the FSI Oral Interview appears to be a normal,
everyday conversation, but it is supposed to be a specialized procedure which
uses the relatively brief testing period to explore many aspects of the student’s lan¬
guage competence in order to place him in one of the categories (levels)
described” (ETS, 1970, p. 11). In order to reasonably duplicate the FSI require¬
ments, two teams of two interviewers each were made up from a pool of one
instructor and three graduate assistants at CESL. Before collection of the speech
data, all interviewers read the Manual for Peace Corps Language Testers (ETS,
1970) and listened to the fifteen training tapes of sample FSI Oral Interviews
provided with the manual. A list of possible interview topics was given to each
team (see Appendix 7D).
Pragmatic Speaking Tests. All three of the pragmatic tests, repetition, oral
cloze, and reading aloud, contained texts intended to range in difficulty from easy
to difficult. All passages were approximately seventy words in length. Each easy
text was selected from a 4th to 5 th grade reader, the intermediate texts were taken
from a junior high school reader, and the difficult passages were excerpted from a
12 th grade reader. The texts of all three pragmatic tasks are given as entries 11,12,
and 13 in the Appendix at the end of this volume.
Procedure. Subjects were interviewed within a six-week period in the spring
of 197 7. Oral interviews were scheduled during the subjects’ free time. During the
experiment all interviews were tape-recorded. To estabhsh interrater reliability,
the first twelve subjects scored by each interviewing team were rated again by the
other team, producing twenty-four interviews rated by both teams.
The pragmatic speaking tests (repetition, oral cloze, and reading aloud) were
administered at the CESL/SIU Language laboratory during the same six-week
period as the oral interviews. (For detailed instructions provided to the subjects,
see the Appendix entries 11,12, and 13.) A second score was derived by counting
all the appropriate words that were included in subjects’ responses but not in the
original text; and a third score was simply the amount of time (in seconds) required
for each reading. For the pragmatic tasks each subject was individually tape-
recorded. Actual testing time was approximately 40 minutes, with an additional 15
minutes for seating and laboratory adjustments.
Scoring. The oral interviews were scored by the usual FSI procedures. Both
the repetition and oral cloze tests were scored by exact and acceptable word
methods. The reading aloud tests were scored in three ways: a first score was
arrived at by counting exact word renditions (one point for each word). Gross
mispronunciations were counted as errors in all cases. To obtain an overall score
for the reading aloud tasks, the number of correct words in each passage (exact
plus acceptable) was divided by the number of seconds required to complete each
passage. This was the score used by Scholz et al. in their factor analyses (see
Tables 2-1 and 2-4).
As Table 7-1 indicates, the correlation between the two interviewing teams for the
overall interview score was .90. The only low correlation between the ratings of
the teams was for the Accent Subscale (.43). However, the high intercorrelations
for the other scales show that the two interview teams agreed substantially in their
assessments of oral proficiency.
Table 7-2 indicates that the pragmatic task which correlates most highly with
the overall FSI score is Repetition (.70); Oral Cloze follows (.63), and then the
overall Reading Aloud score (.51). Each of these correlations is significant at the
Table 7-1 Interrater Reliabilities for the Subscales of the FSI Oral Interview
{N = 24)
Group 2 1 2 3 4 5 6 7 8
Group 1
Measures
1 Oral Interview—Accent .43
2 Oral Interview—Grammar .83
3 Oral Interview—Vocabulary .85
4 Oral Interview—Fluency .80
5 Oral Interview—Comprehension .89
6 Oral Interview—Total Score .91
7 Converted Total Score .85
8 Overall Scores .90
.001 level. In terms of common variance, the Repetition task accounts for 49% of
the variance in the overall FSI rating; the Oral Cloze has a common variance of
40% with the overall FSI score, and the Reading Aloud 26%.
These facts can probably best be accounted for on the basis of the respective
reliabilities of the various measures. Repetition appears to be a promising measure
of oral proficiency. It correlates significantly (p < .001) with all the subscales of
the Oral Interview, and with the other pragmatic tasks. In addition, Repetition may
be useful as a diagnostic test. Subjects tended to make consistent grammatical
errors which could provide invaluable diagnostic data in defining the learner s
interlanguage system (cf. Selinker, 1972). Oral Cloze also looks promising.
However, we believe its utility could possibly be enhanced by altering the format
to require phrases expressing the next idea in the text rather than single words as in the
written cloze formats. Reading Aloud is the least effective measure, but this may
Table 7-2 Correlations between the Subscores on the Four Oral Proficiency Tests
[N> 62,<70)*
Measures 1 2 3 4 5 6 7 8 9
Oral Proficiency Tests

1 Oral Interview—Accent 1.00 .43 .40 .32 .37 .47 .41 .37 .18
2 Oral Interview—Grammar 1.00 .89 .23 .53 .81 .44 .32 .06
3 Oral Interview-Vocabulary 1.00 .77 .84 .93 .56 .45 .44
4 Oral Interview—Fluency 1.00 .89 .67 .66 .72 .91
5 Oral Interview—Comprehension 1.00 .86 .75 .75 .79
6 Oral Interview (Overall Rating) 1.00 .70 .63 .51
7 Repetition 1.00 .80 .62

1.00 .67
8 Oral Cloze
1.00
9 Reading Aloud
*p < .05 for all correlations at or above .25.

Table 7-3 Correlations between the Subscores of the Four Oral

Proficiency Tests and the Subtests of the CESL Placement Battery
(/V>62,<70)*
CESL CESL CESL

Reading Structure Listening Total
Oral Proficiency Tests
Oral Interview—Accent .25 .34 .29 .34
Oral Interview—Grammar .49 .47 .43 .53
Oral Interview—Vocabulary .45 .42 .47 .51
Oral Interview—Fluency .10 .01 .08 .07
Oral Interview—Comprehension .23 .15 .25 .23
Oral Interview (Overall Rating) .35 .32 .37 .39
Repetition .31 .25 .35 .34
Oral Cloze .31 .31 .39 .39
Reading Aloud .13 .01 .03 .05
*p < .05 for all correlations above .25.
be due to the scoring method used. We return to the latter problem below in the
discussion of the factor analyses in Tables 7-4 and 7-5. There is a hint in Table 7-2
that the Reading Aloud task measures Fluency (r=.91), Comprehension (r=.79),
and to a lesser extent Vocabulary (r=.44) more than whatever is measured by the
other scales. However, as we will see below, this interpretation does not fit the
factor results.
Table 7-3 shows the relationship between the oral proficiency tests and the
CESL Placement battery. The CESL battery consistently correlates best with the
Oral Interview Grammar and Vocabulary scores. They share a maximum common
variance of about 28%. On the whole, the Oral Interview is roughly equivalent to
Repetition and Oral Cloze as a predictor of the CESL Placement battery.
The factor analysis results reported in Table 7-4 are excerpted from Oiler
(1979, Appendix Table 5). Here it should be remembered that only the FSI scales
are really full-blown testing procedures. The other scores included are actually
subscores from the Repetition, Oral Cloze, and Reading Aloud tasks defined
above. Nevertheless, in spite of the reduced reliability that should be expected
when scores are based on such shortened subtests, the loadings on a common
principal factor (g) are nearly all significant at the .01 level. Only five of the sub¬
scores included fail to load significantly ong. At least one of the subscores in each
set accounts for .45 or more of the variance in the common factor.
The fact that the Acceptable word scores for Repetition tasks A, B, and C load
inconsistently (as well as lightly or insignificantly) on g is easily understood. If
subjects rendered the texts correctly by the exact word criterion (i.e., a verbatim
repetition) there was little likelihood of their getting any additional points for
Acceptable words not in the original text The same explanation applies even more
dramatically to the Reading Aloud Acceptable scores, which all tend to load nega¬
tively ong. This is due to the fact that the more the subject tends to render the texts
Table 7-4 Principal Components Analysis over Twenty-seven

Speaking Scores (N = 64)*
Scores Loadings on g Squared loadings
Oral Interview—Accent .59 .35

FSI Oral Proficiency Level .87 .76
Repetition Exact A .71 .50

Repetition Acceptable A .07 .00
Repetition Exact B .87 .76
Repetition Acceptable B .29 .08
Repetition Exact C .39 .15
Repetition Acceptable C -.06 .00
Oral Cloze Exact A .54 .29

Oral Cloze Acceptable A .66 .44
Oral Cloze Exact B .5 9 .35
Oral Cloze Acceptable B .43 .18
Oral Cloze Exact C .38 .14
Oral Cloze Acceptable C .67 .45
Reading Aloud Time A -.65 .42

Reading Aloud Exact A .39 .15
Reading Aloud Acceptable A -.10 .01
Reading Aloud Time B -.65 .42
Reading Aloud Exact B .70 .49
Reading Aloud Acceptable B -.36 .13
Reading Aloud Time C -.54 .29
Reading Aloud Exact C .52 .27
Reading Aloud Acceptable C -.17 .03
Eigenvalue = 9.39
♦Loadings above .33 are significant atp <.01, those above .25 atp <.05;
in this case and in Table 7-5, the deletion of missing cases was listwise (Nie
et al., 1975), and therefore the subject population was the same for each
and every test score. __
exactly, the less opportunity they have for creative innovations that would gain
points under the Acceptable word scoring method. The Oral Cloze tasks, however,
were sufficiently challenging that Acceptable scores account for about as much of
the variance mg as do the Exact scores. As would be expected, the Reading Aloud
Time scores are negatively related to the global oral proficiency factor. The more
proficient subjects recjuired less time; hence the longer the time the lower the
proficiency.
Moreover, by studying the loadings on g for the Reading Aloud scores, it is
apparent that by including the Acceptable scores in the computation of the over¬
all Reading Aloud scores, we necessarily depressed the effectiveness of Reading
Table 7-5 Varimax Rotated Solution over Twenty-seven Speaking Scores

(N = 64)
Factors 1 2 3 4 5 6 7 h2
Scores
Oral Interview-Accent .48 .23
FSI Oral Proficiency Level .92 .85
Repetition Exact A .45 .46 .41

Repetition Acceptable A .73 .53
Repetition Exact B .48 .50 .48
Repetition Acceptable B .49 .46 .45
Repetition Exact C .69 .48
Repetition Acceptable C .58 .34
Oral Cloze Exact A .49 .24

Oral Cloze Acceptable A .75 .56
Oral Cloze Exact B .52 .45 .47
Oral Cloze Acceptable B .86 .74
Oral Cloze Exact C .72 .52
Oral Cloze Acceptable C .56 .33
Reading Aloud Time A -.82 .67

Reading Aloud Exact A .61 .44 .45 .77
Reading Aloud Acceptable A .87 .76
Reading Aloud Time B -.84 .71
Reading Aloud Exact B .63 .40
Reading Aloud Acceptable B .72 .52
Reading Aloud Time C -.82 .67
Reading Aloud Exact C .45 .56 .51
Reading Aloud Acceptable C .73 .53
Eigenvalue = 15.22
Aloud as a measure of global proficiency. This fact probably explains (at least in
part) the results obtained in Tables 7-2 and 7-3 where the Reading Aloud overall
score (defined as the quantity, Exact plus Acceptable score, divided by Time)
performed less well than Repetition and Oral Cloze in relation to the FSI and
CESL subscores. All the loadings ong in Table 2-1 of Repetition, Oral Cloze, and
Reading Aloud could probably be improved by taking advantage of what can be
learned from Table 7-4 concerning optimum scoring methods.
In Table 7-5 a varimax rotated solution is presented over the same data.
While the single-factor solution accounts for only 35% of the total variance in all
the scores whereas the rotated solution (with seven significant factors) accounts
for an additional 21% (for a total of 56% of the variance), the rotated solution is
hardly parsimonious in interpretation.
All the FSI scores load on Factor 1 (Table 7-5). We might have expected the
Fluency scale to load on a factor common to the Time scores for the Reading Aloud
Proficiency—FSI oral interview 85
tasks. However, the Time scores all load by themselves on Factor 3. The Repeti¬
tion scores scatter their variances over five of the seven orthogonal (i.e., uncorre¬
lated) factors. In particular, the Exact scores tend to load on Factor 2 along with
the Acceptable scores for the Oral Cloze tasks, while the Acceptable scores for the
Repetition tasks fall out on Factor 6. These clusters of loadings are probably due
more to unreliabilities in the tasks (probably because of their brevity) than to
reliable differences in the nature of the processing skills exercised.
Clearly, there is no evidence that the several FSI scales are measuring differ¬
ent skills or components of oral proficiency. Further, in view of the fact that the
loadings on the varimax rotated factors are scarcely any higher in Table 7-5 than in
Table 2-1 (which offers a single-factor solution for an even wider range of tasks), it
is assumed that the FSI scales are unitary and that they measure the same basic
skill that underlies performance on the three oral pragmatic tasks (especially
Repetition scored by the Exact word method. Oral Cloze scored by the appropri¬
ate word method, and Reading Aloud scored simply in terms of the number of
seconds it takes subjects to complete the task).
Finally, in view of the estimated reliabilities of the various scores, it seems
reasonable to conclude that the unitary factor solution presented in Table 7-4
explains the bulk of the reliable variance in all the tasks considered and that the
additional variance accounted for in the multiple-factor solution shown in Table
7-5 is indeed unreliable variance. The latter claim will require further empirical
substantiation, but in the meantime it seems safe to say that Repetition, Oral
Cloze, and Reading Aloud all offer some promise as substitutes for oral interview
testing.
Appendix 7 A
The Five Levels of Overall Proficiency
of the FSI Oral Interview
Level 1
Able to satisfy routine travel needs and minimum courtesy requirements. The
student can answer questions on topics very familiar to him; within the scope of his
very limited language experience he can understand simple questions and
statements, allowing for slowed speech, repetition, or paraphrase. His speaking
vocabulary is inadequate to express anything but the most elementary needs;
errors in pronunciation and grammar are frequent, but he can be understood by a
native speaker used to dealing with foreigners attempting to speak his language.
Level 2
Able to satisfy routine social demands and limited work requirements. The
student can handle with confidence but not with facility most social situations
including introductions and casual conversations about current events, as well as

work, family, and autobiographical information. He can handle limited work
requirements needing help in handling any complications or difficulties. He can
get the gist of most conversations on nontechnical subjects (i.e., topics which
require no specialized knowledge) and has a speaking vocabulary sufficient to
express himself with some circumlocutions. His accent, though often quite faulty,
is intelligible. He can handle elementary constructions quite accurately but does
not have thorough or confident control of the grammar.
Level 3
Able to speak the language with sufficient structural accuracy and vocabu¬
lary to participate effectively in most formal and informal conversations on
practical, social, and professional topics. The student can discuss particular
interests and special fields of competence with reasonable ease. His comprehen¬
sion is quite complete for a normal rate of speech. His vocabulary is broad enough
that he rarely has to grope for a word. His accent may be obviously foreign. His
control of grammar is good, and his errors never interfere with understanding and
rarely disturb the native speaker.
Level 4
Able to use the language fluently and accurately on all levels normally pertin¬
ent to professional needs. The student can understand and participate in any
conversation within the range of his experience with a high degree of fluency and
precision of vocabulary. He would rarely be taken for a native speaker, but he can
respond appropriately even in unfamiliar situations. His errors of pronunciation
and grammar are quite rare, and he can handle informal interpreting from and into
the language.
Level 5
Speaking proficiency equivalent to that of an educated native speaker. The
student has complete fluency in the language such that his speech on all levels is
fully accepted by educated native speakers in all its features, including breadth of
vocabulary and idiom, colloquialisms, and pertinent cultural references.
Appendix 7B
Proficiency Descriptions
Accent
1. Pronunciation frequently unintelligible.
2. Frequent gross errors and a very heavy accent make understanding difficult,
require frequent repetition.
Proficiency descriptions 87
3. “Foreign accent” requires concentrated listening, and mispronunciations

lead to occasional misunderstanding and apparent error in grammar or vo¬
cabulary.
4. Marked “foreign accent” and occasional mispronunciations which do not
interfere with understanding.
5. No conspicuous mispronunciations, but would not be taken for a native
speaker.
6. Native pronunciation, with no trace of “foreign accent”
Grammar
1. Grammar almost entirely inaccurate except in stock phrases.
2. Constant errors showing control of very few major patterns and frequently
preventing communication.
3. Frequent errors showing some major patterns uncontrolled and causing
occasional irritation and misunderstanding.
4. Occasional errors showing imperfect control of some patterns but no weak¬
ness that causes misunderstanding.
5. Few errors, with few patterns of failure.
6. No more than two errors during the interview.
Vocabulary
1. Vocabulary inadequate for even the simplest conversation.
2. Vocabulary limited to basic personal and survival areas (time, food, trans¬
portation, family, etc.).
3. Choice of words sometimes inaccurate; limitations of vocabulary prevent
discussion of some common professional and social topics.
4. Professional vocabulary adequate to discuss special interests; general
vocabulary permits discussion of any nontechnical subject with some cir¬
cumlocutions.
5. Professional vocabulary broad and precise; general vocabulary adequate to
cope with complex practical problems and varied social situations.
6. Vocabulary apparently as accurate and extensive as that of an educated
native speaker.
Fluency
1. Speech is so halting and fragmentary that conversation is virtually impos¬
sible.
2. Speech is very slow and uneven except for short or routine sentences.
3. Speech is frequently hesitant and jerky; sentences may be left incomplete.
4. Speech is occasionally hesitant, with some unevenness caused by rephrasing
and groping for words.
5. Speech is effortless and smooth, but perceptibly nonnative in speed and
evenness.
6. Speech on all professional and general topics as effortless and smooth as a
native speaker s.
Comprehension
1. Understands too little for the simplest kind of conversation.
2. Understands only slow, simple speech on common social and touristic
topics; requires constant repetition and rephrasing.
3. Understands careful, simplified speech directed to him, but requires occa¬
sional repetition or rephrasing.
4. Understands quite well normal educated speech directed to him, but
requires occasional repetition or rephrasing.
5. Understands everything in normal educated conversation except for very
colloquial or low-frequency items, or exceptionally rapid or slurred speech.
6. Understands everything in very formal and colloquial speech to be expected
of an educated native speaker.
Appendix 7C
FSI Score Weighting Table
Proficiency description 1 2 3 4 5 6
Accent 0 1 2 2 3 4
Grammar 6 12 18 24 30 36
Vocabulary 4 8 12 16 20 24
Fluency 2 4 6 8 10 12
Comprehension 4 8 12 15 19 23
Total
Conversion Table
Total score Level Total score Level

(from weighting table)
16-25 0+ 63-72 3
26-32 1 73-82 3+
33-42 1+ 83-92 4
43-52 2 93-99 4+
53-62 2+
Appendix 7D
Interview Topics
Present Tense
1. Your usual day
2. Your hobby
3. Your present job
4. Your pet economy
5. The thing you dislike most
6. Your home, apartment, or room
7. Your home town
8. Saturday afternoon in your home town
9. A holiday
10. Your favorite animal
11. The problems of an only child
12. Your big (or little) brother (or sister)
13. Your father s favorite sayings
14. Why you are in school
15. Your favorite subject (teacher, classmate) at school
16. What education means to you
17. Are examinations necessary?
18. Life in a college dormitory
19. Differences between high school and college
20. The importance of sports in college.
21. Your best friend
22. Your worst enemy
23. An interesting person you know
24. The strangest person you know
25. Traffic problems
26. The best kind of vacation
27. Why you like (don’t like) television
28. What makes a good movie?
29. Small towns versus large towns
Past Tense
1. A frightening experience
2. Your most embarrassing moment
3. Your biggest surprise
4. What you did last weekend
5. An untrue story you told
6. Your most interesting trip
7. Your first long trip
8. An important event in your life
9. One time when you were misunderstood
10. Why you decided to come to this school or city
11. A folk tale

12. A movie or play you enjoyed (or didn’t enjoy)
Future Tense
1. The world in the year 2000

2. What will probably happen in the next 6 months
3. Things you intend to do
4. What you will probably do tomorrow
5. Your plans for a vacation
6. Your plans for next weekend (next year)
7. The plans you have for your children or grandchildren
Should or Imperative
1. How to be a good tourist
2. Travel tips
3. How to bake a cake, a pie (recipes and instructions)
4. Should married women work outside the home?
Conditional
1. If you had a million dollars
2. If you had three wishes
3. If you governed the world
4. If you had not come to this school
5. If you knew you had only two weeks to live
6. How would you teach English?
7. The changes you would make in this city
8. If you were the last person alive
Direct and Indirect Speech
1. A conversation you had this morning

2. A conversation you overheard
Chapter
Rater Reliability
and Oral Proficiency Evaluations
Karen A. Mullen
This investigation attempted to determine (1) whether judges working

independently are apt to reach similar conclusions regarding the oral
proficiency of nonnative speakers of English; (2) whether the ratings of one
pair of judges are similar to the ratings of other pairs of judges; (3) whether
four subscales aimed at supposedly different aspects of proficiency actually
contribute nonredundant information. Ninety-eight nonnative speakers of
English were divided into six groups. Five interviewers were paired to form
six teams. Each group of subjects was rated on five scales—listening
comprehension, pronunciation, fluency, grammar, and overall profi¬
ciency—by one of the six teams of interviewers. The results show that
ratings assigned by one judge in a pair are significantly different from ratings
assigned by the other in some cases. However, the reliability coefficients are
above .70 for all but two pairs of judges. Hence an average of the ratings by
two judges is a better measure of proficiency than a rating by a single judge.
The results also show that on all scales except grammar, the six groups were
rated similarly. The four subscales were given approximately equal weight,
and all taken together (rather than singly or in pairs or triplets) best predict
overall oral proficiency. The overall scale shows a high reliability across all
groups, and since it appears to be a composite of the four subscales, it is
considered the best measure of oral proficiency.
Testing speaking proficiency has been of special interest in the field of

second language learning, and it has generally been recognized that the best way to
test for oral proficiency is to have a subject speak. However, the issue of what and
how to test has remained. Five basic components of speaking skill have been
proposed by Harris (1969): pronunciation, grammar, vocabulary, fluency, and
91
auditory comprehension. Of these five, fluency has been said to be the easiest to
assess since it focuses only on pauses, backtracking, and fillers in the speech flow.
On the other hand, pronunciation has been said to be difficult to judge since cri¬
teria for judgment may vary from one person to another, some listeners may be
able to readily decode foreign accents; others may be judging comprehensibility
rather than phonemic or allophonic accuracy. Moreover, three of these compo¬
nents (grammar, vocabulary, and auditory comprehension) can presumably be
tested without asking a subject to speak. However, evidence of a formal knowl¬
edge of grammar and vocabulary does not guarantee that it will be applied in actual
speech production.
The question of how to test for speaking proficiency has usually been decided
in favor of the interview. It has certain advantages: it can be conducted rather
quickly, it most resembles real-life speaking situations, and it can be adjusted up
or down depending upon the speaker s demonstrated proficiency. It, however, has
been open to the criticism that the measures derived from such a test tend to have
rather low reliability. Some have suggested that a tolerable degree of reliability can
be achieved if behavioral statements are used as a standard for each scalar judg¬
ment, if judges are trained for their task, and if at least two judgments are pooled
for each interview.
The purpose of this chapter is to report a study designed to determine if
experienced ESL teachers, working in pairs, can reach the same judgments
regarding the oral proficiency of nonnative speakers of English, i.e., to determine
the degree of reliability of such judgments. In addition, the question of whether
different sets of judges will rate the same subjects differently is also posed. Finally,
the study was designed to determine the relative weight given to each component
category in predicting the overall proficiency score. Specifically, the hypotheses
were:
1. The ratings assigned by judge 1 in a specific category of oral proficiency are

not significantly different from those of judge 2, when judge 1 and judge 2
are rating the same subject
2. The ratings assigned by one pair of judges to a group of speakers are not sig¬
nificantly different from those assigned by other pairs of judges to other
groups of speakers.
3. Each scale is differentially weighted in the assessment of a subject’s overall
proficiency.
Method
To test hypothesis 1, a single-factor experimental design having repeated mea¬
sures was chosen. The F-statistic based upon the mean square* between judges
*Editors’ note: For the non-statistically-trained reader, this chapter is a bit more technical than most of
the others in this volume (also, see Chap. 15). Throughout this paper, the term mean square refers to a
quantity obtained by adding up squares of deviations from some average score (or quantity) and
dividing by the number of scores entering the computation. It is always an index of some sort of
variance.
Mullen: Reliability and proficiency evaluations 93
divided by the mean square of the judge-subject interaction was computed to test
the hypothesis of no significant differences between judges.1 Reliability coeffi¬
cients (unbiased) were calculated based upon the number of subjects in the same
sample, and the number of judges (2), the mean square between subjects, and the
mean square within subjects (Winer, 1971, p. 287). To test hypothesis 2, a two-
factor experimental design having repeated measures on one factor was chosen.
The F-statistic based upon the mean square between groups divided by the mean
square of subjects within groups was computed to test the hypothesis of no signifi¬
cant differences among groups.2 To test hypothesis 3, the regression coefficients
of a multiple regression equation were calculated and an F-test for a coefficient of
zero was performed. 5
Judges. Five judges participated in this study. They were randomly paired to
form six groups. All judges were graduate students in linguistics. They had
completed courses in phonetics, syntactic and phonological analysis, and TESL
methodology. They all had taught ESL for at least one year. All had been
instructed on how to use the rating form and guidelines, and all had participated in
such interviewing before. None of the judges had previously met the subjects they
interviewed.
Subjects. Ninety-eight subjects were referred to the University of Iowa
Department of Linguistics for a proficiency evaluation by either the foreign admis¬
sions officer, the foreign student advisers, or the student’s academic adviser. Most
of the subjects were new to the university and were referred because their TOEFL
scores were below 550 (an arbitrary cutoff point). Some were evaluated because
the foreign student advisers had noted a lack of facility in spoken English although
their TOEFL scores were not below 550. The purpose of the evaluation was to
determine whether additional work in English and a reduced academic program
might be recommended.
Procedure. Judges were required to rate speakers on five scales: listening
comprehension, pronunciation, fluency, control over English structure, and
overall speaking proficiency. These scales were labeled vertically on a rating fonn.
Beside each of the five scales there was a double horizontal line equally divided
into five contiguous compartments labeled from left to right: poor, fair, good,
above average, excellent. The judges were instructed to put an X in the box best
characterizing the speaker s proficiency with regard to each of the five scales or to
put an X on the line between boxes if the subject’s proficiency seemed to be
between the two labeled areas. A set of guidelines for deciding what level of profi¬
ciency to assign was explained to the judges beforehand and was available for
reference after each interview.
Each pair of judges was instructed to ask the subject a few questions in order
to get the interview started—what the subject’s name was, where he was from, how
long he had been in the United States, where he had studied, what his major field of
interest was, etc. These questions were intended to put the subject at ease as
quickly as possible. The interviewers were instructed to begin by speaking at a
normal rate and to slow down later if it became apparent that the subject could not
follow. In lieu of grossly distorting their speech, the interviewers rephrased their
questions by simplifying the sentence structure and vocabulary.
If it became evident that the subject was able to do well in a rather informal
type of interview, the judges were prepared to shift the questioning to a level more
like that which subjects would encounter in an informal academic setting. They
would then ask such questions as: What interests you most about your field of
study? What are some important questions yet to be answered? What special sub¬
fields exist? What are the relationships between them? The intent was to simulate a
conversation, and the next question in a given sequence, of course, tended to
follow from what had preceded. One of the aims was to vary the questioning from
subject to subject so that later subjects would not know what questions to expect
Raters were instructed to contribute to the conversation occasionally but to
control their speaking in such a way that it could be used to assess the subject’s
listening comprehension. They were encouraged to give the subject every oppor¬
tunity to demonstrate his speaking proficiency. After each interview, lasting
approximately fifteen minutes, each judge evaluated the subject without consult¬
ing the other. Evaluations made by the individual judges were later converted to a
numerical value of l=poor, 2=between poor and fair, 3=fair, 4=between fair
and good, 5=good, 6=between good and above average, 7=above average,
8=between above average and excellent, and 9= excellent.

Table 8-1 shows the results of a two-factor analysis of variance having repeated
measures on one factor. A separate analysis of variance is reported for each of the
five scales of speaking proficiency. The F-statistic for differences between groups
is significant at the .05 level for listening, pronunciation, and overall speaking pro¬
ficiency. It is significant at the .01 level for grammar. It is not significant for
fluency. The F-statistic for differences between judges is significant at the .01
level for all scales.
Table 8-2 shows the results of a single-factor analysis of variance having
repeated measures for each of the six groups and for each of the five rating scales.
The F-statistic for a difference between judges’ listening ratings is significant at
the .05 level for three pairs of judges (groups 4, 5, and 6), between pronunciation
ratings for two pairs (groups 1 and 2), between fluency ratings for one pair (group
2), between grammar ratings for three pairs (groups 1, 2, and 3), and between over¬
all ratings for two pairs (groups 2 and 3). For every pair of judges, there was a
significant difference in ratings on at least one scale. Additionally, one pair of
judges (group 2) showed a significant difference in ratings on all but one scale. Two
pairs (groups 5 and 6) showed a significant difference on just one scale, that of
listening comprehension. At the .01 level, there is no significant difference
between judges across rating scales for three sets of judges (groups 3, 5, and 6).
Group 2 shows a significant difference between judges on every scale except
listening.
Table 8-1 An Analysis of Variance of Performance Scores of Six Groups of

Subjects Rated by Different Pairs of Judges on Five Scales of Speaking Proficiency
Source of
variation df Listening Pronunciation
SS MS F SS MS F
Between groups 5 58.00 11.60 2.45* 38.54 7.71 2.78*

Between subjects 92 435.98 4.74 255.41 2.78
within groups
(pooled)
Between judges 6 18.53 3.09 4.85f 9.31 1.55 3.38f
within groups
(pooled)
Subjects X judges 92 58.47 .64 42.19 .46
(pooled)
Source of
variation df Fluency Grammar
SS MS F SS MS F
Between groups 5 23.88 4.78 1.13 51.58 10.31 4.48-f

Between subjects 92 388.00 4.22 212.09 2.31
within groups
(pooled)
Between judges 6 18.74 3.12 5.55-f 7.26 1.21 30.32f
within groups
(pooled)
Subjects X judges 92 51.76 .56 36.74 .04
(pooled)
Source of
variation df Overall
SS MS F
Between groups 5 46.64 9.33 2.95*
Between subjects 92 291.04 3.16
within groups
(pooled)
Between judges 6 7.68 1.3 4.15 f
within groups
(pooled)
Subjects x judges 92 28.32 .31
(pooled)
♦Significant at p <.05. -(Significant at p <.01.
Table 8-3 shows the differences in judges’ mean ratings for each group. It is
apparent from the graphs that some lines rise more sharply than others and that the
distance between the lowest and the highest line is great enough to result in a
significant difference in groups at the .05 level. Group 2 stands out as the most
deviant with regard to differences between judges and with regard to mean ratings
when compared to other groups.
4— 4— 4— 4— * 4— 4— 4—
r-» O 00 VO VO ON VO d" d- CN CO
LL. ON vo CN CO CN CO r-» VO vo CN *
<4- — O CN O ON* 00* pz vo* r—’ VO* CN* CO
o n3
03
rd CO CO ON 00 CO CN o d* CO o CO r— O 00 r^ vn VD
Cl_ O CO VO CO CO 00 in 00 CN o d" in 00 r— 00 CO ON CO VO
4-> VO 00 CO 00 in r— O 00 CN r- d^ CN ON ON d] d| O CN
C on i-l CO CN 1-1 d* CN* CO
CD
<D
4-
4-
4— H— 4— 4— 4— * * 4— 4—
Q 00 CN CO o CN VO 00 CO CO 00 vn o
t_ ON "d ON CO 00 CO '-"I o r- 00 •d ^—
rd LL
r-Z O CO ON rZ CO pZ ON
>* E
_o E
ftj
13 CO CO CN VO ,_ o o ON ,_ in ,_ d* in CO
d* d* o
O 3 CO CO CO O d* "d- ?— x— o d- 00 CN d- ON CN vo d* CO d-
d vo CN CN ON CO ON 00 CN vo O 00 CO CO CN O co
CN <— CN *- x— CN CN* CO
</n
Q.
3 * 4— 4— — 4— 4—
O in 00 ON »— CO in 00 vn CN P" r- CN
U. 00 VO in CN o d^ 00 O CN d^ CO
O > d- 00* ON CN in CN o’ CO rZ *
o CO
X c
03
C-O 3
T— CO ON CN vn o 00 o O VO 00 00 o in o CN o
4— lZ CO VO CO r— CO CO VO CO in in d- in vn CN ON d* in
d*
o d; vo ON CN CO CO o o o CO ON vo d^ CO d-
JZ d* X—^ CO ■d- CN* 1-1 «n i-l d- i-l CO *
o ,—
<d
LU
i-
vg 4— * 4— 4— * 4— * 4—
vn CO ON VO x— <n d- x— 00 CO d- CN
<D w t— 00 VO r- CO CN 1— X-
LL. 0°. o vo vn
C *4-> *
rd 4") d* CO* i-l in CN* CO CO* vo* CN*
O *o T- x—
''—^ c
<n 3
CD c p^ CO o CN in o 00 O in I"- ON ON VO o CO d- CN d-
1— o CO 4-) CO on CO CO vo 00 o vn VO CN x— CN o in ON vn
o i— in CO vo r— CN CO oo 00 CO 00 O CN ON o vo CO 00 CO
o Q_ CO CO CO ■d* CO* CN*
CO
<L>
CD
c
4— 4— 4_ 4_ 4— * 4_ *
rd
CD o 00 d- vo cn X- ON d- ON
E C 00 U- CO o 00 r-~ CN CN CN r— CN CN vo in
-(-Significant at/9 <.01.
c
o
<4—
.E
CD *c
vo vo* r—’ CN* r—“ X—’
T—
>n
X—
r-’ VO* VO* vn
4_ 03
<D 4-»
Q_
o
A.
(/>
o 00 00 in CN ,_ o vo ,_
o CN 00 o 00 o 00
4— Q_ _J co VO o CN o CO CN x— o vo d- o vo p^ CN CN d- r- d*
vo © ON r- CO 00 d*. co vn in in ON VO CO in vo
O DJO
vo CN* CO* »— *—* VO* 00* d* CO* d* CO*
CD C
O
c lx
rd
<D 4— d- T—
“O d* vo vo ON ON vo vo d- d* CO m
Cl CN CN
rd 1 T~~
CO
> 4—
</> t^>
4— o to
4->
to
4-4 *-> </» to tn </>
4—> to to
4-4 </> to
o t/N o </s o o tO o
+->
O to o O to V3
4-4
o l/N O
4-4
o IS\ o
4-4
03 03 03 a 03 03 03 03 03
t/N _CD 9— OX)
<13
OX)
03
00
03^ 03 03^ 03 03^ 03 03
c OX) OX) lo CSX)
CD O !E T3 IE ■O IE ■a !E IE *a lo E ■O !E
♦Significant atp <.05.
c/N rd o !o ■O
cd 03 3 D 3 3 3 3
to 3 3 3 3 3 3 3 3 3 3 3 to 3
4-4 </> c/3 to tn t/t to to </} i/>
CO O tr>
i— rd X X
cd c c C C C C X C C X C c X c c X
c CD 13 03 <13 03 <D 03 03 03 03
to to t^ 03 03 03 03
> o rd <13 13 03 03 03 03 03 03 03 03

03 03
to
< Ll_ CO
>
£ £ OX) £ OX) £ £
03
OX) £
03
OX) £
03
OX) £
03
DO
4-4 4-* "O +-> 4-> ■O 4-* 4-> ■a 4-4 T3 4-4 ■a 4-4 TD
<u 03 3 03 03 3 03 03 3 03 03 03 03 03 03
c 3 3 3
CO CO CD CD CO CO CO CO QQ CO CO 00
CN o
00 t/N
O) CD Q.
W) 3
-O "O o
•—
CN CO d- in vo
rd 13
h-
Table 8-4 shows the reliability coefficients (unbiased) for each pair of judges
for all scales of oral proficiency. The coefficients range from a low of .43 to a high
of .93. Table 8-4 also shows the relationship between the F-statistic and the relia¬
bility coefficients. Some eases show a significant difference in judges’ ratings and a
high reliability coefficient (for example, listening ratings in group 4). The reverse
is also apparent (for example, fluency ratings in group 3). In addition, some cases
show no significant difference in judges’ ratings and a high reliability coefficient
(for example, listening ratings in group 1) as well as the reverse (for example,
grammar ratings in group 2). Since the reliability coefficient is a measure of the
degree to which the average error measurement is zero and indicates how good an
average of the two judges’ ratings is as an estimate of the subjects’ true rating, in
cases where there is a significant difference, in judges’ ratings and a high reliabil¬
ity (above .70), one may conclude that an average of the two ratings is a better
estimate of the subject’s true rating than either by itself. In cases where there is a
high reliability and no significant difference in judges’ ratings, one may conclude
that either of the two ratings is as good an estimate of the subject’s true rating as an
average is. Where there is a low reliability and a significant difference in judges’
ratings, we may conclude that an average of the two judges’ ratings is not a good
estimate of the true rating. Likewise, where there is a low reliability and no signifi¬
cant difference in judges’ ratings, it is evident that the average error of measure-
Table 8-3 Means, Standard Deviations, and Plots of Means of Performance

of Six Groups Rated by Different Pairs of Judges on Five Scales of
Speaking Proficiency
Group n Listening Pronunciation Fluency

Judge 1 Judge 2 Judge 1 Judge 2 Judge 1 Judge 2
Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD
1 15 6.06 1.75 6.06 1.98 5.46 1.55 6.13 1.35 5.73 1.79 6.20 1.47
2 17 7.47 1.46 7.76 1.09 6.41 1.37 7.11 1.26 6.11 1.49 7.41 1.12
3 10 6.10 2.02 6.70 .82 5.30 1.25 5.70 .82 5.60 1.17 5.70 1.41
4 17 5.52 1.69 6.52 2.00 5.41 1.37 5.47 1.50 5.64 1.65 6.00 1.83
5 25 6.04 1.45 6.60 1.75 5.68 1.21 5.72 1.02 5.72 1.51 6.08 1.77
6 14 6.28 1.58 7.00 1.56 5.50 1.22 5.85 1.09 6.07 1.32 6.21 1.42
Group n Grammar Overall

Judge 1 Judge 2 Judge 1 Judge 2
Mean SD Mean SD Mean SD Mean SD
1 15 5.46 .99 6.00 .92 5.73 1.48 6.06 1.33

2 17 6.76 .90 7.35 .86 6.76 1.09 7.41 .93
3 10 5.50 1.26 6.10 .73 5.40 1.26 6.00 .81
4 17 5.52 1.50 5.58 1.12 5.58 1.66 5.82 1.50
5 25 5.64 1.31 5.80 1.19 5.68 1.34 5.96 1.24
6 14 5.85 1.16 5.92 1.49 5.92 1.43 6.00 1.30

Listening Pronunciation Fluency
ment is sufficient to reduce the extent to which an average of the two ratings is a
good estimate of the subject’s true rating. The overall scale appears to provide the
most uniform reliability coefficients across groups of subjects. Since a significant
difference in judges’ ratings is evident, it is best to use an average of the two judges’
ratings as the best estimate.
As shown in Table 8-5, the correlation between scales is high. However, a

stepwise multiple regression analysis reveals that as the variables are added one by
one to the equation for predicting the overall proficiency rating, each addition to
Table 8-4 A Comparison of the F-statistics from Analyses of Variance (from

Table 8-2) and Reliability Coefficients of Judge Pairs
Group n Listening Pronunciation Fluency Grammar Overall
F r2 F r2 F r2 F r2 F r2
1 15 .00 .93 4.83 .72 1.78 .75 10.42f .76 2.50 .88
2 17 1.74 .82 11.7 6 f .79 39.5 If .57 9.30f .57 19.36f .77
3 10 1.23 .43 2.25 .74 .05 .50 7.36* .74 7.36* .75
4 17 15.Ilf .82 .11 .92 2.85 .92 .03 .66 1.66 .93
5 25 6.24* .82 .03 .67 3.27 .88 .88 .86 2.24 .83
6 14 5.51* .77 2.52 .81 .32 .85 .10 .88 .13 .92
♦Significant atp <.05. -(-Significant at p < .01 •
the equation significantly reduces the amount of unpredicted variance. Table 8-6
shows these results. The beta coefficients in the multiple regression equation
which are used to predict the overall rating from the four other scales indicate that
if the listening rating were increased by one unit and the other ratings remained
constant, the expected change in the overall proficiency rating would be .23. If the
grammar rating were increased by one (and the other ratings remained constant),
the change in the overall proficiency rating would be .29. Parallel conclusions may
Table 8-5 Correlations of Ratings on the Five Scales

of Speaking Proficiency
Scale 2 3 4 5
1 Listening .79 .85 .79 .90

2 Pronunciation .77 .78 .88
3 Fluency .80 .89
4 Grammar .89
5 Overall Proficiency
be reached for the influence of an increase in the pronunciation or fluency ratings

on the overall proficiency rating. If all four ratings were increased by one unit, the
change in the overall proficiency rating would be very close to one unit as well.
This indicates that each of the scales is weighted approximately equally in the
determination of the overall proficiency rating. The F-statistic for B in Table 8-6
indicates that the hypothesis of no weight given to any one of the variables is to be
Table 8-6 A Stepwise Regression Analysis with the Overall Proficiency Scale as the
Dependent Variable
Source df SS MS F R2 Beta SE (B) F (for B)
Regression 1 300.060 300.060 790.78* .803

Listening 1 300.060 300.060 790.78* .90 .025 790.78*
Residual 194 73.612 .379
Regression 2 332.739 166.369 784.40* .890

Listening 1 300.060 .51 .031 168.16*
Grammar 1 32.674 32.674 154.07* .49 .043 154.07*
Residual 193 40.934 .212
Regression 3 343.842 114.610 737.40* .920

Listening 1 300.060 .36 .030 91.99*
Grammar 1 32.674 .36 .040 96.38*
Pronunciation 1 11.091 11.091 71.37* .31 .038 71.37*
Residual 192 29.841 .155
Regression 4 348.994 87.248 675.25* .933

Listening 1 300.060 .23 .032 34.52*
Grammar 1 32.674 .29 .038 67.81*
Pronunciation 1 11.091 .28 .035 67.60*
Fluency 1 5.162 5.162 39.95* .24 .034 39.95*
Residual 191 24.678 .129
♦Significant at p <.01.
rejected in each case. All are about equally important contributors to the
prediction of scores on the overall proficiency scale.
Conclusions
Two of the hypotheses outlined at the beginning of the chapter are rejected by the
results of the analysis of variance on performance ratings by pairs of judges in each
of the six groups. Ratings assigned by one judge are significantly different from
ratings assigned by the other in six out of thirty cells at the .01 level and eleven out
of thirty cells at the .05 level. However, the reliability coefficients for four out of
the six significant-difference cases at the .01 level are above .70, and the relia¬
bility coefficients for nine out of the eleven significant-difference cases at the .05
level are above .70. Therefore, an average of the two judges’ ratings will serve as a
good estimate of the true rating. This means that it is best if there is more than one
judge per subject
There is a significant difference at the .05 level in the ratings assigned by one
pair of judges to a group of speakers when compared to the ratings assigned by
another pair of judges to another group of speakers. This may be due to a
difference in judgments made by the raters or a difference in the groups them¬
selves. Since the design of the experiment did not control for homogeneity of
groups, it is impossible to determine the source of this difference. However, at the

.01 level, there is no significant difference in assigned ratings among the groups on
four out of the five scales. We may conclude that the ratings assigned by the judges
on the listening, pronunciation, fluency, and overall proficiency scales do not
differ from group to group and that with regard to these scales and these raters, the
six groups were fairly homogeneous. For the grammar scale, the significant differ
ence in groups may be due to a real difference in performance in grammar among
the groups or to a difference in the application of the scale from one set of judges to
another.
Each of the four subscales contributes significantly to the determination of
the overall proficiency rating. In addition, using all four scales to predict the over¬
all rating yields significantly less unexplained variance in the overall scale than
does using the three best scales or the two best scales or the best single scale.
Moreover, the scales are given approximately equal weight in the prediction equa¬
tion and the relationship between the four scales and the overall scale is very
nearly perfectly linear. Since the overall scale shows high reliability coefficients
across all groups and since it appears to be a composite of the other four scales, it is
the best scale of measurement for oral proficiency.
Additional Editors’ Notes
1. In other words, if the variance across judges is small in relation to the variance across subjects (and
if the F-ratio is therefore high), it is assumed that the judges agree on where the differences between
subjects lie. This would be an indication of interrater reliability.
2. In other words, if the differences (variance) among subjects in the same group are about equal to or
greater than the differences (variance) between subjects in different groups, it will be assumed that the
ratings of different pairs of judges do not differ much across groups. If this is so, and if the groups are
really similar in ability to start with, this finding would also support the conclusion that the ratings ol
judges (in this case pairs of them) are reliable.
3 Here the question is how much each of the four separate scales of listening comprehension,
pronunciation, fluency, and structure contributes to the explanation of variance in overall speaking
proficiency.
Chapter 9
Accent and the Evaluation

of ESL Oral Proficiency1
Donn R. Callaway
This study investigated the reactions of naive native raters and ESL teachers
to various samples of differently accented speech. Two panels of judges, 35
naive natives and 35 experienced teachers, evaluated 15 samples of
nonnative speech from three source language backgrounds (Arabic, Persian,
and Spanish). Evaluations for both groups were in terms of bipolar scales of
intelligibility, pleasantness, acceptability, nativeness, and overall profi¬
ciency of the speaker. In addition, the ability of judges to guess the speaker s
native language background was studied. Results indicated that both groups
can distinguish degrees of proficiency with substantial reliability, that they
cannot distinguish the posited components of speech, and that some judges
are fairly good at guessing language background of ESL learners based on
samples of their speech.
The evaluation of oral proficiency in a second language has always been a

formidable problem for the second language teacher. The teacher must accurately
evaluate the learner s ability “to receive or transmit information in the test lan¬
guage for some pragmatically useful purpose” (Clark, 1975, 10). However, in
order to evaluate the student’s oral proficiency, the teacher must rely on highly
subjective judgments of the student’s output
Studies of evaluative reactions to speech samples by Lambert, Hodgson,
Gardner, and Fillenbaum (1960), Labov (1966), and others have supported the
notion that a linguistically naive listener can and does make critical evaluations of
the speaker s personality, social class, or ethnicity. These studies have used varia¬
tions either between dialects or between languages as the independent variables,
while the dependent variables have consisted of general personality, social class
rating, and individual personality traits, they have tended to ignore supposedly
102
Callaway: ESL Oral proficiency 103
separable components or characteristics of speech. For this reason, Williams

(1970, p. 473) called for research
to link whatever language and speech features serve as salient cues in this judgmental
process with whatever kinds of evaluation or stereotypes are of interest to us in the
behavior of listeners.
He showed that social class and judgment ratings can be predicted from the
presence, absence, or strength of certain language features, among them length of
pauses and verb constructions (p. 477). However, since his study was limited to
the effect of black dialect on black and white elementary school teachers and did
not include other dialects and other evaluators, further research is needed.
As a result of the concern of sociolinguists that accented speech can cause
alienation and discrimination in educational and occupational opportunities
(Ortego, 1970, Rvan, 1973), most studies have dealt only with the language varie¬
ties of a few rather large minority groups: French Canadians (Lambert et al., 1960,
Anisfeld and Lambert, 1964, Webster and Kramer, 1968), Black Americans
(Harms, 1963, Shuy, Baratz, and Wolfram, 1969, Tucker and Lambert, 1969,
Williams, 1970, Williams, Whitehead, and Miller, 1971), Mexican Americans
(Ortego, 1970, Williams, Whitehead, and Miller, 1971, Ryan, 1973), and British
regionals (Strongman and Woosley, 1967, Giles, 1972). Richards (1971, p. 21)
points out that any “deviancy from grammatical or phonological norms of a speech
community elicits evaluational reactions that may classify a person unfavorably.
If this divergence from the standard language has an effect on the social relation¬
ships of these minorities, it could have an even more pronounced effect on the
degree of success attained by the second language learner, not to mention his like¬
lihood of being socially accepted.
In oral evaluation, a general assumption is that any native speaker can assess
the proficiency of a nonnative. In fact, a generally used measure, the American
Language Institute Oral Rating Form, describes a speaker of minimum profi¬
ciency as having “pronunciation... virtually unintelligible to ‘the man in the
street’ [my italics]” (1962). However, usually, it is the second language teacher,
rather than “the man in the street” who makes the evaluation. Is it safe to assume
that there is no significant difference between a trained rater and an untrained
one? Cartier (1968, p. 21) does not think it is. He says that judgments of profi¬
ciency “are made by the wrong people, they are made by sophisticated language
instructors who have become quite skilled at understanding heavily dialectal
English rather than the student’s eventual instructors, classmates, and job super¬
visors.” He seems to be implying that naive judges might be better. According to
Jakobovits (1970, p. 85), however, naive judges are apt to attribute too much
importance to “accent, pronunciation, and fluency” and too little to the weightier
matters of grammar and vocabulary.
A number of research studies have examined the effect of accent in
bidialectal and bilingual speech. These projects have dealt with the reliability of
judgments, the ability of judges to specify certain speech characteristics, degree of
accentedness, and overall proficiency.
In 1973, Gorosch compared oral EFL proficiency evaluations in Sweden by

teachers who were nonnative speakers of English, and nonteachers who were
native speakers of English. His evaluators rated six Swedish EFL learners by
noting each mistake in pronunciation and assessing overall intelligibility on a five-
point scale. His data indicated that both groups tended to separate pronunciation
from intelligibility and that the evaluations of the nonteachers were unpredict¬
able. He concluded (p. 151) that there are “considerable differences between
assessments produced by teachers and those produced by non-teachers.”
Contrary to Gorosch’s findings, a study by Brennan, Ryan, and Dawson
(1975) demonstrated that native speakers could give rehable judgments. Seventy-
two naive listeners judged the degree of accentedness in eight samples of English
as spoken by native speakers of Spanish. In addition to high reliability, the results
indicated that the judges agreed on what degree of accentedness should be associ¬
ated with a particular level of proficiency. However, it also showed that the
subjects during an informal question period were unable to say clearly what
features of speech enabled them to arrive at their judgments.
Giles (1970) attempted to arrange three supposed speech characteristics
hierarchically: pleasantness of voice, intelligibility of accent, and prestige value of
the accent. His subjects, who were adolescents from South Wales and southwest
England, listened to and rated 13 British regional dialects and foreign accents on
the three characteristics mentioned above. His data showed that although the
subjects were able to identify the individual accents, the three characteristics were
apparently indistinguishable. He concluded (p. 219) that these characteristics
were at best “three variants of one evaluative dimension.”
Further research conducted by Galvan, Pierce, and Underwood (1975)
examined the speech of Mexican-American bilinguals in terms of 10 personality
traits and 10 speech characteristics. Five recorded samples were evaluated by 92
American undergraduates. The analysis showed that the raters generally evaluated
speakers more positively than negatively, that the evaluations became more nega¬
tive as accentedness increased, and that very little of the variance across samples
could be predicted on the basis of academic background of the listeners. Regard¬
ing speech characteristics (relaxed, appropriate, natural, standard, graceful, care¬
ful, understandable, good English, active, and smooth), the authors speculated
that not only could the characteristics be reduced to a few factors but from these
factors specific points on a continuum of accentedness could be established (p.
15).
The previously mentioned studies all had one thing in common; they dealt
with the speech of fairly established groups, whether these groups were bidialectal
or bilingual. A study by Palmer (1973), however, dealt with a transient group of
ESL learners. Palmer produced a preliminary report of the subjective evaluations
of ESL learners by naive listeners. Eighteen students from Georgetown University
listened to 36 speakers from four different language backgrounds (Lingala, Arabic,
Spanish, and Vietnamese). Each judge evaluated speakers on a five-point scale
over three tasks: reading, story retelling, and narration. The judges showed sub¬
stantial reliability across language backgrounds and across tasks. However, they
were not very good at identifying the source language background of the speaker.
It would seem from Palmer s results that particular foreign accents, e.g., Spanish,
may not be as easily recognized as is popularly believed.
This study attempts to address a number of remaining questions. Here, as in
Palmer s study, the focus is on learners of ESL, but unlike Palmer s study, this one
compares interrater reliability among naive raters with interrater reliability among
experienced ESL teachers. Further, evaluations by both groups are validated
against the independent placement of the speakers in one of five proficiency
levels,via a separate testing procedure. Whereas some of the cited studies were
concerned with the effect of a speaker’s accent on the listener s judgment (with¬
out knowing or caring what characteristics of speech contributed to that evalua¬
tion), this experiment will attempt to separate accentedness into distinguishable
dimensions.
In particular, the following questions are addressed:
1. How much agreement is there among judges on the evaluation of samples of

nonnative speech?
2. Is there any difference between naive native raters and experienced ESL
teachers in the evaluations of foreign speech?
3. Will evaluations of nonnative speech samples correlate with the independ¬
ent placement of the speakers in an intensive English program; that is, do the
ratings have concurrent validity?
4. Are there distinguishable dimensions for the ratings of accentedness in this
study?
5. How accurately can the judges identify the source language background of
each speaker?
Method
Speech Samples. During the fall semester of 1975, the experimenter asked
instructors from each of the five proficiency levels at the Center for English as a
Second Language (CESL), Southern Illinois University, to recommend students
of average speaking ability from Arabic-, Persian-, and Spanish-speaking language
backgrounds who would be willing to participate in a short recording session. Two
native speakers who were graduate students in the Department of Linguistics,
Southern Illinois University, were also taped. Originally 25 ESL students, along
with the two American students, were recorded in a laboratory setting reading one
of twelve 100-word passages in English. (The paragraphs are given in Appendix
9A.) Each speaker was allowed to practice the passage twice before he was
recorded. Eleven tape samples were eliminated because of technical recording
problems or to avoid including too many samples from the same source language
background. Three nonnatives from each of the five levels of CESL (lo ESL
students) were finally selected along with one of the American students to form
exactly 16 criterion samples of speech.*
Questionnaire: Scales of Accentedness and Overall Proficiency. The scales
were constructed in a semantic differential form similar to the scales used by
Lambert et al, Galvan et al.. Palmer, and others. The first four scales consisted of
four pairs of bipolar adjectival descriptors (“not very intelligible” to “intelligible,”
“unpleasant” to “pleasant,” “unacceptable” to “acceptable,” and “nonnative" to
“native”). In addition, there was an overall proficiency scale (OPS). Each of these
scales was in a six-point Likert-type format. Also, a multiple-choice question
about the language background of the speaker was included (see Appendix 9B).
Raters. The tape and questionnaire were administered to 70 raters (Rs). Half
of them were enrolled in undergraduate English composition courses and had
neither linguistic nor teaching experience (the naive group), while the other half
were instructors or teaching assistants in ESL (the experienced group). Before
taking the test, each R completed a form on biographical data. The naive Rs were
tested during their regular class meetings, while the teachers were tested either
individually or in small groups.
Rating Procedure. Each sample was rated by the naive native Rs and the
experienced ESL teachers on each of the six scales. The order of the six scales was
the same for all samples of speech. The order of the 16 speakers, however, was
randomized for the first tape and was given in reverse order on a second tape.
Detailed instructions for the use of the protocols were presented orally with an
example using a male Spanish speaker. Form A was used with 15 of the naive Rs
and 20 of the ESL teachers. In sum, each of the Rs heard the same directions,
either orally or taped, and the same example, and they listened to one of the two
tapes of the 16 speakers. They evaluated each speaker on each of the six scales.
Results
Rater Agreement. In order to see whether or not the raters agreed in their evalua¬
tions of accentedness (see Question 1 above), the Rs were treated as variables in a
Q-type factor analysis. Normally, of course, the variables input to a factor analysis
are test scores, scales, or other measures. In this case, the raters were treated as
variables and the various accent scales were treated as subfiles (each containing
16 cases, i.e., the 16 speech samples). Owing to computer space limitations, only
60 Rs could be included. Therefore, five were excluded from the naive group and
five from the group of experienced ESL teachers on a random basis. In the first
factor analysis, the first four scales were treated together without distinction. We
will return shortly to the justification for this. (See the discussion of Question 4,
below.) In the principal components analysis. Factor 1 accounted for 48% of the
variance among the individual Rs. All of the Rs loaded positively on this factor and
•Editors’ note: These speech samples were collected prior to the major testing project discussed in
other chapters (especially 2, 6, 7, 14, 17, 18, 23, and 24).
Table 9-1 The First Factor from a Q-Type Principal

Components Analysis of the Four Scales of Accentedness
(Intelligibility, Pleasantness, Acceptability, and Nativeness)
NaTve Loadings on Experienced Loadings on

raters Factor 1 raters Factor 1
6* .862 36 .548
7 .500 37 .811
8 .466 38 .691
9 .840 39 .505
10 .849 40 .810
11 .805 41 .778
12 .718 42 .754
13 .729 43 .833
14 .664 44 .666
15 .767 45 .748
16 .580 46 .700
17 .698 47 .664
18 .424 48 .693
19 .721 49 .778
20 .589 50 .476
21 .539 51 .596
22 .791 52 .809
23 .373 53 .739
24 .492 54 .791
25 .694 55 .762
26 .777 56 .804
27 .525 57 .703
28 .444 58 .812
29 .655 59 .741
30 .719 60 .714
31 .687 61 .785
32 .703 62 .681
33 .406 63 .837
34 .547 64 .734
35 .616 65* .714
♦The first and the last five raters were eliminated so as not to exceed
the space limitation of the SPSS factor program (PA1, Nie, Hull,
Jenkins, Steinbrenner, and Bent, 1975, pp. 479-480).
above .36 (Table 9-1). Fifty-six Rs showed a correlation of greater than .50 with
this factor, while 12 of these 56 loaded at .80 or higher. The overall mean loading
was .69. The mean loading for the naive Rs was .645, and for the 30 ESL teachers
was .735. On a similar principal components analysis for the OPS, the first factor
accounted for 56% of the total variance. The average loading for the naive Rs was
.676; for the ESL teachers, it was .816 (Table 9-2). From these analyses, it can be
concluded that there is very substantial agreement among the Rs, regardless of
whether they are naive or experienced.
Table 9-2 The First Factor from a Q-Type Principal

Components Analysis of the Overall Proficiency Scale
Naive Loadings on Experienced Loadings on

raters Factor 1 raters Factor 1
6* .952 36 .620
7 .673 37 .731
8 .516 38 .729
9 .797 39 .834
10 .895 40 .863
11 .727 41 .715
12 .848 42 .717
13 .778 43 .914
14 .758 44 .777
15 .830 45 .827
16 .846 46 .804
17 .660 47 .783
18 .361 48 .824
19 .698 49 .850
20 .598 50 .881
21 .642 51 .633
22 .915 52 .873
23 .394 53 .723
24 .437 54 .842
25 .917 55 .792
26 .658 56 .808
27 .615 57 .796
28 .357 58 .841
29 .784 59 .762
30 .570 60 .786
31 .766 61 .801
32 .729 62 .810
33 .326 63 .760
34 .613 64 .828
35 .654 65* .708
*See the footnote to Table 9-1.
Difference in Reliability. The second question was whether a significant

difference in reliability existed between naive Rs and experienced ESL teachers.
The answer to this question can be deduced directly from the loadings of the previ¬
ously defined factors. We can simply contrast the average loading of the naive Rs
with the average loading of the ESL teachers. The contrast between the naive and
the experienced Rs is not significant (p > .05) for the four scales lumped together,
but the contrast for the OPS is significant at the .05 level. Therefore, the
experienced Rs appear to be somewhat more reliable, although it should be noted
that both groups are surprisingly reliable on the whole (this can be inferred from
the strength of the loadings on the two principal factors defined above).
Table 9-3 Intercorrelations among the Scales of Accentedness for

Ratings of the 16 Speech Samples by 70 Independent Raters
Scales 1 2 3 4 5
1 Intelligibility 1.000 .787 .880 .783 .889

2 Pleasantness 1.000 .786 .727 .790
3 Acceptability 1.000 .792 .892
4 Nativeness 1.000 .802
5 Overall Proficiency 1.000
Table 9-4 Correlations of the Scales of Accentedness for Mean

Ratings of 16 Speech Samples Averaged over the 70 Raters
Scales 1 2 3 4 5
1 Intelligibility 1.000 .996 .998 .971 .997

2 Pleasantness 1.000 .994 .970 .993
3 Acceptability 1.000 .974 .978
4 Nativeness 1.000 .978
5 Overall Proficiency 1.000
Concurrent Validity. Question 3 concerned the concurrent validity of the

accent ratings. A convenient criterion was the speaker’s placement at CESL. The
overall correlation between the OPS and the various placement levels was .66. For
the five Arabic speech samples, separately, the correlation was .87. For the five
Spanish speech samples, it was .71, and for the five Persians, it was only .45. Obvi¬
ously, the overall ratings were better predictors of CESL placement level for the
Arab subgroup. This is probably due to a greater range of abilities among the Arab
samples than in the Spanish and Persian samples, and a correspondingly greater
amount among the Spanish than the Persian. In other words, if more variance
exists and if the explanatory variable(s) is (are) appropriate, more variance will be
explained.
Dimensions of Accentedness. Question 4 asks whether the several scales are
actually sources of unique variance with respect to ratings of accentedness or
overall proficiency. Table 9-3 shows the intercorrelations between the four scales
and the OPS with the 70 Rs times the 16 speakers as input data. Each scale
appears to be measuring the same thing as each other scale. When mean scores
(Table 9-4) are input for the speakers on each scale, the correlation across the
scales approximates unity—that is, they are almost perfect, never less than .97.
Thus we may conclude that the scales, in this study, for practical purposes are
unitary.
Identification of Language Backgrounds. The fifth question asked how
accurately judges would be able to identify each speaker s language background.
The experienced Rs correctly identified the source language backgrounds 73% of
the time, while the naive Rs could do so only 30% of the time. Identification of the
individual language backgrounds by the experienced Rs was best for the Spanish
Table 9-5 The Percentages of Correctly speakers, then the Persians, and then
Identified Language Backgrounds the Arabs. For the naive Rs, the order
was Spanish, Arabs, and Persians
Source By By
language experienced na'fve
(Table 9-5).
of speaker raters raters Of these three languages, only
Arabic 66.3 31.4 Spanish was studied by more than one
Spanish 84.6 32.6 R (6 naive Rs and 20 experienced
Persian 68.6 26.9 Rs). For the naive Rs it is interesting
Combined 73.0 30.0
to note that those who had not had
Spanish identified the speakers with
16% greater accuracy than the naive judges who had studied Spanish (Table 9-6).
In looking at the ratings of the experienced Rs, we discover that with or without
studying Spanish, Rs could identify the speakers equally well. It seems, therefore,
that the ability to correctly identify the language background of a speaker may be
due more to mere contact with speakers of the language in question than to formal
study of the language supposed to be the source of the accent.
Table 9-6 The Percentages of Correctly Identified

Spanish Speakers
By By
experienced naive
raters raters
Those who had studied Spanish 85 20
Those who had not studied Spanish 84 36
Discussion
With little or no research basis, Cartier speculated that teachers are not the most
reliable judges of oral proficiency. With equally little empirical study, Jakobovits
suggested that naive natives are also not the best judges. Obviously, both Cartier
and Jakobovits cannot be right, but neither recommended appropriate research to
test their claims. With unabashed certainty, language testers have categorized oral
performance into the separate components of accent, grammar, vocabulary,
fluency, and comprehension (Valette, 1967, 1977, Harris, 1969, Clark, 1972,
Heaton, 1975, Davies, 1977). These categories have become the sanctioned
criteria for the evaluation of oral proficiency by teachers. For the most part, their
empirical necessity has gone unquestioned.
The data from this experiment dispute both the speculation of Cartier and the
opposite claim of Jakobovits; for if both claims were true, we would expect to find
little reliability in the evaluations of either of the two groups of raters studied here.
However, the results indicate that both groups can distinguish degrees of profi¬
ciency with substantial reliability, although the teachers are somewhat more
reliable than the naive judges (but contrary to what some might have hoped, the
latter fact does not support Jakobovits’ claim; it refutes his claim because of the
demonstrated unity of the various scales of accentedness and the overall profi¬
ciency rating). Apparently, all the raters tended to make holistic unidimensional
evaluations, rather than multidimensional judgments, i.e., separate evaluations of
the presumed components. The unity of the scales suggests that dividing oral per¬
formance into components is superfluous at best, and artifactual at worst Accord¬
ing to the available empirical evidence, a listener apparently does not and perhaps
cannot componentialize the characteristics of speech. Rather it would appear that
overall comprehensibility is what motivates the evaluation.
The inability of a naive judge to identify the source language of accented
speech substantiates Palmer’s findings (1973). The data, also, show that ESL
teachers (the experienced group) are quite successful (7 3%) in identifying source
language backgrounds. Further, the data indicate, contrary to Palmer (197 3), that
accents are substantially distinctive.
More research is required before it will be possible to relate degrees of
accentedness to points on a well-defined, although subjective, continuum. In
addition, experiments should be conducted to see if the ability to identify a
speaker s first language affects the reliability and validity of a given judge’s evalua¬
tions. Since studies have shown that speech characteristics may affect personality
assessments, the converse relationship between personality assessments and
speech characteristics may also affect the reliability and validity of oral profi¬
ciency evaluations and should also be investigated.
A final area of research would be the reliability of evaluations of accentedness
by nonnatives the speech of whom also reveals varying degrees of accentedness.
This study used only native speakers of English as raters: what if we wanted to
generalize to nonnatives from different language backgrounds? Would Arabs, for
example, tend to rate Arabic speakers as reliably as native speakers of Spanish or
Persian? Would they be more lenient? Therefore, own-accentedness and native
language are factors that might well be studied in relation to the reliability of
nonnatives as evaluators of nonnative speech.
Note
1. This is an expanded version of a paper presented at The First International Conference on

Frontiers in Language Proficiency and Dominance Testing at Southern Illinois University, Carbon-
dale, I1L, on Apr. 22, 1977. Another version of this report served as the author s master’s thesis at the
Department of Linguistics, Southern Illinois University. It has been revised for inclusion in this
volume.
Appendix 9 A
[The texts used in this study were selected from several sources. Each passage was
rewritten to conform to a 100-word hmit on length. Texts 1, 2, 3, and 4 were each
read by two different speakers. The remaining texts were read by only one speaker
each.]
1. Ben got off the bus and then the bus drove away. He forgot about the
tickets because it was raining. The road was wet and there was a very big hole in his
shoe. Then a second bus stopped and he got on. This time there was a seat. He paid
a dime for his ticket and then shut his eyes. When he opened them again, the bus
was past the theater. He rang the bell and the bus stopped suddenly. It was still
raining as he walked back to the theater and went in through the door. He saw
many photographs of the actors just before he saw the stage. (Adapted from Baird,
Broughton, Cartwright, and Roberts, 1972, p. 44.)
2. I hope to learn several foreign languages but English is the one I want to
study first To begin with, I hope to get a good position with one of the big compa¬
nies in the capital and it will be an advantage for me to have an understanding of
English. If my work should ever require my traveling outside of the country, it
would be helpful if I knew English. It is used in carrying on business in almost
every part of the world. My brothers and sisters, already skillful in English, are
eager to practice it with me; so I will have many opportunities when I am ready to
speak English. (Adapted from Van Syoc and Van Syoc, 1971, p. 89.)
3. On Saturday mornings the big public library opens at half past nine. A lot
of the people go into the library on Saturday because this is the time when they go
shopping, they take their books into the library, and go home with new ones. Susan
and Mary, the two girl librarians, were standing behind the desk. They took the
books from the people who came in and gave them their tickets. It was a warm
Saturday, and a lot of people were in the streets and in the stores, and many were
coming into the library too. (Adapted from Baird, Broughton, Cartwright, and
Roberts, 1972, p. 47.)
4. As was expected, the favorites had gotten well out in front with the
remaining horses grouped together some way behind. On a dangerous bend, three
of the horses leading the group fell, throwing the riders into great confusion. As the
race progressed, the track became full of horses without riders. Toward the end,
there were only three horses left. College Joy and Sweet Seventeen were still lead¬
ing the race with an unknown horse far behind. The crowd was very disappointed
when on the last jump in the race, the riders of both favorites failed to keep in the
saddle. The crowd cheered and applauded as the unknown horse crossed the
finishing line. (Adapted from Alexander, 1974, p. 60.)
5. Moving the pilot aside, the man took his seat and listened carefully to the
urgent instructions that were being sent by radio from the airport below. The plane
was now dangerously close to the ground, but it soon began to climb. The man had
to circle the airport several times in order to become familiar with the controls.
The terrible moment came when he had to land the plane. Following the
instructions, the man guided the plane toward the airfield. It shook violently as it
touched the ground and then moved rapidly across the field, but after a long run it
stopped safely. (Adapted from Alexander, 1974, p. 61.)
6. The following Sunday we stayed at home, even though it was a fine day.
About noon a large and very expensive car stopped outside our house. We were
astonished when we saw several people preparing to have a picnic in our small
garden. Father got very angry and went out to ask them what they thought they
were doing. You can imagine his surprise when he recognized the man who had
taken our address the week before. Both men burst out laughing and Father wel¬
comed the strangers into the house. In time, we became friends, but we had
learned a lesson we have never forgotten. (Adapted from Alexander, 1974, p. 63.)
7. It was a very dark and stormy night. Two men were walking slowly down
the road. Snow was covering the ground and a cold wind was blowing. They
noticed a light behind some trees and soon arrived at a house. A poor old man
immediately invited them into a clean room. He seemed a strange fellow, but he
spoke kindly and offered them milk and fresh fruit The men remained there until
morning. Then the man led them to the nearest town, but he would not accept any
money for his help. (Adapted from Alexander, 1974, p. 18.)
8. Science has told us so much about the moon that it is fairly easy to
imagine what it would be like to go there. It is certainly not a friendly place. As
there is no air or water, there can be no life of any kind. Also for mile after mile,
there are only flat plains of dust with mountains around them. Above, the sun and
stars shine in a black sky. The moon is very silent. But beyond the horizon, our
earth is shining more brightly than the stars. It looks like an immense ball, colored
blue, green, and brown. (Adapted from Alexander, 1974, p. 35.)
9. The store was empty and very peaceful. We sat down in the main hall and
listened to the rain beating against the windows. Suddenly there was a loud noise at
the door. Then a large party of boys were led in by a teacher. The poor man was try¬
ing to keep them cpiiet, but they were not paying any attention to him. The boys ran
here and there. The teacher explained that the boys were rather excited. But the
noise proved too much for us; so we decided to leave. After all, the boys had more
right to be in the store than we did. (Adapted from Alexander, 1974, p. 54.)
10. Driving along a highway one dark night, Tom suddenly had a flat tire.
Even worse, he discovered that he did not have a spare tire in the back of his car.
Tom waved to passing cars and trucks, but not one of them stopped. At last, he
waved to a car like his own. To his surprise, the car actually stopped and a well-
dressed woman got out The woman offered him her spare tire, but Tom had never
changed a tire in his life. So she set to work at once and changed the tire in a few
minutes while Tom looked on. (Adapted from Alexander, 1974, p. 27.)
11. Dan found the school work easy. He read widely both at school and in
the branch library. After the third year of high school, he left to take a job with a
glass firm. Art work had always been a major interest, and he did so well with the
firm that he was promised rapid advancement But then the depression came, the
business failed, and Dan was without a job. At first, he went out looking for a job
and continued his art work at home, but when all his efforts brought no results, he
stopped looking for work and even lost interest in art (Adapted from Whyte, 1955,
pp. 8-9.)
12. Tony came into the club to talk the situation over with John. He was try¬
ing to get transportation, he said, but even if he could arrange it in the next few
minutes it was so late that the boys would miss a large part of the evening. If any¬
one wanted his money back or a ticket for the next football game, he could have it.
John explained the situation to the boys and then said that he thought it would be
better if we went another time. Tony agreed. He said that John could collect the
tickets later. (Adapted from Whyte, 1955, p. 182.)
Appendix 9B
Questionnaire
Name_ Sex F M
State or country-- ESL Experience yrs. mos.
Native language__
Other languages___
Age___
In this experiment, you will rate how well some nonnative speakers read a short prose
passage. In addition, you are asked to identify their native language.
EXAMPLE
NVI (Not very intelligible) 1 2 3 4 5 6 VI (Very intelligible)
UNP (Not pleasant) 1 2 3 4 5 6 P (Pleasant)
UNA (Not acceptable) 1 2 3 4 5 6 A (Acceptable)
•NN (Nonnative) 1 2 3 4 5 6 N (Native)
OPS 1 2 3 4 5 6
Language background Ar. Sp. Pr. Am. X
(Ar.—Arabic, Sp.—Spanish, Pr.--Persian, Am.--American, X- Unknown)
1. The speaker is unintelligible to a native speaker.

2. The speaker has a very heavy accent and makes frequent gross errors.
3. The speaker’s accent requires concentrated listening. His mispronunciations lead to
occasional misunderstandings.
4. The foreign accent is evident and occasional mispronunciations occur, but these do not
interfere with understanding.
5. There are no consistent mispronunciations, but because of occasional deviations the
speaker would not be taken for a native.
6. The speaker has native pronunciation with no trace of a foreign accent.
Name
1. NVI 1 2 3 4 5 6 VI
UNP 1 2 3 4 5 6 P
UNA 1 2 3 4 5 6 A
NN 1 2 3 4 5 6 N
OPS 1 2 3 4 5 6
LgB Ar. Sp. Pr. Am. X
16. NVI 1 2 3 4 5 6 VI
UNP 1 2 3 4 5 6 P
UNA 1 2 3 4 5 6 A
NN 1 2 3 4 5 6 N
OPS 1 2 3 4 5 6
LgB Ar. Sp. Pr. Am. X
Part III Discussion Questions
1. In what sense is the oral interview technique discussed by Hendricks et al.

and by Mullen a more direct measure of speaking proficiency (or overall lan¬
guage skill) than, say, an elicited imitation task (e.g., the repetition procedure
used by Hendricks et al.)? Is the repetition task a more or less direct method
of testing speaking ability, than, say, an oral cloze task? A written cloze task?
A question-answer drill? What are the relevant parameters of the judgments
you must make in order to differentiate direct and indirect testing pro¬
cedures?
2. Why might direct tests be considered to have superior validity (even before
any experimental or empirical studies are done)? Are the criteria associated
with judging the directness of a measure more important than its reliability?
How do those criteria relate to other considerations that enter into the empir¬
ically demonstrable validity of a proposed test?
3. What implications do you see in the finding that multiple dimensions of

speech ratings are difficult if not impossible to differentiate? What implica¬
tions does this have for fluency drills? Vocabulary exercises? Grammar prac¬
tice? Pronunciation courses? What about the curriculum in general? Are
there possible implications for the improvement of “oral language skills” in
native English speakers? Consider also the relationship demonstrated earlier
in Chap. 2 especially, between oral language processing and other language
processing tasks.
4. What kinds of motivations can you posit for the inclusion of topics to elicit
certain tenses and structures in instructions to teams of oral interviewers?
See Appendix 7D.
5. Consider the factor analysis shown in Table 7-5. Try to find a simple expla¬
nation for each of the loadings recorded. Remember that the factors
displayed are mathematically uncorrelated (the technical temi is orthog¬
onal).
6. Read through the descriptions of FSI proficiency levels in Appendix 7A.

What basis could be proposed to explain the association of such verbal
descriptions with the numerical values given in tables of Appendix 7C? For
instance, what does 4+ mean in terms of a rough verbal description? Can
you suggest any method other than the one used in Appendix 7 A for the defi¬
nition of levels of proficiency? What practical procedures could be followed
to maximize the reliability of such definitions in actual interview ratings?
116
7. Mullen’s method of investigating interrater reliability seems to suggest that

differences in calibration across raters may contribute to unreliability. If this
is so, how can we explain the fact that judges often differ significantly in
terms of the scores they assign to subjects but still show high levels of relia¬
bility? In other cases, judges agree substantially in their evaluations and
show low levels of reliability. How can these facts be explained? (Clue: What
is the correlation between scores of 1 and 98, 2 and 99, 3 and 100, assigned
by two judges to the same three subjects? Also consider the correlation
between scores assigned by two judges who rate the same three subjects as 1
and 1,1 and 1, and 1 and 1. Answer The correlation for the first set is perfect,
1.00, and for the latter set of scores it is 0.)
8. On the basis of the correlations displayed in Table 8-4, compute the average
variance overlap among scales. This can be done by computing the squares
of the correlations displayed in the table (the squares can easily be displayed
in the same table in the lower half of the matrix, below the diagonal). Divide
the sum of the squares by the number of entires. The result is the average
variance overlap across all five scales. (There is a simpler way to do the com¬
putation. Can you see how? Consider the method used in Question 9.)
9. Now, to estimate how much of the reliable variance on the average is com¬
mon variance, average the reliabilities displayed in Table 8-4 for each of the
scales separately. Then average the reliabilities across all the scales. Com¬
pare the square of the average reliability against the square of the average
common variance. If they are approximately equal, it can be concluded that
nearly all the reliable variance is common variance. In other words, to the
extent that any single scale is reliably measuring anything, the other scales
are measuring the same thing. If you subtract the square of the average relia¬
bility from the square of the average variance overlap, the result will be a
good estimate of the average variance in all the scales that is unique to each
one.
10. Now, compare the magnitude of the squares of correlations appearing in

column 5 of Table 8-5. For instance, the correlations of the Listening scale
with the Overall Proficiency scale is .90; the square of that quantity is .81.
The latter value represents a good estimate of the variance overlap between
Listening and Overall Proficiency. Compute (if you have not already done
so) the squares of each of the other correlations between Overall Proficiency
and the remaining three scales. Which scale accounts for the largest amount
of variance in the Overall Proficiency scale? How much difference is there
between the scale that accounts for the greatest amount of variance in Overall
Proficiency and the one that accounts for the least variance in Overall Profi¬
ciency? What conclusions can you draw from the comparison concerning the
differentiation of the scales Listening, Pronunciation, Fluency, and Gram¬
mar as contributors to the Overall Proficiency ratings?
11. Are there some questions that cannot be answered except on the basis of
untested opinion? If so, what is the difference between empirically vulner-
able and nonexperimental issues?
12. Can you think of any explanation for the fact that judges who lack knowledge
of the source language underlying an accent should be better at identifying
the source language for the accent than judges who have studied the source
language in question (see Table 9-6, in particular the figures given for naive
raters)?
13. Compare the correlations reported by Callaway in Table 9-3 with those
reported in Table 8-5 by Mullen. What differences or similarities do you see?
Consider the nature of the scalar judgments required of raters as well as the
pattern of relationships between scales.
Part IV
Investigations of Reading Tasks
Is cloze procedure a suitable basis for ESL placement testing? Can it be

substituted for the more complicated and expensive procedures now used in
most colleges and universities? What about the format? Is it possible to make
multiple-choice (objectively scorable) cloze tests which would be even
more economical and convenient to score than the open-ended variety?
What about the content of cloze tests? Do examinee attitudes toward the
material in the text interfere significantly with filling in the blanks? Suppose
highly controversial material is selected, will scores be less meaningful than
if neutral material had been chosen? Finally, what is the relationship
between scores on English tests (such as the TOEFL) which are intended for
nonnative speakers and standardized reading tests intended for native
speakers of English? Is it possible to say what the reading levels of foreign
students are in terms of grade-level equivalences for native English-
speaking populations? These and other questions are dealt with in the four
chapters of this part.
'
Chapter
Cloze as an Alternative Method of ESL

Placement and Proficiency Testing1
Frances Butler Hinofotis
Can a cloze test be substituted for more complicated ESL testing proce¬
dures without significant loss of information? Which scoring method for the
cloze test will yield the most information? A cloze test in open-ended form
was administered to over 100 incoming foreign students at the Center for
English as a Second Language (CESL) at Southern Illinois University
during the summer of 1976. The Test of English as a Foreign Language
(TOEFL) and the placement examination used at CESL were the written
criterion measures against which the cloze test was evaluated. It was scored
twice: first, responses corresponding exactly to the deleted words were
counted as correct (cloze-exact), and second, responses that were grammati¬
cal and contextually appropriate were counted as correct (cloze-accept¬
able). The data indicate that cloze testing may indeed be a viable alternative
procedure for placement and proficiency testing. It appears, however, that
the two cloze scoring methods are not equally reliable. Apparently the
cloze-acceptable method yields a more accurate assessment of the student’s
ESL ability than the cloze-exact method, particularly at more advanced
levels. This study supports previous research in language testing which
suggests that the cloze procedure is a useful evaluative tool for ESL
specialists.
The formidable problem of evaluating the competence levels of learners of

English as a second language is a central issue at the many language centers in the
United States and abroad. Placement and progress examinations are given peri¬
odically in order to determine student proficiency. If the learner is a rank begin¬
ner, there is no question about the appropriate level of placement in a language
center, but for a student who arrives at a center with some control of English, the
121
122 IV: READING TASKS
problem is more complex. The examination procedures that are used to determine
overall ESL proficiency levels are, for the most part, quite involved. The tests are
most often written and are usually composed of a number of subtests, each one
supposedly assessing a different facet of language ability. In general, a composite
score on such an examination is taken as an indicator of the examinee’s overall
ESL proficiency.
The various tests now employed seem to provide a reasonably accurate
indication of English language ability. However, because of the length of the tests
and problems of scheduling, it is often necessary to set aside two or three days at
the beginning of each term of instruction for the purpose of screening new stu¬
dents. If testing time could be reduced and if comparable results could be
obtained, the considerable time and effort saved could be put to better use in the
classroom.
A testing method originally used as a readability index has recently been
looked at seriously as a possible measure of second language proficiency. The
“cloze” procedure, as the method was called by its originator Wilson Taylor,
involves deleting every nth word from a prose passage and asking the person tested
to supply the missing words in the blanks. The procedure is justified on the
assumption that a person who is either a native speaker of the language or a reason¬
ably proficient nonnative speaker should be able to anticipate what words belong
in the blanks given the contextual clues of the passage. A cloze test is easy to
administer and can be scored quickly. Studies to date indicate high correlations
between cloze test results and total scores on established ESL proficiency mea¬
sures. The work of Oiler at UCLA and the University of New Mexico, Stubbs and
Tucker in Beirut, Krzyzanowski with secondary students in Poland, Hisama at
Southern Illinois University, and others has shown positive correlations in the
70s and 80’s between cloze test scores and total or composite scores on a variety
of ESL examinations. On the basis of such findings, the possibility of streamlining
both placement procedures at language centers and ESL proficiency evaluation in
general is suggested. The present study was conducted to confirm and extend
earlier results which seem to indicate that cloze testing can be used in lieu of more
complicated testing procedures for measuring overall ESL proficiency.
The two research questions addressed in this paper are: (1) Can a cloze test
be substituted for more complicated ESL testing procedures with substantial
savings of time and effort and without significant loss of information? and
(2) Which cloze test scoring method provides the most information about ESL
ability?
Method
Subjects. Foreign students studying ESL at the Center for English as a Second
Language (henceforth CESL) at Southern Illinois University comprised the popu¬
lation sampled for this study. The subjects included 107 incoming foreign stu-
Hinofotis: Cloze as an alternative method 123
dents from a variety of native language backgrounds and with varying degrees of
competence in English. The subjects were the students who arrived at CESL for
two consecutive six-week terms during the summer of 1976.
Testing. Two ESL proficiency tests, the Test of English as a Foreign Lan¬
guage (TOEFL, Educational Testing Service) and the CESL Placement battery,
were the criterion measures against which the cloze test was evaluated. Since both
of these test batteries are described in earlier chapters (the TOEFL in Chap. 1 and
the CESL Placement in Chap. 2), their context is not described further here.
The CESL Placement examination is given to all incoming foreign students.
On the basis of the CESL scores, it is determined which of the students will be
given the TOEFL. Because of the greater difficulty of the TOEFL, only those stu¬
dents scoring within the intennediate and advanced ranges on the CESL
Placement are given the TOEFL.
The cloze test constructed for this study was selected on the basis of the
results of pretesting with both native speakers of English and students at each of
the six proficiency levels at CESL. (See Chap. 4 for a description of the profi¬
ciency levels.) The passage chosen was 427 words in length. Every seventh word
was deleted up to a total of fifty blanks. The passage was adapted from an inter¬
mediate ESL text. It was about different forms of transportation for long-distance
traveling, a familiar topic to foreign students studying in the United States. As is
customary, a few sentences were left intact at the beginning and at the end of the
passage to provide context The length of blanks was kept uniform throughout
Students were allowed thirty minutes to complete the test which was scored
twice: first, responses corresponding exactly to the deleted words were counted as
correct (cloze-exact), and second, responses that were grammatical and contextu¬
ally appropriate were counted as correct (cloze-acceptable). Previous research
with cloze testing does not indicate definitely which scoring procedure is prefer¬
able. Using the exact-word scoring method eliminates the element of subjectivity
and should therefore yield a more reliable score. In addition, tests can be scored
more quickly using the exact-word method, and this is an important consideration
when large numbers of students are being tested. However, whether there is sub¬
stantial loss of information by substituting the exact-word method for the
acceptable-word method is not clear. For this reason, the question concerning
scoring procedure was raised.

Summary test statistics (means and standard deviations) were computed for all
measures. Reliability coefficients computed by the Kuder Richardson formula 20
were obtained as well. These are presented in Table 10-1. The means reveal the
relative difficulty of the cloze, CESL Placement, and TOEFL. The overall mean
for CESL Placement was 50.8, which indicates placement at level three, which is
Table 10-1 Summary Test Statistics
Reliability
Possible coefficient
Measure score Mean SD N KR 20
Cloze-exact 50 11.9 2.08 107 .61
Cloze-acceptable 50 15.3 7.30 107 .85
Total CESL Placement 300^3=100 50.8 16.23 107
CESL Listening Comprehension 100 50.4 18.50 107 .70-87
CESL Structure 100 50.4 20.80 107 .79—.92
CESL Reading 100 51.3 16.01 107
Total TOEFL ca. 700 422.1 56.06 52 .965
TOEFL Listening Comprehension 70 43.7 6.88 52 .899
TOEFL Structure 70 44.8 6.94 52 .864
TOEFL Vocabulary 70 39.8 7.27 52 .892
TOEFL Reading 70 42.9 7.52 52 .841
TOEFL Writing 70 39.8 6.92 52 .855
Table 10-2 Correlations for Cloze Tests with Total and Subtests
of the Criterion Measures
Cloze-exact Cloze-acceptable N
Total CESL Placement .80 .84 107
CESL Listening Comprehension .71 .73 107
CESL Structure .63 .69 107
CESL Reading .80 .80 107
Total TOEFL .71 .79 52
TOEFL Listening Comprehension .47 .51 52
TOEFL Structure .51 .58 52
TOEFL Vocabulary .59 .62 52
TOEFL Reading .68 .77 52
TOEFL Writing .55 .64 52
an intermediate level. Thirty-one of the 107 subjects tested actually were placed
in classes at that level. The reliability for the CELT subtests on CESL Placement
is quite good. The range of the reliability coefficient on the Listening Comprehen¬
sion testis .70 to .87, and the range for the Structure testis .79 to .92. No reliability
statistics are available for the Reading for Understanding (CESL Reading) test, but
based on the observed reliability coefficients for the other CESL subtests, a con¬
servative estimate of the reliability of the CESL Total score would be about .80.
The mean for the total TOEFL was 422.1, which falls about one standard
deviation below the average for the reference population which took that test in
1964. It should be pointed out that the statistics for the TOEFL scores were
computed for only 52 subjects (since only those who placed at level three or above
were given the TOEFL), while the computations for the other tests included all
107 subjects. The reliability for the total TOEFL is .965.
To check the validity of the cloze scores against the established criterion
measures, simple correlations were computed for all the scores on the TOEFL and
CESL Placement with the cloze-exact and cloze-acceptable scores. Table 10-2
provides these figures.
The cloze scores and total scores on the TOEFL and CESL Placement tests
were strongly correlated regardless of the method used to score the cloze test
Cloze-exact correlated with the total TOEFL at .71 and with the total score on the
CESL Placement battery at .80. Cloze-acceptable correlated with the same two
measures at .79 and .84, respectively. The correlations were all significant beyond
the .005 probability level, and indicate substantial variance overlap among the
tests. Indeed, it seems safe to conclude that probably all the reliable variance in
the cloze scores is present in the total scores on both the proficiency batteries.
However, the converse is not true. Some of the reliable variance in the TOEFL is
not generated by the cloze scores. The best explanation for this is probably the
difficulty of the cloze text, which would tend to depress its overall variance and
hence its reliable variance. This is a problem that can be corrected by lengthening
the cloze test and/or making it easier. Therefore, it is concluded that cloze testing
may indeed be a viable alternative procedure for ESL placement and proficiency
testing.
Previous research with cloze testing has shown that cloze tests tend to corre¬
late best with those subtests of ESL proficiency measures that are highly
integrative in nature—subtests aimed at reading comprehension, listening skill,
and dictation. It is suggested that this trend may indicate that cloze testing is an
effective procedure for measuring overall ESL proficiency rather than some more
narrowly defined facet of language ability. This claim is supported repeatedly in
this volume (especially in Chaps. 1, 2, 4, 7,13, 20, and also see Oiler and Perkins,
1978). In the present study, there were moderate to high correlations between
cloze test scores and scores on all the subtests. The highest correlations were with
the subtests of the CESL Placement battery. The correlation of cloze-exact with
the CESL reading test was as high as with the total CESL Placement score. Cloze-
exact correlated with both the total CESL Placement and its Reading subtest at
.80. These correlations are both at about the maximum level that could be
expected in view of the estimated reliabilities of the tests. Taken together with the
results of Hisama (Chap. 4; see especially her factor results). These findings
indicate that the CESL Reading subtest may he a satisfactory placement test by
itself. Of course, if sufficient reliability is obtained, the same could be said for
nearly any one of the subtests studied. The advantage of the cloze procedure is the
relative ease of test construction.
It is interesting that both the cloze-exact scores and the cloze-acceptable
scores correlated more highly with the reading subtest of the TOEFL and CESL
Placement examination than with any of the other subtests. This is not surprising
in light of earlier work with native speakers where cloze tests have consistently
been found to be highly reliable and valid measures of reading ability. Indeed, the
high correlations between the cloze test scores and reading subtests of the TOEFL
and CESL Placement examination could be a function of method. That is, cloze
tests may correlate highly with reading tests because basically the same method of
measurement is used. However, this explanation would not fit recent findings
showing that cloze scores are also excellent predictors of variance in nonverbal IQ
tests (see Chap. 3 and Stump, 1978). Nor does it fit the fact that cloze scores are
known to be very good predictors of achievement scores on a wide range of other
tests (Stump, 1978, and Streiff, 1978). In terms of validity, the cloze task and the
reading subtests all involve reading performance. Nevertheless, taking a reading
comprehension test is an integrative task, and therefore, the conclusion of previ¬
ous research that cloze tests tend to correlate highly with tests that are integrative
in nature is also supported.
Which scoring method for the cloze test yields the most accurate informa¬
tion about the student’s ESL ability? As one would expect, the mean was higher on
the cloze tests for the acceptable-word scoring method than for the exact-word
method. The means were 15.3 (cloze-acceptable) and 11.9 (cloze-exact) out of a
possible 50. The ranges were 0 to 45 for the acceptable-word scoring method and
0 to 32 for the exact-word scoring method. With the acceptable-word method, the
average percent of correct responses was 31% and with the exact-word method
24%. While the scores obtained by the acceptable-word method were higher, the
performance pattern on the test was virtually the same. That is, if the subjects were
rank ordered according to both sets of scores, there would be few differences in
rankings. Furthermore, the cloze-exact responses and the cloze-acceptable
responses correlated at .97 (p < .001), indicating a nearly total overlap in variance
(94%). These data suggest that there is little appreciable difference in informa¬
tion provided by the different scoring methods.
However, a rather different conclusion is suggested by the standard devia¬
tions and the reliability coefficients. The standard deviations for the two scoring
methods differed by several points: 2.08 for exact-word and 7.3 for acceptable-
word. If we square these values in order to compare them, the acceptable-word
scoring method generates about 13 times as much raw variance across learners as
the exact-word method (53/4=13). This is not to say that it contains 13 times as
much information, however, as other factors enter in.
When subjects are homogeneous in ability for the skill being tested, a small
amount of variance is desirable and would indicate precision in the testing instru¬
ment. However, in a situation where students with widely varying language ability
are tested, the amount of variance (the square root of which is the standard devia¬
tion) must be viewed differently. A low standard deviation in this case might
indicate that the testing instrument does not discriminate levels because it is either
so difficult that the beginning, intermediate, and some of the advanced students all
perform poorly, or so easy that the majority get almost everything correct When a
wide range of performance is expected, the standard deviation should be generally
higher than with a homogeneous group of subjects.
In the present study, the subjects represented the full range of ESL profi¬
ciency tested at CESL, from those considered beginners to those considered pro¬
ficient enough to enter the university. Apparently the exact-word scoring method
does not discriminate among levels to the extent the acceptable-word method
does.
The reliability coefficients obtained for the two scoring procedures suggest
that the acceptable-word scoring method yields more reliable scores. The coeffi¬
cients are .61 for the exact-word scoring method and .85 for the acceptable-word
scoring method. The differences in the coefficients are largely a function of the
amount of variability in the test scores (and in the way the reliabilities are
estimated). The difference in the standard deviations given above illustrates the
considerable gain in variance with the acceptable-word scoring method. By the
KR 20 estimate, the estimated difference in reliability between the two scoring
methods is about 35% (.852- .612 = .35). However, the difference in actuality is
probably not so great, since the correlation between the two scoring methods
(another way of estimating reliability) indicates that only 6% of the variance in the
cloze-acceptable and cloze-exact scores is not common variance. Further, the
observed correlations in Table 10-2 also suggest that the KR 20 estimate of
reliability for cloze-exact is a bit too conservative.
Nevertheless, since the exact-word scoring method does not allow for alterna¬
tive answers, the observed variance is somewhat suppressed, and the full range of
ability levels of the students may therefore not surface in the test results as well as
with the acceptable-word method. It would appear, then, on the basis of the
standard deviations and reliability coefficients, that the acceptable-word scoring
method provides more accurate information about ESL proficiency levels.
In an attempt to help clarify the issue, the correlations obtained for the two
scoring methods with the total CESL Placement and the total TOEFL were tested
to see if they were significantly different The correlations for cloze-exact and
cloze-acceptable with CESL Placement were .80 and .84, respectively. These
correlations did not prove to be significantly different (by a t-test). This indicates
that the scores obtained by both scoring methods provide essentially the same
information about student performance on the CESL Placement examination.
The correlations obtained between cloze-exact and cloze-acceptable with
the total TOEFL were .71 and .79, respectively. There correlations were signifi¬
cantly different (p < .05). Apparently with regard to performance on the TOEFL,
more information is provided by the acceptable-word scoring method. One
possible explanation for this may be that a high-level test such as the TOEFL
makes a somewhat finer discrimination at the upper end of the proficiency
continuum than CESL placement; so with the cloze procedure the scoring method
that allows for finer discrimination among the more advanced students will
provide more information about the language ability of those students.
In conclusion, the high positive correlations between cloze test scores and
scores on established criterion measures suggest that the concurrent validity of the
cloze test used in this study would warrant cautious application of cloze proce¬
dure for placement purposes. Results indicate that the difficulty level of the test
should be somewhat lower for CESL students at SIU than the text selected for this
study. The data show that the acceptable-word method is probably the more
reliable of the two scoring methods studied. The exact-word method is the pre¬
ferred grading procedure in practical terms, but the information that it yields about
the student’s ESL ability does not seem to reflect the student’s language compe¬
tence as well as the information yielded by the acceptable-word method.
Note
1. This is a revised and slightly expanded version of a paper presented at the annual meeting of the
Linguistic Society of America in Philadelphia, Dec. 30, 1976. The study was part of the author’s
doctoral dissertation at Southern Illinois University in Carbondale. The author is indebted to Richard
Daesch and the staff at the Center for English as a Second Language, Southern Illinois University, for
permission to conduct the research there. Special thanks is expressed to Dr. Dorothy Higginbotham
and Dr. Paula Woehlke for helpful comments and suggestions at every stage in the study.
Chapter
An Alternative Cloze Testing Procedure:

Multiple-Choice Format1
Frances Butler Hinofotis and Becky Gerlach Snow
Recent research indicates that open-ended cloze test results correlate

strongly with total scores on established ESL proficiency measures. This
suggests that open-ended cloze tests may possibly be substituted for more
complicated ESL proficiency measures. The question asked here is whether
amultiple-choice (MC) cloze test will work equally well. MC and open-ended
cloze tests over the same text used by Hinofotis (Chap. 10) were adminis¬
tered to 66 incoming foreign students at the Center for English as a Second
Language (CESL), Southern Illinois University in the fall of 1976. Each
student attempted 25 MC items and 25 open-ended items. Distractors for
the MC items were developed from responses to an open-ended version of
the same cloze test. The criterion measures were the subtests of the CESL
Placement battery. To check the validity of the cloze test, simple correla¬
tions were computed between total scores and subscores of the CESL
Placement battery with the MC and open-ended cloze scores. Results
indicate that an MC cloze test is a promising evaluative tool and may be pre¬
ferred over open-ended cloze tests because of ease of scoring.
With the majority of cloze testing research to date, two different scoring
methods have been employed—the exact-word method and the acceptable-word
method. The exact-word method involves counting as correct only the words
which were actually deleted. The acceptable-word method allows any response
that is grammatically and contextually appropriate. Using the acceptable-word
scoring method can be time-consuming, however, and an element of subjectivity is
introduced. Sometimes a word may make sense in the sentence in which it appears
but not in the passage as a whole. While scores in general with the acceptable-word
method tend to be higher, the performance pattern on the test is usually the same.
129
We reasoned that if the exact-word scoring method provides about as much

information as the more complicated acceptable-word scoring method (see Oiler,
1972, Stubbs and Tucker, 1974, and Chap. 10), then perhaps an MC cloze pro¬
cedure would give us equal information while still further simplifying the scoring
procedure. Studies to date (briefly summarized in Chap. 10) have concentrated on
the open-ended cloze technique. Only three studies that we know of have
investigated the possibility of using an MC cloze test to measure ESL proficiency.
Jonz (1976) reported substantial correlations between an MC cloze test for ESL
proficiency and subtests of a more traditional placement examination. Wijnstra
and van Wageningen (manuscript) used an MC cloze test in Dutch, with ele¬
mentary students of foreign background in Holland. They found a high correla¬
tion between open-ended and MC versions. Cranney (1973) constructed an MC
cloze test for readability for native speakers. He claimed high reliability and
validity.
The main research question asked in this study was: Do the scores on an MC
version of a cloze test correlate as strongly with a widely used measure of the ESL
proficiency as the scores on an open-ended version of the same test?
Method
Subjects. Sixty-six incoming foreign students for the Oct 4, 1976, term at CESL
took the cloze test as a part of their placement examination. Language back¬
grounds of subjects included Arabic, Japanese, Farsi, French, Spanish, Chinese,
and Vietnamese.
Testing. The same 427-word passage used by Hinofotis (Chap. 10) was con¬
verted into a 50-item MC test Each MC item was based largely on responses
observed in the previous administration of the same text in open-ended form to
107 incoming foreign students during the summer of 1976, at the Center for
English as a Second Language, Southern Illinois University (Chap. 10). The most
frequent incorrect responses obtained from the open-ended version of the test
were used as distractors for the MC tests. High-frequency responses which were
grammatical in terms of short-range phrase structure but which were not
contextually acceptable in relation to the full text were not used.
Two test forms were then constructed as follows: on form A, the first 25 items
were MC and the second 25 were open-ended. On form B, the order was reversed.
Tests were distributed so that every other student took fonn A while the rest were
taking form B. This procedure had three desirable effects: first, it counterbal¬
anced any order effect of MC versus open-ended format; second, it systematically
randomized the selection of students who took form A or form B; and third, it
tended to reduce the possibility of test compromise.
Because the open-ended and MC items were part of the same passage, we
made the directions for the two parts of the passage as similar as possible. For the
open-ended items the student was asked to follow the standard fill-in-the-blank
procedure. For the MC items, the four alternatives were listed in the right-hand
Hinofotis/Snow: Alternative cloze testing procedure 131
margin, and the student was instructed to write the best choice in the blank.
Subjects were allowed 30 minutes to complete either form A or B.
The CESL Placement battery (described in Chap. 2) was the criterion against
which the cloze tests were evaluated. Statistics were computed over the entire
group of 66 subjects and separately for the 33 subjects taking form A and the 33
subjects taking form B.

The summary test statistics are reported in Table 11-1. The means and standard
deviations were almost the same for forms A and B, as we should expect The
higher mean scores for the MC tests across the board indicate that the MC task is
easier. The means and standard deviations for the CESL Placement Total were
similar across the two groups who took forms A and B. Therefore, the procedure
for passing out the tests did seem to result in two equivalent subject samples. The
mid-30’s range on the CESL Placement exam indicates that the majority of the
students fell at the lower end of the proficiency continuum.
On the basis of previous research using the same cloze test in open-ended
form, the assumption was made that the open-ended cloze test could be
substituted for the CESL Placement test. We expected to find that the open-
ended cloze test would correlate as strongly as the CESL Placement test with this
group of subjects as it did in the earlier study, allowing for subject variability. In
order to test this possibility, it is necessary to examine the correlations between
cloze-exact, cloze-acceptable, and MC and the criterion measures (the CESL
Placement test and its subscores). These are shown in Table 11-2. Cloze-exact
correlated with CESL Placement at .71 and cloze-acceptable correlated with
CESL Placement at .74. The correlation between the MC scores and CESL
Placement scores was .63.
Although the differences between the correlations of the open-ended scores
and the MC cloze scores with CESL Placement appear to be fairly large, the differ-
Tablell-1 Summary Test Statistics
Form B: Open-
ended-multiple
Form A: Multiple choice-open-ended choice All subjects
Mean SD Mean SD Mean SD
Cloze-exact 2.73 3.07 2.55 3.01 2.64 3.02

Cloze-acceptable 3.39 3.91 2.97 3.50 3.18 3.69
MC version 8.21 3.43 8.12 3.81 8.17 3.60
CESL Listening 32.24 15.49 34.42 14.90 33.33 15.12
CESL Structure 32.55 18.27 36.45 21.22 34.50 19.75
CESL Reading 35.94 14.92 39.18 14.69 37.56 14.78
CESL Total 33.58 14.5 3 36.69 14.82 35.13 14.65
Table 11-2 Correlation Matrix for All Subjects [N - 66)

Cloze- Cloze- MC CESL CESL CESL CESL
exact acceptable version Listening Structure Reading Total
.97 .54 .60 .59 .71 .71

Cloze-exact
Cloze-acceptable .59 .63 .61 .73 .74
.52 .62 .53 .63
MC version
.70 .56 .85
CESL Listening
.73 .94
CESL Structure
.86
CESL Reading
CESL Total
Table 11-3 Correlation Matrix Form A: Multiple Choice-Open-Ended (N - 33)

Clgze-exact .97 .70 .77 .71 .75 .83

MC version .59 .57 .53 .63
CESL Listening .69 .65 .87
CESL Reading .89
CESL Total
Table 11-4 Correlation Matrix Form B: Open -Ended-Multiple Choice (N = 33)

Cloze-exact .96 .38 .44 .51 .67 .61

MC version .47 .66 .53 .65
CESL Listening .72 .44 .82
CESL Reading .82
CESL Total
ences are not. in fact, statistically significant at the .05 probability level. This
suggests that the open-ended and MC tests are providing similar information.
In the study by Hinofotis (Chap. 10), the correlations between the cloze
scores and the CESL Placement Total were substantially higher (.80 exact, .84
acceptable) than those shown for cloze-exact and cloze-acceptable with the CESL
Placement Total in Table 11-2(.71 and.74). However, the corresponding correla¬
tions obtained for the subjects taking form A of the present test were .83 and .80,
as shown in Table 11-3. The corresponding correlations for the subjects taking
form B were considerably lower, as shown in Table 11-4 (.61 and .69). However,
the correlations between MC form A and MC form B with the CESL Placement
Total were nearly identical (.63 and .65, respectively).
The item facility indices (not tabulated here) revealed that in general the MC
items were easier than the open-ended items, as should be expected. Few of the
Hinofotis/Snow: Alternative cloze testing procedure 133
items on either the open-ended or MC tests, however, approached the 85% facility
level, but a number of them came close to the 15% facility level, indicating they
were quite difficult for this sample of students. This is consistent with the previ¬
ously reported findings of Hinofotis (Chap. 10). The discrimination was better
with the open-ended tasks than with the MC items. This may be due to the fact that
in taking an open-ended cloze test the student has to generate his own alternatives
much as he would in normal communication (allowing, as Hisama noted in Chap.
4, less chance for correct guessing).
In conclusion, the MC cloze technique seems to have good potential as an
ESL proficiency measure. A caution to keep in mind is that constructing an MC
cloze test is a considerably more complicated procedure than constructing an
open-ended cloze test An MC version requires pretesting with an open-ended
task (or some other system) to obtain distractors. It is then necessary to pretest the
MC version to check item facility and discrimination. Items which do not dis¬
criminate well should be eliminated or modified. However, once a reliable MC
format is obtained, the time involved in administering and scoring the test should
be less. An MC cloze test can be easily hand checked or computer scored. The
results of this study, though not conclusive, suggest that MC cloze tests have some
promise. To compensate for the noise factor created by the greater likelihood of
correct guesses on MC cloze tests than on open-ended versions, test length will
probably have to be increased in order to achieve the desired reliability levels,
and/or item facility and item discrimination requirements will have to be more
rigorously monitored during test development phases.2
Notes
1 This is a revised version of a paper presented at the Midwestern Regional NAFSA Meeting in
Chicago in November 1976. The authors wish to express their thanks to William Flick for reading and
commenting on an earlier draft of the paper.
2. Further research with a 50-item MC cloze test is presently underway at UCLA and Southern
Illinois University with native and nonnative speakers of English. The depressed variability in the
sample used in this study prevents us from concluding that an MC cloze test cannot be substituted fora
more complicated testing procedure. Indeed, in spite of the generally low overall proficiency in our
sample of subjects, there is substantial reason for optimism concerning MC cloze testing.
Chapter
The Effects of Agreement/Disagreement

on Cloze Scores
Naomi Doerr
In order to test the hypothesis that the foreign language learner will tend to
block out or alter material with which he disagrees, passages which repre¬
sented positive and negative sides of three controversial issues were
converted into cloze tests and were administered to the same 182 subjects
referred to in Chap. 2 as part of the CESL testing project About 100
subjects also participated in an oral interview where they were asked to indi¬
cate their views (pro, con, or neutral) on each of the three issues. Subjects
indicating neutrality were not included in the study. These self-expressed
attitudes were used to determine composite agree and disagree scores for
the 88 nonneutral subjects who completed cloze tests over pro and con texts
on each of the three topics. The agree score for each subject was the sum of
scores on texts with which the subject had indirectly indicated agreement,
and the disagree score was a similar sum over the remaining texts. The
composite agree and disagree scores as well as compositepro and con scores
(i.e., the sum of scores on pro and the sum of scores on con texts, respec¬
tively) were contrasted by two-tailed Mests and were also tested for correla¬
tion. Correlations were .91 for agree and disagree scores and .90 for pro and
con. The contrasts were significant in both cases (p < .03), but the agree
scores, contrary to prediction, were lower rather than higher; and although
no contrast was predicted on pro and con scores, the pro texts proved to be
easier. From these data it was deduced that if attitudes are a contributing
factor in the foreign language learner s comprehension of a message with
which he disagrees, their effects are probably not significant and almost
certainly not very strong.
Conversation between people who disagree typically involves many instan¬

ces of communication breakdown. Consequently, as much linguistic effort may be
134
Doerr: Effects of agreement/disagreement 135
required for reiteration and correction of faulty comprehension as for pertinent

debate on the issue at hand. Why is it that redundancy, which, according to Manis
and Dawes (1961), is “a safeguard against errors in communication,” is apparently
less effective in cases of disagreement or controversy? They suggest that the
“increased rate of error may in part be attributable to the recipient’s inability to
capitalize fully upon the potential redundancy of messages that oppose his
beliefs” (p. 79). In fact, their study, which involved native speakers of English,
revealed that a recipient who disagrees with the contents of a controversial
statement is relatively less sensitive to the redundancy of the message and has a
harder time understanding the message than someone who agrees with the views
expressed in the statement
A number of psychological experiments in perception attest to the effects of
attitudes in communicative situations (Jones and Kohler, 1958). These studies
show that cognitive activities tend to be selective. Stimuli that are supportive of
one’s own values or self-concepts tend to be accentuated, while stimuli that are
contravaluant are often missed or distorted. Jones and Kohler examined the
degree of learning and retention of value-laden controversial material with native
speakers of English. They found that students learn covaluant material (that is,
material that they tend to agree with) more effectively than they do contravaluant
material (that is, material with which they disagree).
If attitudes have significant effects on native speakers when communicating
and when learning controversial material, perhaps they will have similar effects on
students of foreign language. In fact we might expect the effects to be more pro¬
nounced in learners who are not as sensitive to the redundant features of the
language used. Therefore, it is hypothesized that the foreign language learner will
tend to block out or alter contravaluant material. Following Manis and Dawes
(1961), the cloze procedure (Taylor, 1953, 1954, 1956) was used to test this
hypothesis.
Method
Subjects. Initially all 182 subjects enrolled at CESL during the first tenn in the
spring of 1977 were tested (see Chap. 2 for more descriptive data). However, a
much smaller number completed the interviewing and all six of the tests described
below. In all, 88 subjects (mostly drawn from the higher end of the proficiency
range) completed the interview and the tests, and also indicated nonneutral
feelings on the three controversial topics selected for study.
Testing. For each of three controversial topics (the legalization of marijuana,
the retention of capital punishment, and the morality of abortion), two 100-word
passages, a pro passage (supporting the positive side of the issue) and a con passage
(supporting the negative side), were selected for use in the experiment. An attempt
was made to structure pro and con passages on the same topic to be of equivalent
difficulty. To ensure maximum spread of subjects, each pair of texts became pro¬
gressively more difficult the marijuana texts were easy; the capital punishment
texts somewhat more difficult; and the abortion texts more difficult still (see
Appendix 12A for complete version of the texts).
The first sentence of each passage was left intact; every fifth word was then
deleted until a total of twelve blanks were inserted. Following the twelfth blank,
the remainder of the passage remained unmutilated. The position of each deletion
was signified by a numbered blank of standard length. The subjects were
instructed to fill in the missing words. Cloze scores were determined by counting
the number of words restored exactly as they had appeared in the original text plus
the number of other acceptable words inserted, e.g., synonyms.
Approximately one week before the written testing was done, oral interviews
assessing attitudes toward marijuana, capital punishment, and abortion were
completed. Each subject indicated how much he agreed or disagreed with the
following statements (see Appendix 12B for a complete description of the Oral
Interview Attitude Survey where the following questions appear as items 7 a, 7b,
and 7 c):
1. “Marijuana should be legalized.”

2. “Capital punishment should be abolished.”
3. “Abortion of unwanted children (when the life of the mother is not endan¬
gered) is a crime and should be punished.”
The interviewer assessed each subject’s response to each statement on a five-

point scale. A score of 1 was assigned to indicate “strongly agree”; 5 was used to
signify “strongly disagree”; 2 and 4 represented the appropriate milder degrees of
sentiment. The mid-point of the scale (3) was marked to indicate neutrality on the
topic, and subjects who so responded were eliminated from the study.
To control a possible order effect and to prevent test compromise as much as
possible, two forms of the test were used. In form A, the pro passage preceded the
con passage for each issue; in form B, the order was reversed. The two forms were
alternated: every other student received form A while the remaining students
received form B. (Only form A is given in Appendix 12A.)
For each subject, four composite scores were computed: pro; con; agree;
disagree. The pro score for each subject was the sum of the scores he achieved on
the three pro texts. The con score was the sum of the scores achieved on the three
con texts. The agree score was the sum of the scores obtained on the three texts
which reflected the subject’s previously determined views, either pro or con, on
each of the respective issues. This score, then, might be a composite of scores on
all pro texts, all con texts, or some combination of pro and con texts. The disagree
score was the sum of the scores achieved on the remaining three texts which were
contravaluant to the subject’s determined views. For instance, if the subject had
indicated disagreement with the statement that marijuana should be legalized, it
was assumed that he agreed with the con marijuana text and disagreed with the pro,
and so forth. As in the Manis and Dawes study, each subject served as his own
control for language ability.1
Table 12-1 Contrasts between Agree and Disagree

Scores and Pro and Con Scores
2-tailed
Score Mean SD f value probability
agree 7.6818 6.634
-2.18 0.032
disagree 8.3750 7.300
pro 9.7273 7.080

3.01 0.003
con 8.7159 6.452

Table 12-1 presents the contrasts between the agree and disagree scores and the
pro and con scores for the 88 subjects who completed all six of the cloze tests in
addition to the Oral Interview Attitude Survey. The pro and con scores contrasted
significantly; the pro texts were easier by approximately 1.01 points on the
average (t = 3.01, p < .003). Possibly as a result of this contrast, and because
more people tended to agree with the con texts than with the pro texts, the agree
scores were significantly lower than the disagree scores by —.69 (t ——2.18,
p < .03). Approximately 56% of the subjects agreed with the con texts, whereas
only about 32% of the subjects agreed with the pro texts (12% being neutral). This
confounding factor was apparently also present in the original design of Manis and
Dawes. To escape this difficulty, it would be necessary for the pro and con texts to
be of exactly equivalent difficulty to begin with and/or for them to contribute
about equally to the agree and disagree scores for subjects of equal ability.
On the basis of these findings, it is not possible to argue strongly that a foreign
language learner’s self-expressed attitude toward a controversial issue has no rela¬
tionship to his comprehension of a contravaluant message concerning that issue.
However, the effect is not substantial enough to override the slight (hut signifi¬
cant) difference in difficulty across the pro and con texts in this study.
Furthermore, the correlations between the agree and disagree scores, .91, and the
pro and con scores, .90, reveal the very substantial overlap in variance (about
81%) across the several composite scores. Therefore, the experimental hypothesis
is cautiously rejected.
Note
1. This control could be effective only to the extent that the various pairs of texts (pro and con) were
actually of equivalent difficulty. In fact, they turned out not to be of quite equivalent difficulty. This is a
confounding factor further complicated by the fact that the composite agree and disagree scores could
be constituted by any combination of pro and con scores over the pairs of texts.
Appendix 12A
Name_Instructor_
Course Level_Section
Form A
DIRECTIONS: You will read six short paragraphs. Some of the words have
been left out See if you can fill them in. Read all of each para¬
graph before you try to fill in the missing words.
EXAMPLE: The cat ran up the_.
ANSWER: The cat ran up the tree .
You should fill the blank with the word tree. Other words that
also fit are fence, wall, door, street, branch, and so on. It is all
right to guess if you are not sure. Try to use only one word for
each blank.
DO NOT TURN THE PAGE UNTIL YOU ARE TOLD TO DO

SO. YOU WILL HAVE THIRTY MINUTES TO COMPLETE
THIS EXERCISE.
[Pro Marijuana]
Mr. Sam Brown says “millions of people use and enjoy marijuana. Marijuana helps
people forget (1)_problems of everyday life. (2)_in a
modern society (3)__sometimes very difficult People (4) __
often unhappy with life. (5)_when people smoke marijuana (6)_
feel very calm and (7) __They think that life (8)_pleasant
Marijuana is being (9)_by many groups. Some (10)_these
groups say that (11)_is not dangerous to (12)__. Marijuana
is becoming more and more common. People of all ages, professions, and nation¬
alities use marijuana to make their lives a little more enjoyable.”
[Con Marijuana]
Dr. Jane Wilson says “marijuana is a very dangerous drug. Scientists have studied
marijuana (13)_many years. Their studies (14)_us that
marijuana has (15)_effects on people. For (16)_, mari¬
juana causes changes in (17)_emotions. It can make (18)_
person feel afraid, or (19)_, or very sad. Marijuana (20)_
causes people to forget (21)_Because of this, smokers (22)_
marijuana often have problems (23)_their jobs or at (24)_
Finally, marijuana makes people feel very tired. They have no energy to do even
the most important jobs.
[Pro Capital Punishment]

Lawyer Ira Adams says “there are some murders committed out of an anger so
strong that no penalty would be strong enough to stop the murderer. Indeed, men
who are (25)_dangerous that they kill (26)_they lose their
tempers (27)_be executed for the (28)_of other people.
Moreover, (29)_must remember that all (30)_are not com¬
mitted under (31)_impulse. Because of those (32)_in
which men do (33)_of committing murder, it (34)-neces¬
sary that capital punishment, (35) _ most powerful and effective
(36)_, should be retained. When the guilty are not punished, the law
has failed and the safety of innocent people is endangered."
[Con Capital Punishment]
According to lawyer Tom Jones, “the death penalty is not justice; it is more an act
of hate. Punishment cannot be considered (37)_real goal of criminal
(38)_. Yet simple punishment is (39)_of the most often
(40)_arguments for the death (41)_. A responsible society
wants (42). ., not revenge. It does (43). . need to get revenge
(44)_ _itself, nor should it (45) . to do so. Punishment (46)
_ punishment’s own sake is (47) .justice. The death penalty
(48) not only unnecessary and useless, but it is also crude and brutal.
Capital punishment has no place in a civilized society that does not require the
taking of a human life for its safety and welfare."
[Pro Abortion]
Dr. George Smith of the Malaga Medical School says that “abortion is one of the
most beneficial developments of modem medicine. A simple and safe (49)
_, abortion is recognized all (50)_the world for its (51)
on the lives of (52)_of individuals, as well (53)
its sociological implications. Abortion (54) freed women from
unwanted (55)_. People who cannot (56) support a child,
or (57)_are not psychologically prepared (58) _become
parents can turn (59) abortion as a practical (60)-real¬
istic solution to their dilemma. At the sociological level, abortion is useful as a
population control device."
[Con Abortion]
Dr. Harry White asks and answers the question: “What is an abortion? It is the kill¬
ing of a distinct, irreplaceable, unique human being. At best, it is (61)-
to killing a person (62)_his sleep. And, as (63)-all know,

killing a (64)_in his sleep, even (65)-inflicting any pain, is
(66)_more serious offense than (67)_him painful injuries
which (68)_not fatal. This is (69)_the offense lies in (70)
_the individual of the (71)-of his life. The (7 2)-
toward casual, unthinking acceptance of abortion is nearly as scandalous as the act
of abortion itself. The people who argue that abortion is just a simple medical pro¬
cedure are forgetting the ethical and moral implications involved.
Appendix 12B
Oral Interview Attitude Survey
Demographic information:
1. Name_
Last (family name), First (given name)
2. Native language_
(the language you spoke most in your home country)
3. Country of origin_
(where you grew up)
4. Date of arrival in the United States_

5. Had you ever visited an English-speaking country before this most recent
trip to the United States? If so, what was the length of stay in days, weeks,
or months (indicate which)?_
If none, interviewer should write “none.”
6. What country did you visit and for what purpose?
(e.g., England, to get a Master’s Degree in biology at Oxford; or the United

States to see the Grand Canyon and other historic and geographical sites)
Agree/Disagree Attitude Scales:
7. Most people would disagree strongly with the statement that “Being
unfriendly is a desirable human quality.” But they would agree strongly
with the statement “Being friendly is a desirable trait” How do you feel
about the following statements? Indicate how much you agree or disagree
by saying (1) strongly agree, (2) agree mostly, (3) don’t care one way or the
other, (4) disagree mostly, or (5) disagree strongly. (Interviewer should
prompt to try to get the subject’s actual opinion—but the interviewer
should not try to influence the subject’s opinion at all!)
7 a. Marijuana should be legalized.

__(Interviewer indicates a numerical value corresponding
to the subject’s degree of agreement or disagreement with the state¬
ment.)
7b. Capital punishment should be abolished. _
7c. Abortion of unwanted children (when the life of the mother is not
endangered) is a crime and should be punished. _
(Interviewer should try to elicit subject’s true feelings and not just
get an answer such as nodding of the head. Prompts might include
questions like, “How do you feel about capital punishment?” Avoid
prompts like, “You probably are against capital punishment, aren’t
you?” Also avoid any violent head nodding or shaking or any other
gestures which might make subjects change their position to a
stronger or weaker one. Try to get neutral subjects to commit them¬
selves one way or the other, but without biasing them in a particular
direction.)
Behavior Patterns:
8. Who do you live with? People who speak your native language or English-
speaking people? In other words, what language do you speak most of the
time when you are at your dormitory or apartment?
(Interviewer should place the subject on the following scale.)
Only Enghsh Equal Amounts Only Native Language

4 3 2 1 0
Mostly English Mostly Native Language
9. About how many pages of text (typewritten pages as a basis for estimating
length—double-spaced) do you write each semester?
0 = less than 1 page 4 = 16 to 20

1 = between 1 and 5 5 = 21 to 25
2 = 6 to 10 6 = 26 to 30
3 = 11 to 15 7 = 30 or more
(Interviewer should indicate category and

number of pages.)
Chapter
TOEFL Scores in Relation

to Standardized Reading Tests
Kyle Perkins and Keith Pharis
Two standardized reading tests for native speakers were administered to

advanced ESL students. Correlations of .46 and .49 (p < .05) were observed
between the Iowa Silent Reading Test and the McGraw-Hill Reading Test
with the total score of the criterion variable, the TOEFL. The standardized
reading tests indicated grade level equivalences for advanced ESL students
ranging from the 1st to the 27th percentile of the 12th grade. Shared
variance between the TOEFL reading subtest and the standardized reading
tests was low in two cases. The construct validity problem is discussed. The
question why students who apparently can t read very well succeed in
university study is also considered.
Can a measure of proficiency in English as a second language (ESL) predict

performance of nonnatives on standardized reading tests for native speakers and
vice versa? More specifically, what is the relation between scores on the test of
English as a Foreign Language (TOEFL) and standardized reading tests?
It has commonly been assumed by teachers and administrators of intensive
English centers that foreign students who successfully complete the prescribed
curricula are ready to compete with native speakers of English in the traditional
language skills— speaking, listening, reading, and writing. It is usually assumed
that an ESL student who gets an 85 or 90 on the Michigan Test of English Profi¬
ciency or a 500 or better on the TOEFL will have the requisite language skills to
perform adequately in a university context. If these assumptions are correct, there
should be moderate to strong correlations between ESL measures and standard¬
ized reading tests.
We chose reading skill as a first area of inquiry because of its assumed impor-
tance to success in university courses and also to just about any sort of test taking.
142
Perkins/Pharis: TOEFL scores 143
Certainly, nearly all measures of ESL proficiency require discourse processing in

the form of reading. In brief, if a student can’t read, he can’t even take most tests.
Of course, an oral interview like the Foreign Service Institute procedure (see
Chap. 7) would be an exception, but this does not in any way reduce the impor-
tance of being able to read well.
Method
Subjects. We tested students at levels 5 and 6 in three different batches (Group 1
contained 23 subjects, 2 contained 47 and 3, 40) at the Center for English as a
Second Language (CESL) at Southern Illinois University—Carbondale in the fall
of 1976 and spring of 1977. AtCESL, the TOEFL is given to all students at level 3
on up (see Chap. 4 for a description of the levels and how they are set up). On the
basis of TOEFL, Michigan, and CESL Placement scores, students are eventually
recommended for full or part-time university study at the graduate or undergradu¬
ate level. Groups 1, 2, and 3 each took different standardized reading tests, as is
indicated below.
Testing. The procedures for gathering the data were as follows. We adminis¬
tered each standardized reading test to the foreign students enrolled in levels 5
and 6 at CESL approximately one week before the TOEFL was administered in
each of the three cases.
In the first phase of our data gathering (with Group 1, N = 23), we used the
Nelson-Denny Reading Test, Form D (ND). This test was designed for use in
grades 9 through 16, and it contains subtests aimed at Vocabulary, Comprehen¬
sion, and Reading Rate. In our studies, we did not use the Reading Rate data. There
are 100 Vocabulary items and 36 Comprehension items. The Comprehension
score is given double weight, thus giving 172 total points. The nonnal working
time is 30 minutes. The ND was administered to Group 1 in mid-December, 1976.
In the second phase of our data gathering (with Group 2 , IV = 47), we used the
Iowa Silent Reading Tests (ISRT). We chose level 2 of the ISRT, which is
intended for grades 9 through 14, with norms differentiated according to post-
high-school plans, and we administered only two of the subtests—Vocabulary and
Comprehension. The Vocabulary section consists of 50 items which are said to
survey the depth, breadth, and precision of the student’s general reading vocabu¬
lary. There are four multiple-choice answers to each question, from which the
student selects the nearest synonym for a given word.
The Comprehension section consists of 50 items which are supposed to
measure the student’s ability to comprehend literal detail, to reason in reading,
and to evaluate what he has read. Part A includes 38 questions based on six short
passages. In this section, the reader is allowed to look back at the passages while
answering the questions. Part B consists of 12 passages which are supposed to test
the student’s short-term recall. The student is required to answer one question
about each passage without looking back. The normal working time is 54 minutes.
The ISRT was administered to Group 2 from February to May 1977. Summary test
statistics for the ISRT and TOEFL are reported below in Table 13-2.
Our colleague, Douglas Flahive (1977), collected the data for the third phase
of our research (with Group 3). He used the McGraw-Hill Basic Skills System
Reading Test This battery includes a Reading Comprehension section (MHRC)
with 10 short passages followed by a total of 30 questions testing for main ideas
and supporting details. The MHRC allows 40 minutes of working time. He corre¬
lated the MHRC with the TOEFL Reading subscore. In the two earlier phases, the
TOEFL total score was used in addition to the Reading subscore.
What we were interested in throughout was the power of the TOEFL scores as
predictors of whatever is measured in standardized reading tests intended for
native speakers of English. In the single predictor case, such as the three cases we
are interested in here, the regression analysis is a simple correlation problem.
Hence, for all three groups of subjects, this is the statistical method we applied.
Summary test statistics for the ND and TOEFL (Group 1) are given in Table 13-1.
The correlation of .49 between the TOEFL and the ND is significant at p < .05.
Oddly the TOEFL Reading subscore correlates with the ND only at. 15 (p > .05).
Thus the overlap in variance between the ND and TOEFL Total score is only about
24% (somewhat less than we might have expected). Perhaps the explanation has to
do with the difficulty of the ND for the nonnative subjects, which tends to depress
the variance in their ND scores. They ranged from the 1st to the 27th percentile
when compared against the norms for English-speaking 12th graders on the ND.
Summary data for the second phase (Group 2) are given in Table 13-2. Again
the correlation between the TOEFL Total score and the reading measure (ISRT in
this case) is positive and significant (p < .01). The variance overlap is slightly less
(21%). As before, however, the total possible variance is restricted by the low
scores obtained by nonnatives. They ranged from the 1st to the 16th percentile
when compared against a population of 12th grade English-speaking natives
bound for four-year colleges. Again the correlation between subscores aimed
specifically at reading comprehension of the two test batteries is somewhat lower
than for the total scores (.23,p > .05). We will return below to the question of why
this anomalous result repeats itself.
Data from the third phase (Group 3) are reported in Table 13-3. Here,
Flahive (our collaborator; see his write-up, 1977), used only the subscore of the
TOEFL aimed at Reading, and the MHRC score. The correlation was surprisingly
high, .91 revealing a variance overlap of 83%. His students seem to have been
more advanced, ranking in the 10th percentile for English-speaking freshmen at
four-year colleges. Correspondingly, his students also achieved substantially
higher scores on the TOEFL than those in Groups 1 and 2.
As expected, the three studies taken together seem to suggest a moderate to
strong relationship between TOEFL scores and scores on standardized reading
Perkins/Pharis: TOEFL scores 145
Table 13-1 Summary Test Statistics-Nelson-Denny Reading Test, Form D

(yv = 23)
Mean SD Range
TOEFL 426 40.60 344 - 520
ND 26 (172) 10.85 9 - 48
KR 20 Reliability (for the Standardized Reading test) .87
Pearson product moment correlation
ND and TOEFL Total .49*
ND and TOEFL Reading .15
Percentile rank with norm reference group
Grade 12: 1st to 27th percentile
*p < .05.
Table 13-2 Summary Test Statistics—Iowa Silent Reading Test, Level 2,

Form % and TOEFL {N = 47)
Mean SD Range
TOEFL 461 43.72 398 - 550
IS RT 35 (100) 10.48 9 - 65
Pearson product moment correlation
IS RT and TOEFL .46*
ISRT Reading Comprehension and TOEFL Reading .23
Grade 1 2 bound for 4-year colleges: 1 st to 16th
percentile
*p < .01.
Table 13-3 Summary Test Statistics—McGraw-Hill Reading Test and TOEFL,

Flahive’s Data (N = 40)
Mean SD Range
TOEFL 489 43.70 366 - 578
M-H 12.5 (40) 4.70 4- 24
Spearman rank order correlation
M-H with TOEFL Reading .91*
Mean score: 10th percentile four-year
college freshman
*p <.01.
tests intended for native speaking populations. We can conceive of two plausible
explanations for the fact that in two cases the subscores aimed at reading compre¬
hension from the TOEFL and the corresponding standardized test failed to corre¬
late significantly. The first explanation has to do with the fact that the shorter test
provides less possibility of variance in scores, hence the lower correlation for the
subtests than for the total scores. The second possibility is that the standardized
reading tests are actually measuring what is termed “power” for natives and
“speed” for nonnatives. However, positing separate constructs does not fit well
with the data from the third phase (Table 13-3), where it is clear that the standard¬
ized reading measure and the TOEFL Reading subtest are measuring substantially
the same thing (nor would this second interpretation fit with many of the earlier
findings in this volume).
Finally, we pose two questions: What can be said about grade level equiva¬
lences for nonnative college populations in comparison with their native English-
speaking competitors? and, How is it that so many foreign students are able to
succeed with such minimal reading ability in English? By all accounts it seems that
our advanced ESL students at CESL/SIU (and probably at centers like ours all
over the United States) are well below average college freshmen in reading ability.
The standardized tests used here could certainly be applied to get an idea of just
how far below our students are (this in spite of the fact that the tests are too difficult
for many of our students). How then do foreign students ever make it through
courses of study that most assuredly require a great deal of reading with compre¬
hension in English? We can offer two tentative reasons. First, we know that univer¬
sity students will be taking courses that require reading; therefore, practice may
compensate for the lack of requisite skills at the beginning of university studies.
Second, we believe that although the ESL student’s surface English machinery
(grammar, in the more traditional sense of the term) may not be as well developed
as that of a native speaker, the deep cognitive machinery is probably as well
developed as that of English-speaking competitors. This deep conceptual ability
may help to compensate for the lack of surface skill in English.
Note
1. A version of this paper was presented at the NAFSA meeting in New Orleans, 1977, and also
appeared in B. J. Robinett (ed.) 1976-77 Papers in ESL: Selected Conference Papers of the Associa¬
tion of Teachers of English as a Second Language. Washington, D.C.: NAFSA, 13-15.
Part IV Discussion Questions
1. Consider examination procedures used in any program with which you may
be familiar. Is multiple measurement with several subtests depended on?
How much time in person-hours is spent in preparing, administering, and
scoring the tests? Discuss the possibility of using a simpler testing procedure.
How could you determine whether there was any significant loss of informa¬
tion if a simpler procedure were proposed?
2. Discuss the implications of findings in Chaps. 1 and 2 for the questions

addressed by Hinofotis in Chap. 10. Should we expect cloze procedure to be
a suitable alternative to more complex multiple subtests? Try to offer empir¬
ical data rather than theory alone to support whatever position you prefer.
3. On the basis of what sort of theoretical reasoning are multiple tests proposed
as a basis for E SL placement and proficiency testing? What kinds of assump¬
tions are made concerning the nature of language proficiency? Compare the
CESL Placement battery (see the description of subparts in Chap. 2) with
the TOEFL (see Chap. 1 for a list of the subparts). Better yet, take two such
tests (or more) and compare them point by point to see how much agreement
exists between the experts who make up such tests.
4. Since all estimates of reliability relate to the statistical notion of variance

(differences between individuals as expressed in a particular mathematical
term, namely, the square of the standard deviation), if the variance is de¬
pressed, what is the expected effect on reliability estimates? On the other
hand, other things being equal, if the variance across learners is increased,
what is the expected effect on reliability? Further, what is the expected effect
on the correlation of two measures if the variance in one of them is depressed
(say the test is too hard or too easy)? Now, consider the fact that the cloze test
used by Hinofotis and by Hinofotis and Snow was relatively difficult for the
tested population in both cases (see Tables 10-1 and 11-1). What result
might be expected if the test could be adjusted to make it somewhat easier?
5. Reflect over your own personal experience with statements that you find
particularly agreeable or disagreeable. Is it sometimes hard to understand a
view that you disagree with? Easier to misunderstand a view that is radically
opposed to your own? Discuss these reflections in relation to the findings of
Doerr. What would all this imply for the selection of ESL or FL teaching
materials? Testing?
147
6. What difference in possible attitndinal effects would you expect in relation

to changes in the level of proficiency of the tested population? For instance,
suppose that the proficiency of the learners were excessively weak as con¬
trasted with mature native performance, what effect might this have on vari¬
ance in performance due to attitudinal factors?
7. Discuss the problem of determining the difficulty levels of texts. Can you
conceive of a way to counterbalance Doerr's design more effectively to
eliminate any possibility of contamination due to differences in difficulty
levels across pro and con texts?
8. Consider the possibility of testing nonnatives on achievement tests

(especially reading tests) that were intended for native speakers. What fac¬
tors might make the scores less comparable than one might hope? Does a
reading grade level of 4.5 actually mean that a particular foreign student can
only be expected to understand reading material suitable for fourth or fifth
graders? What factors enter into the foreign student’s reading ability that do
not exist for the average fourth grader who is a native speaker of English?
9. If the variance on standardized reading tests were depressed in the case of

foreign students, what effect might this have on the correlation of those tests
with an examination intended explicitly for nonnative speakers of English
(such as the TOEFL)?
10. What is the usual effect of lengthening a test (say, doubling the number of
items) on its total variance? Its reliability? If the total variance in scores on
the TOEFL is best explained by a single factor (i.e., global language profi¬
ciency, as suggested in Chap. 1), then what would be the effect on test relia¬
bility if the length of the test were cut to one-fifth of the total length of the
test? This is essentially what is done if the Reading subtest is used by itself
instead of the total score in at least one of the correlations discussed in
Tables 13-1,13-2, and 13-3. What would be the predictable effect on corre¬
lations with standardized reading tests?
Part V
Investigations of Writing Tasks
Are subjective ratings of essays as reliable as scores based on more objective

counting procedures? How do subjective evaluations and objective scoring
techniques correlate with other measures of ESL proficiency? How many
distinct sources of variance can be discerned in a variety of tasks aimed at
assessing writing skills? How much agreement is there across judges or
teams of judges concerning ratings assigned to essays? Is it possible to differ¬
entiate aspects of writing skill (e.g., vocabulary, structure, and organiza¬
tion)? What is the degree of correlation between scales aimed at supposedly
different aspects of writing proficiency? What about scoring systems that
relate to specific, discrete points of structure? What aspects of structure
provide maximum discrimination between previously established levels of
proficiency? Is it possible to develop an accurate notion of overall writing
ability on the basis of scores derived from occurrences of connectives (e.g.,
conjunctions), anaphoric referring terms (e.g., pronouns), and foreground¬
ing elements (e.g., articles)? The four chapters in this section deal with these
and other questions.
Chapter 14
Scoring and Rating Essay Tasks
Celeste M. Kaczmarek
There has been much debate over the utility of essays as tests of language
proficiency and even as tests of writing skill per se. this chapter reports a
study of two writing tasks and two scoring methods for each. The results
show that subjective methods of evaluating essays work about as well as
objective scoring techniques and are strongly correlated with other
measures of ESL proficiency which have independent claims to validity.
Furthermore, it is demonstrated that teacher judgments of a subjective sort
have substantial reliability and are strongly correlated with similar judg¬
ments by independent raters and with objective scores computed over the
same essays. It is argued, therefore, that essays may reasonably be used for a
wide variety of assessment purposes.
In the past many critics have opposed the use of composition as a measure of
language proficiency. They have reasoned that (1) students are apt to perform
differently on different occasions and when writing on different topics; (2) the
scoring of essays is highly subjective; and (3) students can easily avoid problems
and mask their weaknesses (see Harris, 1969, p. 69). Furthermore, scoring essays
has been considered to he too time-consuming in large-scale testing situations. It
has been argued that other methods might work just as well and be much more
economical to use.
On the other side, teachers have long felt that essays are reasonable sorts of
tasks to require of students who will have to do a great deal of writing in order to
complete just about any educational program. Some of them have argued that the
scores assigned by instnictors on essays written by their own students should be
quite reliable and indeed valid. They argued that there would be considerable
151
152 V: WRITING TASKS
agreement among raters about how well a given essay expresses the author s
intended meaning.
To assess the reliability and validity of two sorts of essay writing tasks, the
following study was designed. Both subjective rating techniques and objective
scoring methods are examined.
Method
Subjects. One hundred and thirty-seven of the subjects who participated in the
Center for English as a Second Language testing project completed all the tests in
this study. They ranged in ability from beginners to fairly advanced students. (See
Chap. 2 for more details.)
Tests. Two essay tasks were used. The first was a rewriting or recall task.
Subjects viewed a paragraph on a screen at the front of the room via an overhead
projector. The text was displayed for exactly one minute. Then the projector was
turned off and the students were instructed to write down what they had read using
the original words of the text as much as possible, but being sure to convey the
intended meaning. Five minutes were allowed for the rewriting. This procedure
was repeated three times—once for each of three different texts. The texts were
selected from reading materials that were believed to range from easy to difficult
for the test population. (See entry 20 in the Appendix at the end of this volume for
texts used in the Recall tasks.)
The second type of writing task was a more traditional essay. Subjects were
allowed twenty minutes to write about an accident according to the following
instructions: “Imagine you are a witness to an auto accident Tell who was
involved; tell who was at fault Describe what happened, where you were, when it
happened, how long before the police arrived, whether or not there were injuries,
and so on.”
Three objective tests aimed at writing skill were also included. (These are
given in their entirety as entry 19 in the Appendix at the end of this volume.) They
consisted of multiple-choice items of the following types:
(1) Choosing an appropriate word, phrase, or clause to form a continuation at

various decision points in a text (Select):
A farmer’s daughter had been out to milk the cows and was returning home,
carrying her pail of milk on her head. As she walked along she
(1)_(A)
(A) started thinking:
(B) had to
(C) prepared
(D) began to be
“The milk in this pail will provide me with cream, . . .”
Kaczmarek: Scoring and rating essay tasks 153
(2) Error recognition (Edit):
Most people have misconceptions (1) for (A) NO CHANGE

(B) with
(C) toward
(D) through
(E) about
gifted children. It seems that no one knows for sure
121 which (A) NO CHANGE these qualities reflect heredity

(B) that
(C) because
(D) whether
(E) unless
or the environment
(3) Putting words, phrases, and clauses in an appropriate order (Order):
A small child has (1) (2) (3) (4)

(A) very of the future. The year between
(B) idea
(C) a
(D) limited
one Christmas the the next Christmas (5) (6)

(?)- (8) (A) seems
(B) eternity
(C) an
(D) like
Scoring Procedure. The essay recall task was scored in two ways. First, the
instructors who taught the writing classes and who were most familiar with the
subject’s writing skills were asked to rate each protocol on the following six-point
scale:
Scale 1
1. Nil proficiency—virtually no proficiency.

2. Elementary proficiency—no academic program. Writes simple statements
and questions using vocabulary taught him; makes frequent errors in spelling
and structure, often obscuring meaning.
3. Intermediate proficiency—writes statements and questions on familiar topics
with fair control of basic, but not complex patterns with frequent obscurity of
meaning; has limited ability to organize a narrative or descriptive paragraph.
4. Minimal academic proficiency—limited academic program. Has most sen¬
tence structure under fair control within familiar and academic areas with
occasional obscurity of meaning; under time or test pressures, control weak-
ens; little understanding of paragraph organization of expository or argu¬

mentative essay.
5. Partial academic proficiency—limited academic program. Writes with some
ease but with occasional errors and misuse of idiom; under time or test pres¬
sures control may weaken. Shows very little understanding of organization of
expository/argumentative essay but has sufficient background for rapid devel¬
opment of control and self-correction.
6. Full academic proficiency—if you feel that the student is proficient in writing,
put a P on his paper.
Second, five raters were instructed in a method of scoring each of the three
paragraphs objectively. All sequences that were unintelligible or that contained
information not actually stated or clearly implied in the original text were disre¬
garded. Then a count was made of the error-free words in the correct meaningful
sequences that remained. Thus the total possible score was roughly equal to the
number of words in the original text.
The essay task was scored in three ways. First, each instructor assigned a
point score to each protocol according to the six-point scale defined above.
Second, a separate team of five raters were instructed to score the essay objec¬
tively. They were instructed to read the entire essay trying to understand the
intended meaning, then to rewrite it so that it expressed the apparently intended
meaning clearly and correctly; a score was then computed by counting the error-
free words in the subject’s original rendition minus the number of errors in the
same rendition. Errors included: (1) words that subjects wrote which were not
necessary, and which were therefore deleted by the rater, (2) words that the
subject did not use but which were necessary to convey the apparently intended
meaning, and which had to be added by the rater, and (3) in the latter instance non-
idiomatic sequences which had to be rewritten by the rater. A count of whichever
sequence contained the greater number of words (that is, the rater s sequence or
the subject’s original wording) determined the number of errors. Phonetically
correct misspellings were not counted as errors (see Chap. 6). An example illus¬
trating the scoring method follows:
other
The last day I had a colision with another car. The other person involved
at at a
was a young boy. The boy was in fault. I was stoped in an traffic light, when
hit
the other car -hite4 my own car in the rear part, it happened the last week.
The police arived inmidiately, because there was a high-way patrol stopped
on Fortunately, not any
-in the right side of the street Unfortunately, did not there were
SO
injuries, both cars had full Insurance, *m4 the father of the boy gave me
another was being fixed at the service station.
other, while mine is fixing +n ears.
(76 -21)/ 76 = .72
where 76 is the number of error-free words and 21 is the number of
errors.
The third scoring method for the essays involved a subjective rating of
content and organization employing a somewhat different scale than the one used
for the teacher s ratings. The following six-point scale was used by the same team
of five raters who did the objective scoring:
Scale 2
1. Incomprehensible or no attempt
2. Most points are hard to understand and/ or there are serious contradictions or
inconsistencies.
3. The meaning is sometimes hard to follow and often awkwardly expressed.
There also may be occasional inconsistencies.
4. Organization and transitions are occasionally difficult to follow, but there are
no glaring inconsistencies or incomprehensible sequences.
5. Meaning is easy to follow throughout and there are no obvious weak transi¬
tions or awkward sequences.
6. Well-written native-like composition with practically no errors.
The question was whether a subjective rating would yield as much meaning¬
ful variance as the objective scoring and whether either method would produce
new information not available in the other.

The intercorrelations between scoring methods for the various essay tasks and the
three objective tasks broken down according to subparts are given in Table 14-1.
Correlations range from .21 to .78. Since these correlations are based on very
short subtests—12 items each for the multiple-choice tests and only a few lines of
text for each of the recall tasks—they can be interpreted as indicating substantial
reliability and validity. The correlation of .78 between the teacher’s rating and the
objective score on recall text A suggests that perhaps this text was best aimed in
terms of difficulty level for the test population. This is also apparent in an examina¬
tion of the means and standard deviations of the respective tests shown in Table
14-2. The A passage was in fact the easiest of the recall tests, but contrary to
expectations, the B text turned out to be harder than the C text
In addition, the teacher’s scoring of the essay correlated at .65 with rater’s
objective score and .67 with their evaluation on Scale 2. Once again, this suggests
that the teacher’s subjective score may actually be as consistent as a similar
vo *— 00 P" 03 P p^ VO vo P P-^ O 00 p P O'— CN

in in VO P P vo VO vo lo vo h vo lo P 00 00 00
vo t— O', m oo 00 CN OVOr- lo vo CO 00 CN
co co co CO (N CN p P lO fO Tp lo CO COCO p vo
VO O P ’— CN CN vo CO 00 r- CO CO 00 O vo CO
P P P P P P p p p P m in co in cn lo
03 T— p^ vo CO 00 vo o o cn in P r- OO
vo p p VO p CN vo vo vo P VO vo P P CO
vo 00 o
CN r"^ CO
,_ p r—
CN CN CN CO
vo vo
CO CN
t— r— Tp
CN CO P
VO r-
CO CO
00 CN
P CO vo
vo r-
CO CO p
r_ vo
p
o VO
p CO
VO CN O
co p p
VO
P
o 0*3 03 o CO vo CN 00 (N 00 P^ vo
p CO p vo p p vo p p co p in
p" CN o
p CO p
CN r_
03
vo p CO
p"
vo
o vo
vo p
CN
CO
p
vo
p 03 CO vo 03 VO p 03 vo P
vo CO p vo CO CO vo vo vo P
vo 00 vo vo CO o 00 03 03
p p p CO p p p p p
Correlations of Essay Tasks and Multiple-Choice Cloze Tasks*
CO CO 03
VO p p
CN ,_
p
vo p p
vo
vo 00
vo p vo
vo p vo
,_ CN p^
vo vo p
p^
vo
00 o p vo p^ 03
VO vo VO vo p CO
CO ,_
VO
p p vo
00 03
CO p ll
in co in CO
in >n p cu
JZ
£
00 r- (N
p" in in
"O
00 CN o
vo
"O
LU
+
cn
Total^Select+Edit+Order
CN i/i
_C0 LLl
W> n3 rd
c O O
CO CO
rd c c
rd o o
& n < co u
Passage A
< 03 U
Passage B
Passage C
on </>
OJ <
C c 
I O _Drd
Vj
03 </3 </3 </3 i/> rd rd
</> </>
Table 14-1
on on
rd rd rd rd rd rd _=3 rd rd rd
(/> 1/3
</> </3 h-
</3 1/3
(/3 </3 > O
— 0) ■a
i > .9- 13

O CN co o rd cn ^ LU r” T" ll
V ni
P vo 00 03
o
—
o' CXI LU
*
Table 14-2 Means and Standard Deviations of

Essay Tasks and Multiple-Choice Cloze Tasks
Tests Mean SD
Recall—Teacher’s Ratings
Passage A 2.00 1.41
Passage B 1.83 0.99
Passage C 1.75 1.02
Recall—Trained Raters
Passage A 13.32 13.58
Passage B 7.88 10.04
Passage C 10.45 11.91
Essay—Teacher’s Ratings
Evaluation on Scale 1 2.67 1.17
Essay—Trained Raters
Evaluation on Scale 2 2.57 1.13
Objective Score of Essay 45.50 29.67
Multiple-Choice Cloze
Select
Passage A 2.81 1.41
Passage B 2.80 1.48
Passage C 2.45 1.75
Edit
Passage A 1.48 1.28
Passage B 1.75 1.29
Passage C 1.31 1.21
Order
Passage A 12.26 4.75
Passage B 7.54 4.58
Passage C 7.51 6.18
evaluation by a small number of trained raters. It is interesting to note that the

trained judge’s rating of the score on Scale 2 and their objective essay score corre¬
lated at .81, which may attest to the reliability and consistency of the raters in
supplying those two scores.
The multiple-choice modified cloze test correlated with the rater’s essay
scores at .66 for the objective essay score and at .65 on the Scale 2 ratings. This
fact seems to suggest that both tasks may he at least weakly tapping the same
source of language ability.
Table 14-3 shows the first principal component from an unrotated factor
solution which accounts for 49% of the total variance in all the tests used in this
study. This factor explains a surprising amount of the total variance in view of the
excessive difficulty and consequent lowered reliability of some of the tests.
The varimax rotated solution in Table 14-4 does notseem to present a clearer
patterning of the data than the solution given in Table 14-3. For instance, the
Select tasks scatter over all three factors in Table 14-4, but the highest loadings in
Table 14-4 for those variables are not as high as their loadings on the first principal
Table 14-3 Principal Components Solution

for the Essay Tasks and Multiple-Choice Tasks*
(N= 137)
Loadings on
Tests g Factorf h2
Passage A .78 .61
Passage B .69 .48
Passage C .78 .61
Passage A .74 .55
Passage B .66 .44
Passage C .62 .38
Evaluation on Scale 1 .80 .64
Evaluation on Scale 2 .77 .59
Objective Score of Essay .80 .64
Multiple-Choice Tasks
Select
Passage A .63 .40
Passage B .74 .55
Passage C .71 .50
Edit
Passage A .67 .45
Passage B .62 .38
Passage C .45 .20
Order
Passage A .75 .56
Passage B .69 .48
Passage C .62 .38
Eigenvalue 8.84
*The data in this table and Table 14-4 are also discussed
by Oiler (1979, Appendix).
fAccounts for 49% of the variance in the total factor
matrix.
component displayed in Table 14-3. In fact, of the 19 variables included in the

factor analyses only five loaded higher on one of the factors in Table 14-4 than on
the principal component in Table 14-3. Hence the principal component solution
seems to give a clearer patterning of the data.
We are tempted to conclude, therefore, that the factor represented in Table
14-3 is an index of writing skill. However, the strength of the relationship between
the writing and objective tasks in this study with a wide variety of other tasks (see
Chap. 2) precludes this possibility. The factor in question appears rather to be a
Kaczinarek: Scoring and rating essay tasks 159
Table 14-4 Varimax Rotated Solution for the Essay Tasks and
Multiple-Choice Tasks {N - 137)
Tests Factor 1 Factor 2 Factor 3* /72
Passage A .57 .65 .05 .65
Passage B .74 .40 -.01 .71
Passage C .81 .30 .21 .79
Passage A .40 .69 .10 .65
Passage B .64 .25 .24 .53
Passage C .71 .09 .30 .59
Evaluation on Scale 1 .45 .58 .33 .68
Evaluation on Scale 2 .34 .70 .29 .68
Objective Score of Essay .24 .77 .25 .71
Multiple-Choice Tasks
Select
Passage A .49 .41 .15 .43
Passage B .18 .70 .37
Passage C .20 .43 .65 .64
Edit
Passage A .45 .25 .52 .53
Passage B .49 .12 .53 .30
Passage C .11 .08 .71 .51
Order
Passage A .18 .68 .40 .65
Passage B .28 .40 .55 .54
Passage C .06 .46 .59 .56
Eigenvalue 10.81
♦Factor 1, Factor 2, and Factor 3 account for 19.5% of the total variance, 24.4
and 1 6%, respectively.
robust measure of global language proficiency. Furtber, Stump (1978) shows this
factor is apparently indistinguishable from factors of intelligence and school
achievement in native speakers.
Chapter
Evaluating Writing Proficiency in ESL1
Karen A. Mullen
An investigation was made to determine the equivalence of composition

ratings in assessing second-language writing proficiency. One hundred and
seventeen nonnative speakers of English were asked to write a composition
under controlled conditions. The subjects were arbitrarily divided into eight
groups. Five experienced ESL teachers were paired to form eight sets of
raters. Each group of compositions was read by one pair of evaluators and
rated on five scales of writing proficiency. A two-factor analysis of variance
was performed to test for a significant difference between groups and a
single-factor analysis of variance was performed to test for a significant
difference between raters within a pair. Unbiased reliability coefficients
were derived from the partitioned variance. A stepwise regression analysis
was performed to determine the relative weight given to each scale in assess¬
ing overall proficiency and to determine if all four scales were necessary to
best predict the overall scores. The results of this investigation show no sig¬
nificant difference among the ratings assigned to the eight groups of subjects
and a significant difference between raters within at least one pair. The
evidence suggests that some rater pairs are able to reach equivalence in their
judgments, other pairs can produce parallel and reliable judgments, and
some pairs produce nonequivalent, nonparallel judgments. The results of
this study also show that although scale ratings correlate highly, all four
scales will do better in predicting the overall score than any three, any two,
or any single scale. Furthermore, the results indicate that the ratings on the
vocabulary usage scale play the heaviest role in determining the overall
quality of the composition and that ratings on compositional organization
play the least.
160
Mullen: Evaluating writing proficiency 161
In the field of second language testing, one of the skills commonly tested is
that of writing ability, either objectively in the context of multiple-choice
questions or productively in the framework of a writing task. Objective writing
tests, most notably of the type included in the TOEFL (see Chap. 1 for descrip¬
tion), have been constructed to meet the requirements of reliability and validity.
Yet one of the major criticisms is that they do not allow one to see how a second
language learner organizes his thoughts on paper or applies his known vocabulary,
nor do they indicate how well a learner uses in extended, unified prose the formal
grammatical ndes he has been taught chapter by chapter in his language texts. On
the other hand, productive writing tests have been criticized for their failure to
produce reliable measurements of writing skill. This unreliability has been
attributed to two sources: the topics on which the learner writes and the judges
who evaluate the result These criticisms have been leveled against such tests
given to native speakers of English. However, for nonnative speakers, the case may
be different, particularly if the criteria to be measured are those of sentence
structure, vocabulary usage, fluency of writing, and coherence of ideas. If the
purpose of having a learner of English write is to test his ability to put sentences
together while appropriately selecting from his store of vocabulary and to apply
within a set period of time the grammatical rules he knows, then the question of
whether the judges of such writing can assess this and produce parallel assess¬
ments is an important one.
The purpose of this chapter is to report the results of a study designed to
determine if experienced ESL teachers, working in pairs, can come to a mutual
agreement concerning the writing proficiency of nonnative speakers of English
and to determine the reliability of such judgments. In addition, the question of
whether different sets of judges rate differently is posed. Finally, the role each
scale plays in the evaluation of overall writing proficiency is examined. Specific¬
ally, the hypotheses are as follows:
1. There is no significant difference between the ratings assigned by judge 1 and

judge 2 to a group of subjects in each of the several scales of writing profi¬
ciency.
2. There is no significant difference between ratings assigned by one pair of
judges to one group of speakers and by another pair of judges to another group
of speakers.
3. The information provided by the four scales together can predict the overall
writing proficiency score significantly better than any three, any two, or any
single scale.
Method
To test hypothesis 1, a single-factor experimental design having repeated
measures was chosen.* The F-statistic based upon the mean sum of squares
*Editors’ note: See explanatory notes in Chap. 8.

between judges divided by the mean sum of squares of the residual variance was
computed to test the hypothesis of no significant difference between judges.
Unbiased reliability coefficients were calculated based upon the number of
subjects in the sample, the number of judges within the group, the mean square
between subjects, and the mean square within subjects (Winer, 1971, p. 287). To
test hypothesis 2, a two-factor experimental design having repeated measures on
one factor was chosen. The F-statistic based upon the mean square between
groups divided by the mean square of subjects within groups was computed to test
the hypothesis of no significant difference among groups. To test hypothesis 3, the
F-statistic based upon the increment in the sum of squares due to the addition of a
scale variable divided by the residual variance was calculated from a stepwise
regression analysis.
Procedure. The judges were required to rate the subjects on five scales of
writing proficiency: Control over English Structure, Organization of Material,
Appropriateness of Vocabulary, Quantity of Writing, and Overall Writing Profi¬
ciency. The scales were labeled vertically on a rating form. Each scale was pre¬
sented in the form of a double horizontal line equally divided into five contiguous
compartments labeled from left to right: poor, fair, good, above average, and
excellent (see Appendix 15A). The judges were instructed to put an X in the box
best characterizing the learner’s proficiency with regard to each of the five scales
or to put an X on the line between boxes if the evidence warranted. A set of
guidelines for deciding what level of proficiency to assign was explained to the
judges before they read the composition, and it was at hand for reference during
the evaluation (see Appendix 15B). For analysis, judgments were later converted
to a numerical value of 1 = poor, 2 = between poor and fair, 3 = fair, 4 = between
fair and good, 5 = good, 6 = between good and above average, 7 = above
average, 8 = between above average and excellent, and 9 = excellent.
The subjects were given a composition booklet with instructions inside
directing them to choose a topic from one of four choices, to plan their ideas for ten
or fifteen minutes, to develop their ideas using details and examples, and to
consider that their writing would be evaluated for grammar, vocabulary, paragraph
organization, logical development, and quantity of writing. Subjects were allowed
an hour for the task.
Judges. Five judges participated in this study. They were randomly paired to
form eight groups. All judges were graduate students in linguistics. They had com¬
pleted courses in phonetics, syntactic and phonological analysis, and TESL
methodology. They had all taught ESL for at least a year. All had been instructed
on how to use the rating form and the guidelines, and they had participated in such
composition evaluation before. None of the judges had had a previous acquaint¬
ance with the subjects whose compositions they read.
Subjects. The 117 subjects in this study had been referred to the University of
Iowa Uinguistics Department for a proficiency evaluation by either the foreign
admissions officer, the foreign student adviser, or the student’s academic adviser.
Most of the subjects were new to the university. Most had been referred because
their TOEFL scores were below 550. Some appeared for an evaluation because in
the course of their few days on campus, the foreign student advisers had noted a
lack of facility in English, although the TOEFL scores were not below 550. The
purpose of the evaluation was to determine whether additional instruction in
English and a reduced academic program might be recommended for the student.

Table 15-1 shows the results of a two-factor analysis of variance having repeated
measures on one factor. A separate analysis of variance and two F-statistics are
Table 15-1 An Analysis of Variance of Performance Scores of Eight Groups of

Subjects Tested under Different Pairs of Judges on Five Scales of Writing
Proficiency
Source of df Structure Organization

variance SS MS F SS MS F
Between groups 7 59.46 8.49 1.74 70.38 10.05 1.89

Between subjects
within groups (pooled) 109 533.48 4.89 579.24 5.31
Between judges
within groups (pooled) 8 7.20 .09 1.22 32.64 4.08 4.45f
Residual (pooled) 109 79.80 .73 99.86 .92
Source of df Quantity Vocabulary

variance SS MS F SS MS F
Between groups 7 106.35 15.19 2.38* 30.22 4.32 .83
Between subjects
within groups (pooled) 109 697.11 6.40 566.39 5.20
Between judges
within groups (pooled) 8 30.25 3.78 5.29f 17.58 2.20 3.03-f
Residual (pooled) 109 78.75 .72 78.92 .72
Source of df Overall
variance SS MS F
Between groups 7 47.67 6.81 1.35

Between subjects
within groups (pooled) 109 548.39 5.03
Between judges
within groups (pooled) 8 14.22 1.78 3.29f
Residual (pooled) 109 58.78 .54
♦Significant at p <.05. fSignificant atp < .01.

4— * * H— 4_ 4— * *
O lO CN O "d- to T—* CN to O O O 00 to o
Li. CO CO to T— os CD 00 CO d[ o O 00 CO
o’ CO* CO CD CN* CN ■d- CN* CO T—* CO* CN* d*
r_ T to
rd CN
<d
> r- o d CD O CO CD 'd- CO to r- os as OS CN o O r- CD i— o o d
o CN CO 00 CO O CD p- 00 CD C4 r- o o o P- to to to CD CN to to as
Analyses of Variance on Performance Scores (One for Each of Eight Pairs of Subjects Tested by
CO o T— to r— o CO os CD to O CN CN »— 'd as CD o o CN CD o o d
CO CO T” CD CD ■d- CN CN* d* CN* CN*
4— 4— 4— 4— * 4— * *
p* CD o p- to T- d- CD o CD \— T— to as P- o
>* d CN d p* CD CN d^ CN o ■d CO O p- d o
cd u. O T-^ r-1 CD* 00* to* CO r-* d* co r—*
T— r"
3
-O
rd T— r— T—
o o to CN CD CN d to CO 00 o CO r— to CO CN o as CD 00 o o
o CD os d T— CD r— CD o 00 o r- 00 d- d- r-> to CD co CD d CO to to
> CO I— t— CN t— p- OS CD 00 to d^ 00 CO o d o '— d CD oo to d d
r-»* T“ 00 to r-* d- CN* CN* co
4— * * * 4— * -4— -4— 4— 4— * *
11 ON p- 00 CD O r— O o r- T- as d o
CD 00 CD T— p- <— o as. to o to *— co O o
+-> d* CO OS* CN* CD CN* ^-* to’ CN* o* 00* CN* to as’
T— CN CN
c
cd ,_ ,_ ,_
3 OS CO d OS 00 as r- os CN to to o CD CD CN o CO as o o to
os o CN d CD CD to CD r^» CM CN o CO CO P" to 00 d d to o o to
O'
CO CD *— 00 o p* »— OS CM CN 00 CO CD CD d o O 00 00 o to
to 00* T CN r-** Os to d- CO T— to* CN* CN* to
, |
-4— H— 44- H— 4— * *
d 00 00 O CO GO as to 00 CD as P' to to
U_ • o d r— o d^ to to o o 00 00 o CD
Different Pairs of Judges) on Five Scales of Writing Proficiency
o
*+-» p-’ d CN* CN* 00* to* to CO d* CO CN* to
rd CN T
N
’c r- r— 00
rd o CO 00 to CD o CD o OS r- O r- CO to CN o CD
00 CO o d CD CN r- CO o d- o CN CN CN to CN to CD p- to T—
CO d CD OS CD T— CD 00 O os CD as
co.
r— CO co o 00 CN T— o d o P-
O CD CO r-* to’ p** CN* r—* T—* T-! *
d- d- o* to* d* d*
4— 4— 4— 4—
O i— CD to to CN 00 as CD CN d 00
fSignificant atp <.01.
00 to CD o
<D u_ d CD d d CN CO as o to CN »— o as CO CO CN
3 CO CO CN* oo* CD* r-— T CN* d* d* r-1 CN*
+-> r—
CD r“
3
£3 OS o 00 T— to OS CD CO CD CO to CN CD OS as T— o to as r— 00 r_ o P-
co CO T- CD 00 d CN r- CO CO OS CN CO o o CD to o CN d CD r— o
:£ CO d O CO 00 O CN r^* as CD CO o as CO to o CD 00 o p- CO CM as
p*** T CN* CD* CO CN* CN* T— 00* CO CN*
r~
4— CM ,_ CN ,_ ,_ 00 _ _
00 as as o o OS as OS as
"O CN CN T T— r— r—
r— T T“ T— r_
us t/i ir> us us US us
•*-> +-> +-* 4-»
L) us CD US o at CD </) CD US o (A CD us us
<D CD
cd <D <D CD CD CD CD <D CD CD CD
4— 00 00 WD 0D 00 00 00 CuO
o D -O -O -D -O
T3 ■a
Significant atp <.05.
"O “O "a T5 15s ■a lo T3

CD r~ 3 3 3 3 3 3 3 3 3 3 3 3 3 3
t/» us (✓) t/i us
3 D
o '—' •—• •—> —• us us us
i— rd
’H C C "rd C C "cd C C "rd C c "rd C C C c C C C C
3 CD <D 3 <D CD "cd rd "cd rd
rd CD (D CD D <D <D CD D <D <D <D CD
o
>
CD CD <D CD 3 CD CD 3 <D CD 3 (D <D 3 <D CD 3 CD (D 3 <D <D 3
to TJ TD "JD T3
£ £ £ £ £ £ £
4-* us +-» US +-* CO *-» 4-> "us 4-J "us +-> "us "us
0) CD <D <D CD CD CD (D <D (D <D <D <D CD CD <D CD CD CD 0) <D ns <D
Table 15-2
CO CQ QC CO CQ o' CQ CQ a: CQ CQ O' CQ CQ CQ CQ QC CQ CQ QiC ac

m
Cl
3
O r“ CN CO d- to CD P- 00
O
reported for each scale. The F-statistic for a difference between groups is not
significant at the .05 level for four of the scales—Structure, Organization,
Vocabulary, and Overall Proficiency. It is not significant at the .01 level for any of
the five scales. The F-statistic for a difference between judges within at least one
pair is significant at the .01 level for every scale except Structure.
In order to ascertain which pairs of judges might be responsible for producing
significant differences, we must consult Table 15-2, showing the results of a
single-factor analysis of variance having repeated measures for each of the five
scales and for each of the eight groups of subjects. The F-statistic for a difference
between judges’ Structure scores is not significant at either the .05 or the .01 level.
It is significant at the .05 level for a difference between judges’ Organization
scores for four pairs of judges (groups 2, 4, 7, and 8), between judges Quantity
scores for four pairs (groups 1, 2, 3, and 8), between V ocabulary scores for two
pairs (groups 2 and 3), and for Overall scores for two pairs (groups 1 and 2). For six
out of the eight groups, there is a significant difference in ratings on at least one
scale. At the .01 level, there is a significant difference between judges Organi¬
zation scores for group 2, between Vocabulary scores for groups 2 and 3, and for
Overall scores for group 2. The F-statistic shows no significant diffeience between
judges’ Quantity scores at this level. Additionally, there is no significant differ¬
ence between judges across rating scales for six out of the eight pairs. Pair 3 shows
a significant difference on one scale (Vocabulary). Pair 2 is the most deviant,
showing a significant difference at the .01 level for three out of the five scales. This
pair performed similarly on an identically designed experiment involving scales of
speaking proficiency on a randomly selected group of subjects from the same pool.
The reliability coefficient is a measure of the degree to which an average of
the rater’s scores on a given scale is a good estimate of the subjects’ true scores and
as such indicates the percentage of the obtained variance in the distribution of
scores which may be regarded as variance not attributable to errors of measurement.
Squaring the reliability coefficient will indicate the accuracy ol the prediction
when the score assigned by one judge in a pair is used to predict the score given by
the other judge. Table 15-3 shows the unbiased reliability coefficients for each
pair of judges for all scales of writing proficiency. The coefficients range from a low
of-.34 to a high of .99. Table 15-3 also shows the relationship between the F-
statistic for a difference between judges and the reliability coefficients. Some
cases show a significant difference and a high reliability (Vocabulary scores in
group 3, for example), indicating that the judges are not interpreting the scale in
the same way. The scores they assigned are not equivalent but are closely parallel.
There are also cases in which there is no significant difference between judges’
scores but a rather low reliability (Structure scores for group 5, for example). This
indicates that the judges, though showing no overall difference in assigning scores,
do not interpret the scale in any way consistent with one another across all subjects
in the sample. Where there is no significant difference in judges’ scores and a high
reliability (group 6 for all scales), we may infer that both judges are interpreting the
scales similarly and reach agreement in their evaluation of individual subjects. In
Table 15-3 A Comparison of the F-Statistics from Analyses of Variance (from

Table 15-2) and Reliability Coefficients of Judge Pairs
Structure Organization Quantity Vocabulary Overall

Group n F F F F F
r2 r2 r2 r2 r2
1 23 3.61 .92 4.08 .83 7.87* .70 .26 .90 5.35* .87
2 12 .45 .53 22.1 8f -.34 9.16* .45 11.47 f -.32 16.50f .12
3 19 .32 .87 .03 .87 6.71* .89 8.61 f .77 2.94 .90
4 20 1.09 .84 5.78* .75 1.00 .95 .46 .79 .81 .92
5 11 2.22 .13 .55 .26 2.57 .77 .06 .65 .45 .50
6 10 .08 .92 .06 .91 .57 .94 .31 .98 1.00 .99
7 12 1.35 .75 3.87* .70 2.37 .84 .79 .71 2.38 .65
8 10 .20 .51 5.65* .10 9.00* .55 1.00 .64 .10 .73
♦Significant atp <.05. ■(■Significant atp <.01.
Table 15-4 Correlations of Ratings on the Five

Scales of Writing Proficiency
Scale 1 2 3 4 5
1 Structure .79 .67 .84 .89
2 Organization .82 .82 .90
3 Quantity .72 .83
4 Vocabulary .92
5 Overall Proficiency
cases where there is a significant difference between judges and a low reliability
(as for pair 2 on all scales), one may conclude that the judges disagree and that the
error of measurement on the scales for these raters is sufficient to reduce the
extent to which an average of the two scores could be used as an estimate of the
subject’s true score. Reliability coefficients for four of the eight pairs of judges are
within the acceptable limits of .70 or above on all scales. A fifth pair is within
acceptable limits on four out of five scales. Three sets of judges did not produce
acceptable reliability coefficients on the five scales. The quantity scale appears to
provide the most uniform reliability coefficients across all groups of subjects and
raters.
As shown in Table 15-4, the correlation between scale ratings is high.
However, a stepwise multiple regression analysis reveals that as the variables are
added one by one to the equation for predicting the Overall Proficiency rating,
each addition to the equation significantly reduces the amount of unpredicted
variance. The variable accounting for the most variance is added first and subse¬
quent variables are then added one by one depending on which accounts for the
most remaining variance. Table 15-5 shows these results. The beta coefficients in
the equation indicate that if the Vocabulary rating were increased by one unit and
the other ratings remained constant, the expected change in the overall
Table 15-5 A Stepwise Regression Analysis with the Overall Proficiency Scale
as the Dependent Variable
Source df SS MS F R2 Beta SE (B) F (for B)
Regression 1 564.874 564.874 1257.85* .844

Vocabulary 1 564.874 564.874 1257.85* .92 .025 1257.85*
Residual 232 104.185 .499
Regression 2 606.665 303.332 1 123.02* .906

Vocabulary 1 564.874 564.874 .66 .028 521.74*
1 41.791 41.791 154.72* .36 .024 154.72*
Quantity
Residual 231 62.393 .270
Regression 3 626.91 1 208.970 1140.33* .937

1 564.874 564.874 .42 .032 164.65*
Vocabulary
1 41.791 41.791 .31 .020 168.53*
Quantity
1 20.246 20.246 110.47* .32 .030 110.47*
Structure
Residual 230 42.148 .183
Regression 4 631.583 157.895 964.82* .943

1 564.874 564.874 .36 .032 126.14*
Vocabulary
1 41.791 41.791 :23 .023 70.13*
Quantity
1 20.246 20.246 .27 .030 84.59*
Structure
1 4.672 4.672 28.54* .18 .032 28.54*
Organization
Residual 229 37.476 .163
♦Significant atp <.01.
proficiency rating would be .36. If the Quantity rating were increased by one (and
the other ratings remained constant), the change in the overall rating would be .23.
Parallel conclusions inay be reached for the influence of an increase in the Struc¬
ture or Organization rating on the Overall score. A unitary increase in the Vocabu¬
lary rating would produce the most change (by about a third) and such an increase
in the Organization rating would produce the least (by about a fifth). Unitary
increases in the Quantity or Structure scores would change the Overall rating by
about a fourth. If all four ratings were increased by one unit, the change in the
Overall Proficiency rating would be very slightly more than one unit as well. This
indicates that although the scales are unequally weighted, they function together
in such a way that they produce a unitary change in the Overall score. In addition,
the R-square coefficient indicates that the four variables account for .943 of the
variance in the Overall scores and that the relationship between the four scales and
the Overall scale is very nearly linear.
Conclusions. It is apparent from the results of this study that some pairs of
judges achieve fairly high reliability and show no significant difference in scoring
on all scales. It is also apparent that some pairs of judges are highly reliable in their
rating, although one may be calibrating judgments consistently higher than the
other. It is also clear that some pairs of judges cannot produce reliable judgments.
Moreover, each of the scales plays a role in the determination of the Overall score.
In relation to the hypotheses stated at the beginning, we may conclude that
1. There is evidence to reject the claim that there is no significant difference

between the ratings assigned by judges within a pair.
2. There is no evidence to reject the claim that there is no significant difference
in ratings assigned by the eight pairs of judges.
3. There is no evidence to reject the claim that all four scales together can predict
the Overall writing proficiency score better than any three or any two or any
single scale.
Note
1. This is an expanded version of a paper presented at the AILA/TESOL Convention in Miami, Fla,
on Apr. 27, 1977. Another version of this paper appeared in H. Douglas Brown, Carlos A. Y orio, and
Ruth H. Crymes (eds.) On TESOL 'll Teaching and Learning English as a Second Language: Trends
in Research and Practice. Washington, D.C.: TESOL, 309-320. It is reprinted here by permission.
Appendix 15A
Composition Evaluation
Name_ Date_
Evaluator_
Above
Poor Fair Good Average Excellent
Control over English Structure
Compositional Organization
Quantity of Writing
Appropriateness of Vocabulary
Overall Writing Proficiency

Appendix 15B
Guidelines for Evaluation of Compositions
Control over English Structure
Excellent: Few, if any, noticeable errors of grammar or word order. Frequent

use of complex sentences.
Very good: Occasional grammatical and/or word-order errors. Some use of
complex sentences.
Good: Frequent grammar and word-order errors. General use of simple
sentences.
Fain Many errors in grammar make comprehension difficult Use of short
basic sentences.
Poor Severe errors in grammar and word order. No apparent knowledge of
English.
Compositional Organization
Excellent: Well-developed introduction which engages concern of the reader.
Use of internal divisions and transitions. Substantial paragraphs to
develop ideas. Conclusion suggests larger significance of central
idea.
Very good: Obvious inclusion of an introduction, though not smoothly devel¬
oped. Division of central idea into smaller parts, though paragraphs
are lean on detail. Conclusion restates the central idea.
Good: Intent to develop central idea is evidenced, but only a few points are
mentioned. The introduction or conclusion is very simply stated or
may be missing. Occasional wandering from topic.
Fain Limited organization. Thoughts are written down as they come to
mind. No introduction or conclusion.
Poor: No organization. No focus. No development. No major consideration
of topic.
Quantity of Writing
Excellent: Writing is an easy task. Quantity seems to be no problem.
Very good: Reasonable quantity for the hour. Writing flows without much hesi¬
tation.
Good: Enough writing to develop the topic somewhat Evidence of having
stopped writing at times.
Fain Much time spent struggling with the task of putting down thoughts on
paper.
Poor: Very little writing during the hour-long assignment.
Appropriateness of Vocabulary
Excellent: Precise and accurate word choice. Obvious knowledge of idioms.
Aware of word connotations. No translation from native language
apparent May have attempted a metaphoric use of words.
Very good: Occasional misuse of idioms, but little difficulty in choosing appro¬
priate fonns of words. Uses synonyms to avoid repetition. Some
vocabulary problems may be due to translations.
Good: Use of the most frequently occurring words in English. Does not
use synonyms to avoid repetition. Some inappropriate word choices.
Uses circumlocutions or rephrasing when the right word is not avail¬
able.
Fain Depends upon a very small vocabulary to convey thoughts. Repeti¬
tion of words is frequent. Appears to be translating. Great difficulty
in choosing appropriate word forms.
Poor: Vocabulary is extremely limited.
Chapter
Measures of Syntactic Complexity

in Evaluating ESL Compositions
Douglas E. Flahive and Becky Gerlach Snow
This chapter reports on the effectiveness of four objective measures of syn¬

tactic maturity/complexity in evaluating compositions written by ESL
students. The four measures used were: T-unit length and clause/T-unit
ratio, originally developed by Hunt (1965) and later modified as a result of
research by O’Donnell, Griffin, and Norris (1967); and Index of Complex¬
ity, loosely based on the work of Endicott (1973); and an Errors-per-T-unit
ratio. Three hundred compositions written by ESL students at six levels of
proficiency from beginning to advanced were analyzed. Discriminant
analysis using direct and stepwise procedures was used to determine how
accurately objective measures could discriminate among different levels
and to determine the relative power of each objective measure. Pearson
product moment correlations were computed between objective measures
and holistic evaluations.
Until recently the ESL skill areas of reading and writing have suffered from
neglect by researchers and teachers alike. The historical reasons behind this past
neglect have been discussed elsewhere by others and will not be reviewed here
(Saville-Troike, 1973, Wilson, 1973). Now that the need has been recognized,
researchers are faced with the task of formulating and testing hypotheses whose
acceptance or rejection they hope will lead to an increased understanding of
second language (L2) reading and writing processes and ultimately to better
methods of teaching.
Conducting empirical research in these two areas is not without problems.
Despite the wide-ranging theorizing and research into first language reading and
writing processes, little is agreed upon as certain. There is as yet no widely
accepted model of the reading process. Empirical studies of writing present
171
perhaps even more uncertainty, since writing is both a skill and an art. However,
because of the vast amount of research done in these areas with native speakers,
the L2 researcher would probably be missing a good bet not to formulate
hypotheses based on the successful studies of LI researchers. Because of the
relatively crude state of the L2 research art, the hypotheses formulated will be
general in nature.
Method
Objective Measures. The study reported below concerning the use of objective
measures of syntactic complexity in the evaluation of ESL compositions is based,
to a large extent, on the work of Hunt (1965). Hunt, in looking for an objective
measure of syntactic maturity, formulated the I -unit. The T-unit, as defined by
Hunt, is a “minimal terminable unit. . . minimal as to length, and each would be
grammatically capable of being terminated with a capital letter [at one end] and a
period [at the other]” (1965, p. 21). The unit not only preserves subordination but
also all of the coordination between words and phrases and subordinate clauses. In
Hunt’s study it was demonstrated that as students become older and their writing
becomes more complex, the mean length of their T-units as seen in their
compositions also increases.
In addition to length of T-unit, Hunt developed a subordination ratio. The
subordination ratio is determined by simply adding the total number of clauses per
T-unit and dividing by the number of T-units. The subordination ratio, like the
number of T-units, increases as the students increase in age and syntactic
maturity. Table 16-1 displays the results obtained by Hunt for the three grade
levels which his study examined.
While we recognize the usefulness of Hunt’s measures, we also see some limi¬
tations on applying them to the writing of L2 learners. Hunt’s measures do not take
errors into account. Nor do the measures take into account morphological and
transformational complexity. As a possible supplement to the T-unit for purposes
of analyzing the development of the writing of L2 learners, we developed two addi¬
tional measures, both based on the T-unit. The first of these is an Errors-per-T-
unit measure. The measure is computed by simply adding the number of errors
found in each T-unit. The second measure is one of morphological and transforma¬
tional complexity loosely based on the work of Endicott (1973). In our adaptation
Table 16-1 Results Obtained by Hunt in Composi¬

tions Written by Fourth, Eighth, and Twelfth Grade
Students (Total of 54 Compositions, 18 in Each
Group)
Grade level 4 8 12
T-unit length 8.6 11.5 14.4
Clause/T-unit ratio 1.30 1.42 1.68
Flahive/Snow: Measures of syntactic complexity 173
of the Endicott model, we assigned points to selected grammatical structures and

morphological forms. The grammatical structures were selected on the basis of
their frequency in the writing of advanced ESL students. As with the other
measures, the basic unit for analysis is the T-unit. If a T-unit contains no embed¬
ding or any complex, morphological forms, it is assigned a value of 1 as in the
following example:
John hit the ball over the fence.

1111 11 1
Complexity =0 + 7 = 7
T-unit length =7
If, however, a T-unit contains embedding and one or several complex, derived
morphological forms, the unit is scored in the following manner
111 2
John carelessly hit the red ball which his father bought him over the
1 11111 111 1111
2
neighbor s fence.
1 1
Complexity = 7 + 15 = 22
---= 1.47
T-unit length — 15
In the scoring scheme, a weight of 3 is assigned to noun clauses and a

weight of 2 is assigned to relative clauses, passive sentences, embedded ques¬
tions, possessives, and comparatives. Each derivational morpheme and each
adjective gets a weight of 1.
Subjects and Elicitation Procedures. Using the four measures—T-unit
length, Clause/T-unit ratio, Errors-per-T-unit, and the Index of Complexity
described above—300 compositions written by ESL students at six levels of profi¬
ciency (Center for English as a Second Language, at Southern Illinois University)
were analyzed. There were 50 compositions at each of the levels. All composi¬
tions were written under carefully monitored conditions. That is, students were
given a list of several expository topics from which they were asked to select one
and write for 50 minutes.
The means for each of the measures across the six levels are found in Table 16-2.
The general developmental trend observed by Hunt for native speakers is also
seen in this study in the writing of nonnative speakers. This trend was observed
Table 16-2 Results Obtained from ESL Compositions over Six Levels of
Proficiency (Total of 300 Compositions, 50 at Each Level)
Proficiency level 1 2 3 4 5 6
T-unit length 7.00 8.71 10.89 12.24 12.63 13.56

Clause/T-unit ratio 1.07 1.28 1.56 1.60 1.92 1.74
Errors-per-T-unit 1.26 1.12 1.28 1.26 1.21 1.33
Complexity index 1.17 1.33 1.39 1.35 1.40 1.47
Table 16-3 The Four Variables Together with

Wilks’ Lambdas and Univariate F-Ratios (with 2
and 297 Degrees of Freedom)
Variable Wilks’ lambda F-ratio

T-unit length .57 110.40
Clause/T-unit ratio .76 47.25
Errors-per-T-unit .98 3.18
Complexity index .93 11.04
over the two measures developed by Hunt as well as with the Complexity Index
developed by us. The only exception to the progression Hunt observed is seen in
the clause/T-unit ratio at levels 5 and 6. This is not surprising, since the writers at
both these levels are all fairly advanced in their composition skills. Perhaps they
are making more precise choices in lexical items and thereby are reducing the
length of clauses through greater clause density. Another possibility is that some of
the level 5 students were actually more proficient writers than level 6 students.
Using the four measures described above, the authors attempted to determine
(1) how accurately objective measures alone could discriminate among the
different levels of ESL placement and (2) what the relative power of each of the
objective measures is. To answer these questions, discriminant analysis using both
direct and stepwise procedures was employed. Since there are only four measures
and six groups, it was necessary to collapse the six groups to three: levels 1 and 2
became Group 1; 3 and 4 became Group 2; 5 and 6 became Group 3.
Only two discriminant functions were found—length of T-unit and clause/T-
unit ratio. Table 16-3 contains the four variables together with the four Wilks’
lambda and univariate F-ratios with 2 and 297 degrees of freedom. Related eigen¬
values and canonical correlations for the two discriminate functions are found in
Table 16-4. Together the two measures accounted for 56% of the total variance
across groups. The errors-per-T-unit and the Index of Complexity (Endicott
index) turned out to be totally lacking in discriminatory power (see Wilks’ lambdas
in Table 16-3). The summary table (Table 16-5) reveals that 64% of the grouped
cases were correctly classified.
While the results of this preliminary study are somewhat encouraging, it is
also clear that more precise measures are needed to discriminate better among the
Flahive/Snow: Measures of syntactic complexity 175
Table 16-4 Canonical Correlations, Chi-Square Values, and

Eigenvalues for the Two Discriminant Functions*
Function Eigenvalue Correlation Chi-square
T-unit length .83 .67 205.39

Clause/T-unit ratio .10 .29 27.35
*The two discriminant functions account for 54% of the variance.
Table 16-5 Summary Table of Correctly and Incorrectly

Classified Cases*
Group No. of cases Predict 1 Predict 2 Predict 3
1 100 83 12 5
2 100 27 46 27
3 100 6 31 63
*64% of the cases were correctly predicted.
intermediate- level students. One possibility is to adjust the weightings of the Com¬
plexity Index to reflect more accurately the relative complexity of the various
structures. However, given the current state of linguistic theory, many questions
concerning what is or is not a complex structure have not been resolved. Another
possibility is to develop a measure which would more accurately assess the task
demands of the writing process, a process which involves the logical chaining of
one sentence to another. This measure of cohesion could possibly serve as a useful
complement to the Length-of-T-unit and the clause/T-unit ratio. Nonetheless it
seems safe to conclude that we have found that the sentences of ESL students grow
in complexity in ways similar to the sentences of native speakers. Further, objec¬
tive measures employed reasonably discriminate among the various ability levels.
A final question was whether there would be significant correlations between
objective measures of complexity and holistic evaluations of compositions. For
this purpose, each composition was evaluated by experienced ESL teachers on a 1
to 5 scale: 5 represented outstanding, 4 above average, 3 average, 2 below average,
and 1 inferior. The scale was relative to each level. For example, a 3 for Course 3
indicated that the composition was “average"’ for students writing at that level. To
ensure reliability, both of the authors reevaluated the compositions. The interrater
reliability exceeded .90. Pearson product moment correlations were computed
between the holistic evaluations and the objective measures. Results are
presented in Table 16-6. For the lower three levels, the highest correlations were
obtained between the clause/T-unit ratio and the holistic evaluation. Progressing
up the levels, the correlations between length of T-unit and holistic evaluation
continued to increase. At the most advanced level, 50% of the variance in the
holistic evaluations could be accounted for on the basis of length of T-unit alone.
Table 16-6 Correlation of Objective Measures with Holistic

Evaluations over Six Levels of Proficiency
Clause/T-unit Errors-per- Complexity

Level T-unit length ratio T-unit index
1 .22 .50* -.45* .39*

2 .32 .57* -.60* .25
3 .39* .48* -.35 .08
4 .54* .24 -.47* -.11
5 .64* .39* -.26 .19
6 .74* .61* -.50* .34
*p <.01.
While the authors concede that there is far more to writing than length-of-T-
unit or clause/T-unit ratios, this study has demonstrated that these measures are
relatively useful in determining levels of overall ESL proficiency and in predict¬
ing the overall effectiveness of writing ability.
Chapter 1
Discrete Point versus Global Scoring

for Cohesive Devices
Jill Evola, Ellen Matner, and Becky Lentz
Based on first language acquisition research, hierarchies of difficulty have

been associated with conjunctions, pronouns, and articles. Assuming such
hierarchies also exist for second language learners, improvement in profi¬
ciency should entail improvement in the ability to use more difficult
elements in the hierarchy. Essays of 94 Arabic- and Farsi-speaking male
students at the Center for English as a Second Language (CESL) at
Southern Illinois University were analyzed. There was a significant though
weak correlation between development of overall proficiency and ability to
use the grammatical items as hierarchically arranged by first language
studies. Difference in language background (Arabic versus Farsi speakers)
was not consistently a significant variable. When global proficiency
measures were used as validating criteria, a scoring system which simply
totaled correct usages proved to be a better indicator of language develop¬
ment than systems which took into account the number of words produced,
errors, or obligatory contexts.
Coulthard (1975) suggested that techniques of discourse analysis are appro¬

priate to both spoken and written texts and that cohesive devices (in both cases)
range across sentence boundaries. This notion prompted us to look at conjunc¬
tions, pronouns, and articles as they affect cohesion in essays. Our major hypothe¬
sis was that the ability to use cohesive devices, in particular a measure of the skills
of connecting ideas logically, using anaphoric referential devices, and attending to
foregrounding constraints would index a student’s communicative proficiency.
First language acquisition research has established hierarchies of difficulty for
items within the grammatical categories of conjunctions, pronouns, and articles
177
178 Y: WRITING TASKS
Table 17-1 Distribution of Subjects by Native

Language and Level at CESL
Native language Level at CESL Total
1 2 3 4 5
Arabic 2 15 4 11 8 40
Farsi 2 13 23 6 10 54
Total 4 28 27 17 18 94
(Lee and Canter, 1971, and Warden, 1976). Assuming that these hierarchies also
exist for second language learners, it was hypothesized that correct usage within
each of these three categories would increase from elementary to advanced levels,
and that there would be no significant difference in usage by speakers from two
diverse language groups (Arabic and Farsi). We used three different scoring
methods and correlated each with two measures of global proficiency, subjective
essay ratings and objective essay scores, to discover to what extent our discrete-
point scoring methods were valid as indicators of language proficiency.
Subjects. Our sample was taken from a population of 182 foreign students
studying English at CESL. We selected the two largest homogeneous groups
according to sex and language. The subjects used for analysis were 94 males: 54
native speakers of Arabic and 40 native speakers of Farsi. They were distributed
across the levels at CESL as shown in Table 17-1.
Elicitation Instrument. The data were based on essays written as part of the
spring term testing project (see Chap. 14). The students were to write about an
imaginary accident to which they were witnesses and were asked to report who was
injured, when the police arrived, and other relevant facts about the incident. This
task was selected because it drew on experiences outside the language classroom.
Although none of the instructors who administered the test gave any clues as to
paragraph construction or the use of introductions and conclusions, all but five
teachers helped their students with the vocabulary in the directions.
Scoring. The hierarchies of difficulty used for the analyses of conjunction and
pronoun usages were adapted1 from the scales reported by Lee and Canter (1971)
for language-delayed children acquiring English as their native language. The
hierarchies used in this study are as follows in ascending order of hypothesized
difficulty:
Conjunctions: (1) and; (2) hut; (3) because; (4) so, so that, if; (5) or, except,
only; (6) where, when, while, why, how, whether (or not), for, till, until, since,
before, after, unless, as, as + adjective + as, as if, like, that, than, therefore, how¬
ever, whenever.
Pronouns: (1) I, me, my, mine, you (subject and object),your, yours (no need
for referent); (2) he, him, his (adjective and nominal), she, her (object and
Evola/Mamer/Lentz: Scoring for cohesive devices 179
possessive), hers; (3) we, us, our, ours-, (4) they, them, their, theirs; (5) all reflexive
pronouns.
Units for measurement included both phrases and clauses. Conjunctions

were tabulated as correctly used when they logically joined together two ideas. If
the conjunction did not convey an appropriate relationship between two phrases
or clauses, it was tabulated as an error. “My father saw the man and stopped the
car" exemplifies a correct usage of the conjunction and. “He was hurt and the car
hit him” demonstrates an incorrect usage of and where the use of because would
have more appropriately conveyed the relationship between the two clauses.
A pronoun was tabulated as correctly used when its referent was identifiable
in context and when it was grammatically correct in person, number, gender, and
case. The use of a pronoun which did not meet the above criteria was counted as an
error. For example, “knocked a little girl and killed her" shows a correct use of the
pronoun her. “Last weekend at midnight I standed in downtown. We spoke
together. . .” as the first line of an essay demonstrates an error in the use of we
because its referent is not identifiable in context.
The correctness of an article usage was judged by obligatory context. When
an item or person was introduced into the discourse for the first time (i.e., before it
had been “foregrounded"; see Chafe, 1972), an indefinite article was scored as
obligatory. In instances in which an item had already been mentioned and
reference was made to it again (i.e., after the referent had been foregrounded), a
definite article was scored as obligatory. Credit was given for a correct answer and
half credit was given for the wrong article. If no article at all appeared, no credit
was given. We consistently scored initial reference to “ambulance" as an
obligatory occasion for “an." Also, words like “the street" and “school, home,
work" were treated as generic terms and were scored as obligatorily requiring or
not requiring an article (as they are given above).
Using the tabulations explained above, three scoring systems were used to
analyze the data:
(51) number of correct usages (for all categories)
number of correct usages

(52) (for conjunctions and pronouns)
number of words
number of correct usages
a,ri number of obligatory contexts fior articles)
number of correct usages - number of errors

(for conjunctions and pronouns)
number of words produced
number of correct usages - number of errors

(for articles)
number of obligatory contexts
As validating criteria for these discrete-point scoring systems, we used two

measures of global proficiency: a subjective essay rating by teachers and an objec¬
tive essay score. (For a description of these techniques, see Chap. 14.)
Table 17-2 Three-Way Analysis of Variance (with Repeated

Measures) over Each of Three Grammatical Categories Using
Hierarchical Position, Level at CESL, and Native Language as
Independent Factors
% Variance
Category df F P explained
Conjunctions
Position in hierarchy 5,504 122.78 .001 .48
Level at CESL 4,504 11.12 .001 .04
Native language 1,504 2.95 .082 .00
Pronouns
Level at CESL 4,420 10.68 .001 .04
Native language 1,420 .03 .999 .00
Articles
Level at CESL 4,102 3.23 .015 .03
Native Language 1,102 18.38 .001 .05
Position in hierarchy
by language 2,102 9.99 .001 .04

A three-way analysis of variance was performed for each of the three grammatical
categories, with frequency of usage as the dependent variable and position in
hierarchy, level at CESL, and native language as the independent variables. The
results (see Table 17-2) indicate a significant interaction between frequency of
usage and position in hierarchy for conjunctions, F (5,504) = 122.78, p < .001;
pronouns, F(4,420) = 92.81 ,/>< .001; and articles, F(2,102) = 38.93,p< .001.
The overlapping variance is highest for conjunctions (R2 = 48%) and somewhat
lower for pronouns (R2 = 38%) and articles (R2 = 34%). Level at CESL seems to
have little relation to the frequency of usage for conjunctions or pronouns
(eta2 = 4%), or for articles (eta“ = 3%). As predicted, there was no significant
difference between Arabic and Farsi speakers in the use of conjunctions or
pronouns, though there was a significant difference in article usage favoring the
Farsi speakers. None of the two- or three-way interactions were significant except
for language background with position in hierarchy for articles.
The results of a Pearson correlation between each of the discrete point
scoring systems and global scoring systems are summarized in Table 17-3. For
conjunctions and pronouns, only SI (correct usage) is significant, and the correla¬
tions even there are weak. The highest correlation between conjunctions or pro¬
nouns and the global scores is .37 (p < .001), which indicates a shared variance of
Evola/Mamer/Lantz: Scoring for cohesive devices 181
Table 17-3 Correlations between Various Discrete-Point Scores and Global

Scoring Methods (n - 94)
Conjunctions Pronouns Articles

SI S2 S3 SI S2 S3 SI S2 S3 § S3 ^
Essay rating by teachers .37* .01 .04 .36* -.02 .05 •18f .10 .26* .18*^
Objective essay score .26f .01 .04 .37* .06 .13 .26* .19| .28* .24*
*p < .001. fp < .01. \/p <.05. § Full credit. y Half credit.
only 14%. For articles, several of the scoring methods are significant, but none of
the correlations exceed .28 (p < .01).
Our findings indicate that skills in the usage of cohesive devices are indeed
minimal indicators of overall language proficiency. A student s ability to use
conjunctions, pronouns, and articles correctly cannot be expected to reflect his
communicative ability, although it must contribute to finer aspects of that skill.
Surprisingly, our second hypothesis was weakened: Level at CESL accounts for
far less variance than does an item’s hierarchical position within any grammatical
category. In other words, a student’s competence in using these three structures
has little or no bearing on his level at CESL. Furthermore, native language seems
to play no role at all in ability to use cohesive devices except possibly articles.
When global proficiency measures are used as validating criteria, the scoring
system which totals correct usages seems to be a better indicator of language
ability than systems which take into account the number of words produced, errors
made, or obligatory contexts. Discrete-point analyses seem to reveal only narrow
descriptions of potential communicative capacity and as a result do not appear to
be comprehensive indices of language proficiency.
Note
1. Pronouns used as referents to persons were the only pronouns scored; those referring to objects or
situations were omitted. First person plural and third person plural pronouns were tabulated
separately. Sentence points and weighted scores (see Lee and Canter, 1971) were not used.
Part V Discussion Questions
1. Would it be reasonable to expect an essay grader to become more skillful

with experience? Or less? Or possibly to remain the same? Can you think of
contexts where training raters might result in a decrease in overall reliability
and validity? Consider the results of Evola et al. (Table 17-3) using discrete-
point scoring techniques; also, compare the findings of Kaczmarek (Table
14-1, column 19) with those of Flahive and Snow (especially Table 16-6).
What differences in the statistical techniques complicate the comparisons?
Also examine Table 15-4.
2. What effect on reliability would be predicted if a series of essay writing tasks

were evaluated instead of a single essay?
3. What hypothetical factor do you believe can best explain the loadings ong in
Table 14-3? Also consider the loadings on g in Table 2-2. Compare the
single-factor solution of Table 14-3 with the three-factor solution of Table
14-4. Try to find a consistent (non-self-contradictory explanation) for the
factors in Table 14-4.
4. Contrast the amount of variance explained by the single-factor solution in

Table 14-3 with the amount explained in Table 14-4 bv the multiple-factor
solution. In other words, consider the contrast in relative amounts of vari¬
ance explained for each test entered into the factor analysis.
5. In Chap. 15, what does a significant difference between judges indicate con¬
cerning interrater reliability? Note that contrasts sometimes co-occur with
high reliabilities (see Table 15-3).
6. Why do some pairs of judges achieve so much more reliability than others?
What factors might enter in? How many of the reliability coefficients dis¬
played in Table 15-3 do you consider to be acceptable? How many, for
instance, are above .80? Below .80?
7. Disregarding the extremely unreliable pairs of judges, what is the average

reliability for Structure ratings over all eight groups? Organization? Quantity?
Vocabulary? Overall?
8. Compare the average reliabilities from Question 7 with the correlations

across scales recorded in Table 15-4. How much of the reliable variance in
any given scale is likely to be common to all the scales? To the Overall Profi¬
ciency rating? (See Questions 8, 9, and 10 of Part III.)
182
V: Discussion questions 183
9. Is it possible to write long awkward sentences? What effect would the tend¬
ency to do so have on T-unit scores (see Chap. 16)? What effect would such a
tendency have on essay ratings? Or, consider sentences that are very short,
pithy, and clear.
10. Are the measures used in Chap. 16 sensitive to errors or organizational prob¬
lems involving constraints that go beyond the level of a single T-unit or
clause? What about holistic ratings? Are they sensitive to such constraints?
11. Does it necessarily follow that a better analytic conceptualization of the

skills underly ing writing will lead to better instruction in writing? Why or
why not?
12. Whv would it be reasonable to expect correlations across measures to be de¬

pressed in Table 16-6, somewhat more than those in Table 15-4 (other
things being equal)?
13. Discuss the hierarchical ordering of elements in a given grammatical cate¬

gory (e.g., the conjunctions in Chap. 17). Why might you expect ordering by
difficulty to be similar or different for natives and nonnatives? What result
did Evola et al. obtain?
14. W hat factors are overlooked in discrete-point scoring methods but included
(at least implicitly) in the more holistic scoring techniques?
'
>
Part VI
Native versus Nonnative Performance:

WhaVs the Difference?
Do native speakers of English tend to make errors that are different in type
from those made by nonnatives who are learning English as a second
language? Are the structures that are difficult for one group also difficult for
the other? What is the strength of the similarity (if any)? Do children learn¬
ing English as their first language and adults learning it as their second
language both find direct assertions easier to process than indirectly
conveyed meanings (e.g., presuppositions or implications)? Can it be
demonstrated that learners of ESL from a particular language background
find specific structures in English more difficult than others because of
interference from their native language? In other words, will a group of ESL
learners from a particular native language background find certain struc¬
tures in English to be significantly more difficult in relation to other English
structures than those same structures might be for native speakers of
English or for learners of ESL from other nonnative backgrounds? These
questions and others are dealt with in Chaps. 18 to 20.
' '•
Chapter
We All Make the Same Mistakes:

A Comparative Study of Native and Nonnative
Errors in Taking Dictation
Michelle Fishman
Second language learner errors have often been attributed to the develop¬
mental cognitive strategies that the learner uses while learning a second
language. This study attempts to find out whether, as is usually supposed, the
kinds of errors made by native speakers of English are actually very differ¬
ent from the kinds of errors that second language learners make. The data
were gathered from a dictation task administered to native and nonnative
speakers. The same passage of prose was recorded twice. The natives
listened to a tape with white background noise while the nonnatives heard a
tape without noise. The dictation consisted of nine segments with pauses to
allow both groups to attempt to write verbatim what they had heard. We
hoped to learn whether processing difficulties are distributed similarly over
segments of text for native and nonnatives, and whether the two groups
would tend to make the same types of errors and in roughly the same propor¬
tions. The results reveal substantial similarities in dictation processing by
natives and nonnatives. Although natives made fewer errors in spite of the
noise factor, the native and nonnative speakers tended to agree on what they
found difficult and what they found easy. The data show that in a Q-type
factor analysis all the natives and the more proficient nonnatives loaded on
the same factor rank ordering the difficulty of segments, and a second factor
analysis of the same type showed that a single component accounted for .86
of the total variance between subjects in both groups. The foregoing would
suggest that when pushed to the limits of their ability, both native and non¬
native speakers seem to make the same kinds of errors.
Do nonnative speakers make the same errors as native speakers in a dis¬

course processing task? Almost anyone would expect nonnatives to make errors
187
188 VI: NATIVE/NONNATIVE PERFORMANCE
more frequently, but that is not the issue. The issue is whether or not natives and
nonnatives tend to make the same types of errors, and in roughly the same propor¬
tions. Put differently, do nonnatives use different processing strategies and
therefore make different kinds of errors, or do they use similar processing strate¬
gies but simply use them less efficiently? And, does the same hierarchy of
difficulty for segments of discourse hold for both natives and nonnatives?
Related to these basic questions is the whole issue of contrastive analysis and
the often postulated influences of the learner’s native language on target language
processing. Naive contrastive theories predicted that types of errors would
necessarily be different for natives and nonnatives (also see Chap. 20). Depend¬
ing on the nonnative’s first language, learning strategies would differ. However,
Richards (1971) showed that errors made by second language learners are not very
different from errors children make while learning their native language. On the
basis of such reasoning, learner outputs (especially errors) have come to be
regarded as evidence of the strategies the learner uses to test hypotheses about the
structure of the language—i.e., the route he follows when developing proficiency
in the target language.
It is widely accepted that both first and second language learners must
abstract similar rules in order to achieve mature or native-like competence. To test
the degree of similarity in native and nonnative processing strategies, a dictation
was used as an elicitation device. In many previous studies tasks have been used in
which oral or written responses were elicited from the learner (Selinker, 1972,
Dulay and Burt, 1974, Schumann and Stenson, 1974, Johansson, 1975, and
Taylor, 1975). All these studies have assumed that it is possible to infer some¬
thing of the nature of the underlying mental processes from learner performance.
In order to challenge the efficiency of speech perception of the native
speakers used in this study (see below) and thus make the dictation task of
somewhat more equivalent difficulty for natives and nonnatives, white noise was
imposed on the signal presented to the natives. A similar technique had been used
by Gradman and Spolsky (1975) to measure second language proficiency. Long
before that, the technique was used with native speakers by Miller, Heise, and
Lichten (1951) in a study of voice communication systems.
Method
Subjects. Thirty-two native speakers of English attending Southern Illinois
University and thirty-two nonnatives served as subjects. The former group
consisted of randomly selected freshmen, sophomores, juniors, and seniors. Major
fields of study were Business Administration (5 students), Teacher Education (4),
Marketing (3), Health Education (3), Early Childhood Education (3), Sociology
(3), Social Welfare (2), and one each in Biology, Physical Therapy, Public Rela¬
tions, Clothing and Textiles, Political Science, Law, and Physiology. Two were
undecided. The foreign students were among those enrolled in Southern Illinois
University’s Center for English as a Second Language (CESL) during the fall
Fishman: Comparative study in taking dictation 189
semester of 1976. The thirty-two normative subjects were chosen from level 5 at
CESL. They included native speakers of Farsi, Arabic, Spanish, Japanese,
Turkish, and French. According to CESL’s placement procedure (see Chap. 4)
this group was supposed to be relatively homogeneous in ESL ability.
Testing. The same test was tape-recorded twice. The following passage of
prose from Mark Twain's Huckleberry Finn was used:
(1) The river life was very leisurely. / (2) The meals were mostly fish caught while travel¬
ing. / (3) The boat he possessed had a raised section for living quarters. / (4) In normal
weather he could lie around / (5) and fish or sleep without ever getting wet / (6) About
the only work involved was tying up the boat at night / (7) and fixing meals, neither of
which required much work. / (8) Overall, it is easy to see why / (9) this life can be called
a relaxing one (Hardison, 1966, pp. 111-113).
The main difference from ordinary speech was that pauses were inserted at phrase
and clause boundaries at the same places in both readings. Pauses (indicated by
slashes) were sufficiently long for the hearer to have ample time to write each
dictated segment verbatim. (The numbers given in parentheses were not dictated
and are used only for the sake of convenient reference to the segments later in this
chapter.)
The dictation contained 76 running words of text broken into 9 segments as
shown above. Two types of errors were computed. First, scores were tabulated for
each segment by dividing the number of correct words written by each subject on
each segment by the actual number of words in that segment The level of segment
difficulty could then be calculated from these scores.
Ten categories of errors were differentiated. A percentage score was
calculated for each subject on each category by dividing the number of errors on
that category by the total number of errors. Thus the native and nonnative
responses could be compared for relative frequency of errors on the various seg¬
ments and for the relative frequency of occurrence of the various types of errors.
The categories and examples of errors of each type are given in Table 18-1.
Spelling errors were not included as part of the scoring process, except those
which seemed to indicate difficulties in perception of distinct sounds as in“rever”
for “river” or those which affected the lexical identity of the word as in “whether”
for “weather” or“possede” for “possessed.” Therefore, spellings like “travelling,
tieing, and leasurly” were not counted as incorrect (See Oiler, 1979, and Chap. 6.)
Overall, in spite of the noise factor, native speakers made approximately 2.5 times
fewer errors than the nonnatives. However, both the natives and nonnatives
tended to rank the segments similarly. That is, what the natives found difficult the
nonnatives also found difficult and what the natives found easy, the nonnatives
also found easy. There appeared to be a similar hierarchy of difficulty in segments
for the two groups.This can be seen in Table 18-2 by comparing the rank order of
segments by the natives with the rank order by nonnatives.
Table 18-1 Examples Illustrating Categories of Errors*
Native responses Nonnative responses
1. Morphological Changes
(1) The river life is (was) very leisurely. (1) The river life is (was) very leasurly.
(4) In normal weather he could lay (lie) (4) In normal weather he could lay (lie)
around . . . around ...
(9) This life could (can) be called a (9) This life may (can) be called a
relaxing one. relaxing one.
2. Multiple Segmentation Errors

(2) The meals were mostly fish: (2) The meals were mostly:
caught in the channels. fixed for travel.
caught by trappers. difficult for travelling.
carp and salmon. (fish caught while traveling.)
(caught while traveling.)
3. Substitution Due to Prominent Vowel

Nuclei
(4) In normal weather he could ride (4) In normal weather he could drive
(lie) around . . . (lie) around . . .
(2) . .. mostly fish caught by (while) (2) . . . mostly fish caught by (while)
traveling. traveling.
4. Distortion or Deletion of Weakly Stressed

Syllables
(5) ... and fish could (or) sleep .. . (5) . . . and fish will (or) sleep . . .
. .. and fish asleep (or sleep) . . . . .. and fish could (or) sleep . . .
(1) The river life was very usual (D The river life is very easy (leisurely).
(leisurely).
5. Words or Sounds Borrowed from

Neighboring Context
(6) About the only work involved was (6) About the only work involved is by
around [tying up—influence from the work at night [tying up the
(4)] at the boat at night. boat—influence from (6)].
(7) ... neither of which was tiring work (7) . .. neither which must work on
[required much work—influence around [required much work—
from (6), was tying up the boat at influence from (4), in normal
night]. weather he could lie around].
(9) This life can be called a relaxing (9) This life can be called a relaxing
life (one—lexicalizes “one” due to life (one—lexicalizes “one” due to
“this life”). “this life”).
6. Substitution Attributable to
Phonological Similarities
(between [b] and [v]) (between [b] and [v])
(6) ... was time devoted at night (was (6) ... was time enough to vote at
tying up the boat at night). night (was tying up the boat at
night).
(between [r] and [w]) (between [r] and [w])

(1) The winter life was... (river) (D The woman life was . . . (river)
♦Parenthesized numbers refer to the specific segments at issue. Errors shown in italics have
the correct wording given in the immediately following parentheses.
Native responses Nonnative responses
7. Complete Word Deletions
8. Inflectional Deletions
(7) ... and fixing meals, neither which (7) ... and fixing meal (meals) ...
require (neither of which required)
much work.
... and fixing meals, this require . . . and fixing meals, neither which
(neither of which required) much require (neither of which required)
work. much work.
9. Sensible Substitutions for Stressed

Lexical Items
(8) On the whole (overall) it is easy to (8) Oh well (overall) it is easy to see
see why . . . why . . .
(4) In good (normal) weather he could (4) In normal weather he could haing
lie around . . . (lie) around . . .
(3) ... had a certain (raised) section for (3) ... had the best (a raised) section
living quarters. for living quarters.
10. Miscellaneous Unclassifiable Errors

(9) This life can be called an (a) (9) This life can be called at the reac¬
relaxing one. tion human (a relaxing one).
(2) The mirrors (meals) were mostly (2) The new York fish (meals were
fish . . . mostly fish) . . .
^Parenthesized numbers refer to the specific segments at issue. Errors shown in italics have
the correct wording given in the immediately following parentheses.
Table 18-2 Rank Ordering* of Segments by Natives and Nonnatives
Natives Nonnatives
Segment Rank % Mean SD Rank % Mean SD
1 1 .54 3.25 1.64 6 .51 3.06 1.05

2 3 .57 4.59 2.01 3 .39 3.09 1.53
3 2 .55 6.00 2.71 1 .29 3.06 1.85
4 7 .89 6.22 .79 9 .85 5.94 .84
5 6 .86 6.91 1.09 4 .42 3.38 1.85
6 5 .83 9.97 2.06 2 .34 4.12 1.74
7 4 .82 7.38 1.93 5 .43 3.84 2.11
8 8 .91 6.34 .60 8 .80 5.59 1.24
9 9 .96 7.66 .60 7 .64 5.09 1.73
*Rank order is from hardest to easiest.
To assess the degree of agreement among natives and nonnatives in the

ranking of segments, a Q-type principal components analysis was computed with
the 9 segments as cases and the 62 subjects (31 natives and 31 nonnatives) as
variables. (Two subjects were eliminated at random so as not to exceed computer
space limitations.) The natives and more proficient nonnatives tended to load on
Table 18-3 Principal Components Analysis of Segment

Difficulties: Loadings for Natives and Nonnatives
Native Nonnative Native Nonnative

Subjects Factor 1 Factor 1 Factor 2 Factor 2
1 .88 .60* -.25 .63

2 .66 -.17 -.47 .41
3 .86 .38 -.39 -.35
4 .96 .02 .08 .83
5 .75 -.01 -.11 .65
6 .77 .60* -.28 .25
7 .84 .38 -.41 .71
8 .53 .43 -.11 .36
9 .74 .58* .18 .51
10 .66 .54* -.46 .39
11 .71 .35 -.41 .51
12 .91 .79* -.24 -.16
13 .82 .41 -.21 .79
14 .84 .54* -.47 .68
15 .65 .37 -.50 .64
16 .79 .26 .39 .59
17 .70 .59* -.32 .44
18 .59 .27 .12 .72
19 .82 .52* -.19 .29
20 .84 .52* -.47 .16
21 .70 .41 -.60 .71
22 .73 .19 -.49 .66
23 .67 .45 .60 .86
24 .72 .35 .00 .71
25 .71 .22 -.11 .83
26 .74 .60* .31 -.06
27 .87 .45 -.26 .61
28 .92 .12 .10 .85
29 .92 .30 .12 .76
30 .93 .69* -.03 .38
31 .63 -.26 -.68 .79
*Nonnatives whose performance conformed substantially to that of

natives (p < .001).
the first principal component, which accounted for 40% of the total variance. All
the natives loaded on that factor at levels above .50. Their mean loading was .78.
Eleven of the nonnatives also loaded on that factor at levels above .50. The mean
loading for nonnatives on the first factor, however, was .19. The second principal
component or factor accounted for .24 of the total variance and received loadings
above .50 by the remaining twenty nonnatives. The mean loading of nonnatives in
the second factor was .60 and for the natives was .12. The Spearman rank order
correlation for the ranking of segments by natives and nonnatives was .60.
(Pearson, r = .58, p < .001). From all these data it is possible to conclude that
what is difficult for natives tends to be difficult for nonnatives.
A similar Q-type factor analysis was also applied to the native and nonnative
ranking of error categories, i.e., to the ranking of categories by proportion of errors
Table 18-4 Rank Ordering of Error Categories by Natives and Nonnatives
Natives Nonnatives
Category Rank % Mean SD Rank % Mean SD
1 7 .040 .625 .707 7 .055 1.719 .924

2 8 .053 .813 .738 4 .037 1.156 1.051
3 9 .082 1.250 1.047 9 .101 3.156 1.919
4 6 .038 .594 .560 8 .089 2.781 1.453
5 4 .027 .406 .560 2 .021 .656 .745
6 1 .018 .281 .523 3 .028 .875 .793
7 10 .663 10.156 6.994 10 .567 17.781 7.555
8 2 .018 .281 .457 6 .050 1.563 1.343
9 5 .038 .594 .837 5 .047 1.469 1.319
10 3 .020 .313 .738 1 .007 .219 .491
♦Rank order is from least frequent to most frequent.
Table 18-5 Principal Components Analysis (Categories) Loadings for

Natives and Nonnatives
Native Nonnative Native Nonnative

Subjects Factor 1 Factor 1 Subjects Factor 1 Factor 1
1 .99* .90 17 .98 .99

2 .86 .99 18 .96 .99
3 .97 .98 19 .99 .97
4 .98 .94 20 .91 .99
5 .98 .99 21 .45* .99
6 .97 .98 22 .88* .99
7 .94 .94 23 .94 .94
8 .99 .99 24 .97 .99
9 .99 .76* 25 .92 .99
10 .95 .97 26 .93 .95
11 .93 .81* 27 .41* .95
12 .98 .81* 28 .99 .99
13 .96 .99 29 .99 .98
14 .68* .82* 30 .99 .96
15 .82* .99 31 .65* .69*
16 .99 .99
♦Loadings below .90.
in each one. The ranks, percentages, means, and standard deviations for natives
and nonnatives are given in Table 18-4. A single principal component accounted
for .866 of the total available variance, as shown in Table 18-5. The mean loading
for the natives was .84 and for the nonnatives was .90. Only two native speakers
loaded on that factor below .91 and only five nonnatives loaded on the same factor
below .90. No other interpretable factors emerged. The Spearman correlation
between ranks (as displayed in Table 18-4) was .68 (Pearson, r = .71). Clearly,
natives and nonnatives perform very similarly in terms of overall proportions of
errors of the ten types studied.
As a further check and to investigate specific error types, ten t-tests were
computed contrasting natives and nonnatives in terms of the proportion of errors
of each type made by each group. Only two of the contrasts were significant Non¬
natives made proportionately more errors of type 4, distortion or deletion of
weakly stressed syllables (t = 4.27, df = 62, p < .001). Also nonnatives made
proportionately more errors of type 8, inflectional deletions (t — 3.24, df = 62,
p < .002). We might have predicted that the natives would make fewer errors of
each type, in comparison with the nonnatives. From all the foregoing we can say
that when pushed toward the limits of their ability, natives and nonnatives seem to
make the same kinds of errors in taking dictation and in roughly the same
proportions.
Chapter 19
Processing of Indirectly Conveyed Meaning:
Assertion versus Presupposition in
First and Second Acquisition1
Patricia L. Carrell
The theoretical linguistic distinction between assertion and presupposition

was empirically tested with two groups of subjects—young children
acquiring English as their first language and adults acquiring English as a
second language. The distinction was tested via measurement of the
frequency of perceptual errors as a function of differences between asserted
information and presupposed information in cleft and pseudo-cleft sentence
patterns. Subjects heard a cleft or pseudo-cleft sentence prior to presenta¬
tion of a slide picture in which the asserted or presupposed noun phrase was
misrepresented. The task was to decide if the sentence correctly described
the picture. Results indicated that, for both groups of subjects, significantly
more errors occurred when the misrepresentation involved the presupposed
noun phrase than when it involved the asserted noun phrase.
Much of the meaning conveyed by a sentence is conveyed only indirectly—it

is not part of the meaning which is the basic assertion of the sentence. Classic
examples are the indirectly conveyed presuppositions (Strawson, 1952) and
implications (Austin, 1962) of sentences with certain so-called implicative verbs
(Karttunen, 1971). Thus a sentence like:
(1) John managed to find his hat.
not only directly conveys its basic assertion, but also indirectly conveys a
presupposition2 and an implication5:
(2) Presupposition: John tried to find his hat.

(3) Implication: John found his hat
There are several linguistic properties that play a role in the interpretation of
indirectly conveyed meaning. One is lexical class; another is grammatical struc-
195
ture. The example above illustrates the role played by lexical class membership (in
that case, a verb, manage, of the class of implicative verbs) in the interpretation of
indirectly conveyed meaning (in that case, presupposition and implication).
A classic case of lexical presupposition is illustrated by the sentence Max has
stopped beating his wife. In addition to the directly conveyed assertion, this state¬
ment indirectly conveys the presupposition that Max did at one time beat his wife.
The presupposition in this example is lexical because it depends on the lexical
item stop; something cannot be stopped (or not stopped) unless it has been
happening.
Illustrative of the role played by grammatical structure in indirectly
conveyed meanings are the following cleft sentences:
(4) It is a bird that is eating the worm.

(5) It is a worm that the bird is eating.
Sentence (4) conveys a different meaning from sentence (5), owing primarily to
differences in the assertions and presuppositions of these sentences (Akmajian,
1969, Muraki, 197Q):
(6) a. Presupposition: Something is eating the worm,

b. Assertion: That something is a bird.
(7) a. Presupposition: The bird is eating something,
b. Assertion: That something is a worm.
Sentence (4) directly conveys the assertion (6b) and also indirectly conveys the
presupposition (6a); sentence (5) directly conveys the assertion (7b) and also
indirectly conveys the presupposition (7 a). The type of presupposition illustrated
here, known as grammatical presupposition, does not appear to be attributable to
the presence of any particular lexical item (note that the sentences employ exactly
the same lexical items). Rather the presupposition is due to the particular
grammatical structures, namely, the cleft constructions. The related pseudo-cleft
sentences exhibit the same properties (Muraki, 1970):
(8) What is eating the worm is a bird.

(9) What the bird is eating is a worm.
(10) a Presupposition: (cf. 6a)
b. Assertion (cf. 6b)
(11) a Presupposition: (cf. 7a)
b. Assertion: (cf. 7b)
Sentence (8) directly conveys the assertion (6b) and also indirectly conveys the
presupposition (6a); sentence (9) directly conveys the assertion (7b) and also
indirectly conveys the presupposition (7 a).
Although presupposition has been traditionally dealt with by logicians as a
strictly logical relation between statements, several philosophers and linguists
have recently come to regard it as a pragmatic notion relative to the belief
structures of the speaker and the hearer (Sellars, 1954, Hutchinson, 1971,
Carrell: Assertion versus presupposition 197
Karttunen, 1974,1975). Hutchinson, for example, maintains that presupposition

can be correctly used only when the speaker knows the presupposition to be true
and also believes that his listener knows it to be true. The assertion is something
the speaker believes to be true, but that he believes his listener is unaware of.
Hutchinson describes these pragmatic presuppositions in terms of an ‘‘inference
schema”:
x says “A ” to y as an assertion
x believes that B and x believes that y believes that B and x believes that A and x believes that y
is ignorant of A. (Hutchinson, 1971, p. 136)
Hutchinson continues:
Under this analysis presupposition involves a belief on the part of the speaker that the
addressee is aware of some fact which the speaker believes to be true, and assertion
involves a belief on the part of the speaker that the addressee is ignorant of some fact
which the speaker believes to be true. In no case are the facts themselves relevant, that
is, it doesn’t matter for considerations of communication whether A and B are in
actuality true. What is relevant is whether x believes them to be true or not and whether
y believes them or not, for it is in these terms that we may characterize the appropriate¬
ness or inappropriateness of speech acts. The speaker must believe both A and B if he is
to make a legitimate assertion, and the addressee must have the beliefs the speaker
believes he has if the assertion is to be appropriate to this addressee (1971, p. 136).
Obviously, then, the difference in the appropriate use of sentences like (4) and (5)
or (8) and (9) depends on what the speaker believes and what he believes his
listener believes.
Hutchinson also discusses what may happen if the belief inferences fail to
hold. A speaker might, for example, employ a construction involving a presupposi¬
tion the speaker believes to be false. In this case, the speaker may be said to be
intentionally misleading the listener. Since a speaker may also employ a false
assertion in order to intentionally mislead the listener, the question arises as to
whether a listener is more likely to be deceived by a false presupposition than by a
false assertion.
The cleft and pseudo-cleft sentence constructions described above are
particularly well suited for empirically testing that question. First, we may note the
neatly “reversative” relationship which holds between the noun phrases of the
presupposition and assertion of each pair of cleft sentences. In a cleft sentence
like (4) It is a bird that is eating the worm, the presupposition is about something
eating the worm, the assertion is that the bird is the thing doing it. In the related
cleft sentence (5) It is a worm that the bird is eating, the presupposition is about
the bird eating something, the assertion is that the worm is the thing it is eating.
That is, the asserted and presupposed noun phrases are reversed in the two related
types of cleft sentences. In the first type of cleft sentence, (4), we may say that the
agent noun phrase is asserted, the object noun phrase is presupposed. In the
second type of cleft sentence, (5), we may say that the object noun phrase is
asserted, the agent noun phrase is presupposed. The same relationship holds for
the two types of pseudo-cleft sentences, like (8) and (9). (See Chart 19-1.)
Chart 19-1 Summary of Cleft and Pseudo-Cleft Sentence

Types—Interrelationships of NP’s
Cleft Pseudo-cleft
Sentence type Type 1 Type II Type 1 Type II
Assertion (4) (5) (8) (9)
bird worm bird worm
agent object agent object
1st NP 1st NP 2nd NP 2nd NP
Presupposition (4) (5) (8) (9)

worm bird worm bird
object agent object agent
2nd NP 2nd NP 1st NP 1st NP
Second, we may note the order relationship which holds between the noun
phrases of cleft and pseudo-cleft sentences making identical assertions and pre¬
suppositions. Sentences (4) and (8) are alike in their assertions and presupposi¬
tions; sentences (5) and (9) are alike in their assertions and presuppositions. In the
cleft sentences, the first noun phrase is asserted, the second is presupposed. In the
pseudo-cleft sentences, the second noun phrase is asserted, the first is presup¬
posed. (See Chart 19-1.)
This study makes the assumption that the theoretical linguistic distinction
between assertion and presupposition described above is psychologically “real”
and is therefore empirically measurable. In particular, by comparing subjects’
responses to these two types of cleft sentences, as well as their pseudo-cleft
counterparts, it should be possible to measure empirically the differences in the
effects of assertion and presupposition. Previous work by Hornby (1974) has
shown this to be a valid assumption for adult native speakers of English. Further,
this study assumes that if the distinction between assertion and presupposition is
present in the competence/performance of adult native speakers of English, it
should be detectable in preadult stages of children acquiring English as their first
language and in the acquisition process of nonnative adults acquiring English as a
second language. This study is an attempt to extend Hornby’s findings to children
acquiring English as their first language and to adults acquiring English as a
second language, and to compare the results of the two in terms of the relationship
between first and second language acquisition.
Method
Subjects. The L2 subjects tested in this experiment were 52 intermediate and

advanced adult students enrolled in the Center for English as a Second Language
(CESL) at Southern Illinois University, Carbondale. These foreign students came
from all parts of the world and from a variety of native language backgrounds; they
were predominantly from the Middle East, the Far East, and South America. At
the time they were tested, they had been in the United States a minimum of six
weeks. The LI subjects were 20 four- and five-year-old children attending nursery
school. Their ages ranged from 4.3 to 5.4 (M = 4.45). Half of these subjects were
female, half male.
Procedure. The subjects were presented with a series of prerecorded cleft
and pseudo-cleft sentences (see Appendix 19A). Each sentence was followed
ahnost immediately (one second delay) by the presentation of a slide picture (one
second duration). The duration of the slide presentation was arrived at through a
pilot experiment to be long enough to allow the subjects to form an impression of
the picture but also short enough to keep them from making out all the details of
the picture. Each picture involved only a simple three-element event: an agent, an
object, and a simple relationship between the agent and object; for example, a girl
riding a bicycle (see Appendix 19B). The task was for the subject to decide
whether the sentence did or did not correctly describe the picture. If the sentence
correctly described the picture, they were to respond “true,” but if they noted a
discrepancy between the picture and the sentence, they were to respond "false.
Their answers were later transcribed onto computer-scored answer sheets. Prior
to actual testing, each group of subjects was given four example test items to
ensure comprehension of the task.
The only differences in the procedures for the two groups were that the L2
adults, who were literate, were tested in groups of 10 to 15 and gave their
responses in writing. In other words, they simply had to circle “true’ or "lalse” on
a prepared dittoed answer sheet. The LI children, who were not yet literate, had to
give their responses verbally—responding simply “yes” rather than "true,” or
“no” rather than “false.” The necessity of verbal rather than written responses
dictated that the children be tested individually.
Half of the test items involved misrepresentation in the picture of the asserted
information (the asserted noun phrase) in the related sentence; the other half of
the test items involved misrepresentation in the picture of the presupposed
information (the presupposed noun phrase) in the related sentence. In no case was
more than one of the two noun phrases misrepresented, and in no case was the
action or verb relationship between the two noun phrases misrepresented. Addi¬
tional test items in which the sentence correctly represented the picture were
introduced as control items to break up any set toward negative responses that
might develop, but these were not scored.
In addition to systematically varying the test items so that half the misrepre¬
sentations involved the asserted noun phrase and the other half involved the pie
supposed noun phrase, it was very important to control and systematically vary two
other factors in order to rule out other possible alternative explanations for the
expected results. In order to rule out the possibility that any detected differences
might be due to a difference between first noun phrase and second noun phrase
(rather than due to assertion-presupposition differences), both cleft and pseudo¬
cleft sentences were included. If only cleft sentences had been included, for
example, one might be able to explain any significant differences between asser¬
tion and presupposition as due to differences between the first noun phrase and
24 Items
Figure 19-1 Internal Construction of Test
the second noun phrase. (See Chart 19-1.) Therefore, half the test items involved
misrepresentations in cleft sentences, the other half in pseudo-cleft. In order to
rule out the possibility that any detected differences might be due to a difference
between agent and object (rather than to assertion-presupposition differences),
both types of cleft and both types of pseudo-cleft sentences were included. (See
Chart 19-1.) Half the test items involved misrepresentations in agent noun
phrases, the other half in object noun phrases. Thus the 28-item test was construc¬
ted as follows: 4 control items involving no misrepresentation between sentence
and picture, not scored; 24 scored items involving misrepresentation between
sentence and picture. (See Fig. 19-1.)
Null Hypothesis. There is no significant difference in effect between asser¬
tion items and presupposition items. In mathematical terms, we would state this
null hypothesis as H : pa — p. , where pu is the mean of the population on the
assertion items and p is the mean of the population on the presupposition items.
Research Hypothesis (preferred alternative hypothesis). There is a signifi¬
cant difference in effects between assertion items and presupposition items; in
fact, subjects score better on assertion items (making significantly fewer numbers
of errors) than on presupposition items (making significantly greater numbers of
eiTors). In mathematical terms, we would state this alternative hypothesis as
A: p > p .
ra
Results
The principal measure was the number of times each subject correctly reported
that the sentence was not a correct representation of the picture. The important
comparison was whether the correct responses occurred significantly more
frequently when the misrepresentation involved the asserted noun phrase than
when it involved the presupposed noun phrase. Said another way, did the subjects
make significantly greater numbers of errors with the presupposition items than
Table 19-1 Descriptive Statistics—Correct Responses to 12 Assertion

and 12 Presupposition Items
LI subjects (children) L2 subjects (adults)

Assertion Presupposition Assertion Presupposition
Mean 9.90 9.05 10.038 8.923

SD 1.889 2.114 1.252 1.690
Variance 3.568 4.471 1.567 2.857
Range 6-12 6-12 8-12 5-12
with the assertion items? The results are presented in Table 19*1.
It can readily be seen that both groups of subjects performed better with the
assertion noun phrases than with the presupposition noun phrases. Out of a maxi¬
mum possible score of 12, there is over a full point difference (actually 1.12)
between the mean performance on assertion items (10.04) and the mean perform¬
ance on presupposition items (8.92) for the L2 subjects. For the LI subjects, there
is almost a full point difference (actually .85) between the two mean scores. When
the presupposed noun phrase was misrepresented, the subjects tended to make
more errors, i.e., failed more often to notice the discrepancy, than when the
asserted noun phrase was misrepresented. For the presupposed noun phrases, the
L2 subjects overlooked the misrepresentation an average of 3.08 times out of 12,
but when the asserted noun phrases were misrepresented, the average number of
errors was only 1.96. The LI subjects overlooked the misrepresentation an
average of 2.95 times out of 12 for the presupposed noun phrases, but the average
number of errors for the asserted noun phrases was lower at 2.10.
We may also note that for the L2 subjects the range of correct responses was
higher for the assertion items (8 to 12) than it was for the presupposition items (5 to
12). For the LI subjects the range of correct responses was the same for both
assertion and presupposition items (6 to 12). There was also relatively greater vari¬
ability among the responses to the presupposition items than there was to the
assertion items; S2 — 2.857 to S2 — 1.567 for L2 subjects, S“ — 4.471 to
S2 = 3.568 for the LI subjects. In other words, there was relatively greater
homogeneity in the responses to the assertion items than to the presupposition
items.
The relatively high mean scores (10.04, 8.92 and 9.90, 9.05) indicate that
the test was not too difficult for either group of subjects and that they were
generally able to carry out the assigned task;4 they were able to detect both the
directly conveyed assertion and the indirectly conveyed presupposition. It should
also be clear that these mean scores near 9.0 and above are extremely unlikely to
have occurred by chance; that is, it is extremely improbable that these scores
could have resulted from chance guessing on the part of the subjects. The proba¬
bility of a mean score of 9 or above on a 12-item true-false test is only .07. Clearly,
then, the subjects were not merely guessing but were attending as well as they
could to the task at hand. They performed well at detecting both the directly
conveyed assertion and the Table 19-2 Pearson Product Moment

indirectly conveyed presup¬ Correlations of Assertion and Presupposition
position, better on the former Scores
than on the latter.
LI subjects L2 subjects
The two variables, asser¬
(children) , (adults)
tion and presupposition, may
Coefficient r .67* .41 f
be compared to determine Cases 20 52
the degree of correspondence r2 .4533 .1674
or relationship between them.
Calculation of a Pearson pro¬ *p < .05 (one--tailed test).
fjp <.005 (one-tailed test).
duct-moment correlation for
each group of subjects yield¬
ed the results shown in Table
19-2. Table 19-3 f-Tests—Assertion with Presuppo¬
For both groups of sub¬ sition
jects, the two variables tend
LI subjects L2 subjects
to covary. In the case of the (children) (adults)
20 LI subjects, r — .67, and f-value 2.33* 4.90f
in the case of the 52 L2 df 19 51
subjects, r = .41. (Squaring r
*p <.02 ( one-tailed test).
yields the coefficient of de¬
fp <.001 (one-tailed test).
termination r2, or variance
overlap.) The coefficients are
both sufficiently high so that,
taking the sample sizes into account, we may reject a null hypothesis that the
population correlation coefficients are zero, p < .005.
While we might have expected the two variables to have a positive
correlation—as opposed to a negative correlation (hence the appropriateness of
the one-tailed test of significance) or a zero correlation—the difference in the
magnitude of the two correlations is puzzling. This is discussed further in the last
section of the chapter.
In order to determine the statistical significance of the differences in the
mean scores on the assertion and presupposition items, i.e., in order to test the null
hypothesis that there is no difference between the two types of items, the mean
scores for assertion and presupposition items were statistically compared using a t-
test for two related samples. The results are shown in Table 19-3.
The analysis with the t-tests yielded values of 2.33 for the LI subjects and
4.90 for the L2 subjects, both of which are significant at p < .02.5 These results
allow rejection of the null hypothesis for both groups of subjects. We are left with
our research hypothesis (the preferred alternative) that there is a significant differ¬
ence favoring assertion items over presupposition items.
Discussion
These results demonstrate a number of things about the competence/performance
with assertions and presuppositions of both adults acquiring English as a second
language and young children acquiring English as their first language. First, they
show that both groups of subjects are able to detect directly conveyed assertions
and indirectly conveyed presuppositions. Both groups performed better than
chance in recognizing either aspect of the meaning of the sentence/picture they
were presented with.
Second, the results show that there is a certain degree of correlation between
the two variables. That is, the processing of asserted information and the process¬
ing of presupposed information are not totally unrelated. Similar mental strate¬
gies appear to be involved. What isn’t clear from these results is the magnitude of
the correlation. The results from the L2 subjects appear to indicate only minimal
overlap—only about 17% of the variance in one variable is accounted for by the
variance in the other variable. The results from the LI subjects appear to suggest
greater overlap—as much as 45% of the variance overlapping. One possible expla¬
nation might be that the correlation coefficient for the LI subjects is artificially
high because of the relatively small sample size. However, this area of correlation
and overlap between the two variables requires further investigation.
Third, these results show that, although the two variables are related, they are
significantly different in the level of competence/performance. Both groups of
subjects attended better to assertions than to presuppositions; they are more likely
to be deceived by misrepresented presuppositions than by misrepresented asser¬
tions. The fact that the subjects in all these studies failed more often to notice the
discrepancy between the presupposed proposition and the picture suggests that
they tended to take for granted that this part of the picture was correct and focused
their attention on those parts of the picture relevant to determining the correct¬
ness of the asserted proposition.
These findings might best be accounted for in terms of Hutchinson’s (1971)
notion of pragmatic presupposition and his so-called “inference schema”
described earlier. Sentences like clefts and pseudo-clefts, with overt grammati¬
cally signaled distinctions between asserted and presupposed information, are
generally used in contexts where the speaker assumes that the listener already
knows the presupposed information. It follows then that the presupposed part of
the sentence is not usually providing the listener with any new information. In
judging whether or not such statements were true, the listener would be con¬
cerned primarily with the new information that the speaker is asserting. Under
time pressures such as those in the present study, the subject would presumably
first try to confirm this aspect of the meaning by rapid visual analysis of the
assertion-related portion of the picture. Based on that confirmation, and failing
additional times to further check the presupposition-related portion of the
picture, he would conclude that the statement was a correct description of the
picture.
Hutchinson describes this phenomenon as one of presupposition “swallow¬
ing” (1971, p. 137). If the hearer has no prior beliefs about the presupposition, as
would be the case in the present study, and does not have time or the inclination to
check the presupposition, as would also be the case in this study, a hearer has two
courses of action open to him:
(i) he can express surprise over the new (putative) fact brought to his attention, or
(ii) he can “swallow" the presupposition and come to believe it on the basis of his
respect for the “expertise" of the speaker (Hutchinson, 1971, p. 137).
In some instances, Hutchinson maintains, the latter course of action is the more
natural of the two. Most listeners, most of the time, do not go around expressing
surprise at the presuppositional beliefs attributed to them by speakers; rather they
tend to go along with the speaker. A convincing example of this is cited by
Hutchinson: If someone said to a listener, “The present shaman of the Chippewa is
a friend of mine,’’ the listener would most likely conclude that there exists such a
person rather than to question or express surprise over the existence of such a
person. Hutchinson says:
It is through this propensity to adopt the beliefs of others when we do not hold counter¬
beliefs that one can inform (or misinform) through presuppositions. Presuppositional
lying can be extremely effective (1971, p. 138).
This study has, in fact, demonstrated the effectiveness of presuppositional

lying and has shown a clear difference between the way presuppositions and asser¬
tions are attended to by both adults acquiring English as a second language and
young children acquiring English as their first language.
A very intriguing question raised by these studies but not answered by them is
why these same results should have been found for such different groups of
subjects—adults acquiring English as a second language, young children in the
process of acquiring English as a native language, and native English-speaking
adults who have virtually completed the acquisition process. To what extent is the
distinction between assertion and presupposition a linguistic universal which can
be expected to show up in the competence/performance of any group of speakers
of any language? To what extent is the distinction language specific? I find it partic¬
ularly suggestive that the results obtained for the L2 adults and for the LI children
on the same questionnaire, using the same methods and procedures, are so similar.
These remarkably similar results strongly suggest that the assertion-presupposi¬
tion distinction is related to factors common to both first and second language
acquisition.
Notes
1. The author gratefully acknowledges the assistance provided by the following graduate students
who worked on various segments of the project which yielded these studies: Joan Jamieson, Maureen
Garry, Jonas Nartney, and Pamela Benson. Thanks are also due to the Center for English as a Second
Language (CESL) at SIU-C, which kindly granted the author access to the L2 subjects who partici¬
pated in the study, to the Child Study Cooperative Nursery School at SIU-C, which kindly granted the
author access to the LI subjects who participated in the study, and to SIU-C’s Office of Research
Development and Administration, which provided financial assistance for this project This paper
appeared in Language Learning under the title “Empirical Investigations of Indirectly Conveyed
Meaning: Assertion versus Presupposition in First and Second Language Acquisition” (1977, 27, 353-
369). It is reprinted here, with minor editorial changes, by permission.
2. Strawson (1952) defines presupposition as the relation between two statements, A and B (read“A
presupposes B ’) when the truth of B is a necessary condition for the truth or falsity of A.
3. Classical logical implication is a relation between two statements, A and B (read “A implies B”),
when the truth of B follows from the truth of A and the falsity of A follows from the falsity of B. I am
following Austin (1962) in using the term "imply” in its ordinary weaker sense: “A implies B” means
only that asserting A commits the speaker to B; asserting ~ B need not commit the speaker to - A.
4. As an indication of the overall reliability of the instrument, the internal consistency reliability
coefficient of the 24-item test was computed to be .55. If the test were lengthened to a total of 100
items, the estimated internal consistency reliability coefficient would be .84.
5. A one-tailed test of significance is appropriate because the research hypothesis claims that asser¬
tion items will be less difficult than presupposition items.
Appendix 19A
Test Items by Groups
The following is the 28-item test used in the reported studies. Items are listed in
the same random order in which they were administered. If the item involved a
misrepresentation between the sentence and the picture, that misrepresented
noun phrase is underlined in the sentence and the object actually pictured is indi¬
cated on the right. Each item is coded as to the type of item it is:
Control: Control Items (no misrepresentation)

Cleft-As-Ag: Cleft-Assertion-Agent
Cleft-As-Obj: Cleft-Assertion-Object
Cleft-Pre-Ag: Cleft- Presupposition- Agent
Cleft-Pre-Obj: Cleft-Presupposition-Object
P Cleft-As-Ag: Pseudo-Cleft-Assertion-Agent
P Cleft-As-Obj: Pseudo-Cleft-Assertion Object
P Cleft-Pre-Ag: Pseudo-Cleft-Presupposition-Agent
P Cleft-Pre-Obj: Pseudo-Cleft-Presupposition-Object
EXAMPLE:
A. It is a gun that the soldier is holding.

B. What the girl is holding is a doll. book
C. What is in the tree is a bird.

D. It is a man that is holding the telephone. lady
TEST ITEMS:
Control 1. It is a sack that the man is carrying.

P Cleft-Pre-Ag 2. What the man is riding is a bicycle. girl
Cleft-As-Ag 3. It is a boy that is eating the ice cream. girl
P Cleft-As-Obj 4. What the baby is holding is an umbrella. bar bells

Control 5. It is an arm that the boy has hurt.

Control 6. What the man is climbing is a mountain.
P Cleft- As- Obj 7. What the boy is breaking is a window. rock
P Cleft-Pre-Ag 8. What the elephant is holding is a ball. seal
Cleft-As-Ag 9. It is a girl that is behind the bars. man
P Cleft-Pre-Ag 10. What the bird is eating is a tree. beaver

Cleft-Pre-Ag 11. It is a belt that the girl is putting on. man
P Cleft-As-Obj 12. What the man is carrying is a book. umbrella
Cleft-Pre-Obj 13. It is a man that is holding the gun. sword
Cleft-Pre-Obj 14. It is a doctor that is checking the boy. dog
Cleft-iVs-Obj 15. It is a coat that the girl is sewing. pants
Control 16. What the lady is holding is a telephone.

P Cleft-Pre-Obj 17. What is in the tree is a bird. cage
Cleft-As-Obj 18. It is a shoe that the man is holding. hat
Cleft-Pre-Ag 19. It is a notebook that the man is writing in. woman
Cleft-As-Obj 20. It is a gun that the lady is holding. bow and arrow
P Cleft-Pre-Obj 21. What is leaning on the floor is a ladder. wall
Cleft-As-Ag 22. It is a man that is holding the baby. woman
P Cleft-As-Ag 23. What is in the tree is a cat. bird
Cleft-Pre-Obj 24. It is a lady that is brushing her teeth. hair
P Cleft-As-Ag 25. What is smelling the flower is a boy. ape
Cleft-Pre-Ag 26. It is a bracelet that the lady is looking at man
P Cleft-As-Ag 27. What is on the tracks is a truck. train
P Cleft-Pre-Obj 28. What is on the floor is a picture. easel

Appendix 19B
Sample Pictures
The following are eight sample pictures used in the reported studies.* Pictures are
numbered to correspond to test-item numbers.
♦The pictures are taken from Peabody Picture Vocabulary Test: Series of Plates by Lloyd M. Dunn,
Ph.D., published by American Guidance Service, Inc. They are reproduced here by permission.
Chapter
Can ESL Cloze Tests Be Contrastively Biased?—

Vietnamese as a Test Case1
Craig B. Wilson
The notion that a contrastive analysis will facilitate the identification of

syntactic structures with which second language learners will have trouble is
a belief evidently held by many linguists, textbook writers, and language
teachers. The experiment reported here tested a necessary assumption for
this claim, namely, that the set of syntactic difficulties in learning a given
target language (in this case English) is unique to each language background
(in this case Vietnamese). On the basis of a contrastive analysis, English
cloze tests were structurally modified to be biased against speakers of Viet¬
namese. The tests were administered to speakers of a wide range of different
languages in several ESL programs, and to a small group of native speakers
of English. It was predicted that the biasing would result in a significantly
poorer performance by the Vietnamese. An analysis of covariance control¬
ling differences in initial proficiency showed no difference on the contrast¬
ively biased tests between Vietnamese learners of ESL and those with other
language backgrounds. In fact, all groups tended to rank the tests and test
items in the same order of difficulty.
In his landmark formulation of the structuralist contrastive analysis approach,

Lado encouraged the application of contrastive analysis to language testing as well
as teaching:
The view of grammar as grammatical structure opens the way to a comparison of the
grammatical structure of the foreign language with that of the native language to
discover the problems of the student in learning the foreign language. The results of
such a comparison tell us what we should test and what we should not test... (1957, p.
208
Wilson: ESL cloze tests 209
The experiment reported here was intended to measure the degree to which cloze
tests that are deliberately biased on the basis of contrastive analysis would actually
be harder for Vietnamese than for speakers of other languages. It tests the
approach to ESL for Vietnamese described in a guide to teachers of Vietnamese
refugees published by the Center for Applied Linguistics:
The teacher of Vietnamese students can tell in advance which lessons will be difficult
for his students by comparing the structure taught in a lesson with the parallel structure
in Vietnamese... (National Indochinese Clearinghouse, 1975, p. 7).
Method
Subjects. Subjects came from seven ESL programs with Vietnamese students.2
Students were judged to be at the intermediate stage of ESL learning or higher.
The 72 subjects from 12 language backgrounds were grouped according to their
native languages into three main categories: Vietnamese (37), other nonnatives
(26), and native speakers of English(9). Forthe sake of judging degrees ofinterfer-
ence, the group designated “other nonnatives” was divided further into Southeast
Asians, Hmong (5); Laotian (4); and Cambodian (1); Asians, Chinese (5), and
Japanese (1); and Indo-Europeans, Spanish (6); Farsi, Hindi, Armenian, and
Russian (1 each). In a few cases, respondents indicated two languages as their first.
The language used by the minority in the country in question was chosen.3
Tests. The cloze tests were from four Readers Digest articles which were
reduced to about 200 words each (see Appendix 20A). A contrastive analysis of
Vietnamese and English grammatical structures served as the basis for biasing
three of the passages. The author of that contrastive analysis shares Lado’s
approach to learning difficulties:
The fundamental principle guiding the writing of this contrastive grammatical analysis
of English and Vietnamese is the conviction held by many linguists and foreign lan¬
guage teaching specialists that one of the major problems in learning a foreign language
is the interference caused by the structural difference between the native language of
the learner and the foreign language to be learned (Nguyen Dang Liem, 1967, p. xii).
Four cloze tests were constructed: (1) Every fifth word was deleted from the
unmodified passage which served as the control test; (2) a selected deletion test was
constructed over the second passage by choosing to place cloze deletions in struc¬
tures predicted by the contrastive analysis to be difficult for Vietnamese; (3) a
third salted test was constructed by loading the text with structures predicted to be
hard for Vietnamese by contrastive analysis and simply deleting every fifth word
on a random basis; (4) the remaining passage was used to construct a double-
biased test by salting and by carefully selecting difficult structures for deletion
points.
The Flesch readability formula showed the modified texts (Appendix 20A) to
be of nearly identical difficulty, and of similar interest levels. However, the Fry
(1968) readability index ranked the control test and the selected deletion test as
appropriate for readers between grades six and seven. The salted test was rated at
seventh grade reading level and the double-biased test at between grades seven
and eight. A similar ranking using the SMOG index (McLaughlin, 1969) rated the
first three tests at eighth grade level and the fourth at ninth grade level.

Three hypotheses were tested: (1) the Vietnamese mean scores on the three
biased tests were expected to be lower than on the control; (2) the mean score of
the Vietnamese on the double-biased test was expected to be lower than that on
either the selected deletion test or the salted test; and (3) the Vietnamese scores
on biased tests adjusted for the covariate control test were expected to be lower in
every case than similarly adjusted scores for speakers of other languages.
The predicted ranking of tests (control, followed by single bias, then double-
biased) was found as expected. On the other hand, this was true for all groups as is
shown in Table 20-1, rather than only for the Vietnamese as hypotheses 1 and 2
would predict. Furthermore, as shown in Table 20-2, contrary to the third predic¬
tion, the analysis of covariance indicated no significant variation across groups
once the adjustments were made for initial differences on the control test
Table 20-1 Means and Standard Deviations on the Four Cloze Tests for Each
Language Group
Selected
Language group Control deletion Salted Double-biased
Vietnamese 15.89 7.85 11.27 6.61 9.76 7.11 7.41 5.14
Other SE Asian languages 12.20 6.51 7.40 5.30 6.00 2.62 5.80 2.82
Chinese and Japanese 23.33 5.92 15.50 7.82 14.83 7.22 11.33 5.28
Indo-European 20.70 7.04 17.30 7.21 14.80 6.34 11.40 6.80
Native English 26.89 5.37 23.44 4.93 24.00 5.77 21.00 7.57
Table 20-2 Analysis of Covariance Contrasting Vietnamese

with Other Nonnatives on Each Biased Test, with the Control
Test as the Covariate
Other
Vietnamese, nonnatives,
Test mean mean df F*
Selected deletion 11.99 12.06 1,60 .006
Salted 10.43 10.45 1,60 .001
Double-biased 7.91 8.51 1,60 .611
*AII the contrasts failed to achieve significance,p > .05.

The mean product moment correlation (not tabled) between scores on the
biased tests and scores on the control test was .88 for the Vietnamese and .85 for
the other nonnatives taken together. The correlations between item ranks on the
several tests also led us to reject the hypothesis that the Vietnamese performance
is in any w av due specifically to their language background. The average Spearman
rank-order correlation between the item ranking by Vietnamese and the ranking
by the several other language groups taken together (including the native English
speakers) was .70 on the selected deletion test, .78 on the salted test, and .67 on
the double-biased test The corresponding Pearson product moment correlations
of the average scores on test items over all four tests was .85 for the Vietnamese
with the other Southeast Asians; .83 for the Vietnamese with the Chinese and
Japanese Asians; .81 for the Vietnamese with the Indo-Europeans; and .64 for the
Vietnamese with the native speakers of English.
In conclusion, there is no evidence of a difference in performance due to
native language interference for the Vietnamese group. These findings, though
they do not constitute a final answer to the contrastive analysis issue, are nonethe¬
less convincing enough to suggest that native language interference may be a less
important factor than it has often been claimed to be. The author believes,
however, that interference models built on foundations other than structuralist
models also need to be tested. Ross (1976), for example, suggests that contrastive
analysis should be directed to the use of structures rather than to their surface
forms. The question is whether experimentally testable hypotheses can be
formulated.
Notes
1. This is an expanded version of a paper entitled Can Discourse Be Language Language-Biased:

Vietnamese Performance on “Biased” Cloze Tests, which was presented at the First International
Conference on Frontiers in Language Proficiency and Dominance Testing, at Southern Illinois
University, Carbondale, Ill. on Apr. 22, 1977.
2. The experiment would not have been possible without contributions of completed tests from the
students and staffs of the following ESL programs: Stephen Schumacher and Leo J. Brown, Adult
Basic Education Program Evaluation and Developmental Center, Southern Illinois University,
Carbondale, Ill. Six ESL program administrators were located from a list provided by Anne P. Gonvery
of the National Indochinese Clearinghouse, Center for Applied Linguistics: Michael Jane Buss,
International Institute of Boston, Boston, Mass.: Cynthia J. Groom, Colorado State Indochinese
Refugee Assistance Office, Denver, Colo.; Bienvenido D. Junasa, Office of the Governor, Slate Com¬
mission on Manpower and Full Employment, State Immigrant Service Center, Honolulu, Hawaii;
Elisabeth Price, International Institute, St. Louis, Mo.; David I. Quitugua, Diocesan Resettlement
Office, Tamuning, Guam; John Westbrook, Brookdale Community College Learning Center, Long
Branch, N.J.
3. Tests for over 180 students were received. However, the results of more than hall of the
respondents were not included in the experiment because the students had failed to complete one or
more of the tests.
Appendix 20A
Cloze Tests
NAME AGE
ARTHUR MITCHELL (Control Test)
Arthur Mitchell was one of the first black ballet dancers to be in a major
American ballet company. Now, after doing well as a dancer, he is teaching ballet
to black boys and girls.
When he was a _ b°y_, Arthur never dreamed of_being_a
famous dancer. Rut one_day a teacher noticed him_dancing
at a party__encouraged him to study ballet_He tried
out and Sa'ne(^ admission into a high scbo°* for the arts. When
he graduated, some people gave him_ enough money to
study ballet all the time.
Al the_arts school, Arthur had. to _ work very hard. “I’m
fighter. I was detennined to succeed," he says of
those days. He began to perform in the school’s troupe
and stayed with it for fourteen years. He became one
of its best dancers.
The murder _ of Martin Luther King in _ 1968 changed
Mitchell’s life. Whh the death of a_ man loved by so many
_people_ , Arthur asked himself: What can _ I do in his
memory He decided to teach ballet _ to black children, who
usually little chance to _r study that art. Mitchell chose
to open a school in Harlem, m _ New York City. He
surprised many people by creating. good dance company. His

students have danced before crowds in many parts of the world.
Arthur is still teaching ballet in Harlem. He continues to give confidence to
shy children and to teach others to work hard at everything they do. Many people
are glad that Arthur chose this way to honor Martin Luther King.
NAME AGE
PELE (Selected Deletion Test)
Edson Arantes do Nascimento, also known as Pele, is famous around the

world as a soccer player. Actually, most people would say that he is the best. Pele
might agree with them.
Pele grew up during _hie_ 1940’s in the interior of Brazil.
-His-father made money for the family by_playTng_S0Ccer. When
Pele was _boy, he, too,. played soccer with his friends. At
first, _ the ball was only made . of rags, and the boys
played in the street. But by the time he turned . fifteen, Pele was
able to play for money. As he matured, he became the. best
player in Brazil. He helped_ his country win three world
championships. Brazilians even think _ of him as a national
treasure. He is the model for others who wa°t to play well.

_Sax®_ Pele: “I was _ bom for soccer, just as
Beethoven was b°m for music.”

When he quit. playing. _ soccer in Brazil, Pele moved to the
United States, where he joined a team in New York and helped
make it great This created an interest in soccer_111_
the United States.
Pele loves _to work with children. Youngsters in forty. countries
around the world_ have seen him play. He really enjoys_teaching

soccer. But he also _ believes that children who learn this game will
learn to w°rk _ hard and to act_ fairly in all that they
do
Pele’s playing might be useful in another way. Once, while the Biafran war
was raging, he went to that area to play. On the two days he was there, the fighting
actually stopped so that people could go to see him on the field.
NAME AGE
JEAN EYMERE (Salted Test)
Jean Eymere is a champion skier who lost his sight to disease when he was
thirty-three. But he has not let blindness defeat him. Instead, he has learned to use
his knowledge to help other sightless people to live fuller lives.
Jean was a skiing teacher until 1969. That was die_year
when disease blinded him_ . Though encouragement was offered
_by_ friends, Jean himself felt that his life had become
empty_ . However, he was finally persuaded by another skier to
try_ his sport again. When Jean attempted to ski, he
found he could hardly even stand_ up. Yet, there was
the realization that he could *earn_ to ski again, despite
being_blind.
It was this_experience that gave Eymere the Idea_for his life’s
work. Skiing_ _was dangerous for the_sighted Jean dared to believe
that blind people could learn. Although_ some did not agree
with_ him, Jean began encouraging die_ blind to ski. "I
want to make other blind _people laugh and feel the
wind on their faces, as_!_have,” he would say.
_Li_1970, Jean began BOLD, Inc.,__the blind. This

group was_established in order to teach sightless people new
skills and_ to show them how__ enjoy nature. That it
has_ worked is clear. Many are_ learning that there are
wayg._for them to fish, build houses, and even to ski. As a result, more
blind people are living active lives.
Eymere himself is a good example of what some can do. Besides ski racing
and figure skating, he has also climbed a 14,265-foot mountain. Being blind has
not kept him from enjoying life.
NAME AGE
LORETTA LYNN (Double-Biased Test)
Loretta Lynn is the most popular woman country music singer in the United
States. Housewives especially like her. Her songs are full of the wisdom she herself
gained as a wife and mother.
Loretta was_horn_the second of eight children to a_Poor_
family living in Kentucky. Whenever he could , her father mined coal or
worked on federal projects for those without j0bs. He
-- helped by a wise wife. But they remained poor because
saving money was _ hard in those days. Then, at only thirteen,
Loretta chose to marry , though few W0llld marry at that age.
__from Washington State, her _ husband decided to move
his family back there. _ This meant that _young_
Loretta had to_carry_the load of raising a growing_family far away
from her parents.
_There_was always a lot of work for the new _wife to do. She
found that singing _would make her happy. Ji_ was this
__that showed her husband that Loretta. _ much talent had
When Loretta-did-make a record, in 1960, many people were

—impressed— . She then began_MUg._on stage. In 1961, an award was
-given- t0 Loretta _hy_ a large club, calling her_ the “Most
Promising Female Singer”_2f_the year. Her best fans are
other women. She
_ herself explains it this way: “I think about the
housewife who’s watching the kids all week. I was like that for a
_l2Ug_time myself.”
Loretta is a success. She knows what her audiences are like. Housewives
identify her as one of their own. They see her as a friend who understands.
Part VI Discussion Questions
1. Fishman boldly asserts in her title that “we all make the same mistakes.” Dis¬
cuss the evidence that she presents along with that presented in Chaps. 19 and
20 by Carrell and Wilson. To what extent does her observation seem to be true?
How much of the total variability across test items in Fishman’s study, Carrell’s,
and Wilson’s appears to be common to native and nonnative speakers?
2. Discuss the specific errors committed by natives and nonnatives which are
used to illustrate the various error categories employed by Fishman (see
Table 18-1). Consider the errors that seem to be identical for natives and non¬
natives. Also consider any that seem to be different What strategies seem to
be common to both groups, and what strategies seem to be unique to one or the
other?
3. Is there any reason to expect the performance of nonnatives to be less reliable

than that of native speakers of English on any of the tests used in the papers in
Part VI? What effects would a lowering of reliability of performance have on
the similarity of patterning across the two groups?
4. Would you expect implicative meanings to be equal to assertive meanings, or

more difficult than assertive meanings? How might you expect implications to
relate to presuppositions in terms of difficulty? Discuss your conclusions in
the light of Carrell’s findings.
5. The reliability of Carrell’s 24-item test was .55 according to her footnote 4.
How does this reliability compare with some of those observed in relation to
other testing procedures discussed in earlier chapters? What possible expla¬
nations might be offered to explain the contrast? Can Carrell’s test be classed
as integrative or discrete-point? Is it a pragmatic procedure? Bear in mind the
fact that her test does require the mapping of utterances onto contexts.
6. Consider the claims of contrastive analysis. Is Wilson’s study a fair test of

some of those claims? What results would be expected if the sorts of assump¬
tions that contrastive analysts make were correct in general? How much of the
variability across test items seems to be common to the Vietnamese group and
the other groups as well? Is any of the reliable variance not common to all the
groups? (The analysis of covariance in Table 20-2 answers the latter questioa)
What explanation can be given for the failure of the contrastive analysis pre¬
dictions?
215
7. Note that Wilson remains (see his concluding remarks) unconvinced of the
failure of contrastive analysis to provide a suitable basis for explaining the
performance of the Vietnamese learners on the ESL tests. He seems to believe
that a different sort of contrastive analysis might get better results. Discuss
some of the options for using a functional contrastive analysis rather than a
structural one. (See Ross, 1976.)
Part VII
Measuring Factors Supposed to

Contribute to Success in Second
or Foreign Language Learning
How strongly are measures of aptitude, attitude, and other variables related
to second language proficiency.'' Is the Modem Language Aptitude Test, for
instance, a good predictor of success in university level foreign language
study? Does motivation really seem to influence the degree of success
attained in learning ESL? Are self-reported differences in amounts of
practice in using the target language significantly related to its acquisition?
Are variables that should be expected to correlate with ESL proficiency
more strongly related to it than variables which have no theoretical relation
to ESL learning? Can a measure of redundancy utilization (e.g., the
tendency to use correctly morphemes which are largely redundant) be
shown to be a good index of the degree of integrativeness of ESL learners?
These questions and related issues are dealt with in Part VII.
'
Chapter
The Correlation between Aptitude Scores

and Achievement Measures
in Japanese and German
Sadako 0. Clarke
Relationships between scores on the Modern Language Aptitude Test and

achievement measures were studied in a sample of 91 college students
enrolled in German and Japanese courses at Southern Illinois University at
Carbondale. Correlations between the MLAT and total scores on the various
tests ranged from .26 to .74 {p < .02). Thus foreign language aptitude scores
are significantly associated with level of achievement in the language
classes. However, the results suggest that achievement in the language
courses is also related about as strongly to factors other than aptitude scoies,
and the MLAT is not consistently a good predictor of achievement
Aptitude tests have been used for a variety of purposes by educators, e.g., for
grouping within classes and determining whether a student has the potential for
future language study (a use by the Foreign Language Institute). This study was
designed to determine the strength of the Modem Language Aptitude Test as a
predictor of foreign language achievement scores. More specifically we wanted to
know whether scores on the MLAT are as highly correlated with language achieve¬
ment for Japanese as for Indo-European languages. German was selected as a
representative of the latter language family. We were also concerned to discover
what effect course status (whether elective or required) had on aptitude-
achievement correlations and whether these correlations weie highei for some
language skills than for others. Finally, we attempted to discover whether previous
language had any effect on aptitude scores and what effect this training had on
achievement.
219
220 VII: MEASURING FACTORS
Method
Subjects. Ninety-one Southern Illinois University students served as subjects.

Sixty-nine were enrolled in elementary German courses in the academic year
1976-1977 and the remaining 22 were students enrolled in elementary Japanese
for at least one year between 1974 and 1977. The students in German classes were
divided into two groups: one group of 25 students (36%) took German as an
elective and the other group of 44 students (64%) took it as a retirement Almost
all the students in the Japanese classes took it as an elective (91 %). All the subjects
came from English-speaking homes and completed both the fall and spring
semesters of foreign language study.
Materials. Two major types of instruments were used: the Modem Language
Aptitude Test (MLAT) and various achievement tests given by the language
instructors. A questionnaire was also used to elicit information about behavior and
experience outside the classroom.
MLAT. The short form of the MLAT developed by Carroll and Sapon (1959)
was used. It consists of Parts III, IV, and V from the longer version. In Part III,
Spelling Clues, students select the correct meaning of disguised spellings of
English words. In Part IV, Words in Sentences, students respond to various aspects
of English Grammar but without having to use specific grammatical terminology.
This part purports to measure sensitivity to grammatical structure. In Part V,
Paired Associates, students memorize pairs of words. This supposedly measures
their ability to learn rapidly by rote. This short form of the MLAT requires approxi¬
mately 30 minutes of testing time and was given at the beginning of the spring
semester.
Achievement Measures. The measures of attained language proficiency were

first and second semester total points in elementary German and Japanese. They
were obtained at the end of the first year of language study. A total of 1,000 points
was possible in German: 300 for grammar, 50 for vocabulary, 90 for dictations,
175 for compositions, 75 for oral quizzes, 50 for mid-term exam, 60 for labora¬
tory work, 100 for class participation and homework, and 100 for the final exam.
The Japanese points totaling 600 were distributed equally over tests of grammar,
vocabulary, listening, speaking, reading, and writing. The achievement tests used
in Japanese from 1974-1977 were very similar and were based on the same text¬
book, workbook, and mimeographed materials.
Questionnaire. The questionnaire asked the students for information regard¬

ing their purpose in taking either Japanese or German and whether or not they had
received previous language training. Questions included: Why do you want to
study the language? How long have you previously studied the language? Have you
ever been in the foreign country? If yes, how long? Do you have any relatives who
are native speakers? Do your parents use the language at home? If yes, occa¬
sionally or frequently?
Clarke: Aptitude scores and achievement measures 221
Table 21-1 Mean Scores and Standard Deviations for the Short Form of the
MLAT (Total and Parts) and for Fall and Spring Achievement Scores (Total
and Grammar Subscore)
German German
Japanese German Elective Required
Variables (possible score) Total N = 22 Total N = 69 N = 25 N = 44
MLAT Part III (50) 23.32 9.70 19.94 8.12 24.72 7.48 17.23 7.22
Part IV (45) 21.23 6.74 25.33 7.48 27.00 7.12 24.39 7.59
Part V (24) 17.45 5.30 18.86 4.52 19.84 4.20 18.30 4.64
MLAT Total (119) 62.00 18.19 64.13 15.23 71.56 13.62 59.91 14.60
Achievement, fall 1976, % 90.75 5.44 86.09 7.95 87.27 8.89 85.42 7.38
Spring 1977, % 85.53 8.07 84.61 10.09 87.12 9.49 83.19 10.24
Grammar subscore, spring
1977,% 85.11 8.00 81.61 13.95 84.81 13.03 79.80 14.27

First we will compare correlations between aptitude scores and achievement ob¬
tained for Japanese and German classes. Then we will deal with some of the other
variables that are possibly important to foreign language learning. Table 21-1
shows the means and standard deviations for part and total scores on the MLAT,
Achievement scores for the fall and spring semester expressed as percentages of
the total possible points, and spring Grammar scores. The Grammar scores are
reported because they probably reflect most nearly the aspects of foreign language
achievement best predicted by the MLAT. The mean for the elective group in
German (71.6) was considerably higher than for the retpiired group (59.9). The
means (62.0 and 64.1) in this study were actually a little higher than for the
criterion group (61.2) reported in the manual for the MLAT (Carroll and Sapon,
1959). The mean Achievement scores ranged from 83 to 90, with the fall semester
higher than the spring semester in all cases.
Table 21-2 Correlations between the MLAT Scores and Fall and Spring
Achievement Scores in Japanese and German
Japanese (N = 22) German (N = 69)

Fall Spring Fall Spring
1976 1977 Grammar 1976 1977 Grammar
MLAT Part III .52* .58* .66* .26* .20* .17
Part IV .23 .26 .50* .48* .34* .33*
Part V .56* .43* .70* .09 .07 .03
Total .53* .53* .74* .40* .29* .26*
*p < .05.
Pearson product moment correlation coefficients were computed for MLAT

part and total scores with Japanese and German Total Achievement scores for both
fall and spring and for spring Grammar subscores. These are reported in Table
21-2. In general, the correlation coefficients for the Japanese group were higher
than for the German group. The correlation coefficient between the MLAT total
scores and fall Achievement scores for the Japanese group was .53; for the German
group it was .40. In the spring the respective correlations were .53 and .29. The
correlation coefficient between the MLAT total score and Grammar subscore for
the Japanese group (.7 4) was much higher than for the German group (.26). There
was no relationship between Part V of the MLAT score and the fall and spring
Achievement scores for the German group.
Carroll and Sapon (1959) had said that the MLAT is more useful in predict¬
ing success in Indo-European languages using the Roman alphabet than in such
languages as Japanese and Chinese. Thus the correlation coefficient should be
higher for German than for Japanese. The reasons for the opposite result are not
entirely clear. Although the correlations for the German group were lower than for
the Japanese, they were similar to those obtained for original reference groups
(Carroll and Sapon, 1959), as shown in Table 21-3.
In addition to aptitude scores, a number of other variables might be expected
to affect achievement For instance, students who elect to take a foreign lan¬
guage may be expected to outperform students who are required to do so. Sixty-
four percent of the subjects in the German classes were from the College of
Science which requires one year of foreign language study. Chemistry majors are
Table 21-3 Validity Coefficients for the MLAT as a Predictor of

Achievement Measures in College German: SIU Sample Compared
against Original Reference Groups*
Group N Achievement measures r Mean SD

SIU sample Fall Semester
69 Achievement scores .40 64.1 15.2
(German only)
69 Spring Semester
Achievement scores .29 64.1 15.2
(German only)
Criterion
samples
H 24 Course grade .47 61.0 14.2
K 22 Course grade .29 61.7 13.0
K 21 Course grade .36 70.8 12.6
J 37 Final exam grade .40 76.6 9.8
J 38 Course grade .30 76.6 9.7
*See Carroll and Sapon, 1959.

Table 21-4 Correlations between the MLAT Scores and Fall and Spring
Achievement Scores for German Students Broken Down into Elective and
Required Groups
Elective N = 25 Required N = 44
Fall Spring Fall Spring
1976 1977 Grammar 1976 1977 Grammar
MLAT Partlll .09 .01 -.09 .33* .20 .20
K)
*
LO
Part IV .35* .30 .29 .55* .32*
Part V .04 -.07 -.29 .11 .09 .14
Total .24 .15 .01 .49* .29* .31*
*p < .05.
also required to take German. Table 21*1 shows that the German students who
took the courses as electives obtained the highest mean score (71.56) on the
MLAT. German students who took the courses as requirements on the other hand
obtained a mean of 59.91, and the Japanese group obtained a mean of 62.00. How¬
ever, the German elective group did not get the highest score for academic
achievement (87.27); the Japanese group did (90.7 5). However, as shown in Table
21-4, there were no significant aptitude-achievement correlations for the elective
German students except between Part IV of the MLAT score and the fall Achieve¬
ment score (r = .35). All the other correlations in Table 21-4 for the elective
German group were insignificant. This confirms Cooper s finding (1964) that only
grammar tests seem to predict first semester grades of students in German. Our
study also supports Carroll’s (1958) observation that English grammar tests fre¬
quently have been highly predictive of success in foreign language classes (though
not necessarily of success in actual foreign language acquisition; see Krashen,
1977). . , ,
Table 21-5 shows the correlation between previous language training and
MLAT scores* Carroll and Sapon (1959) note that very little direct evidence has
been obtained concerning the relation between previous language study and the
MLAT scores. But here students in the German group who had studied Latin in the
past showed a high correlation with Part 111 (r = .89, p < .001, N — 9) and the
total score of the MLAT (r = .73, p < .01, N= 9). Could prior study of Latin have
an effect on linguistic aptitude as measured by the MLAT? Does the MLAT
measure what Latin courses teach? There certainly appears to be a substantial
amount of shared variance between the MLAT total and especially the I ait III
subscores and previous study of Latin. Table 21-5 also shows that Latin is the only
language for which prior study is correlated to aptitude. Could this be because the
methods of instruction used in Latin (e.g., teaching a good deal about English word
roots, etc.) contrast markedly with those used in modern languages?
♦In spite of the fact that many of the correlations in this table appear to he quite strong (both positive
and negative ones), the numbers of subjects are so small in most cases that the correlations (even the
seemingly large ones) rarely achieve significance.
Table 21-5 The Effect of Previous Language

Training on the MLAT Scores for the German and
Japanese Groups
German group
MLAT MLAT MLAT
Language N III IV V Total
German 69 .15 -.02 .0015 .07
French 7 .48 .23 .08 .34
Spanish 22 -.10 .04 -.29 -.12
Latin 9 .89* .06 .46 .74*
Japanese group
MLAT MLAT MLAT
Language N III IV V Total
J apanese 22 .17 -.004 .08 .12
Spanish 9 .26 -.17 -.03 -.26
Chinese 3 .79 .13 .28 .39
French 4 -.77 -.34 .32 -.57
*P <.01.
Table 21-6 shows the relationship of Achievement scores to factors other

than aptitude: (1) amount of time spent studying the language, (2) parents’ or
spouses’ use of the foreign language, and (3) students’ travel experience or “time
abroad.” These variables were also correlated with scores on the MLAT. Data for
all three variables were obtained for both groups, but so few of the German
students had either traveled abroad or been exposed to the language outside the
classroom that the latter two variables were not meaningful in relation to that
group. Hence no computations are reported on these two variables for the German
students, but see Fig. 21-1 for a discussion of the effect of time spent studying the
language. None of the variables were significantly correlated with the MLAT
scores. However, amount of study was, as should be expected, correlated with
Achievement. Significant effects were apparent in all scores—Reading, Writing,
Grammar, and Vocabulary. Also students from homes where Japanese was spoken
attained greater competence in speaking (r = .71). However, parents’ influence
on achievement was not as strong as travel and time spent studying. Time abroad
also had a potent effect on students’ achievement: Listening, Speaking, Reading,
and especially on Vocabulary (r = .79). (See Table 21-6.)
Figure 21-1 includes two diagrams displaying correlation coefficients for
German and Japanese groups. Amount of time spent in studying the language was
more strongly related to achievement for the Japanese than for the German
students. This variable was not significantly correlated with the MLAT scores,
however, for either group.
Table 21-6 Correlations between Behavior and Experience outside the

Classroom with Achievement in Japanese
MLAT Fall Spring

total Achievement Achievement Listening Speaking
Amount of
time studying Japanese .12 .47* .53* .62* .63*
Parents’ use of
Japanese .21 .40* .32 .41 .71*
Time abroad -.15 .47* .46* .65* .61*
Reading Writing Grammar Vocabulary
Amount of
time studying Japanese .64* .70* .67* .75*
Parents’ use of
.53 .47 .49 .33
Japanese
.62* .57* .60* .79*
Time abroad
*p < .05.
Japanese group (N - 22) German group (N - 69)
Amount of time spent Amount of time spent
Figure 21-1. A simple schematic of the correlations between amount of time

spent studying Japanese or German with the MLAT and
achievement measures
It should be noted that Japanese achievement may have been affected by the
smaller number of students in Japanese classes and by the fact that they were
encouraged to work with native speaking Japanese students on a tutorial basis.
This sort of experience was not available for the German students.
In summary, it would appear that achievement in foreign language study is
significantly (though weakly) related to scores on the MLAT. Contrary to the
expectation of Carroll and Sapon (1959), the MLAT predicted achievement in
Japanese (not an Indo-European language) better than achievement in German (an
Indo-European language with Roman script). Further, for the Japanese subjects
time spent in studying the language, parents’ or spouses’ use of the language, and
time in a country where the language was spoken all had significant impacts on
achievement The possibility that the MLAT may measure rather well whatever is
taught in Latin courses is fairly strongly suggested.
Further study is needed on all the foregoing issues, but it seems clear that the
ML AT by itself leaves a rather large margin of error as a predictor of foreign
language achievement Between 45 and 90% of the variance in foreign language
achievement in this study for both the German and the Japanese students was not
predicted by the MLAT. Furthermore, and much more importantly, the results
obtained here are if anything more encouraging than is typical of previous research
with the MLAT (Carroll and Sapon, Manual 1959; also see Carroll, 1967).
Indeed, the predictions for the Japanese group were considerably better here than
those reported for the German reference populations in the MLAT test manual.
Chapter
Behavioral and Attitudinal Correlates of Progress

in ESL by Native Speakers of Japanese1
Mitsuhisa Murakami
To investigate possible behavioral and attitudinal correlates of proficiency

in listening and reading for native speakers of Japanese in an ESL situation,
a self-report questionnaire and two types of language proficiency tests were
used—a modified dictation procedure and a cloze test. For the dictation, the
test materials were taped from various radio programs and, for the cloze test,
materials from two rather easy college-level texts were used. Thirty students
out of a total of 38 Japanese students studying at Southern Illinois University
at Carbondale were tested in the fall of 1976. The results show that the
length of the subjects’ stay in the United States and the estimated number of
pages they write in English were correlated significantly with their listening
comprehension skill as measured by the dictation test. Also the number of
close friends who were native speakers of English was significantly corre¬
lated with the subjects’ reading skill as measured by the cloze test
It is often claimed that the skill attained by second language learners is

related to a variety of factors including (1) length of exposure to the target
language, (2) amount of study and effort in learning the target language, and
(3) amount of contact with native speakers of the language. This is all plausible,
but to what degree is it true, and how do the posited factors affect different aspects
of language proficiency?
Method
Materials. In order to find out whether or not various predictor variables were
significantly correlated with proficiency scores in English, two types of tests were
constructed—a modified dictation procedure and a cloze test, accompanied by a
227
self-report questionnaire. Ten items were made for the dictation by taping various
portions of radio broadcasts. Parts of the news, dramas, narratives, reports, and
interviews were used. Test items were arranged so that subjects, first, listened to
an entire recorded selection and then wrote a selected sentence from it The
selected sentences were copied on the tape and thus were repeated several times,
always in exactly the same way that they were originally said. There were no
unnatural pauses between words or phrases in the criterion sentences. The text of
each item is given in Appendix 22A.
For the cloze test, two passages were taken from college freshman textbooks,
one from Pickett and Laster, Writing and Reading in Technical English, p. 218,
and the other from Anne Free, Social Usage, p. 117. The usual cloze test construc¬
tion procedure was followed with every seventh word suppressed and the first two
sentences in each passage left intact The total number of blanks was twenty-five.
These tests are reproduced below as Appendix 22B.
On the questionnaire, the subjects were asked to supply demographic
information, self-ratings of their speaking and listening skills as well as informa¬
tion about their social life in the United States.
Subjects. Thirty Japanese students studying in Carbondale during the fall of

1976 participated in the study; 10 of them were enrolled in the Center for English
as a Second Language program, 15 in the undergraduate school, and the remaining
5 in the graduate school.

Table 22-1 shows the results of subjects’ performance on the Dictation, the Cloze
test, and the total (i.e., Dictation score plus Cloze score).2 The scores ranged from
4 to 41 for the Dictation and from 6 to
37 for the Cloze test and from 11 to 77
Table 22-1 Subjects’Performance for the Total. The distribution of scores
on Dictation, Cloze, and the Total
on the Dictation, Cloze, and the Total
(N = 30) scores was nearly normal.
Mean SD As can be seen in row (i) of Table
22-2, the subjects’ academic status at
Dictation 21.70/50 10.31
time of testing, i.e., CESL, undergradu¬
Cloze 21.57/50 8.38
ate or graduate student, at Southern
Total 43.27/100 16.91
Illinois University correlated very
strongly with their test performance,
.82, for the Dictation, .70 for the Cloze,
and .84 for the combined Cloze and Dictation score. Table 22-2 row(ii) shows that
the subjects initial status at SIU correlated somewhat lower, .66 for Dictation, .57
for the Cloze test, and .68 overall. Row (iii) indicates that, as is generally held, the
dictation as a means of testing aural comprehension skill is strongly correlated
with the subjects’ length of stay in ESL surroundings. Row (iv), however, affords a
curious generalization—namely, that the greater the number of English-speaking
Murakami: Progress in ESL 229
Table 22-2 Correlation Coefficients between Test Performance and

Predictor Variables
Dictation Cloze Total
(i) Present status atSIU 0.82 0.70 0.84

(ii) Initial academic status 0.66 0.57 0.68
(iii) Length of stay in United States 0.68 0.29* 0.56
(iv) Number of close English-speaking friends 0.28 0.39 0.37
(v) Number of pages subjects write in English
per semester 0.64 0.32 0.55
(vi) Integrative motive -0.46 -0.06* -0.31
(vii) Self-rating of speaking skill 0.69 0.51 0.67
(viii) Self-rating of listening skill 0.38 0.32 0.39
*Not significant (p > .05). All nonasterisked correlations are significant at p <.05 or better.
friends the ESL student had, the better was his reading skill as measured by the
Cloze test. The reason for this is unclear. Perhaps, even more curious are the
statistics in row (v). The number of pages the subjects wrote in English was more
strongly related to their score on the Dictation performance than on the Cloze
task: that is, the number of pages written in English is more closely related to
scores on the listening comprehension task than to scores on the reading
comprehension task. There is no obvious explanation, but the following specula¬
tion is offered for consideration.
Since writing requires intense mental and linguistic work for nonnative
speakers, it may orient ESL students to be attentive to the less salient function
words of speech in addition to the more obviously meaningful content words. Thus
writing might exercise their ability to catch the function words of speech which
otherwise flow too fast and are almost inaudible. In fact, the average subject had
the greatest difficulty with the function words on the Dictation test In the Cloze
test, in contrast, function words are shown as boldly as content words, which
naturally requires less analysis-by-synthesis guesswork on the part of the
nonnative speakers. Hence, perception of grammatical points learned through
writing might prove to be very helpful in clarifying the fuzzy segments of actual
speech.
Table 22-2 row (vi) shows a negative correlation between the integrative
motive and the subjects’ test performance, which is partly in agreement with the
finding of Spolsky (1969) that integrativeness is not a significant lactor for
Japanese learning ESL. The integrative motive, which has been summarized from
the reasons the subjects gave for their stay in the United States, consisted of five
statements as follows:
(1) “I like the United States as a nation.”; (2) “I like Americans.”; (3) "I want to marry
an American”; (4) “I want to make as many English speaking friends as possible.';
(5) “I want to see real American life and people.”
Subjects rated their agreement with each statement on a five-point scale ranging
from “indifferent” to “very much so.” The results show that, the more subjects
indicated that they were integratively motivated, the less proficient they were
likely to be.
As is seen in Table 22-2 (vii) and (viii), the correlation between the self-rating
of speaking skill and test performance, especially on dictation, was fairly high,
though not high enough to be taken as an adequate substitute for the proficiency
tests used in this study. The rating of ability to speak English was somewhat more
highly correlated with the ESL test scores than the rating of listening skill.
In conclusion, the fact that Japanese students show rather remarkable
progress in their aural comprehension skills and less marked improvement in their
reading skill according to their length of stay in the United States presumably
reflects the bias of English education in Japan. Their reading skill may have been
developed nearer to its maximum while there was greater room for improvement in
listening comprehension.
Notes
1. I am grateful for the comments and suggestions of John Oiler on an earlier draft of this paper. 1
would also like to thank Takeshi Ohara for help with the statistics and computer programming for the
present study. Any errors, of course, are my own.
2. Forthe 17 cases where the TOEFL test was taken between November 1974 and November 1976,
the Dictation, the Cloze, and the Total scores of the present test given during October and November
1976 correlated with the total TOEFL scores at .59, .68, and .76, respectively.
Appendix 22A
Part I. Dictation.
(1) A Swedish airlines plane crashed into a mountain in Southwest Turkey

tonight, killing all 154 persons abroad. Most of the victims were tourists.
— Radio News
(2) The Democratic Vice Presidential nominee says the Peace Corps repre¬
sents a classic example of the dividends that flow from idealism and that its
spiritual commitment may be more important than what its projects actu¬
ally accomplish. At its peak in the mid-sixties, the Corps had more than
15,000 volunteers in 48 countries; now it has less than 7,000 in 58 coun¬
tries. —News and Commentary
(3) Everyone wants more out of life. Did you ever meet someone totally satis¬
fied? The more we have, the more we want, it seems. But, then, there are
those who resign themselves to mediocrity. They know they will go no
further and settle down to what Henry David Thoreau calls “lives-of-quiet-
desperation.” -Narrative
Murakami: Progress in ESL 231
(4) Consider primitive man who cowered in a cave, who literally trembled at
every falling leaf, who saw himself surrounded on every side by malevolent
spirits. Even the gods to whom he prayed were capricious and fickle. That
was the primitive, ignorant, undeveloped man. But why do we have similar
problems, we sophisticated, modern, civilized beings? --Narrative
(5) “Fred?"
“Yeah, honey, it's me."
“Well, what's the matter?”
“Uh, you better forget about going out to dinner tonight. I just got fired.
“What?”
“A nice anniversary present for us, uh?" --Drama
(6) “You grabbed Time magazine right out of my hands.

“I know that, sir. I had to.
“Why?”
“Because you’re reading it wrong, all out of order.”
“Look! Look! I have my favorite Time magazine departments, so I just look

into them first each week ...
“In order! In order and by page numbers."
“All right, but some stories interest me more than others, so I naturally dip
into those stories first." --Commercial
(7) “Fred! We’ve got to go.

“What?”
“We’ve got to go to that recital."
“Me at Carnegie Hall?"
“I wouldn’t miss this for the world. I’ve got to go and see and hear this for
myself."
“Please go without me.”
“OK no. Please, Fred. I want to go with you."
“Well, O.K. But I’m not going to enjoy it." -Drama
(8) “Well, lately we’ve been hearing a lot about hyperactive children, Dennis
These kids are restless, irritable, excitable and impulsive and naturally
this behavior pattern causes all kinds of problems both at home and at
school."' -Science Report
(9) “You mentioned they were looking at the brain of the ant. Now let me see.
That’s about the size of a speck. What can they possibly learn from that?”
“.. . His experiments have shown that the individual ants with bigget
brains are able to perform better in intelligence tests.
“Hum, that’s fascinating! ...” -Science Report
(10) “Speaking of the sun,... all over the country millions of Americans are
stretched out on beaches, back yards and swimming pools trying to get a
good, healthy sun tan. Only, as most of us ought to know by now, the sun tan
is not too healthy.”
“.. . Every year we pass along the same warnings, and I have a feeling that
people aren't really listening.” —Science Report
Appendix 22B
Part II. Cloze Test
1. The letter seeking adjustments is in some ways the most difficult of all
letters to write. Frequently the writer is angry or annoyed or extremely dissat¬
isfied and his first impulse is to express his feelings in a harsh, angry, sarcastic
letter. But the purpose of the adjustment letter is to bring about positive
action that satisfies the complaint A rude letter that antagonizes the
reader is not likely to result in such a positive action. Thus, above all
in writing an adjustment letter, be calm, courteous, and businesslike.
Assume that the reader is fair and reasonable. Include only factual
information, not opinions; and keep the focus on the real issue, not on
personalities.
Generally, the adjustment letter includes these three points:
(1) identification of the transaction, (2) statement of the problem, and
(3) desired action.
2. Successful entertaining is the result of a happy interaction among host,

hostess, and guest, no one of whom can be considered without the others. A
sense of appreciation and responsibility of one toward the other is essential.
The chances are that the efforts of the hostess along will not be enough
unless her husband, the host, gives her full support. And the gracious
host and hostess with a flair for details in entertaining may fail if the
guest does not enter into the entertainment with enthusiasm. Usually
the ideal guest is also an ideal host or hostess—the principles involved
are the same for both roles. Consideration for others and forgetfulness of
self are all-important.
The essence of successful entertaining is warm hospitality, and the
basic rule _is_ that it must be friendly and sincere with no attempt at
pretense.
Chapter
Seven Types of Learner Variables

in Relation to ESL Learning1
John W. Oiler, Jr., Kyle Perkins, and Mitsuhisa Murakami
What factors are important to learning English as a second language for adult
foreign students in the United States? Seven types of self-reported data were
investigated: (1) descriptive/demographic variables such as length of study
of EFL before coming to the United States; (2) expressed attitudes toward
instruction such as the extent to which learners enjoy or feel they benefit
from ESU classes; (3) reported behavior in using and studying the target
language including whether the learner thinks or dreams in English and the
amount of time spent studying English each day; (4) attitudes toward
Americans including the extent to which they are viewed as truthful, kind,
friendly, rich, powerful, etc.; (5) attitudes toward self including the extent to
which the learner sees himself as reserved, talkative, calm, carefree, etc.;
(6) reasons for studying English including integrative motives such as getting
to understand Americans and instrumental motives such as getting a good
job; and (7) opinions on controversial topics such as the legalization of
marijuana or the abolition of capital punishment. Subjects were tested on a
battery of language proficiency tests including a conventional dictation and
a modified cloze test focused on grammatical functors, rnodals, conjunc¬
tions, prepositions, and the like. Moderate to low correlations (never above
.46) were observed between the dependent variables (scores on the
dictation and the grammar test) and the predictor variables (the seven types
of self-reported variables) for the 45 to 77 subjects who took the tests and
completed the questionnaires. None of the attitude variables accounted for
more than .16 of the variance in either of the language proficiency measures
used. Further, the variables of type (7), which were originally considered
extraneous to the study, accounted for as much variance as any of the non-
extraneous variables. Three alternative explanations are considered the
alternative that the attitude questionnaire is a kind of unintentional language
and intelligence test cannot be nded out
233
In the last couple of decades, the hypothesis that certain attitudes and
motivations are apt to lead to higher levels of attainment in second language learn¬
ing than others has grown in popularity. It has come to be widely accepted that a
so-called integrative orientation toward the target language culture—that is, a
desire to become like valued members of that community—is apt to be a more
effective basis for second language learning than an instrumental orientation—the
mere desire to acquire some material or utilitarian advantage (Gardner and
Lambert, 1972). Recently, however, another plausible alternative has been pro¬
posed to explain the possible superiority of the so-called integrative motive.
Savignon (1972) has produced evidence suggesting that an integrative orientation
may be the result rather than the cause of a superior performance in language
learning.
More recently still, a third possibility has been proposed: the popular
measures of attitudes and motivations may themselves be surreptitious measures
of language proficiency (Oiler and Perkins, 1978, Chap. 5). If the last alternative
were correct, it would not invalidate the theories, but it would invalidate tests of
them based on the popular questionnaire formats. Further, if the latter alternative
were correct, it should be possible to show that the correlation between attained
target language proficiency and affective variables such as integrative or instru¬
mental reasons for learning ESL should be no stronger than the correlations
between attained language proficiency and attitudes toward extraneous variables
such as the legalization of marijuana, the abolition of capital punishment, and
whether or not abortion is an immoral act
A disturbing fact about much of the previous research is that the posited rela¬
tionships between motives and learning are sometimes sustained and sometimes
not, yet the popularity of the theories seems to increase in spite of the evidence.
For instance, it is generally assumed that an integrative motive is superior to an
instrumental motive. Gardner and Lambert (1959) found evidence that this is so.
However, Anisfeld and Lambert (1961) found no contrast between the two types
of motivation and neither did Lambert, Gardner, Barik, and Tunstall (1962).
Lukmani (1972), in fact, found the opposite—that the instrumentally motivated
learners tended to outperform integratively motivated learners. It would seem that
the real strength of the theories resides in their intuitive appeal rather than in the
available empirical data. Perhaps as Oiler (1977) has argued, the final arbiter of
questions concerning attitudes and affective variables in general will have to be
subjective judgment rather than empirical tests employing psychometric mea¬
sures of affect
In the meantime, certain empirical questions remain to be answered:
(1) What is the relation between language proficiency and a wide range of self-
reported variables? (2) Can language proficiency be predicted more accurately on
the basis of variables that are expected to be causally related to its attainment than
on the basis of extraneous variables? (3) Can the possibility that affective
measures are surreptitious measures of language proficiency be ruled out? To
obtain answers to the foregoing questions, the following study was designed.
Oiler/Perkins/Murakami: Learner variables 235
Method
Subjects. In all, 182 foreign students at the Center for English as a Second Lan¬
guage (Southern Illinois University, Carbondale, Ill.) were tested as part of the
spring testing project in 1977. Owing to absenteeism and the voluntary nature of
participation in the attitude part of the study, between 45 and 101 students com¬
pleted relevant portions of the questionnaires, the oral interview, and the language
tests. There was some selectivity favoring the better students because the weaker
ones tended to complete fewer language tests and fewer attitude questions, but all
levels of CESL were represented. Practically all the subjects were males between
the ages of 19 and 30 and the largest language backgrounds represented were
Arabic, Persian, and Spanish.
Dependent Measures. The Dictation score used in this study was the
composite (sum) of the three dictations discussed in greater detail by Bacheller
(Chap. 6, this volume); also see entry 5 in the Appendix to this volume. The
Grammar test was prepared for CESL by its Academic Director, Dr. Charles
Parish. It is given in its entirety as entry 22 in the Appendix. Other scores could
have been used, but the Dictation and Grammar tests were among the best
predictors of global proficiency as defined in the factor analytic studies of Scholz
et al. (Chap. 2, this volume), and they were among the tests completed by the
largest numbers of subjects, thus maximizing the number of valid eases for the
correlations and regression analyses reported below.
Predictor Variables. The independent variables, or predictor variables,

employed included questions of seven different types presented in either written
or oral form. The written questionnaire is given as entry 23 in the Appendix, and
the oral form was given earlier as Appendix 12B. The last nine cpiestions of the
written form in the Appendix, entry 23, are discussed in greater detail by Johnson
and Krug (Chap. 24). They are the only questions which were presented in
languages other than English (Arabic, Persian, Spanish, and Japanese). All the
other questions, written and oral were given in English.
The seven types of predictor variables can be summarized as follows:
(1) Descriptive/demographic:
i. time in the U.S.;
ii. length of study of EFL before coming to the U.S.;
iii. language spoken by EFL teacher back home;
iv. highest educational level attained by either parent;
v. father s occupation;
vi. whether the subject had visited the U.S. or Britain before coming to
the U.S. to study;
vii. for how long;
viii. and for what purpose.
(2) Attitudes toward instruction:

i. the language tests were too easy or too hard;
ii. less or more testing would have been preferred;
iii. extent of enjoyment of ESL classes;
iv. amount of learning that takes place in ESL classes.
(3) Reported use of the target language:
i. hours per day spent studying EFL back home (before coming to the
U.S.);
ii. language spoken by the people the subject lives with;
iii. hours per day now spent in studying ESL;
iv. whether or not lessons are reviewed outside of class;
v. whether the subject asks questions in class whenever things are not
clear,
vi. how leisure time is spent;
vii. whether the subject thinks or dreams in English;
viii. how much time is spent with speakers of English as opposed to
speakers of the native language of the subject;
ix. how often the subject tries to phrase things in English;
x. what language the subject speaks most at home;
xi. how many pages the subject writes for courses each semester (in
English).
(4) Attitudes toward speakers of English (Americans):

i. extent to which the subject sees them as critical of the way he speaks
English;
n. extent to which the subject sees Americans as truthful;
iii. friendly;
iv. undependable; viii. technologically retarded;
v. unkind; ix. rich;
vi. culturally primitive; x. scientifically retarded;
vii. militarily strong; xi. politically powerful.
(5) Attitudes toward self:

i. extent to which the subject sees himself as reserved (as opposed to
outgoing);
ii. talkative;
iii. calm; vi. carefree;
iv. happy-go-lucky; vii. follower,
v. solitary; viii. observer.
(6) Reasons for studying English:

i. to understand Americans;
ii. to get a good job;
iii. to have English speaking friends;
iv. to achieve social recognition;
Oller/Perkins/Murakami: Learner variables 237
v. to think and act like an American;

vi. to be an educated person;
vii. to marry an American;
viii. to complete my educational goals;
ix. to stay in the U.S.
(7) Opinions on controversial topics:

i. agreement or disagreement with the statement that “Marijuana should
be legalized”;
ii. that “Capital punishment should be abolished”;
iii. that “Abortion of unwanted children (when the life of the mother is not
endangered) is a crime and should be punished."
Data Analysis. The question was whether or not the predictor variables
would account for a significant (and/or substantial) portion of the variance in
either of the dependent measures of language proficiency. Hence a multiple re¬
gression technique was used. Each of the seven clusters of predictor variables was
regressed onto the Dictation and Grammar scores separately. Since there was no a
priori basis for positing a particular hierarchy among the predictor variables in any
set, each set was dealt with in a stepwise fashion, selecting the best predictor from
a given set, and then the next best (having partialed out the portion of variance
already accounted for in the dependent variable by the first selected predictor),
and so forth until all variables in a given set had been exhausted. Since this
procedure increases the possibility of chance relations in rough proportion to the
number of variables entered into any given regression equation, we imposed three
stringent statistical constraints; First, to be considered a significant predictor of
the criterion (dependent variable), the predictor had to be significantly correlated
with the criterion at p < .05; second, the predictor had to enter the regression
equation at a significant F-ratio (p < .05); and third, the predictor had to
significantly increase the amount of variance accounted for in the dependent
variable (again at p < .05). The first constraint was suggested by Gardner (per¬
sonal communication) and the second and third are taken from Kei linger and
Pedhazur (1973).
Results
We will report the results of the regression analyses in the order of the question
types as presented under Predictor Variables above. Only one of the descriptive/
demographic variables proved to he a significant predictor of the Dictation scores.
It was time spent in the United States. It entered the regression equation at an F-
ratio of 12.42 with 1,47 degrees of freedom. The raw correlation between this
variable and the Dictation score was .46, which proved to be the strongest for any
of the 108 correlations computed between the independent and dependent
variables. None of the predictor variables of the descriptive/demographic type,
however, was significantly correlated with the Grammar score. It is interesting to
note that although the amount of time spent in the United States before the testing
seemed to improve scores on the Dictation, it had no effect on scores on the
Grammar test. Also, it may be worth noting that even though the amount of time
spent in the United States predicted more of the variance in the Dictation than any
other predictor accounted for in either of the dependent variables, the amount of
explained variance is not great (.462 = 21%).
Among the variables classed under Attitudes toward Instruction, only one
accounted for a significant amount of variance in the two dependent variables,
namely, the extent to which subjects considered the language tests to be difficult
The latter variable entered the regression equation as a predictor of the Dictation
score at an F-ratio of 7.51 with 1,43 degrees of freedom and correlated at—.38
with that score. It entered the regression equation with the Grammar score as
dependent variable at an F-ratio of 12.34 with 1,64 degrees of freedom and corre¬
lated at -.40 with the latter variable. From these facts we may conclude that
subjects who thought the tests were difficult tended to do more poorly than those
who thought that the tests were easy.
Of the variables classed under reported use of the target language, only the
tendency to think or dream in English correlated significantly with either of the
dependent variables. In fact, this variable correlated only with the Dictation score
and inversely at that. It entered the regression equation at an F-ratio of 4.92 with
1,49 degrees of freedom and correlated at .30 with the criterion. This correlation
is problematic, however, because it should be negative if we assume that a greater
tendency to think or dream in English is indicative of greater skill in the language.
In fact, the variable was coded such that a high score indicates a lesser tendency to
think or dream in English. We will return to this problem below.
What about attitudes toward Americans? None of the predictor variables was
signficantly correlated with the dependent variables. Although the predictor
“truthful” barely failed to achieve a significant positive relation with the dictation
and the variable “technologically retarded” barely failed to achieve a significant
negative relation with the Grammar score, contrary to expectations, no clear
interpretable pattern emerged. Attitudes toward self were also disappointing in
this regard. None of the predictor variables was significantly correlated with either
of the dependent criteria.
Among the nine reasons for studying English, one correlated with the Dicta¬
tion score, and a different one with the Grammar score. The desire to understand
the American people and their way of life was inversely related to the Dictation
score (F = 6.71, df= 1,45). The correlation was .36 but should have been
negative to support the usual interpretation of the integrative motive because the
scale was scored high for disagreement and low for agreement. The variable
intended to measure the subject’s desire to stay in the United States was positively
correlated with the Grammar score at .24 (F = 4.06, df = 1,64).
The last three variables included in the regression analyses were extraneous
ones—presumably unrelated to language proficiency. However, contrary to
popular theories, the extraneous variables proved to be as strongly related to the
Oller/Perkins/Murakami: Learner variables 239
dependent variables as any of the theoretically justifiable predictors. Attitude

expressed toward the legalization of marijuana proved to be a significant predictor
of both the Dictation score and the Grammar score. In both cases the correlation
was negative, indicating that desire to legalize marijuana was negatively related to
proficiency as measured by the Dictation or by the Grammar test (F = 14.66,
df= 1,76 for the dictation with a simple correlation of -.40; F= 10.62,
df = 1,100 for the Grammar score with a simple correlation of-.31).
Discussion
It is immediately apparent that the theoretical views concerning integrative and
instrumental motives for learning second languages do not serve very well as
explanations of the data reviewed above. In no case does an attitudinal or affective
variable account for more than .16 of the variance in a predictor variable, and
surprisingly, one of the extraneous variables accounts for as much variance as any
other single affective variable or set of variables. An exceedingly difficult finding
for the popular theories to explain is the fact that the degree of integrativeness of
subjects is inconsistently related to scores on the language proficiency tests. In
one case it appears to be negatively related—e.g., the fact that the desire to behave
as Americans do is negatively correlated with the Dictation score.
What possible explanation can be offered that is not inconsistent with some of
the data? To suggest that this group of subjects becomes less integrative as they
become more proficient in the language fails as an explanation of the data. For one
thing, it fails to explain the positive relation between desire to stay in the United
States and the Grammar score.
Of the three hypotheses considered at the outset, only one is consistent with
the data. Unfortunately, it is the least attractive of the three hypotheses and it
encourages far-reaching skepticism for the future of attitude measurement It is
the view that the nonrandom variance in attitude measures may have little to do
with actual attitudes and motivations. It may be largely the result of extraneous
nonrandom sources such as response set, the approval motive, and self-flattery
(see Oiler and Perkins, 1978, Chap. 5). Certainly if a person does not understand
the question, he cannot give the desired response (or the one that he perceives to
be the desired response) or the self-flattering response (the one that will make him
look good in his own eyes and the eyes of others); neither can he give a consistent
response (one that merely keeps him from contradicting himself). Hence the
ability to give consistent, socially appropriate, and self-flattering responses hinges
on the ability to both understand the language of the questions and to infer the
socially appropriate, flattering, and consistent response. Thus modest correla¬
tions with language proficiency measures would be predicted independent of the
content of the questions. This interpretation is the only one that can explain the
fact that the extraneous questions correlated as strongly with the language profi¬
ciency criteria as did any of the other predictor variables.
It would seem that attitude theorists will have to find better measures, or a
different basis for testing their theories. Substantial evidence to date suggests that
the extant questionnaire techniques regardless of the language of presentation
(whether in the native or target language), regardless of the format (oral or written),
may be inadequate to test the theories they are constructed to evaluate.
Note
1. This paper was originally presented under the title Four Clusters of Learner Variables in Relation
to Attained Proficiency in ESL at the TESOL Convention in Miami, Fla., Apr. 29,1977. We are grate¬
ful for the comments of Paul Holtzman and others who reacted to the paper at the meeting. The present
version is much revised and more complete than that report, however.
Chapter
Integrative and Instrumental Motivations:

In Search of a Measure
Thomas Ray Johnson and Kathy Krug
This chapter concerns the measurement of integrative and instrumental

motivations. A modified form of the Gardner and Lambert measure of
integrative/instrumental motivation was administered to ESL learners in
Carbondale, Ill. In addition, as an alternative measure of integrativeness, a
redundancy index was developed based on scores obtained from an FSI type
of oral interview. The relationships between attained language proficiency
and the above measures of motivation and redundancy are discussed. Weak
and sometimes contrary-to-prediction relationships are observed. It is
concluded that affective factors are important regardless of the difficulty of
measuring them. The standard theories, however, are not very helpful in
explaining the data.
The affective domain has turned out to be a Pandora’s box for second
language acquisition researchers. The more research that is done, the more
complex the relation between affect and second language acquisition seems to
become. Study follows study, but there is little agreement on how basic constructs
are to be conceptualized or measured. Perhaps a large part of the problem is that
the social sciences may not be amenable to the same methodological procedures as
the hard sciences (Cicourel, 1964). In any case, thoroughly adequate instruments
have yet to be designed and a highly elegant model relating attitudes and
motivations of second language learners to attained language proficiency has yet to
be developed.
The central research issue discussed here is the notion of integrative and
instrumental motivations. The first of these has been variously defined in terms of
the social distance of the second language learner from the target culture, the
learner’s desire to become a member of the target culture, and the ego-
241
permeability or the willingness to take the social risks believed to be necessary to

second language learning. Spolsky has offered a concise definition of the inte-
gratively motivated second language learner as one who chooses as a reference
group the second language group over the native language group (1969, p. 275).
The existence of important affective variables in second language learning is
widely recognized. As Gardner and Lambert put it, “A really serious student of a
foreign or second language who has an open, inquisitive, and unprejudiced
orientation toward the learning task might very likely find himself becoming an
acculturated member of a new linguistic and cultural community as he develops a
mastery of that other group’s language” (1972, p. 2). We regard the question of
learner orientation and the process of becoming or not becoming acculturated to
the target language culture as central to research concerning factors in second
language acquisition. It is our belief that since language and language learning are
always embedded within social contexts, second language acquisition must always
involve variables of an affective nature whether these variables are explicitly
recognized or not.
This study attempted to measure the degree of integrative or instrumental
orientation of ESL learners in Carbondale, Ill. Among the questions considered
were: (1) Will an integrative orientation prove to be substantially correlated with
attained ESL proficiency? (2) How will integrative motives contrast on the whole
with instrumental motives as predictors of attained ESL proficiency? (3) How well
will measures of integrative and instrumental motivation as derived from measures
of the type devised by Gardner and Lambert correlate statistically with a variety of
tests of ESL proficiency? (4) Can a measure of proficiency be constructed which
is maximally sensitive to integrativeness and which is directly observable as a
feature of the learner’s interlanguage?
As part of our attempt to answer the fourth question, the relationship between
redundancy in discourse and integrative motivation was considered. Schumann
has argued that "the speech of the second language learner will be restricted to the
communicative function if the learner is socially and/or psychologically distant
from the speakers of the target language” (1975b, pp. 68-69). Elsewhere he has
theorized that when the language learner develops a desire to expand his/her
language use beyond the communicative function to the expression of social
identity within the target language group, the learner’s interlanguage will compli¬
cate and expand. This complication and expansion involves the need for
redundancy, alternative forms, obligatory tense markers, and the like (Schumann,
1974, p. 151). Based on these observations we developed a redundancy index
(described below) in order to study the possible relationships between redun¬
dancy in the learner’s interlanguage, integrative motives, and various measures of
ESL proficiency.
Method
Subjects. Seventy-two subjects enrolled at CESL constituted the test population.

There were 23 speakers of Farsi, 18 of Arabic, 15 of Spanish, and 3 of Japanese.
Johnson/Krug: Integrative and instrumental motivations 243
There were also one or two speakers each from a variety of other languages, includ¬
ing Greek, Turkish, and Vietnamese.
Materials. A motivational and attitudinal measure constructed by Gardner
and Lambert (1972, p. 148) is given as Appendix 24A. It contains reasons
(according to Gardner and Lambert) often given by students in the Louisiana and
Maine communities for studying French. The measure actually used in this study,
however, is given in the Appendix to this volume as the last nine questions of entry
23. The questions used have been slightly modified to fit more closely the social
situation of the ESL learners studied. For instance, ‘French people and
“French-speaking" were changed to “American people" and “English-speaking”;
“to finish high school" was changed to “to fulfill my educational goals"; and “it
will allow me to meet and converse with more and varied people" was changed to a
maximally integrative type of reason, “it will enable me to marry an American.
Question 9 concerning desire to stay in the United States was added as a further
indicator of integrative or instrumental motivation (Schumann, 1975b, p. 74).
This modified Gardner and Lambert measure was translated into the native
language of the subjects—Spanish, Persian, Arabic, and Japanese. This was done
in an attempt to minimize variance attributable directly to ESL proficiency (Oiler
and Perkins, 1978, Chap. 5). A questionnaire in English was also provided for the
few subjects who spoke languages other than Spanish, Persian, Arabic, and
Japanese.
An oral interview, roughly in the FSI format (see Chap. 7) was used to obtain a
measure of redundancy in the subject’s interlanguage according to the procedure
described below.
Procedure. The modified Gardner and Lambert questionnaire (Appendix
entry 23, last nine questions) was handed out to all students enrolled at CESL
during April 1977 (182 in all). The questionnaire in the five languages was
attached to the back of another attitude questionnaire which was part of a separate
study (see Chap. 23). The students were asked to take the questionnaire home, fill
it out, and return it. Seventy-two students returned the questionnaire. It was
scored according to the scale which appears at the bottom of Table 24-1; a value ot
1 was assigned for “strongly agree" and 7 for “strongly disagree."
A redundancy measure was also obtained for each subject by scoring the
taped interviews. Five obligatory functors whose semantic function in English is
largely redundant were counted:
1. the plural morpheme, as in cats, shoes, or boxes (/s/, /z/, or /0z/),
2. the copula, as in He is sick; _ t
3. the copula in progressive constructions, such as He is going (is + *ing),
4. possessive inflection, as in John s bike; and
5. the third person singular habitual present, as in She sends him money every
month (Brown, 1973, pp. 253-254).
Each interview was played for approximately five minutes. Each time the subject
produced one of the functors, a variable a was incremented by one point. Each
Table 24-1 Means and Standard Deviations of Integrative and Instrumental

Reasons for Studying English and Desire to Stay in the United States and
their Correlations with the Redundancy Index (N = 72)
Redundancy
Reasons for studying English Mean SD index
*1. It will help me to understand the American people and
their way of life. 3.2 1.4 .32-f
2. 1 think it will someday be useful in getting a job. 3.0 1.5 .11
3. It will enable me to gain good friends more easily
among English-speaking people. 2.8 1.1 .10
4. One needs a good knowledge of at least one foreign
language to merit social recognition. 3.6 1.9 -.21
5. It should enable me to begin to think and behave
as Americans do. 5.1 1.5 .11
6. 1 feel that no one is really educated unless he is
fluent in the English language. 5.8 1.4 -.14
7. It will enable me to marry an American. 5.6 1.5 .00
8. 1 need English in order to fulfill my educational goals. 2.4 1.4 .23
9. If you could stay in the United States for as long as
you wanted after you finish your studies, how long
would you stay?
A. Leave as soon as my studies were finished.
B.
C.
Stay 3 months.
Stay 6 months. 1 1 „ 1.7 .15
I
D. Stay 1 year. j
E. Stay 5 years.
F. Stay permanently (immigrate). )
Scale 12 3 4 5 6 7
strongly agree disagree strongly
agree disagree
*Odd-numbered reasons are believed to be integrative, even-numbered reasons instrumental,

tP <-05.
Table 24-2 Correlation Matrix for Integrative/Instrumental Reasons

and Desire to Stay in the United States (N = 72)
Reason (by number) 1 2 3 4 5 6 7 8 9

1 .30 .37 .16 .28 .20 .10 .34 -.11
2 .43 .50 .34 .27 .13 .41 -.13
3 .25 .11 .09 .25 .39 .00
4 .13 .33 -.19 .15 .24
5 .43 .21 .18 -.20
6 .10 .13 .05
7 -.08 -.09
8 -.14
time a functor was required but was not produced, a variablep was incremented by
one point Then RI was computed as follows:
a + p
Thus the redundancy score for each subject is equal to the number of correct
usages of the above-listed functors divided by the number of obligatory occasions.

Mean scores, standard deviations, and correlations with the redundancy index are
given in Table 24-1 for each of the nine attitude questions. Contrary to our expec¬
tations, of the four correlations between the redundancy index and the four
reasons for integrative motivation, only one was significant (p < .05), and it was
positive. We would have expected it to be negative on the basis of the way we
scored the scales. Looking at question 9, if desire to stay in the target language
culture is any measure of integrative motivation, then this subject population is not
very integratively motivated. A mean between 3 and 6 months was obtained, and
only a small number of subjects expressed a desire to stay more than a year after
their studies were finished. Only two subjects wanted to immigrate.
Table 24-2 gives the correlation matrix for all nine attitude questions.
Several of the integrative reasons correlated as strongly with the instrumental
reasons as the integrative reasons correlated with each other. The overall correla¬
tion between the sum of scores on instrumental reasons and the sum of scores on
integrative reasons was .39. This indicates a lack of bipolarity in the question¬
naire; i.e., the reasons did not seem to differentiate clearly between the posited
constructs of instrumentality and integrativeness. This lack of bipolarity can also
he seen by examining the factor analysis given in Table 24-3. Here it is apparent
that integrative and instrumental questions load on the same factors, e.g., 1, 2, 3,
and 8 on Factor 1, where 1 and 3 are supposedly integrative reasons and 2 and 8
are instrumental.
One problem that may have contributed to the observed lack of bipolarity
may be the unidirectionality of the response scales. This can encourage what has
been termed “response set” (Liebert and Spiegler, 1974, p. 180). Although none
of our protocols have the same response for each of the eight reasons in question,
the use of unidirectional scales and their tendency to create response set is widely
recognized (Oiler and Perkins, 1978, Chap. 5, and Lett, 1977, p. 36). Another
possible contributor might be the tendency for respondents to give answers which
they perceive to be socially desirable rather than answers that indicate their real
feelings. There is a greater tendency for this to occur when it is necessary to ask
subjects to give their names on the questionnaire, as had to be done here in order to
match questionnaires with test scores.
Looking at the correlation between language proficiency tests and the sum of
scores on the integrative and instrumental reasons, no significant correlation is
Table 24-3 Rotated Varimax Factor Solution over Integrative and Instrumental
Reasons for Studying English and Desire to Stay in the United States
Factor 1 Factor 2 Factor 3 Factor 4

Reasons
1. Understand Americans .62
2 Get a job .68
3. Make English-speaking friends .77
4. Social recognition .61
5. Think and behave as Americans do .79
6. Be an educated person .81
7. Marry an American .93
8. Fulfill educational goals .77
9. Desire to stay in the United States .86
Percentage of variance accounted for by factor .25 .19 .16 .13
Total explained variance = .73 Eigenvalue = 6.57 '
♦Factor loadings of .50 or less are not listed in the table but are entered into the computation
• of percentages of explained variance and the eigenvalue.
Table 24-4 Correlation.Matrix of the Sum of Integrative and Instrumental

Reasons with Language Proficiency Tests and Redundancy Index
Language proficiency tests Sum of scores on Sum of scores on

integrative reasons N instrumental reasons N
Listening Cloze (Appendix, entry 2) .10 38 .03 39

Dictation (Appendix, entry 5) .18 47 .13 47
Oral Cloze (Appendix, entry 1 2) -.05 32 -.12 33
Reading Aloud (Appendix, entry 13) .01 33 .24 33
MC Reading Match (Appendix, .02 59 .10 59
entry 15)
Standard Cloze (Appendix, entry 16) .10 59 .00 59
Recall Rating (Appendix, entry 20) -.09 65 -.11 64
Grammar (Appendix, entry 22) -.02 67 -.11 67
Redundancy Index .13 34 .00 34
observed (p > .05). Moreover, on five of the integrative correlations and four of the
instrumental correlations, the tendency is opposite to the one expected.
The redundancy index (Table 24-5) is significantly correlated with all the
language tests. In addition, several of the questions from the larger questionnaire
discussed by Oiler, Perkins, and Murakami (Chap. 23) correlated significantly
with the redundancy index. As we might have expected, the harder the student
thought the language tests were, the lower was the score on the redundancy index
(r = .38). The larger the percent of leisure time the students said they spent speak¬
ing or listening to English, the higher was the score on the redundancy index
Table 24-5 Correlation of the Redundancy Index with Selected Language

Proficiency Tests and Questions Taken from the Chap. 23 Attitude Questionnaire
Language proficiency tests Redundancy

index N
Listening {Cloze task) .53 22

Dictation .64 27
Speaking (Cloze task) .53 19
Reading Aloud .51 19
Reading Match of Synonyms in Context .59 30
Cloze (standard format for reading) .61 30
Paragraph Recall Task (subjectively rated) .30 33
Grammar (Cloze type of test) .60 32
Questions from Chap. 23 (see entry 23 in Appendix)

1. I thought the tests were -.38 29
A. too easy
B. a little too easy
C. okay
D. a little too hard
E. much too hard
15. How much of your leisure time do you spend speaking

English or listening to English spoken (TV, radio,
conversation, etc.)? .34 32
A. 10%
B. 20%
C. 50%
D. 70%
E. 100%
(On the next two items, how would you rate Americans?)
Technologically Technologically -.46 24

52.
advanced retarded
1 2 3 4 5 6 7 8
Scientifically -.35 24
54. Scientifically
advanced retarded
12 3 4 5 6 7 8
(r = .34). The more technically and scientifically advanced the students felt
Americans to be, the higher the score on the redundancy index (r = -.46, and
—.35, respectively).
Summary and Conclusion
None of the correlations between the integrative and instrumental measures from
the modified Gardner and Lambert questionnaire with the several language tests
achieved significance (p > .05). The redundancy index, however, showed signifi
cant correlations in all instances with the same tests.
Several alternative conclusions are possible from our study: (1) Attitude and
motivation measures of the type developed by Gardner and Lambert may not be
sensitive enough indicators of learners’ feelings toward the target culture to be
good predictors of attained language proficiency. At least for this type of subject
population, this conclusion cannot be excluded by the evidence we have
presented above. (2) The redundancy index appears to be significantly correlated
with attained language proficiency, but it cannot be linked with factors related to
the affective domain on the basis of present results. Further study is needed.
(3) The relationship between affect and cognition may be more complex than such
constructs as integrativeness and instrumentality would indicate; i.e., the con¬
structs themselves may need revision. A possible conceptual revision that we
believe deserves consideration is that there may exist an “intrinsic" kind of
motivation. If this is so, an activity such as second language learning might be
undertaken for its own sake, independent of such “extrinsic" motivations as inte¬
grativeness or instrumentality (Conway, 1975, p. 259). (4) The methods of
empirical science are insufficient to capture the interrelationships that are
displayed by everyday situations. Whereas science assumes discreteness, perma¬
nence, and stability as features of a real world, the reality of every day life in which
language learning takes place may not be so constituted (Mehan and Wood, 197 5,
p. 64). Maureen Concannon-O’Brien (1977) addresses this possibility when she
observes that “psychologists may have to accept what is obvious to the teacher—
that the structures of no two personalities are the same, and that each individual
has a different set of habits, drives, needs, and impulses” (p. 196).
In the final analysis, whatever conclusions our results suggest, we still believe
that an awareness of learners’ feelings is an important factor in effective language
teaching, regardless of whether or not these feelings can be conceptualized and
measured.
Appendix 24A
Gardner and Lambert Motivational Measure
(1972, p. 148)
Endorsement of one of the following reasons was interpreted as indicating an

Integrative Orientation:
a It will help me to understand better the French people and their way of life.
b. It will enable me to gain good friends more easily among French-speaking
people.
c. It should enable me to begin to think and behave as the French do.
d. It will allow me to meet and converse with more and varied people.
Endorsement of one of the following reasons was interpreted as indicating an

Instrumental Orientation:
a I think it will some day be useful in getting a good job.

b. One needs a good knowledge of at least one foreign language to merit social
recognition.
c. I feel that no one is really educated unless he is fluent in the French language.
d. I need it in order to finish high school.
The rating scale below each question had the following form:
Not my feeling Definitely

at all - - - - - - - my feeling
Part VII Discussion Questions
1. Clarke’s study (Chap. 21) seems to indicate that language aptitude is only
one predictor of achievement in language classes. Compile a list of other
factors that you think might be significant predictors of achievement in lan¬
guage classes. How could the factors in your list be measured?
2. Clarke showed (Table 21-1) that the German elective group (N = 25) had a
mean of 71.56 on the MLAT total while the German required group (N = 44)
had a mean of 59.91. There is a significant difference between the means:
t = 3.27 (df = 67), p < .001. Does this finding lend support for the motiva¬
tion hypothesis: students will learn more if they want to take the language
class? What other selective factors might enter in?
3. Clarke obtained higher correlations between MLAT scores and achievement

scores for the Japanese group than for the German group (see Table 21-2).
What explanation can you offer for these data? Consider the claim made by
Carroll and Sapon (1959) concerning Indo-European versus other language
groups in relation to the MLAT.
4. Glarke’s conclusions, if extended to an English language setting, suggest that

an ESL student would have a possibility of greater attained language profi¬
ciency than an EFL student Has it been your experience that ESL students
achieve higher proficiency levels than EFL students?
5. Clarke suggested that the MLAT by itself would be a rather weak predictor
of foreign language achievement. What do you consider to be some valid,
reliable predictors of attained language proficiency to be used in conjunc¬
tion with the MLAT?
6. Clarke indicated that the MLAT may measure rather well whatever is taught
in Latin courses.What does this imply for the validity of the MLAT? How
would one go about validating a test of this type?
7. According to Clarke’s Table 21-2 for the Japanese students, the total MLAT
score is a better predictor of Grammar scores (r = .74) than Part IV of the
MLAT, which purports to measure sensitivity to grammar (r = .50). For the
spring 1977 Japanese, Spelling Clues, Part III of the MLAT was the best
predictor of Achievement scores (r = .58), yet for the fall 1976 Japanese
Part V, Paired Associates was the best predictor (r = .56). For the German
groups, Part IV was the best predictor of Achievement and Grammar (.48,
.34, and .33). What conclusions can you draw about the construct validity of
the MLAT subtests? (cf. Chaps. 1 and 13.)
250
8. Murakami (Chap. 22) obtained a negative correlation between the integrative

motive and subjects’ test performance. Contrast these data with other atti-
tudinal data reported elsewhere, especially Gardner and Lambert (1972).
9. Why would academic status at the time of testing correlate very highly with
test performance for Murakami’s subjects?
10. Murakami stated that writing causes students to concentrate on content

words and less salient-free and bound morphemes in speech; therefore, his
finding that the number of pages of writing is more closely related to the
listening comprehension task than to the reading task. Can you offer other
plausible explanations for these data?
11. Murakami found that length of stay in the United States correlated signifi¬
cantly with listening comprehension. Can we say that length of stay can be
equated with a practice effect?
12. Murakami does not specifically discuss the reliability and/or validity of his
elicitation instruments. Can they be inferred from any of the data he does
present? How? (See other chapters in this volume which use similar tech¬
niques.)
13. Oiler, Perkins, and Murakami (Chap. 23) suggested that "the popular mea¬
sures of attitudes and motivations may themselves be surreptitious measures
of language proficiency. Discuss the feasibility of developing an elicitation
instrument for attitudes and motivations that is not unintentionally a test of
language proficiency and/or intelligence.
14. What statistical procedures could one use to extract the extraneous non¬
random variance in self-reported attitude and motivation data? Consider the
quantity of the reliable variance which may be attributable to such factors.
15. Given the wide range of discussion and conflicting data on instrumental and
integrative orientation, one can conclude that the role of affective variables
and aptitude factors in language acquisition has not been settled. Consider
some research questions which Chap. 23 suggests for affective and aptitude
factors in language proficiency. What measures need to be developed or vali¬
dated? What kind of research designs need to be utilized to carry out this
research?
16. Oiler, Perkins, and Murakami stated that the ability to give consistent,
socially acceptable, and self-flattering responses hinges on the ability to
understand the language of the questions and to infer the appropriate
response. Can one conclude that these abilities are acquired? If so, are
these abilities examples of rule-governed behavior? Can they logically be
distinguished from language ability?
17. Johnson and Krug (Chap. 24) concluded that attitude and motivation
measures of the type developed by Gardner and Lambert may not be sensi¬
tive enough to be good predictors of attained language proficiency. What
measures would be sensitive enough? How could one test the sensitivity of
proposed instruments?
18. Attitude and motivation studies have implicitly treated the integrative and
instrumental orientation constructs as static entities. Do you think these
constructs might be variable? Is that why they are so elusive and hard to mea¬
sure? Along this line, discuss the claim of Cicourel (1964) that this field may
not lend itself to measurement.
19. Johnson and Krug found that none of the correlations between the integrative
and instrumental measures and the language tests achieved significance. On
the other hand, correlations between integrative and instrumental motives
did tend to be significant What plausible explanation(s) can be offered for
the raw correlation of .43 between reasons 2 and 3 (Tables 24-1 and 24-2)?
What about the correlation of .43 between reasons 5 and 6, another inte¬
grative and instrumental pair. Shouldn’t one expect “integrative” and
“instrumental” motivational factors rather than several factors which are
both “integrative” and “instrumental”?
References
Aborn. M., H. Rubenstein, and T. D. Sterling. 1959. Sources of contextual constraints upon words in
sentences. Journal of Experimental Psychology 57. 171-180.
Adorno, T. W.. Else Frenkel-Brunswick, D. J. Levinson, and R. N. Sanford. 1950. The Authoritarian
Personality. New York: Harper.
Aiken, Lewis, jr. 1971. Psychological and Educational Testing. Boston: Allyn & Bacon.
Akmajian. A. 1969. On deriving cleft sentences from pseudocleft sentences. Unpublished manuscript,
MIT.
Alexander, L. G. 1974. A First Book in Comprehension, Precis, and Composition. Rowley, Mass.:
Longman-Newbury House.
Allen, J. P. B., and A. Davies (eds.). 1977. Testing and Experimental Methods. London: Oxford Uni¬
versity Press.
American Language Institute. 1962. Oral Rating Form. Washington. D.C.: Georgetown University
Press.
Anisfeld, M„ and W. E. Lambert 1961. Social and psychological variables on learning Hebrew.
Journal of Abnormal and Social Psychology 63. 524-529. Also in Gardner and Lambert (1972),
217-227.
_. 1964. Evaluational reactions of bilingual and monolingual children to spoken languages.
Journal of Abnormal and Social Psychology 69. 89-97.
Asher, J. J. 1969. The total physical response approach to second language learning. Modem Language
Journal 53. 3-17.
Asher, J. J., J. A. Kusudo, and R. De La Torre. 1974. Learning a second language through commands:
the second field test Modem Language Journal 58. 24-32.
Austin, J. L. 1962. How to Do Things with Words. New York: Oxford University Press.
Baird, A., G. Broughton, D. Cartwright, and G. Roberts. 1972. Success with English: A First Reader.
Baltimore, Md.: Penguin Books.
Bezanson, K. A., and N. Hawkes. 1976. Bilingual reading skills of primary school children in Ghana.
Working Papers on Bilingualism 11. 44-73.
Bondaruk, J., J. Child, and E. Tetrault 1975. Contextual testing. In Jones and Spolsky (1975), 89-
104.
Brennan, E. M., E. B. Ryan, and W. E. Dawson. 1975. Scaling of apparent accentedness by magnitude
estimation and sensory modality matching. Journal of Psycholinguistic Research 4. 27-36.
Brodkey, D., and H. Shore. 1976. Student personality and success in an English language program.
Language Learning 26. 153-1 62. ... , , . „

Brown, H. D. 1973. Affective variables in second language acquisition. Language Learning ZS. ZSl-
244.
Brown, Roger. 1973. A First Language. Cambridge, Mass.: Harvard University Press.
Burt, M. K. 1975. Error analysis in the adult EFL classroom. TESOL Quarterly 9. 1, 53-56.
Carroll, J. B. 1958. A factor analysis of two foreign language test aptitude batteries. Journal of General
Psychology 59. 3-19. ,

_. 1961. Fundamental considerations in testing for English language proficiency of foreign stu¬
dents. Washington, D.C.: Center for Applied Linguistics. Reprinted in H. B. Allen and R. N.
Campbell(eds.), Teaching English as a Second Language. New York: McGraw-Hill, 1972,313-
32°. . . .
_. 1967. Foreign language proficiency levels attained by language majors near graduation trom
college. Foreign Language Annals 1. 131-151.
253
254 REFERENCES
Carroll, J. B. 1972. Defining language comprehension: some speculations. Paper presented at the
research workshop on language comprehension and the acquisition of knowledge. Durham, N.C.,
Mar. 31-Apr. 3. Also in R. 0. Freedle and J. B. Carroll (eds.), Language Comprehension and the
Acquisition of Knowledge. New York: Wiley, 1-29.
Carroll, J. B., and S. Sapon. 1959. Modern Language Aptitude Test Manual New York: The Psycho¬
logical Corporation.
Cartier, F. A. 1968. Criterion-referenced testing of language skills. TESOL Quarterly 2. Reprinted in
Palmer and Spolsky (1975), 19-24.
Chafe, Wallace. 1972. Discourse structure and human knowledge. In Roy 0. Freedle and John B.
Carroll (eds.), Language Comprehension and the Acquisition of Knowledge. New York: Winston,
41-69.
Chase, C. 1.1972. Test of English as a foreign language. A review in 0. K. Buros (ed.). Seventh Mental
Measurements Yearbook. Highland Park, N.J.: The Gryphon Press.
Chastain, Kenneth. 1975. Affective and ability factors in second language acquisition. Language
Learning 25. 153-161.
Christie, R., and F. Geis. 1970. Studies in Machiavellianism. New York: Academic Press.
Cicourel, Aaron. 1964. Method and Measurement in Sociology. New York: Free Press.
Clark, J. L. D. 1972. Foreign Language Testing: Theory and Practice. Philadelphia. Pennsylvania:
Center for Curriculum Development.
-. 1975. Theoretical and technical considerations in oral proficiency testing. In Jones and
Spolsky (1975), 10-24.
Concannon-O’Brien, M. 1977. Motivation: an historical perspective. In M. K. Burt. H. Dulay, and M.
Finocchiaro (eds.), Viewpoints on English as a Second Language. New York: Regents, 185-197.
Conway, Patrick. 1975. Volitional competence and the process curriculum of the ANISA model. In
P. Conway (ed.), Development of Volitional Competence. New York: MSS Information Corpora¬
tion.
Cooper, Carl J. 1964. Some relationships between paired-associates learning and foreign-language
aptitude. Journal of Educational Psychology 55. 132-138.
Coulthard, Malcolm. 1975. Discourse analysis in English: a short review of the literature. Language
Teaching aiul Linguistics Abstracts 8. 73-89.
Cranney, A. Garr. 1973. The construction of two types of cloze reading tests for college students.
Journal of Reading Behavior 5. 60-64.
Crowne, D. P., and D. Marlow. 1964. The Approval Motive. New York: Wiley.
Darnell, D. 1970. The development of an English language proficiency test of foreign students using
the clozentropy procedure. Speech Monographs 37. 36-46.
Davies, A. 1977. The construction of language tests. In J.P.B. Allen and A. Davies (eds.), Testing and
Experimental Methods. London: Oxford University Press.
Davis, F. B. 1944. Fundamental factors of comprehension in reading. Psychometrika 9. 185-197.
Donaldson, Weber, D., Jr. 1971. Code-cognition approaches to language learning. In Robert C. Lugton
(ed.), Towards a Cognitive Approach to Second Language Acquisition. Philadelphia: Center for
Curriculum Development
Dulay, H. C., and M. K. Burt. 1974. Error and strategies in child second language acquisition. TESOL
Quarterly 8. 129-136.
Endicott A. L. 1973. A proposed scale of syntactic density. Research in the Teaching of English 7. 5-
12.
Enkvist, N. 1973. Should we count errors or measure success? In Jan Svartvik (ed.), Errata. Papers in
Error Analysis. Lund, Sweden: CWK Gleerup.
Ervin-Tripp, Susan M. 1970. Structure and process in language acquisition. Georgetown University
Monograph. 21st Annual Round Table No. 23.
Flahive, Douglas. 1977. The reading proficiency of ESL/EFL learners. Paper presented at the Spring
1977 KICL meeting, Eastern Kentucky University, Richmond, Ky.
Flesch, Rudolf. 1948. A new readability yardstick. Journal of Applied Psychology 32. 221-233.
Frederiksen, C. H. 1975a. Effects of context induced processing operations in semantic information
acquired from discourse. Cognitive Psychology 7. 139-166.
REFERENCES 255
Frederiksen, C.H. 1975b. Representing logical and semantic structure of knowledge acquired from
discourse. Cognitive Psychology 7. 431-458.
Fry, Edward. 1968. A readability formula that saves time. Journal of Reading 11.513-516, 575-578.
Galvan, J.. J. A. Pierce, and G. N. Underwood. 1975. The relevance of selected educational variables
of teachers to their attitudes toward Mexican-American English. Paper presented at the 1975
annual meeting of the Linguistics Association of the Southwest in San Antonio, Tex.
Gardner, R. C. 1975. Social factors in second language acquisition and bilinguality. Paper presented at
the invitation of the Canada Council’s Consultative Committee on the Individual, Language,
and Society at a conference in Kingston, Ontario, November, December 1975.
Gardner, R. C., and W. E. Lambert 1959. Motivational variables in second language acquisition.
Canadian Journal of Psychology 13. 24-44.
-. 1972. Attitudes and Motivation in Second Language Learning. Rowley, Mass.: Newbury
House.
Gardner, R. C., P. C. Smythe, R. Clement and L. Gliksman. 1976. Second language learning: a social
psychological perspective. Canadian Modem Language Review 32. 198-213.
Genesee, F. 1976. The role of intelligence in second language learning. Language Learning 26. 267-
280.
Giles, H. 1970. Evaluative reactions to accents. Educational Review 22. 211-227.
-. 1972. The effect of stimulus mildness-broadness in the evaluation of accent Language and
Speech 15. 262-266.
Giles, H., and P. F. Rowesland. 1975. Speech Style and Social Evaluation. London: Academic Press.
Gorosch, M. 1973. Assessment intervariability in testing oral performance of adult students. In Jan
Svartvik (ed.). Errata: Papers in Error Analysis. Lund, Sweden: CWK Gleerup.
Gradman, H. L., and B. Spolsky. 1975. Reduced redundancy testing: a progress report. In Jones and
Spolskv (1975), 59-70.
Guiora. A. Z„ M. Paluszny, B. Beit-Hallahmi, J. C. Catford, R. E. Cooley, and C. Y. Dull. 1975. Lan¬
guage and person studies in language behavior. Language Learning 25. 43-62.
Gunnarsson, Bjami. A look at the content similarities between intelligence, achievement, personality,
and language tests. In Oiler and Perkins (1978), 18-35.
Haggard, L. A., and R. S. Isaacs. 1966. Micro-momentary expressions as indicators of ego mechanics
in psycho-therapy. In L. A. Gottschalk and A. H. Auerback(eds.), Methods of Research in Psycho¬
therapy. New York: Appleton-Century-Crofts, 154-165.
Hardison, 0. B., Jr. 1966. Practical Rhetoric. Chapter by M. Twain, The Mississippi Riven A Symbol
of Freedom. New York: Appleton-Century-Crofts, 111-113.
Harms, L. S. 1963. Status cues in speech: extra-race and extra-region identification. Lingua 12. 300-
306.
Harris, D. P. 1969. Testing English as a Second Language. New York: McGraw-Hill.
Harris^ D. P., and L. A. Palmer. 1970. CELT Technical Manual: Preliminary' Edition. New York:
McGraw-Hill.
Heaton, J. B. 1975. Writing English Language Tests. London: Longman.
Hinofotis, F. B. 1976. An investigation of the concurrent validity of cloze testing as a measure of over¬
all proficiency in English as a second language. Unpublished doctoral dissertation. Carbondale,
Ill.: Southern Illinois University.
Hisama, K. 1976. Design and empirical validation of the cloze procedure for measuring language pro¬
ficiency of non-native speakers. Unpublished doctoral dissertation. Carbondale, Ill.: Southern
Illinois University.
__ 1977 a. A new direction in measuring proficiency in English as a second language. Paper pre¬
sented at the annual meeting of the American Educational Research Association, New York.
_ 1977b. Predictive validity of short-form placement tests under two scoring systems. Paper
presented at the National Council on Measurement in Education, New York.
Hornby, P. A. 1974. Surface structure and presupposition. Journal of Verbal Learning and Verbal
Behavior 13. 530-538.

Hunt, K. 1965. Grammatical structures written at three grade levels. National Council of Teachers of
English Research Report Champaign, Ill.: National Council of Teachers of English.
256 REFERENCES
Hutchinson, L. G. 1971. Presupposition and belief-inferences, in Papers from the Seventh Regional
Meeting, Chicago Linguistic Society. Chicago: Chicago Linguistic Society, 134-141.
Indochinese Refugee Education Guides, No. 11,1975. Arlington, Va: National Indochinese Clearing¬
house, Center for Applied Linguistics.
Iowa Silent Reading Tests. Level 2, Form E. 1973. New York: Harcourt Brace Jovanovich.
Irvine, Patricia, Parvin Atai, and John W. Oiler, Jr. 1974. Cloze, dictation, and theTestofEnglishasa
Foreign Language. Language Learning 24. 245-252.
Jakobovits, L. A. 197 0. Foreign Language Learning: A Psycholinguistic Analysis of the Issues. Rowley,
Mass.: Newbury House.
Jensen, Arthur R. 1972. The nature of intelligence. In G. H. Bracht, Kenneth D. Hopkins, and Julian
C. Stanley (eds.), Perspectives in Educational and Psychological Measurement. Englewood Cliffs,
N.J.: Prentice-Hall, 191-213. Reprinted from Harvard Educational Review (1969), 39. 5-28.
Johansson, S. 1973a. An evaluation of the noise test IRAL 11. 107-133.
-. 1973 b. Partial dictation as a test of foreign language proficiency. Swedish-English Contrastive
Studies, Report No. 3, Department of English, Lund University, Sweden.
-. 1975. Papers in contrastive linguistics and language testing. Lund Studies in English. Lund
University, Sweden.
Johnson, D. 1977. The TOEFL and domestic students: conclusively inappropriate. TESOL Quarterly
11. 79-86.
Jones, E. E., and F. Kohler. 1958. The effects of plausibility on the learning of controversial state¬
ments. Journal of Abnormal Social Psychology 58. 315-320.
Jones, R. L., and B. Spolsky (eds.). 1975. Testing Language Proficiency. Arlington, Va.: Center for
Applied Linguistics.
Jonz, J. 1976. Improving on the basic egg: the M-C cloze. Language Learning 26. 255-265.
Karttunen, L. 1971. Implicative verbs. Language 47. 340-358.
-. 1974. Presupposition and linguistic context Theoretical Linguistics 1. 181-194.
-. 1975. On pragmatic and semantic aspects of meaning. Texas Linguistic Forum 1.
Kerlinger, F. N. 1973, Foundations of Behavioral Research, 2d ed. New York: Holt Rinehart and
Winston.
Kerlinger, F. N., and E. J. Pedhazur. 1973. Multiple Regression in Behavorial Research. New York:
Holt Rinehart and Winston.
Krashen, Stephen. 1977. Language testing: current research. Paper presented at the ACTFL meeting
in San Francisco, November.
Krzyzanowski, H. 1976. Cloze tests as indicators of general language proficiency. Studia Anglica
Posnaniensia 7. 29-43.
Labov, W. 1966. The Social Stratification of English in New York City. Washington, D.C.: Center for
Applied Linguistics.
Lado, R. 1957. Linguistics across Cultures; Applied Linguistics for Language Teachers. Ann Arbor.
University of Michigan Press.
-. 1961. Language Testing: The Construction and Use of Foreign Language Tests. New York:
McGraw-Hill.
Lambert, W. E., R. C. Hodgson, R. C. Gardner, and S. Fillenbaum. 1960. Evaluational reactions to
spoken languages. Journal of Abnormal and Social Psychology 60. 44-51.
Lambert, W. E., R. C. Gardner, H. Barik, and K. Tunstall. 1962. Attitudinal and cognitive aspects of
intensive study of a second language. Journal of Abnormal and Social Psychology 66. 358-368.
Reprinted in Gardner and Lambert (1972), 288-245.
Lee, Laura L., and Susan M. Canter. 1971. Developmental sentence scoring: a clinical procedure for
estimating syntactic development in children’s spontaneous speech. Journal of Speech and Hear¬
ing Disorders 36. 315-340.
Lett, John. 1977. Assessing attitudinal outcomes. In June K. Phillips (ed.), The Language Connec¬
tion: From the Classroom to the World. ACTFL Foreign Language Education Series 9.
Levenston, E. A. 1975. Aspects of testing the oral proficiency of adult immigrants to Canada In
Palmer and Spolsky (1975), 67-74.
REFERENCES 257
Liebert, R. M., and M. D. Spiegler. 1974. Personality: Strategies for the Study of Man. Rev. ed. Home-
wood, Ill.: Dorsey Press.
Lorge, Irving, Robert L. Thorndike, and Elizabeth Hagen. 1964. The Lorge-Thorndike Intelligence
Tests. Boston: Houghton Mifflin.
Lukmani, Yasmeen. 1972. Motivation to learn and language proficiency. Language Learning 22. 261-
274.
Manis, Melvin, and Robyn M. Dawes. 1961. Cloze score as a function of attitude. Psychological Reports
9. 79-84.
McLaughlin, G. H. 1969. SMOG grading: a new readability formula. Journal of Reading 12.639-645.
Mehan, H., and H. Wood. 1975. The Reality of Ethnomethodology. New York: Wiley-Interscience.
Mehrens, William A., and Irvin J. Lehmann. 1969. Standardized Tests in Education. New \ ork: Holt,
Rinehart and Winston.
Miller, G. A. 1956. The perception of speech. In M. Halle (ed.). For Roman Jakobson. The Hague:
Mouton.
Miller, G. A., G. A. Heise, and W. Lichten. 1951. The intelligibility of speech as a function of the con¬
text of the test materials. Journal of Experimental Psychology 41. 81-97.
Miller, G. A., and P. N. Johnson-Laird. 1976. Language and Perception. Cambridge, Mass.: Belknap
Press of Harvard University Press.
Muraki, M. 1970. Presupposition and pseudoclefting. In Papers from the Sixth Regional Meeting,
Chicago Linguistic Society. Chicago: Chicago Linguistic Society, 390-399.
Naiman, N. 1974. The use of elicited imitation in second language acquisition research. Working
Papers on Bilingualism 2. 1-37.
Nelson, M. J., and E. C. Denny. 1973. The Nelson-Denny Reading Test. Boston: Houghton Mifflin.
Nguyen Dang Liem. 1967. A Contrastive Analysis of English and Vietnamese. Canberra: Austrahan
National University.
Nie, N. H., C. H. Hull, J. G. Jenkins, K. Steinbrenner, and D. H. Bent 1975. SPSS: Statistical Package
f°r the Social Sciences. New York: McGraw-Hill.
Nunnally, Jum C. 1964. Educational Measurement and Evaluation. New York: McGraw-Hill.
-. 1967. Psychometric Theory. New York: McGraw-Hill.
- 1975. Introduction to Statistics for Psychology and Education. New York: McGraw-Hill.
O’Donnell, Roy C., W. J. Griffin, and R. C. Norris. 1967. Syntax of kindergarten and elementary school
children: a transformational analysis. NCTE Research Report No. 8. Champaign-Urbana, Ill.:
National Council of Teachers of English.
Oiler, John W„ Jr. 1972. Contrastive analysis, difficulty, and predictability. Foreign Language Annals
6. 95-106.
-. 1973a. Cloze tests of second language proficiency and what they measure. Language Learning
23.105-118.
-. 1973b. Discrete-point tests versus tests of integrative skills. In Oiler and Richards (1973),
184-199.
-. 1974. Expectancy for successive elements: key ingredient to language use. Foreign Language
Annals 7. 443-452.
_. 1976. Interlanguage and fossilization. Paper presented at Modern Language Association Con¬
vention, New York. (See also Rule fossilization: a tentative model, with N. Vigil. Language Learn¬
ing 26. 281-295.)
_. 1977. Affective variables in second language acquisition: how important are they? Paper pre¬
sented at the NAFSA meeting in New Orleans, May 27, 1977. In Betty Wallace Robinett (ed.),
1976-77 Papers in ESL: Selected Conference Papers of ATESL. Washington, D.C.: NAFSA.
_. 1979. Language Tests at School: A Pragmatic Approach. London: Longman.
Oiler, John W„ Jr., and J. C. Richards (eds.). 1973. Focus on the Learner: Pragmatic Perspectives for
the Language Teacher. Rowley, Mass.: Newbury House.
Oiler, John W„ Jr., and Virginia Streiff. 1975. Dictation: a test of grammar based expectancies. In
Jones and Spolsky (1975), 71-88. Also in English Language Teaching 30. 1975. 25-36.
258 REFERENCES
Oiler, John W., Jr., and F. B. Hinofotis. 1976. Two mutually exclusive hypotheses about second lan¬
guage ability: factor analytic studies of a variety of language tests. Unpublished paper delivered at
the Winter meeting of the Linguistic Society of America. Dec. 30, 1976. Also in this volume as
Chap. 1.
Oiler, John W., Jr., Alan J. Hudson, and Phyllis Fei Liu. 1977. Attitudes and attained proficiency in
ESL: a sociolinguistics study of native speakers of Chinese in the United States. Language Learn¬
ing 27. 1-27.
Oiler, John W., Jr., Lori Baca, and Fred Vigil. 1977. Attitudes and attained proficiency in ESL: asocio-
linguistic study of Mexican-Americans in the Southwest TESOL Quarterly 11. 173-183.
Oiler, John W., Jr., and K. Perkins. 1978. Language in Education: Testing the Tests. Rowley, Mass.:
Newbury House.
Olssen, M. 1973. The effect of different types of errors in the communication situation. In Svartvik
(1973).
Ortego, P. D. 1970. Some cultural implications of a Mexican-American border dialect of American
English. Studies in Linguistics 21. 77-84.
Palmer, L. A. 1973. A preliminary report on a study of the linguistic correlates of raters’ subjective
judgments of non-native speech. In Shuy and Fasold (1973), 41-59.
Palmer, L. A., and B. Spolsky (eds.). 1975. Papers on Language Testing 1967-1974. Washington,
D.C.: Teachers of English to Speakers of Other Languages.
Paulston, Christina Bratt, and Mary Newton Bruder. 1975. From Substitution to Substance: A Hand¬
book of Structural Pattern Drills. Rowley, Mass.: Newbury House.
Perkins, K. 1976. Hierarchies of syntactic complexity of adult ESL Learners. In Robert St Clair and
Beverly Hartford (eds.), LEKTOS: Interdisciplinary Working Papers in Language Sciences.
Louisville: University of Louisville.
Perren, G. E. 1968. Testing spoken language: some unsolved problems. In A. Davies (ed.). Language
Testing Symposium A Psycholinguistic Approach. London: Oxford University Press. 107-116.
Postovsky, Valerian A. 1974. Effects of delay in oral practice at the beginning of second language
learning. Modern Language Journal 58. 229-239.
-• 1976. Individual differences in acquisition of receptive and productive language skills. Paper
presented at the Kentucky Foreign Language Conference, Lexington, Ky.
Praninskas, J. 1959. Rapid Review of English Grammar. Englewood Cliffs, N.J.: Prentice-HalL
Raven, J. C. 1960. Guide to the Standard Progressive Matrices. London: H.K. Lewis and Co. Ltd.
Raygor, Alton. 1970. McGraw-Hill Basic Skills System Test Manual Del Monte Research Park,
Monterey: McGraw-Hill.
Richards, Jack C. 1970. A non-contrastive approach to error analysis. English Language Teaching'll.
115-135. Also in Oiler and Richards (1973), 96-113.
-• 1971. Error analysis and second language strategies. Language Sciences 17. 12-22. Also in
Oiler and Richards (1973), 114-135.
-• 1974. Error Analysis: Perspectives on Second Language Acquisition. London: Longman.
Roscoe, John T. 1975. Fundamental Research Statistics for the Behavioral Sciences. 2d ed. New York:
Holt, Rinehart and Winston.
Ross, Janet 1976. The habit of perception in foreign language learning: insights into error from con¬
trastive analysis. TESOL Quarterly 10. 169-175.
Ryan, E. B. 1973. Subjective reactions toward accented speech. In Shuy and Fasold (1973), 60-73.
Sattler, J. M. 1974. Assessment of Children s Intelligence. Philadelphia: W.B. Saunders.
Savignon, Sandra. 1972. Communicative Competence: An Experiment in Foreign Language Teaching.
Montreal: Marcel Didier.
Saville-Troike, M. 1973. Reading and the audio-lingual method. TESOL Quarterly 7. 395-405.
Schumann, John. 1974. The implications of interlanguage, pidginization, creolization for the study of
adult second language learning. TESOL Quarterly 8, 2. 145-152.
-• 1975a. Second language acquisition: the pidginization hypothesis. Unpublished doctoral dis¬
sertation. Harvard University.
REFERENCES 259
-. 1975b. Affective factors and the problem of age in second language acquisition. Language
Learning 25. 209-235.
-. 1976. Social distance as a factor in second language acquisition. Language Learning 26.135-
143.
Schumann, John, and Nancy Stenson. 1974. New Frontiers in Second Language Learning. Rowley,
Mass.: Newbury House.
Selinker, L. 1972. Interlanguage. International Review of Applied Linguistics 10, 3. 209-231. Also in
Schumann and Stenson (1974), 114-136.
Sellars. W. 1954. Presupposing. The Philosophical Review 63. 197-215.
Shuy. R. W„ J. C. Baratz, and W. A. Wolfram. 1969. Sociolinguistic Factors in Speech Identification.
Research Project No. MH 1504801. Arlington. Va.: Center for Applied Linguistics.
Shuy, R. W., andR. W. Fasold(eds.). 1973. Language Attitudes: Current Trends and Prospects. Wash¬
ington, D.C.: Georgetown University Press.
Spolsky, Bernard. 1969. Attitudinal aspects of second language learning. Language Learning 19.
271-283.
Spolsky, Bernard, Penny Murphy, Wayne Holm, and Allen Ferrel. 1975. Three functional tests of oral
proficiency. TESOL Quarterly 1972, 6. 221-235. In Palmer and Spolsky (1975), 76-90.
State University of Iowa. 1964. The Iowa Tests of Basic Skills. Boston: Houghton Mifflin.
StendahL Christina. 1972. The relative proficiency in their native language and in English as shown
by Swedish students of English at university level Projektet sprakfardighet: engelska(SPRENG),
Rapport 6. Engelska institutionen, Goteborgs Universitet
Stevenson, D. 1974. Construct validity and the test of English as a foreign language. Unpublished
doctoral dissertation. Albuquerque, N.M.: University of New Mexico.
Sticht, Thomas G. 1972. Learning by listening. In R. Freedle and J. B. Carroll (eds.), Language Com¬
prehension and the Acquisition of Knowledge. Washington, D.C.: V.H. Winston.
Strawson, P. F. 1952. Introduction to Logical Theory. New York: Wiley.
Streift Virginia. 1978. Relationships among oral and written cloze scores and achievement test scores
in a bilingual setting. In Oiler and Perkins (1978), 65-100.
Strongman, K... and J. Woosley. 1967. Stereotyped reactions to regional accents. British Journal of
Social and Clinical Psychology 6. 164-167.
Stubbs, J. B„ and G. R. Tucker. 1974. The cloze test as a measure of English proficiency. Modern
Language Journal 58. 239-241.
Stump, Thomas A. 1978. Cloze and dictation as predictors of intelligence and achievement scores. In
Oiler and Perkins (1978), 36-63.
Svartvik, J. (ed.). 1973. Errata: Papers in Error Analysis. Lund, Sweden: CWK Gleerup.
Swain, M„ G. Dumas, and N. Naiman. 1974. Alternatives to spontaneous speech: elicited imitation and
translation as indicators of second language competence. Working Papers in Bilingualism 3. 68-
79.
Taylor, B. 1975. The use of overgeneralization and transfer learning strategies by elementary and
intermediate students in ESL. Language Learning 25. 73-108.
Taylor, W. L. 1953. Cloze procedure: a new tool for measuring readability. Journalism Quarterly 30.
415-433.
_. 1954. Application of cloze and entropy measures to the study of contextual constraint in
samples of continuous prose. Unpublished doctoral dissertation. Champaign-Urbana, 111.. Uni¬
versity of Illinois.
_. 1956. Recent developments in the use of the cloze procedure. Journalism Quarterly 33. 42-
48.
Thurstone, T. G. 1959. General Reading for Understanding: Teacher's Handbook. North Carolina:
Science Research Associates.
-. 1963. Reading for Understanding Placement Test. Chicago: Science Research Associates.
260 REFERENCES
Truus, S. 1972. Sentence construction in English and Swedish in the writings of Swedish students of
English at university level: a pilot study. Projektet sprakfardighet: engelska (SPRENG) Rapport
7, Engelska institutionen, Goteborgs Universitet
Tucker, G. R., and W. E. Lambert. 1969. White and negro listeners’ reactions to various American-
English dialects. Social Forces 47. 463-468.
Tucker, G. R., E. Hamayan, and F. H. Genesee. 1976. Affective, cognitive, and social factors in second
language acquisition. Canadian Modern Language review 32. 214-226.
Upshur, John A. 1975. Objective evaluation of oral proficiency in the ESOL classroom. TESOL
Quarterly 5. 47-60. In Palmer and Spolsky (1975), 53-65.
Valette, R. M. 1967. Modern Language Testing: A Handbook. New York: Harcourt, Brace and World
(2d ed„ 1977).
Van Syoc, W. B., and F. S. Van Syoc. 1971. Let’s Learn English: Advanced Course, Book 5. New York:
American Book Company.
Vineyard, Edwin, and Robert B. Bailey. 1966. Interrelationships of reading ability, listening skill,
intelhgence, and scholastic achievement. Journal of Developmental Reading 3. 174-178.
Warden, David A. 1976. The influence of context on children’s use of identifying expressions and
references. British Journal of Psychology 67. 101-112.
Webster, W. G., and E. Kramer. 1968. Attitudes and evaluational reactions to accented English speech
Journal of Social Psychology IS. 231-240.
Whitaker, S. F. 1976. What is the status of dictation? Audio-Visual Journal 14. 87-93.
Whyte, W. F. 1955. Street Comer Society: The Social Structure of an Italian Slum. Chicago: Univer¬
sity of Chicago Press.
Wijnstra, Johan M., and Nico van Wageningen. The cloze procedure as a measure of first and second
language proficiency. Unpublished manuscript
Wilds, Claudia P. 1975. The oral interview test. In Jones and Spolsky (1975), 29-44.
Wilhams, F. 1970. Psychological correlates of speech characteristics: on sounding’‘disadvantaged."
Journal of Speech and Hearing Research 13. 472-488.
Wilhams, F., J. L. Whitehead, and L. M. Miller. 1971. Ethnic stereotyping and judgments of children’s
speech. Speech Monographs 38. 166-170.
Wilson, L. I. 1973. Reading in the ESOL classroom; a technique for teaching syntactic meaning.
TESOL Quarterly 7. 259-267.
Winer, B. J. 1971. Statistical Principles in Experimental Design, 2d ed. New' York: McGraw-Hill.
Appendix
2* Listening Cloze (open-ended)

DIRECTIONS: You will hear three paragraphs. The first time just listen. The
second time some of the words will be left out. For each
“blank” write down the word you think is missing in the space
provided.**
EXAMPLE: This is an example: Harry and Sarah work at the Bookstore.
Sarah is the manager and Harry is a clerk.
This time try to write down the words to fill in the blanks.
(1) Harry and Sarah work at the-.
(2) Sarah_the manager, (3) and-is
a clerk.
1. _
2. _
3. _
The correct answers are (1) bookstore, (2) is, and (3) Harry. If
you are not sure of the answers, just guess.
DO NOT TURN THE PAGE UNTIL YOU ARE TOLD TO
DO SO.
♦The numbering system used here corresponds to the numbering used in the main body of the text
referring to these tests. It does not refer to the order of testing, and there are some numbers missing
because they refer to published tests. See the brief descriptions of tests used by Scholz et aL, Chap. 2.
**A11 the directions were presented both in written form and on tape (in English). The written
form and the answer sheet, consisting of numbered blanks, are not given here. Instructors were advised
to make certain students understood thoroughly before the actual testing began. On the whole, the task
proved to be quite difficult for all but the most advanced students.
261
262 APPENDIX
[texts for Listening Cloze, open-ended]
Passage A
(1) At the airport they saw a helicopter. (2) Neither Tom nor David nor Annabel
had ever seen one close up before. (3) “It looks funny without wings,” Annabel
said. (4) “How long does it take you to fly downtown from here?” (5) Tom asked
the pilot. (6) “Six minutes,” he answered. (7) “Wow! By car it takes almost an
hour,” Tom said. (8) “And it can take even longer if there’s lots of traffic,” (9) his
father added, (10) “I’d like to go for a ride more than anything,” David said. (11)
“I’m sorry,” said the pilot, (12) “I can t take you downtown now, son. (13) We
have a full load, (14) and everybody is in a hurry." (15) Looking at his watch, he
added, “We have to take off now. So long.”
Passage B
(16) The United States can try to find more oil. (17) Big deposits of oil have
recently been discovered (18) in Alaska, in Mexico, and under the waters of the
North Sea. (19) We already get a lot of oil from underneath the water along the Gulf
Coast and the Pacific Coast of the United States. (20) Scientists believe there may
be large oil deposits off the Atlantic Coast as well.
(21) There is another source of oil, (22) but it will be hard to get (23) Hundreds of
millions of barrels of oil are mixed with sandy rock called shale. (24) There is
enough shale oil to keep us going for many years. (25) But at present, it is very
costly to remove, (26) and mining it may leave the countryside ugly.
(27) The United States has always been rich in oil. (28) We still produce a great
deal of it (29) But we use & great deal of oil, (30) so someday we will run out of it
Passage C
(31) Anybody who has traveled much on any continent (32) knows that there are
many land surfaces (33) that are too flat to be called mountains, (34) and too rough
to be called plains. (35) We can all agree to call such land surfaces hills, (36) with
the understanding that in some regions high hills are difficult to distinguish from
low mountains, (37) and low hills are difficult to distinguish from plains. (38) Hill
country includes regions with an average altitude change of more than 500 to
1,000 feet, (39) and with many more sloping surfaces than flat ones. (40) By this
definition, (41) most of the Illinois mountains are not mountains at all; (42) rather
they are hills. (43) So are many of the so-called mountains of New England. (44)
They may be called mountains by New Englanders, (45) but to a geographer they
are low hills.
3. Listening Cloze (multiple-choice format)

— DIRECTIONS: You will hear three short paragraphs. Each one will be
read once; then it will be read a second time with some of
the words left out Try the example to see if you under¬
stand.
3. Listening cloze 263
EXAMPLE: Harry and Sarah work at the bookstore. Sarah is the

manager and Harry' is a clerk.
Taped 1. Harry and Sarah work at the_.
Instructions Sarah is the_,
and Harry is a
A. library 2. A. friend A. jerk
B. bookstore B. janitor B. fellow
C. drugstore C. manager C. story
D. bar D. owner D. clerk
The correct choices are B for 1, C for 2, and D for 3.
They work at the bookstore. Sarah is the manager, and
Harry is a clerk.
DO NOT TURN THE PAGE UNTIL THE INSTRUC¬
TOR TELLS YOU TO DO SO.
[Listening Cloze, multiple-choice format, student’s answer sheet]
Name Instructor _
Course level _Section
DIRECTIONS: You will hear three short paragraphs. Each one will be read
once; then it will be read a second time with some of the
words left out Try the example to see if you understand.
EXAMPLE: 1. A. library 2. A. friend 3. A. jerk
B. bookstore B. janitor B. fellow
C. drugstore C. manager C. story
D. bar D. owner D. clerk
The correct choices are B for 1, C for 2, and D for 3. They
work at the bookstore. Sarah is the manager, and Harry is a
clerk.
DO NOT TURN THE PAGE UNTIL THE INSTRUCTOR TELLS YOU TO DO

SO.
[texts for Listening Cloze, multiple-choice format]
Passage A
(1) Two weeks before school started, Henry Higgins was in the kitchen one
evening feeding Fala (2) while Mr. Higgins washed the dinner dishes and Mrs.
264 APPENDIX
Higgins dried them. (3) Henry took some hamburger and half a can of Alpo Dog
Food out of the refrigerator. (4) Henry broke up the hamburger and put it on Fa la's
dish. (5) “Why don't you chew it?” he asked, (6) when F^ala began to gulp it down in
large chunks. (7) Henry spooned the last of the can of Alpo into the plastic dish. (8)
Fala sniffed at the food. (9) Then he wagged his tail and looked hopefully at Henry,
(10) who knew this meant that Fala would eat the dog food (11) only when he was
sure (12) he was not going to get any more hamburger.
Passage A
(1) A. feeding (?) A. some

B. playing with B. the
C. teaching C. up
D. scolding D. all
(2) A. broke (8) A. like
B. hung B. to
C. dried C. for
D. painted D. at
(3) A. Mr. Higgins (9) A. bit
B. Mrs. Higgins B. sat on
C. Fala C. wagged
D. Henry D. bent
(4) A. cup (10) A. was
B. dish B. hoped
C. chain C. be
D. collar D. meant
(5) A. should (11) A. how
B. must B. when
C. haven’t C. that
D. don’t D. ever
(6) A. kinds (12) A. every
B. flavors B. any
C. chunks C. each
D. steaks D. some
Passage B
(13) You know how fast a jet goes. (14) It is hard for the pilot to get out of it (15)
When it is goingso fast (16) it is hard to get away from it too. (17) We had to work
out a way of escape. (18) A pilot must not be caught in a jet when there is trouble.
(19) We made a seat that would shoot down out of the jet (20) It was automatic.
(21) A pilot could just press a button! (22) Out he would go, seat and all. (23) We
3. Listening cloze 265
tried out the automatic seat with dummies. It seemed to work. (24) Now I was to be
the first live man to try it
Passage B
(13) A. f°ggy (19) A. jet
B. super B. pilot
C. fast C. button
D. far D. seat
(14) A. might (20) A. went
B. should B. was
C. is C. has
D. could D. did
(15) A. like (21) A. pilot
B. actually B. driver
C. also C. rider
D. so D. passenger
(16) A. get (22) A. have
B. walk B. g°
C. drive C. be
D. fly D. land
(17) A. it (23) A. wild animals
B. we B. pictures
C. you C. dummies
D. he D. guinea pigs
(18) A. surely (24) A. honest
B. too B. ordinary
C. not C. great
D. try to D. live
Passage C
(25) All Aztec boys had to attend school when they reached fifteen years of age.
(26) They were trained chiefly in making war. (27) That was because Aztecs
valued war. (28) They believed wars made them strong and great, (29) so of course
they taught their children the same values. (30) The Aztec child learned that it was
important to be a brave soldier and (31) that it was an honor to die in battle. (32)
That pleased the gods, who rewarded the soldier with a long and happy life after
death. (33) The Aztecs were also trading people. (34) At their markets they
exchanged food for clay pots, tools, and other things they needed. (35) The Aztecs
traded with other Indians also. (36) Aztec merchants visited villages hundreds of
miles away.
266 APPENDIX
Passage C
(25) A. had (31) A. for
B. did B. like
C. must C. and
D. ought D. to
(26) A. and (32) A. angered
B. to B. worshipped
C. in C. destroyed
D. of D. pleased
(27) A. so (33) A. Amazons
B. because B. Amish
C. however C. Aztecs
D. like D. Asians
(28) A. wanted (34) A. because
B. tried B. since
C. needed C. with
D. believed D. and
(29) A. his (35) A. traded
B. your B. sold
C. their C. bought
D. its D. cashed
(30) A. funny (36) A. villages
B. brave B. miles
C. happy C. others
D. real D. them
4. Multiple-Choice Listening Comprehension

DIRECTIONS:* Again you will hear three texts. You will hear each one only
once. At the end of each text there will be multiple-choice
tpiestions. For example, suppose the text was about a man
and his wife on a long trip. You might be asked to complete a
statement like the following:
EXAMPLE: The man and his wife in the story were
A. going to town.
B. on their way to a neighbor s house.
C. celebrating a birthday of a relative.
D. on a long trip.
You would mark choice D. on a long trip.
DO NOT TURN THE PAGE UNTIL YOU ARE TOLD TO DO SO.
*These directions were presented both in writing and on tape. The text, questions, and choices
were all taped for each passage. However, the students’ answer sheets only contained printed versions
of the alternatives to each question.
4. Listening comprehension 267
Passage A*
In 1911 people had something new to talk about A New York newspaper said
it would give 50,000 dollars to the first man to fly across the country. The flight
had to be made within 60 days, however. There weren’t many planes or fliers in
those days. The first American flight had been made only eight years before. Few
people had ever seen a plane flying. The New York newspaper s story made most
people laugh. No one thought it was possible to fly a plane clear across the country.
Besides, who would be foolish enough to try?
The first man to start was Robert Fowler. He was only able to travel a couple
of hundred miles a day. Everywhere people stopped what they were doing to watch
him. Most of them had never seen a plane before. After 49 days and many troubles
he finally made it from San Francisco to New York. He didn’t collect the 50,000
dollars though because he had waited too long to start his trip and the 60-day time
limit had elapsed.
(1) The time when this story took place was

A. in the early 17 th century.
B. in the late 1900s.
C. about 1850.
D. in the early 1900s.
(2) The story is about
A. a man and his dog.
B. a newspaper boy.
C. New York.
D. a trip across the United States.
(3) A New York newspaper offered 50,000 dollars to anyone who would
A. write a prize winning novel.
B. fly across the United States in sixty days.
C. engineer a much needed water project.
D. explore the Grand Canyon in a row boat
(4) People didn’t think anyone would try because
A. it takes a long time to write a book.
B. airplanes were hardly known in those days.
C. no one knew how to build good dams back then.
D. row boats weren’t all that safe in the Grand Canyon.
(5) The first man to start was
A. Robert Fowler.
B. John W. Brown.
C. Henry A. Smith.
D. George Willis.
* Adapted from Science Research Associates Reading Lab Orange 5, Secondary Edition.
268 APPENDIX
(6) He was only able to travel

A. a thousand miles a day.
B. for about an hour at a time.
C. with a permission slip from his doctor.
D. in a horse drawn cart.
(7) Everywhere he went people stopped what they were doing to watch
A. his plane going by.
B. the traffic he created.
C. the horses on the run.
D. the wagon wheels turning.
(8) He finally made it to New York in a little less than
A. two years.
B. six months.
C. three days.
D. two months.
(9) He didn’t win the money though because
A. he started too late.
B. he stopped at various scenic spots along the way.
C. he turned down the offer.
D. he left New York before it arrived.
(10) The sixty-day time limit had
A. delayed.
B. returned.
C. passed.
D. fired.
Passage B*
Little things mean a lot. They may make a lot of money for the man who
invents them. They may save other people a lot of time. Every day you use many of
these little things called “gadgets.” One of these little things made Hyman Lipman
a fortune. He was the man who thought of putting an eraser on a pencil. Ever use
one? He took out a patent on this idea and sold it for $ 10,000. That may not sound
like much today, but it was ten years’ wages in 1908. Another useful little gadget is
the paper clip. Hortense Jones was responsible for its patent It netted him
$25,000. Another patent was for screw top jars. It seems simple now, but it was a
real contribution to the average home in the early 19th century. Bottle tops too
were invented by a resourceful man. His name was Pointer. Have you ever thought
of inventing something and taking out a patent on it? Who knows, you might get
rich. While you’re thinking you can scratch off barbed wire invented by Joseph
Glidden in 1827, and zippers made by a man named Judson in 1830. But it is a safe
bet that many gadgets are yet to be invented.
•Adapted from SRA 7 Orange, Secondary Edition.

4. Listening comprehension 269
(11) This passage starts out by talking about

A. great spaces.
B. small objects.
C. little animals.
D. great feats.
(12) According to the paragraph, people who invent things are apt to
A. make a lot of money.
B. become very well known.
C. be honored by their government
D. be more intelligent than the average man.
(13) This passage does not mention that
A. Glidden invented barbed wire.
B. Pointer invented bottle caps.
C. Judson invented zippers.
D. Edison invented the light bulb.
(14) Little things used every day are sometimes called
A. fingernails.
B. gadgets.
C. resources.
D. furniture.
(15) Hyman Lipman made a fortune by
A. putting caps on bottles.
B. making wire with barbs.
C. inventing the paper clip.
D. putting erasers on pencils.
(16) Most of the inventions referred to in the passage were patented before
A. 1920.
B. 1801.
C. 1800.
D. 1755.
(17) The passage invites the listener to consider
A. becoming a railroad worker.
B. joining the Air Force.
C. flying crop dusters.
D. trying to invent something.
(18) The incentive is that you might
A. become famous.
B. get an education.
C. get rich.
D. become a businessman.
(19) The following gadgets are already patented according to the article
A. pens, glasses, and envelopes.
B. paper clips, zippers, and barbed wire.
C. rubber-soled shoes, hairpins, and watch bands.
D. levis, harnesses, and wagon wheels.
270 APPENDIX
(20) The author concludes by saying that

A. most gadgets that are needed have already been invented.
B. many gadgets are yet to be invented.
C. no more inventors are actually needed.
D. inventors are a thing of the past
Passage C*
One summer day in 1842 a man appeared on Broadway in New York City
with five bricks under his arm. Solemnly, he placed one of the bricks at a busy
comer. Then he walked three blocks down Broadway, placing a brick at each
corner. With the fifth brick under his arm, he walked to the American Museum.
There he presented a ticket for admission, walked through the building, and went
out the rear door.
In a few minutes, he was back on Broadway, where he picked up each brick
and put down in its place the brick he had been carrying. Then he went back to the
Museum and through its doors without a word to the curious people who followed
in his path.
Each time he entered the Museum, a crowd paid the admission charge of 25
cents, hoping to find the answer to the puzzle of the walking bricklayer inside. Half
an hour after this strange business began, 500 people were following the man.
After several days, the police put a stop to the performance because the crowds
were interfering with street traffic. But P. T. Bamum was pleased. He paid off his
walking bricklayer and chuckled as he counted up his increase in profits from the
Museum. It was known as Bamum’s American Museum.
(21) This text is about something that happened

A. in the mid 1800s. C. in the 16th century.
B. in 1942. D. about 1492.
(22) It all took place in
A. Detroit, Michigan. C. New York City.
B. San Francisco. D. Philadelphia.
(23) You might say that the man in the story behaved
A. very strangely. C. without thinking.
B. quite normally. D. as if he were quite drunk.
(24) To begin with he was carrying five bricks
A. in a knapsack. C. in his suitcase.
B. on top of his head. D. under his arm.
*Adapted from SRA, Secondary edition.

5. Dictation 271
(25) As he walked toward the American Museum he laid all the bricks but one
down
A. each on a different corner one at a time.
B. in the gutter at the comer of 25th and Vine.
C. and smashed them one by one.
D. and never picked them up again.
(26) Then he carried the last brick into
A. Madison Square Garden.
B. the Astrodome.
C. the Ford factory.
D. the American Museum.
(27) He paid the admission fee and walked right on
A. into the inside.
B. out the back door.
C. up the stairs.
D. down the street
(28) A few minutes later he was back on Broadway where he picked up each bnck
he laid down and
* A. picked up three passengers and drove across town.
B. put the one he was carrying in its place.
C. carried them all back home.
D. put them back in his suitcase.
(29) Then he went back to the
A. Astrodome.
B. movies.
C. museum.
D. fair.
(30) He kept doing this over and over, and after a while people began to follow
him, each time paying admission as they went through the
A. tunnel.
B. museum.
C. theater.
D. school.
(31) P. T. Bamum had paid the man to behave strangely in order to get enough
people to go to the
A. theater.
B. Brooklyn bridge.
C. museum.
D. desk clerk.
5. Dictation
.... DIRECTIONS: This is a dictation test You will hear three paragraphs.
Each one will be read three times. The first time, listen.
272 APPENDIX
The second time, write what you hear, during the pauses.
The third time, you may try to correct any mistakes.
Punctuation marks will be given the second time. Do not
spell out the punctuation marks.
Taped If the speaker says, “This is a book, period,” you will
Instructions write: This is a book.
If the speaker says, “He has a pen, comma, a pencil,
comma, and a notebook, period,” you will write: He
has a pen, a pencil, and a notebook.
If the speaker says, “Those classes, dash, English and
history, dash, were interesting, period,” you will
write: Those classes—English and history—were
interesting.
Remember, you will hear each paragraph three times.
The first time listen; the second time, write; and the third
time, correct mistakes.
Are there any questions?
Now we will begin with paragraph 1.
DICTATION A*
Some days / it doesn’t pay to get up. / Some days / you can’t get anything right /
One day I woke up, / and the sun was shining. / Birds were singing. / I got up, /
spilled hot coffee at breakfast, / tripped and fell down the stairs, / and got to class
just as it was ending. / All in all, it was a bad day to get out of bed. /
DICTATION B
Measles is a childhood disease. / It used to strike / about 4 million children a year/

in the United States alone. / In 1963 a vaccine was developed. / Now there is no
longer any need / for so many children to suffer. / Nevertheless this year / an
epidemic of measles threatens. /
DICTATION C
I have always believed / that every college curriculum / should include a course /
on the nonappreciation of literature. / In such a course / the student would decide /
what books he doesn’t like and why. / A well-reasoned term paper / on why you
don’t like Major Barbara / could be a contribution of great merit / We learn by
what we reject / as well as by what we accept /
11. Repetition (elicited imitation)

. Thank you all for being here today. We are going to ask you to
do some exercises like repeating sentences, reading aloud, and
filling in the blanks. Just relax, and do as well as you naturally can.
We hope you can learn something from this exercise.
•Slashes indicate the points at which pauses were inserted in the text on the second reading.
11. Repetition 273
Part I:
The first exercise is sentence repetition. We will say a
sentence, and you will repeat exactly what you hear. For example, I
will say:
Taped Last summer I saw mountains for the first time.
Instructions Then I stop, and you repeat
Then I continue:
In the distance the mountains looked very blue.
And again you repeat Do you understand? I say the sentence, and
you repeat it when I stop. If you have any questions, raise your hand
now. [Pause for the lab instructor to answer questions.]
OK, let’s begin.
Passage A
Once there was a good man and his good wife.
They lived in a beautiful house.
The man and his wife were very happy except that he could never
find things.
He would look for his shoes, and he couldn’t find them.
He would look for a book, and he couldn’t find it
But it was not his fault
He couldn’t find anything because his good wife had moved every¬
thing.
Passage B
China was one of the oldest and largest kingdoms in the world.
Why then was it so easily divided by foreign imperialists?
There were many reasons.
The government was weak.
The people wanted to keep their old customs and ways of doing
things.
They refused to use western methods of farming and manufacturing.
China in the 19th century was a backward country.
Passage C
The human organism is extremely resourceful in adjusting to

various situations.
It has a way of compensating for losses and attempting recovery.
To the extent that an infant child has been assured of his worth and
lovability,
he begins to adapt, to act and react in ways that will spare him pain
and fill in his painful voids.
Some of his devices will be designed to avoid further pain or to
invite and win love.
274 APPENDIX
12. Oral Cloze Test (with spoken responses)

. This is part II. In this exercise we will read a paragraph once,
while you listen. Then we will read it again, but the second time,
some of the words will be missing. When we come to a missing word
we will stop, and wait for you to guess the next word. There are
many possibilities, many correct answers, but try to guess the word
that will come next For example, listen to these sentences:
Last summer I saw mountains for the first_.
Taped In the distance the_looked very blue.
Instructions Blue is my favorite_.
When we stop, you fill in a word. After you have given your word
you will hear the correct answer. Please give your answer as quickly
as possible, before you hear our answer. Now listen again to the
example. First just listen to the paragraph. ... Now listen again, but
this time fill in a word when we stop. Do you understand? If you
have any questions, raise your hand now. Remember, when I stop
first you say a word, then you will hear a possible word. OK, let’s
begin.
. Passage A
Everyone enjoys stories. Stories are fun to listen to and fun to
tell. They can be about any subject The more interesting and excit¬
ing your stories are, the more your friends will enjoy them. Some¬
times you tell a story about something that really happened to you.
This kind of story is called a true story. In a true story you may tell
about the day you were lost, how you got your pet, or about some
other exciting or funny adventure.
Passage B
Robert McCloskey was bom in Hamilton, Ohio, in 1914. This

was his home until he was a young man. It was fun to grow up in
those days in a small town like Hamilton. Bob knew almost every¬
one there, and they knew him. He played in the parks, visited the
barbershop, the stores and the public library. Whenever anything
exciting happened in Hamilton, he knew about it Sometimes he
saw it happen. He enjoyed his childhood in his town very much.
Passage C
Since they had moved to America, they were a busy family.

They d stand outside the back door, usually speaking Japanese,
which Rose didn’t understand very well. But her brother Joe kept
putting in English words because he said their father and mother
ought to use them too, to be really American. They both tried, and
their father could carry on his business conversations now with no
trouble. But for their mother it was harder.
12. Oral cloze test 275
Passage D
Chomsky was an unusual monkey. He had many habits which

made him seem almost human. His smoking was one. He could light
his cigarette with matches or a lighter with equal ease. Then he
would lie down on the ground on his back, one arm under his head
and his legs bent up and crossed, blowing great clouds of smoke
into the sky. He was the only animal I have met that would think of
sharing things with you.
Passage E
Besides changing the environment to suit his own needs, man
also makes it uninhabitable for many wildlife species by polluting
and poisoning it Chimneys, factories, and exhaust pipes belch
deadly debris into the atmosphere. Many streams and rivers run
thick with human and industrial refuse, making them death traps
for fish and other forms of aquatic life. A number of lakes are dying,
because man uses them as dumping grounds for wastes of every
kind.
13. Reading Aloud
Now we will begin part III. This last exercise is very easy. You
will be given 3 paragraphs to read. All you have to do is read them as
naturally as possible. When you finish the first paragraph, go on to
Taped the rest of the paragraphs. When you finish, turn off your tape
Instructions player and stay in your seats until everyone is finished. You will
have a few minutes to read over the paragraphs first.
Thank you again for coming. We really appreciate your help.
Passage A
Way out at the end of a tiny little town was an old overgrown
garden, and in the garden was an old house. In the house lived Mary
Mullins. She was nine years old, and she lived there all alone. She
had no mother and no father. That was, of course, very nice because
there was no one to tell her to go to bed just when she was having the
most fun.
Passage B
When students and an instructor walk into class the first day of
the semester, they know without thinking what to expect of each
other and how each will behave. The students know that the
instructor will stand in front of the class behind a lectern, probably
276 APPENDIX
call the roll, assign reading, and dismiss them early on the first day.
The instructor knows that, unless the class is required, students will
be shopping around.
Passage C
For long-distance travel, the airplane has replaced the railroad
and the ship as the principal carrier. The airplane has become so
commonplace that we often fail to realize what a recent develop¬
ment in transportation it really is. The first transatlantic passenger
flights were made only a few years before World War II. Frequent
service came into being only after the war, and it was not until jets
were introduced that passenger capacity began to expand.
15. Multiple-Choice Reading Match Test
DIRECTIONS: You will read three short paragraphs. In each one, some of the
words or phrases are underlined. For each underlined word
or phrase, several choices are given inside square brackets as
shown in the example below. YOU ARE TO SELECT THE
CHOICE THAT MEANS THE SAME OR NEARLY THE
SAME AS THE UNDERLINED WORD OR PHRASE.
EXAMPLE: It was a small A. large dog.

B. little
C. smart
D. cute ,
ANSWER: (B) little.
The best choice is B. little, because the words small and little
mean almost the same thing in this context So, you would
circle the letter B.
You should read the entire passage before you try to mark the
correct choices. Mark your answers right on the test booklet
DO NOT TURN THE PAGE UNTIL YOUR INSTRUCTOR

TELLS YOU TO BEGIN. YOU WILL HAVE FIFTEEN
MINUTES TO FINISH PART 1.
15. Reading match test 277
Passage A
Joe was a (1) beginning A. sad English student. He had come to

B. new
C. baby
D. single
the United States for several (2) reasons A. hopes . First of all, he
B. wonders
C. purposes
D. desires
wanted to (3) learn A. become able to understand and speak English and
B. try to forget all about
C. be able to hear
D. have a chance to use his limited
to really be able to talk with and understand the people who (4) call
themselves A. would like to be Americans. Another thing that Joe

B. think that they are
C. telephone other
D. refer to themselves as
wanted to (5) accomplish A. attempt was a good deal of traveling. Some of

B. get done
C. find out
D. get out of
sights A. ones he wanted to see included the Grand Canyon and

B. others
C. places
D. towns
Yosemite National Park. (7) The third item on his list
A. The number three thing he wanted to do was to finish his degree in engineer-
B. The last thing on his mind
C. The one thing he didn’t want to do
D. The item second to none
ing so that (8) he would be able to A. the engineers would help build much
B. he could
C. the degree would
D. it would be able to
needed dams and other water projects in his homeland back in Africa. There were
278 APPENDIX
not to
(9) canals and reservoirs A. canaries and elephants
B. channels and highways
C. waterways and manmade lakes
D. government buildings and reserves
mention the many other projects that were underway. (10) Joe wanted to contrib-
ute to all of it A. Joe wanted to be involved because he

B. He wanted to make a gift to it
C. Joe was apart from all of it
D. He hoped he could get something out of it all
knew how important the projects were to the survival of his people. For all of these
reasons, (11) learning English for Joe A. teaching Joe to speak English
B. Joe’s learning of English
C. to learn Joe’s English
D. learning about himself in English
was not just a school task. In fact, (12) it was a necessary step toward a goal he had
set for himself A. it was a means to an end

B. it was a way for it to end
C. English was a step toward the end
D. English was his way to himself
Passage B
Nicholas Rizos had come to Athens to work. He was not merely visiting
Greece as a (13) tourist A. friendly laborer. He left America and (14) anived
B. book author.
C. sailor.
D. casual traveler.
in A. got to Athens after several weeks on a Greek (15) cargo

B. traveled in
C. discovered in
D. went across
A. passenger ship. His Uncle Stavros had (16) invited A. ordered

B. government B. asked
C. tourist G. begged
D. transport D. wanted
Nicholas to come and help him in his garage. Nicholas had (17) accepted
A. said to wait for

the invitation because he wanted to become a mechanic. He
B. told him not to
C. taken him up on
D. put him off on
(18) thought A. believed the work would do him good. (19) At least
B. wondered if
C. wished that
D. promised
A. On the other hand the experience would be good for him. He would have
B. In any case
C. For the best
D. For the time being
(20) a rare opportunity A. a strange experience to get to know the

B. an unusual chance
C. a foreign exchange
D. an unknown opposition
country (21) of his parents birth A. where his father and mother were born
B. where his parents used to live
C. of his own birth and his youth
D. of his heritage and grandfathers
For several months he had studied Greek. (22) He wanted to speak it as well as
possible A. Nicholas hoped he would be able to talk with sailors

B. He wanted to be able to speak Greek as well as possible
C. Nicholas wanted to learn English as well as he could
D. He hoped he would be able to understand English on the way
and to he able to read signs along the streets of Athens. Now (23) he wished he
could have practiced more A. Nicholas was sorry he had not spoken more Greek
B. He wished he had practiced his English more
C. Nicholas wanted to practice sailing more
D. He was sorry he hadn’t practiced reading signs
with the Greek sailors on board the cargo ship. That first morning in Athens was
some experience! Finally, (24) Nicholas was really there—in the land where his
parents had married and where he himself had been bom!
A. He was leaving the country of his birth and childhood.

B. He had come to the land where his parents were married and where he was bom.
C. Nicholas had ended his trip and had arrived back in the new home of his parents
—right back in the United States.
D. Nicholas had made it back to the cargo ship—his stay in Athens was over at last.
How happy he was to be there!

280 APPENDIX
Passage C
This book is about power and the (25) means A. agencies of producing
B. methods
C. theories
D. memories
it The word “energy” (26) has A. repairs more than one meaning in
B. restores
C. clarifies
D. possesses
(27) common use A. useful practice . Even in science the word is

B. ordinary usage
C. communication research
D. actual formation
used in (28) several A. consistent ways—for example, “heat

B. actual
C. various
D. minimal
energy,” “electrical energy,” and so on. (29) Usually A. In the long run
B. In summary
C. In general
D. In this analysis
energy in these expressions means the (30) capability A. fluency to do work

B. desire
C. motive
D. ability
But in this book we shall (31) adapt this idea
A. thoroughly apply this remedy and use the word “energy” to

B. carefully adopt this communication
C. slightly modify this notion
D. completely disregard this modification
describe (32) the raw materials of A. available ideas of power produc¬

B. bare considerations of
C. expert conditions for
D. basic resources for
tion. The word “power” will be (33) used to mean A. limited to control
B. asked to reflect
C. established to produce
D. utilized to signify
energy that has been converted into a form that can (34) readily be applied
A. vigorously be denied for many practical purposes. The first section of this
B. correctly be appointed
C. quickly be chosen
D. easily be used
book deals with (35) the sources of energy that can be turned into power.
A. the products of reserve fuels that can be promoted as real energy

B. the results of complete power that can be computed as specific energy
C. the use of scientific power that can be thought of in energy units
D. the origins of energy that can be converted into actual power
The second section presents the scientific principles that lie behind the conver¬
sion of energy into power. This section also shows (36) how and why different
kinds of machines have been manufactured
A. the reasons behind the production of various types of mechanisms

B. the opportunities for the utilization of every sort of resource
C. the reports on the organization of many groups of industries
D. the comments about the distribution of some kinds of energy
to meet particular needs while using a particular sort of “raw energy.”
NOW STOP AND WAIT FOR FURTHER INSTRUCTIONS.

YOU MAY GO BACK AND CHECK YOUR ANSWERS.
16. Standard Cloze
DIRECTIONS: You will read 2 short paragraphs. Some of the words in the
paragraphs have been left out Try to guess the missing words.
For example, you might read:
EXAMPLE: The cat ran up the_.
You should fill in the blank with the word that seems to fit the
context best. If you do not know you should guess. In the
example you might answer
ANSWER: The cat ran up the tree.
Other words that would also fit in the blank are wall, branch,
street, and so on. Remember that when there is more context,
the choices are more limited. Make sure your answers fit. Be
sure to read the whole text before you try to fill in the missing
words. It is all right to guess if you are not sure. Try to use only
one word for each blank. Do not turn page until the instruc¬
tor tells you to begin. You will have 20 minutes to complete
this exercise.
282 APPENDIX
Man has always made music. His (1) (vo'ce) is a natural musical
instrument From (2) times, music in some form has (3) (always)
been with him. Man made music (4) (w‘^) his voice long before he ever
(5)_(created) a musical instrument like a guitar (6) (<>r)_a flute. For
thousands of years man’s (7) (music) wag soun(J 0f own (g)
_(X2.-C.e)_, the sound of animals and the (9) (singing) Qf girc|s> There
was also the (10) _(sound)_ Gf streams? ang other natural things (11)
_(ajKun(i)_him. Today composers write difficult music (12) (in)_
symbols that are learned by other people. (13) ( ^)_performer can sing
or play a (14)_(hong)_that he has never heard before (15) _he
has learned to read those symbols.
Modern transportation and communication inventions have (16)

-(ma(k)_ possible the development of the metropolitan community, (17)
_i—_ we would expect aviation likewise to (18) (have) some
effect on it Through the (19)-(use)_of the automobile, particularly, the
customs (20)_(ail(^_attitudes of the large city have (21) ^>cen)_
extended to the people in the (22) (surrounding) territory. Observers have noted
that small (23) —(towns)-located near large cities have characteristics (24)
—(djbcreut)_from those of the small towns (25) (located) more distantly.
For instance, small communities (26)_(within) a ra(Jius Qf large urban
(27) —(centers)-have lost a large part of (28) (their) former isolation
and provincialism. We may (29)_(lxEect) _this radius to become even wider
(30)-(with)-the 4evel0pment uf aviation, particularly private aircraft
19. Multiple-Choice Writing Task
NAME: _INSTRUCTOR:_
COURSE LEVEL: __ SECTION:
[Part 1: Selecting an Appropriate Continuation]

DIRECTIONS' ^°U rea<^ three short passages. The first is about a Milk¬
maid and Her Pail; the second is about Farming and Popula¬
tion; and the third is about The Economics of Development
In each passage, there are blanks where words or groups of
words have been left out For each blank there are several
choices. Only one of the choices really fits in the blank. Place
the letter of that choice in the blank.
19. Writing task 283
EXAMPLE: (A) being

(B) with
The sky (1) (C) is blue when the sun is shining.
(D) used to
ANSWER: The sky (1) C blue when the sun is shining.

The correct answer is (C). We say, the sky is blue when the
sun is shining. So you should put a C in the blank where the
word is belongs as shown in the Answer. Do not turn the page
until the instructor tells you to begin.
Passage A
A Milkmaid and Her Pail
A farmer s daughter had been out to milk cows and was returning home, carry¬
(A) started
(B) had to
ing her pail of milk on her head. As she walked along she (1). (C) prepared
(D) began to
thinking: “The milk in this pail will provide me with cream,
(2)- I will make into butter and take to market to sell.
Then 1 will buy eggs and these will produce chickens. Soon I will have
(A) a large
(B) a hard
(3)- (C) an expensive poultry-yard. After that I will sell some of
(D) a high
(A) these new dresses I am thinking

(B) some of the dressing
my chickens, and I will buy myself (4)_ (C) new dresses for herself
(I)) the new dresses that I want
which I will wear when I go shopping. All the young men will love me, but 1 will
toss my head and have nothing to say to them.”
(A) all about the pail

(B) what the pail was about
Forgetting (5). (C) the pail about her head she tossed her head
(D) about all of the pail
284 APPENDIX
(A) Spilling the milk,

(B) The milk was spilled,
proudly. Down went the pail. (6)_ (C) Some milk will spill
(D) Already spilled all the milk.
and all her fine castles in the air vanished in a moment!
Passage B Farming and Population
To raise the food and ot ler farm products people need, we

(A) will
(B) must
(1)- (C) have to have more land. This means arable land, of
(D) should
which a little more than one and a quarter acres was available per person for the
(A) on
(B) at
world as a whole (2) (C) in 1955. If the world population in
(D) upon
(A) for
(B) to
creases by the year 2000 (3) (C) around about seven billion
(D) at
persons, which some scientists now predict, the amount of arable land per person
will decrease to just over one-half acre. There will be some new land under culti¬
(A) will be using,

(B) will used to be,
vation, of course, and better fertilizer (4). (C) will be used,
(D) will get used to,
which will increase productivity. But advances in medicine
(A) should have made

(B) should to make
(5)- (C) will have make people live longer, and this will offset
(D) will make
the factors that slow down the rate of population growth. As the population
increases, therefore, the size of the piece of land from which each person
(A) having his food has to decrease.

(B) his food has to decrease.
(6)- (C) has his food to decrease.
(D) gets his food will decrease.
Passage C
Economics of Development
“Development meant the development of raw materials, of food supplies or
of trading profits. The colonial power was primarily interested in supplies and
profits, not in the development of the natives, and this
(A) is meaning
(B) had meant
(1) (C) was meaning it was primarily interested in the colony’s
(D) meant
(A) internal market

(B) market place.
exports and not in its (2) (C) marketability. This outlook has
(D) market abroad.
stuck to such an extent that even the Pearson Report considers the expansion of
(A) for succeeding

(B) by succeeding
exports to be the main criterion (3) (C) to succeed for devel¬
(D) of success
(A) someone
(B) no one
oping countries. But of course, (4) (C) people do not live by
(D) country
(A) herself
(B) itself
exporting, and what they produce for (5) (C) themselves
(D) them
and for each other is of infinitely greater importance to them
(A) than what they produce for foreigners.

(B) than what if they produced it for foreigners.
(6) (C) if their produce is for many other foreigners.
(D) because their production is primarily for foreigners.
[Part 2: Editing Errors]
DIRECTIONS: You will read three more short passages. The first is about
Gifted Children; the second is about the Migration of Birds;
the third is about the Origins of Language. In each passage
some of the words or groups of words are underlined. Next to
286 APPENDIX
each underlined word or group of words several choices are

given. If the underlined portion is correct, you should mark
choice (A) NO CHANGE. However, if it is not correct, you
must select choice (B), (C), (D), or (E) to replace the under¬
lined word or group of words. Only one of the choices really
fits the context
(A) NO CHANGE
(B) g°
John (1) goed (C) will go to town yester¬
(D) went day.
(E) has went
ANSWER 1: John (1) goed (D) to town yesterday.

The correct answer is (D), went. The word goed is not correct
and should be changed to went.
Here is another example.

(A) NO CHANGE
(B) left
EXAMPLE 2: John had to (2) leave (C) leaved early.
(D) leaving
(E) gone
ANSWER 2: John had to (2) leave (A) early.
The answer is (A) because NO CHANGE is necessary. Do not

turn the page until the instructor tells you to begin.
Passage A
Gifted Children
(A) NO CHANGE
(B) with
Most people have misconceptions (1) for (C) toward
(D) through
(E) about
gifted children. It seems that no one knows for sure

(A) NO CHANGE
(B) that
(2) which (C) because these qualities reflect heredity or
(D) whether
(E) unless
environment Parents and the home atmosphere (3) count heavily with
(A) NO CHANGE
(B) count heavily in stimulating
stimulating (C) count heavily for stimulate giftedness. What’s
(D) count heavy in stimulating
(E) count heavily for stimulating
(A) NO CHANGE
(B) need
(4) needing (C) needed is a strong mother and father who
(D) for need
(E) to need
love, appreciate, respect and trust their children and a home (5) where there is
harmony and sharing of experiences.
(A) NO CHANGE
(B) that there is harmony and sharing experiences.
(C) where there are harmony and sharing experiences.
(D) where there is harmony and to share experiences.
(E) where there is harmony and share experiences.
Certain environments and activities are especially conducive
(A) NO CHANGE
(B) to intellectual development
(6) of intellectual development (C) to intellectual develop.
(D) to intellectually development.
(E) for intellectually developing.
Passage B
The Migration of Birds
With the coming of autumn, many species of birds living in northern latr
(A) NO CHANGE
(B) to south.
tudes migrate (1) toward southward. (C) southward. They do
(D) southern.
(E) to the southward.
288 APPENDIX
(A) NO CHANGE
(B) since
(C) because they have “thought out” the
not migrate (2) as
(D) on account of
(E) why _
situation and have planned ahead for winter. This is a human wa\
(A) NO CHANGE
(B) in seeing
(C) of seeing the problem. F or the migrating
(3) be seeing
(D) in thinking
(E) to think
(A) NO CHANGE
(B) nothing
birds there is no “problem.’ There is (4) anything (C) not everything
(D) nobody
(E) not all of it
left to plan for. they are simply responding (5) on
certain changes in their environment. (6) The gradual shortening of daylight
hours seems to trigger certain glandular changes,
(A) NO CHANGE
(B) It is seemed that the gradual shortening of daylight hours triggers certain
glands to acting,
(C) The gradual shortening of daylight hours has to be triggering certain glands
into acting,
(D) The gradual shortening of daylight hours seems to triggering certain glands
into action,
(E) It has been that the gradual shortening of daylight hours has triggered
certain glandular reactions,__
which then result in the migratory flight
Passage C
The Origins of Language
We are profoundly ignorant (1) in the

(A) NO CHANGE
(B) to be contented
origins of language, and have (2) to content (C) been contented
(D) contently been
(E) to contently
ourselves with more or less plausible speculations. We
(A) NO CHANGE
(B) even do not know
(3) never know (Q do not knowing for certain whether language
(D) do not even know
(E) do even not know
(A) NO CHANGE
(B) at
arose (4) with (C) by the same time as tool making
(D) on
(E) as
and the earliest forms of specifically human cooperation. In the great Ice Ages of
the Pleistocene period people made fire and cooked their food. (5) They
(A) NO CHANGE
(B) They had big hunting games,
hunted big game, (C) They hunted with big game, often by
(D) The games were largely hunted,
(E) They were hunted by big game.
methods that called for considerable cooperation and coordination. It is difficult
to believe that (6) speech makers lacked the power in this culture.
(A) NO CHANGE
(B) the power of speech made a lack in the culture.
(C) making speeches was lacked in this culture but power was not
(D) the power of the culture was in the speech makers who lacked it.
(E) the makers of this culture lacked the power of speech.
[Part 3: Ordering Words, Phrases, and Clauses]
DIRECTIONS: In this part, again, you will read three short passages. The first
passage is about Growing Up; the second is about Modem
Man; and the third is about Driver Education and Traffic
Safety. In each passage there are blanks where some words or
groups of words have been left out The words that are
missing are given, but they are not given in the correct order.
You must decide what order is the correct order.
290 APPENDIX
EXAMPLE: George really loves Sarah, (1) (2) (3)

(A) doesn’t
(B) but
(4) (C) Sarah him.
(D) love
ANSWER: George really loves Sarah, (1) B (2) (C)

(3) A (4) D him.
In this part you use all of the possible choices. Because the
sentence should say “George really loves Sarah, (1) but (2) Sarah
(3) doesn’t (4) love him,” you should write the letterB in blank (1),
C in blank (2), A in blank (3), and D in blank (4). Do not turn the
page until your instructor tells you to begin.
Passage A
Growing Up
A small child has (1) (2) (3) (4)
of the future. The year between one Christmas and the next Christmas
(A) seems
(B) eternity
(5). (6) (7) (8) (C) an

(D) . like
Francie’s idea of time was like this (9) (10) (11)
(12) . Between her eleventh and twelfth birthdays,
things changed. The future came along quicker, (13) _(14) _
(A) shorter
(B) day
(15) (16) (C) each and each week seemed to
(D) seemed
have fewer days in it Things were changing so fast for Francie that (17)
(A) she
(B) mixed
(18) -(19) _(20) (C) got
(D) up
(21) -(22) _(23) _(24)
(A) some of the things she loved so much in her father

(B) and that her mother was wrong once in a while
(C) during this period of time she discovered that
(D) were considered very comical to other people
Passage B
Modern Man
One finds that progress can also have its drawbacks. It is true that today
man moves more swiftly through the world. But in doing so, he often loses (1) _
(A) of
(B) and traditions
(2) (3) (4) (C) the roots that give sub¬
(D) sight
stance and meaning to life. (5) (6) (?)

(A) does
(B) that
(8) (C) nor he is better informed through television, radio,
(D) the fact
newspapers, and books necessarily mean that he is (9) _(10) _
(A) men of
(B) than
(11) (12) (C) wiser Instead, the ease
(D) earlier generations
(A) are produced today

(B) with which
(13) (14) (15) (16) (C) written
(D) and spoken words
sometimes seems to lead to superficiality of thought (17) (18)

292 APPENDIX
(A) when he is not working

(B) although man has been given the gift of
(19) -(20) leisure and a long life

(C) he has become more restless
(D) and is often uncomfortable_
Flooded with goods and gadgets, he finds (21) -(22)
(A) for material things

(B) rather than satisfied
(23) -(24) (C) his appetite
(D) increased
Passage C
Driver Education and Traffic Safety
The “traffic problem” was created by man and perhaps it can progressively be
solved by man. You, the driver, have a big stake (1) -(2) -
(A) effective
(B) in
(4) (C) control of traffic, which includes everyone
(3)
(D) the
and everything that moves on our streets and highways—pedestrians, passenger cars,
trucks, busses, motorcycles,and bicycles.
In a large sense, (5) -(6) -(7) -(8) -

(A) solve
(B) everyone
(C) to the traffic problem for himself. Streets and roads are laid down.
(D) has
Signs, signals, and markings are used to direct and regulate traffic and to (9)
Police supervision
is provided to aid and protect the safe drivers and to remove from the traffic those
who abuse their driving privileges. There are (13) -(14) -

20. Recall task 293
(A) of people
(B) throughout this country
(15) (16) (C) working to help keep you
(D) tens of thousands
safe.
You are the only one, however, (17) _(18) _(20)
(A) will pay off

(B) can make certain
(C) who . You are the one who can do the most to keep yourself and
(D) that their efforts
others safe. All of the safeguards (21) _(22) _(23)
(A) will have little effect

(B) which society has developed for your protection
(24) (C) that you will be a safe and efficient driver
(D) if you do not determine
20. Rating of Recall Task
NAME: _INSTRUCTOR: _
COURSE LEVEL: _SECTION:
DIRECTIONS: You will read three short passages one at a time on the screen
in the front of the room. You must write down everything you
remember after you read the passage. Try to get the meaning.
The exact wording of the passage is less important You will
have exactly one minute to study the passage. Then, the pro¬
jector will be turned off and you will not see the passage
again. When the projector is off, you may begin writing. You
may not take notes while the projector is on. You will have
five minutes to write what you remember of the passage. Here
is a very simple example:
EXAMPLE: Sarah is an older woman who lives alone. She works at a
library. She rarely goes out at night because she is afraid of
the street gangs in New York.
Now write your answer.

294 APPENDIX
ANSWER:
Now you may check your answer. Did you tell where Sarah
lives? Her name? Did you mention her job and where she
works? Did you tell about the fact that she lives alone? Did
you mention her fear and why she is afraid?
Your score depends on how much of the passage you are able
to reconstruct from memory and write down. The facts are
important. The order and manner in which you write them
down may be important, too. Try not to leave out anything, or
to add things that were not in the passage.
Passage A
A snake sheds its skin as it grows. First, the snake breaks the skin on its nose. Then
the snake crawls out, leaving the skin behind. Although snakes have no legs, they
move about very well. Some can crawl as fast as you can walk. Some climb trees.
Most of them can swim.
Passage B
Composition, oral as well as written, is the controlled use of language. The two
forms make up the “expressive” language arts of speaking and writing. But compo¬
sition is more than merely talk or writing; it is speech or writing with a plan and a
purpose and a conscious choice of words and ideas.
Passage C
An adaptation is a body structure that makes an animal or plant more likely to

survive in its environment Adaptations for food finding include special forms of
stomachs and teeth, bills of birds, tongues and mouth parts of reptiles, amphibians,
and insects, and adaptations of certain species of plants for the use of animal rather
than the usual soil minerals.
22. Grammar 295
22. Grammar (Parish Test)

COURSE UEVEU: SECTION:
DIRECTIONS: You will read three passages. The first is about students prac¬
ticing for an English test; the second is about a mother and
her children; the third is about two friends discussing work.
In each passage, there are blanks where a word has been left
out; in a few cases, a contraction (like Fin, isn't, doesn't) is
needed to fill the blank Read the entire passage before filling
in the blanks because you may get help from the next
sentence or from a sentence further along in the passage.
Write in the word (or the contraction) you think goes best in
the blank; if you change your mind, erase and write in a new
word.
EXAMPUE: The doctor said that John_swim if he wanted
__ , but that he should_careful.
John said, “Well _ you think I should
_go swimming, I wont.”
ANSWER: The doctor said that John_could_swim if he wanted
_to_, but that he should_be_careful.
John said, “Well, d_ you think I should
_H£t_go swimming, I wont.”
In a few cases you may be able to think of more than one word
that can fit. Be sure that each one fits exactly with the words
that come before and after it; then you can write in either one
of the words. REMEMBER: Only one word in each blank
(contractions count as one word).
Passage A
Two students are practicing for an English test They are asking each other
questions and giving the answers. “ (1)_How-many months are there in a
(2)_year_?” “There are twelve, three in (3)_eacb-season, and of
course that means (4)_there_are four seasons.”
“Yes. Do you know the (5)_days-of the week
“Of course I (6) do_Ask me a hard question!”
“OK. Do you want (7)_to_practice spelling?”
“No. I (8) can gpell everything without (9)_“Y-trouble.”
“All right Do you know (10)_when_summer begins?”
“No, I don’t Ask me (11) some_thing_ else.”
“All right (12) Will_it snow next month?”
296 APPENDIX
“(13)_I_don’t think so, but sometimes (14)_ll_does.”

“What do we carry (15) when_it rains?”
“An umbrella Now, it’s (16)_my_turn to ask the questions.”
“OK, but I hope you (17) will__ be fair with me, because I (18)
didn't ask you anything very hard.”
“(19) Who_works in the hospital?”
“Doctors and nurses. There (20) are_also students in some hospitals."
“Where (21)_do_housewives work?”
“At (22) home_. That’s a very easy question, (23)_don t_you
think?”
“Yes. How many (24) times is it 9 o’clock every day?"
“Twice. That’s (25) enough_studying. Let’s go to the Center now.”
Passage B
Mrs. Smith was (26) reading _a book to her children, and they (27)
were listening very carefully to her. (28) Their faces showed
how deeply they were interested (29) _ 111 the story and if one of (30)
them_made any sound, the other children (31) would tell the
offender to be silent. The smallest, only four years (32). old was sound
asleep on a cushion (33) next_to his mother, and the others were becom¬
ing (34) sleepy_ also, yawning and rubbing their (35)_eyes_
Mother looked around and said to them, “(36) Has_ every one heard
enough yet?” “(37) Please don’t stop, Mother,” said the oldest child, “(38)
there isn’t much more.” “ Yes, keep reading. 1(39)_Ml_enjoy¬
ing it so much,” said the next oldest “(40) Which one of you wants to
read? I simply (41) must wash the dishes.’Do you really (42)
have_to?” asked the second youngest “I think Daddy (43) should
do the dishes tonight.” “Your father (44) has_worked hard all day.”
“But so have (45) you _ , Mother. You’ve worked all day too, (46)
_thatisn’t so?” asked the oldest. “I agree (47) with_you, my
dear child, but mothers get (48)_use<f to working hard all day, more (49)
_{han_fathers.” The second youngest said, “Daddy has (50) been
sleeping for an hour, so I think mothers are much stronger than fathers.”
Passage C
John: “Well, hello there, Frank! How are you these days? I haven’t seen you
since last September. How is everything?”
Frank: “Just fine. I’ve really (51)_been_studying hard this semes-
ter.
John: “Listen, (52)_have_you eaten lunch yet? (53) Would
you like to have something (54) with me?”
22. Grammar 297
Frank: “That sounds fine. I (55) -fod'11_even have coffee this raorn-
ing or anything at all (56) to eat since I began work. There (57)
were people coming in all morning for information
John: (58)-Do_you have to work like (59)_ that every
day?
Frank: "No, but today half (60)_of_the workers were out sick, so
(61)-fh?-rest of us, those who came in, (62) had_to Jo every.
thing.”
John: “Oh, I see. And (63)_^2_you enjoy your job? Is it (64)
—something you can do easily at the (65) same time that you’re going
to school?”
Frank: "(66) -^hy_ do you think it’s different from any (67)
-other-job here at school? (68) D°es_it seem like hard work?
Actually, it (69)_'sn 1_hard at all: I go (70) there_at work for
three hours (71)__leave at noon. I think that (72) most
students work at least three hours a day. That (73) doesn t seem like a lot to
me.
John: “O.K. Now (74)_where shall we go for lunch?
Frank: “Well, Superburger (75)_1!_good enough and it’s close,
that is, (76)_if_you like hamburgers.
[Grammar, Parish Test]

Part 2
COURSE LEVEL: _SECTION:
DIRECTIONS: You will read two passages. The first is about a literature test;
the second is about taking a morning walk. In each passage,
there are blanks where a word has been left out; in a few
cases, a contraction (like I'm, isn't, doesn't) is needed to fill
the blank. Read the entire passage before filling in the blanks
because you may get help from the next sentence or from a
sentence further along in the passage. Write in the word (or
the contraction) you think goes best in the blank; if you
change your mind, erase and write in the new word.
EXAMPLE: The doctor said that John_swim if he wanted
_, but that he should_careful.
John said, “Well, _ you think I should
_go swimming, I won’t”
298 APPENDIX
ANSWER: The doctor said that John_could_swim if he wanted

to_, but that he should_be_careful.
John said, “Well, _11_ you think I should
_not_go swimming, I won’t.”
In a few cases you may be able to think of more than one word
that can fit. Be sure that each one fits exactly with the words
that come before and after it; then you can write in either one
of the words.
REMEMBER: Only one word in each blank (contractions

count as one word).
Passage A
An interesting discussion arose in our literature class the other day. Since it
was getting near the final examination period, the teacher (77)_l2]d-the
class that there (78) would be a test on the entire textbook the following
week. “(79) D°es that mean that we have (80)_*2_know
everything?” asked Tom. “Not quite everything, (81)_kill_almost every¬
thing,” answered the teacher. “(82) Did_you say the whole book?
asked John. “(83) Are_you asking because it’s too (84)_much_
reading?” said the teacher. “Actually, it (85)_does_seem like a lot,” said
John, “because (86) *ts_ such a big book.” “Would you (87)
rather have two short tests on the book, (88) instead_0f a long
test?” “I would, but the (89)_others_ _ might not” “All right Let’s see: how
(90) do_the rest of you students feel (91) ah°ut_the matter?
How many of you (92)_prefer__ only 0ne test and how many prefer two?”
“(93) Is_it possible for us to have (94) °rdy_one test on only
one half (95)_of_the book?” asked Sam. “I'm (96) afraid_
not,” said Mr. Jackson, “because (97) d_I don't test you on the entire
book, I (98) can’t_ grade you on the entire book.” “But (99)
we_all read the book, Mr. Jackson!” said Helen. “(100)_Lm_
sure of that, but I don’t know (1 01) who_understands the meaning of the
book, and (102) that_is what tests are really for.”
Passage B
I got up very early this morning and went out for a little walk. I think that it
was about 6 a.in., still not very light out I put on a warm sweater and a heavy jacket
(103) because it was quite cold. I wore (104)_gi°ves on my hands,
and I put a scarf (105) around my neck to keep warm. The weather (106)
was_fair, and I could see that (107)_if_was going to be a nice
day. There was (108) nobody_outside but me, and I (109) thought_
that was unusual; after all, six o’clock (110) wasn t really very early. I
23. Attitude questionnaire 299
asked (111)_myself why nobody was in the street: (112) coti 1<1_ft
be that my watch was wrong? (113) Was_it really five o’clock, and not
six? (114)_After_walking another block without meeting anyone, 1(115)
__a newsboy delivering papers on his bicycle. “(116) Why_
are the papers so thick today?” I wondered. (117) Like_a holt of light¬
ning, the reason flashed (118) through my head: It was Sunday! Somehow it
(119) had_slipped my mind, and that was strange: (120) After
all, I wouldn’t have gotten (121)_2P_so early if it hadn’t been Sunday,
(122) would p When I returned home, everyone was (123)
_stdl sleeping, so I prepared a big (124) breakfast for myself, sat
and ate it while reading the morning (125)_paper___. And still nobody got up!
“Well,” I said to myself after some reflection, “(126) If_you don’t like
to be alone, you should (127) n°t_get up so early.”
23. Attitude Questionnaire
NAME: _COURSE: _
Native Language: _Section: _
Dear Student:
The CESL staff would like to know your reaction to all the testing that has been
done this term. Would you please give us your reaction in addition to filling out the
other questions below? Thank you.
1. I thought the tests were
a. too easy
b. a little too easy
c. okay
d. a little too hard
e. much too hard
2. I would have preferred
a. much less testing
b. a little less testing
c. about the same
d. a little more testing
e. much more testing
3. How long have you been in the United States?_
300 APPENDIX
4. How long did you study English before coming to the United States?
a. less than one year d. five to six years
b. one to two years e. seven years or more
c. three to four years
5. For how many hours a day did you study English in your own country?
a. less than one hour a day c. three to four hours
b. one to two hours d. more than four hours
6. What languages did your English teacher in your own country speak?
a. English only
b. English and some other language
c. only a little English
7. Do the people you live with here in the United States speak English or your
native language?
a we only speak in my native language
b. we speak some in English and some in another language
8. What was the highest level of education of either of your parents?
a. Ph.D., doctorate, or the equivalent
b. Masters, graduate study, or the equivalent
c. BA, BS, or college graduate, or the equivalent
d. secondary (high) school graduation or the equivalent
e. eighth grade education, or equivalent
f. went to school for 4 or more years
g. less than 4 years of school
h. no schoohng
9. What is your father’s job in your home country?
a a businessman d. professor or school teacher
b. farmer or laborer e. government official
c. doctor or lawyer f. other
10. Do you enjoy your English classes?
a always d. rarely
b. usually e. never
c. sometimes
11. Do you feel that you learn from your English instructors?
a never d. usually
b. rarely e. always
c. sometimes
12. How much time do you sspend each day studying your English lessons
CESU outside of class?
a none d. two hours or more
b. an hour e. three hours or more
c. more than an hour
13. Do you review the material covered in each class outside of class on your own?
a. every day d. rarely
b. usually e. never
c. sometimes
14. If you don't understand something, do you ask about it in class?

a. never d. usually
b. rarely e. always
c. sometimes
15. How much of your leisure time do you spend speaking English or listening to
English spoken (TV, radio, conversation, etc.)?
a. 10% d. 70%
b. 20% e. 100%
c. 50%
Do you think or dream in English?
a. always d. rarely
b. usually e. never
c. sometimes
17. How much time do you spend with people who speak English?
a. almost none d. a lot of time
b. a little bit e. all the time
c. some time
18. When do you say things in English?
a. every
J
time there is a chance
b. quite frequently
c. once in a while
d. very rarely
e. only when it is absolutely necessary
19. Do you feel people are critical of the way you speak English?
a. very much d. a little
b. a great deal e. not at all
c. quite a bit
Following is a list of words that might be used to describe people. First, indicate
whether these qualities are desirable, neutral, or undesirable. Circle the word
which is desirable. If neither word is desirable, don’t circle anything. For example,
if you think that kind is better than unkind, you would circle kind.
Example: unkind —( kind
20. outgoing — reserved

21. quiet — talkative
22. nervous — calm
23. cautious — happy-go-lucky
24. sociable — solitary
25. serious — carefree
26. leader — follower
27. initiator — observer
302 APPENDIX
How well do these words describe you? Notice that these words are on a scale. If
you think that you are kind most of the time, but not all of the time, you might indi¬
cate that in the following manner
Example: kind unkind

1 © 3 4 5 6 7 8
28. outgoing reserved

1 2 3 4 5 6 7 8
29. quiet talkative

1 2 3 4 5 6 7 8
30. nervous calm
1 2 3 4 5 6 7 8
31. cautious happy-go-lucky
1 2 3 4 5 6 7 8
32. sociable solitary

1 2 3 4 5 6 7 8
33. serious carefree

1 2 3 4 5 6 7 8
34. leader follower

1 2 3 4 5 6 7 8
35. initiator observer

1 2 3 4 5 6 7 8
Below is a list of words that might be used to describe people. First indicate
whether these qualities are desirable, neutral, or undesirable. Circle the word
which is desirable. If neither word is desirable, don't circle anything.
36. devious — truthful

37. snobbish — friendly
38. dependable — undependable
39. kind — unkind
40. culturally advanced — culturally primitive
41. militarily weak — militarily strong
42. technologically advanced — technologically retarded
43. economically poor — economically rich
44. scientifically advanced — scientifically retarded
45. politically powerless — politically powerful
Below is a list of words that might be used to describe people. Think of each word
in terms of how well it describes Americans. For example, if you think Americans
are helpful, you would indicate as follows:
Example: helpful unhelpful
® 2 3 4 5 6 7 8
46. devious truthful

. 1 2 3 4 5 6 7 8
47. snobbish friendly

1 2 3 4 5 6 7 8
48. dependable undependable

1 2 3 4 5 6 7 8
49. kind unkind

1 2 3 4 5 6 7 8
50. culturally advanced culturally primitive

1 ' 2 3 4 5 6 7 8
51. militarily weak militarily strong

1 ' 2 3 4 5 6 7 8
52. technologically advanced technologically retarded

1 2 3 4 5 6 7 8
53. economically poor economically rich

1 J'2 3 4 5 6 7 8
54. scientifically advanced scientifically retarded

1 2 3 4 5 6 7 8
55. politically powerless politically powerful

1 ' 2 3 4 5 6 7 8
Below is a list of reasons frequently given by students for studying English. After
careful thought please evaluate each statement according to how it reflects your
feelings by placing an X in one of the 7 blanks. Look at the example below.
I am studying English because my parents want me to.
_X_
agree disagree
I am studying English because:
1. It will help me to understand the American people and their way of life.
agree disagree
304 APPENDIX
2. I think it will some day be useful in getting a good job.

agree disagree
#
3. It will enable me to gain good friends more easily among English-speaking

people.

agree disagree
4. One needs a good knowledge of at least one foreign language to merit social
recognition.

agree disagree
5. It should enable me to begin to think and behave as Americans do.

agree disagree
6. I feel that no one is really educated unless he is fluent in the English language.

agree disagree
7. It will enable me to marry an American.

agree disagree
8. I need English in order to fulfill my educational goals.

agree disagree
9. If you could stay in the United States for as long as you wanted after you finish
your studies, how long would you stay? Please circle your answer. I would
a. leave as soon as my studies were finished.
b. stay 3 months.
c. stay 6 months.
d. stay 1 year.
e. stay 5 years.
f. stay permanently. (Immigrate)
About the Authors
Franklin I. Bacheller is presently doing post master s study in inferential sta¬

tistics at Southern Illinois University. He has published several articles and has
presented papers at a number of national and international professional meetings
on language teaching, learning, and testing. Bacheller has taught ESL/EFL in the
United States and in Japan.
Pamela Cohelan Benson is a graduate of Goddard College and has recently

completed requirements for the M. A. in English as a Second Language at Southern
Illinois University. She has taughtESL/EFL at the Center for English as a Second
Language (SIU) as well as abroad in the Philippines, India, Turkey, and Zambia.
Donn R. Callaway holds the B.A. from University of Santa Clara and the M.A. in
English as a Second Language from SIU. During academic 1977-1978 and 1978-
1979 he worked as a master teacher in the joint program between the University of
Illinois (Urbana-Champaign) and Arya Mehr University in Isfahan, Iran. While at
Arya Mehr, he coordinated their English language testing program as well as begin¬
ning and advanced courses in technical English.
Patricia L. Carrell is Associate Professor of Linguistics and Chairperson of the

Department of Linguistics and Center for English as a Second Language at
Southern Illinois University, Carbondale, Illinois. She completed a Ph.D. in
theoretical linguistics from the University of Texas at Austin in 1966. She has
published a Transformational Grammar of Igbo (Cambridge University Press,
1970) and has presented papers at meetings of the Linguistics Society of America,
Mid-America Linguistics Conference, and TESOL. Present interests include
pragmatics and communicative competence, theoretical linguistics, and language
acquisition.
Sadako O. Clarke is Instructor of Japanese for the Department of Foreign

Languages and Literatures at Southern Illinois University. She holds the M.A. in
English as a Second Language from SIU; an M.A. in Education from Roosevelt
University in Chicago; and an M.A. in Christian Education from Emory University
in Atlanta. She has published previously on the topics of curriculum, and language
testing.
Naomi Doerr, a graduate of Southern Illinois University, has completed require¬

ments for the M.A. in English as a Second Language there. She served as Research
306
ABOUT THE AUTHORS 307
Assistant to the language testing project at CESL during 1976-1977 and is an

experienced teacher of ESL.
Jill Evola holds the B.A. in Latin American Studies from the University of
Michigan in 1975. Recently she too completed requirements for an M.A. in Lin¬
guistics at Southern Illinois University.
Michelle Fishman has an M.A. in ESL from SIU and is currently teaching
English as a second language at the Defense Language Institute at Lackland Air
Force Base in San Antonio,Texas. During academic 1976-1977 she served as a
Teaching Assistant in the Center for English as a Second Language at SIU.
Douglas E. Flahive is Assistant Professor of English at Colorado State Univer¬

sity, and also serves as associate director of the intensive English program at Fort
Collins, Colorado. He has published widely and presented papers on language
learning, teaching, and testing at regional and national meetings of several
organizations including LSA, NAFSA, and TESOL. He holds an M.A. in English
from Xavier University in Ohio, an M.A. in Linguistics from SIU, and the Ph.D. in
applied linguistics from SIU.
Debby Hendricks is presently working toward a Ph.D. degree at Stanford

University. She has taught English as a second language to immigrants and plans to
continue her present research into the nature of the second language acquisition
process.
Frances Butler Hinofotis is Assistant Professor of English with the TESL

group at UCLA. Her doctorate was earned at Southern Illinois University where
she had experience in teaching ESL at all levels. She has also taught ESL overseas
in Greece. She has presented papers at meetings of the LSA and TESOL organiza¬
tions and has published a number of papers. Her most recent publication is a book
including papers presented at the Mexico TE SOL meeting in 1978, co-edited with
Eugene J. Briere.
Kay Hisama holds the Ph.D. in Educational Psychology from Southern Illinois
University. She has presented a number of papers on language testing at national
and international conferences. Her most recent effort includes the preparation of a
book on psychological aspects of second language learning.
Christine Hjelt completed her B.A. at Willamette University and earned her
M.A. in English as a Second Language at Southern Illinois University. She has
taught ESL/EFL at CESL and in Zambia and Botswana.
Marianne Johnson completed a certificate in the Teaching of English as a

Second Language at Southern Illinois University before accepting her recent post
at Shiprock New Mexico, where she taught English at the secondary level.
308 ABOUT THE AUTHORS
Thomas Ray Johnson received his B.A. in Philosophy from Northern Illinois
University and an M.A. in Linguistics at Southern Illinois University. From 1972-
1975 he served as an EFL Instructor in Safi, Morocco. Presently he is teaching
EFL at Lockhart English Language Academy in Pamplona, Spain.
Celeste M. Kaczmarek completed the M.A. in English as a Second Language at

Southern Illinois University and is now teaching EFL in Boumerdes, Algeria. Her
professional papers include significant research on the distinction between ESL
and ESP (English for Specific Purposes) at TESOL 1979 in Boston.
Kathy A. Krug recently finished her M.A. in English as a Second Language at

Southern Illinois University. Now she is teaching ESL to Indo-Chinese refugees in
the Southern Illinois area.
Becky Lentz holds the B.A. in Spanish and Latin American Studies from the
University of Arkansas and the M.A. in English as a Second Language from South¬
ern Illinois University. She is now employed by the Office of International Educa¬
tion at the Central YMCA Community College in Chicago.
Ellen Mamer completed her B.A. in English Literature at the University of

Illinois and the M.A. in English as a Second Language at Southern Illinois Univer¬
sity. She co-authored a paper with Franklin Bacheller presented at the NAFSA,
Region V Conference in October 1977, comparing the effectiveness of audio
versus video tape-recordings in the language laboratory.
Karen A. Mullen is now Associate Professor of Linguistics at the University of

Louisville in Kentucky. Formerly she was an Assistant Professor of Linguistics at
the University of Iowa. She received the B.A. from Grinnell College and the M.A.
and Ph.D. degrees from the University of Iowa in linguistics. Mullen has had
extensive experience in teaching ESL, linguistics, sociolinguistics, and related
areas, and has published widely on language testing. Recent papers included
presentations at TESOL, NAFSA, and the Interagency Roundtable on language
testing in Georgetown (March 1978).
Mitsuhisa Murakami is Associate Professor in the Department of English

Literature at Kinran Junior College in Osaka, Japan. Formerly, Murakami
directed the Osaka YMCA English School and has co-authored materials for
teachiing ESL/EFL in Japan.
John W. Oiler, Jr. is Associate Professor of Linguistics and Educational

Foundations at the University of New Mexico. Before coming to UNM Oiler held
an Associate Professor post at UCLA where he was responsible for the testing of
foreign students in ESL. He has served on the TOEFL Examiners Committee at
Educational Testing Service and has published widely.
ABOUT THE AUTHORS 309
Kyle Perkins is Assistant Professor of Linguistics at Southern Illinois Uni¬

versity. He earned his Ph.D. in Linguistics from the University of Michigan in
1976 and has published widely on language learning and testing. He co-edited and
co-authored Language in Education: Testing the Tests and co-chaired the 1979
TESOL meeting in Boston. Perkins was also co-organizer (along with James
Redden) of the First International Conference on Frontiers in Language Profi¬
ciency and Dominance Testing at Southern Illinois University in April 1977.
Keith Pharis received an M.A. in English as a Second Language from SIU and
has worked in the Peace Corps training programs in Micronesia. He has also taught
EFL in Saudi Arabia and Japan and is currently an Instructor at the Center for
English as a Second Language at SIU.
George Scholz holds the M.A. in English as a Second Language from SIU and is
currently responsible for the testing program at the Institute for Electronics and
Electricity, English Training Program, at Boumerdes, Algeria, North Africa.
Scholz has presented papers at national TESOL meetings and has recently com¬
pleted additional work on the factorial structure of language proficiency.
Becky Gerlach Snow is an Instructor in the ESL program at SIU. Snow has
presented professional papers at TESOL and NAFSA in recent years and has had
wide ranging experience in teaching and testing ESL. She also holds an M.A. in the
Teaching of English as a Second Language.
Randon Spurling completed the M.A. in ESL at SIU before accepting a post at
INELEC in Bourmerdes, Algeria where she is presently an instructor in EFL.
Lela Vandenburg finished her M.A. in ESL at SIU in the spring of 1977 and is
currently teaching ESL in Africa.
Craig B. Wilson is an ESL teacher in the Indo-China Education Program,

Evaluation and Development Center at Southern Illinois University. His responsi¬
bilities include testing and curriculum development He has served as an
interpreter of Vietnamese for the Department of Health, Education and Welfare at
the Fort Chaffee (Arkansas) refugee camp. He holds the B.A. in Vietnamese lan¬
guage and the M.A. in Linguistics from SIU.
. \
Index
Aborn, M. 67, 253 bipolar scales 102, 106, 245

achievement tests 1, 223, 224 Black Americans 103
Adorno, T. 253 Bondaruk, J. 253
Adult Basic Education Program Evaluation and Brennan, E. 104, 253
Development Center, SIU 211 British regionals 103
affective variables 1, 8, 234, 239 Brodkey, D. 253
African 48 Brookdale Community College Learning
Aiken, L. 253 Center 211
Akmajian, A. 196, 253 Broughton, G. 112, 253
Alexander, L. 112, 113, 253 Brown, H. Douglas 253
Allen, J. P. B. 253 Brown, Leo J. 211
American Field Service 15 Brown, Roger 243, 253
American Languagae Institute Oral Rating Bruder, M. 59, 258
Form 103, 253 Burt, M. 67, 188, 253, 254
analysis of covariance 210, 215 Buss, M. J. 211
Anisfield, M. 103, 234, 253
aptitude tests 7, 8, 219, 222, 223 Callaway, Donn R. 4, 102, 118, 306
Arabic 25, 48, 102, 104, 105, 109, 110, 111, Cambodian 209
114,130,177,178,180,189,235,242,243 Canadian bilingual school 63
Armenian 209 Canter, S. 177, 256
Asher, J. J. 61, 62, 63, 65, 72, 253 Carrell Patricia L. 6, 195, 215, 306
Asian 48, 209 Carroll J. B. 35, 48, 220, 221, 222, 223, 226,
assertion 6, 195,196,197,198, 199, 200, 201, 250, 253, 254
202, 203, 204, 205, 215 Cartier, F. 103, 110, 254
Assessment of Children's Intelligence 15 Cartwright, D. 112, 253
Atai, P. 64, 256 Catford, J. C. 255
attained language proficiency 219, 250 Center for Applied Linguistics 209
attitude (also see affective variables) 135, 239, Center for English as a Second Language,
248 Southern Illinois University 3, 4, 5, 13, 17,
attitude questionnaire 299-305 18, 24, 25, 26, 34, 48, 49, 51, 52, 56, 68,
audio-lingual method 60 69, 78, 79, 80, 105, 109, 121, 122, 123,
Austin, J. L. 195, 205, 253 126, 130, 131, 135, 143, 152, 173, 177,
180, 181, 188, 189, 198, 228, 235, 242
Baca, L. 258 CESL Placement Tests 18, 19, 20, 23, 26, 27,
Bacheller, Frank 3, 5, 66, 73, 306 28, 29, 32, 33, 77, 79, 82, 122, 124, 125,
Bailey, R. 260 127, 131, 132, 143
Baird, A. 112, 253 Chafe, W. 179, 254
Baratz, J. 103, 259 Chase, C. 48, 254
Barik, H. 234, 256 Chastain, K. 254
Beirut 122 Child, J. 253
Beit-Hallahmi, B. 255 Chinese 48, 130, 209, 211, 222
Benson, Pam 3, 59, 72, 204, 306 Christie, R. 254
Bent,D. 33, 107, 257 Cicourel A. 241, 252, 254
Bezanson, K. 253 Clark, John L. D. 14, 78, 102, 110, 254
311
312 Index
Clarke, Sadako 7, 219, 250, 306 Dull, C. 255

clause/T-unit ratio 173, 174, 175, 176 Dulay, H. 188. 254
cleft sentence 196, 197, 198, 199, 200, 203 Dumas, G. 63, 259
Clement, R. 255 Dutch 130
cloze procedure 3, 4, 5, 7, 8, 13, 16, 17, 18,
20, 28, 33, 34, 35, 36, 45-46, 49, 55, 64, Educational Testing Service 4, 26, 79
67, 70, 78, 80, 81, 83, 116,121,122,123, elicited oral imitation (also see repetition) 63
126, 136mmation 202 Endicott, A. L. 172, 254
cognitive code method 60 England 104
cohesion 177, 181 English 7
Colorado State Indochinese Refugee Assistance Enkvist, N. 67, 254
Office 211
errors 67, 187, 188, 189, 190-191, 215
composition as a measure of language errors/T-unit measure 172, 173, 174
proficiency 151 Ervin-Tripp, S. 60, 72, 254
Comprehensive English Language Test (CELT)
ESL learners 7
4, 26, 29, 47, 48, 49, 124 essay scoring 5, 6, 28, 33, 153-155, 168-170,
Concannon-O’Brien, M. 248, 254 178-180, 182
contrastive analysis 7, 188, 209, 211, 215, 216 Evola, Jill 5, 6, 177, 182, 183, 307
Convery, Anne P. 211
expectancy grammar 59, 64
Conway, P. 248, 254
Cooley, R. 255 F-statistic 92, 93, 94, 97, 99, 161, 162, 165,
Cooper, C. 223, 254 237, 238
correlation 3, 4, 14, 50, 64, 81, 117, 118, F-test 93
155-156, 203, 227, 229, 230, 234, 238, factor analysis 14, 24, 49, 50, 52, 82, 116.
239,245, 246-247 158-159, 182
Coulthard, M. 177, 254 Far East 198
Cranney, A. 130, 254 Farsi 25,48,64,102, 105,109, 110. 111,114,
Crowne, D. 254 130, 177, 178, 180. 189, 209, 235, 242,
243
Darnell, D. 254 Fasold, R. 259
Davies, A. 110, 253, 254 Ferrell A. 259
Davies, B. 253 Fillenbaum, S. 102, 256
Davis, F. 35, 254 First International Conference on Frontiers in
Dawes, R. 5, 135, 136, 257 Language Proficiency and Dominance Test¬
Dawson, W. 104, 253 ing, Southern Illinois University 111, 211
Defense Language Institute, Monterey 61 Fishman. Michelle 6, 187, 215, 307
De la Torre, R. 253 Flahive, Douglas 3, 5, 34, 55, 56, 144, 171,
Denny, E. 257 182, 254, 307
dependent variable 180, 235, 237 Flesch, R. 254
dictation 3, 6, 8, 13, 16, 17, 26, 33, 64, 68, Flesch readability scale 63, 209
188, 189, 227, 228, 229, 230-232, 235, Flick, William Cary 133
238, 239, 271-272 foreign language population 2
dictation errors 189 Foreign Service Institute Oral Interview 4, 18,
discourse analysis 177 19, 20, 23, 26, 27, 31, 32, 33, 77, 78, 79,
discourse processing 1, 3, 143 80, 81, 84, 85-88, 116, 143, 243
discrete point approach 6, 7, 48, 77, 78, 181 Frederiksen, C. H. 67, 254, 255
182, 183, 215 Free, Anne 228
divisible competence hypothesis 14,15,24,25, French 130, 189, 243
33 French Canadians 102
Doerr, Naomi 5, 8, 134, 306 Frenkel-Brunswick, E. 253
Donaldson, W. 60, 254 Fry, E. 255
double-biased test 209, 210 Fry readability index 209
Index 313
Galvan. Jose 104, 106, 255 Hull, C. 33, 107, 257

Gardner, R. C. 102, 234, 237, 242, 243, 249, Hunt, Kellogg W. 172, 173, 255
251, 255, 256 Hutchinson, L. 196, 197, 203, 204, 256
Garrv. M. 204
Geis, F. 254 implicatives 195, 196
general factor of intelligence, g 15, 16, 17, 18, independent variable 180
19, 20, 21, 22, 28, 30, 32, 34, 38, 52, 54, index of complexity 173, 174
82, 83, 84, 182 indirectly conveyed meaning 195, 196, 202
Genesee, F. 38, 255, 260 Indochinese Refugee Education Guides 256
Georgetown University 104 Indo-European 209, 219, 222, 226, 250
German 7, 219, 220, 221, 222, 223, 224, 225, inference 197, 203
226, 250 instrumental motivation 234, 241, 242, 243,
Giles. H. 103, 104. 255 245, 247, 248, 252
Gliksman, L. 255 integrative motivation 229, 234,241,242, 243,
global language proficiency 1, 3, 4 245, 247, 248, 252
goodness of fit 15 integrative tests 48, 181, 215
Gorosch, M. 104, 255 intelligence 1, 2, 3, 15, 34, 55, 56
Gradman, Harry 188, 255 International Institute of Boston 211
grammar 3, 250 International Institute, St. Louis 211
grammar-translation method 60 interrater reliability 80, 81, 117, 175
Greek 243 Iowa Silent Reading Test 4, 143,144,145, 256
Griffin. W. 257 Irvine, P. 64, 256
Groom. C. J. 211 Isaacs, R. 255
Guiora, Alexander Z. 255
Gunnarsson, B. 255
Jakobovits, L. 103, 110, 256
Jamieson, J. 204
Hagen, E. 257
Japanese 7, 8, 48, 130, 189, 209, 219, 220,
Haggard, L. 255
221, 222, 223, 224, 225, 226, 228, 229,
Hamayan, E. 260
230, 235, 242, 243, 250
Hardison, 0. 255
Jenkins, J. 33, 107, 257
Harms, L. 103, 255
Jensen, A. 15, 34, 256
Harris, David P. 26, 29, 91, 110, 151, 255
Johansson, Stig 64, 69, 188, 256
Hawkes, N. 253
Johnson, Dixon 38, 256
Heaton, J. B. 110, 255
Johnson-Laird, P. 67, 257
Heise, G. 188, 257
Johnson, Marianne 3, 4, 24, 77, 307
Hendricks, Debby 3, 4, 24, 77, 116, 307
Johnson, Thomas6,9, 235, 241, 251,252, 308
Higginbotham, D. 128
Jones, E. 135, 256
Hindi 209
Jones, R. 256, 259
Hinofotis, Francis Butler 3, 4, 13, 17, 24, 25,
Jonz, J. 130, 256
31, 54, 78, 121, 129, 130, 132, 133, 255,
Junasa, B. 211
258, 307
Hisama. Kay 3, 47, 49, 56, 122, 125, 255, 307
Hjelt, Chris 3, 59, 72, 307 Kaczmarek, Celeste 5, 6, 151, 182, 308
Hmong 209 Karttunen, L. 195, 197, 256
Hodgson, R. 102, 256 Kerlinger, F. 237, 256
Holland 130 Kohler, F. 135, 256
Holm, W. 259 Kramer, E. 103, 260
Holtzman, Paul 240 Krashen, S. 256
holistic evaluations 111, 175, 183 Krug, Kathy 6, 9, 235, 241, 251, 252, 308
Hornby, P. 198, 255 Krzyzanowski, H. 122, 256
Huckleberry Finn 189
Kuder- Richardson 20 formula 123, 127, 145
Hudson, A. 258 Kusudo, J. 253
314 Index
Labov, W. 102, 256 multiple-choice reading match tests 28, 33,

Lado, Robert 14, 208, 209, 256 276-281
Lambert, W. E. 102, 103, 106, 234, 242, 243, multiple-choice writing task 28, 33, 282-293
249, 251, 253, 255, 256, 260 multiple regression 93, 99, 166, 237, 238
language, a factor in educational tests 1 Murakami, M. 8, 227, 233, 246, 251, 308
language interference 211 Muraki, M. 196, 257
language proficiency 2, 15, 227 Murphy, P. 259
language redundancy 78
language skills 2 Naiman, N. 63, 257, 259
Laotian 209 Nartney, J. 204
Latin 223, 226, 250 National Indochinese Clearinghouse 209, 211
Lee, L. 178, 256 Nelson-Denny Reading Test 4, 143, 144, 145
Lehmann, I. 257 Nelson, M. 257
Lentz, Becky 5, 6, 177, 308 New Cloze Test 48, 49, 52, 53
Lett, J. 245, 256 Nguyen Dang Liem 209, 257
Levenston, E. 256 Nie, N. 33, 107, 257
Levinson, D. 253 non-verbal IQ test 34-35, 37, 126
Lichten, W. 188, 257 Norris, R. 257
Liebert, R. 245, 256 null hypothesis 200
Likert scale 106 Nunnally, Jum 15, 31, 257
Lingala 104
listening cloze 26, 30, 31, 33, 261-266 objective measures of writing 172
listening comprehension 3, 4 obligatory functors 243
Liu, P. 258 O’Donnell, R. 257
Lorge, I. 257 Ohara, T. 230
Lukmani, Y. 234, 257 Oiler, John W„ Jr. 1, 3, 4, 8, 9, 13, 17, 24, 25,
Lund University, Sweden 64 26, 29, 31, 33, 54, 56, 64, 67, 68, 69. 78,
82, 122, 130, 189, 230, 233, 234, 239,
Mainer, Ellen 5, 6, 177, 308 243, 245, 246, 251, 256, 257, 258. 308
Manis, M. 5, 135, 136, 257 Olssen, M. 67, 71, 258
Manual for Peace Corps Language Testers 79 one-tailed test of significance 205
Marlow, D. 254 oral cloze 27, 33, 274-275
McGraw-Hill Reading Test 4, 5, 35, 36, 37, 55, Oral Interview Attitude Survey 136, 137,
56, 144, 145 140-141
McLaughlin, G. 210, 257 oral proficiency 102, 104, 110
Michigan Test of English Language Proficiency oral proficiency scales 4, 91, 92
38, 47, 48, 142, 143 oral skills 3, 4, 78, 85, 116
mean square 92, 93, 162 Ortego, P. 103, 258
Mehan, H. 248, 257
Mehrens, W. 257
Palmer, L. 104, 105, 106, 111, 255, 258
Mexican-American bilinguals 104 Paluszny, M. 255
Middle East 198 paraphrase 34
Miller, G. 67, 188, 257 Parish, Charles 5, 235
Miller, L. 103, 260
Parish test 29, 33, 70,237-238, 239,295-299
Modern Language Aptitude Test (MLAT) 7, Paulston, C. 59, 258
219, 220, 221, 222, 223, 224, 226, 250 Peabody Picture Vocabulary Test 207
morphological and transformational complexity Pearson product moment correlation 17, 145,
172
175, 180, 192, 202, 211, 222, 223, 224,'
Mullen, Karen 4,5,91,116,117,118,160, 308 225
multiple-choice cloze tests 129-130,131,132 Pedhazur, E. 237, 256
133
Perkins, Kyle 1, 4, 8, 9, 35, 36, 54, 56, 142,
multiple-choice listening comprehension tests 233, 234, 239, 243, 245, 246, 251, 258,
26, 30, 31, 266-271 309
Index 315
Perkins-Yorio test 35, 36, 37, 38—45, 55 Russian 61, 209

Perren, G. 258 Ryan, E. 103, 104, 253, 258
Pharis. Keith 4, 142, 309
Piaget, J. 34 salted test 210
Pierce, J. 104, 255 Sanford, R. 253
Pimsleur Spanish Proficiency Tests 62, 63 Sapon, S. 220, 221, 222, 223, 226, 250, 254
Poland 122 Sattler, J. 15, 258
Postovsky, V. 60, 61, 63, 65, 258 Savignon, S. 234, 258
Praninskas, J. 258 Saville-Troike, M. 171, 258
predictor variable 7, 8, 37, 227, 235, 237 scale of communicative effectiveness 4, 66, 68,
presupposition 6, 195, 196, 197, 198, 199, 69, 70, 71
200, 201, 202, 203, 204, 205, 215 Scholz, George 3, 4, 24, 54, 77, 80, 235, 309
presupposition swallowing 203 Schumacher, S. 211
Price, E. 211 Schumann, J. 188, 242, 258, 259
principal components solution 16, 17, 18, 19, Science Research Associates Reading for Under¬
20, 24, 25, 27, 30, 32, 47, 50, 53, 78, 83, standing Placement Test 27
106, 107, 108, 157, 192, 193 second language population 2
proficiency tests 3, 8 self-report questionnaire 228, 230, 235—237,
profile method 49 243, 246, 249
pseudo-cleft sentences 196, 197, 198, 199, Selinker, L. 81, 188, 259
200, 203 Sellars, W. 196, 259
Seventh Mental Measurements Yearbook 48
Q-type factor analysis 6, 106, 192 Shore, H. 253
Quitugua, D. 211 Shuy, R. 103, 259
single-factor analysis of variance 94, 161
rating of recall task 293—294 single factor experimental design 92
Raven, J. 36, 37, 258 SMOG index 210
Raven s Progressive Matrices3, 34, 35, 36, 37, Snow, Becky 4, 5, 129, 171, 182, 309
38, 55, 56 social distance of language learners 241
Raygor, A. L. 35, 258 Social Usage 228
reading 3, 4, 142 South America 198

reading aloud 27, 33, 78, 80, 81, 275—276 Southeast Asians 209
reading comprehension 35 Southern Illinois University 6, 8, 17, 122, 220
Reading for Understanding Placement Test South Wales 104
(RFUPT) 48, 49, 51, 52, 53 Spanish 25, 48, 62, 102, 104, 105, 106, 109,
recall task 152 110, 111, 114, 130, 209, 235, 242, 243
Redden, James 309 Spearman, Charles 15, 54
redundancy index of integrativeness of second Spearman rank order correlation 6, 145, 192,
language learners 9, 242, 244, 245, 246, 193
248 speech perception 188
regression analysis 37 spelling errors 69, 189
reliability 47, 71, 78, 93, 108, 111, 116, 123, Spiegler, M. 245, 256
124, 126, 127, 151, 152, 161, 162, 165, Spolsky, B. 78, 188, 229, 242, 255, 256, 258,
167, 182, 205, 215 259
repetition (elicited imitation) 27, 33, 70, 78, Spurling, Randon 3, 4, 24, 77, 309
80, 81, 116, 272-273 standard cloze 281-282
research hypothesis 200 standardized reading tests 142
Richards, J. 66, 188, 257, 258 standard scores 49, 51
Roberts, G. 112, 253 State University of Iowa 259
Roscoe, J. 258 Steinbrenner, K. 33, 107, 257
Ross, J. 211, 216, 258 Stendahl, C. 259
Rowesland, P. 255 Stenson, N. 188, 259
Rubenstein, H. 67, 253 Sterling, T. 67, 253
316 Index
Stevenson, D. 259 unitary competence hypothesis 14, 15, 17, 23,

Sticht, T. 63, 64, 259 24, 25, 33, 54
Strawson, P. F. 195, 204, 259 univariate F-ratio 174
Streiff, V. 126, 257, 259 University of Iowa, Department of Linguistics
Strongman, K. 103, 259 93, 162
Stubbs, J. 122, 130, 259 University of New Mexico 122
Stump, Thomas 15, 56, 126, 259 University of Tehran, Iran 13, 15
subordination - 'o 172 Upshur, John A. 260
Svartvik, J. 259
Swain, M. 63, 259 Valette, R. 110, 260
Sweden 104 validity 47, 71, 77, 109, 111, 116, 124, 126,
Smythe, P. 255 151, 152, 161, 182
syntactic complexity 172 Vandenburg, L. 3, 4, 24, 77, 309
Van Syoc, W. and F. Van Syoc 112, 260
T-scores 50 1 van Wageningen, N. 130, 260
t-test 194, 202 varimax rotated factor solution 19, 20, 23, 25,
T-unit 172, 173, 183 32, 78, 84, 85, 157, 246
T-unit length 173, 174, 175, 176 Vietnamese 7, 48, 104, 130, 209, 210, 211,
Taylor, B. 188, 259 215, 216, 243
Taylor, W. L. 135, 259 Vigil, F. 258
Terman-Merrill scale 38 Vineyard, E. 260
Test of English as a Foreign Language (TOEFL)
4, 5, 13, 15, 16, 17, 19, 20, 23, 34, 35, Warden, E. 178, 260
36, 37, 47, 48, 55, 64, 93, 121, 123, 125, Webster, W. 103, 260
127, 142, 143, 144, 161, 163, 230 Westbrook, John 211
test variance 13, 14, 17, 19, 20, 24, 30, 32, 34, Whitaker, S. 69, 260
52, 54, 101, 106, 117, 126, 144,180, 182 Whitehead, J. 103, 260
Tetrault, E. 253 white noise 188
Thai 48 Whyte, W. 114, 260
Thorndike, R. 257 Wijnstra, J. 130, 260
three-way analysis of variance 180 Wilds, C. 78, 260
Thurstone, T. 27, 259 Wilks' Lambda 174
total physical response method 61 Williams, F. 102, 260
Truus, S. 260 Wilson, Craig B. 6, 208, 215, 216, 309
Tucker, R. 103, 122, 130, 259, 260 Wilson, L. 171, 260
Tunstall, K. 234, 256 Winer, B. J. 93, 162, 260
Turkish 189, 243 Woehlke, P. 128
Twain, Mark 189 Wolfram, W. 103, 259
two-factor analysis of variance 94, 163 Wood, H. 248, 257
Woosley, J. 103, 259
UCLA 69, 122 writing 3, 4, 5
UCL4 ESL Placement Exam 17
Underwood, G. 104, 255 Yorio, Carlos A. 35
Other Newbury House Books —
Adaptation in Language Teaching 0-88377-105-5

Harold S. Madsen and J. Donald Bowen
The need to adapt or modify given textbooks and other language teaching materi¬
als to fit the requirements of particular learning situations, and even particular
students, is widely recognized. But until now, the guidance available to teachers
contemplating such adaptations has been limited and highly specialized. In this
volume, Madsen and Bowen have presented a comprehensive, systematic
approach to adaptation which is ideal for methods courses in teaching foreign lan¬
guages or English as a Second Language. As the first principle of effective adapta¬
tion, the authors stress maintenance of congruence among a variety of factors: the
teaching materials, the methodology and objectives of the course, student charac¬
teristics, the character of the language being taught, and the personality and style
of the teacher.
Language In Education: Testing the Tests 0-88377-104-7

John W. Oiler, Jr. and Kyle Perkins
Some researchers now suspect that almost all tests given to students in all subjects,
as well as general tests of intelligence and personality, are essentially language
tests. The validity and the implications of this hypothesis—for education and
guidance in general, and for language teaching in particular— are obvious and
enormous. Here, in a rigorous examination of these implications, perceptive
researchers discuss the importance of language proficiency in varied testing situa¬
tions, content similarities among intelligence, achievement, personality and
language tests, cloze and dictation tasks as predictors of intelligence and achieve¬
ment scores, and language proficiency as a source of variance in self-reported
affective variables.
Second Languages in Primary Education 0-88377-132-2

Mildred R. Donoghue and John F. Kunkle
A timely guide to the principles and techniques of teaching a second language in
the elementary school classroom. This volume will be especially useful for the
many expanding bilingual programs and Title VII training courses being estab¬
lished throughout the nation, as well as for teachers, administrators, and
supervisors of FLES and ESL programs. Written jn clear, non-technical language,
the book covers such topics as—the rationale, traditional and modem, for study¬
ing a second language; applications of psycholinguistics and child development to
second language learning; guidelines for teacher preparation; comparison of
bilingual, FLES, and language switch programs; comparisons of several methods
of teaching second languages; cultural objectives; application of FLES programs ,
to long-term career education; and an overview of ESL.
NEWBURY HOUSE PUBLISHERS, INC.

ROWLEY, MASSACHUSETTS 01969
0-88377-131-4
2640R

Research in Language Testing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research in Language Testing

Uploaded by

Copyright:

Available Formats

SEARCH I

JOHN W. OLLER, JR.

John W. Oiler, Jr.

Newbury House Publishers, Inc. / Rowley / Massachusetts / 01969

Research in language testing.

Many of the contributions included were originally

Cover design by Barbara Frake.

NEWBURY HOUSE PUBLISHERS, INC.

ROWLEY, MASSACHUSETTS 01969

Printed in the U.S.A.

Part I How many factors are there in second language skill?

1 Two mutually exclusive hypotheses about second language

2 Is language ability divisible or unitary? A factor analysis of 22

3 Separating the g factor from reading comprehension 34

4 An analysis of various ESL proficiency tests 47

Part II Investigations of listening tasks

5 Listening competence: a prerequisite to communication 59

6 Communicative effectiveness as predicted by judgments of the

Part III Investigations of speaking tasks

7 Oral proficiency testing in an intensive English language program 77

8 Rater reliability and oral proficiency evaluations 91

9 Accent and the evaluation of ESL oral proficiency 102

Discussion Questions 116

Part IY Investigations of reading tasks

10 Cloze as an alternative method of ESL placement and proficiency

11 An alternative cloze testing procedure: multiple-choice format 129

12 The effects of agreement/disagreement on cloze scores 134

13 TOEFL scores in relation to standardized reading tests 142

Discussion Questions 147

Part V Investigations of writing tasks

14 Scoring and rating essay tasks 151

15 Evaluating writing proficiency in ESL 160

16 Measures of syntactic complexity in evaluating ESL

17 Discrete point versus global scoring for cohesive devices 177

18 We all make the same mistakes: a comparative study of native and

19 Processing of indirectly conveyed meaning: assertion versus

20 Can ESL cloze tests be contrastively biased?—Vietnamese as a

Part VII Measuring factors supposed to contribute to success

21 The correlation between aptitude scores and achievement

22 Behavioral and attitudinal correlates of progress in ESL by native

23 Seven types of learner variables in relation to ESL learning 233

24 Integrative and instrumental motivations: in search of a measure 241

Discussion Questions 250

John W. Oiler, Jr.

A fundamental question in educational research is how knowledge is acquired,

communication, the subjective evaluation of communicative effect is a sensible

How Many Factors Are There

How can the differences in performance on language tests be explained? To

Two Mutually Exclusive Hypotheses about

John W. Oiler, Jr., and Frances Butler Hinofotis

Two hypotheses proposed to explain the variance in second language tests

One of the empirical methods for investigating the composition of mental

strongly correlated), the unitary competence hypothesis would be sustained. If

Table 1-1 Principal Components Solution (with

Table 1-2 Correlation Matrix (above the Diagonal) and

Table 1-4 Principal Components Solution (with Itera¬

♦Accounts for 87% of the total variance.

Table 1-5 Correlation Matrix (above Diagonal) and Predicted Correlations

1 Cloze .07 -.10 ■-.15 --.12 --.12 .04 .13 .18

Table 1-7 Varimax Rotated Factor Solution (with Iterations)

Factor 1 Factor 2 /j2*

Table 1-8 Principal Components Solution (with Itera¬

♦Accounts for 65% of the total variance in the factor matrix.

E CJ a <D "O c/i O CD "d