Professional Documents
Culture Documents
http://ltj.sagepub.com/
Published by:
http://www.sagepublications.com
Additional services and information for Language Testing can be found at:
Subscriptions: http://ltj.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations: http://ltj.sagepub.com/content/8/1/67.refs.html
What is This?
The present research uses the verbal report methodology to examine how
listening tests work, and how processes not normally accessible through quan-
titative research methods influence test performance. Six introspectees pro-
vided data on four main areas of interest: the influence of the short-answer
test method on the measurement of listening comprehension; whether test
items can measure ’higher-level’ cognitive processes; whether test items can
measure how well listeners monitor the appropriacy of their interpretation;
and how question preview influences comprehension and test performance.
The interview protocols provide a great deal of data relevant to these and
related issues, the presentation and interpretation of which is the main purpose
of this paper. The protocols also indicate a serious dilemma for language
testers in that listening comprehension involves far more than the application
of linguistic knowledge to produce a propositional representation of a text;
rather it is an inferential process in which listeners attempt to construct an
interpretation which is meaningful in the light of their own assessment of the
situation, knowledge and experience. Thus there are often no clear, objective
criteria against which to judge the appropriacy of any one interpretation. The
implications of this are discussed.
I Introduction
Tests of second language listening ability are very common in lan-
guage education, and yet a review of the literature on listening, both
L and L2, suggests that there is no generally accepted, explanatory
theory of listening comprehension on which to base these tests (Buck,
1990). It seems that in practice test constructors are obliged to follow
their instincts and just do the best they can when constructing tests of
listening comprehension. While such a hit or miss approach would
seem to be largely unavoidable given our lack of knowledge of how
listening comprehension works, it is obviously unsatisfactory, and
suggests that there is an urgent need for research into the listening
process, and the best ways of testing it. However, there do not even
seem to be any sufficiently clearly stated hypotheses about the listen-
ing process which could form the basis for research. In such a case,
verbal reports of testees introspecting while taking listening tests
would seem to be most productive course. Although this method may
not be very suitable for testing clearly formulated research hypoth-
eses, it does seem likely to provide a broad view of second-language
listening processes and indicate how listening tests work. Hopefully
at some future date, these could then be used to formulate hypotheses
which could be tested by other research methods. This study forms
part of a series of introspective studies into the processes of listening
comprehension, and presents findings related specifically to aspects
of the testing of second language listening comprehension.
The advantages and disadvantages of the verbal-report method-
ology have been examined in detail by Ericsson and Simon (1980;
1984; 1987) who provide a theoretical basis for the method in the
form of an information processing model. The method has been
reasonably widely used in second language education (Cohen and
Hosenfeld, 1981; Mann, 1982; Cohen, 1986a, 1986b; Sarig, 1987;
Faerch and Kasper, 1987) although less so in second language testing
(Cohen, 1984; Alderson, 1988). The methodology has been little used
to examine listening processes, but in a pilot study designed to investi-
gate whether the verbal report methodology could be used to investi-
gate listening comprehension Buck (1990, forthcoming) suggested
that verbal reports on introspection could provide useful data on both
listening processes and the taking of listening-tests. Buck also made
detailed suggestions for the practical implementation of the method-
ology to examine listening, and suggested a number of research topics
which seemed amenable to investigation. These recommendations
are incorporated into the present study.
One research topic of obvious interest is the test method and
its influence on listening-test scores. When discussing the nature of
construct validity, Campbell and Fiske (1959) argued that tests are
a trait-method unit, that is a union of the particular trait to be
measured with a particular measurement procedure. They propose
that construct validity can be examined by comparing the relative
contribution of trait and method variance to the total test variance.
However, the methodology they propose, the multitrait-multimethod
methodology, only examines the test method effect in abstract statis-
tical terms. It seems likely that the verbal report methodology can
offer opportunities to examine exactly how the task demanded of
the testee interacts with the comprehension process to influence both
reliability and validity. ~~&dquo;, -1 ~ ~,
II Method
&dquo;
~ .. .
’.’ ’
&dquo; ..&dquo;’
2
Many thanks are due to Michael Rost for his kind permission to use the Susan story
each section before seeing the test questions. The three other subjects,
H, I and J did the test with question preview, that is they had time
to look at the test questions before hearing the text for each section
of the story. A transcript of the text, and the test items are given in
the Appendix.
Many of the test items were experimental and included for the pur-
pose of the research. They were intended to be three different item
types. The first two attempted to operationalize the two hypothesized
levels of listening: (i) lower-level processing, using items which asked
for the reproduction of clearly stated information, and (ii) higher-
level processing, using items which required inferences based on the
information in the text.
The third item-type was included to try to assess listeners’ success
in monitoring their current interpretation of the text and consisted of
repetitions of earlier items. It seems reasonable to assume that if
listeners are monitoring their interpretation and modifying it in the
light of new input, then it may be possible to find evidence of this by
presenting certain test items a second time at a later point in the test.
After completing the test items for each section testees were then
asked a number of interview questions. All interviewees were asked
the same questions. Firstly, they were asked about the test items:
were they clearly written, was there enough time to answer them, were
any missed, and did any of the questions give a hint to the other ques-
tions ? These were then followed by a number of questions intended
to examine comprehension of the text. The first asked interviewees to
estimate their level of comprehension of the section they had just
heard and this was then followed by a request to restate the con-
tent in their L 1, Japanese. After these came a number of general
questions relevant to each particular section. There were ques-
tions probing testees’ understanding, their thoughts, inferences,
predictions, and their mental images. A complete transcript of the
interviews is given in Buck (1990).
were normal. F explained that she had to think about this, and
decided that this must have been a Japanese-style window because she
could only conceive of the bar as a piece of wood jammed into the
window to prevent it sliding open. In such a case how should her
response be graded? Should she be penalized for making a reasonable
but wrong assumption about how the bar was used, especially when
she seems otherwise to have understood just as much as H?
While H and F mentioned both the bars and the guard, there were
cases where the testee only mentioned one of these. I said Susan
’employed a guard’ and J that she ’put bars on the window so the
robber can’t get in’. While J did offer a reason for Susan’s efforts,
both answers seem to give insufficient information considering the
fact that Susan tried a number of things to try to stop the burglar
getting in. However, while an ideal answer should contain far more,
can we assume from these answers that the testees were unable to
give a more complete answer, and hence did not understand very
well, or must we allow for the possibility that they did understand
quite well, but felt that the answer they gave was quite adequate?
Examination of the protocols indicates clearly that J gave a shorter
answer because of insufficient comprehension, whereas I seems to
have given a shorter answer because she felt it would be appropriate.
The problem also arises of what to do with answers which indicate
only a partial understanding of the text. Consider the case of E, who
replied, ’she put things by the window to prevent the burglar getting
in’. This is clearly wrong, but her response does indicate that she
understood quite a lot of this section, and it seems a pity that this
information should be wasted.
This would suggest that the responses to Q 10 could be arranged
along some sort of continuum of desirability, from the most complete
answer to one indicating only a minimum of comprehension. The
problem with dichotomous scoring is that this continuum is turned
into dichotomy, a process which seems both arbitrary and wasteful.
Two possible solutions present themselves. First, the test could be
administered to a group of competent listeners and their responses
used as a basis for judging the acceptability of responses. Secondly,
a differential rating of responses according to some assessment of
their desirability may be a far better way of marking such tests. Of
course, such a marking scheme would require time and effort to
develop, and may involve far more marking time than dichotomous
scoring. It is also possible that the extra information might not be so
great and might be more easily obtained by simply increasing the
number of dichotomous items. However, given the fact that the pro-
tocols indicate clearly that comprehension is not simply ’on’ or ’off,
research is needed to explore this possibility.
However, apart from such mixed items, there were 14 other items
which seem tohave required inferences for a correct answer to be
made; these were of five different types and are discussed below, with
examples.
Inference type 1: This item type asked listeners to say how they
thought one of the characters felt at some particular point in the
narrative.
There was only one example of this inference type: Q12 in Section
3. After hearing that the robber still broke in, despite all Susan’s
attempts to prevent him, listeners were asked How do you think
Susan felt now? All testees responded with an answer which was
accepted as correct, and consideration of the variety of responses sug-
gests that any negative emotion could not, reasonably, be rejected.
Furthermore, examination of the questions by anyone who had not
heard the text would indicate that Susan had been burgled, and gene-
ral knowledge would indicate that such is not a pleasant experience.
Given this, it is doubtful whether this item was passage dependent
(Preston, 1964; Wendell, Weaver and Bickley, 1967; Connor and
Read, 1978), and such a problem could easily arise with other items
which required this type of inference.
Inference type 2: This item type asked testees to find reasons for
information clearly stated in the text. This was based on the fact that
speakers do not usually state explicitly everything they expect listeners
to understand; frequently the listener is expected to understand rela-
tionships between events and the implications of events from general
world knowledge.
There were three examples of this type of inference question (Q 16,
Q25, and Q54), and they all seemed to have worked reasonably
well, although there were some problems. For example, in Section 4,
which says that Susan went to the police and then adds that she
was very frightened, Q 16 asked testees Why do you think Susan was
frightened? The test maker expected the correct response to be
’because she couldn’t understand what was happening,’ and although
none of the testees gave that response, three testees (E, H and I) said
they thought she was frightened because the robberies still continued.
This was obviously accepted as correct, as was G’s response that she
was frightened because the situation was weird.
Inference type 5: This item type asked testees to find reasons for what
seemed like an obvious inference made by the test constructor.
There were three of these (Q8, Q17 and Q26), and they caused
considerable problems. The first came in Section 2 after the descrip-
tion of frequent visits by a thief who only stole small things. The test
constructor had inferred that such a burglar was not a normal profes-
sional thief working for gain, and so testees were asked Q8 How
do you know this was an amateur doing this? However, testees had
difficulty with this item, and in fact only one testee, E, answered
correctly with ’if he wasn’t an amateur he would take valuable things
with only one robbery’. It seems that the other testees did not make
the same inference as the test maker and were at a loss what to answer.
The data suggest that constructing short-answer comprehension
questions to test whether listeners are utilizing the input to make
’reasonable’ inferences is very difficult. Care must be taken to ensure
that items are passage dependent. Of course it is standard practice for
test makers to check that other questions do not give away answers,
but there is a second danger with these inference questions. There are
cases in the data of answers which are literally correct but actually add
no new information to what could be gleaned from reading the ques-
tions. Furthermore, if questions are not carefully worded it is often
possible to answer them literally correctly by simply repeating the
clearly stated information on which it was expected the inference
would be based.
Another serious problem concerns the necessity of ensuring the
possibility of some responses being wrong. The fact is that many
reactions to a text, such as ideas about what Susan was feeling at a
given point in the story, are simply a matter of personal opinion.
While we might feel that being burgled on a regular basis would be
a frightening experience, the listener is quite free to feel that it would
be an exciting and interesting adventure. Similarly, there is no basis
for deciding that any prediction about the future development of
the story is wrong, however outlandish. There is a serious danger of
the test maker insisting on one individualistic interpretation of a
text when all available theory suggests that what listeners actually do
is find an interpretation which is meaningful to them personally
(Sperber and Wilson, 1986; Rost, 1990). In order to avoid this it
is necessary to ensure that responses to inference items are clearly
constrained by the text.
Related to the question of personal interpretation is the fact that
inferences are based on background knowledge, which obviously
differs between testees. A listening comprehension test ought not to
test general knowledge, and it would seem important that the test
maker takes care not to reject responses simply because the testee had
different, or insufficient, background knowledge. However, it can be
very difficult to decide which responses are reasonable and which are
not. In the present test, despite the fact that the testees came from
the same L background and passed through the same educational
system in a society frequently characterized as extremely homoge-
neous, there was considerable variation between individual inter-
pretations of the text. It seems reasonable to suppose that when
testees from a variety of cultural backgrounds are taking the same
test, differences in cultural assumptions and background knowledge
could lead to cases where it is not clear whether the testee had simply
not understood and was guessing wildly, or really had understood
but had reached different yet perfectly reasonable conclusions.
Despite the problems constructing inference items, with skilful
item writing and careful piloting two question types seem to offer the
possibility of testing inferential skills: (i) asking testees to find reasons
for information clearly stated in the text; and (ii) asking them to make
deductions which follow logically from information in the text. The
problems are complex, and it is unlikely that one small study has
revealed them all; there is clearly a need for further research on ways
of testing higher-level listening processes.
However, one very worrying fact is the possibility of a strong
response set towards such items. There were 13 occasions when testees
seemed to have understood well enough to have made the inference
necessary to answer a test question but failed to do so. This occurred
once with F, three times with I and nine times with J. While the cases
of F and I may not suggest anything untoward, the case of J seems
to indicate a serious problem. A number of explanations are possible,
but examination of the protocols suggests that in fact J is making
the inferences required but that she is not inclined to use them in
her responses to test questions. This reluctance to give answers which
contain more information than that literally stated in the text seems
to indicate a strong response set (Cronbach, 1946; 1950). Whether
this is a characteristic commonly found among other test-takers is
not clear, but the implications for communicative testing are quite
considerable. More research is clearly needed on this issue.
IV Conclusions
The main aim of the present research has been to use the verbal report
methodology to examine how listening tests work, and how processes
not normally accessible through quantitative research methods influ-
ence testee performance, test reliability and validity. A number of
areas of interest were examined based on recommendations made
in a pilot study into the use of introspective techniques in listening
research (Buck, 1990, forthcoming). It should be stressed that the
topics examined here constitute only a small sample of the topics
which could profitably be examined by means of testee introspection.
Results support earlier findings that the methodology can provide
valuable insights into many aspects of language processing and how
this relates to test performance, and its further use is recommended.
It seems that despite their apparent simplicity, short-answer com-
prehension questions can result in a number of problems which influ-
ence the measurement of comprehension in quite complex ways.
The protocols provide useful data on how test unreliability arises
due to a shortage of time, response evaluation, and implementation
problems. Furthermore, the problems associated with the inference
items show in a very practical way how item behaviour can influence
reliability, and the section on monitoring throws light on how the
information available to testees, and their expectations, influences
their interpretations of what test questions are asking.
This relates largely to reliability. Regarding validity, the protocols
seem to provide no concrete evidence of the short-answer test method
systematically influencing comprehension in such a way that would
call into question the construct validity of a listening test which used
this method. However, the method does seem to have one major
drawback in that it allows testees considerable freedom to decide
what they think a question is asking, or what information they will
give in reply, and consequently it may be difficult to force testees to
address a specific issue if they are trying to avoid doing so. Whether
this systematically excludes certain aspects of comprehension from
being tested is not clear, but there are obvious implications for con-
struct validity and further research is clearly required.
Finally, the protocols highlight a dilemma which arises with all
V References
Alderson, J.C. 1984: Reading in a foreign language: a reading problem or
a language problem. In Alderson, J.C. and Urquhart, A.H., editors,