Listening Comprehension Study Examines Test Processes

Language Testing
http://ltj.sagepub.com/
The testing of listening comprehension: an introspective study1

Gary Buck
Language Testing 1991 8: 67
DOI: 10.1177/026553229100800105
The online version of this article can be found at:

http://ltj.sagepub.com/content/8/1/67
Published by:
http://www.sagepublications.com
Additional services and information for Language Testing can be found at:
Email Alerts: http://ltj.sagepub.com/cgi/alerts
Subscriptions: http://ltj.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations: http://ltj.sagepub.com/content/8/1/67.refs.html
>> Version of Record - Jun 1, 1991
What is This?
Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

67-
The testing of listening

comprehension: an introspective
1
study
Gary Buck University of Lancaster
The present research uses the verbal report methodology to examine how
listening tests work, and how processes not normally accessible through quan-
titative research methods influence test performance. Six introspectees pro-
vided data on four main areas of interest: the influence of the short-answer
test method on the measurement of listening comprehension; whether test
items can measure ’higher-level’ cognitive processes; whether test items can
measure how well listeners monitor the appropriacy of their interpretation;
and how question preview influences comprehension and test performance.
The interview protocols provide a great deal of data relevant to these and
related issues, the presentation and interpretation of which is the main purpose
of this paper. The protocols also indicate a serious dilemma for language
testers in that listening comprehension involves far more than the application
of linguistic knowledge to produce a propositional representation of a text;
rather it is an inferential process in which listeners attempt to construct an
interpretation which is meaningful in the light of their own assessment of the
situation, knowledge and experience. Thus there are often no clear, objective
criteria against which to judge the appropriacy of any one interpretation. The
implications of this are discussed.
I Introduction
Tests of second language listening ability are very common in lan-
guage education, and yet a review of the literature on listening, both
L and L2, suggests that there is no generally accepted, explanatory
theory of listening comprehension on which to base these tests (Buck,
1990). It seems that in practice test constructors are obliged to follow
their instincts and just do the best they can when constructing tests of
listening comprehension. While such a hit or miss approach would
seem to be largely unavoidable given our lack of knowledge of how
listening comprehension works, it is obviously unsatisfactory, and
1 This research submitted in part of doctoral dissertation at Lancaster

was completion a
University (Buck, 1990).

68
suggests that there is an urgent need for research into the listening
process, and the best ways of testing it. However, there do not even
seem to be any sufficiently clearly stated hypotheses about the listen-
ing process which could form the basis for research. In such a case,
verbal reports of testees introspecting while taking listening tests
would seem to be most productive course. Although this method may
not be very suitable for testing clearly formulated research hypoth-
eses, it does seem likely to provide a broad view of second-language
listening processes and indicate how listening tests work. Hopefully
at some future date, these could then be used to formulate hypotheses
which could be tested by other research methods. This study forms
part of a series of introspective studies into the processes of listening
comprehension, and presents findings related specifically to aspects
of the testing of second language listening comprehension.
The advantages and disadvantages of the verbal-report method-
ology have been examined in detail by Ericsson and Simon (1980;
1984; 1987) who provide a theoretical basis for the method in the
form of an information processing model. The method has been
reasonably widely used in second language education (Cohen and
Hosenfeld, 1981; Mann, 1982; Cohen, 1986a, 1986b; Sarig, 1987;
Faerch and Kasper, 1987) although less so in second language testing
(Cohen, 1984; Alderson, 1988). The methodology has been little used
to examine listening processes, but in a pilot study designed to investi-
gate whether the verbal report methodology could be used to investi-
gate listening comprehension Buck (1990, forthcoming) suggested
that verbal reports on introspection could provide useful data on both
listening processes and the taking of listening-tests. Buck also made
detailed suggestions for the practical implementation of the method-
ology to examine listening, and suggested a number of research topics
which seemed amenable to investigation. These recommendations
are incorporated into the present study.
One research topic of obvious interest is the test method and
its influence on listening-test scores. When discussing the nature of
construct validity, Campbell and Fiske (1959) argued that tests are
a trait-method unit, that is a union of the particular trait to be
measured with a particular measurement procedure. They propose
that construct validity can be examined by comparing the relative
contribution of trait and method variance to the total test variance.
However, the methodology they propose, the multitrait-multimethod
methodology, only examines the test method effect in abstract statis-
tical terms. It seems likely that the verbal report methodology can
offer opportunities to examine exactly how the task demanded of
the testee interacts with the comprehension process to influence both
reliability and validity. ~~&dquo;, -1 ~ ~,

69
Another research topic of interest concerns the question of listen-

ing sub-skills. While a number of long and complex taxonomies have
been proposed, (Munby, 1978; Richards, 1983), many listening theo-
rists have discussed listening comprehension in terms of a two stage
process, making a distinction between lower-level skills and higher-
level skills (Rivers, 1966; Carroll, 1972; Oakeshott-Taylor, 1977;
Clark and Clark, 1977; Hughes, 1989). While there seems to be some
disagreement on exactly how these two stages should be charac-
terized, and what the relationship between them is, in the main there
seems to be general agreement that the first stage is basically linguistic
processing, and the second stage involves the application of the
results of the first.
This question of different levels of listening processing is important
because it relates directly to the recent trend in language testing to
move away from testing second language ability in the narrow sense
of language processing, and more towards testing language use in a
wider more communicative sense (Canale and Swain, 1980; Savignon,
1983; Heaton, 1988; Bachman, 1990). The implications of this trend
would seem to be that if listening tests are to test communicative
listening ability, they should include items which measure higher-level
cognitive skills.
Another research topic which seems best approached by means of
the verbal report methodology is the question of listeners monitoring
their comprehension. The pilot study (Buck, 1990, forthcoming)
provided evidence that constructing an interpretation of a listening
text is largely an inferential process, and listeners must check or
monitor their developing interpretation to ensure that they are
producing a ’reasonable interpretation’ (Brown and Yule, 1983) in
the light of the actual input, their background knowledge, and their
purpose for listening. In the pilot study there were a number of
examples of listeners comparing the incoming information with what
they had already heard before, or with their general knowledge, and
evaluating it in the light of that. There were also cases of testees fail-
ing to notice that their developing interpretation was quite unreason-
able considering the input, which then led to serious comprehension
problems.
A fourth topic which seems amenable to research by the verbal
report methodology is the question of listening purpose and how this
influences comprehension. Brown and Yule (1983) have argued that
listeners generally listen for a purpose, and this purpose affects both
the level of their motivation and the listening strategies they choose.
As both of these directly influence the degree of comprehension they
suggest that tests of communicative listening ability ought to provide
a listening purpose.

70
Of course, the test-taking situation usually provides the motivation

for listening, but it does not of itself suggest the appropriate listening
strategies, rather the form and content of the test must do that. One
of the best ways of ensuring that listeners select the most appropriate
strategies is by using task-based items, and one of the simplest testing
tasks, of course, is answering comprehension questions. Thus if
listeners are given the test questions before they listen to the text this
should determine their listening purpose: their motivation for listen-
ing will be to get the information necessary to answer the test ques-
tions, and the nature of the texts and questions will determine the
listening strategies they choose. This would suggest that there ought
to be a difference in performance between those cases where testees
are given the questions before they listen and those where they are
given them after they have finished listening.
The research topics discussed above can be summarized as follows:
1) How does test method influence the comprehension process, and

attemptsto measure listening comprehension?
2) Is it possible to write test questions which test higher-level
processing?
3) Is it possible to write test questions which measure the extent to
which listeners are monitoring their developing interpretation?
4) Does question preview influence comprehension and test perfor-
mance, and if so how?
II Method
&dquo;
~ .. .
’.’ ’
&dquo; ..&dquo;’
The basic research format consisted of listeners hearing a short narra-

tive text divided into 13 sections. The passage chosen has been called
the Susan story, and has been used in other studies of listening (Buck
1989, 1990, forthcoming)2. After each section of text, the tape was
stopped and interviewees answered a number of test questions, all
short-answer comprehension questions. Interviewees were then asked
a number of interview questions designed to probe their listening
comprehension in general and their test taking in particular.
The 54 item test was administered to a group of six interviewees,
E, F, G, H, I, and J; all female Japanese university students living in
Britain. The questions were written in Japanese, and testees wrote
their answers in Japanese or English. Three subjects, E, F and G did
the test without question preview, that is they listened to the text for
2
Many thanks are due to Michael Rost for his kind permission to use the Susan story

71
each section before seeing the test questions. The three other subjects,
H, I and J did the test with question preview, that is they had time
to look at the test questions before hearing the text for each section
of the story. A transcript of the text, and the test items are given in
the Appendix.
Many of the test items were experimental and included for the pur-
pose of the research. They were intended to be three different item
types. The first two attempted to operationalize the two hypothesized
levels of listening: (i) lower-level processing, using items which asked
for the reproduction of clearly stated information, and (ii) higher-
level processing, using items which required inferences based on the
information in the text.
The third item-type was included to try to assess listeners’ success
in monitoring their current interpretation of the text and consisted of
repetitions of earlier items. It seems reasonable to assume that if
listeners are monitoring their interpretation and modifying it in the
light of new input, then it may be possible to find evidence of this by
presenting certain test items a second time at a later point in the test.
After completing the test items for each section testees were then
asked a number of interview questions. All interviewees were asked
the same questions. Firstly, they were asked about the test items:
were they clearly written, was there enough time to answer them, were
any missed, and did any of the questions give a hint to the other ques-
tions ? These were then followed by a number of questions intended
to examine comprehension of the text. The first asked interviewees to
estimate their level of comprehension of the section they had just
heard and this was then followed by a request to restate the con-
tent in their L 1, Japanese. After these came a number of general
questions relevant to each particular section. There were ques-
tions probing testees’ understanding, their thoughts, inferences,
predictions, and their mental images. A complete transcript of the
interviews is given in Buck (1990).
III Results and discussion

Unlike quantitative research, where there are widely accepted
methods of summarizing data and a standard format for the report-
ing of results, presenting the results of qualitative research involves
a number of problems. The present data consist of long and rather
verbose interview protocols which are not easily summarized. It is
especially difficult to separate presentation of the data from discus-
sion of their implications, consequently in the present study the data
are presented and discussed together.
..,’&dquo;.....&dquo;t’!’1’’’’1’’’’’-’1&dquo;~~~~ ffl ----..

72
1 The research methodology

The generalizability of the findings is of obvious interest and in order
to assess the extent to which the results can be generalized to other
listening-test-taking situations it is prudent to examine the influence
of the research format on the listening process and the reliability and
validity of the test used to measure that.
Interview format At the end of the interviews testees were asked a
number of questions about the test format. Indications are that
breaking the text up into sections and stopping the tape while listeners
answered questions did not lead to them forgetting the story, nor to
any fundamentally different processes from those found in normal
listening-test taking situations. However, it did allow listeners more
time to think about what they were hearing, which seems to have
helped lower-ability listeners. Comparison of the results with other
research on listening, both qualitative and quantitative, also suggests
that the listening processes and test responses are typical of those
occurring in other listening situations.
The test While descriptive test statistics are of obvious interest,
an analysis with only six cases cannot be generalized to a wider test
population with any degree of confidence. Despite this, total testee

scores and descriptive statistics for the total test are given in Table 1.
Examination of this shows a mean of 40, and hence a mean facility
value of .75. This suggests that the test was too easy for these subjects,
and in fact there were 18 items which all testees answered correctly.
However, the test did produce a reasonable spread of scores, with
a standard deviation of 7.1. Estimates of reliability were also calcu-
lated : on the 54 dichotomously scored items KR20 is .87, suggesting
that the test had a fairly high level of internal consistency. However,
considering that the text is divided into 13 sections with a number of
items in each section it may be more prudent to regard each section
as a separate item, in which case Cronbach’s alpha calculated on the
13 sections produces an estimate of .77.
Table 1 Test scores and statistics

73
Internal correlations and item statistics are included in the appendix.

The only statistical means of assessing the validity of the test is by
examination of the internal correlations. The two highest section/
total correlations are in Section 5 (.83), in which the fact of Dagbovie
being a witch-doctor is introduced, and Section 8 (.97), in which
Dagbovie first tells Susan that she must perform a magic action to
stop the burglar coming. This suggests that understanding the role
of magic in the story seems to be the best predictor of success. The
item/total correlations lead to a similar conclusion; there are 10 items
with an item/total correlation of over .80 and all but one of these
deals with some aspect of the theme of magic. Considering the con-
tent of the story and the central role of magic in it, these correlations
seem very reassuring.
2 Test-method effects: short-answer questions ,
The test method consisted of short-answer open-ended comprehen-

sion questions, which were also used in an MTMM study of listening
and reading comprehension (Buck, 1989; 1990) where they were
found to have a minimal and non-significant method effect. Never-
theless the present research shows that the method is not without
problems. There are three areas of concern.
Shortage of time: Forcing testees to construct their own responses
naturally gives them the responsibility of deciding how much time
they should spend on each answer. Shortage of time can arise through
either thinking too much or writing too much. When items are work-
ing well, the amount of thinking time is very small compared to the
amount of writing time, and cases where individual students have to
think about their answers for a long time are usually an indication of
inadequate comprehension. However, testees can also run out of time
because they write too much. The problem arises especially with badly
constructed items which do not make clear exactly how much infor-
mation is required in response. This can then lead to complications
’
with other items.

For example, H was unable to answer Q40 because she had run out
of time, but in the discussion afterwards she indicated that she knew
the answer very well. The problem was that in response to Q37 she
wrote down more or less the whole of the text for that section, with
the result that she had insufficient time left to answer all the other
questions. Q37 is obviously a bad item because it does not make it
clear to those taking the test just how much information is required
in response. However, the problem with Q37 actually manifested
itself with Q40. An item analysis of a large scale administration of this
test would probably show Q37 to be a good item, and Q40 to be a bad

74
one, resulting in the rejection of the wrong item. There is an obvious

need for more research on this issue, but it should be noted that
such problems may not become apparent unless testees are asked to
provide verbal reports on their test taking.
Marking: Allowing testees the freedom to construct their own
responses leads naturally to the necessity of making decisions about
which responses are satisfactory and which are not. While many
responses are clearly right or wrong, many fall somewhere on a con-
tinuum between the two, and judgements on where the line should be
drawn between correct and incorrect can involve difficult and even
arbitrary decisions regarding which interpretations of the text are
reasonable, and exactly what the questions are asking for.
This is best illustrated by means of an example. In Section 3 the
narrator explains that Susan tried many different things to stop the
burglaries, and gives two examples of these: putting bars over her
windows and hiring a guard to watch the window; this is followed
by Q10, which asks What did Susan do about her problem? The test
maker must decide how much information should be demanded
before an answer is accepted as correct. To expect all information
relevant to this question is surely unreasonable, but an answer which
only states that Susan tried to stop the robber getting in may not be
sufficient considering that this could probably be inferred from
general knowledge.
The actual responses given by testees illustrate the problem very
well. H wrote in English, ’she prevented them coming from putting
the bar on the window and hiring the guards’. Despite a number of
English mistakes her answer does seem reasonable, but it is not clear
whether she realized that in fact Susan tried many other things to
prevent the burglar getting in and these are only examples of what she
did. Should this response be marked correct, or can the test marker
demand that testees show that Susan did many other things besides
these? It may well be, of course, that interested Ll listeners would
not consider such a distinction important and may not even notice
it at all.
Problems also occur when judging responses which are based on an
interpretation of events very different from that of the test maker or
marker. There were cases of testees basing their responses on what
seemed to be strange, variant interpretations of the text which further
consideration suggested may not be so unreasonable given the back-
ground of the testee. For example, in reply to Q 10 F said, ’she
employed a guardman and jammed the window closed with a bar’.
At first sight this seems very strange, but it should be noted that
the idea of bars on a window is not very common in Japan; tradi-
tionally Japanese houses were made of wood and sliding windows

75
were normal. F explained that she had to think about this, and
decided that this must have been a Japanese-style window because she
could only conceive of the bar as a piece of wood jammed into the
window to prevent it sliding open. In such a case how should her
response be graded? Should she be penalized for making a reasonable
but wrong assumption about how the bar was used, especially when
she seems otherwise to have understood just as much as H?
While H and F mentioned both the bars and the guard, there were
cases where the testee only mentioned one of these. I said Susan
’employed a guard’ and J that she ’put bars on the window so the
robber can’t get in’. While J did offer a reason for Susan’s efforts,
both answers seem to give insufficient information considering the
fact that Susan tried a number of things to try to stop the burglar
getting in. However, while an ideal answer should contain far more,
can we assume from these answers that the testees were unable to
give a more complete answer, and hence did not understand very
well, or must we allow for the possibility that they did understand
quite well, but felt that the answer they gave was quite adequate?
Examination of the protocols indicates clearly that J gave a shorter
answer because of insufficient comprehension, whereas I seems to
have given a shorter answer because she felt it would be appropriate.
The problem also arises of what to do with answers which indicate
only a partial understanding of the text. Consider the case of E, who
replied, ’she put things by the window to prevent the burglar getting
in’. This is clearly wrong, but her response does indicate that she
understood quite a lot of this section, and it seems a pity that this
information should be wasted.
This would suggest that the responses to Q 10 could be arranged
along some sort of continuum of desirability, from the most complete
answer to one indicating only a minimum of comprehension. The
problem with dichotomous scoring is that this continuum is turned
into dichotomy, a process which seems both arbitrary and wasteful.
Two possible solutions present themselves. First, the test could be
administered to a group of competent listeners and their responses
used as a basis for judging the acceptability of responses. Secondly,
a differential rating of responses according to some assessment of
their desirability may be a far better way of marking such tests. Of
course, such a marking scheme would require time and effort to
develop, and may involve far more marking time than dichotomous
scoring. It is also possible that the extra information might not be so
great and might be more easily obtained by simply increasing the
number of dichotomous items. However, given the fact that the pro-
tocols indicate clearly that comprehension is not simply ’on’ or ’off,
research is needed to explore this possibility.

76
Practical implementation: The assumption is made in writing test

items that if testees understand, they answer correctly, and if they
don’t understand, they can’t answer correctly. However, this is not
always the case, even with items which are well constructed. The inter-
views showed that one of the most common problems is in the prac-
tical implementation of items. Out of the 324 responses to the test
items (54 items by 6 testees), there were 24 cases (i.e. 7%) where the
item did not work as it should have done due to some practical
problem with implementation.
First, there were the seven cases where testees ran out of time, a
problem which has already been discussed. There were also nine cases
where testees misunderstood the question, despite the fact that ques-
tions were written in their L 1 and all concerned agreed that they were
clearly written. Finally there were also eight cases in which testees did
not write down what they had actually intended to write. In 21 out of
these 24 cases the problem resulted in the testees not answering the
questions correctly and thus not getting credit for having understood
the text. Of the remaining three cases, there were two in which the
testee would have answered incorrectly even if the problem had not
occurred, and one where a clerical error writing the answer led to the
testee giving a correct answer by mistake. Most of these problems
reflect simple carelessness, and argue strongly for care in test con-
struction and administration.
3 Higher-level skills: inference items

Researchers into listening comprehension have frequently stressed the
importance of ’higher-level’ cognitive skills in the listening compre-
hension process (Brown and Yule, 1983; Anderson and Lynch, 1988;
Rost, 1990) and one of the major purposes of the present research is
to examine listening-test items designed specifically to measure these.
In order to operationalize the distinction between lower-level pro-
cessing and higher-level processing two types of items were made:
those which asked for information clearly stated in the text, and
those which required testees to make inferences based on that clearly
stated information. However, items did not always work as the test-
maker intended. This is largely because the short-answer method
allows different testees to give different answers to the same question,
and the various answers often require different amounts of inferenc-
ing. Thus the same item could be testing the ability to understand
clearly stated information for one testee and inferencing ability for
another. This suggests that items need to be classified according to
the responses that are accepted as correct rather than according to
the intention of the item writer.

77
However, apart from such mixed items, there were 14 other items
which seem tohave required inferences for a correct answer to be
made; these were of five different types and are discussed below, with
examples.
Inference type 1: This item type asked listeners to say how they
thought one of the characters felt at some particular point in the
narrative.
There was only one example of this inference type: Q12 in Section
3. After hearing that the robber still broke in, despite all Susan’s
attempts to prevent him, listeners were asked How do you think
Susan felt now? All testees responded with an answer which was
accepted as correct, and consideration of the variety of responses sug-
gests that any negative emotion could not, reasonably, be rejected.
Furthermore, examination of the questions by anyone who had not
heard the text would indicate that Susan had been burgled, and gene-
ral knowledge would indicate that such is not a pleasant experience.
Given this, it is doubtful whether this item was passage dependent
(Preston, 1964; Wendell, Weaver and Bickley, 1967; Connor and
Read, 1978), and such a problem could easily arise with other items
which required this type of inference.
Inference type 2: This item type asked testees to find reasons for
information clearly stated in the text. This was based on the fact that
speakers do not usually state explicitly everything they expect listeners
to understand; frequently the listener is expected to understand rela-
tionships between events and the implications of events from general
world knowledge.
There were three examples of this type of inference question (Q 16,
Q25, and Q54), and they all seemed to have worked reasonably
well, although there were some problems. For example, in Section 4,
which says that Susan went to the police and then adds that she
was very frightened, Q 16 asked testees Why do you think Susan was
frightened? The test maker expected the correct response to be
’because she couldn’t understand what was happening,’ and although
none of the testees gave that response, three testees (E, H and I) said
they thought she was frightened because the robberies still continued.
This was obviously accepted as correct, as was G’s response that she
was frightened because the situation was weird.
Inference type 3: This type of question asks testees to make a deduc-

tion about some aspect of the story, and is very similar to the previous
type, except that it does not ask about clearly stated information.
There were three of these (Q21, Q44 and Q46) and the first

78
occurred in Section 5 when listeners had been informed that Dagbovie

was an animist or witch-doctor who had some special power to under-
stand strange events. Testees were asked (Q21 ) What could Dagbovie
do that the police couldn’t? and a number of them were able to come
up with an acceptable answer; for example E said ’he will foretell who
the thief was’. One interesting response was that of I, who suggested
that ’he will analyse the thief s psychology’. There is a feeling that this
ought to be marked wrong, but in fact this could be a reasonable
inference for a person who does not believe in magic, and this may
in fact be how witch-doctors operate. However, examination of the
protocols suggests that I misunderstood animist as analyst and com-
bined that with the word doctor to infer the present answer.
Inference type 4: These questions asked testees to make predictions

on how they thought the story would develop.
There were four such items (Q 13, Q 18, Q29 and Q42); the first one
came in Section 3 after testees had heard how Susan had tried many
things to stop the burglaries. Q 13 asks What do you think Susan will
do next? It soon became clear that it was very difficult to mark any
coherent answer wrong. Both F and G suggested that she would think
of a strategy to catch the thief, which is little different from what she
had been doing up to that point, but these answers had to be marked
correct. I suggested that she would ’move house’ which seems a possi-
ble, if somewhat extreme reaction in the circumstances, and clearly
could not be rejected.
Inference type 5: This item type asked testees to find reasons for what
seemed like an obvious inference made by the test constructor.
There were three of these (Q8, Q17 and Q26), and they caused
considerable problems. The first came in Section 2 after the descrip-
tion of frequent visits by a thief who only stole small things. The test
constructor had inferred that such a burglar was not a normal profes-
sional thief working for gain, and so testees were asked Q8 How
do you know this was an amateur doing this? However, testees had
difficulty with this item, and in fact only one testee, E, answered
correctly with ’if he wasn’t an amateur he would take valuable things
with only one robbery’. It seems that the other testees did not make
the same inference as the test maker and were at a loss what to answer.
The data suggest that constructing short-answer comprehension
questions to test whether listeners are utilizing the input to make
’reasonable’ inferences is very difficult. Care must be taken to ensure
that items are passage dependent. Of course it is standard practice for
test makers to check that other questions do not give away answers,
but there is a second danger with these inference questions. There are

79
cases in the data of answers which are literally correct but actually add
no new information to what could be gleaned from reading the ques-
tions. Furthermore, if questions are not carefully worded it is often
possible to answer them literally correctly by simply repeating the
clearly stated information on which it was expected the inference
would be based.
Another serious problem concerns the necessity of ensuring the
possibility of some responses being wrong. The fact is that many
reactions to a text, such as ideas about what Susan was feeling at a
given point in the story, are simply a matter of personal opinion.
While we might feel that being burgled on a regular basis would be
a frightening experience, the listener is quite free to feel that it would
be an exciting and interesting adventure. Similarly, there is no basis
for deciding that any prediction about the future development of
the story is wrong, however outlandish. There is a serious danger of
the test maker insisting on one individualistic interpretation of a
text when all available theory suggests that what listeners actually do
is find an interpretation which is meaningful to them personally
(Sperber and Wilson, 1986; Rost, 1990). In order to avoid this it
is necessary to ensure that responses to inference items are clearly
constrained by the text.
Related to the question of personal interpretation is the fact that
inferences are based on background knowledge, which obviously
differs between testees. A listening comprehension test ought not to
test general knowledge, and it would seem important that the test
maker takes care not to reject responses simply because the testee had
different, or insufficient, background knowledge. However, it can be
very difficult to decide which responses are reasonable and which are
not. In the present test, despite the fact that the testees came from
the same L background and passed through the same educational
system in a society frequently characterized as extremely homoge-
neous, there was considerable variation between individual inter-
pretations of the text. It seems reasonable to suppose that when
testees from a variety of cultural backgrounds are taking the same
test, differences in cultural assumptions and background knowledge
could lead to cases where it is not clear whether the testee had simply
not understood and was guessing wildly, or really had understood
but had reached different yet perfectly reasonable conclusions.
Despite the problems constructing inference items, with skilful
item writing and careful piloting two question types seem to offer the
possibility of testing inferential skills: (i) asking testees to find reasons
for information clearly stated in the text; and (ii) asking them to make
deductions which follow logically from information in the text. The
problems are complex, and it is unlikely that one small study has

80
revealed them all; there is clearly a need for further research on ways
of testing higher-level listening processes.
However, one very worrying fact is the possibility of a strong
response set towards such items. There were 13 occasions when testees
seemed to have understood well enough to have made the inference
necessary to answer a test question but failed to do so. This occurred
once with F, three times with I and nine times with J. While the cases
of F and I may not suggest anything untoward, the case of J seems
to indicate a serious problem. A number of explanations are possible,
but examination of the protocols suggests that in fact J is making
the inferences required but that she is not inclined to use them in
her responses to test questions. This reluctance to give answers which
contain more information than that literally stated in the text seems
to indicate a strong response set (Cronbach, 1946; 1950). Whether
this is a characteristic commonly found among other test-takers is
not clear, but the implications for communicative testing are quite
considerable. More research is clearly needed on this issue.
4 Monitoring the interpretation: repeated items

Although both theory and other research suggest that monitoring is
an important skill in successful listening comprehension, it is not at
all clear how this could be assessed. The basic process testers would
want to examine seems to be described by one of the introspectees in
the pilot study (Buck, 1990; forthcoming) as ’the continuous modifi-
cation of a developing interpretation in response to incoming infor-
mation’. The ability to adjust the interpretation in response to new
information is obviously an important listening skill, but especially so
in the case of second language listening, where linguistic knowledge
and processing efficiency may be grossly inadequate and the listener
is often trying to interpret a text from a partial analysis of the proposi-
tional content. One possible way of examining this would seem to be
by repeating test items. An item could be presented before a piece of
new information, and then the same test item would be repeated at
some later point to ascertain whether listeners had in fact modified
their interpretation in response to the new information. A number of
such items were included in the present test.
Apart from general practical considerations, there are two specific
problems in constructing such items. Firstly, test-takers may be con-
fused when they are presented with questions which they have already
answered, and this was avoided to some extent in the current
study by using different wording with the repeated item. The second
problem concerns the far more difficult issue of what constitutes
a repeated item. There are many potential problems. For example,

81
changing the wording of a question could obviously change the way

an item functions, but also changing the position of an item in the test
could also result in the item functioning differently. It may, in fact,
be impossible to have repeated items, insofar as answering an item
the first time may result in a change of the testees’ cognitive state
such that any repetition of that item would take place in a different
cognitive environment and thus would not be the same item.
In order to examine the effectiveness of the repeated items, the data
need to be examined for evidence of cases where listeners understood
the input from one particular section, but failed to incorporate that
into their developing representation of the story. There are 16 cases
where the second element of a pair of repeated items is wrong. How-
ever, examining the protocols for each of these provides no evidence
of listeners understanding on a literal level but failing to incorporate
that into their interpretation of events.
However, the protocols also indicate that two listeners, G and J,
did have trouble developing an interpretation which was reasonable
in the light of the input. Neither of them understood Dagbovie’s
explanation that the burglar was under a magic spell in Section 6, and
they continued to have some misunderstandings about the role of
magic in the story. At the end of the interview G commented that she
felt she had made the story up from her own imagination and said she
thought this may have been different from the text in some respects.
J also referred to this question of monitoring when she suggested
that although she felt she had understood the words of the text the
story did not seem to make sense. This is interesting because in fact
she did not understand the words; she failed to understand the word
magic on every occasion it appeared. There is no knowing exactly
why this happened, but it is inconceivable that she didn’t know the
word magic as this has entered the Japanese language as a loan word,
with a pronunciation that is the closest Japanese phonological equiva-
lent of the original English, and with a similar meaning. It seems
that, as in the pilot study, there is a case of a listener failing to under-
stand language which ought to have been well within her linguistic
capability.
While there is room for considerable doubt about exactly what G
and J actually understood, this is just the situation that the repeated
items were designed to clarify, but they do not seem to have offered
any help at all. Perhaps this is because test items can only access infor-
mation which testees are consciously aware of, and J’s failure to
understand ’easy’ language may go deeper than the conscious level.
The assumption was made that problems with monitoring occurred
somewhere between lower-level and higher-level processing. That is,
that the input was understood on a propositional level but that was

82
not incorporated in the conceptual interpretation of the text. While

the present research does not discount such a possibility, it suggests
that many problems occur at the level of processing the language into
a basic semantic representation. Regarding the nature of monitoring,
Buck (1990) suggested that it is not a separate process, but rather con-
sists of simply being alert and mindful for any discrepancies or
anomalies in the interpretation. If testees are not sufficiently mindful
of what is happening, then discrepancies will occur outside conscious
awareness; they will not ’hear’ it, or they will ’hear’ what they expect
to hear. In such cases test items will not be able to address the
problem.
Despite the disappointing performance of the repeated items, they
did provide some useful information on the behaviour of test items
in general. Although it does seem possible to repeat items which
ask about clearly stated facts, in most cases it seems that answering
an item a second time is not in any sense a repetition of the first
item. The present study suggests that answering the first time leads
to changes in the testees’ cognitive state, especially the expectations
they have about what they will be tested on, such that the second
response is produced in a different cognitive environment from the
first. There is a tendency for testees to assume, first, that once an item
of information has been used in one test item it will not be used again,
and secondly, that items are asking for the information most recently
heard, which they will then often try to fit into their response. If
there is a mismatch between what they have understood from a text
and their expectations about what they will be tested on, many second
language listeners are likely to assume that the problem lies with their
comprehension. From this it would seem that ’reasonable’ responses
to test questions are not just determined by what is literally correct,
but must also be considered in the light of an interaction between
the test question and the information testees expect to be tested on.
5 Listening purpose: question preview

Of the six testees who took the test, three read the questions before
hearing the text and three did not see the questions until after. It is
quite possible that allowing testees to see test questions before listen-
ing could lead to different listening strategies or different levels of
comprehension than in cases where testees had not seen the questions,
which would obviously then influence test performance. The inter-
views provided two types of data relevant to this: firstly the extent to
which question preview influenced what testees were thinking about
while listening, and secondly testee reports and opinions.
What listeners are thinking about: On 12 occasions the six

83
Table 2 Listener reports on thinking while listening
interviewees were asked whether they had thought about certain

specific issues while they had been listening to the text. Comparison
of the responses of those who had question preview with those who
did not should indicate whether giving question preview influenced
what testees were thinking about while they were listening. Responses
are summarized in Table 2. This indicates that of the 36 cases where
testees with question preview had been asked whether they had
thought about something, there were 20 cases where they replied that
they had, and in the case of those without question preview the
number was exactly the same. This suggests that question preview did
not influence what listeners were thinking about very much, if at all.
Of course, this could be because the topics they were asked about
were not so well chosen and the results might have been very different
if they had been asked about other topics.
However, among those 12 occasions when listeners were asked
whether they had been thinking about something while listening, in
four cases what they were asked was exactly the same as one of the
test questions they had to answer in that section. In these cases it
seems reasonable to assume that all those testees who had seen the
test questions before hearing the test would be thinking about them
while listening. However, strangely this was not always the case. For
example, in Section 3, after hearing that Susan had unsuccessfully
tried many things to stop the burglaries, Q13 asked What do you
think Susan will do next? Then during the interviews listeners were
asked exactly the same question again, and also whether they had
thought about it while listening. Out of the 3 testees who had question
preview, both H and I claimed that they had not thought about this
while listening, despite the fact that both of them offered a reasonable
answer to Q13. This is rather an unexpected finding since all the
testees who had question preview claimed that they had actively
searched for the information necessary to answer the questions. The
protocols offer no reasons why this should be, but it seems that
reading questions before listening does not always influence listening

84
strategies. Perhaps in some cases listeners just scan the questions

without really registering what they are asking for, or perhaps they
search for the answers to test questions on some occasions but not on
others.
It is possible that this issue could be related to text type. Listeners
seem likely to approach information texts, such as announcements,
by listening for the specific information relevant to their needs, to
answer the questions or whatever, but in the case of narratives,
especially interesting ones, listeners will tend to listen out of interest.
Thus, if testees find a text intrinsically interesting the influence of
motivating devices built into the test, such as question preview, may
be far less than in cases where they find the text difficult or boring.
Testee opinions: After the interviews were completed interviewees
were asked a number of questions in order to ascertain how they
thought question preview had influenced their listening strategies and
level of comprehension.
Testees without question preview (E, F and G) were asked three
questions. Firstly, whether they would have preferred seeing the ques-
tions before listening. E said she thought this would have been a help,
but would have made a worse test, and F tended to agree, suggesting
that the test would have been easier and not as good. G said that she
didn’t think it would have made much difference. They were then
asked whether they had felt that they lacked a purpose for listening.
E and F said they hadn’t felt that at all, but just listened to find out
what happened in the story; G said that at first she had wondered
what to listen for, but then she just tried to ’grasp the flow of the
story’. The third question asked whether they thought they would
have understood the text better, or worse, with question preview. All
three of them said they thought it would have been easier to under-
stand with question preview; F explaining that the questions would
have given helpful hints to the content of the story.
Testees with question preview (H, I and J) were also asked three
questions. First, they were asked whether reading the questions
before listening helped them to understand, and all of them answered
that it did, suggesting that the questions indicated the content of the
story. For example J explained that when you read ’What did she give
Dagbovie?’ then you know she gave him something, or ’Why did the
robberies stop?’ then you know that the robberies stopped. They were
also asked whether they had listened specifically for answers to the
questions and all three said they had. H for instance said that she
was actively searching for the answers, although J suggested that this
searching for test answers may be different from normal listening.
Finally they were asked whether they thought they would have under-
stood the story more or less without question preview, and all three

85
agreed that they would have understood less.

Thus there was definite agreement between all the introspectees
thatquestions preview did influence listening strategies, the degree of
comprehension and test difficulty. Clearly more research is needed to
examine this issue, which should ideally take account of variables
such as text-type, level of interest and task motivation.
IV Conclusions
The main aim of the present research has been to use the verbal report
methodology to examine how listening tests work, and how processes
not normally accessible through quantitative research methods influ-
ence testee performance, test reliability and validity. A number of
areas of interest were examined based on recommendations made
in a pilot study into the use of introspective techniques in listening
research (Buck, 1990, forthcoming). It should be stressed that the
topics examined here constitute only a small sample of the topics
which could profitably be examined by means of testee introspection.
Results support earlier findings that the methodology can provide
valuable insights into many aspects of language processing and how
this relates to test performance, and its further use is recommended.
It seems that despite their apparent simplicity, short-answer com-
prehension questions can result in a number of problems which influ-
ence the measurement of comprehension in quite complex ways.
The protocols provide useful data on how test unreliability arises
due to a shortage of time, response evaluation, and implementation
problems. Furthermore, the problems associated with the inference
items show in a very practical way how item behaviour can influence
reliability, and the section on monitoring throws light on how the
information available to testees, and their expectations, influences
their interpretations of what test questions are asking.
This relates largely to reliability. Regarding validity, the protocols
seem to provide no concrete evidence of the short-answer test method
systematically influencing comprehension in such a way that would
call into question the construct validity of a listening test which used
this method. However, the method does seem to have one major
drawback in that it allows testees considerable freedom to decide
what they think a question is asking, or what information they will
give in reply, and consequently it may be difficult to force testees to
address a specific issue if they are trying to avoid doing so. Whether
this systematically excludes certain aspects of comprehension from
being tested is not clear, but there are obvious implications for con-
struct validity and further research is clearly required.
Finally, the protocols highlight a dilemma which arises with all

86
attempts to test listening comprehension, whatever the method used.

First, it is clear from both this research and from general experience
that it is possible to test the listeners’ ability to understand the propo-
sitional content of a text, i.e. to process and understand clearly
stated information. This is because competent listeners are likely to
agree on what a text says, and there are thus clear criteria against
which to judge testee interpretations. Secondly however, it is also
clear that this is only part of the listening process and that in fact
listening comprehension is far more than just understanding on a
literal, propositional level. The protocols confirm current theoretical
descriptions of listening comprehension as a very individual and
personal process - an active, inferential process of constructing an
interpretation which seems reasonable in the light of the listeners’
own assessment of the situation, the listeners’ background knowl-
edge, and the purpose for listening. These obviously differ between
listeners and hence interpretations of any text are likely to vary con-
siderably. Thus there are often no clear, objective criteria against
which to judge the appropriacy of one interpretation against another,
and this leads to considerable problems when trying to test the higher-
level cognitive skills involved in interpreting the meaning of a text.
Consequently, it is difficult to see how it would be possible to con-
struct objective tests of listening comprehension in the true sense
of the term, that is of communicative listening ability, when the
criteria for judging the appropriacy of listener interpretations will
vary between listeners.
Obviously the core of the problem is the fact that comprehension
involves much more than just the application of linguistic knowledge
(see Alderson, 1984 for a discussion of how this relates to reading
problems). Perhaps language testers need to clarify exactly what they
want to test: do they want to test listening or language? Perhaps it
could be argued that in most cases what the language tester needs
to know is whether testees have sufficient control of the linguistic
knowledge necessary to produce their own interpretation of the text,
but it is usually not necessary to know what that interpretation would
be. Given this, the temptation is to suggest that listening tests should
concentrate on measuring the testees’ ability to understand the propo-
sitional content of the text, or at least they should only try to test
things for which the text itself provides clear criteria for judging the
appropriateness of testee responses. Although such a course seems
reasonable, unfortunately it oversimplifies the problem in that it
ignores the fact, for which there is ample evidence in the psycho-
linguistic literature, that higher-level variables such as back-ground
knowledge, expectations and inferences often play an important role
in determining what propositional interpretation is placed on a text.

87
There is also abundant evidence in the protocols to suggest that

separating the linguistic-level propositional analysis from the wider
interpretation of what the text means to the listener is not always
possible.
This is a very complex issue, which relates directly to the construct
validity of all listening tests, and indeed all reading tests too, and must
surely be considered one of the central problems in language testing
at the present time. There is clearly a need for a wide ranging discus-
sion of the theoretical issues involved among testers and applied
linguists, and perhaps a considerable research effort may be needed
before this issue is satisfactorily resolved.
V References
Alderson, J.C. 1984: Reading in a foreign language: a reading problem or
a language problem. In Alderson, J.C. and Urquhart, A.H., editors,
Reading in a foreign language. London: Longman.

Alderson, J.C. 1988: Testing reading comprehension skills. Paper presented
at Sixth Colloquium in Reading in a Second Language. TESOL,
Chicago, March 1988.
Anderson, A. and Lynch, T. 1988: Listening. Oxford: Oxford University
Press.
Bachman, L. F. 1990: Fundamental considerations in language testing.
Oxford: Oxford University Press.
Brown, G. and Yule, G. 1983: Teaching the spoken language. Cambridge:
Cambridge University Press.
Buck, G. 1989: A construct validation study of listening and reading com-
prehension. Paper presented at the Eleventh International Language
Testing Research Colloquium, San Antonio, Texas, March 1989.
Buck, G. 1990: The testing of second language listening comprehension.
PhD dissertation, University of Lancaster, England.
Buck, G. (forthcoming): A pilot study of introspection on second language
listening: methodological issues and initial results.
Campbell, D.T. and Fiske, D.W. 1959: Convergent and discriminant valida-
tion by the multitrait-multimethod matrix. Psychological Bulletin
56: 81-105.
Canale, M. and Swain M. 1980: Theoretical bases of communicative
approaches to second language teaching and testing. Applied
Linguistics 1: 1-47.
Carroll, J.B. 1972: Defining language comprehension: some speculations. In
Freedle, R.O. and Carroll, J.B., editors, Language comprehension
and the acquisition of knowledge. New York: John Wiley.
Clark, H.H. and Clark, E.V. 1977: Psychology and language: an introduc-
tion to psycholinguistics. New York: Harcourt Brace Jovanovich.
Cohen, A. 1984: On taking language tests: what the students report.
Language Testing 1: 70-81.

88
Cohen, A. 1986a: Mentalistic measures in reading strategy research: some

recent findings. English for Specific Purposes 5: 131-45.
Cohen, A. 1986b: Verbal reports as a source of information on reading
strategies. The ESP 15: 1-12.
Cohen, A.D. and Hosenfeld, C. 1981: Some uses of mentalistic data in
second language research. Language Learning 31: 285-313.
Connor, U. and Read, C. 1978: Passage dependency in ESL reading com-
prehension tests. Language Learning 28: 149-57.
Cronbach, L.J. 1946: Response sets and test validity. Educational and
Psychological Measurement 6: 475-94.
Cronbach, L.J. 1950: Further evidence on response sets and test design.
Educational and Psychological Measurement 10: 3-31.
Ericsson, K.A. and Simon, H. 1980: Verbal Reports as Data. Psychological
Review 87: 215-51.
Ericsson, K.A. and Simon, H. 1984: Protocol analysis: verbal reports as
data. Cambridge, Massachusetts: MIT Press.
Ericsson, K.A. and Simon, H. 1987: Verbal Reports on Thinking. In Faerch,
C. and Kasper, G., editors, Introspection in second language research.
Clevedon, Philadelphia: Multilingual Matters.
Faerch, C. and Kasper, G. 1987: Introspection in second language research.
Clevedon, Philadelphia: Multilingual Matters.
Heaton, J.B. 1988: Writing English language tests. London: Longman.
Hughes, A. 1989: Testing for language teachers. Cambridge: Cambridge
University Press.
Mann, S. 1982: Verbal reports as data: a focus on retrospection. In Ding-
wall, S., Mann, S. and Katamba, F., editors, Methods and problems
in doing applied linguistic research, Lancaster: Department of Lin-
guistics, University of Lancaster.
Munby, J. 1978: Communicative syllabus design. Cambridge: Cambridge
University Press.
Oakeshott-Taylor, J. 1977: Information redundancy, and listening compre-
hension. In Dirven, R., editor, Hörverständnis im Fremdsprachenun-
terricht. Listening comprehension in foreign language teaching,
Kronberg/Ts.: Scriptor.
Preston, R.C. 1964: Ability of students to identify correct responses before
reading. The Journal of Educational Research 58: 181-83.
Richards, J.C. 1983: Listening comprehension: approach, design, pro-
cedure. TESOL Quarterly 17: 219-40.
Rivers, W.M. 1966: Listening comprehension. Modern Language Journal
50: 196-202.
Rost, M. 1990: Listening in language learning. London: Longman.
Sarig, G. 1987: High-level reading task in the first and in the foreign
language: some comparative process data. In Devine, J., Carrell, P.L.
and Eskey D.E, editors, Research in reading in a foreign language.
Washington, DC: TESOL.
Savignon, S.J. 1983: Communicative competence: theory and practice.
Reading, Mass.: Addison-Wesley.
Sperber, D. and Wilson, D. 1986: Relevance. Oxford: Blackwell.

89
Wendell, W.W. and Bickley, A.C. 1967: Sources of information for

responses to reading test items. Proceedings, 75th Annual Convention,
APA: 293-94.
Appendix The Test

Section and item statistics are given in brackets. The first statistic is the mean score
of that section or item; the second statistic is the correlation of that section or item
with the total test score.
Intended Item Types

L = Lower-level item requiring processing of clearly stated information.
H = Higher-level item requiring inferences based on the text.
R = Repeated item.
Section 1 (2.33; .15)
my friend Susan was living in West Africa and while she was living there she had a
problem with burglars ’
Ql. (L: .67;-.07) Who was Susan?

Q2. (L: .67;-.07) Where was Susan living?
Q3. (L: .67; .25) What do you think was Susan’s problem?
Q4. (L: 33; .07) What are burglars?
Section 2 (2.83; .66)
for a period of about two months every Sunday night someone was breaking into her
house through her bedroom window and was stealing something very small from her
house
Q5. (L: .83; .58) What happened on Sunday nights?
Q6. (L: 1.0) For about how long did the problem continue?
Q7. (L: .83; .58) What was taken?
Q8. (H: .17; .42) How do you know this was an amateur doing this?
Section 3 (4.67; .13)
and she tried many things to prevent this from happening like putting bars over the
windows and hiring a guard to watch the window and still every Sunday night
somehow someone came through the window and stole something
Q9. (R: 1.0) What was Susan’s problem?
Q10. (L: .67; .14) What did Susan do about her problem?
Ql 1. (R: 1.0) What is a burglar?
Q12. (H: 1.0) How do you think Susan felt now? ’
Q13. (H: 1.0) What do you think Susan will do next?

Section 4 (3.67; .25)
so after this happened a few times she went to the police and she was very frightened
and the police listened to her story and told her that she should visit a person named
Dagbovie
Q14. (L: 1.0) What did Susan do?
Q15. (L: .83; -.42) What advice was Susan given?
Q16. (H: .67; .48) Why do you think Susan was frightened?
Q17. (H: .33; .02) Why was the advice given to Susan strange?
Q18. (H: .83; .52) Who do you think Dagbovie is?

90
Section 5 (1.67; .83)

and Dagbovie was a sort of animist or witch-doctor who had some special power to
understand this kind of strange event
Q19. (R: .33; .81) What was Dagbovie?
Q20. (L: .33; .81) Why could Dagbovie understand strange events?
Q21. (H: .50; .70) What could Dagbovie do that the police couldn’t?
Section 6 (3.50; .62)
and he listened to her story about what was happening and he asked her a lot of ques-
tions and finally he said this is some sort of magical spell on someone who is forced
to come to your house every Sunday and steal something
Q22. (L: 1.0) What did Dagbovie do first?
Q23. (L: .83; .52) What did Susan tell him about?
Q24. (L: .67; .88) Why did Dagbovie say the burglar came?
Q25. (H: .50: .00) How did Dagbovie arrive at his explanation?
Q26. (H: .50; .33) Why do you think the burglar didn’t need the things he took?
Section 7 (2.83; .59)
and he asked her especially about the first time that it happened and she had she told
him that she had a party at her house that day that first Sunday and that there were
five guests
Q27. (L: .83; .59) What was Dagbovie especially interested in?
Q28. (L: 1.0) What had happened on the night of the first burglary?
Q29. (H: 1.0) Who do you think Dagbovie will say was the thief?
Section 8 (2.67; .97)
and Dagbovie said that one of these guests was the thief and that if she wanted to
stop the robberies she had to perform a magic action that he would help her do
Q30. (R: 1.0) Who did Dagbovie say the burglar was?
Q31. (L: .67; .88) What did Dagbovie say that Susan must do to stop the burglaries?
Q32. (L: .50; .89) Who did Dagbovie say would help her?
Q33. (R: .50; .70) How did Dagbovie know who the burglar was?
Section 9 (2.67; .07)
and of course she was very shocked and afraid but she said that she would try it
because she realized she had to do something
Q34. (L: 1.0) How did Susan feel about Dagbovie’s explanation?
Q35. (L: .67; -.07) What did Susan agree to do?
Q36. (H: 1.0) Why do you think Susan agreed to follow Dagbovie’s advice?
Section 10 (3.33; 117)
so he told her to pick up five stones from the ground and put them on the table and
he said think of each stone as representing one of the five guests and decide which
stone is which guest
Q37. (L: 1.0) What did Dagbovie tell her to do?
Q38. (L: 1.0) What did Dagbovie tell her to imagine?
Q39. (L: .67; .28) Why were there five stones? &dquo;
Q40. (H: .67; -.07) What was Dagbovie going to do?
Section 11 (4.33; .81)
and then Dagbovie took some kind of magical powder and threw it on the table and
most of the powder landed on one stone and he said that is the thief and she said but
that’s my friend and he said still that is the thief .
Q41. (L: 1.0) What did Dagbovie do?

Q42. (L: .67; .88) What type of powder was it? ; ; .

91
Q43. (R: .67; .13) What was Dagbovie’s conclusion?

Q44. (H: .67; .53) What was Susan’s reaction?
Q45. (R: .50; .89) How did Dagbovie know who the thief was?
Q46. (H: .83; .08) How did Susan feel about her friend?
Section 12 (2.0; .77)
and so he said that in order to stop the burglaries he would have to take some of this
same powder and throw it on the house the doorstep of the house of the thief and
then the robberies would stop
Q47. (L: .67; .13) Where would Dagbovie go?
Q48. (L: .67; .88) What does Dagbovie say would stop the burglaries?
Q49. (R: .67; .88) Why did Dagbovie say the burglar came to Susan’s house?
Section 13 (4.33; .64)
so she agreed to do this and she paid Dagbovie some money and then he took the
powder one night to the thief’s house and spread it on the doorstep and the burglaries
stopped
Q50. (L: 1.0) What did Susan give Dagbovie?
Q51. (R: 1.0) Where did Dagbovie go?
Q52. (R: .67; .13) What did Dagbovie do?
Q53. (L: 1.0) What was the result of Dagbovie’s action?
Q54. (H: .67; .88) Why did the burglaries stop?

Listening Comprehension Study Examines Test Processes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Listening Comprehension Study Examines Test Processes

Uploaded by

Copyright:

Available Formats

Language Testing

The testing of listening comprehension: an introspective study1

The online version of this article can be found at:

Email Alerts: http://ltj.sagepub.com/cgi/alerts

>> Version of Record - Jun 1, 1991

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

The testing of listening

1 This research submitted in part of doctoral dissertation at Lancaster

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

Another research topic of interest concerns the question of listen-

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

Of course, the test-taking situation usually provides the motivation

1) How does test method influence the comprehension process, and

The basic research format consisted of listeners hearing a short narra-

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

III Results and discussion

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

1 The research methodology

population with any degree of confidence. Despite this, total testee

Table 1 Test scores and statistics

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

Internal correlations and item statistics are included in the appendix.

2 Test-method effects: short-answer questions ,

The test method consisted of short-answer open-ended comprehen-

with other items.

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

one, resulting in the rejection of the wrong item. There is an obvious

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

Practical implementation: The assumption is made in writing test

3 Higher-level skills: inference items

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

Inference type 3: This type of question asks testees to make a deduc-

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

occurred in Section 5 when listeners had been informed that Dagbovie

Inference type 4: These questions asked testees to make predictions

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

4 Monitoring the interpretation: repeated items

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

changing the wording of a question could obviously change the way

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

not incorporated in the conceptual interpretation of the text. While

5 Listening purpose: question preview

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

Table 2 Listener reports on thinking while listening

interviewees were asked whether they had thought about certain

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

strategies. Perhaps in some cases listeners just scan the questions

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

agreed that they would have understood less.

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

attempts to test listening comprehension, whatever the method used.

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

There is also abundant evidence in the protocols to suggest that

Reading in a foreign language. London: Longman.

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

Cohen, A. 1986a: Mentalistic measures in reading strategy research: some

Downloaded from ltj.sagepub.com at Afyon Kocatepe Universitesi on May 6, 2014

Wendell, W.W. and Bickley, A.C. 1967: Sources of information for

Appendix The Test

Intended Item Types