Professional Documents
Culture Documents
Yufen Hsieh
To cite this article: Yufen Hsieh (2019): Effects of video captioning on EFL vocabulary
learning and listening comprehension, Computer Assisted Language Learning, DOI:
10.1080/09588221.2019.1577898
ABSTRACT KEYWORDS
This study investigated how video caption type affected Video captioning;
vocabulary learning and listening comprehension of low- vocabulary learning;
intermediate Chinese-speaking learners of English. Each listening comprehen-
sion; EFL
video was presented twice with one of the five caption
types: (1) no caption (NC), (2) full caption with no audio
(FCNA), (3) full caption (FC), (4) full caption with highlighted
target-word (FCHTW), and (5) full caption with highlighted
target-word and L1 gloss (FCL1), where gloss was presented
simultaneously with full caption. The results showed that
caption type did affect vocabulary learning. FCL1 facilitated
the learning of both word form and meaning in a multi-
media listening activity. FCHTW increased attention to word
form at the expense of word meaning. Videos with either
captions (FCNA) or audio (NC) were not helpful for the
learning of written words, indicating that presentation of
verbal information through two modalities (audio plus text)
was superior over single-modality presentation. While cap-
tion type had no impact on listening comprehension, con-
current presentations of video, audio, and captions did not
overload the learners in the FC condition, suggesting that
selective attention might be allocated to different parts of
the visual stimuli during the first and second exposure to
the videos. Additionally, the presence of highlighted words
and glosses in the captioning line might direct learner
attention to vocabulary rather than video content.
1. Introduction
Audiovisual material provides contextualized visual images that could aid the
understanding of verbal input (Plass & Jones, 2005). A combination of
imagery and verbal information could make L2 input more comprehensible
and easily retrievable from memory as the activation of both verbal and non-
verbal systems results in better learning (Paivio, 2007). Furthermore, captions
could enhance audiovisual input by allowing learners to visualize what they
hear, especially when the material is slightly beyond their level of proficiency.
During the parsing of continuous acoustic messages, captions may aid the
segmentation of speech streams as well as form-meaning mapping of lexical
items (Yeldham, 2018). Thus, it is easier for learners to identify meaningful
units of speech and to recognize words when viewing videos.
However, the effectiveness of different types of captioning has
remained unclear. For example, some studies reported positive caption-
ing effects on comprehension of video content (e.g. Hsu, Hwang, Chang,
& Chang, 2013; Winke, Gass, & Sydorenko, 2010), but others did not
(e.g. Aldera & Mohsen, 2013; Montero Perez, Peters, Clarebout, &
Desmet, 2014). Other types of caption, such as caption with first lan-
guage (L1) gloss, have been less explored in previous research (see
Montero Perez, Peters, & Desmet, 2018). It is also inconclusive to what
extent captions improve different aspects of vocabulary learning. For
instance, some studies reported positive captioning effects on both mean-
ing recognition and recall (e.g. Sydorenko, 2010; Winke et al., 2010),
whereas others found no effect on meaning recall (e.g. Montero Perez
et al., 2014). Furthermore, limited research has examined how learners
with L1 Chinese, a language with non-Roman scripts, utilize captions in
an alphabetic language like English (see Hsu et al., 2013). While previous
research has primarily involved learner groups whose L1 and L2 both
use Roman scripts, it is necessary to recognize orthographic impact on
word processing across languages (e.g. Aldera & Mohsen, 2013;
Sydorenko, 2010; Winke, Gass, & Sydorenko, 2010, 2013).
In light of these issues, the present research examined the effects of
caption type on listening comprehension and vocabulary learning of
Chinese-speaking learners of English. Two video clips were presented
with one of the five caption types: (1) no caption (NC), (2) full caption
with no audio (FCNA), (3) full caption (FC), (4) full caption with high-
lighted target-word (FCHTW), and (5) full caption with highlighted tar-
get-word and L1 gloss (FCL1).
2. Methods
2.1. Participants
The participants were 105 undergraduates who were native speakers of
Mandarin Chinese and had learned English at school as a foreign lan-
guage for at least nine years. They were at the low-intermediate level of
English proficiency based on their TOEIC (Test of English for
COMPUTER ASSISTED LANGUAGE LEARNING 7
2.2. Material
The videos were two English animations with a single narrator, each
approximately 4–5 minutes or 600–700 words in length. The titles of the
videos were ‘The psychology behind irrational decisions.’ and ‘Why do
some people have seasonal allergies?’ made by TED-Ed (Technology
Entertainment Design Education). The video clips were downloaded
from the VoiceTube website, and then captions and glosses were added
to the videos using Corel VideoStudio. Each video was created with five
versions: (1) NC, (2) FCNA, (3) FC, (4) FCHTW, and (5) FCL1. The
screen shots of the five caption types are displayed in Figure 1. In the
FCHTW condition, all the target words were underlined. In the FCL1
condition, L1 (Chinese) translations appeared below the underlined tar-
get words. The target words were the words unknown to the learners
prior to the experiment, and the selection of the target words is detailed
below. The five caption types varied in the amount of text, the salience
of the target words as well as the access to word meaning. Compared to
FCNA, FC and FCHTW, FCL1 contained a larger amount of text (i.e.
caption plus L1 gloss) and provided direct access to word meaning. The
target words were more salient in FCHTW and FCL1 relative to the
other captioning conditions.
For the selection of the target words in the clips, the following proced-
ure was adopted to assure that the words were unfamiliar to low-
8 Y. HSIEH
in their difficulty levels (Table 1). Two important aspects of lexical com-
plexity were lexical density, measured as the proportion of content words
in a text (Daller, van Hout, & Treffers-Daller, 2003), and lexical sophistica-
tion, measured by root type-token ratio (Guiraud, 1954, as cited in Daller
& Xue, 2007). Most common indices of syntactic complexity included
average sentence length, namely average words per sentence, and amount
of subordination, namely the percentage of subordinate clauses out of all
clauses (Housen, Kuiken, & Vedder, 2012; Norris & Ortega, 2009).
Experts’ opinions were also sought to confirm that the language use of the
videos was suitable for low-intermediate learners.
2.3. Instruments
2.3.1. Comprehension tests
There was a comprehension test after each video in order to measure the
comprehension of video content. Each test contained ten multiple-choice
questions about the main ideas and details of the animation.
2.4. Procedures
The study took place in a computer lab. The participants were first given an
instruction explaining the tasks involved in the experiment. Then, they
10 Y. HSIEH
watched each of the two videos ‘irrational decisions’ and ‘seasonal allergies’
twice. The two videos were presented with one of the five caption types: (1)
NC, (2) FCNA, (3) FC, (4) FCHTW, and (5) FCL1. After the second view-
ing of each video, the participants took a comprehension test and three
vocabulary tests. The whole experiment took around 40–50 minutes.
3. Results
The results of descriptive statistics for the listening comprehension and
vocabulary test scores in the five captioning conditions are presented in
Table 2.
the recall task required retrieval of word meaning from memory while
the recognition task only involved form-meaning matching. The meaning
recall score was highest in the FCL1 condition (M ¼ 11.38, SD ¼ 2.31),
followed by the FC condition (M ¼ 7.19, SD ¼ 2.93), whereas the scores
were lowest in the FCHTW (M ¼ 5.00, SD ¼ 2.51), the NC (M ¼ 5.05,
SD ¼ 2.01), and the FCNA condition (M ¼ 4.52, SD ¼ 1.81).
As shown in Table 3, the one-way MANCOVA yielded a significant
main effect of caption type on vocabulary learning after controlling for
listening comprehension (F(12,257)¼11.62, p<.001, Wilks’ Lambda¼.32,
g2¼.32). The effect size statistics revealed a large effect of caption type
on the three vocabulary test outcomes (g2¼.238 on form recognition,
g2¼.589 on meaning recognition, g2¼.564 on meaning recall).
Table 4 presents the post hoc Tukey test results of vocabulary test
scores across captioning conditions. For form recognition, post hoc com-
parisons showed that the FCL1, the FCHTW, and the FC group scored
significantly higher than the NC and the FCNA group (ps<.01). The
other pairwise comparisons did not achieve statistical significance
(ps>.05). In other words, presenting new words with both audio and
captions (two modalities) was more effective for form recognition than
with either audio or captions (single modality).
As for meaning recognition and recall, post hoc analyses revealed that
the FCL1 group significantly outperformed all the other groups
(ps<.001). The FC group gained significantly higher scores than the
FCHTW, the NC, and the FCNA group (ps<.05). There was no signifi-
cant difference between the FCHTW, the NC, and the FCNA group
(ps>.05). The findings demonstrated the effectiveness of FCL1 and FC
for enhancing meaning recognition and recall probably because the two
caption types visualized the auditory input and drew the participants’
attention to word meaning. FCL1 was particularly useful as it provided
direct access to word meaning. In contrast, while highlighting the target-
words facilitated form recognition, it appeared to hinder meaning recog-
nition and recall. The significantly lower scores in the FCHTW relative
to the FC condition suggested that the participants might have paid
more attention to word form than to meaning in the FCHTW condition.
COMPUTER ASSISTED LANGUAGE LEARNING 13
Table 4. Post hoc comparisons of vocabulary test scores between captioning conditions.
Test type Comparison Mean difference Sig.
Form recognition FCL1-FCHTW .200 .803
FCL1-FC .843 .295
FCL1-NC 3.009 .000
FCL1-FCNA 3.374 .000
FCHTW-FC .643 .426
FCHTW-NC 2.810 .001
FCHTW-FCNA 3.174 .000
FC-NC 2.167 .008
FC-FCNA 2.531 .003
NC-FCNA .364 .653
Meaning recognition FCL1-FCHTW 6.205 .000
FCL1-FC 4.745 .000
FCL1-NC 6.395 .000
FCL1-FCNA 6.602 .000
FCHTW-FC –1.460 .029
FCHTW-NC .190 .772
FCHTW-FCNA .397 .550
FC-NC 1.651 .014
FC-FCNA 1.857 .007
NC-FCNA .207 .755
Meaning recall FCL1-FCHTW 6.260 .000
FCL1-FC 4.378 .000
FCL1-NC 6.213 .000
FCL1-FCNA 6.335 .000
FCHTW-FC –1.882 .007
FCHTW-NC –.048 .944
FCHTW-FCNA .074 .913
FC-NC 1.835 .008
FC-FCNA 1.957 .006
NC-FCNA .122 .858
FCL1, full caption with highlighted target-word and L1 gloss; FCHTW, full caption with highlighted target-
word; FC, full caption; NC, no caption; FCNA, full caption with no audio.
4. Discussion
The findings revealed significant captioning effects on vocabulary gain,
including form recognition, meaning recognition, and meaning recall.
While caption type did not affect listening comprehension, concurrent
14 Y. HSIEH
Selective attention could help to reduce the load in visual working mem-
ory, and thus verbal redundancy and cognitive overload could
be reconciled.
In Hsu et al. (2013) and Baltova (1999), where caption was found
beneficial for video comprehension relative to no caption, the research
design also reduced the overload that might result from split attention
between multiple visual inputs. Hsu et al. (2013) showed that, compared
to the no-caption group, the full-caption and the keyword-caption group
experienced larger learning gains in English listening comprehension
over a period of four weeks. The videos embedded with full English cap-
tions did not seem to cause cognitive overload for the elementary-level
EFL learners. One possible reason was that Hsu et al. allowed the learn-
ers to play, pause, and replay the videos for listening training. Thus, the
cognitive overload could be reduced as the learners had a sufficient
amount of time to process information from multiple sources. Baltova
(1999) provided captions for only about one-half of the script, which
contained key words for the most important content in the video.
Presumably keyword captions were easier to read given the lower textual
density, especially for the non-advanced learners, and thus created less
cognitive load as compared to full captions. The benefit of keyword cap-
tions could be maximized by the design of the ten short-answer compre-
hension questions that all addressed the main points in the video. In
contrast, the multiple-choice comprehension questions in the present
study tested the main ideas as well as specific details of the animations,
thus requiring more detailed understanding of the content.
learners. The present study added further evidence that providing glosses
for unknown words improved meaning acquisition of low-intermedi-
ate learners.
However, our findings were somewhat different from Montero Perez
et al.’s (2014) regarding captioning effects on meaning recognition.
Montero Perez et al. (2014) reported that the groups receiving full cap-
tioning with and without highlighted keywords gained similar scores on
meaning recognition and both scored higher than the no-captioning
group. In the current study, however, FCHTW led to significantly worse
performance in meaning recognition relative to FC. One possibility was
that there exists a competition between attention to form and attention
to meaning during language processing, especially for early-stage learners
(VanPatten, 1990). Since our participants (low-intermediate level) had
relatively lower proficiency than those in Montero Perez et al. (2014)
(high-intermediate level), they might have had greater difficulty in
attending to both form and meaning of new words. While the partici-
pants’ attention was mostly devoted to meaning extraction in the FC
condition, attention to the highlighted vocabulary in the FCHTW condi-
tion was likely to be at the expense of meaning processing, thus resulting
in low gains on meaning recognition and recall. Another possible factor
that could lead to the different outcomes in Montero Perez et al. (2014)
and the present study was the L1 background of the participants. While
Montero Perez et al.’s participants were Dutch-speaking learners of
English, whose L1 and L2 were both Germanic languages sharing the
same writing system, the present study involved Chinese-speaking learn-
ers of English, who had to read in an L2 with a typologically distinct
orthographic system. Compared to the participants in Montero Perez
et al., our non-advanced Chinese-speaking participants might have allo-
cated more cognitive resources to orthographic processing at the expense
of meaning processing, especially in the FCHTW condition.
In line with Baltova (1999) and Sydorenko (2010), which also involved
non-advanced learners, the present study revealed that for meaning recall
of written words video combined with both captions and audio (the FC
condition) was more beneficial than with either captions (the FCNA con-
dition) or audio (the NC condition). Concurrent presentations of spoken
and written verbal input had advantage over spoken-only and written-
only presentations in a video learning context, at least for non-advanced
learners. As reported in the meta-analysis by Adesope and Nesbit (2012),
spoken-written presentation was beneficial especially for system-paced
learning material and low prior-knowledge learners. As the low-inter-
mediate learners in the present study were not allowed to rewind, pause,
or slow down the videos, receiving the same verbal information through
20 Y. HSIEH
two sensory channels rather than one could help them recover from fail-
ure of either channel and thus enhanced decoding of unknown words.
Based on the dual-processing theory of multimedia learning (Mayer,
2005), written information is represented in a visual processing system
and the corresponding spoken information in an auditory system. Since
the two processing systems are separate and draw from distinct cognitive
resources, learners can hold two kinds of verbal information in working
memory simultaneously without competition between them.
As for the comparison between the NC (video þ audio) and the FCNA
group (video þ captions), who received verbal information in a single
modality, the NC group performed non-significantly better than the
FCNA group, contrary to the finding in Sydorenko (2010). As discussed
previously, the discrepancy could be attributed to the participants’ profi-
ciency levels. In Sydorenko’s study the beginning learners benefited from
captions more than audio in learning new words as they might have bet-
ter reading than listening skills. When learners progressed on the devel-
opment of an L2, they became less dependent on captions when listening
to multimedia material, as the case of the low-intermediate learners in
the current study (Leveridge & Yang, 2013; Yeldham, 2018).
5. Conclusions
Overall, the vocabulary test results demonstrated that learners could
benefit from multimedia material with a combination of captions, images
and audio as multimodality makes input accessible through different
channels. Concurrent presentations of the three kinds of information did
not lead to cognitive overload probably because the learners could select-
ively attend to different parts of the visual stimuli during the first and
second exposure to the videos.
Particularly, FCL1 enhanced all the three aspects of vocabulary learn-
ing, suggesting that it could promote attention to formal and semantic
features of a word and reinforce form-meaning connections. Highlighted
target words without glosses (FCHTW), on the other hand, appeared to
inhibit meaning recognition and recall, as more attentional resources
might be allocated to word form than to meaning. Videos with either
captions (FCNA) or audio (NC) did not help vocabulary acquisition, at
least for non-advanced learners, indicating that presentation of verbal
information through two modalities (audio plus text) was superior over
single-modality presentation in a video learning context. Unlike caption-
ing effects on vocabulary learning, caption type had no impact on listen-
ing comprehension, though the FC group performed non-significantly
better than the other groups. It was possible that highlights and glosses
COMPUTER ASSISTED LANGUAGE LEARNING 21
Disclosure statement
No potential conflict of interest was reported by the author.
Notes
1. Millett, Quinn, and Nation (2007) suggested a 70% comprehension accuracy as an
acceptable level in L2 reading.
2. For the meaning recall tests, all of the participants’ responses were either exact
translations/synonyms or incorrect translations. No partial credit was given because
none of the responses were from the same semantic fields as the respective
target words.
22 Y. HSIEH
Notes on contributor
Yufen Hsieh holds a Ph.D. in Linguistics from the University of Michigan, Ann Arbor.
Currently, she is an Assistant Professor in the Department of Applied Foreign
Languages, National Taiwan University of Science and Technology. Her research inter-
ests include second language learning and teaching as well as reading comprehension.
References
Adesope, O. O., & Nesbit, J. C. (2012). Verbal redundancy in multimedia learning envi-
ronments: A meta-analysis. Journal of Educational Psychology, 104(1), 250–263. doi:
10.1037/a0026147
Aldera, A. S., & Mohsen, M. A. (2013). Annotations in captioned animation: Effects on
vocabulary learning and listening skills. Computers & Education, 68, 60–75. doi:
10.1016/j.compedu.2013.04.018
Baltova, I. (1999). Multisensory language teaching in a multidimensional curriculum:
The use of authentic bimodal video in core French. The Canadian Modern Language
Review, 56(1), 32–48.
Bird, S. A., & Williams, J. N. (2002). The effect of bimodal input on implicit and explicit
memory: An investigation into the benefits of within-language subtitling. Applied
Psycholinguistics, 23(04), 509–533.
Daller, H., van Hout, R., & Treffers-Daller, J. (2003). Lexical richness in the spontaneous
speech of bilinguals. Applied Linguistics, 24(2), 197–222. doi:10.1093/applin/24.2.197
Daller, H., & Xue, H. (2007). Lexical richness and the oral proficiency of Chinese EFL
students. In H. Daller, J. Milton & J. Treffers-Daller (Eds.), Modelling and assessing
vocabulary knowledge (pp. 93–115). Cambridge: Cambridge University Press.
Ellis, N. C. (2016). Salience, cognition, language complexity, and complex adaptive sys-
tems. Studies in Second Language Acquisition, 38(02), 341–351. doi:10.1017/S0272
26311600005X
Housen, A., Kuiken, F., & Vedder, I. (2012). Complexity, accuracy and fluency:
Definitions, measurement and research. In A. Housen, F. Kuiken, & I. Vedder (Eds.),
Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in
SLA (pp. 1–20). (Language Learning & Language Teaching; Vol. 32). Amsterdam:
John Benjamins Publishing Company.
Hsu, C.-K., Hwang, G.-J., Chang, Y.-T., & Chang, C.-K. (2013). Effects of video caption
modes on English listening comprehension and vocabulary acquisition using handheld
devices. Educational Technology & Society, 16(1), 403–414.
Leveridge, A. N., & Yang, J. C. (2013). Testing learner reliance on caption supports in
second language listening comprehension multimedia environments. ReCALL, 25(02),
199–214. doi:10.1017/S0958344013000074
Linebarger, D., Piotrowski, J., & Greenwood, C. R. (2010). On-screen print: The role of
captions as a supplemental literacy tool. Journal of Research in Reading, 33(2),
148–167. doi:10.1111/j.1467-9817.2009.01407.x
Mayer, R. E. (2001). Multimedia learning. New York: Cambridge University Press.
Mayer, R. E. (2005). The Cambridge handbook of multimedia learning. New York, NY:
Cambridge University Press.
Mayer, R. E., & Moreno, R. (2003). Nine ways to reduce cognitive load in multimedia
learning. Educational Psychologist, 38(1), 43–52. doi:10.1207/S15326985EP3801_6
COMPUTER ASSISTED LANGUAGE LEARNING 23
Millett, S., Quinn, E., & Nation, P. (2007). Asian and Pacific speed readings for ESL
learners. Wellington: English Language Institute Ocasional Publication.
Montero Perez, M., Van Den Noortgate, W., & Desmet, P. (2013). Captioned video for
L2 listening and vocabulary learning: A meta-analysis. System, 41(3), 720–739. doi:
10.1016/j.system.2013.07.013
Montero Perez, M., Peters, E., Clarebout, G., & Desmet, P. (2014). Effects of captioning
on video comprehension and incidental vocabulary. Language, Learning &
Technology, 18(1), 118–141. doi:10.1080/09588221.2017.1375960
Montero Perez, M., Peters, E., & Desmet, P. (2018). Vocabulary learning through view-
ing video: the effect of two enhancement techniques. Computer Assisted Language
Learning, 31(1-2), 1–26. doi:10.1080/09588221.2017.1375960
Moreno, R., & Mayer, R. (1999). Cognitive principles of multimedia learning: the role of
modality and contiguity. Journal of Educational Psychology, 91(2), 358–368. doi:
10.1037/0022-0663.91.2.358
Moreno, R., & Mayer, R. E. (2002). Verbal redundancy in multimedia learning: When
reading helps listening. Journal of Educational Psychology, 94(1), 156–163. doi:
10.1037//0022-0663.94.1.156
Nation, I. S. P. (1990). Teaching and learning vocabulary. New York: Newbury House.
Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge: Cambridge
University Press.
Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in
instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555–578. doi:
10.1093/applin/amp044
Paivio, A. (2007). Mind and its evolution: A dual coding theoretical approach. Mahwah,
NJ: Erlbaum.
Plass, J., & Jones, L. (2005). Multimedia learning in second language acquisition. In R.
Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 467–488). New
York: Cambridge University Press.
Schmidt, R. (2001). Attention. In P. Robinson (Ed.), Cognition and second language
instruction (pp. 3–32). Cambridge: Cambridge University Press.
Sydorenko, T. (2010). Modality of input and vocabulary acquisition. Language Learning
& Technology, 14(2), 50–73.
Tabbers, H. K., Martens, R. L., & Merri€enboer, J. J. G. (2004). Multimedia instructions
and cognitive load theory: Effects of modality and cueing. British Journal of
Educational Psychology, 74(1), 71–81.
VanPatten, B. (1990). Attending to form and content in the input. Studies in Second
Language Acquisition, 12(03), 287–301. doi:10.1017/S0272263100009177
Winke, P., Gass, S., & Sydorenko, T. (2010). The effects of captioning videos used for
foreign language listening activities. Language Learning & Technology, 14(1), 65–86.
Winke, P., Gass, S., & Sydorenko, T. (2013). Factors influencing the use of captions by
foreign language learners: An eye-tracking study. The Modern Language Journal,
97(1), 254–275. doi:10.1111/j.1540-4781.2013.01432.x
Yeldham, M. (2018). Viewing L2 captioned videos: What’s in it for the listener? Computer
Assisted Language Learning, 31(4), 367–389. doi:10.1080/09588221.2017.1406956