Hsieh (2020) Effects of Video Captioning On EFL Vocabulary Learning and Listening Comprehension

Computer Assisted Language Learning
ISSN: 0958-8221 (Print) 1744-3210 (Online) Journal homepage: https://www.tandfonline.com/loi/ncal20
Effects of video captioning on EFL vocabulary

learning and listening comprehension
Yufen Hsieh
To cite this article: Yufen Hsieh (2019): Effects of video captioning on EFL vocabulary
learning and listening comprehension, Computer Assisted Language Learning, DOI:
10.1080/09588221.2019.1577898
To link to this article: https://doi.org/10.1080/09588221.2019.1577898
Published online: 22 Feb 2019.
Submit your article to this journal
View Crossmark data
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=ncal20
COMPUTER ASSISTED LANGUAGE LEARNING
https://doi.org/10.1080/09588221.2019.1577898
Effects of video captioning on EFL vocabulary

learning and listening comprehension
Yufen Hsieh
Department of Applied Foreign Languages, National Taiwan University of Science and
Technology, Taipei, Taiwan
ABSTRACT KEYWORDS
This study investigated how video caption type affected Video captioning;
vocabulary learning and listening comprehension of low- vocabulary learning;
intermediate Chinese-speaking learners of English. Each listening comprehen-
sion; EFL
video was presented twice with one of the five caption
types: (1) no caption (NC), (2) full caption with no audio
(FCNA), (3) full caption (FC), (4) full caption with highlighted
target-word (FCHTW), and (5) full caption with highlighted
target-word and L1 gloss (FCL1), where gloss was presented
simultaneously with full caption. The results showed that
caption type did affect vocabulary learning. FCL1 facilitated
the learning of both word form and meaning in a multi-
media listening activity. FCHTW increased attention to word
form at the expense of word meaning. Videos with either
captions (FCNA) or audio (NC) were not helpful for the
learning of written words, indicating that presentation of
verbal information through two modalities (audio plus text)
was superior over single-modality presentation. While cap-
tion type had no impact on listening comprehension, con-
current presentations of video, audio, and captions did not
overload the learners in the FC condition, suggesting that
selective attention might be allocated to different parts of
the visual stimuli during the first and second exposure to
the videos. Additionally, the presence of highlighted words
and glosses in the captioning line might direct learner
attention to vocabulary rather than video content.
1. Introduction
Audiovisual material provides contextualized visual images that could aid the
understanding of verbal input (Plass & Jones, 2005). A combination of
imagery and verbal information could make L2 input more comprehensible
and easily retrievable from memory as the activation of both verbal and non-
verbal systems results in better learning (Paivio, 2007). Furthermore, captions
could enhance audiovisual input by allowing learners to visualize what they
CONTACT Yufen Hsieh yfhsieh@mail.ntust.edu.tw Department of Applied Foreign Languages,

National Taiwan University of Science and Technology, 43 Section 4, Keelung Road, Taipei, Taiwan
ß 2019 Informa UK Limited, trading as Taylor & Francis Group
2 Y. HSIEH
hear, especially when the material is slightly beyond their level of proficiency.
During the parsing of continuous acoustic messages, captions may aid the
segmentation of speech streams as well as form-meaning mapping of lexical
items (Yeldham, 2018). Thus, it is easier for learners to identify meaningful
units of speech and to recognize words when viewing videos.
However, the effectiveness of different types of captioning has
remained unclear. For example, some studies reported positive caption-
ing effects on comprehension of video content (e.g. Hsu, Hwang, Chang,
& Chang, 2013; Winke, Gass, & Sydorenko, 2010), but others did not
(e.g. Aldera & Mohsen, 2013; Montero Perez, Peters, Clarebout, &
Desmet, 2014). Other types of caption, such as caption with first lan-
guage (L1) gloss, have been less explored in previous research (see
Montero Perez, Peters, & Desmet, 2018). It is also inconclusive to what
extent captions improve different aspects of vocabulary learning. For
instance, some studies reported positive captioning effects on both mean-
ing recognition and recall (e.g. Sydorenko, 2010; Winke et al., 2010),
whereas others found no effect on meaning recall (e.g. Montero Perez
et al., 2014). Furthermore, limited research has examined how learners
with L1 Chinese, a language with non-Roman scripts, utilize captions in
an alphabetic language like English (see Hsu et al., 2013). While previous
research has primarily involved learner groups whose L1 and L2 both
use Roman scripts, it is necessary to recognize orthographic impact on
word processing across languages (e.g. Aldera & Mohsen, 2013;
Sydorenko, 2010; Winke, Gass, & Sydorenko, 2010, 2013).
In light of these issues, the present research examined the effects of
caption type on listening comprehension and vocabulary learning of
Chinese-speaking learners of English. Two video clips were presented
with one of the five caption types: (1) no caption (NC), (2) full caption
with no audio (FCNA), (3) full caption (FC), (4) full caption with high-
lighted target-word (FCHTW), and (5) full caption with highlighted tar-
get-word and L1 gloss (FCL1).
1.1. Literature review

This section reviews previous research on the impact of video captioning
on listening comprehension and vocabulary learning. The review ends
with a summary highlighting the unresolved issues in the literature.
1.1.1. Captioning effects on listening comprehension

Video captions translate aural sounds into text and thus could support
listening comprehension. While some studies showed positive effects of
COMPUTER ASSISTED LANGUAGE LEARNING 3
full captioning on listening comprehension compared to no captioning

(e.g. Hsu et al., 2013; Winke et al., 2010), others found no impact (e.g.
Montero Perez et al., 2014) or even a negative impact of captioning (e.g.
Aldera & Mohsen, 2013). Furthermore, it is yet to be resolved how dif-
ferent types of caption facilitate or disturb L2 listening comprehension.
While captions provide information in print, learners might not be able
to split attention between visual, auditory and textual input. Presenting
several modes of information concurrently might create high cognitive
load and distract learners from video content (Linebarger, Piotrowski, &
Greenwood, 2010; Mayer & Moreno, 2003; Moreno & Mayer, 2002). The
section below will discuss factors that could potentially influence caption-
ing effects on video comprehension, including learners’ proficiency level,
L1-L2 orthographical difference, information load, and design of a com-
prehension test.
Winke et al. (2010, 2013) demonstrated that while captioning
improved video comprehension of second-year learners in general, cap-
tioning benefits were modulated by script differences between L1 and L2
and learners’ level of L2 proficiency. Particularly, unlike learners of the
other languages, English-speaking learners of Chinese spent more time
reading captions in videos with unfamiliar content than with familiar
content. The authors speculated that when these non-advanced learners
had difficulty with aural passages, they put more attentional resources on
captions and tried to derive meaning from textual support because they
were unable to attend to multiple input sources. Chinese-English ortho-
graphical differences and complexity of Chinese characters might incur
cognitive load and hence required more time for processing. This high-
lighted the possibility that L1–L2 distance would influence caption use.
In contrast to the evidence of positive captioning effects, some studies
failed to find differences between full caption and other types of caption
possibly due to the load of information given and the design of a compre-
hension test (e.g. Aldera & Mohsen, 2013; Montero Perez et al., 2014).
Montero Perez et al. (2014) reported no difference in comprehension
between high-intermediate French learners who watched videos under one
of the four conditions: no captioning, full captioning, keyword captioning
(17.11% of the total script), and full captioning with highlighted keywords.
The null effect of captioning on comprehension might be attributable to
the test, in which open-ended questions and sentence corrections required
production of information from memory. The difficulty of the test might
have minimized the differences between the four captioning modes and
resulted in mediocre scores in all conditions. Aldera and Mohsen (2013)
showed that comprehension of beginning English learners was significantly
lower in the ‘animation þ captions þ keyword annotation’ condition and
4 Y. HSIEH
the ‘animation þ captions’ condition compared to ‘animation alone.’ In

other words, captions and/or annotations impeded listening comprehen-
sion of the low-proficiency learners. An explanation for the negative cap-
tioning effect was that if the learners concentrated on annotations and/or
captions, they might ignore auditory inputs (Bird & Williams, 2002).
Meanwhile, annotations in Aldera and Mohsen’s study contained multiple
pieces of information (i.e. definition, example sentence, corresponding
image, and pronunciation), which might impose a cognitive load and dis-
tract the learners from video content.
1.1.2. Captioning effects on vocabulary learning

Research has suggested that captions could aid L2 vocabulary learning
for the following reasons. First, captions visualize auditory information
and draw learner attention to the language of a video, which could
enhance decoding and noticing of new lexical items (e.g. Montero Perez
et al., 2014, 2018; Sydorenko, 2010; Winke et al., 2010). Second, a com-
bination of video, audio, and captions seems beneficial for vocabulary
learning, especially for the learning of word meaning, because multi-
modal input could lead to increased depth of processing (e.g. Aldera &
Mohsen, 2013; Bird & Williams, 2002). Research findings regarding cap-
tioning effects on vocabulary learning are provided below.
Investigating the effects of presentation modality on the learning of L2
spoken words, Bird and Williams (2002) indicated that advanced English
learners performed better in a recognition memory task when the new
words were presented with text and sound (bimodal input) than when
presented with either text or sound (single modality). In other words,
bimodal presentation improved the learning and retention of novel
words, pointing to the interaction between auditory and visual process-
ing. The authors argued that captions enabled faster decoding of speech
because word boundaries were clearly marked by spaces in print.
Sydorenko (2010) further looked into beginning learners’ attentional
focus while viewing captioned videos. The findings revealed that captions
(either ‘video þ audio þ captions’ or ‘video þ captions’) facilitated recog-
nition of written word forms, whereas audio (‘video þ audio’) aided rec-
ognition of aural word forms. Moreover, recall of word meaning was
more successful when video was combined with both audio and captions
than with either one. In addition, the learners reported to pay more
attention to captions than to video and audio, though some had diffi-
culty dealing with the three types of input simultaneously. Likewise,
Aldera and Mohsen (2013) reported that the use of annotations for cap-
tioned animation improved vocabulary recognition and written
vocabulary production. Based on the findings of the cited studies, mul-

tiple modalities seem beneficial to the learning of word meaning.
Montero Perez et al. (2014) provided further insight into how high-
intermediate learners used various types of caption in vocabulary learn-
ing. The results showed that the captioning groups (full captioning, key-
word captioning, and full captioning with highlighted keywords)
performed equally well on word form recognition and significantly out-
scored the no-captioning group. For word meaning recognition, keyword
captioning and full captioning with highlighted keywords were most
beneficial. Relative to full captioning, keyword captioning might impose
less cognitive load given the lower textual density, which allowed learners
to spend more mental resources on processing keywords and have a
greater opportunity to derive the meaning of unknown words.
Compared to the plain text in full-captioned video, highlighted keywords
were more salient and thus more likely to be noticed and enter into sub-
sequent processing (Ellis, 2016; Schmidt, 2001).
Montero Perez et al. (2018) was the first to investigate the effects of
glossed caption on vocabulary acquisition of (high-) intermediate learn-
ers (Dutch-speaking learners of French). Keyword caption with access to
L1 gloss was found most helpful for written form recognition and mean-
ing recall. Look-ups of a glossed word contributed to the learning of that
word. While Montero Perez et al. (2018) provided evidence for the bene-
fits of glossed caption, some issues required further research. In Montero
Perez et al.’s study every keyword was linked to its L1 translation, and
the participants accessed the meaning of a keyword by tapping the space-
bar, which then paused the video with the translation appearing on the
center of the screen. It remains unknown how simultaneous presenta-
tions of full captions and L1 glosses would affect vocabulary learning as
well as listening comprehension. Presenting glosses along with captions
would minimize the interruption of a video as learners need not tap the
spacebar to look up a word and to pause and resume the video.
Nevertheless, the increased amount of text (compared to that in Montero
Perez et al.’s study) might influence listening comprehension and/or
vocabulary learning in a system-paced multimedia environment, espe-
cially when learners have two different writing systems in their L1 and
L2 (e.g. Chinese and English).
In sum, no clear conclusion can be drawn from the previous research
regarding how captions can be enhanced to support L2 learning (e.g.
increasing word salience by highlighting keywords or providing word
meaning in the captioning line). As revealed in the meta-analyses by
Montero Perez, Van Den Noortgate, and Desmet (2013), captioned video
had an overall positive effect on listening comprehension and vocabulary
6 Y. HSIEH
learning. However, the meta-analyses did not include caption type as a

variable that might influence captioning effectiveness. Montero Perez
et al. (2013) also pointed out the need for empirical work addressing
how captioning affected different dimensions of vocabulary learning as
knowing a word involves receptive and productive knowledge of its form
and meaning (Nation, 2001). Although captioning was found equally
effective for recognition and recall tests in Montero Perez et al. (2013),
the analysis of word recognition was only based on two primary studies
that included both recognition and recall tests. Captioning effects on
vocabulary learning thus remain inconclusive. Additionally, Montero
Perez et al.’s (2013) meta-analyses did not examine possible moderating
effects of learners’ native language and the target language due to the
lack of native language information or the great diversity of native lan-
guages in the primary studies. Yeldham (2018) further investigated L2
learners’ processing of captioned video by drawing on the results of nine
previous studies, in which L1 and L2 were mostly European languages.
The investigation revealed that lower-proficiency learners relied on read-
ing the captions more than listening to the speakers whereas higher-pro-
ficiency learners generally attended to multiple cues simultaneously.
Further research is necessary to know whether captioned videos would
enhance comprehension and vocabulary gains of learners whose L1 and
L2 have different writing systems (e.g. Chinese-speaking EFL learners)
and how these learners utilize various types of caption.
1.2. Research questions

In light of the previous research, this study aimed to answer the follow-
ing questions:
1. How do different caption types influence Chinese-speaking EFL

learners’ comprehension of video content?
2. How do different caption types influence Chinese-speaking EFL
learners’ vocabulary acquisition, including form recognition, meaning
recognition, and meaning recall?
2. Methods
2.1. Participants
The participants were 105 undergraduates who were native speakers of
Mandarin Chinese and had learned English at school as a foreign lan-
guage for at least nine years. They were at the low-intermediate level of
English proficiency based on their TOEIC (Test of English for
International Communications) scores of 400–550 out of a total score of

990 or IELTS (International English Language Testing System) scores of
4.0–4.5 out of a total score of 9.0. The participants were randomly
assigned to one of the five captioning conditions: (1) no caption (NC),
(2) full caption with no audio (FCNA), (3) full caption (FC), (4) full cap-
tion with highlighted target-word (FCHTW), and (5) full caption with
highlighted target-word and L1 gloss (FCL1). They watched two videos
and then completed one comprehension test and three vocabulary tests
for each video. They were paid for participation in the experiment.
Given that learner factors, such as age, gender, and L2 proficiency,
might influence captioning effects on listening comprehension and
vocabulary learning, ANOVA and Chi-square analyses were performed
and confirmed that the distributions of age (F(4,100)¼.54, p>.05) and
gender (v2(4) ¼ 1.03, p>.05) did not differ across the participant groups.
The five groups of participants did not significantly differ in terms of
their English proficiency test scores either (F(4,100)¼.88, p>.05).
Therefore, these learner factors were not further considered in statis-
tical analyses.
2.2. Material
The videos were two English animations with a single narrator, each
approximately 4–5 minutes or 600–700 words in length. The titles of the
videos were ‘The psychology behind irrational decisions.’ and ‘Why do
some people have seasonal allergies?’ made by TED-Ed (Technology
Entertainment Design Education). The video clips were downloaded
from the VoiceTube website, and then captions and glosses were added
to the videos using Corel VideoStudio. Each video was created with five
versions: (1) NC, (2) FCNA, (3) FC, (4) FCHTW, and (5) FCL1. The
screen shots of the five caption types are displayed in Figure 1. In the
FCHTW condition, all the target words were underlined. In the FCL1
condition, L1 (Chinese) translations appeared below the underlined tar-
get words. The target words were the words unknown to the learners
prior to the experiment, and the selection of the target words is detailed
below. The five caption types varied in the amount of text, the salience
of the target words as well as the access to word meaning. Compared to
FCNA, FC and FCHTW, FCL1 contained a larger amount of text (i.e.
caption plus L1 gloss) and provided direct access to word meaning. The
target words were more salient in FCHTW and FCL1 relative to the
other captioning conditions.
For the selection of the target words in the clips, the following proced-
ure was adopted to assure that the words were unfamiliar to low-
8 Y. HSIEH
Figure 1. Five caption types for each video.
intermediate learners. First, a list of possible target words was compiled

along with filler items that did not appear in the clips. The possible tar-
get words were all nouns, which were easier to learn than other word
classes (Nation, 1990). Second, 50 pilot participants (Chinese-speaking
EFL learners with a proficiency level similar to that of the actual partici-
pants of this study) indicated whether they knew each of the words given
and provided the Chinese translation if they did. The words unknown to
at least 90% of the pilot participants were selected as the target words.
The actual research participants were not tested on their word knowledge
prior to the study in order to avoid prompting special attention to single
words, as opposed to video content. There were 9 target words from the
video ‘irrational decisions’ and 10 words from the video ‘seasonal aller-
gies.’ The target words appeared only once or twice in the videos except
for two words (one word occurred six times in ‘irrational decisions’ and
one word five times in ‘seasonal allergies’).
The language of the two clips was slightly beyond the level of the low-
intermediate learners based on the average comprehension scores of below
70% (i.e. 65 and 68%, respectively) in the pilot tests (Millett, Quinn, &
Nation, 20071). A total of 30 low-intermediate Chinese-speaking EFL
learners, who were not the actual research participants, took the pilot tests,
in which they answered multiple-choice comprehension questions after
watching each of the two videos with full captions. Moreover, an analysis
of lexical and syntactic complexity revealed that the two clips were similar
Table 1. Lexical and syntactic analysis of two video clips.

Video Irrational decisions Seasonal allergies
Lexical density 51.20% 51.90%
Lexical sophistication 12.82 13.30
Average sentence length 16.08 14.93
Amount of subordination 35.89% 38.63%
in their difficulty levels (Table 1). Two important aspects of lexical com-
plexity were lexical density, measured as the proportion of content words
in a text (Daller, van Hout, & Treffers-Daller, 2003), and lexical sophistica-
tion, measured by root type-token ratio (Guiraud, 1954, as cited in Daller
& Xue, 2007). Most common indices of syntactic complexity included
average sentence length, namely average words per sentence, and amount
of subordination, namely the percentage of subordinate clauses out of all
clauses (Housen, Kuiken, & Vedder, 2012; Norris & Ortega, 2009).
Experts’ opinions were also sought to confirm that the language use of the
videos was suitable for low-intermediate learners.
2.3. Instruments
2.3.1. Comprehension tests
There was a comprehension test after each video in order to measure the
comprehension of video content. Each test contained ten multiple-choice
questions about the main ideas and details of the animation.
2.3.2. Vocabulary tests

The vocabulary tests contained 19 target words in a written form. There
were three types of written vocabulary tests, including form recognition,
meaning recognition, and meaning recall, to measure word knowledge.
The three vocabulary tests demanded different cognitive processing. In
the form recognition test the participants only had to indicate whether a
word appeared in the clips. The target words were mixed with filler
items in order to keep the participants from guessing answers. Meaning
recognition and recall required the understanding of word meaning. In
the meaning recognition test the participants selected the correct Chinese
translation for a given target word. Meaning recall was more difficult
because it required writing down the Chinese translation of a target
word, namely producing word meaning from memory.
2.4. Procedures
The study took place in a computer lab. The participants were first given an
instruction explaining the tasks involved in the experiment. Then, they
10 Y. HSIEH
watched each of the two videos ‘irrational decisions’ and ‘seasonal allergies’
twice. The two videos were presented with one of the five caption types: (1)
NC, (2) FCNA, (3) FC, (4) FCHTW, and (5) FCL1. After the second view-
ing of each video, the participants took a comprehension test and three
vocabulary tests. The whole experiment took around 40–50 minutes.
2.5. Data analysis

For the listening comprehension tests, each correct answer was worth 1
point. As for the vocabulary tests, an exact translation or synonym received
1 point and an incorrect translation 0 point. No partial credit was given2.
All the tests were scored by two independent raters, and inter-rater reliabil-
ity was assessed by calculating the percent agreement between the two
raters. Inter-rater reliability was 100% for all the tests. The average scores of
the comprehension and the vocabulary tests were calculated for the five cap-
tioning groups to show the participants’ comprehension and vocabulary
gain after watching the videos with different types of caption. The data
from all the comprehension and the vocabulary tests passed Levene’s test
for equality of variances (ps>.05). All the test outcomes had skewness and
kurtosis between 1 and –1, indicating no serious violation of normality.
Then, the listening comprehension scores were submitted to a one-way ana-
lysis of variance (ANOVA) with video caption type as the independent fac-
tor. Score differences across conditions, if any, would reflect captioning
effects on listening comprehension. The results of the three vocabulary tests
(i.e. form recognition, meaning recognition, and meaning recall) were ana-
lyzed using one-way multivariate analysis of covariance (MANCOVA), with
video caption type as the independent variable and the scores on the three
vocabulary tests as the dependent variables. Listening comprehension scores
were included as covariate in the model for vocabulary test scores because
learners’ comprehension of video content should be correlated to their
vocabulary gains. This is to confirm the effect of video caption type on
vocabulary learning, controlling for learners’ listening comprehension.
3. Results
The results of descriptive statistics for the listening comprehension and
vocabulary test scores in the five captioning conditions are presented in
Table 2.
3.1. Captioning effects on listening comprehension

Research Question (1): How do different caption types influence
Chinese-speaking EFL learners’ comprehension of video content?
Table 2. Descriptive statistics for listening comprehension and vocabulary tests.

Listening comprehension Form recognition Meaning recognition Meaning recall
Condition N M (SD) M (SD) M (SD) M (SD)
FCL1 21 12.48 (2.96) 16.62 (3.28) 12.14 (2.18) 11.38 (2.31)
FCHTW 21 12.05 (3.17) 16.43 (2.18) 5.86 (1.96) 5.00 (2.51)
FC 21 13.14 (3.37) 15.76 (2.95) 7.52 (3.03) 7.19 (2.93)
NC 21 12.05 (3.47) 13.62 (2.29) 5.67 (1.74) 5.05 (2.01)
FCNA 21 10.62 (2.62) 13.29 (1.92) 5.19 (1.81) 4.52 (1.81)
Total 105 12.07 (3.18) 15.14 (2.88) 7.28 (3.35) 6.63 (3.44)
This study investigated which caption type – NC, FCNA, FC,

FCHTW, and FCL1 – was more effective for video comprehension. The
scores for the two comprehension tests were summed to give a max-
imum total score of 20 points. As shown in Table 2, the average score
across conditions was 12.07 (SD ¼ 3.18). The FC group scored highest
(M ¼ 13.14, SD ¼ 3.37), followed by the FCL1 (M ¼ 12.48, SD ¼ 2.96),
the FCHTW (M ¼ 12.05, SD ¼ 3.17), and the NC group (M ¼ 12.05, SD
¼ 3.47). The FCNA group obtained the lowest score (M ¼ 10.62, SD ¼
2.62). The comprehension scores did not differ significantly across
groups (F(4,100)¼1.83, p>.05). In other words, caption type did not
affect video comprehension.
3.2. Captioning effects on vocabulary learning

Research Question (2): How do different caption types influence
Chinese-speaking EFL learners’ vocabulary acquisition, including form
recognition, meaning recognition, and meaning recall?
A total score of 19 represented the sum of the two tests for each
dimension of form recognition, meaning recognition, and meaning recall.
For form recognition, the mean score across conditions was 15.14 (SD ¼
2.88). The participants performed best in the FCL1 (M ¼ 16.62, SD ¼
3.28) and the FCHTW (M ¼ 16.43, SD ¼ 2.18) condition, followed by
the FC condition (M ¼ 15.76, SD ¼ 2.95). Form recognition was worst
in the NC (M ¼ 13.62, SD ¼ 2.29) and the FCNA condition (M ¼ 13.29,
SD ¼ 1.92). For meaning recognition, on average the five participant
groups obtained 7.28 (SD ¼ 3.35) out of a maximum possible score of
19. Overall the score of meaning recognition was lower than that of
form recognition as learning of word meaning required semantic proc-
essing and attention to form-meaning connections (Nation, 2001). The
FCL1 group gained the highest score (M ¼ 12.14, SD ¼ 2.18), followed
by the FC group (M ¼ 7.52, SD ¼ 3.03). The FCHTW (M ¼ 5.86, SD ¼
1.96), the NC (M ¼ 5.67, SD ¼ 1.74), and the FCNA group (M ¼ 5.19,
SD ¼ 1.81) scored relatively low. For meaning recall, the average score
for all conditions was 6.63 (SD ¼ 3.44) out of a maximum of 19, which
was lower than the average meaning recognition scores, probably because
12 Y. HSIEH
Table 3. MANCOVA results of captioning effects on vocabulary learning (with listening

comprehension as covariate).
Wilks’ K (Sig.) (g2) Source Vocabulary test type F Sig. g2
854 (.001) (.146) Listening comprehension Form recognition .069 .794 .001
Meaning recognition 7.757 .006 .073
Meaning recall 16.236 .000 .141
.318 (.000) (.318) Caption type Form recognition 7.713 .000 .238
Meaning recognition 35.440 .000 .589
Meaning recall 32.021 .000 .564
the recall task required retrieval of word meaning from memory while
the recognition task only involved form-meaning matching. The meaning
recall score was highest in the FCL1 condition (M ¼ 11.38, SD ¼ 2.31),
followed by the FC condition (M ¼ 7.19, SD ¼ 2.93), whereas the scores
were lowest in the FCHTW (M ¼ 5.00, SD ¼ 2.51), the NC (M ¼ 5.05,
SD ¼ 2.01), and the FCNA condition (M ¼ 4.52, SD ¼ 1.81).
As shown in Table 3, the one-way MANCOVA yielded a significant
main effect of caption type on vocabulary learning after controlling for
listening comprehension (F(12,257)¼11.62, p<.001, Wilks’ Lambda¼.32,
g2¼.32). The effect size statistics revealed a large effect of caption type
on the three vocabulary test outcomes (g2¼.238 on form recognition,
g2¼.589 on meaning recognition, g2¼.564 on meaning recall).
Table 4 presents the post hoc Tukey test results of vocabulary test
scores across captioning conditions. For form recognition, post hoc com-
parisons showed that the FCL1, the FCHTW, and the FC group scored
significantly higher than the NC and the FCNA group (ps<.01). The
other pairwise comparisons did not achieve statistical significance
(ps>.05). In other words, presenting new words with both audio and
captions (two modalities) was more effective for form recognition than
with either audio or captions (single modality).
As for meaning recognition and recall, post hoc analyses revealed that
the FCL1 group significantly outperformed all the other groups
(ps<.001). The FC group gained significantly higher scores than the
FCHTW, the NC, and the FCNA group (ps<.05). There was no signifi-
cant difference between the FCHTW, the NC, and the FCNA group
(ps>.05). The findings demonstrated the effectiveness of FCL1 and FC
for enhancing meaning recognition and recall probably because the two
caption types visualized the auditory input and drew the participants’
attention to word meaning. FCL1 was particularly useful as it provided
direct access to word meaning. In contrast, while highlighting the target-
words facilitated form recognition, it appeared to hinder meaning recog-
nition and recall. The significantly lower scores in the FCHTW relative
to the FC condition suggested that the participants might have paid
more attention to word form than to meaning in the FCHTW condition.
Table 4. Post hoc comparisons of vocabulary test scores between captioning conditions.
Test type Comparison Mean difference Sig.
Form recognition FCL1-FCHTW .200 .803
FCL1-FC .843 .295
FCL1-NC 3.009 .000
FCL1-FCNA 3.374 .000
FCHTW-FC .643 .426
FCHTW-NC 2.810 .001
FCHTW-FCNA 3.174 .000
FC-NC 2.167 .008
FC-FCNA 2.531 .003
NC-FCNA .364 .653
Meaning recognition FCL1-FCHTW 6.205 .000
FCL1-FC 4.745 .000
FCL1-NC 6.395 .000
FCL1-FCNA 6.602 .000
FCHTW-FC –1.460 .029
FCHTW-NC .190 .772
FCHTW-FCNA .397 .550
FC-NC 1.651 .014
FC-FCNA 1.857 .007
NC-FCNA .207 .755
Meaning recall FCL1-FCHTW 6.260 .000
FCL1-FC 4.378 .000
FCL1-NC 6.213 .000
FCL1-FCNA 6.335 .000
FCHTW-FC –1.882 .007
FCHTW-NC –.048 .944
FCHTW-FCNA .074 .913
FC-NC 1.835 .008
FC-FCNA 1.957 .006
NC-FCNA .122 .858
FCL1, full caption with highlighted target-word and L1 gloss; FCHTW, full caption with highlighted target-
word; FC, full caption; NC, no caption; FCNA, full caption with no audio.
In other words, more cognitive resources might have been allocated to

orthographic processing than to semantic processing when watching vid-
eos with highlighted target-words. As the participants allocated their lim-
ited attentional resources to word form in the FCHTW condition,
processing and retention decreased for word meaning.
Figure 2 summarizes the results of the form recognition, meaning rec-
ognition, and meaning recall tests. In short, FCL1 was most effective for
learning both word form and meaning relative to the other caption types.
FC was also beneficial for word learning compared to NC and FCNA
but not as effective as FCL1 in the meaning aspect. FCHTW helped
form recognition but hindered meaning recognition and recall. NC and
FCNA (single-modality presentation) were least useful for word learning.
4. Discussion
The findings revealed significant captioning effects on vocabulary gain,
including form recognition, meaning recognition, and meaning recall.
While caption type did not affect listening comprehension, concurrent
14 Y. HSIEH
Figure 2. Summary of results of three vocabulary tests.
presentations of audio, video, and captions did not seem to overload

the learners.
4.1. Captioning effects on listening comprehension

Although the FC group performed better than the other groups, listening
comprehension did not differ across the five captioning conditions. The
result was congruent with some of the previous findings (e.g. Aldera &
Mohsen, 2013; Bird & Williams, 2002; Montero Perez et al., 2014; Winke
et al., 2010) but contradicted the others (Baltova, 1999; Hsu et al., 2013),
which showed that caption facilitated learners’ comprehension compared
to no caption.
The finding that caption type had no effect on video comprehension
could be accounted for by the difficulty of the videos as well as the
modality effect (Mayer, 2001; Moreno & Mayer, 1999). The overall low
scores (i.e. 53–65%) in the listening comprehension tests suggested that
the videos might be too difficult for the low-intermediate learners. Thus,
the listening comprehension did not significantly improve even with the
support of captions. Furthermore, although captions could provide text-

ual aids for learners regarding what they heard or watched in a video,
the extra information could also distract learners’ attention. The modality
effect occurs when learners have to split their attention among informa-
tion sources that require processing in the same visual channel, especially
in cases where the material is beyond their proficiency level and the
presentation pace is out of their control (Mayer, 2001; Moreno & Mayer,
1999; Tabbers, Martens, & Merri€enboer, 2004). The lower comprehen-
sion scores of the FCL1 and the FCHTW group relative to the FC group
suggested that highlighted words and Chinese translations in the caption-
ing line might draw the participants’ attention to particular words rather
than the overall video content. As pointed out by some of the partici-
pants, they could not simultaneously attend to the animations and the
words in the captions because the material was delivered at a relatively
rapid pace. The modality effect could also explain why listening compre-
hension was better in the NC condition, where the videos were presented
with audio narration, than in the FCNA condition, where the videos
were accompanied by printed text. While the information sources in the
FCNA condition both required visual processing, the audio input in the
NC condition was carried by the auditory channel and thus reduced the
load of the visual channel. In addition to the modality effect, the superior
comprehension of the NC group relative to the FCNA group indicated
that the low-intermediate participants were likely to utilize audio infor-
mation rather than only captions. Since the between-group differences
did not reach statistical significance, all the group comparisons are
only suggestive.
It is worth noting that watching videos with audio and captions did
not seem to overload the learners in the present study. Our finding that
comprehension was non-significantly better in the FC condition (video-
þ audio þ captions) than in the NC condition (video þ audio) contra-
dicted Moreno and Mayer’s (2002) observation that concurrent
presentation of video, audio, and captions hindered comprehension as
compared to no captioning. Moreno and Mayer argued that bisensory
redundant verbal information enhanced listening comprehension only
when video was presented before captions due to the limited capacity of
visual working memory. A possible explanation for the contradictory
findings was that our participants viewed each video twice, in which case
they could selectively attend to certain parts of the visual stimuli (e.g.
video) on the first view and then other parts (e.g. captions) on the
second view. According to Mayer (2005), when learners have to divide
attention across multiple sources of stimuli, they would actively select
information for processing given their limited cognitive capacity.
16 Y. HSIEH
Selective attention could help to reduce the load in visual working mem-
ory, and thus verbal redundancy and cognitive overload could
be reconciled.
In Hsu et al. (2013) and Baltova (1999), where caption was found
beneficial for video comprehension relative to no caption, the research
design also reduced the overload that might result from split attention
between multiple visual inputs. Hsu et al. (2013) showed that, compared
to the no-caption group, the full-caption and the keyword-caption group
experienced larger learning gains in English listening comprehension
over a period of four weeks. The videos embedded with full English cap-
tions did not seem to cause cognitive overload for the elementary-level
EFL learners. One possible reason was that Hsu et al. allowed the learn-
ers to play, pause, and replay the videos for listening training. Thus, the
cognitive overload could be reduced as the learners had a sufficient
amount of time to process information from multiple sources. Baltova
(1999) provided captions for only about one-half of the script, which
contained key words for the most important content in the video.
Presumably keyword captions were easier to read given the lower textual
density, especially for the non-advanced learners, and thus created less
cognitive load as compared to full captions. The benefit of keyword cap-
tions could be maximized by the design of the ten short-answer compre-
hension questions that all addressed the main points in the video. In
contrast, the multiple-choice comprehension questions in the present
study tested the main ideas as well as specific details of the animations,
thus requiring more detailed understanding of the content.
4.2. Captioning effects on vocabulary learning

4.2.1. Form recognition
The present study demonstrated that presentation of verbal information
in two modalities (i.e. audio þ captions in the FC, the FCHTW, and the
FCL1 condition) facilitated word form recognition compared to single-
modality presentation (i.e. audio in the NC condition and captions in
the FCNA condition). Nevertheless, word salience in the captioning line
did not affect recognition of word form. The findings were consistent
with those in Montero Perez et al. (2014) that high-intermediate learners
of English achieved similar scores on word form recognition in all the
captioning conditions (i.e. full caption, keyword caption, and full caption
with highlighted keywords) and performed significantly better than in
the no-caption condition. This suggested that the presence of captions,
rather than vocabulary salience, had a positive impact on form recogni-
tion. One explanation was that on-screen text supported word
recognition as it helped learners segment continuous speech stream into

words and notice forms of unknown words, which were the first steps of
vocabulary learning (Winke et al., 2010). On the other hand, a reason
for the null effect of word salience on form recognition was that the par-
ticipants might have noticed and paid attention to the unfamiliar words
in the captioning line even without highlights (Montero Perez
et al., 2014).
It is worth noting that the effects of glossed caption on form recogni-
tion appeared to differ in Montero Perez et al.’s (2018) and the present
study. Montero Perez et al. (2018) found glossed caption more effective
than full caption whereas the current study revealed no difference
between the two caption types. The different findings might be attribut-
able to the amount of text in the captioning line as well as the pace of
presentation. Montero Perez et al.’s (2018) participants read keyword
captions with access to a keyword gloss by tapping the spacebar to pause
and look up a word whereas our participants received full captions along
with glosses for the target words at a system-controlled pace. Compared
to the glossed-keyword-caption group in Montero Perez et al. (2018), the
FCL1 group in the present study had to deal with a larger amount of
text at one time (i.e. full captions plus target-word glosses) and might
spend less time on the unknown words as they were unable to pause the
video. Thus, the FCL1 group might not allocate as much attention to the
form of the unknown words as the glossed-keyword-caption group in
Montero Perez et al. (2018).
Furthermore, the superior performance of the FC group (video þ cap-
tions þ audio) relative to the FCNA group (video þ captions) indicated
that even if captions were available, the participants attended to audio
input and made use of phonological cues. The result was contradictory
to Sydorenko’s (2010) finding that the VC group (video þ captions) and
the VAC group (video þ captions þ audio) obtained similar scores on
written form recognition. Sydorenko argued that word form recognition
was most effective when input modality and test modality were identical.
The inconsistency between Sydorenko’s and our findings might result
from different proficiency levels of the participants. The participants
were beginners in Sydorenko’s study and low-intermediate learners in
the current study. The beginners might rely more heavily on captions
and pay less attention to audio as they reported to have difficulty proc-
essing captions, audio, and video simultaneously (Yeldham, 2018).
Compared to the beginners, the low-intermediate learners in the present
study were likely to have better listening skills and be less dependent on
reading captions, which was further supported by the finding that the
NC group (video þ audio) achieved slightly higher scores than the FCNA
18 Y. HSIEH
group (video þ captions) on form recognition, though the between-group

difference was not significant.
In short, the presence of two kinds of verbal cues (text and sound)
improved form recognition of the low-intermediate learners in the FC,
the FCHTW, and the FCL1 condition relative to the FCNA and the NC
condition, suggesting that bimodal presentation of verbal information
was more effective than single-modality presentation as the printed
words could be better recognized with auditory support. The findings
provided evidence for Bird and William’s (2002) claim that ‘[ortho-
graphic and phonological] information sources interacted and had recip-
rocal influences on each other during [word] processing’ (p. 528).
4.2.2. Meaning recognition and recall

This study revealed that caption type had a significant effect on meaning
recognition and recall. While the scores for meaning recognition were
generally higher than those for meaning recall, the two kinds of tests
produced similar patterns of results. On both meaning tests the FCL1
group scored highest, followed by the FC group, whereas the FCHTW,
the NC, and the FCNA group scored lowest. In other words, full caption
plus L1 gloss enhanced meaning recognition and recall, which corrobo-
rated Montero Perez et al. (2018) on the benefits of glossed keyword
caption. Particularly, the FCL1 group outperformed the FC group as the
former had access to context-specific meanings of unknown words and
thus avoided erroneous guesses (Nation, 2001). The presence of L1 gloss
might also draw learner attention to the target words, especially the
meaning aspect. Crucially, adding L1 glosses to the captioning line did
not seem to overload the participants. Glosses present in the fast-paced,
non-stop video context could make it easier for the low-intermediate
learners to build sematic relations between new lexical items and other
words. In such a video context where the learners could not control the
presentation pace or review the material, they might fail to notice the
unknown words if not highlighted (as in the FC condition). Or they
might just ignore these words or infer meanings incorrectly due to insuf-
ficient time and/or contextual information.
On the other hand, FCHTW (highlighting the target words without
providing concrete information on meaning) supported the learning of
word form but impeded the learning of word meaning, as the FCHTW
group performed slightly better on form recognition but much worse on
meaning recognition/recall relative to the FC group. The findings con-
firmed Montero Perez et al.’s (2014) speculation that increasing word
salience with highlights did not boost meaning recall as inferring word
meaning from context might be problematic to intermediate-level
learners. The present study added further evidence that providing glosses
for unknown words improved meaning acquisition of low-intermedi-
ate learners.
However, our findings were somewhat different from Montero Perez
et al.’s (2014) regarding captioning effects on meaning recognition.
Montero Perez et al. (2014) reported that the groups receiving full cap-
tioning with and without highlighted keywords gained similar scores on
meaning recognition and both scored higher than the no-captioning
group. In the current study, however, FCHTW led to significantly worse
performance in meaning recognition relative to FC. One possibility was
that there exists a competition between attention to form and attention
to meaning during language processing, especially for early-stage learners
(VanPatten, 1990). Since our participants (low-intermediate level) had
relatively lower proficiency than those in Montero Perez et al. (2014)
(high-intermediate level), they might have had greater difficulty in
attending to both form and meaning of new words. While the partici-
pants’ attention was mostly devoted to meaning extraction in the FC
condition, attention to the highlighted vocabulary in the FCHTW condi-
tion was likely to be at the expense of meaning processing, thus resulting
in low gains on meaning recognition and recall. Another possible factor
that could lead to the different outcomes in Montero Perez et al. (2014)
and the present study was the L1 background of the participants. While
Montero Perez et al.’s participants were Dutch-speaking learners of
English, whose L1 and L2 were both Germanic languages sharing the
same writing system, the present study involved Chinese-speaking learn-
ers of English, who had to read in an L2 with a typologically distinct
orthographic system. Compared to the participants in Montero Perez
et al., our non-advanced Chinese-speaking participants might have allo-
cated more cognitive resources to orthographic processing at the expense
of meaning processing, especially in the FCHTW condition.
In line with Baltova (1999) and Sydorenko (2010), which also involved
non-advanced learners, the present study revealed that for meaning recall
of written words video combined with both captions and audio (the FC
condition) was more beneficial than with either captions (the FCNA con-
dition) or audio (the NC condition). Concurrent presentations of spoken
and written verbal input had advantage over spoken-only and written-
only presentations in a video learning context, at least for non-advanced
learners. As reported in the meta-analysis by Adesope and Nesbit (2012),
spoken-written presentation was beneficial especially for system-paced
learning material and low prior-knowledge learners. As the low-inter-
mediate learners in the present study were not allowed to rewind, pause,
or slow down the videos, receiving the same verbal information through
20 Y. HSIEH
two sensory channels rather than one could help them recover from fail-
ure of either channel and thus enhanced decoding of unknown words.
Based on the dual-processing theory of multimedia learning (Mayer,
2005), written information is represented in a visual processing system
and the corresponding spoken information in an auditory system. Since
the two processing systems are separate and draw from distinct cognitive
resources, learners can hold two kinds of verbal information in working
memory simultaneously without competition between them.
As for the comparison between the NC (video þ audio) and the FCNA
group (video þ captions), who received verbal information in a single
modality, the NC group performed non-significantly better than the
FCNA group, contrary to the finding in Sydorenko (2010). As discussed
previously, the discrepancy could be attributed to the participants’ profi-
ciency levels. In Sydorenko’s study the beginning learners benefited from
captions more than audio in learning new words as they might have bet-
ter reading than listening skills. When learners progressed on the devel-
opment of an L2, they became less dependent on captions when listening
to multimedia material, as the case of the low-intermediate learners in
the current study (Leveridge & Yang, 2013; Yeldham, 2018).
5. Conclusions
Overall, the vocabulary test results demonstrated that learners could
benefit from multimedia material with a combination of captions, images
and audio as multimodality makes input accessible through different
channels. Concurrent presentations of the three kinds of information did
not lead to cognitive overload probably because the learners could select-
ively attend to different parts of the visual stimuli during the first and
second exposure to the videos.
Particularly, FCL1 enhanced all the three aspects of vocabulary learn-
ing, suggesting that it could promote attention to formal and semantic
features of a word and reinforce form-meaning connections. Highlighted
target words without glosses (FCHTW), on the other hand, appeared to
inhibit meaning recognition and recall, as more attentional resources
might be allocated to word form than to meaning. Videos with either
captions (FCNA) or audio (NC) did not help vocabulary acquisition, at
least for non-advanced learners, indicating that presentation of verbal
information through two modalities (audio plus text) was superior over
single-modality presentation in a video learning context. Unlike caption-
ing effects on vocabulary learning, caption type had no impact on listen-
ing comprehension, though the FC group performed non-significantly
better than the other groups. It was possible that highlights and glosses
directed learner attention to vocabulary rather than video content. In

short, comparisons between the findings in this study and the previous
research suggested that video captioning effects depended on various fac-
tors, including learner proficiency, information load, L1-L2 orthograph-
ical difference, and number of times/pace of presentation.
While evidence was found for captioning effects on various aspects of
vocabulary learning, this study had a number of limitations that war-
ranted further research. First, the study only addressed vocabulary learn-
ing via written tests, pointing to the need for further investigation into
captioning effects on the learning of spoken words. Second, it is neces-
sary to know which caption types are suitable for different stages of L2
development given that learners’ proficiency may influence how they
process information. This study involved low-intermediate EFL learners
only, and thus the generalizability of the findings was constrained. Third,
this study only compared learners’ test performance immediately after
watching the videos. More comprehensive research is needed to under-
stand delayed effects as well as learner perceptions of various caption
types, which could help to maximize learning opportunities in a multi-
media environment. Fourth, while adding glosses increases text density
in the captioning line, the L1 (Chinese) glosses in this study were very
concise and consisted of only two to three characters per gloss. It is
worth considering the amount of text displayed on the screen as a vari-
able likely to influence video captioning effects. In addition, while 2 of
the 19 target words in this study appeared more frequently than the
others, it is unclear whether word frequency and caption type would
interact with each other and affect vocabulary gains. There is a need to
control for word frequency in future captioning research. Finally, investi-
gations taking into account learner characteristics, such as learning styles,
strategies and prior knowledge, would enhance the generalizability of
video captioning research and inform instructional design.
Disclosure statement
No potential conflict of interest was reported by the author.
Notes
1. Millett, Quinn, and Nation (2007) suggested a 70% comprehension accuracy as an
acceptable level in L2 reading.
2. For the meaning recall tests, all of the participants’ responses were either exact
translations/synonyms or incorrect translations. No partial credit was given because
none of the responses were from the same semantic fields as the respective
target words.
22 Y. HSIEH
Notes on contributor
Yufen Hsieh holds a Ph.D. in Linguistics from the University of Michigan, Ann Arbor.
Currently, she is an Assistant Professor in the Department of Applied Foreign
Languages, National Taiwan University of Science and Technology. Her research inter-
ests include second language learning and teaching as well as reading comprehension.
References
Adesope, O. O., & Nesbit, J. C. (2012). Verbal redundancy in multimedia learning envi-
ronments: A meta-analysis. Journal of Educational Psychology, 104(1), 250–263. doi:
10.1037/a0026147
Aldera, A. S., & Mohsen, M. A. (2013). Annotations in captioned animation: Effects on
vocabulary learning and listening skills. Computers & Education, 68, 60–75. doi:
10.1016/j.compedu.2013.04.018
Baltova, I. (1999). Multisensory language teaching in a multidimensional curriculum:
The use of authentic bimodal video in core French. The Canadian Modern Language
Review, 56(1), 32–48.
Bird, S. A., & Williams, J. N. (2002). The effect of bimodal input on implicit and explicit
memory: An investigation into the benefits of within-language subtitling. Applied
Psycholinguistics, 23(04), 509–533.
Daller, H., van Hout, R., & Treffers-Daller, J. (2003). Lexical richness in the spontaneous
speech of bilinguals. Applied Linguistics, 24(2), 197–222. doi:10.1093/applin/24.2.197
Daller, H., & Xue, H. (2007). Lexical richness and the oral proficiency of Chinese EFL
students. In H. Daller, J. Milton & J. Treffers-Daller (Eds.), Modelling and assessing
vocabulary knowledge (pp. 93–115). Cambridge: Cambridge University Press.
Ellis, N. C. (2016). Salience, cognition, language complexity, and complex adaptive sys-
tems. Studies in Second Language Acquisition, 38(02), 341–351. doi:10.1017/S0272
26311600005X
Housen, A., Kuiken, F., & Vedder, I. (2012). Complexity, accuracy and fluency:
Definitions, measurement and research. In A. Housen, F. Kuiken, & I. Vedder (Eds.),
Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in
SLA (pp. 1–20). (Language Learning & Language Teaching; Vol. 32). Amsterdam:
John Benjamins Publishing Company.
Hsu, C.-K., Hwang, G.-J., Chang, Y.-T., & Chang, C.-K. (2013). Effects of video caption
modes on English listening comprehension and vocabulary acquisition using handheld
devices. Educational Technology & Society, 16(1), 403–414.
Leveridge, A. N., & Yang, J. C. (2013). Testing learner reliance on caption supports in
second language listening comprehension multimedia environments. ReCALL, 25(02),
199–214. doi:10.1017/S0958344013000074
Linebarger, D., Piotrowski, J., & Greenwood, C. R. (2010). On-screen print: The role of
captions as a supplemental literacy tool. Journal of Research in Reading, 33(2),
148–167. doi:10.1111/j.1467-9817.2009.01407.x
Mayer, R. E. (2001). Multimedia learning. New York: Cambridge University Press.
Mayer, R. E. (2005). The Cambridge handbook of multimedia learning. New York, NY:
Cambridge University Press.
Mayer, R. E., & Moreno, R. (2003). Nine ways to reduce cognitive load in multimedia
learning. Educational Psychologist, 38(1), 43–52. doi:10.1207/S15326985EP3801_6
Millett, S., Quinn, E., & Nation, P. (2007). Asian and Pacific speed readings for ESL
learners. Wellington: English Language Institute Ocasional Publication.
Montero Perez, M., Van Den Noortgate, W., & Desmet, P. (2013). Captioned video for
L2 listening and vocabulary learning: A meta-analysis. System, 41(3), 720–739. doi:
10.1016/j.system.2013.07.013
Montero Perez, M., Peters, E., Clarebout, G., & Desmet, P. (2014). Effects of captioning
on video comprehension and incidental vocabulary. Language, Learning &
Technology, 18(1), 118–141. doi:10.1080/09588221.2017.1375960
Montero Perez, M., Peters, E., & Desmet, P. (2018). Vocabulary learning through view-
ing video: the effect of two enhancement techniques. Computer Assisted Language
Learning, 31(1-2), 1–26. doi:10.1080/09588221.2017.1375960
Moreno, R., & Mayer, R. (1999). Cognitive principles of multimedia learning: the role of
modality and contiguity. Journal of Educational Psychology, 91(2), 358–368. doi:
10.1037/0022-0663.91.2.358
Moreno, R., & Mayer, R. E. (2002). Verbal redundancy in multimedia learning: When
reading helps listening. Journal of Educational Psychology, 94(1), 156–163. doi:
10.1037//0022-0663.94.1.156
Nation, I. S. P. (1990). Teaching and learning vocabulary. New York: Newbury House.
Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge: Cambridge
University Press.
Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in
instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555–578. doi:
10.1093/applin/amp044
Paivio, A. (2007). Mind and its evolution: A dual coding theoretical approach. Mahwah,
NJ: Erlbaum.
Plass, J., & Jones, L. (2005). Multimedia learning in second language acquisition. In R.
Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 467–488). New
York: Cambridge University Press.
Schmidt, R. (2001). Attention. In P. Robinson (Ed.), Cognition and second language
instruction (pp. 3–32). Cambridge: Cambridge University Press.
Sydorenko, T. (2010). Modality of input and vocabulary acquisition. Language Learning
& Technology, 14(2), 50–73.
Tabbers, H. K., Martens, R. L., & Merri€enboer, J. J. G. (2004). Multimedia instructions
and cognitive load theory: Effects of modality and cueing. British Journal of
Educational Psychology, 74(1), 71–81.
VanPatten, B. (1990). Attending to form and content in the input. Studies in Second
Language Acquisition, 12(03), 287–301. doi:10.1017/S0272263100009177
Winke, P., Gass, S., & Sydorenko, T. (2010). The effects of captioning videos used for
foreign language listening activities. Language Learning & Technology, 14(1), 65–86.
Winke, P., Gass, S., & Sydorenko, T. (2013). Factors influencing the use of captions by
foreign language learners: An eye-tracking study. The Modern Language Journal,
97(1), 254–275. doi:10.1111/j.1540-4781.2013.01432.x
Yeldham, M. (2018). Viewing L2 captioned videos: What’s in it for the listener? Computer
Assisted Language Learning, 31(4), 367–389. doi:10.1080/09588221.2017.1406956

Hsieh (2020) Effects of Video Captioning On EFL Vocabulary Learning and Listening Comprehension

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hsieh (2020) Effects of Video Captioning On EFL Vocabulary Learning and Listening Comprehension

Uploaded by

Copyright:

Available Formats

Computer Assisted Language Learning

ISSN: 0958-8221 (Print) 1744-3210 (Online) Journal homepage: https://www.tandfonline.com/loi/ncal20

Effects of video captioning on EFL vocabulary

To link to this article: https://doi.org/10.1080/09588221.2019.1577898

Published online: 22 Feb 2019.

Submit your article to this journal

View Crossmark data

Full Terms & Conditions of access and use can be found at

Effects of video captioning on EFL vocabulary

CONTACT Yufen Hsieh yfhsieh@mail.ntust.edu.tw Department of Applied Foreign Languages,

1.1. Literature review

1.1.1. Captioning effects on listening comprehension

full captioning on listening comprehension compared to no captioning

the ‘animation þ captions’ condition compared to ‘animation alone.’ In

1.1.2. Captioning effects on vocabulary learning

vocabulary production. Based on the findings of the cited studies, mul-

learning. However, the meta-analyses did not include caption type as a

1.2. Research questions

1. How do different caption types influence Chinese-speaking EFL

International Communications) scores of 400–550 out of a total score of

Figure 1. Five caption types for each video.

intermediate learners. First, a list of possible target words was compiled

Table 1. Lexical and syntactic analysis of two video clips.

2.3.2. Vocabulary tests

2.5. Data analysis

3.1. Captioning effects on listening comprehension

Table 2. Descriptive statistics for listening comprehension and vocabulary tests.

This study investigated which caption type – NC, FCNA, FC,

3.2. Captioning effects on vocabulary learning

Table 3. MANCOVA results of captioning effects on vocabulary learning (with listening

In other words, more cognitive resources might have been allocated to

Figure 2. Summary of results of three vocabulary tests.

presentations of audio, video, and captions did not seem to overload

4.1. Captioning effects on listening comprehension

support of captions. Furthermore, although captions could provide text-

4.2. Captioning effects on vocabulary learning

recognition as it helped learners segment continuous speech stream into

group (video þ captions) on form recognition, though the between-group

4.2.2. Meaning recognition and recall

directed learner attention to vocabulary rather than video content. In

You might also like