Investigating The Influence of Simultaneous - Versus Sequential-Text-Picture Presentation On Text-Picture Integration

The Journal of Experimental Education
ISSN: 0022-0973 (Print) 1940-0683 (Online) Journal homepage: http://www.tandfonline.com/loi/vjxe20
Investigating the Influence of Simultaneous–

Versus Sequential–Text-Picture Presentation on
Text-Picture Integration
Jana Arndt, Anne Schüler & Katharina Scheiter
To cite this article: Jana Arndt, Anne Schüler & Katharina Scheiter (2018): Investigating the
Influence of Simultaneous– Versus Sequential–Text-Picture Presentation on Text-Picture
Integration, The Journal of Experimental Education, DOI: 10.1080/00220973.2017.1363690
To link to this article: https://doi.org/10.1080/00220973.2017.1363690
Published online: 18 Jun 2018.
Submit your article to this journal
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=vjxe20
THE JOURNAL OF EXPERIMENTAL EDUCATION
https://doi.org/10.1080/00220973.2017.1363690
Investigating the Influence of Simultaneous– Versus

Sequential–Text-Picture Presentation on Text-Picture Integration
€lera, and Katharina Scheitera,b
Jana Arndta, Anne Schu
a
Leibniz-Institut f€ur Wissensmedien, T€ubingen, Germany; bUniversity of T€
ubingen, Germany
ABSTRACT KEYWORDS
This study investigated whether text-picture integration is facilitated when Integration of text and
text and pictures are presented simultaneously instead of sequentially. pictures; mental
Participants memorized general and specific sentences and pictures. It was representations; multimedia
expected that due to text-picture integration, participants should falsely learning; sequential
presentation; temporal
recognize specific versions of the sentences and pictures even after having contiguity principle
studied only their general versions before. Sentences and pictures were
presented simultaneously or sequentially with the picture either preceding or
following the corresponding sentence. Contrary to expectations, text-picture
integration was only observed for picture recognition and was not influenced
by temporal contiguity. Findings are explained by the use of simple materials
that did not sufficiently tax working-memory capacity to allow for any benefit
of a simultaneous text-picture presentation to occur.
PLENTY OF RESEARCH has shown that learners who receive text and pictures show a deeper under-
standing than learners who receive text alone (multimedia effect; see Anglin, Vaez, & Cunningham,
2004; Levie & Lentz, 1982; Mayer, 2009, for overviews). Within the cognitive theory of multimedia
learning (CTML; Mayer, 2009) this finding is explained as follows.
According to CTML, one crucial step in learning when text and pictures are presented is the integra-
tion of verbal and pictorial information into a coherent mental representation. In particular, it is
assumed that information selected from text or pictures is organized into respective verbal or pictorial
representations in working memory. In a next step, the verbal and pictorial representations are inte-
grated into a single coherent representation. This means that one-to-one correspondences or mappings
between the verbal and the pictorial representations—that is, between their elements, actions, and
causal relations—are identified and used to store the information in long-term memory (Mayer, 1997).
Studies investigating explicitly whether text-picture integration takes place are rare; rather, integration
is generally assumed to have taken place whenever better learning outcomes have occurred.
CTML further assumes that when text and pictures are presented in such a way that the delay
between text and picture processing is as brief as possible, integration and, hence, learning (temporal
contiguity effect) are fostered. This is because information from text and picture can then be kept and
processed in working memory at the same time, which in turn makes it easier for learners to integrate
the verbal and pictorial information (Mayer, 2009; see Ginns, 2006, for an overview). For example, in a
study from Mayer and Anderson (1991), participants learned from an animation that showed how a
tire pump works. The corresponding narration was presented either simultaneously with the animation
or sequentially (i.e., either before or after the animation). In the simultaneous group, the events
depicted in the animation were synchronized with the events described in the narration. For example,
CONTACT Anne Sch€ uler a.schueler@iwm-tuebingen.de Leibniz-Institut f€ur Wissensmedien, Schleichstraße 6, 72076
T€
ubingen, Germany.
© 2018 Taylor & Francis Group, LLC
2 J. ARNDT ET AL.
when the narrator said, “The inlet valve opens,” the animation showed the inlet valve opening. Results
showed that learners in the simultaneous group obtained better learning results than learners in the
sequential groups (for similar results, see also Mayer & Anderson, 1992; Mayer & Sims, 1994). As a
possible moderator for the temporal contiguity effect, the length of presented segments has been dis-
cussed. Thus, Mayer, Moreno, Boire, and Vagge (1999) were unable to show a temporal contiguity
effect when small segments were presented successively, entailing many alternations between short
sentences and short animations. In sum, studies by Mayer and colleagues (Mayer & Anderson, 1991,
1992; Mayer & Sims, 1994) provide evidence that simultaneous text-picture presentation can lead to
better learning outcomes than sequential text-picture presentation if the sequential text-pictures seg-
ments are not too short in length (Mayer & Moreno, 1999). However, what remains unclear from these
studies is whether it is indeed text-picture integration that causes better learning in the simultaneous
condition.
A more direct way to observe text-picture integration is offered by the eye-tracking method (Hy€on€a,
2010; van Gog & Scheiter, 2010). To study the process of text-picture integration during learning, look-
froms as well as integrative transitions have been analyzed. Look-froms describe the duration (fixation
times) for rereading text while reinspecting the picture (i.e., looks from text to picture) and reinspecting
the picture while rereading the text (i.e., looks from picture to text, see Mason, Pluchino, & Tornatora,
2013, 2015). Integrative transitions are the number of times participants’ eyes move from text to pic-
ture and vice versa. Indeed, results of some studies showed that look-froms as well as integrative transi-
tions can be interpreted as indicators of learners’ attempts to integrate information from text and
picture (Hannus & Hy€ on€a, 1999; Hegarty & Just, 1993; Holsanova, Holmberg, & Holmqvist, 2009;
Johnson & Mayer, 2012; Mason, Tornatora, & Pluchino, 2013; Schmidt-Weigand, Kohnert, &
Glowalla, 2010). However, there are some limitations regarding this method. As Holsanova, Holmberg,
and Holmqvist (2009) argue, it is not clear whether the position of the eyes always corresponds to the
locus of cognitive processing. Thus, eye tracking is not necessarily a one-to-one reflection of the
cognitive processes and the mental representations resulting from them.
Another possibility to provide more-direct evidence regarding text-picture integration is to use
paradigms from cognitive psychology that distinguish whether performance is due to better memory
and understanding of verbal information, pictorial information, or their integration into one single
mental representation. This, in turn, allows for testing clear predictions with respect to text-picture
integration when learning with text and pictures. For example, using these paradigms should enable
researchers to test whether presenting text and pictures simultaneously instead of successively indeed
facilitates text-picture integration. One might argue that a disadvantage of using paradigms from cog-
nitive psychology is the fact that these paradigms normally use very simplified materials compared to
academic learning contexts (but see Rummer, Schweppe, F€ urstenberg, Scheiter, & Zindler, 2011). How-
ever, Mayer (2005) claims that multimedia learning theories should have theoretical plausibility in that
they are “consistent with cognitive science principles of learning” (p. 32). Multimedia learning theories
need to be grounded in cognitive research to allow for valid conclusions regarding questions about
which instructional variants yield better learning and why they do so (Mayer, 2009). This can be done
best by using paradigms from cognitive psychology that allow for testing clear predictions regarding
what these cognitive processes associated with multimedia learning should look like. While theories
like CTML aim at providing guidelines for designing real-life educational materials, the rationale
underlying these guidelines is based on very basic processes including attention, memory, and mental
model construction. Thus, cognitive paradigms validating these processing assumptions are important
building blocks for theory development in education.
For example, Mammarella, Fairfield, and Di Domenico (2013) used one of these cognitive para-
digms to investigate text-picture integration. In their study, participants were instructed to memorize
black-and-white pictures of different objects, combined with written sentences indicating the color of
these objects. For example, a picture showing a plane was combined with the sentence “The plane is
blue.” The sentences and pictures were presented in 32 trials, each containing four sentences with their
associated pictures. After each trial, participants performed a yes-no recognition test in which colored
pictures of the objects (not presented during the memorizing phase) were presented. Participants were
THE JOURNAL OF EXPERIMENTAL EDUCATION 3
instructed to recognize the information about the object (picture) and its color (sentence). For exam-
ple, when a yellow plane was presented in the recognition test, participants should have rejected the
item, whereas when a blue plane was presented, participants should have accepted the item.
Mammarella et al. assumed that participants had to integrate the information from the sentence and
the picture to correctly perform this task. Additionally, the authors tested whether presenting sentences
and pictures simultaneously led to better recognition performance than presenting text and pictures
sequentially. Results showed indeed that presenting sentences and pictures simultaneously instead of
sequentially led to better performance than presenting sentences and pictures sequentially. However, a
point of criticism is that it is only sufficient to remember the sentence information (e.g., about a yellow
plane) to correctly answer the picture recognition test. Therefore, it could be that participants used a
mental representation of the sentence to answer the picture recognition test and not—as expected by
Mammarella et al.—the integrated mental representation.
Another approach was introduced by Sch€ uler, Arndt, and Scheiter (2015). They used a modified
paradigm from Gentner and Loftus (1979; see also Pezdek, 1977; Wippich, 1987) to investigate text-
picture integration. Participants were instructed to memorize a number of combinations of sentences
and pictures that differed with respect to their degree of specificity (general versus specific). Specific
sentences and pictures provided additional information that general sentences and pictures did not
contain. For example, a general picture showed a tower on a small island while a specific picture
showed a lighthouse on a small island. The corresponding general sentence was “There is only a tower
on the small island,” while the corresponding specific sentence was “There is only a lighthouse on the
small island.” After memorizing all sentence-picture combinations, participants answered a forced-
choice test of sentences in which they decided whether they had seen the general or specific version of
the sentences in the learning phase. Additionally, they answered a forced-choice test of pictures in
which they decided whether they had seen the general or specific version of the pictures. The depen-
dent variable was the frequency of choosing a specific version of the sentences and pictures in the two
forced-choice tests. In cases in which the generality or specificity of the picture and sentence matched,
sentences and pictures provided the same information (e.g., about a tower or a lighthouse), and so it
was expected that participants should have no problem in correctly rejecting or accepting the specific
version. However, in cases wherein sentences and pictures provided information at different levels of
specificity (i.e., general pictures/specific sentences or specific pictures/general sentences) text-picture
integration should become evident when testing in the modality in which the specific information had
not initially been presented (i.e., picture recognition when the sentence contained the specific informa-
tion). For instance, if students had seen the general picture about the tower combined with the specific
sentence about the lighthouse, the integrated representation should contain the information that the
tower is a lighthouse. Therefore, participants should more frequently falsely choose the specific picture
showing the lighthouse instead of the general picture showing the tower. Similarly, participants should
more frequently falsely choose the specific version of a sentence after having seen a specific picture/
general sentence combination. Sch€ uler et al. (2015) were able to show in two studies that text-picture
integration took place at least with regard to the sentence–forced-choice test, when sentences and pic-
ture were presented simultaneously.
To summarize, within the paradigm used by Sch€ uler et al. (2015), it is assumed that text-picture
integration manifests itself as a cross-modal influence because the specific information delivered
through one representation leads to a specific integrated representation that becomes evident in tests
presented in the other modality. Therefore, participants think that they have seen only specific infor-
mation. Text-picture integration was assumed to be observable when participants answer a test in the
modality in which the general but not the specific information had been presented earlier (cross-modal
testing).
The aim of the current study was to provide more-direct evidence for text-picture integration as the
underlying factor of temporal contiguity effects. As a within-subjects factor, we presented sentences
and pictures in four possible combinations (general picture/general sentence, general picture/specific
sentence, specific picture/general sentence, specific picture/specific sentence) resulting in four levels of
sentence-picture correspondence as shown in (Table 1). The same sentence and picture forced-choice
4 J. ARNDT ET AL.
Table 1. Correct and wrong answers with respect to sentence and picture recognition as a function of sentence-picture
correspondence.
Choosing a specific
Sentence-picture combinations Conditions picture sentence
general sentence/general picture general match is wrong is wrong

general sentence/specific picture sentence mismatch is correct is wrong
specific sentence/general picture picture mismatch is wrong is correct
specific sentence/specific picture specific match is correct is correct
tests as in the study by Sch€uler et al. (2015) were used. For each forced-choice test, there were only
three conditions of particular interest, which were chosen for analysis. That is, for the sentence forced-
choice test those were the combinations general picture/general sentence (general match), specific
picture/general sentence (sentence-mismatch) and specific picture/specific sentence (specific match),
whereas for the picture forced-choice test general picture/general sentence (general match), general
picture/specific sentence (picture-mismatch) and specific picture/specific sentence (specific match)
were chosen. Thus, one of the mismatch conditions was always excluded from analysis, as it did
not allow for cross-modal testing (e.g., presenting a specific sentence paired with a general picture
did not allow for sentence-based–cross-modal testing, because the sentence contained already
specific information). Regarding the forced-choice tests, it should be considered that choosing a
specific sentence or specific picture is correct only in the specific match condition, while it is a
mistake in the general match as well as in the respective mismatch (i.e., sentence-mismatch or
picture-mismatch) condition.
As a between-subjects factor, we manipulated the presentation sequence in three conditions. In the
first condition, we presented sentence-picture combinations simultaneously. In the second condition,
we presented for each combination the sentence first and then the picture. In the third condition, we
presented for each combination the picture first and then the sentence.
The dependent measure was the frequency of choosing the specific sentence in the sentence–forced-
choice test and choosing the specific picture in the picture–forced-choice test (in percent).
In the following, we will describe the hypotheses in general for both forced-choice tests. Please note
that when we talk about the mismatch condition, we mean the sentence-mismatch condition regarding
the sentence–forced-choice test and the picture-mismatch condition regarding the picture–forced-
choice test. In line with Sch€
uler et al. (2015), we expected that in the general match condition, general
information from the picture and the sentence are integrated with each other, resulting in a general
integrated representation. On the other hand, we expected that in the mismatch condition general
information from one representation and specific information from the other representation would be
integrated with each other. As a consequence, the resulting integrated representations should be more
specific in the mismatch condition compared with the integrated representations in the general match
condition. Consequently, participants should more often falsely choose the specific version in the mis-
match condition than in the general match condition. The comparison of the mismatch and the gen-
eral match conditions should also allow answering the question whether text-picture integration is
facilitated when sentences and pictures are presented simultaneously instead of sequentially. We
expected an interaction between sentence-picture correspondence and presentation format: The
difference between the general match and the mismatch condition should be larger when pictures and
sentences were presented simultaneously—suggesting stronger integration—in comparison with the
two conditions in which they were presented sequentially (picture-sentence and sentence-picture).
Furthermore, as a manipulation check, we compared performance in the specific match condition
(specific sentences/specific pictures) with performance in the general match condition. We expected
participants to choose the specific version more often in the specific match condition than in the gen-
eral match condition, because in the specific match condition participants were only presented with
specific information. This difference between the specific match condition and the general match con-
ditions should be of comparable size in the simultaneous condition and in the two sequential
conditions. This comparison served mainly as a manipulation check to verify that the specific informa-
tion had been perceived by participants.
To sum up, our main hypothesis was an ordinal interaction between presentation sequence (simulta-
neous, picture-text, text-picture) and sentence-picture correspondence (general match, mismatch, specific
match) for choosing specific sentences and for choosing specific pictures. Although we expected a
cross-modal testing effect with all three presentation sequences (i.e., a main effect of sentence-picture
correspondence, because participants will choose the specific version in the mismatch more often than in
the general match condition), this cross-modal testing effect should be larger with simultaneous presenta-
tion than with sequential text-picture presentation, resulting in the expected ordinal interaction.
Method
Participants and design
One hundred and two students with different majors from a German university participated in the study
for either payment or course credit. Due to technical problems, three participants had to be excluded
from data analysis. Thus, we considered the data of 99 participants (74 female, average age
M D 24.07 years, SD D 3.71). The 3 £ 3 mixed design contained the between-subjects factor presentation
sequence (simultaneous vs. picture-sentence vs. sentence-picture) and the within-subjects factor sentence-
picture correspondence (general match vs. mismatch vs. specific match). Students were randomly assigned
to the presentation-sequence conditions with 33 participants in the simultaneous condition, 34
participants in the picture-sentence condition, and 32 participants in the sentence-picture condition.
Materials
The material consisted of 50 sentence-picture combinations compiled from the item pool of Sch€ uler et
al. (2015). The pictures were simple line drawings and depicted different scenes or objects. The 50 sen-
tence-picture combinations were unrelated to each other in terms of their contents, meaning that 50
different scenes or objects were visualized and described. Sentences and pictures of one single combina-
tion of course referred to each other. Depending on condition, pictures contained either general or
Figure 1. Example of sentence-picture combinations as a function of sentence-picture correspondence and presentation sequence.
6 J. ARNDT ET AL.
Table 2. Number of items in the different sentence-picture combinations occurring in the four sets.
Sentence-picture combinations Set A Set B Set C Set D
general sentence/general picture 14 12 12 12

general sentence/specific picture 12 14 12 12
specific sentence/general picture 12 12 14 12
specific sentence/specific picture 12 12 12 14
specific information. Specific pictures were generated by adding more detail to the picture. For exam-
ple, a general picture showed a house, whereas the corresponding specific picture showed a terraced
house (see Figure 1). The sentences referred to the pictures. In the two sequential conditions, they were
presented in the middle of the screen before or after the pictures were presented. In the simultaneous
condition, they were presented below the pictures. Sentences consisted of four to nine words. General
and specific sentences were almost identical with the exception that the specific sentence contained a
noun that was more specific than the noun used in the general sentence (e.g., terraced house versus
house). Thereby, specific and general nouns described objects for the same category, but the specific
noun always described a more specific one. From the 50 sentences and pictures, four different material
sets were generated. These sets were parallel in that they contained the same sentences and pictures
except for the specification element per sentence/picture. For example, set A contained the general pic-
ture showing a house combined with the sentence “Next to the terraced house are parking places,”
while set B contained the same picture combined with the sentence “Next to the house are parking pla-
ces.” Each set contained all conditions (see Table 2 for the exact number of items in each set). Within
each set, items were presented randomly. Accordingly, every participant saw all sentence-picture com-
binations, but in different conditions.
Measures
The measures contained a forced-choice recognition test for pictures as well as a forced-choice recogni-
tion test of sentences. Each forced-choice test contained 50 items.
Picture– and sentence–forced-choice tests were constructed in the same manner. In both tests, for
each picture (or sentence), the general and specific versions were presented next to each other on one
slide. The order of general and specific pictures (or sentences) was randomized. Subjects had to decide
which one of the pictures (or sentences) they had seen during the learning phase.
The dependent variable was the frequency of choosing the specific picture or sentence in the picture
and sentence tests. It was scored in the following manner: Participants received one point for choosing
the specific version of the picture or sentence, resulting in a maximum score of 50 points. As already
mentioned, choosing the specific version is the correct choice only in the specific match condition. It is
a mistake in the general match condition and in the mismatch conditions (see Table 1). The scores
were converted into percentages for easier interpretation.
Procedure
The research was conducted in conformity with the guidelines of the ethics committee of the research
institute with written informed consent from all participants. Participants were tested in groups with a
maximum of six participants. In the beginning, each participant received an information sheet informing
participants that text-picture combinations would be presented and that they would afterwards have to
answer some questions. Then, each participant was seated in front of a computer screen. Here, partici-
pants were instructed as follows: “Dear Participant! In the following, the learning phase will be presented.
The presentation will run automatically. The contents will be presented only once and for a limited time.
Please follow the presentation attentively and memorize the contents well. Press the space bar to start the
presentation!” After participants pressed the space bar, the learning phase started. In this phase, students
were assigned randomly to one of the three between-subject conditions (simultaneous, picture-sentence,
sentence-picture). In all three between-subject conditions, the 50 sentence-picture combinations were

presented as a function of picture-sentence correspondence (e.g., in set A: 14 general-picture/general-
sentence combinations, 12 general-picture/specific-sentence combinations, 12 specific-picture/general-sen-
tence combinations, 12 specific-picture/specific-sentence combinations; see Table 2). In the simultaneous
condition, each sentence-picture combination remained visible on the screen for 5 seconds. In the
picture-sentence and the sentence-picture conditions, each sentence and its corresponding picture were
presented sequentially while each of them remained visible on the screen for 2.5 seconds (see also
Mammarella, Fairfield, & Di Domenico, 2013). Thus, in the sequential condition, text and pictures were
presented in direct succession to one another. A 1-second black noise mask followed the presentation of
each sentence-picture pair. After the presentation of all 50 sentence-picture combinations, participants
completed a 30-minute unrelated task in which they all received identical materials about a natural phe-
nomenon. For these materials, participants rated their prior knowledge, the perceived difficulty of the
materials, and the degree of imagery they experienced when reading about the natural phenomenon.
After this task, participants answered the sentence and picture–forced-choice tests. No time restrictions
were given for the posttests. Finally, participants answered demographic questions about their age,
gender, and field of studies. A single experimental session took about one hour.
Apparatus
The stimuli and the forced-choice tests were presented on a laptop with a display resolution of
1280 £ 800 pixels on a 15-inch screen. The sentence-picture combinations and the forced-choice tests
were presented using E-prime 2.0 Professional from Psychology Software Tools.
Results
For the sentence– and picture–forced-choice tests, two-factor mixed ANOVAs were conducted with
the within-subjects factor sentence-picture correspondence (general match, mismatch, specific match)
and the between subjects-factor presentation sequence (simultaneous versus picture-sentence versus
sentence-picture). The dependent variables were the relative frequency of choosing the specific picture
or specific sentence, respectively. Research data analyses showed that the assumption of homogeneity
of variance was met, whereas the normality assumption was violated in some of the cells. However, as
the F test can be assumed as being robust to violations of normality, provided group sizes are identical
(Field, 2009), we decided to run the planned ANOVAs.
Figure 2. Means and standard errors for choosing the specific sentences as a function of sentence-picture correspondence and presen-
tation sequence. Please note that differences between the general match and the sentence-mismatch are of special interest, because
they indicate text-picture integration.
8 J. ARNDT ET AL.
Figure 3. Means and standard errors for choosing the specific pictures as a function of sentence-picture correspondence and presenta-
tion sequence. Please note that differences between the general match and the picture-mismatch are of special interest because they
indicate text-picture integration.
With respect to the sentence–forced-choice test, Mauchly’s test indicated that the assumption of
sphericity had been violated, which is why degrees of freedom were corrected using Greenhouse-
Geisser estimates of sphericity. The expected interaction between sentence-picture correspondence
and presentation sequence was not significant, F(3.72, 178.63) D 1.19, MSE D 204.81, p D .32,
hp2 D .02 (see Figure 2), whereas the expected main effect of sentence-picture correspondence was,
F(1.86, 178.63) D 488.50, MSE D 204.81, p < .001, hp2 D .84. As expected, contrasts showed that par-
ticipants chose the specific sentence more often in the specific match condition (M D 72.53%, SD D
16.20) compared to the general match condition (M D 18.61%, SD D 15.06, p < .001, d D 2.46; see
Figure 2). However, contrary to our expectations, the general match and the sentence mismatch condi-
tions did not differ from each other (p D .30). The main effect of presentation sequence (F < 1) was
not significant.
With respect to the pictorial–forced-choice test, Mauchly’s test indicated that the assumption of
sphericity had been violated, therefore degrees of freedom were corrected using Greenhouse-Geisser
estimates of sphericity. The expected interaction between sentence-picture correspondence and presen-
tation sequence did not yield statistical significance, F(3.64, 174.84) D 1.23, MSE D 255.40, p D .30,
hp2 D .03 (see Figure 3). The ANOVA revealed the expected significant main effect of sentence-picture
correspondence, F(1.82, 174.84) D 251.45, MSE D 255.40, p < .001, hp2 D .72. Contrasts showed that
participants chose the specific picture more often in the specific match condition (M D 68.17%, SD D
16.15) compared to the general match condition (M D 24.33%, SD D 15.81, p < .001, d D 1.78, see
Figure 3). Furthermore, in line with our assumptions, participants chose the specific picture more often
in the picture-mismatch condition than in the general match condition (p D .04, d D 0.21, see Figure 3
picture-mismatch versus general match condition). This suggests that across all three conditions of
presentation sequence, participants integrated sentences and pictures with each other, leading them to
falsely recognize the specific picture in the picture-mismatch condition, in cases in which general pic-
tures had been presented along with specific sentences. The main effect of presentation sequence was
only marginally significant, F(2, 96) D 2.95, MSE D 317.00, p D .057, hp2 D .06.
Discussion
The purpose of the current study was to investigate whether text-picture integration is facilitated when
sentences and pictures are presented simultaneously instead of sequentially. We used a paradigm intro-
duced by Sch€uler et al. (2015). In a presentation phase, we crossed general and specific pictures with
general and specific sentences, resulting in three different levels of sentence-picture correspondence for
analyses: general match, (sentence or picture) mismatch, and specific match. Additionally, we
manipulated the presentation sequence (simultaneous, picture-sentence, sentence-picture). Afterwards,

participants answered forced-choice recognition tests for sentences and pictures.
We assumed that text-picture integration should be observable when comparing a condition in
which two general versions were presented (general match) with conditions in which a general and a
specific version were presented (sentence or picture mismatch). This difference should be larger when
sentences were presented simultaneously rather than sequentially (picture-sentence or sentence-
picture).
With regard to choosing the specific picture, results were in line with the assumption that text-
picture integration took place. Participants chose the specific picture more often when general pictures
were paired with specific sentences than when general pictures were paired with general sentences. This
result indicates that participants had integrated the specific sentence information with the general picture
information. Thus, this finding corroborates the central assumption of text-picture integration made by
CTML. However, no difference was observed for choosing the specific sentences—that is, when general
sentences were paired with specific pictures or general pictures. It is unclear why no evidence for text-pic-
ture integration was obtained for the sentence–forced-choice test, especially since Sch€ uler et al. (2015)
found the reverse pattern (i.e., text-picture integration evident for sentence–forced-choice test but not for
picture–forced-choice test). In general, the evidence in favor of text-picture integration here and in the
studies of Sch€uler et al. (2015) is associated with small effect sizes, which implies that it can easily be jeop-
ardized by even small variations in item and subject characteristics and situational factors. On the other
hand, the different pattern of results of our study and the study by Sch€ uler et al. (2015) indicates that
humans not just recall one of the two modalities (sentence or picture) better than the other one or that
they recall specific information usually better than general information. If one of these two assumptions
were true, the same pattern of results would have to be expected for all experiments. For example, if sen-
tences are in general recalled better than pictures, then participants should choose the specific picture
when they were presented with the picture-mismatch condition (i.e., specific sentence paired with general
picture), which was the case in the reported study but not in the study by Sch€ uler et al. (2015). If, on the
other hand, specific information is recalled better than general information, one would expect that partic-
ipants always select in favor of the specific information, meaning that they should choose specific senten-
ces if they were presented with general sentences paired with specific pictures. This was not the case in
our study, thus this explanation is also rather unlikely. In sum, the observed data patterns of the different
experiments speak in favor of an integrated mental model, which explains cross-modal testing effects for
sentence and picture recognition.
Furthermore, contrary to our expectation, temporal contiguity (i.e., presentation sequence) did not
influence text-picture integration. This result supports the assumption that text-picture integration can
also occur when sentences and pictures are presented sequentially instead of simultaneously. One pos-
sible explanation for this finding could lie in the fact that our materials differed in several aspects from
typical multimedia materials with which contiguity effects normally have been observed. The used
materials in this experiment consisted of very simple line drawings combined with short sentences.
Furthermore, pictures were presented immediately after the sentences (or vice versa). Thus, it could be
that even participants with sequential presentation were able to hold the information of the first repre-
sentation active in working memory when the second information was presented. Therefore, partici-
pants might not have had any problems integrating the information from both representations even
with sequential presentation. It is possible that with a longer delay between sentence and picture pre-
sentation in the sequential conditions, integration would be hampered because in that case participants
would not be able to hold the information of the first representation active in working memory. In line
with this assumption, Mayer and Moreno (1999), who presented very short text alternately with ani-
mation segments did not observe a temporal contiguity effect. Mayer (2009, p. 164–166 ff.) explains
the absence of a temporal contiguity effect for short text segments by the fact that in this case there is
only a very short time lag between text and picture presentation, which allows learners to build connec-
tions between the two representations. On the other hand, with longer text segments learners should
suffer from a lack of temporal contiguity in a sequential presentation, because they have to reconstruct
the respective information from long-term memory.
10 J. ARNDT ET AL.
Another explanation might be that the simplicity of the materials allowed participants to store and
retrieve the first information more easily from long-term memory, allowing them to integrate text-
picture information even with sequential presentation. Indeed, results of other studies have already
shown that one possible moderating variable of contiguity effects may be the complexity of the pre-
sented materials (e.g., Mayer & Moreno, 1999; Moreno & Mayer, 1999). In line with this assumption,
other studies have shown that the complexity of the multimedia learning materials can also influence
the appearance of other multimedia design effects. For example, the so-called modality effect seems to
appear especially with materials consisting of low complexity sentence-picture combinations
(e.g., Rummer, Schweppe, F€ urstenberg, Seufert, & Br€unken, 2010) but not necessarily with more-
complex learning materials (e.g., Schnotz, 2011; Sch€
uler, Scheiter, Rummer, & Gerjets, 2012; Sch€ uler,
Scheiter & Schmidt-Weigand, 2011).
Limitations and future research

Based on the above-mentioned limitations, future research should investigate the temporal contiguity
effect with more-complex materials or with a longer time lag between sentences and pictures in the
sequential conditions. Both variations should make it difficult for participants receiving sequential pre-
sentations to hold the information presented first active in working memory and should make it harder
to retrieve the first presented information from long-term memory. Under these circumstances a
temporal contiguity effect would be expected.
Future research might also use eye-tracking technology to study attention allocation to sentences
and pictures within the different conditions. This would allow excluding the possibility that sentences
or pictures were not sufficiently processed within the simultaneous condition. As for the present exper-
iment, we, however, think that this explanation is rather unlikely, because recognition performance did
not differ between sequential and simultaneous conditions and was around 70% (see specific match
condition in Figure 2 and Figure 3), indicating that learners processed sentences and pictures
sufficiently to identify them later.
A more general limitation of our study is that only university students were tested and that
the sample included a high number of females. Thus, we do not know whether the results would
hold true for other samples—for example, pupils in school or people without university educa-
tion. On the other hand, based on theory (e.g., CTML, Mayer, 2009), one would not expect dif-
ferences between these samples because the described processes are assumed as generally being a
part of human cognition.
Additionally, although we instructed participants to memorize materials, we do not know whether
they really did so or whether they simply “read” text and pictures. This might indeed be the case,
because learners were not explicitly informed about the posttest beforehand. Again, we think that per-
formance data speak against this assumption, because participants had a recognition rate of about 70%
in the specific match condition. Nonetheless, future studies might instruct participants more explicitly
beforehand and ask them afterwards what exactly they did when learning materials were presented.
As a last limitation, no mental models might have been constructed due to the the simple materials
used. We would argue that this assumption is unlikely for two reasons: The first reason is of a theoreti-
cal nature. As the assumptions regarding human cognitive processes made by CTML are based on find-
ings from basic cognitive research (e.g., Baddeley’s working memory [1992]), where typically very
simple materials are used, one would expect that the assumed cognitive processes occur with simple
but also with complex materials. The second argument is of an empirical nature: The observed data
pattern for the pictorial forced-choice test can best be explained by the assumption that a mental model
was constructed. Thus, the fact that the current study and Sch€ uler et al. (2015) observed cross-modal
testing effects underlines the assumption that mental models can also be constructed with simple
materials.
To sum up, the observed cross-modal influence for the picture–forced-choice test indicates that
text-picture integration can take place with simultaneous and with sequential text-picture presentation,
at least if materials allow learners to hold them active in memory or to retrieve them easily from
working memory. Thus, if simultaneous presentation of text and pictures is not possible—for example,
because of space limitations—designers may reduce the complexity of the presented materials, allowing
them to present text and pictures in a sequential order without detriment to learning. Further investi-
gation of this assumption should be the aim of future studies. For instance, more-complex materials
should be used to investigate whether the findings remain the same when using more real-life academic
learning materials.
To conclude, using paradigms from cognitive psychology in general seems to be a suitable method
to investigate cognitive processes in multimedia learning in more detail (see also Sch€ uler et al. 2015;
Rummer, Schweppe, F€ urstenberg, Scheiter, Zindler 2011). The paradigm deployed in the present study
seems to be useful as evidenced by the findings of Sch€ uler et al. (2015); however, it requires further
refinements and a better understanding of what causes the small effect sizes associated with it to be
addressed in future studies.
References
Anglin, G. J., Vaez, H., & Cunningham, K. (2004). Visual representations and learning: The role of static and animated
graphics. In D. Jonassen (Ed.), Handbook of research on educational communications and technology. (2nd ed.,
pp. 865–916). Mahwah, NJ: Erlbaum.
Baddeley, A. D. (1992). Working memory. Science, 225, 556–559. doi:10.1126/science.1736359
Field, A. (2009). Discovering statistics using SPSS. London, England: Sage Publications Ltd.
Gentner, D., & Loftus, E. F. (1979). Integration of verbal and visual information as evidenced by distortions in picture
memory. American Journal of Psychology, 92, 363–375. doi:10.2307/1421930
Ginns, P. (2006). Integrating information: A meta-analysis of the spatial contiguity and temporal contiguity effects.
Learning and Instruction, 16, 511–525. doi:10.1016/j.learninstruc.2006.10.001
Hannus, M., & Hy€on€a, J. (1999). Utilization of illustrations during learning of science textbook passages among low- and
high-ability children. Contemporary Educational Psychology, 24, 95–123. doi:10.1006/ceps.1998.0987
Hegarty, M., & Just, M. A. (1993). Constructing mental models of machines from text and diagrams. Journal of Memory
and Language, 32, 717–742. doi:10.1006/jmla.1993.1036
Holsanova, J., Holmberg, N., & Holmqvist, K. (2009). Reading information graphics: The role of spatial contiguity and
dual attention guidance. Applied Cognitive Psychology, 23, 1215–1226. doi:10.1002/acp.1525
Hy€on€a, J. (2010). The use of eye movements in the study of multimedia learning. Learning and Instruction, 20, 172–176.
doi:10.1016/j.learninstruc.2009.02.013
Johnson, C. I., & Mayer, R. E. (2012). An eye movement analysis of the spatial contiguity effect in multimedia learning.
Journal of Experimental Psychology, 18, 178–191. doi:10.1037/a0026923
Levie, W. H., & Lentz, R. (1982). Effects of text illustrations: A review of research. Educational Communication and Tech-
nology Journal, 30, 195–232. doi:10.1007/BF02765184
Mammarella, N., Fairfield, B., & Di Domenico, A. (2013). When spatial and temporal contiguities help the integration in
working memory: “A multimedia learning” approach. Learning and Individual Differences, 24, 139–144. doi:10.1016/
j.lindif.2012.12.016
Mason, L., Pluchino, P., & Tornatora, M. C. (2013). Effects of picture labeling on science text processing and learning:
Evidence from eye movements. Reading Research Quarterly, 48, 199–214. doi:10.1002/rrq.41
Mason, L., Pluchino, P., & Tornatora, M. C. (2015). Eye-movement modeling of integrative reading of an illustrated text:
Effects on processing and learning. Contemporary Educational Psychology, 41, 172–187. doi:10.1016/j.
cedpsych.2015.01.004
Mason, L., Tornatora, M. C., & Pluchino, P. (2013). Do fourth graders integrate text and picture in processing and learn-
ing from an illustrated science text? Evidence from eye-movement patterns. Computers and Education, 60, 95–109.
doi:10.1016/j.compedu.2012.07.011
Mayer, R. E. (1997). Multimedia learning: Are we asking the right questions? Educational Psychologist, 32, 1–19.
doi:10.1207/s15326985ep3201_1
Mayer, R. E. (Ed.), (2005). The Cambridge handbook of multimedia learning. New York, NY: Cambridge University Press.
Mayer, R. E. (2009). Multimedia learning (2nd ed.). Cambridge, England: Cambridge University Press.
Mayer, R. E., & Anderson, R. B. (1991). Animations need narrations: An experimental test of a dual-coding hypothesis.
Journal of Educational Psychology, 83, 484–490. doi:10.1037/0022-0663.83.4.484
Mayer, R. E., & Anderson, R. B. (1992). The instructive animation: Helping students build connections between words
and pictures in multimedia learning. Journal of Educational Psychology, 84, 444–452. doi:10.1037/0022-0663.84.4.444
Mayer, R. E., Moreno, R., Boire, M., & Vagge, S. (1999). Maximizing constructivist learning from multimedia communi-
cations by minimizing cognitive load. Journal of Educational Psychology, 91, 638–643. doi:10.1037/0022-0663.91.4.638
Mayer, R. E., & Sims, V. K. (1994). For whom is a picture worth a thousand words? Extensions of a dual-coding theory of
multimedia learning. Journal of Educational Psychology, 86, 389–401. doi:10.1037/0022-0663.86.3.389
12 J. ARNDT ET AL.
Moreno, R., & Mayer, R. E. (1999). Cognitive principles of multimedia learning: The role of modality and contiguity.
Journal of Educational Psychology, 91, 358–368. doi:10.1037/0022-0663.91.2.358
Pezdek, K. (1977). Cross-modality semantic integration of sentence and picture memory. Journal of Experimental Psy-
chology: Human Learning and Memory, 3, 515–524. doi:10.1037/0278-7393.3.5.515
Rummer, R., Schweppe, J., F€ urstenberg, A., Scheiter, K., & Zindler, A. (2011). The perceptual basis of the modality effect
in multimedia learning. Journal of Experimental Psychology: Applied, 17, 159–173. doi:10.1037/a0023588
Rummer, R., Schweppe, J., F€ urstenberg, A., Seufert, T., & Br€
unken, R. (2010). Working memory interference during proc-
essing texts and pictures: Implications for the explanation of the modality effect. Applied Cognitive Psychology, 24,
164–176. doi:10.1002/acp.1546
Schmidt-Weigand, F., Kohnert, A., & Glowalla, U. (2010). Explaining the modality and contiguity effects: New insights
from investigating students’ viewing behaviour. Applied Cognitive Psychology, 24, 226–237. doi:10.1002/acp.1554
Schnotz, W. (2011). Colorful bouquets in multimedia research: A closer look at the modality effect. Zeitschrift f€ ur
P€adagogische Psychologie, 25, 269–276. doi:10.1024/1010-0652/a000055
Sch€
uler, A., Arndt, J., & Scheiter, K. (2015). Processing multimedia material: Does integration of text and pictures result
in a single or two interconnected mental representations?. Learning and Instruction, 35, 62–72. doi:10.1016/j.
learninstruc.2014.09.005
Sch€
uler, A., Scheiter, K., Rummer, R., & Gerjets, P. (2012). Explaining the modality effect in multimedia learning: Is it due
to a lack of temporal contiguity with written text and pictures? Learning and Instruction, 22, 92–102. doi:10.1016/j.
learninstruc.2011.08.001
Sch€
uler, A., Scheiter, K., & Schmidt-Weigand, F. (2011). Boundary conditions and constraints of the modality effect.
Zeitschrift f€
ur P€adagogische Psychologie, 25, 211–220. doi:10.1024/1010-0652/a000046
Van Gog, T., & Scheiter, K. (2010). Eye tracking as a tool to study and enhance multimedia learning. Learning and
Instruction, 20, 95–99. doi:10.1016/j.learninstruc.2009.02.009
Wippich, W. (1987). The integration of pictorial and verbal information (Untersuchungen zur Integration bildlicher und
sprachlicher Information). Sprache and Kognition, 6, 23–35. doi:10.1515/9783111371054.317

Investigating The Influence of Simultaneous - Versus Sequential-Text-Picture Presentation On Text-Picture Integration

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Investigating The Influence of Simultaneous - Versus Sequential-Text-Picture Presentation On Text-Picture Integration

Uploaded by

Copyright:

Available Formats

The Journal of Experimental Education

ISSN: 0022-0973 (Print) 1940-0683 (Online) Journal homepage: http://www.tandfonline.com/loi/vjxe20

Investigating the Influence of Simultaneous–

Jana Arndt, Anne Schüler & Katharina Scheiter

To link to this article: https://doi.org/10.1080/00220973.2017.1363690

Published online: 18 Jun 2018.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

Investigating the Inﬂuence of Simultaneous– Versus

Sentence-picture combinations Conditions picture sentence

general sentence/general picture general match is wrong is wrong

Sentence-picture combinations Set A Set B Set C Set D

general sentence/general picture 14 12 12 12

sentence-picture). In all three between-subject conditions, the 50 sentence-picture combinations were

manipulated the presentation sequence (simultaneous, picture-sentence, sentence-picture). Afterwards,

Limitations and future research

You might also like