Effects of Task and Corpus-Derived Association Scores On The Online Processing of Collocations

Corpus Linguistics and Ling.
Theory 2019; aop
Kyla McConnell* and Alice Blumenthal-Dramé

Effects of task and corpus-derived
association scores on the online
processing of collocations
https://doi.org/10.1515/cllt-2018-0030
Abstract: In the following self-paced reading study, we assess the cognitive

realism of six widely used corpus-derived measures of association strength
between words (collocated modifier–noun combinations like vast majority): MI,
MI3, Dice coefficient, T-score, Z-score, and log-likelihood. The ability of these
collocation metrics to predict reading times is tested against predictors of lexical
processing cost that are widely established in the psycholinguistic and usage-
based literature, respectively: forward/backward transition probability and
bigram frequency. In addition, the experiment includes the treatment variable
of task: it is split into two blocks which only differ in the format of interleaved
comprehension questions (multiple choice vs. typed free response). Results
show that the traditional corpus-linguistic metrics are outperformed by both
backward transition probability and bigram frequency. Moreover, the multiple-
choice condition elicits faster overall reading times than the typed condition,
and the two winning metrics show stronger facilitation on the critical word (i.e.
the noun in the bigrams) in the multiple-choice condition. In the typed condi-
tion, we find an effect that is weaker and, in the case of bigram frequency,
longer lasting, continuing into the first spillover word. We argue that insufficient
attention to task effects might have obscured the cognitive correlates of associa-
tion scores in earlier research.
Keywords: collocations, cognitive realism, association scores, task effects, self-

paced reading
*Corresponding author: Kyla McConnell, Department of English, Albert-Ludwigs-Universitat

Freiburg, Freiburg im Breisgau, Germany; FRIAS (Freiburg Institute for Advanced Studies),
Freiburg im Breisgau, Germany, E-mail: kyla.mcconnell@anglistik.uni-freiburg.de
Alice Blumenthal-Dramé, Department of English, Albert-Ludwigs-Universitat Freiburg, Freiburg
im Breisgau, Germany; FRIAS (Freiburg Institute for Advanced Studies), Freiburg im Breisgau,
Germany, E-mail: alice.blumenthal@anglistik.uni-freiburg.de
Brought to you by | Göteborg University - University of Gothenburg

Authenticated
Download Date | 11/30/19 5:21 PM
2 Kyla McConnell and Alice Blumenthal-Dramé
1 Introduction
In language comprehension, words are not processed in isolation. Rather, it
has been suggested that sentence processing is influenced by prediction.
Among other things, expectations are driven by knowledge of typical patterns
in language use and contextual information (Kuperberg and Jaeger 2016).
Consider, for example, the bigram vast majority: where vast appears, there
is a strong likelihood that majority will follow. Words that co-occur more
frequently in language use than would be expected by mere chance (under
the null hypothesis that their component words are independent) are called
collocations, and their component words are referred to as collocates. In the
corpus-linguistic and psycholinguistic communities, different metrics have
been put forward to quantify association strengths between collocates.
Certain metrics have predominantly served as tools in theoretical linguistics,
lexicography, and applied linguistics, whereas others have mainly been used
to model probabilistic language processing in information-theoretic terms
(Evert 2009; Hale 2016). However, there is no consensus on which of these
metrics are the most cognitively realistic, since, to the best of our knowledge,
they have never been pitted against each other in any online processing
experiment.
In order to test the cognitive realism of multiple available measures, we
present an online behavioral experiment involving collocation reading. One
hundred and ten native speakers of English participated in a self-paced reading
study designed to assess the predictive power of competing collocation metrics
in terms of reading times for the second word in collocated modifier–noun
bigrams like vast majority. Six of the most common corpus-linguistic association
scores – MI, MI3, Dice coefficient, T-score, Z-score, and log-likelihood (presented
in Section 1.1) – were pitted against the three following measures:
1) (log-transformed) forward transition probability is an information-theoretic
metric that can be considered the current gold standard in psycholinguistics
(Smith and Levy 2013);
2) (log-transformed) backward transition probability measures the likelihood
of the first word given the second and thus specifically excludes the possi-
bility of anticipatory processing, as it measures the likelihood of the first
word given the second (McCauley and Christiansen 2017);
3) (log-transformed) bigram frequency is a measure insensitive to collocation
strength but which has emerged as the method of choice in recent cogni-
tive linguistic work of usage-based inspiration (Christiansen and Arnon
2017).

Authenticated
Online collocation reading 3
In comparing traditional corpus-based metrics with metrics derived from

transition probabilities, we bring corpus linguistics together with information-
theoretic models. Probabilistic models such as these are becoming increasingly
popular in the cognitive sciences in terms of modeling higher human cognition,
from causal learning to visual perception, motor action, and language processing
(Blumenthal-Dramé and Malaia 2018; Clark 2016; Hohwy 2013; Huang and Rao
2011). Until now, corpus linguistics and the probabilistic cognition framework
have been developing largely in parallel. A direct empirical comparison of their
metrics therefore seems natural and necessary.
By adding bigram frequency to the comparison, we contrast these two
approaches with a third strand of research: In recent experimental work of
usage-based inspiration, multi-word units, or chunks, have been suggested to
constitute the primitive building blocks of language. This approach casts doubt
on the premise shared by most frameworks that collocated bigrams are accessed
as two separate but associated units, rather assuming that the bigram is pro-
cessed as a single unit varying from other complex units in its frequency (Arnon
and Snider 2010; Christiansen and Arnon 2017; Conklin and Schmitt 2012;
Blumenthal-Dramé 2016a). Thus, many usage-based approaches would predict
(log-transformed) bigram frequency to be the most cognitively realistic predictor
of processing cost in our experiment.
In addition to probing the cognitive realism of different (corpus linguistic,
information-theoretic, and usage-based) metrics, the present experiment intro-
duces a task effect. More specifically, it involves two types of interleaved
comprehension questions (multiple-choice vs. typed free response). The compar-
ison between these two conditions in a within-subject counterbalanced design
aims at exploring whether minor alterations in task can affect the relationship
between collocation metrics and online reading.
The motivations behind the inclusion of this task difference are twofold:
First, it has been found that language comprehension can be affected by con-
textual and task-based pressures such as speed and cognitive load (Ito et al.
2018; Payne and Federmeier 2017). Second, we hypothesized that insufficient
attention to task effects might explain the lack of convergence between prior
studies probing the cognitive realism of corpus-derived association scores (for
more detail, see Section 1.1).
1.1 The cognitive realism of corpus-derived association scores

Corpus-derived association scores have first and foremost been developed for
the purposes of computational linguistics, lexicography, and applied linguistics

Authenticated
(e.g. for corpus-based dictionaries, translation resources, or automated language

processing systems) but have also fed into theoretical linguistic research under
the assumption that a word’s distribution in language use can offer an empirical
handle on its meaning (Evert 2009: 1,213). It must be emphasized from the outset
that these collocation metrics have not been designed with psycholinguistic
applications in mind. Evert (2009: 1,215) even explicitly states that they represent
“a statistical attraction between certain events and must not be confused with
psychological association (as e.g. in word association norms, which have no
direct connection to the statistical association between words […])”.
However, corpus data ideally capture both the language input that average
language users are exposed to and the output that a representative sample of
language speakers produce. It therefore stands to reason that quantitative gen-
eralizations across corpora must at least in some indirect way reflect the linguis-
tic knowledge in the minds of average language users (Blumenthal-Dramé 2012,
Blumenthal-Dramé 2016b). Another argument in support of the potential cogni-
tive realism of corpus-based association scores is that they are contingency
based. Contingencies between cues and outcomes have long been known to be
highly salient to humans, and it is well established that contingencies drive
associative learning processes across different domains (Gries and Ellis 2015).
The association scores whose cognitive realism are at issue in this paper
cover a spectrum of diverse yet widely established metrics. We compare the six
association scores that are readily computed on the British National Corpus
(BNC) QCP-Edition (Hoffmann 2008). All measures quantify the strength of
mutual attraction between words on the basis of corpus data, with higher scores
indicating stronger attraction, i.e. higher mutual expectation.
1) Log-likelihood is arguably the most trusted and mathematically sophisti-
cated metric, as it is claimed to be a reliable substitution for computation-
ally demanding measures such as Fischer’s exact test (Evert 2009: 30);
2) Dice coefficient is similarly sophisticated, though it cannot show negative
associations between words (Evert 2009: 29);
3) Z-score is a simple measure that tests the null hypothesis that there is no
association between the two words (thus does not reflect effect size), gen-
erally used as a cutoff to identify collocations rather than rank them;
4) T-score is a modification of Z-score designed to reduce Z-score’s low fre-
quency bias, which is especially noticeable in the case of collocations with
low expected co-occurrence (i.e. two low-frequency component words);
5) MI is a simple measure that compares observed to expected bigram fre-
quency. Due to the nature of its mathematical calculation, this metric also
has a strong low-frequency bias (i.e. bigrams containing two low-frequency
individual words are ranked as highly collocated);

Authenticated
6) MI3 is a variant of MI that cubes the observed frequency of co-occurrence to

combat MI’s low frequency bias.
It is worth noting that all of these association scores are bidirectional. In other
words, the scores do not describe if the first word is more predictive of the
second word or vice versa. A more in-depth mathematical comparison of the
metrics lies beyond the scope of the present paper, but excellent overviews can
be found in Evert (2009), Wiechmann (2008), Gries and Ellis (2015), and
Levshina (2015: Ch. 10).
Despite the assumed naturalistic nature of corpora, it is surprising that
evidence in support of the cognitive realism of corpus-derived association scores
has, so far, been relatively scarce and inconsistent. For example, Dąbrowska
(2014) showed that popular measures of association strength (Z-score, T-score,
bigram frequency, and MI) did not predict the choices participants made when
presented with sets of five bigrams and asked to pick the bigram that sounded
the most familiar to them. These findings contrast, however, with a series of
reading experiments by Ellis, Simpson‐Vlach, and Maynard (2008) focusing on
different facets of language processing (e.g. recognizing correct forms, accessing
pronunciation, comprehending in context), which found that raw frequency and
MI do correlate with native speakers’ accuracy and fluency. Durrant and Doherty
(2010) similarly report a significant influence of corpus-based association
metrics (T-score and MI) on overtly primed lexical decisions, but only on pairs
categorically binned as strongly collocated based on very high T-score and MI
values. Collocations with moderate or low T-score and MI values did not show
any priming effect. In a masked priming condition, only very strong collocates
that were additionally psychological associates (i.e. received a high score from a
task where participants listed the first three words to come to their mind when
shown the prompt) showed a priming effect.
One potential explanation for the restricted cognitive plausibility of associa-
tion scores is their bidirectionality, which glosses over the fact that mutual
expectations between words need not be symmetric. For example, in the collo-
cation bonsai tree, bonsai arguably attracts tree much more strongly than vice
versa (Levshina 2015: 224). As a result, bidirectional measures such as those
commonly calculated on the BNC might be cognitively unrealistic in light of
incremental language processing, whereby the first word of a collocation
becomes available before the second word. If, as many psycholinguists assume,
lexical processing is strongly supported by anticipatory processes, “forward-
looking” measures should have more bearing on online processing than bidir-
ectional measures. However, lexical processing cost may also (partly or fully) be
modulated by backward integration difficulty (i.e. the difficulty of connecting

Authenticated
words back to the prior discourse) without the need for prediction (Smith and
Levy 2013: 309–311). If this is the case, “backward-looking” measures should be
superior to forward-looking or bidirectional measures (Gries 2013).
Directionality is addressed in the current experiment through the two infor-
mation-theoretic metrics: log-transformed forward transition probability
between the first and the second word of a bigram (logForwardTP), often
referred to as “surprisal”, and log-transformed backward transition probability
(logBackwardTP), which gauges the probability of the first word of the bigram
given the second word. Both measures are relatively simple in computation;
their calculation includes only dividing the raw frequency of the whole bigram
by the raw frequency of its first word (in the case of logForwardTP) or by the raw
frequency of the second word (in the case of logBackwardTP). It may therefore
come as a surprise that over the last few years, logForwardTP has emerged as
the psycho- and neurolinguistic gold standard for measuring processing load in
lexical processing; less predictable words (in terms of logForwardTP) are more
difficult to process and elicit a “surprise” response in the brain, as has been
shown in a range of neurolinguistic (Frank et al. 2015) and behavioral reading
experiments (Boston et al. 2008; Demberg and Keller 2008; Frank and Bod 2011;
Levy 2008; Smith and Levy 2013; Frank 2013; Hale 2016; Lowder et al. 2018).
Although logBackwardTP has not received the same amount of scholarly atten-
tion, both adults and children have also been shown to be sensitive to this
metric in language processing (McCauley and Christiansen 2017).
A crucial conceptual similarity between corpus-linguistic association scores
and (forward/backward) transition probabilities is that they treat individual words
as the primitive building blocks between which tightness of association is com-
puted. However, this approach has been cast into doubt by certain cognitive
linguists, who have argued that processing units in the mental lexicon need not
be coextensive with the primitive units posited in theoretical and descriptive
approaches to language. They have put forward the alternative hypothesis
that the primitive building blocks of the mental lexicon are complex, unanalyzed
n-grams (i.e. sequences of n adjacent words or morphemes like don’t have to
worry, I don’t know, a state of emergency, at the same time). This suggestion
follows from the usage-based assumption that every single string encountered
in natural language use, no matter how complex, leaves a trace in memory. In this
view, repeated exposure to a string strengthens its memory trace, thereby reinfor-
cing its status as a primitive cognitive unit, with lexemes and morphemes repre-
senting derivative phenomena arising from partial overlap between different
strings (Abbot-Smith and Tomasello 2006; Bybee and McClelland 2005; O’Grady
2008; Gurevich et al. 2010; Hay and Baayen 2005; Bybee 2010; Blumenthal-Dramé
2016a; Blumenthal-Dramé et al. 2017; Siyanova-Chanturia et al. 2017).

Authenticated
The view that n-grams, rather than words or morphemes, represent the
natural units of language processing has received support from an increasing
number of neuro- and psycholinguistic studies attesting to positive correlations
between processing load and log-transformed n-gram frequency1 (Caldwell-
Harris and Morris 2008; Jiang and Nekrasova 2007; Tremblay and Baayen
2009; Tremblay and Tucker 2011; Siyanova-Chanturia 2015; Blumenthal-Dramé
2016a; Tremblay et al. 2011; Jacobs et al. 2016; Bannard and Lieven 2012). These
studies have exploited different tasks (self-paced reading, decision tasks, mem-
ory tests), technologies (pen-and-paper tasks, eye-tracking, brain mapping
methods such as electroencephalography and functional magnetic resonance
imaging), and dependent variables (response times [RTs], response accuracies,
eye movement measures, measures of neural activity). At least some of these
studies have shown that their results are not attributable to transition probabil-
ities between morphemes or individual lexical frequencies (Arnon and Cohen
Priva 2013; Arnon and Snider 2010; Bannard 2006). Thus, we consider bigram
frequency in addition to the corpus-derived association scores and the direc-
tional transition probability measures.
In summary, the abovementioned failure of corpus-derived association
scores to yield consistent correlations with cognitive tasks might be due to
theoretical premises that are not cognitively realistic, including but not neces-
sarily limited to the assumption of bidirectionality and the idea that individual
words are the building blocks of the mental lexicon.
1.2 Effects of task demands on lexical processing
Another potential explanation for the lack of convergence between psycholin-

guistic collocation studies so far is that collocation processing may be modu-
lated by task demands. Effects of task on lexical processing have been
documented in a range of recent experiments. For example, in a self-paced
reading experiment by Hintz et al. (2016), processing advantages for highly
predictable lexical items only emerged in experimental contexts encouraging
1 In psycholinguistic research, raw frequencies are usually log-transformed, for two major
reasons: First, the usage frequencies of linguistic units follow a Zipfian distribution and log-
transformation reduces the risk of overly influential outlier stimuli exerting distorting effects
(see Levshina 2015: 64; Baayen 2008). Second, the relationship between raw frequency and
mental entrenchment is not one-to-one but follows a logarithmic scale. In other words, smaller
frequencies have a higher cognitive impact than higher frequencies, and the frequency differ-
ence between 1 and 10 is cognitively more relevant than the difference between 10,000 and
10,010 (Ellis 2002; Smith and Levy 2013).

Authenticated
prediction: that is, when self-paced reading trials were interleaved with trials in
which participants had to name pictures that were made more or less predictable
from auditorily presented sentence onsets (see also Gollan et al. 2011). By con-
trast, no effects of lexical predictability emerged when the naming task was kept
separate from the self-paced reading task.
An event-related potentials study by Wlotko and Federmeier (2015) showed
that when everything else was held constant, longer exposure times to words in
online reading boosted predictive processing, as suggested by increased seman-
tic integration problems for unexpected words. However, this effect was modu-
lated by the order of experimental conditions: when an experimental block
featuring long word exposure times (500 ms per word) preceded an experimental
block with shorter exposure times (250 ms), effects of semantic predictability
were found in both blocks. By contrast, when the order of experimental blocks
was reversed, no effects of semantic predictability were found in the block
involving speeded presentation times.
It is important to highlight that the two studies named above do not warrant
direct extrapolation to collocation processing, since they gauge lexical predic-
tivity in terms of cloze probability rather than corpus-derived association
strength. However, like studies on collocation processing, they are concerned
with the extent to which associative knowledge tied to lexical items is capita-
lized on in language processing.
We therefore decided to explore whether such task effects extend to the
processing of corpus-extracted collocations. More precisely, we decided to set
the stakes high and to test whether a minimal difference in task – answering
multiple-choice questions versus free response questions – can affect language
users’ online processing of collocations in self-paced reading. Multiple-choice
questions require subjects to choose a response from a set of provided options,
whereas free response formats require subjects to reconstruct knowledge. In L1
reading, free response formats are known to be significantly more difficult than
multiple-choice formats (measured, e.g. in terms of test performance in reading
tasks) (In’nami and Koizumi 2009; Rodriguez 2006). We therefore expected
word-by-word reading times to be higher in the free response condition than
in the multiple-choice condition.
However, we were agnostic as to potential effects of task differences on
collocation processing. On the one hand, increased word-by-word reading
times might be expected to boost associative processing, if they index an
increased involvement of the distributional knowledge stored along with
entries in the mental lexicon (see Wlotko and Federmeier 2015, above). On
the other hand, a stronger focus on individual words may disrupt the

Authenticated
conceptual unity of collocations and therefore elicit weaker effects of associa-

tion strength.
However, it must be pointed out that to the best of our knowledge, the
precise cognitive mechanisms underlying the difference between multiple-
choice and free response tasks (e.g. differential attention, memory, strategic
processes) have not been fully elucidated. As a result, we could not entirely
dismiss the possibility that higher reading times might have no effect on collo-
cation processing at all, if they reflect mental processes that have no bearing on
the retrieval of linguistic units from the mental lexicon (e.g. mind wandering,
memorizing, trying to guess an upcoming question, etc.).
2 Experimental design
2.1 Stimuli
Ninety-one critical bigrams were gathered from everyday conversations, perso-

nal intuition, various English language learning websites, and previous studies
from diverse topics in psycholinguistics (Aijmer and Altenberg 2014; Deuter et al.
2002; Howarth 1998; Martyńska 2004; Biskup 1992; Siyanova and Schmitt 2008).
Each bigram was a modifier–noun sequence obeying the following constraints:
The modifier was either an attributive adjective or a (past or present) participle
in attributive adjectival use. Both the modifier and the noun consisted of one
lexical morpheme and up to two derivational or inflectional morphemes. This
means that modifiers like beautiful, incredible, and refreshing were eligible,
whereas modifiers like single-parent or cold-blooded were not. Likewise, nouns
like stranger or reality were eligible, while nouns like headache or sunshine were
not. This morphological constraint was set up because compounds like sunshine
can be argued to represent collocations at the morphological level, and the
interaction between lexical-level and morphological-level collocations was
beyond the scope of the present study. Nouns with regional spellings (splendor,
generalization, etc.) were avoided, since they represent a further potential source
of variability that was not the topic of the present experiment. All modifier–noun
sequences were attested at least once in the BNC, and they exhibited a higher
observed than expected frequency under the null hypothesis that their respec-
tive component words are independent. (This information was retrieved from the
BNCweb, see below.)
It is important to highlight that semantic compositionality was not strictly
controlled for in stimulus design. This empirical decision was taken for the

Authenticated
following reasons: First, meaning is not directly observable in corpora, and we

wanted to proceed in a maximally empirical fashion. In particular, we wanted to
abstain from making any commitments as to issues of ongoing theoretical
debate (e.g. on where to draw the line between compositional and transparent
bigrams, how many subtypes of compositionality should be posited, etc. [see
Croft 2001: 15]). Second, we wanted to generate findings that generalize beyond
specific subtypes of collocations, since our research question was a highly
general one: What can quantitative generalizations across corpus data tell us
about language processing? We therefore accepted 12 items that can be categor-
ized as exhibiting established figurative meanings in our sample (e.g. helping
hand, heated debate, crushing defeat).
Once the critical bigrams had been selected, they were inserted into sen-
tences. Sentence onsets were designed to be maximally neutral as to the upcom-
ing lexical material (e.g. Ty commented on the…., Sam recalled the…., Brooke
gossiped about the…., Clarissa observed the….). The bigrams functioned as phrase
constituents playing different syntactic roles across the sentences (e.g. Amber
enjoyed a refreshing beverage…., Tarek’s firm conviction…., Sebastian reminded
them of the harrowing moment….). All bigrams were followed by three words
until the sentence offset. This so-called spillover region was held constant to
explore whether potential effects of collocation strength carry over from the
critical noun to the following words.
All association scores were extracted directly from BNC, accessed via the
online CQP-Edition (Hoffmann 2008). For each bigram, this was done in the
following way: First, the noun was searched as a lemma, i.e., as a headword
restricted to the part of speech “noun” (e.g. {right/N}) within the whole BNC. The
noun, rather than the modifier, was set as the node word to obtain collocation
information subsuming a maximum of inflectional forms of a bigram (e.g. close
friend, close friends, close friend’s, close friends’). After selecting the “colloca-
tions” option from the drop-down menu, the default BNC collocation settings
were accepted. The collocational window span was then set to one word to the
left of the noun, the modifier was typed into the “specific collocate” window,
and the POS (part-of-speech) tag for adjectives was selected. In this way, all
available association scores (MI, MI3, T-score, Z-score, log-likelihood, and Dice
coefficient) and the raw bigram frequency were extracted from the BNC. In
addition, ForwardTP was computed by dividing the raw frequency of the bigram
by the raw frequency of its modifier in adjectival use (found by searching the
whole BNC, e.g.{cosmetic/ADJ}). Conversely, BackwardTP was computed by
dividing the raw bigram frequency by the raw frequency of the noun. In com-
pliance with common psycholinguistic practice, raw bigram frequencies and

Authenticated
transition probabilities were log-transformed using the natural logarithm. All

experimental sentences and collocation metrics are presented in Table A1.
In the following, for the sake of simplicity, all these metrics will be subsumed
under the term of “collocation metrics” (the notion of “association strength”
would be too specific to cover bigram frequency, which gauges collocational
entrenchment without quantifying the degree of attraction between lexemes). As
can be seen from Figure 1, strong positive Spearman correlations hold between
almost all collocation metrics, with the strongest correlation being between
T-score and log-transformed bigram frequency (henceforth: logBigramFreq),
which produce an identical ranking.
Figure 1: Spearman correlation matrix between all collocation metrics tested in this study as
well as log-transformed frequencies for modifiers and nouns (henceforth: logModifierFreq and
logNounFreq) displayed using the R package corrplot version 0.84 (Wei and Simko 2017).
logForwardTP: Log-transformed forward transition probability; logBackwardTP: log-transformed
backward transition probability.

Authenticated
2.2 Participants
One hundred and twenty-three adult native speakers of English located in the
United Kingdom, Ireland, the United States, Canada, and Australia were recruited
via Prolific (http://www.prolific.ac/), a crowdsourcing platform, and paid for their
participation in the experiment. Of these, 13 were not considered in the analysis,
for different reasons: 9 participants did not meet the accuracy threshold of 80% in
the comprehension questions, 3 reported being early bilinguals, and 1 participant
attempted to complete the experiment on a mobile device. The remaining 110
participants ranged in age from 18 to 70 (median: 31, mean 34) and were 54%
female. Participants also self-reported education level (35% high school diploma
or less, 54% bachelor’s or associate degree, 11% master’s or higher).
2.3 Experimental design

Participants completed a word-by-word self-paced reading experiment presented
online via IbexFarm, a JavaScript-based host (Drummond 2016). Each version
included 248 sentences, 91 of which contained a critical bigram and 157 of which
were unrelated filler sentences. Sentences were presented in the monospace font
“Courier New”.
34.5 percent of all sentences (or, approximately one-third of all sentences)
were followed by comprehension questions. These questions were designed in
such a way as to encourage sentence processing and memorization at all linguis-
tic levels. Thus, certain questions relied on superficial orthographic recognition
(e.g. questions requiring name recall), whereas others focused on semantic pro-
cessing (e.g. questions involving synonyms to lexemes in the preceding sentence),
while still others necessitated deep syntactic processing (e.g. questions featuring
argument structure alternations relative to the target sentence).
The experiment was split into two blocks, which differed only in the
required answer format to the comprehension questions: In the multiple-choice
block, questions appeared with three possible answers that the participant was
asked to choose between using the number keys (1, 2, and 3 on the keyboard).
By contrast, in the free response block, participants were asked to type short
answers (average 1–3 word response).
The experiment came in 16 pseudo-randomized versions, each with 2 orders,
depending on which block came first. Each participant saw each bigram once.
Across participants, each bigram occurred equally as often in either block.
Participation in the experiment took around 45 min. Practice sentences and
questions that were unrelated to the critical conditions preceded each block.

Authenticated
Participants were instructed that they would be reading sentences one word at
a time and that pressing the spacebar would advance to the next word. They were
asked to read as normally as possible, not to rush, and to answer comprehension
questions as accurately as possible. They were also informed that they would not
receive feedback on their answers. In the free response block, they were told not to
worry about typing speed, capitalization, punctuation, or other typing errors.
3 Statistical analysis and results
3.1 Statistical method
Linear mixed effects models were fitted to log-transformed RTs with the package
lme4 from the statistical processing software R (version 3.5.1) (Bates et al. 2015).
The numerical predictor variables in all models were centered. p-Values for fixed
effects were calculated by means of the R-package lmerTest (Kuznetsova et al.
2017). Model comparison between non-nested models (all fitted to the same data
points) was always performed by comparing AIC (Akaike information criterion)
values; model comparison for nested models was done using the anova() func-
tion. For all reported models, model assumptions were checked on the basis of
diagnostic plots. A detailed script containing the entire analysis and the outcome
of all models is available as a supplementary file.
3.2 Preprocessing
Following standard psycholinguistic practice, raw RTs under 100 ms and above
2,000 ms were excluded as outliers. Moreover, log-transformed RTs falling out-
side of three standard deviations from each subject’s mean were rejected. This
resulted in 1.85% data loss. A baseline model was fitted to the remaining log-
transformed RTs. This model contained SENTENCE NUMBER and WORD LENGTH as
fixed effects and a by-SUBJECT intercept as a random effect. The rationale behind
fitting such a baseline model was to correct the response variable for effects
which are known to modulate RTs in self-paced reading experiments but are
independent of the experimental manipulation (Linzen and Jaeger 2015). The
per-word residual RTs of the baseline model are the corrected RTs that were used
for further analysis.2
2 To ensure that our results were not an artifact of residualizing (Wurm and Fisicaro 2014), we
conducted a follow-up analysis in which non-residualized log-transformed RTs were used in

Authenticated
3.3 Results
Our experiment aimed to compare the predictive power of nine competing

collocation metrics, as well as to explore effects of the factor CONDITION (multi-
ple choice vs. typed free response) on self-paced reading times.
Model fitting comprised the following steps: First, nine separate regression
models (one per COLLOCATION METRIC) were fitted to the corrected RTs for the
critical words.3 All of these models followed the following template: lmer
(corrected_RT ~ COLLOCATIONMETRIC + (1|PARTICIPANT) + (1|WORD)).
Out of these nine models, only those jointly satisfying the following require-
ments were retained for further analysis:
(1) The coefficient for the collocation metric passes the Bonferroni-corrected
significance level of 0.005 at the reference level (i.e. the critical word in the
non-typed condition).
(2) The effect of the collocation metric points in the expected direction (i.e. a
stronger association speeds up RTs), or, in other words, the effect of the
collocation metric is theoretically justifiable.
Based on these criteria, two models – those involving the collocation metrics
logBigramFreq and logBackwardTP – were retained for further analysis and
considered “provisional winning models”.4
It is interesting to note that two models satisfying criterion (1) did not meet
criterion (2): those containing logForwardTP and MI. In both of these models,
higher values in the association metrics correlate with significantly higher RTs.
In order to avoid drawing conclusions based on statistical significance that does
not derive from an informed hypothesis, we do not continue with the analysis of
these metrics at this stage. We return to this issue in the discussion.
Next, we tested whether adding an interaction by CONDITION to the two
provisional winning models significantly improves their fit. This turned out to be
the case: In both models, the collocation metric elicited a significantly weaker
facilitation in the typed condition.
both winning models presented in Tables 4 and 5, with the addition of word length and
(centered) trial order as fixed effects. The results from these models showed the same significant
patterns as the models with residualized RTs. See Section 1.6 of the Supplementary Material for
more detail.
3 Putting several collocation metrics into one model would not have worked for obvious
collinearity reasons (cf. Figure 1). Moreover, it would have defeated the purpose of identifying
which individual collocation metric is most cognitively realistic.
4 T-score, which produces a bigram ranking identical to logBigramFreq (cf. Section 2.1), did not
pass the Bonferroni-corrected threshold (p-value for the T-score coefficient = 0.0214).

Authenticated
This left us with two winning models containing a COLLOCATION METRIC and
5
CONDITION as fixed effects. Table 1 compares these two models in terms of AIC
2
and marginal R (computed using the function r.squaredGLMM() from the
MuMIn package) (Barton 2018). As the table shows, the model featuring
logBigramFreq shows slight AIC and R2 advantages (i.e. a lower AIC value and
a higher R2 value) over the model featuring logBackwardTP.
Table 1: Model comparison for the two models of best fit, all fitted to the same
data points (i.e. the corrected RTs for the critical words).
Model Model syntax AIC Marginal

number R
 corrected_RT ~ cond*logBigramFreq + (| ,. .

participant) + (|word)
 corrected_RT ~ cond*logBackwardTP + (| ,. .
participant) + (|word)
Tables 2 and 3 present the fixed-effects coefficients for the two models.
Table 2: Fixed effects of the linear mixed-effect model of best fit involving the
(centered) COLLOCATION METRIC logBigramFreq and the factor CONDITION (mul-
tiple choice vs. typed free response).
Estimate Std. error t Value Pr (> |t|)
(Intercept) −. . −. < .

condtyped . . . < .
logBigramFreq −. . −. .
condtyped:logBigramFreq . . . .
The model is fitted to corrected log-transformed RTs. The model syntax is

presented in Table 1 (Model 1).
Finally, we added an interaction by POSITION to each of the two winning

models to explore how the effects of COLLOCATION METRIC and CONDITION
evolve across the spillover region. All models reported below include an
additional random intercept for SENTENCE. In the prior models, which were
only fitted to the critical words, this random intercept was not necessary, since
5 Effects of figurative versus literal meaning (main effects and interactions) were tested for
exploratory reasons but were not found in any winning model.

Authenticated
Table 3: Fixed effects of the linear mixed-effect model of best fit involving the
(centered) COLLOCATION METRIC logBackwardTP and the factor CONDITION
(multiple choice vs. typed free response).
(Intercept) −. . −. < .

condtyped . . . < .
logBackwardTP −. . −. .
condtyped:logBackwardTP . . . .
The model is fitted to corrected log-transformed RTs. The model syntax is

presented in Table 1 (Model 2).
SENTENCE was essentially coextensive with WORD. By contrast, in the spillover

region, many words (especially function words) are repeated across different
sentences. Comparing models by AIC confirmed that including an intercept for
SENTENCE significantly improved model fit (see Supplementary Material). The
following tables (Tables 4 and 5) and their respective plots (Figures 2 and 3)
present the outcome of these models.
Table 4: Fixed effects of the linear mixed-effect model of best fit involving the (centered)
COLLOCATION METRIC logBigramFreq, the factor CONDITION (multiple choice vs. typed free
response), and the factor POSITION (critical, spillover1, spillover2).
(Intercept) −. . −. < .

logBigramFreq −. . −. .
condtyped . . . < .
positionspillover −. . −. .
positionspillover −. . −. .
logBigramFreq:condtyped . . . .
logBigramFreq:positionspillover . . . .
logBigramFreq:positionspillover . . . .
condtyped:positionspillover −. . −. .
condtyped:positionspillover −. . −. .
logBigramFreq:condtyped:positionspillover −. . −. .
logBigramFreq:condtyped:positionspillover −. . −. .
Note: The syntax for this model was lmer(corrected_RT ~ logBigramFreq*cond*position + (1|
participant) + (1|word) + (1|sentence)). The model is fitted to corrected log-transformed RTs.

Authenticated
COLLOCATION METRIC logBackwardTP, the factor CONDITION (multiple choice vs. typed free
response), and the factor POSITION (critical, spillover1, spillover2).
(Intercept) −. . −. < .

logBackwardTP −. . −. .
condtyped . . . < .
positionspillover −. . −. .
positionspillover −. . −. .
logBackwardTP:condtyped . . . .
logBackwardTP:positionspillover . . . .
logBackwardTP:positionspillover . . . .
condtyped:positionspillover −. . −. .
condtyped:positionspillover −. . −. .
logBackwardTP:condtyped:positionspillover −. . −. .
logBackwardTP:condtyped:positionspillover −. . −. .
Note: The syntax for this model was lmer(corrected_RT ~ logBackwardTP*cond*position + (1|
participant) + (1|word) + (1|sentence)). The model is fitted to corrected log-transformed RTs.
Figure 2: Effects of logBigramFreq on corrected word-by-word RTs from the critical noun until
the second spillover word as a function of required answer format (multiple choice vs. typed
free response). For the model syntax and coefficients, cf. Table 4.

Authenticated
Figure 3: Effects of logBackwardTP on corrected word-by-word RTs from the critical noun until
the second spillover word as a function of required answer format (multiple choice vs. typed
free response). For the model syntax and coefficients, cf. Table 5.
3.4 Interim discussion
3.4.1 Effects of collocation metrics
In terms of identifying the most cognitively realistic collocation metric, we first

see that five of the nine measures do not pass the significance threshold in
predicting reading times to the critical word (the noun in the modifier–noun
bigram): MI3, Dice, T-score, Z-score, and log-likelihood. Therefore, they are not
analyzed further (but see Section 4 for discussion).
In contrast, two metrics arise as significantly predictive of RTs and seem to be
candidates for the most cognitively realistic measures: logBigramFreq and
logBackwardTP. Of these, logBigramFreq is the best fit to the data, although
logBackwardTP has only a slight disadvantage in terms of model fit (AIC) and
proportion of variance accounted for by the fixed effects (marginal R2). For a
scatterplot representing the relation between logBigramFreq and logBackwardTP,
see Figure A1.

Authenticated
Further complicating the picture, two metrics significantly predict RTs, but
in a direction that cannot be theoretically justified: MI and logForwardTP.
Higher values in these two metrics lead to significantly slower reading times
on the critical word. In the case of MI, it is generally well understood that the
metric has a strong low-frequency bias, as very few co-occurrences of two low-
frequency words can lead to a high MI score for the bigram (Evert 2009: 19). For
this reason, bigrams ranking high in MI tend to be highly specialized terms
which are likely to occur in a restricted range of contexts (e.g. epileptic fit,
ulterior motive, cosmetic surgery). By contrast, bigrams ranking low in MI are
made up of frequent words of everyday language (e.g. wild child, free time, tidy
room). The suspicion thus arises that a single-word frequency bias might under-
lie the unexpected result for MI, an intuition that is further supported by the fact
that the cubed variant MI3, which was designed to reduce the low-frequency
bias of MI, was not significant in any direction.
Might a single-word frequency bias also be driving the unexpected result
for logForwardTP? A closer look at Figure 1 confirms that both MI and
logForwardTP are strongly negatively correlated with logModifierFreq. That
is, higher values in these two association scores are correlated with lower
modifier frequencies (Pearson’s correlation between MI and logModifierFreq:
−0.69, p < 0.0001; Pearson’s correlation between logForwardTP and
logModifierFreq: −0.62, p < 0.0001).
As the stimuli consisted of collocated modifier–noun bigrams, the modifier
always directly preceded the critical word (the noun). It has generally been estab-
lished that word recognition effects may be delayed in self-paced reading as the
paradigm relies on a secondary, behavioral response, i.e. pressing the spacebar
(Smith and Levy 2013; Just et al. 1982). Thus, it is not unusual that effects on the first
spillover word reflect processing of the critical word, or effects on the critical word
reflect processing of the precritical word. The finding that both high MI and high
logForwardTP slow down RTs to the critical word, along with the fact that both
metrics correlate with low logModifierFreq, suggests that logModifierFreq might be
underlying the unpredicted effects of MI and logForwardTP.
The assumption that single-word frequency effects show up on the
following word would also have the potential to explain the patterns of
results found for the winning metrics: LogBigramFreq is positively correlated
with both logModifierFreq and logNounFreq (Pearson’s correlation between
logBigramFreq and logModifierFreq: 0.54, p < 0.0000; Pearson’s correlation
between logBigramFreq and logNounFreq: 0.42, p < 0.0000). Accordingly,
Figure 2 shows facilitation on both the critical word and the word immedi-
ately following the critical word (henceforth Spillover1). LogBackwardTP
is positively correlated with logModifierFreq (Pearson’s correlation: 0.38,

Authenticated
p = 0.0002) but negatively correlated with logNounFreq (Pearson’s correla-

tion: −0.35, p = 0.0007). Accordingly, logBackwardTP shows lower RTs on
the critical word (possibly reflecting high logModifierFreq) but higher RTs
on Spillover1 (possibly reflecting low logNounFreq).
Thus, it seems theoretically possible that the results for all four of these
measures are at least partly driven by the usage frequency of the previous word
and thus reflect single-word frequencies rather than collocation strength or
bigram frequency. In order to test if the effects of logBigramFreq and
logBackwardTP remain when previous word frequency is controlled for, and to
test if previous-word frequency could be driving the unexpected effects of MI
and logForwardTP, we conducted a follow-up analysis, presented in Section 3.5.
3.4.2 Task effects
The comparison of different collocation metrics is not the only goal of the
current experiment. It also includes the variable of CONDITION: the difference
between multiple-choice and typed free response answer formats. As shown in
Section 3.3, both of the winning metrics (logBigramFreq and logBackwardTP)
provide better fit to the data when an interaction by condition is added. When
we explore the effect of this interaction by position (cf. Figures 2 and 3), we see
significantly faster reading times in the multiple-choice condition across both
models. This is in line with our prediction that typed free response formats are
more cognitively taxing than multiple-choice formats and thus elicit slower
processing. Second, we find that the effects of both collocation metrics on RTs
to critical words are stronger in the multiple-choice condition than in the typed
free response condition. Third, in the typed free response condition, the effect of
logBigramFreq is more sustained than in the multiple-choice condition, since it
carries over more strongly into the first spillover word (see Figure 2). By contrast,
in the typed response condition, the effect of logBigramFreq dissipates more
quickly after the critical word.
3.5 Follow-up analysis
In a follow-up analysis, we assessed whether the unexpected effects of MI and

logForwardTP disappear once single word frequency is factored out. Moreover,
we set out to explore whether the winning metrics from Section 3.3,
logBigramFreq and logBackwardTP, remain significant even after partialling
out previous word frequency.

Authenticated
First, we found that logModifierFreq predicts RTs to the critical word better
than logNounFreq and either of the winning collocation metrics (logBigramFreq
and logBackwardTP). Similarly, logNounFreq predicted RTs on Spillover1 better
than logModifierFreq or either of the winning collocation metrics. See Section 2
of the Supplementary Materials for these (and the following) models and
coefficients.
We suspected that the unexpected effects of MI and logForwardTP (see
Sections 3.3 and 3.4.1) might disappear after factoring out previous word fre-
quency. To assess this, we first ran two separate models to partial out the effects
of previous word frequency from RTs to critical words and Spillover1, respec-
tively. The residuals from these two models were then joined into one data frame
and used as the new dependent variable (“cleanedlogRT”).
With the effect of previous word frequency removed from the residualized
RTs, we then ran two models testing whether MI and logForwardTP were still
predictive on the cleaned RTs. This was not the case (p = 0.9547 and p =
0.7898, respectively).
By contrast, two analog models involving logBigramFreq and logBackwardTP
showed that the metrics remain significant even after partialling out previous
word frequency (see Tables 6 and 7). Indeed, with previous word frequency
removed, the winning metrics now show an even more pronounced difference
by condition. In the multiple-choice condition, the metrics correlate with lower
RTs, but in the typed condition, this effect is reversed.
COLLOCATION METRIC logBigramFreq, the factor CONDITION (multiple choice vs. typed free
response), and the factor POSITION (critical, spillover1).
(Intercept) −. . −. < .

logBigramFreq −. . −. .
positionspillover . . . .
condtyped . . . < .
logBigramFreq:positionspillover . . . .
logBigramFreq:condtyped . . . .
positionspillover:condtyped −. . −. .
logBigramFreq:positionspillover:condtyped −. . −. .
Note: The syntax for this model was: lmer(cleaned_RT ~ logBigramFreq*cond*position + (1|
participant) + (1|word) + (1|sentence)). The model is fitted to corrected log-transformed RTs,
after partialling out the effects of previous word frequency.

Authenticated
COLLOCATION METRIC logBackwardTP, the factor CONDITION (multiple choice vs. typed free
response), and the factor POSITION (critical, spillover1).
Estimate Std. error t-Value Pr (> |t|)
(Intercept) −. . −. < .

logBackwardTP −. . −. .
condtyped . . . < .
positionspillover . . . .
logBackwardTP:condtyped . . . .
logBackwardTP:positionspillover . . . .
condtyped:positionspillover −. . −. .
logBackwardTP:condtyped:positionspillover −. . −. .
Note: The syntax for this model was: lmer(cleaned_RT ~ logBackwardTP*cond*position + (1|
participant) + (1|word) + (1|sentence)). The model is fitted to corrected log-transformed RTs,
after partialling out the effects of previous word frequency.
4 General discussion
The first and most straightforward conclusion is that traditional corpus-based
association measures were not cognitively realistic in predicting the reading
times of the second word in collocated modifier–noun bigrams. None of
the six bidirectional measures provided by the BNC (MI, MI3, Dice coefficient,
T-score, Z-score, and log-likelihood) correlated with a significant facilitation in
collocation processing. The fact that these metrics did not predict reading times
is in line with previous research such as Dąbrowska (2014), which showed that
native speaker intuitions about collocation status do not correspond to T-score,
Z-Score, or MI values (see Section 1.1).
Surprisingly, logForwardTP, also known as surprisal, did not predict RTs in
the expected direction. This result was unexpected due to the support previous
research has given the metric and the widely shared assumption that
logForwardTP captures forward-looking processes that are at play during online
language processing (see Section 1.1). However, this effect seems to be driven
largely by individual word frequencies, as it disappears when previous word
frequency is partialled out. It would be interesting to explore whether this result
is replicated by follow-up studies.
The two winning metrics identified in the present study include one metric
insensitive to the association strength between component words (logBigramFreq)
and one measure of association strength that is unidirectional from right to left
(logBackwardTP).

Authenticated
The significance of logBigramFreq supports the usage-based suggestion that

there is chunk-level access to n-grams beyond the word level during collocation
reading (see Section 1.1). Concurrently, the significance of logBackwardTP, as
well as the significance of previous word frequency, suggests parallel access to
individual components (words) and the strength of association between them.
However, the right-to-left directionality of logBackwardTP suggests that in the
experiment at hand, readers relied on backward integration rather than predic-
tion. The significance of previous word frequency provides further support to the
idea that individual component words receive separate activation, above and
beyond holistic n-gram-level access.
The view that chunk-level processing does not preclude concomitant access
to component parts is in line both with recent cognitive linguistic models of
entrenchment and with language-independent models of chunk perception (for
a review, see Blumenthal-Dramé 2016a). These approaches model chunk status
as a gradient feature resulting from the interplay between complex wholes
(sometimes referred to as “configurations”) and their component parts
(Blumenthal-Dramé et al. 2017; Hay and Baayen 2005), with stronger chunk
status resulting from greater ease of access to (or strength of activation of) the
higher level unit relative to its component parts. One diagnostic feature of a
chunk is activation of the whole chunk by partial input (Blumenthal-Dramé
2012). However, this activation is not modelled in terms of syntagmatic predic-
tion from left to right, but in terms of meronomic activation between parts and
wholes: partial sensory input activates a mental representation for the whole
chunk, which then cues top-down activation (or: “top-down prediction”) of
missing sensory parts (Clark 2013). In psycholinguistic models of reading, such
part–whole relationships can be captured by so-called interactive activation
models (Carreiras et al. 2014). In our results, bigram frequency seems to capture
chunk-level activation, while backward transition probability reflects the con-
current activation of the bigram’s component words.
The experiment also tested a task alternation: one condition had multiple-
choice comprehension questions, whereas the other had free response questions
that required typing between one and three words. We find that the multiple-
choice condition elicited significantly lower RTs than the typed response condi-
tion. This is in keeping with our predictions, which derive from meta-studies
revealing that free response tasks are cognitively more taxing than multiple-
choice tasks (see Section 1.2).
The significant interaction with the two “winning” association measures
reveals that the task effect is more than just a generalized slowing, however:
The effect of both collocation metrics on the critical word is more pronounced in
the multiple-choice condition than in the typed free response condition.

Authenticated
Moreover, in the case of logBigramFreq, the effect of the collocation metric is

more short lived in the multiple-choice condition (i.e. the facilitation is not very
strong on the first spillover word), whereas it carries over into the first spillover
word in the typed free response condition.
This difference might point toward a more “blurred” type of processing in
the typed condition. That is, effects arising at the word level are not restricted to
single words but operate over longer periods, allowing effects from previous
words to “level out” the effects of the current word. In other words, we suggest
that there is a wider spreading of syntagmatic activation in the typed condition,
which overlaps with reading of the following word(s). If processing is less
constricted to proceeding word-by-word in the typed condition, this could also
partially explain why partialling out previous word frequency has stronger
effects in the typed relative to the non-typed condition.
This levelling out could be related to slower reading. According to recent
proposals (Christiansen and Chater 2016; Chater and Christiansen 2018), online
language comprehension is constrained by the so-called now-or-never bottle-
neck: Input arrives so quickly that it must be recoded immediately before it is
overwritten or interfered with by the next piece of incoming input. “Recoding” in
this context involves compressing the fine-grained sensory information available
in the original input into a format which increasingly abstracts over low-level
detail, thereby gradually shifting it up a linguistic hierarchy whose levels
operate over larger temporal windows (e.g. from single letters via morphemes
and words to multi-word chunks and beyond).
In the multiple-choice condition, faster reading implies that this bottleneck
is comparably “tighter”, meaning that the input is more quickly recoded and
shifted to more abstract levels of representation. According to the now-or-never
bottleneck model, higher level representations supersede lower level informa-
tion, such that distributional statistics tied to individual words might no longer
be available to influence the processing of upcoming words.
In the typed free response condition, reading is slower, meaning that low-
level information tied to individual words is not dismissed as quickly, thereby
potentially exerting stronger effects on neighboring words. Thus, our results are
compatible with the hypothesis, raised in Section 1.2, that higher word-by-word
reading times might correlate with increased involvement of distributional
knowledge stored along with entries in the mental lexicon.
Of course, if our tentative explanation is indeed on the right track, our
findings do not warrant any statement on the direction of causality. Thus,
processing in the typed free response task might be slower because more
attention is devoted to word-level information, or processing might be slower
for some independent reason (e.g. more mental imagery), which, in turn, may

Authenticated
allow more attention to be devoted to word-level information. The proposal that

access to word-level statistical information may operate over a longer time span
in the typed condition, or, maybe even more generally under conditions of
slower reading, clearly deserves further empirical and theoretical attention.
However, whatever the exact nature of the task effect, this study has shown
that a minimal difference in task can elicit significant differences in online
reading.
5 Conclusion
Overall, the results of the present study suggest caution when applying tradi-
tional corpus-linguistic collocation metrics to online language processing.
Measures like log-transformed bigram frequency may be more informative
than measures assuming words are the individual component parts combined
by associations of different strengths. On the other hand, individual word
frequencies are also critical, as they arise as the best predictor of reading times.
Access to collocational knowledge also appears to be task-contingent, with
slight modulations in experimental design having the potential to alter the
relationship between collocation metrics and the unfolding of effects across
the spillover region. We suggest that this task contingency may account for
the lack of convergence between earlier studies tracking the cognitive correlates
of corpus-derived collocations.
At the same time, the results highlight that we should refrain from jumping
to conclusions on the basis of single experiments: the predictivity of a given
metric in a certain modality (reading) and under certain task demands (e.g.
answering multiple-choice questions) need not generalize to other tasks or
modalities (e.g. typing free response answers).
Further research is underway to address collocation status independently of
individual word frequency through a design that allows for the direct compar-
ison of the usage frequency of the whole and usage frequency of each compo-
nent part. We hope this and other research on task pressures will contribute to
elucidating the open questions that remain.
Acknowledgments: We are grateful to Marc Brysbaert and one anonymous

reviewer for their thoroughly helpful suggestions. Naturally, we take full respon-
sibility for any errors that may remain in the text.
This research was supported by a Junior Fellowship from the Freiburg
Institute for Advanced Studies to the second author.

Authenticated
References
Abbot-Smith, Kirsten & Michael Tomasello. 2006. Exemplar-learning and schematization in a
usage-based account of syntactic acquisition. The Linguistic Review 23(3). 275–290.
Aijmer, Karin & Bengt Altenberg. 2014. English corpus linguistics. New York & London:
Routledge.
Arnon, Inbal & Uriel Cohen Priva. 2013. More than words: The effect of multi-word frequency
and constituency on phonetic duration. Language and Speech 56(3). 349–371.
doi:10.1177/0023830913484891.
Arnon, Inbal & Neal Snider. 2010. More than words: Frequency effects for multi-word phrases.
Journal of Memory and Language 62(1). 67–82. doi:10.1016/j.jml.2009.09.005.
Baayen, R. Harald. 2008. Analyzing linguistic data: A practical introduction to statistics using R.
New York & Cambridge: Cambridge University Press.
Bannard, Colin 2006. Acquiring phrasal lexicons from corpora. University of Edinburgh dissertation.
Bannard, Colin & Elena Lieven. 2012. Formulaic language in L1 acquisition. Annual Review of
Applied Linguistics 32. 3–16. doi:10.1017/S0267190512000062.
Barton, Kamil 2018. MuMIn: Multi-Model Inference. https://CRAN.R-project.org/package=MuMIn.
Bates, Douglas, Martin Mächler, Ben Bolker & Steve Walker. 2015. Fitting linear mixed-effects
models using lme4. Journal of Statistical Software 67(1). 1–48. doi:10.18637/jss.v067.i01.
Biskup, Danuta. 1992. L1 influence on Learners’ renderings of english collocations: A Polish/
German empirical study. In Vocabulary and applied linguistics, 85–93. London: Palgrave
Macmillan. doi:10.1007/978-1-349-12396-4_8.
Blumenthal-Dramé, Alice. 2012. Entrenchment in usage-based theories: What corpus data do and do
not reveal about the mind (Topics in English Linguistics 83). Berlin: de Gruyter Mouton.
Blumenthal-Dramé, Alice. 2016a. 6. Entrenchment from a psycholinguistic and neurolinguistic
perspective. In Entrenchment and the psychology of language learning: How we reorganize
and adapt linguistic knowledge. Berlin, Boston: De Gruyter. doi:10.1515/9783110341423-007.
Blumenthal-Dramé, Alice 2016b. What corpus-based Cognitive Linguistics can and cannot
expect from neurolinguistics. Cognitive Linguistics 27(4). doi:10.1515/cog-2016-0062
Blumenthal-Dramé, Alice, Volkmar Glauche, Tobias Bormann, Cornelius Weiller, Mariacristina
Musso & Bernd Kortmann. 2017. Frequency and chunking in derived words: A parametric
fMRI study. Journal of Cognitive Neuroscience 29(7). 1162–1177. doi:10.1162/jocn_a_01120.
Blumenthal-Dramé, Alice & Evie Malaia. 2018. Shared neural and cognitive mechanisms in
action and language: The multiscale information transfer framework. Wiley
Interdisciplinary Reviews: Cognitive Science e1484. doi:10.1002/wcs.1484
Boston, Marisa, John Hale, Reinhold Kliegl, Umesh Patil & Shravan Vasishth. 2008. Parsing
costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence
Corpus. Journal of Eye Movement Research 2(1). 1, 1–12.
Bybee, Joan. 2010. Language, usage and cognition. Cambridge; New York: Cambridge
University Press.
Bybee, Joan & James L. McClelland. 2005. Alternatives to the combinatorial paradigm of
linguistic theory based on domain general principles of human cognition. The Linguistic
Review 22(2–4). 381–410.
Caldwell-Harris, Catherine L. & Alison L. Morris. 2008. Fast Pairs: A visual word recognition
paradigm for measuring entrenchment, top-down effects, and subjective phenomenology.
Consciousness and Cognition 17(4). 1063–1081. doi:10.1016/j.concog.2008.09.004.

Authenticated
Carreiras, Manuel, Blair C. Armstrong, Manuel Perea & Ram Frost. 2014. The what, when, where,
and how of visual word recognition. Trends in Cognitive Sciences 18(2). 90–98.
doi:10.1016/j.tics.2013.11.005.
Chater, Nick & Morten H. Christiansen. 2018. Language acquisition as skill learning. Current
Opinion in Behavioral Sciences (The Evolution of Language) 21. 205–208. doi:10.1016/j.
cobeha.2018.04.001.
Christiansen, Morten H. & Inbal Arnon. 2017. More than words: The role of multiword sequences in
language learning and use. Topics in Cognitive Science 9(3). 542–551. doi:10.1111/tops.12274.
Christiansen, Morten H. & Nick Chater. 2016. The Now-or-Never bottleneck: A fundamental con-
straint on language. Behavioral and Brain Sciences 39. doi:10.1017/S0140525X1500031X.
Clark, Andy. 2013. Whatever next? Predictive brains, situated agents, and the future of cognitive
science. Behavioral and Brain Sciences 36(03). 181–204. doi:10.1017/S0140525X12000477.
Clark, Andy. 2016. Surfing uncertainty: Prediction, action, and the embodied mind. New York:
Oxford University Press.
Conklin, Kathy & Norbert Schmitt. 2012. The processing of formulaic language. Annual Review
of Applied Linguistics 32. 45–61. doi:10.1017/S0267190512000074.
Croft, William. 2001. Radical construction grammar: Syntactic theory in typological perspective.
New York: Oxford University Press.
Dąbrowska, Ewa. 2014. Words that go together: Measuring individual differences in native
speakers’ knowledge of collocations. The Mental Lexicon 9(3). 401–418.
doi:10.1075/ml.9.3.02dab.
Demberg, Vera & Frank Keller. 2008. Data from eye-tracking corpora as evidence for
theories of syntactic processing complexity. Cognition 109(2). 193–210.
doi:10.1016/j.cognition.2008.07.008.
Deuter, Margaret, James Greenan, Joseph Noble, Janet Phillips & Diana Lea. 2002. Oxford
collocations dictionary. Oxford: Oxford University Press.
Drummond, Alex 2016. Ibex Farm. http://spellout.net/ibexfarm/.
Durrant, Philip & Alice Doherty 2010. Are high-frequency collocations psychologically real?
Investigating the thesis of collocational priming. Corpus Linguistics and Linguistic Theory
6(2). doi:10.1515/cllt.2010.006
Ellis, Nick C. 2002. Frequency effects in language processing: A review with implications for
theories of implicit and explicit language acquisition. Studies in Second Language
Acquisition 24(2). 143–188. doi:10.1017/S0272263102002024.
Ellis, Nick C., Rita Simpson-Vlach & Carson Maynard. 2008. Formulaic language in native and
second language speakers: Psycholinguistics, corpus linguistics, and TESOL. TESOL
Quarterly 42(3). 375–396. doi:10.1002/j.1545-7249.2008.tb00137.x.
Evert, Stefan. 2009. Corpora and collocations. In Anke Lüdeling & Merja Kytö (eds.), Corpus
linguistics: An international handbook, vol. 2. 1212–1248. Berlin, New York: Mouton de Gruyter.
Frank, Stefan L. 2013. Uncertainty reduction as a measure of cognitive load in sentence
comprehension. Topics in Cognitive Science 5(3). 475–494. doi:10.1111/tops.12025.
Frank, Stefan L. & Rens Bod. 2011. Insensitivity of the human sentence-processing system to
hierarchical structure. Psychological Science 22(6). 829–834. doi:10.1177/0956797611409589.
Frank, Stefan L., Leun J. Otten, Giulia Galli & Gabriella Vigliocco. 2015. The ERP response to the
amount of information conveyed by words in sentences. Brain and Language 140. 1–11.
doi:10.1016/j.bandl.2014.10.006.
Gollan, Tamar H., Timothy J. Slattery, Diane Goldenberg, Eva Van Assche, Wouter Duyck & Keith
Rayner. 2011. Frequency drives lexical access in reading but not in speaking: The

Authenticated
frequency-lag hypothesis. Journal of Experimental Psychology: General 140(2). 186–209.

doi:10.1037/a0022256.
Gries, Stefan Th. 2013. 50-something years of work on collocations: What is or should be next ….
International Journal of Corpus Linguistics 18(1). 137–166. doi:10.1075/ijcl.18.1.09gri.
Gries, Stefan Th. & Nick C. Ellis. 2015. Statistical measures for usage-based linguistics.
Language Learning 65(S1). 228–255. doi:10.1111/lang.12119.
Gurevich, Olga, Matthew A. Johnson & Adele E. Goldberg. 2010. Incidental verbatim memory for
language. Language and Cognition 2(1). 45–78. doi:10.1515/langcog.2010.003.
Hale, John. 2016. Information-theoretical complexity metrics. Language and Linguistics
Compass 10(9). 397–412. doi:10.1111/lnc3.12196.
Hay, J. & R. Baayen. 2005. Shifting paradigms: Gradient structure in morphology. Trends in
Cognitive Sciences 9(7). 342–348. doi:10.1016/j.tics.2005.04.002.
Hintz, Florian, Antje S. Meyer & Falk Huettig. 2016. Encouraging prediction during production
facilitates subsequent comprehension: Evidence from interleaved object naming in sen-
tence context and sentence reading. The Quarterly Journal of Experimental Psychology
69(6). 1056–1063. doi:10.1080/17470218.2015.1131309.
Hoffmann, Sebastian. 2008. Corpus linguistics with BNCweb: A practical guide (English Corpus
Linguistics v. 6). Frankfurt am Main: Peter Lang.
Hohwy, Jakob. 2013. The predictive mind. 1st ed. Oxford, New York: Oxford University Press.
Howarth, Peter. 1998. Phraseology and second language proficiency. Applied Linguistics 19(1).
24–44. doi:10.1093/applin/19.1.24.
Huang, Yanping & Rajesh P. N. Rao. 2011. Predictive coding. Wiley Interdisciplinary Reviews:
Cognitive Science 2(5). 580–593. doi:10.1002/wcs.142.
In’nami, Yo & Rie Koizumi. 2009. A meta-analysis of test format effects on reading and listening
test performance: Focus on multiple-choice and open-ended formats. Language Testing
26(2). 219–244. doi:10.1177/0265532208101006.
Ito, Aine, Martin Corley & Martin J. Pickering. 2018. A cognitive load delays predictive eye
movements similarly during L1 and L2 comprehension. Bilingualism: Language and
Cognition 21(2). 251–264. doi:10.1017/S1366728917000050.
Jacobs, Cassandra L., Gary S. Dell, Aaron S. Benjamin & Colin Bannard. 2016. Part and whole
linguistic experience affect recognition memory for multiword sequences. Journal of
Memory and Language 87. 38–58. doi:10.1016/j.jml.2015.11.001.
Jiang, Nan & Tatiana M. Nekrasova. 2007. The processing of formulaic sequences by second
language speakers. The Modern Language Journal 91(3). 433–445.
Just, Marcel A., Patricia A. Carpenter & Jacqueline D. Woolley. 1982. Paradigms and processes
in reading comprehension. Journal of Experimental Psychology: General 111(2). 228–238.
doi:10.1037/0096-3445.111.2.228.
Kuperberg, Gina R. & T. Florian Jaeger. 2016. What do we mean by prediction in language compre-
hension? Language, Cognition and Neuroscience 31(1). 32–59. doi:10.1080/
23273798.2015.1102299.
Kuznetsova, Alexandra, Per B. Brockhoff & Rune H. B. Christensen 2017. lmerTest package: Tests in
linear mixed effects models. Journal of Statistical Software 82(13). doi:10.18637/jss.v082.i13
Levshina, Natalia. 2015. How to do linguistics with R: Data exploration and statistical analysis.
Amsterdam: John Benjamins Publishing Company.
Levy, Roger. 2008. Expectation-based syntactic comprehension. Cognition 106(3). 1126–1177.
doi:10.1016/j.cognition.2007.05.006.

Authenticated
Linzen, Tal & T. Florian Jaeger 2015. Uncertainty and expectation in sentence processing: Evidence
from subcategorization distributions. Cognitive Science 40(6). doi:10.1111/cogs.12274
Lowder, Matthew W., Wonil Choi, Fernanda Ferreira & John M. Henderson. 2018. Lexical
predictability during natural reading: Effects of surprisal and entropy reduction. Cognitive
Science doi:10.1111/cogs.12597.
Martyńska, Małgorzata. 2004. Do English language learners know collocations? Investigationes
Linguisticae 11. 1–12. doi:10.14746/il.2004.11.4.
McCauley, Stewart M. & Morten H. Christiansen. 2017. Computational investigations of multi-
word chunks in language learning. Topics in Cognitive Science 9(3). 637–652.
doi:10.1111/tops.12258.
O’Grady, William. 2008. The emergentist program. Lingua 118(4). 447–464. doi:10.1016/j.
lingua.2006.12.001.
Payne, Brennan R. & Kara D. Federmeier. 2017. Pace yourself: Intraindividual variability in
context use revealed by self-paced event-related brain potentials. Journal of Cognitive
Neuroscience 29(5). 837–854. doi:10.1162/jocn_a_01090.
Rodriguez, Michael C. 2006. Construct equivalence of multiple-choice and constructed-
response items: A random effects synthesis of correlations. Journal of Educational
Measurement 40(2). 163–184. doi:10.1111/j.1745-3984.2003.tb01102.x.
Siyanova, Anna & Norbert Schmitt. 2008. L2 learner production and processing of collocation: A
multi-study perspective. Canadian Modern Language Review doi:10.3138/cmlr.64.3.429.
Siyanova-Chanturia, Anna 2015. On the ‘holistic’ nature of formulaic language. Corpus
Linguistics and Linguistic Theory 0(0). doi:10.1515/cllt-2014-0016
Siyanova-Chanturia, Anna, Kathy Conklin, Sendy Caffarra, Edith Kaan & Walter J. B. van Heuven.
2017. Representation and processing of multi-word expressions in the brain. Brain and
Language 175. 111–122. doi:10.1016/j.bandl.2017.10.004.
Smith, Nathaniel J. & Roger Levy. 2013. The effect of word predictability on reading time is
logarithmic. Cognition 128(3). 302–319. doi:10.1016/j.cognition.2013.02.013.
Tremblay, Antoine & Harald Baayen. 2009. Holistic processing of regular four-word sequences.
Perspectives on Formulaic Language in Acquisition and Production. London and New York:
Continuum.
Tremblay, Antoine, Bruce Derwing, Gary Libben & Chris Westbury. 2011. Processing advantages of
lexical bundles: Evidence from self-paced reading and sentence recall tasks: Lexical bundle
processing. Language Learning 61(2). 569–613. doi:10.1111/j.1467-9922.2010.00622.x.
Tremblay, Antoine & Benjamin V. Tucker. 2011. The effects of N-gram probabilistic measures on
the recognition and production of four-word sequences. The Mental Lexicon 6(2). 302–324.
doi:10.1075/ml.6.2.04tre.
Wei, Taiyun & Viliam Simko. 2017. R package “corrplot”: Visualization of a correlation matrix.
https://github.com/taiyun/corrplot.
Wiechmann, Daniel 2008. On the computation of collostruction strength: Testing measures of
association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory 4(2).
doi:10.1515/CLLT.2008.011
Wlotko, Edward W. & Kara D. Federmeier. 2015. Time for prediction? The effect of presentation
rate on predictive sentence comprehension during word-by-word reading. Cortex 68.
20–32. doi:10.1016/j.cortex.2015.03.014.
Wurm, Lee H. & Sebastiano A. Fisicaro. 2014. What residualizing predictors in regression
analyses does (and what it does not do). Journal of Memory and Language 72. 37–48.
doi:10.1016/j.jml.2013.12.003.

Authenticated
30
Appendix: Materials used in the experiments

Table A1: Full list of experimental stimuli, along with different associations scores between the modifier and the critical word (“Word”) as well as raw
frequencies for the modifier (ModFreq), the critical word (NounFreq), and the whole bigram (BigramFreq), all extracted from the BNC (cf. Section 2.1).
Full sentence Word MI MI Z-score T-score Log-likelihood Dice FTP BTP ModFreq NounFreq Bigram-
Freq
Katy was Accents . . . . . . . . , , 
surrounded by
foreign accents
on the train.
There existed a Argument . . . . . . . . , , 
strong
argument
against the bill.
Emma discerned Attitude . . . . . . . . , , 
the bad attitude
of her client.
Tanner purchased Band . . . . . . . .  , 
Kyla McConnell and Alice Blumenthal-Dramé
the elastic band

that he needed.
Scott Beginning . . . . . . . .  , 
contemplated
the humble
beginning of
the movement.
Amber enjoyed a Beverage . . . . . . . .   
refreshing
beverage under
the stars.
Authenticated
(continued )
Table A1: (continued )
Freq
Ava documented Bird . . . . . . . .  , 
the majestic
bird in her
journal.
The moldy bread Bread . . . . . . . .  , 
was thrown
away.
Ryan chose a fast Car . . . . . . . . , , 
car at the
dealership.
Phoebe started a Chat . . . . . . . . ,  
brief chat with
the postman.
Bentley mentioned Child . . . . . . . . , , 
the wild child
and his mother.
This vicious circle Circle . . . . ,. . . .  , 
seems
unbreakable
sometimes.
Due to the Circumstances . . . . . . . .  , 
mitigating
circumstances
Sarah was
released.
Today civilian Clothes . . . . . . . . , , 
clothes are
being washed.
Online collocation reading
(continued )
Authenticated
31
32
Freq
Kyra always had Coffee . . . . . . . .  , 
decaffeinated
coffee with her
toast.
Last year provided Conditions . . . . . . . . , , 
favorable
conditions for
job creation.
Robert listened to Conscience . . . . . . . . , , 
his guilty
conscience
when making
decisions.
Tarek’s firm Conviction . . . . . . . . , , 
conviction
persuaded the
politicians.
Ty commented on Crime . . . . . . . .  , 
the petty crime
plaguing the
city.
Gregory predicted Danger . . . . . . . .  , 
the grave
danger
associated with
lead.
Connor was Deal . . ,. . ,. . . . , , ,
informed about
the great deal
Authenticated
on designer
jeans.
(continued )
Freq
Barbara Debate . . . . . . . .  , 
remembered
the heated
debate at the
meeting.
Tris watched the Defeat . . . . . . . .  , 
crushing defeat
unfold on TV.
Trevor was Depths . . . . . . . .  , 
fascinated by
the murky
depths of the
ocean.
Something about Dessert . . . . . . . . ,  
the rich dessert
made David ill.
Brenda prioritized Diet . . . . ,. . . . , , 
a balanced diet
and regular
exercise.
Rosa spoke about Driving . . . . ,. . . .  , 
reckless driving
to the kids.
That is a prime Example . . . . . . . . , , 
example of
Renaissance art.
A characteristic Feature . . . . . . . . , , 
feature defined
Larry’s face.
Authenticated
(continued )
33
34
Freq
It is well known Feet . . . . . . . .  , 
that itchy feet
drive people
crazy.
Ahmad’s son Fit . . . . . . . .  , 
anticipated the
epileptic fit
before it
happened.
Mohammed took Food . . . . ,. . . . , , 
note of the
good food at
the pub.
Kevin envied the Fortune . . . . . . . . , , 
small fortune
his brother
inherited.
Lyssa spotted her Friend . . . . ,. . . . , , 
close friend in
the crowd.
Luca smelled the Fruit . . . . . . . .  , 
rotten fruit on
the counter.
The kid played on Grass . . . . . . . . , , 
the green grass
near the school.
Clarissa observed Growth . . . . . . . .  , 
the stunted
growth of the
plant.
Authenticated
(continued )
Freq
Sage’s tiresome Habit . . . . . . . .  , 
habit quickly
became
annoying.
Katrina discovered Hair . . . . ,. . . .  , 
blonde hair in
the bathroom.
Clara offered a Hand . . . . ,. .  .  , 
helping hand to
the workers.
Bartholomew House . . . . . . . . , , 
praised the
magnificent
house and its
owners.
It was not a bright Idea . . . . . . . . , , 
idea to visit
Crystal.
Chris was aware of Illness . . . . . . . .  , 
the debilitating
illness and its
consequences.
Floyd criticized the Impact . . . . . . . . , , 
direct impact of
the pollution.
The object’s vital Importance . . . . ,. . . . , , 
importance
cannot be
overstated.
Authenticated
(continued )
35
36
Freq
Henry grabbed the Instrument . . . . . . . .  , 
blunt
instrument and
appraised it.
Laura knew about Interest . . . . ,. . . .  , 
the
government’s
vested interest
in the change.
Carla was a born Leader . . . . . . . .  , 
leader her
teachers said.
Courtney Living . . . . . . . .  , 
acknowledged
that communal
living had many
benefits.
Brianna Love . . . . . . . .  , 
understood that
unrequited love
could be
painful.
Paul had a light Lunch . . . . . . . . , , 
lunch before
the interview.
Saul realized that Majority . . ,. . ,. . . . , , 
the vast
majority had
voted
Authenticated
incorrectly.
(continued )
Freq
Louise Man . . . . ,. . . . , , ,
complimented
the old man in
her
neighborhood.
Nellie pinpointed Motive . . ,. . . . . .  , 
the ulterior
motive of the
banker.
Tess ensured the Music . . . . ,. . . . , , 
classical music
was showcased
correctly.
Achim dismissed Notion . . . . . . . .  , 
the
preconceived
notion with a
sigh.
Christian wrote Occasion . . . . . . . .  , 
about the
auspicious
occasion in his
memoir.
Priya found the Officer . . . . . . . .  , 
commissioned
officer sitting
around outside.
(continued )
Authenticated
37
38
Freq
The senior officials Officials . . . . ,. . . . , , 
ultimately
decided
everything.
Vladimir Pace . . . . . . . .  , 
maintained a
brisk pace
throughout the
walk.
Philippa Pain . . . . . . . .  , 
experienced
excruciating
pain in her legs.
The business was People . . . . . . . . , , 
based on
common people
and their
desires.
Pablo regularly Principles . . . . . . . . , , 
tested the
moral principles
of his
employees.
Clemence was Protest . . . . . . . . , , 
intrigued by the
peaceful protest
in the capital.
(continued )
Authenticated
Freq
Hank noticed the Rain . . . . . . . .  , 
pouring rain
and stayed
inside.
Chandler lost the Reader . . . . . . . .  , 
avid reader in
the library.
Heather adjusted Reality . . . . . . . . , , 
to the harsh
reality after the
war.
The group thought Rights . . . . ,. . . . , , ,
that human
rights were very
important.
Brian examined Room . . . . . . . .  , 
the tidy room
and was
satisfied.
Neveah took in the Scenery . . . . . . . . ,  
incredible
scenery all
around her.
Charlie gave a Service . . . . ,. . . . , , 
speech about
military service
in the eighties.
Redmond Shame . . ,. . . . . .  , 
considered it a
crying shame to
Authenticated
be poor.
39
(continued )
40
Freq
Ginny made sure Share . . . . ,. . . . , , 
that a fair share
was allocated
today.
Benny figured a Shower . . . . . . . . , , 
quick shower
would be nice.
Alaina followed the Smell . . . . . . . .  , 
putrid smell to
the kitchen.
Ahmed checked Soil . . . . . . . .  , 
the fertile soil
for invasive
insects.
Gabe is a brave Soul . . . . . . . . , , 
soul for going
skydiving.
Taylor reacted to Speed . . . . ,. . . . , , 
the high speed

of the serve.
Morgan ignored Star . . . . . . . .  , 
the budding
star despite her
persistence.
Brooke gossiped Stranger . . . . . . . . , , 
about the
beautiful
stranger on the
train.
Authenticated
(continued )
Freq
Sherry had Surgery . . . . . . . .  , 
cosmetic
surgery done
too often.
Brendon got a Team . . . . . . . .  , 
glimpse of the
losing team
before they left.
Dominic rejoiced Time . . . . . . . . , , 
about the free
time he now
had.
Carl sensed that Time . . . . . . . . , , 
precious time
was running
out.
Shelly was Times . . . . . . . . , , 
interested in
ancient times
and faraway
lands.
Tom heard the Traffic . . . . . . . . , , 
heavy traffic
from his
window.
Hal questioned the Victory . . . . . . . . , , 
narrow victory
of his
opponent.
Authenticated
(continued )
41
42
Freq
Bella’s loud voice Voice . . . . . . . . , , 
carried the
choir.
Matthew felt the Water . . . . . . . .  , 
tepid water with
his toe.
Sam recalled the Winter . . . . . . . . , , 
mild winter
three years ago.
Gertrude passed Youth . . . . . . . .  , 
by a
disillusioned
youth on the
corner.
Katy was Accents . . . . . . . . , , 
surrounded by
foreign accents
on the train.
Range .– .– .– .– .– .– .– .– – – –
. . ,. . ,. . . . , , ,
Mean . . . . ,. . . . ,. ,. .
Standard deviation . . . . ,. . . . ,. ,. .
Note: For greater ease of readability, the five last columns of this table are not log-transformed. FTP: Forward transition probability; BTP: backward
transition probability; ModFreq: modifier frequency; NounFreq: noun frequency; BigramFreq: bigram frequency.
Authenticated
Figure A1: Scatterplot representing the relation between log-transformed bigram frequency
(logBigramFreq) and log-transformed backward transition probability (logBackwardTP) in the 91
collocations used in the present study (Pearson’s correlation: 0.7029, p < 0.0000).
Supplementary Material: The online version of this article offers supplementary material
(https://doi.org/10.1515/cllt-2018-0030).
Bionotes
Kyla McConnell
Kyla McConnell is a Ph.D. candidate in Linguistics at the English Department of the University of
Freiburg (Germany). Her primary research interests center around her ongoing dissertation
“Individual Differences and Task Effects in Predictive Coding”. This research focuses on topics
such as the extent to which quantitative and corpus-derived variables can reflect the cognition
of individual speakers and how speaker- and task-based variables can modulate language
processing. In this, she works with various psycho- and neurolinguistic experimental
paradigms and statistical methods to align large-scale data with real-time language
comprehension.
She previously studied English Language and Linguistics at the University of Freiburg
(Germany), and Hispanic Linguistics and German Language and Literature at the University of
North Carolina at Chapel Hill (USA).
Alice Blumenthal-Dramé
Dr. Alice Blumenthal-Dramé currently works as an Assistant Professor in English Linguistics at

the English Department of the University of Freiburg (Germany). She studied English Philology,
Slavic Philology, Computational Linguistics and General Linguistics at the University of
Manchester (UK), the Lomonosov University of Moscow (Russian Federation), and the University
of Freiburg (Germany), where she received her PhD in 2011.

Authenticated
Her publications exploit behavioral and functional neuroimaging methods to explore the
extent to which statistical generalizations across “big data” (notably, large-scale text corpora
and databases derived from such corpora) have the potential to offer realistic insights into
language users’ cognition. Major motivations behind this research have been: (1) to put to the
test the cognitive reality of cognitive linguistic assumptions, and (2) to gain a better
understanding of the size and nature of the cognitive building blocks that are utilized in natural
language use.
Further research interests include morphological theories, psycholinguistic models, Gestalt
psychology, usage-based linguistics, language typology, and statistical methods.

Authenticated

Effects of Task and Corpus-Derived Association Scores On The Online Processing of Collocations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Effects of Task and Corpus-Derived Association Scores On The Online Processing of Collocations

Uploaded by

Copyright:

Available Formats

Corpus Linguistics and Ling.

Theory 2019; aop

Kyla McConnell* and Alice Blumenthal-Dramé

Abstract: In the following self-paced reading study, we assess the cognitive

Keywords: collocations, cognitive realism, association scores, task effects, self-

*Corresponding author: Kyla McConnell, Department of English, Albert-Ludwigs-Universitat

Brought to you by | Göteborg University - University of Gothenburg

Brought to you by | Göteborg University - University of Gothenburg

In comparing traditional corpus-based metrics with metrics derived from

1.1 The cognitive realism of corpus-derived association scores

Brought to you by | Göteborg University - University of Gothenburg

(e.g. for corpus-based dictionaries, translation resources, or automated language

Brought to you by | Göteborg University - University of Gothenburg

6) MI3 is a variant of MI that cubes the observed frequency of co-occurrence to

Brought to you by | Göteborg University - University of Gothenburg

Brought to you by | Göteborg University - University of Gothenburg

1.2 Effects of task demands on lexical processing

Another potential explanation for the lack of convergence between psycholin-

Brought to you by | Göteborg University - University of Gothenburg

Brought to you by | Göteborg University - University of Gothenburg

conceptual unity of collocations and therefore elicit weaker effects of associa-

Ninety-one critical bigrams were gathered from everyday conversations, perso-

Brought to you by | Göteborg University - University of Gothenburg

following reasons: First, meaning is not directly observable in corpora, and we

Brought to you by | Göteborg University - University of Gothenburg

transition probabilities were log-transformed using the natural logarithm. All

Brought to you by | Göteborg University - University of Gothenburg

2.3 Experimental design

Brought to you by | Göteborg University - University of Gothenburg

3 Statistical analysis and results

3.1 Statistical method

Brought to you by | Göteborg University - University of Gothenburg

Our experiment aimed to compare the predictive power of nine competing

Brought to you by | Göteborg University - University of Gothenburg

Model Model syntax AIC Marginal

 corrected_RT ~ cond*logBigramFreq + (| ,. .

Estimate Std. error t Value Pr (> |t|)

(Intercept) −. . −. < .

The model is fitted to corrected log-transformed RTs. The model syntax is

Finally, we added an interaction by POSITION to each of the two winning

Brought to you by | Göteborg University - University of Gothenburg

Estimate Std. error t Value Pr (> |t|)

(Intercept) −. . −. < .

The model is fitted to corrected log-transformed RTs. The model syntax is

SENTENCE was essentially coextensive with WORD. By contrast, in the spillover

Estimate Std. error t Value Pr (> |t|)

(Intercept) −. . −. < .

Brought to you by | Göteborg University - University of Gothenburg

Estimate Std. error t Value Pr (> |t|)

(Intercept) −. . −. < .

Brought to you by | Göteborg University - University of Gothenburg

3.4 Interim discussion

3.4.1 Effects of collocation metrics

In terms of identifying the most cognitively realistic collocation metric, we first

Brought to you by | Göteborg University - University of Gothenburg

Brought to you by | Göteborg University - University of Gothenburg

p = 0.0002) but negatively correlated with logNounFreq (Pearson’s correla-

3.4.2 Task effects

3.5 Follow-up analysis

In a follow-up analysis, we assessed whether the unexpected effects of MI and

Brought to you by | Göteborg University - University of Gothenburg

Estimate Std. error t Value Pr (> |t|)

(Intercept) −. . −. < .