Data Science Master Thesis - DV Larasati

Predicting Non-Native English Speakers’ Vocabulary
Knowledge with Native English Speaker Psycholinguistic

Behaviors: A Big Data Approach
Student details
Name: D.V. Larasati

Student number: U252005
THESIS SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE IN DATA SCIENCE & SOCIETY
DEPARTMENT OF COGNITIVE SCIENCE & ARTIFICIAL INTELLIGENCE
SCHOOL OF HUMANITIES AND DIGITAL SCIENCES
TILBURG UNIVERSITY
Thesis committee
Supervisor: dr. B. Nicenboim

Second reader: dr. Emmanuel Keuleers
Word count: 8,967
Tilburg University
School of Humanities & Digital Sciences
Department of Cognitive Science & Artificial Intelligence
Tilburg, The Netherlands
3 December 2021
Abstract
With the growing needs and interests for language learning as well as the increasing
accessibility to tools and technologies that support second language learning, it is becoming
more relevant to add to the body of research conducted on this topic. One of the most
relevant topics studied over time is on how second language (L2) speakers develop lexical
proficiency. An important indicator for a second language speaker’s lexical proficiency is
vocabulary knowledge. Various studies have highlighted the importance of lexical
characteristics in predicting the ease with which L2 speakers acquire new words. These
characteristics include how frequently speakers are exposed to words in the language, how
they are able to associate these words to something concrete, as well as word length.
However, few studies have modeled the relationship between native language (L1) lexical
proficiency and second language proficiency. This study aims to contribute to existing
literature by studying the predictive power of native language vocabulary knowledge as
measured by L1 psycholinguistic measures while conducting a word recognition task, on
second language vocabulary knowledge as measured by receptive word knowledge. This
study will compare the implementation of different machine learning algorithms on
regression tasks predicting L2 receptive word knowledge based on data obtained from a
mega study, which brings a unique contribution within this study domain. It was found that
word frequency performed better as a predictor for L2 receptive word knowledge for both the
linear regression and neural network models, confirming previous findings. Overall, the
neural network showed better performance compared to the linear regression algorithm as
an estimator.
Keywords: lexical proficiency, word recognition, machine learning, neural network,
psycholinguistic measures
Data Source/Code/Ethics Statement
Data collected and processed through the English Crowdsourcing Project (Mandera,
Keuleers, & Brysbaert, 2020) as part of a set of online vocabulary tests organized by Ghent
University were used for the purpose of writing this thesis. All data analyzed within this thesis
are publicly available on the following sources: Our lexicon projects (ugent.be) and OSF |
English Crowdsourcing Project.
Code used to conduct experimental procedures for the purpose of data analysis in this thesis
were obtained from the following publicly available source (Gaurav, 2019). The repurposed
code written by author for this thesis can be found on the following link.
Table of Contents
Abstract ................................................................................................................................. 2
Data Source/Code/Ethics Statement ..................................................................................... 3
Predicting Non-Native English Speakers’ Vocabulary Knowledge with Native English
Speaker Psycholinguistic Behaviors: A Big Data Approach ................................................... 1
2. Related Work ................................................................................................................. 5
2.1 Relevant Work ............................................................................................................. 5
2.1.1 Measuring Vocabulary Knowledge ........................................................................ 5
2.1.2 Lexical characteristics as Predictors for L2 Vocabulary Knowledge ....................... 6
2.1.3 L1 Psycholinguistic Measures as Predictors for L2 Vocabulary Knowledge ........... 6
2.1.4 Machine Learning Approach to Lexical Proficiency Predictive Analysis ................. 7
2.2 The Current Study ....................................................................................................... 8
3. Methods ......................................................................................................................... 9
3.1 Modeling Approach ...................................................................................................... 9
3.1.1 Linear Regression ................................................................................................. 9
3.1.2 Neural Network ..................................................................................................... 9
3.1.1 Models ................................................................................................................ 10
3.2 Predictors .................................................................................................................. 11
3.2.1 Word Frequency ............................................................................................ 11
3.2.2 Lexical characteristics .................................................................................... 11
3.2.2 L1 English Psycholinguistic Behavior ............................................................. 12
3.3. Target ....................................................................................................................... 12
4. Experimental Setup...................................................................................................... 13
4.1 Dataset Description ................................................................................................... 13
4.1.1 Data Sources ...................................................................................................... 13
4.1.2 Data Preprocessing ............................................................................................. 14
4.2 Experimental Procedure ............................................................................................ 15
4.2.1 Model Evaluation & Hyper Parameters ................................................................ 16
4.2.2 Tasks .................................................................................................................. 17
Task 2. ............................................................................................................................ 17
Task 3. ............................................................................................................................ 17
5. Results ......................................................................................................................... 18
5.1 Task 1........................................................................................................................ 19
5.2 Task 2........................................................................................................................ 19
5.3 Task 3........................................................................................................................ 20
5.4 Task 4........................................................................................................................ 20
5.5 Task 5........................................................................................................................ 21
5.6 Post-Hoc Analysis...................................................................................................... 22
6. Discussion ................................................................................................................... 23
6.1 Limitations & Future Research ................................................................................... 25
7. Conclusion ................................................................................................................... 26
Acknowledgements ............................................................................................................. 27
References.......................................................................................................................... 29
Appendix A.......................................................................................................................... 37
Appendix B.......................................................................................................................... 39
Appendix C ......................................................................................................................... 42
Appendix D ......................................................................................................................... 45
1
Predicting Non-Native English Speakers’ Vocabulary Knowledge with Native
English Speaker Psycholinguistic Behaviors: A Big Data Approach
With the growing needs and interests for language learning as well as the increasing
accessibility to tools and technologies that support second language learning, it is becoming
more relevant to add to the body of research conducted on this topic. One of the most
relevant topics studied over time is on how second language (L2) speakers develop lexical
proficiency. L2 lexical proficiency can be defined as the depth of understanding (Anderson &
Freebody, 1981) as well as the breadth of lexical knowledge, or vocabulary size (Hazenberg
& Hulstijn, 1996; Nurweni & Read, 1999).
An indicator for someone’s vocabulary size is vocabulary knowledge, or their
knowledge of words (Laufer et al., 2004; Nation, 2001). Previous studies have concluded
that vocabulary knowledge refers to a person’s scope of word usage in writing, reading,
listening, and speaking, which can be further divided into receptive and productive
knowledge (Henriksen, 1999; Laufer, 1998; O’Dell et al., 2000), among other dimensions
(Maskor & Baharudin, 2016). Receptive knowledge of a word involves aspects such as
recognizing a word, knowing what a word looks like, and associating a word to other words
(Nation, 2001). Therefore, one of the most common tasks used to measure receptive word
knowledge includes the word recognition task, where participants are asked to indicate
whether they recognize a word and to identify whether the word seen is an actual word or a
pseudo word (Berger et al, 2019). Response times and level of accuracy of participants’
responses have been found to be important indicators on how L2 speakers process and
access words mentally, how they develop automaticity, as well as vocabulary size
(DeKeyser, 2001; Leow et al, 2014; Milton, 2006).
There are many factors which affect the ease of learning words and expanding
vocabulary knowledge for L2 speakers. One of which is lexical characteristics, a set of word
characteristics that include word frequency, word length, orthographic and phonological
distance to other words, number of morphemes, average age of acquisition, and

2
concreteness (Berger et al, 2019a, 2019b; Brysbaert et al, 2016, 2020; Keuleers et al., 2012,
2015; Mandera et al, 2020; Skalicky et al, 2019). From these characteristics, word frequency
has been a consistently significant predictor for word knowledge, where words that appear
or are used more frequently within a language tend to be learned more quickly by both
second language and native language speakers (Berger et al, 2019; Brysbaert & Cortese,
2011; Ellis, 2002; 2003; Read, 1998; Schmitt et al, 2001; Mandera et al, 2020). However,
there are drawbacks to using word frequency as an indicator for L2 lexical proficiency
(Berger et al, 2019; Brysbaert et al, 2020; He & Godfroid, 2019; Mandera et al, 2020), such
as there being disproportionately higher numbers of low frequency words compared to high
frequency words, particularly in the English language. Therefore, many low frequency words
are, in fact, well-known by L2 speakers. Secondly, word frequency is a measure derived
from a corpus or set of corpora, which means that it is dependent on the corpus used for its
measurement and is subjective in nature.
Another variable which has been studied in relation to L2 lexical proficiency is lexical
proficiency of native language (L1) speakers. Previous studies have found that word
knowledge of individuals who claimed English to be their native tongue and individuals who
spoke English as a second language exhibited high correlation and similar patterns
(Monaghan et al., 2017; Brysbaert et al., 2020). Meanwhile, others noted a relationship
between reading, writing, and speaking abilities of native and second languages of bilingual
individuals (Asfaha et al, 2009; Lee & Schallert, 1997; Pae, 2019; Sparks et al, 2012).
Berger et al. (2019) established that psycholinguistic behavior of L1 English speakers, as
measured by response time and accuracy during lexical tasks, were able to index the
difficulty level of words produced over time by a group of students who did not speak English
as a native language.
Response time and accuracy rate on lexical decision tasks are prevalent measures of
receptive word knowledge as they provide information on real-time automatic processing of
a language by both L2 and L1 speakers (Berger et al., 2019). This is the underlying
motivation for the focus of this study on how psycholinguistic measures of L1 English
3
speakers are able to predict L2 vocabulary knowledge. Measures of response time on lexical
tasks indicate how well a language speaker’s mental lexical network is organized as well as
their processing abilities, allowing for conclusions on how automaticity is developed
(DeKeyser, 2001; Hulstijn et al., 2009; Van Gelderen et al., 2004). Accuracy rate, on the
other hand, is a widely accepted indicator for L1 and L2 vocabulary size (Meara & Milton,
2003). Moreover, compared to measures based on corpora and individual interpretation of
word meaning, these psycholinguistic measures bring some advantages. Corpora-based
measures such as word frequency rely on proxies of large reference corpora, which may not
be representative of language speakers’ actual language exposure. Meanwhile, other lexical
characteristics, such as concreteness, rely on the judgement and perception of individuals,
making them more subjective compared to L1-derived measures which are based on L1
speakers’ performance on word knowledge tasks (Berger et al., 2019).
One of the most prevalent methods used to capture psycholinguistic measures is the
online or mega study approach (Balota et al., 2007; Berger et al., 2019; Marinis, 2010;
Mandera et al., 2020). As opposed to the conventional method of language proficiency
testing in a classroom or experimental set up, this approach has several advantages,
including: the ability to include a larger number of stimuli and participants, resulting in more
power; the ability to assess the importance of both existing and new word characteristics;
data availability for repeated use and answering different research questions; avoiding
experimenter bias; random selection of stimuli to be presented to participants; as well as
evaluating the performance of new computational models (Balota et al., 2015; Keuleers &
Balota, 2013; Liben-Nowell et al., 2019; Mandera et al., 2020). The mega study approach
has led to increased availability of large amounts of data over the years (Manderaet al.,
2020), which together with rapid advancements of computational models have resulted in
increased interest in the utilization of machine learning models for lexical proficiency
prediction (Santos et al., 2012; Yang et al., 2016). However, such studies focused on
measuring the predictive power of L1 psycholinguistic behavior on L2 lexical proficiency,
particularly those measures derived from mega studies, remain limited.

4
This study aims to add to the body of literature within the domain of L2 lexical
proficiency by focusing on evaluating how well psycholinguistic measures of native English
(L1) speakers can predict receptive word knowledge of non-native (L2) English speakers in
comparison to a more conventionally used set of predictors, namely lexical characteristics. In
doing so, information obtained through the means of a mega study will be utilized and the
performances of a simple and a more complex machine learning algorithm will be compared.
The following main research question and sub-research questions are formed:
To what extent can lexical characteristics and measures derived from L1 English
psycholinguistic behavior predict L2 English receptive word knowledge?
1. How do the performances of different machine learning algorithms compare in
predicting L2 English receptive word knowledge based on word frequency?
predicting L2 English receptive word knowledge based on a set of commonly
used lexical characteristics?
predicting L2 English receptive word knowledge based on L1 English
psycholinguistic behavior during word recognition and lexical decision tasks?
4. How does word frequency compare with L1 psycholinguistic behavior in
predicting L2 English receptive word knowledge?
English L2 receptive word knowledge is represented by word difficulty ranking based
on response times and accuracy ratings of L2 speakers on an online word recognition task,
while L1 English psycholinguistic behavior is assessed by response times and accuracy
ratings of native English speakers on the same task. To answer the research questions
presented in this study, one baseline model, five linear regression models and five neural
networks were run.
Findings of this study suggest that word frequency outperforms L1 English
psycholinguistic measures in predicting L2 receptive word knowledge. However, including L1
measures in place of word frequency together with other lexical characteristics within the
5
model yield comparable results, confirming its predictive power not only on productive L2
word knowledge, but also receptive L2 word knowledge.
2. Related Work
2.1 Relevant Work
2.1.1 Measuring Vocabulary Knowledge
An important indicator for a second language speaker’s lexical proficiency is their
vocabulary knowledge, which forms the foundation for their proficiency in writing, reading,
listening, and speaking (Pullido & Hanbrick, 2008). The present study will utilize the
receptive word knowledge as a measure for L2 vocabulary knowledge. The receptive word
knowledge includes measures such as knowing what a word looks and sounds like, knowing
what parts of a word are recognizable, and knowing the meaning of a word. A common tool
used to measure receptive word knowledge is the word recognition task (DeKeyser, 2001;
Leow et al., 2014; Meara & Milton, 2002), where an L2 speaker is typically presented with a
word stimuli and asked to identify whether they recognized the stimuli as an actual word
within the language or not (Berger et al., 2019). Measures of L2 response times and their
accuracy ratings on this task are recorded and have been found to have high correlations
with more demanding in-depth language tests (Ferré & Brysbaert, 2017; Harrington & Carey,
2009; Lemhöfer & Broersma, 2012; Meara & Buxton, 1987; Zhang et al., 2019). These
measures on the word recognition task have also been used to form lists of word ranking
which were able to indicate word difficulty, how frequently an L2 speaker is exposed to the
word within the language (Brysbaert et al., 2019), as well as how useful the L2 speaker
perceives the word or how motivated they are to learn the word (He & Godfroid, 2019). In
their study, Brysbaert, Keuleers, and Mandera (2019), found that L2 word rank had the
highest correlation with word difficulty measures based on human ratings (r = .60), followed
by log of word frequency (r = -.52) based the COCA corpus (Gardner & Davies, 2014), as
well as word usefulness based on human ratings (r = -.47). Moreover, the word rank as a
measure for L2 English word knowledge had a correlation of .68 with word knowledge from a
6
similar study by Hashimoto and Egbert (2019). Thus, this study will utilize word ranks as a
measure for L2 receptive word knowledge.
2.1.2 Lexical characteristics as Predictors for L2 Vocabulary Knowledge
Although larger vocabulary size indicates higher L2 lexical proficiency, not all words
need to be known for successful language understanding. L2 speakers would normally only
acquire between 2.500-3.000 of the most frequent word families (i.e. a word and its various
forms of inflections and derivations) of a language (Cobb, 2007; 2016; Webb & Chang,
2012), compared to the 11.100 word families typically known by native (L1) speakers
(Brysbaert et al, 2016). More frequent words tend to be processed more quickly, both for L1
and L2 speakers (Ellis, 2002; 2003), and are typically learned by L2 speakers at an earlier
point in time compared to less frequent words (O’Dell et al., 2000; Schmitt et al, 2001). This
finding is further affirmed by several classic and well-known studies on the effect of lexical
characteristics on lexical proficiency, such as the English Lexicon Project and British Lexicon
Project (Monsell et al, 1989; Yap et al, 2008; Balota et al, 2007; Keuleers et al, 2012;
Mandera et al, 2020). Some of the other important lexical characteristics which have been
found to be significant predictors for L2 word knowledge include word length, number of
syllables of a word, and average age of acquisition of a word (Balota et al, 2007; Mandera et
al, 2020). Despite this, Brysbaert et al (2016) summarize the drawbacks of word frequency
as a predictor for L2 word knowledge as follows: “word frequency is based on the
assumption that all words encountered by a speaker are of the same weight; lack of
language representation in the corpora from which word frequency measures are taken; and
that receptiveness of a word also depends on the motivation of the learner to know the word”
(pp. 27-28). Keeping this in mind, there is room for further study on variables which could
better predict and serve as more reliable measures for L2 vocabulary knowledge.
2.1.3 L1 Psycholinguistic Measures as Predictors for L2 Vocabulary Knowledge
Various studies have looked into the relationship between L1 and L2 lexical
proficiency. These studies oftentimes focus on the dynamics between L1 and L2 reading
7
and writing abilities for individuals. For example, Asfaha et al. (2009) revealed that L1
reading comprehension and L2 language proficiency significantly predict L2 reading
proficiency. Meanwhile, Lee & Schallert (1997) found that L2 proficiency is a better predictor
for L2 reading ability compared to L1 reading ability. Studies with similar findings on the
relationship between L1 and L2 writing and reading skills were conducted by Sparks et al.
(2012) and Pae (2019). However, little work has been done on L2 vocabulary knowledge
and how this can be predicted by L1 vocabulary. Brysbaert, Lagrou, and Stevens (2017) is
one of these studies, where they were able to confirm the lexical entrenchment hypothesis
for L1 and L2 English speakers, observing that the frequency effect has a stronger impact on
second language speakers compared to measures derived from native language speakers.
Another of such study is by Berger et al. (2019), which attempted to predict L2 English
productive word knowledge with psycholinguistic measures derived from L1 English speaker
performance on word recognition and lexical decision tasks. Findings from this study
suggest that L2 speakers with higher language proficiency (as assessed by human raters)
produce English words that are recognized and named more slowly and less accurately by
L1 speakers. Results from the longitudinal study within this paper conclude that L2 speakers
produce less frequent words over time, as well as words which are recognized or named
less accurately and more slowly by L1 speakers.
2.1.4 Machine Learning Approach to Lexical Proficiency Predictive Analysis
While there has been growing interest in the utilization and exploration of machine
learning approaches to predict lexical proficiency (Petersen & Ostendorf, 2009; Santos et al.,
2012; Yang et al., 2016), most studies adopting this approach have used corpora-based
measures or measures obtained through factorial and classroom designs as data sources.
Moreover, measures of L2 proficiency utilized within these studies were predominantly
measures based on human assessment. For example, Arnold et al. (2018) predicted CEFR
levels of L2 English learners using corpora-based lexical and syntactic metrics. The
performances of the Gradient Boosted Trees and the Neural Network were compared in a
pairwise-classification task of L2 proficiency classes of adjacent CEFR levels. From a

8
linguistic point of view, the most distinguishable marks in proficiency among second
language learners arise from adjacent classes. Key findings of the study highlighted the
higher performance of the gradient boosted tree compared to the neural network and that
task-based corpora entail strong overfitting. Aryadoust & Baghaei (2016) attempted to
predict L2 English learners’ reading ability on the basis of their lexical and grammatical
knowledge using the neural network. Results from this study showed that the neural network
was able to classify L2 readers based on their grammatical and vocabulary knowledge with
an accuracy level of approximately 78% and confirmed previous findings on the relationship
between these variables. However, there were limitations acknowledged on the reliability of
the Rasch model utilized to derive proficiency measures. Lastly, Kerz et al. (2021)
highlighted the advantages in utilizing automated-generated L2 proficiency metrics using the
RNN compared to metrics based on human assessment.
2.2 The Current Study
The presented body of literature leaves room for further studies within the domain of
L2 lexical proficiency. While previous studies have analyzed the relationship between
receptive word knowledge and various lexical characteristics, little research has been done
to model the relationship between L1 English and L2 English lexical proficiency, particularly
for receptive word knowledge. Therefore, this study aims to replicate previous findings
showing the predictive power of lexical characteristics on L2 receptive word knowledge, and
add a unique finding to existing literature by analyzing the predictive power of L1
psycholinguistic measures on L2 receptive word knowledge. Moreover, the previously
mentioned studies utilizing machine learning to predict second language proficiency have
predominantly used information obtained from corpora, conventional language tests, or
experiments. This study will thus add a unique contribution by adopting the machine learning
approach to predict L2 lexical proficiency using data from a mega study.

9
3. Methods
3.1 Modeling Approach
The current study utilized a machine learning approach to answer the presented
research questions, where the performance of a simpler machine learning algorithm was
compared with that of a more complex one. The simpler model tested was the linear
regression as it is a well-established model that is easy to interpret (Stigler, 1986). The feed
forward neural network was selected as the more complex machine learning model to
predict lexical proficiency as it may have a few advantages compared to the linear
regression, such as being more adaptive, is able to model non-linear functions, and does not
need assumptions about the data or relationship of the data being modeled (Haykin, 1998;
Hornik et al., 1989).
3.1.1 Linear Regression
The linear regression is one of the most widely accepted and classical algorithms for
predictions dating back to more than 100 years (Stigler, 1986; Yan & Su, 2009). The
purpose of a linear regression model is to examine the relationship of a set of independent
variables on a dependent variable by estimating parameters of the regression equation that
minimizes its loss function (Yan & Su, 2009). The most prevalent method to do this by
minimizing the sum of squared residuals of the estimated equation. The linear regression
equation can be depicted as follows:
𝑌 = 𝑓(𝑋𝑖 , 𝛽) + 𝑒𝑖
Where Y is the dependent variable, Xi is the set of corresponding dependent
variables, 𝛽 the estimated parameters of the least squares regression equation, and e the
residuals or differences between the observed and fitted values of the dependent variable.
3.1.2 Neural Network
The goal of the neural network is to approximate some function, such as y = f2(z; u, c) as
depicted in Figure 1, by mapping an input x to a value of y based on the calculations of
activation functions within each layer. This is done by learning the values of parameters of the
10
network (in this case, w, b and u, c) that result in the best function approximation (Goodfellow
et al., 2016).
Figure 1
Structure of Feed-Forward Neural Network
z = f1(x; w, b)
y = f2(z; u, c)
Output layer
Input layer
Hidden layer
A loss function is chosen to evaluate how well the model is able to predict true values
of the dataset based on the nature of the problem. The difference between the network’s
output and the expected output as calculated by the loss function is minimized by computing
its gradients, which informs the model how to update the parameters. In order to learn the
best parameter values, the derivative of the prediction error with respect to each parameter
is calculated with the backpropagation algorithm. For example, we can express the
backpropagation of the prediction error of the network in Figure 1 with the following
equation:
𝜕𝑙𝑜𝑠𝑠 𝜕𝑙𝑜𝑠𝑠 𝜕𝑦 𝜕𝑧
= × ×
𝜕𝑤 𝜕𝑦 𝜕𝑧 𝜕𝑤
3.1.1 Models
This study applied one baseline model, five linear regression models, and five neural network
models:
1. Baseline model: Zero-rule algorithm
The zero-rule algorithm is one of the most commonly used baseline algorithms in
machine learning that takes information about the targets in a problem in order to
create a rule for predictions (Zhou, 2019). The mean value of the targets, L2 English
11
receptive word knowledge, was used as the default prediction, as it represents the
central tendency for the output
2. Models 1a and 1b: The linear regression model and neural network trained on word
frequency
3. Models 2a and 2b: The linear regression model and neural network trained on
common lexical characteristics (excluding word frequency)
4. Models 3a and 3b: The linear regression model and neural network trained on L1
English psycholinguistic behavior measures of response times and accuracy rates on
lexical decision tasks
5. Model 4a and 4b: The linear regression model and neural network trained on word
frequency and other common lexical characteristics
6. Model 5a and 5b: The linear regression model and neural network trained on L1
English psycholinguistic behavior measures and common lexical characteristics
(excluding word frequency)
3.2 Predictors
Three predictor variables were used within this study, namely word frequency, lexical
characteristics, and L1 psycholinguistic measures.
3.2.1 Word Frequency
Word frequency is a measure of how often a word is used within a language based
on a specific corpus. This study used word frequency measures from the SUBTLEX-US
database based on 51 million words in American English subtitles (Brysbaert & New, 2012).
The number of times that a word appears within the corpus is expressed as Zipf scores, a
standardized measure of frequency independent of corpus size (van Heuven et al., 2014).
3.2.2 Lexical characteristics
Lexical characteristics is a group of predictors consisting of 8 variables. The first
variable, is word length as measured by the number of letters (length) and syllables (number
of syllables) that make up a word (Mandera et al., 2020). The variable number of
12
morphemes is the smallest unit of meaning within a word while number of phonemes
represent the phoneme counts of a word (Balota el al., 2007). Orthographic Levenshtein
distance (OLD) illustrates how similarly a word is written compared to other words in the
corpus, while phonological Levenshtein distance (PLD) shows how closely a word sounds
compared to other words (Balota el al., 2007). Age of acquisition (AoA) indicates the
average age at which a word is typically learned (Brysbaert, 2012). Lastly, concreteness
ratings are the degree to which an experience with a word can be perceived by language
speakers (Brysbaert et al., 2016).
3.2.2 L1 English Psycholinguistic Behavior
The L1 English psycholinguistic behavior is a set of predictor variables comprised of
response times and accuracy ratings of participants selecting English as their native
language to an online vocabulary test. Participants of this vocabulary test completed a word
recognition task where they were presented with a sequence of both existing English words
and non-words, and were instructed to respond with ‘yes’ or ‘no’ depending on whether they
knew the word being presented to them (Mandera et al., 2020). The mean and standard
deviation of response times to correctly identified words were recorded and their z-scores
were calculated. Z-score measures of the mean and standard deviation of response times
were used in this study. Meanwhile, accuracy ratings represented the percentage of correct
answers for each word.
3.3. Target
To measure L2 English receptive knowledge, the present study utilized a ranking of
English words based on L2 English speakers’ accuracy ratings and response times to the
word recognition task, which has been found to be a good indicator for L2 vocabulary
acquisition (Brysbaert et al., 2020). This measure is ranked from 1-18.236 based on the
percentage known of words and mean word recognition response time for ties.
13
4. Experimental Setup
4.1 Dataset Description
4.1.1 Data Sources
The present study used a combined dataset from three data sources. The data used
in this study was collected through the English Crowdsourcing Project (ECP) (Mandera et
al., 2020), which is a mega study part of a set of online vocabulary tests organized by Ghent
University. The English vocabulary test was first conducted in 2014 (Brysbaert et al., 2016).
It is currently still running and available to access at http://vocabulary.ugent.be/. Participants
of this test are presented with a sequence of 100 stimuli, where 70 of these stimuli are
existing English words and 30 are pseudo words. They are instructed to indicate whether
they recognized the stimuli being presented and are informed that incorrect answers will be
penalized so as to avoid guessing during the test. The complete set of instructions for the
online vocabulary test are described in Brysbaert et al. (2020) and Mandera et al. (2020).
Two different versions of the ECP dataset were used. Psycholinguistic measures for
L1 English speakers were sampled from the ECP dataset as described in Mandera,
Keuleers, and Brysbaert (2020), where word recognition times and accuracy ratings of
62.000 English words for native English speakers were presented. L2 English receptive
word knowledge as measured by word rankings were derived from accuracy ratings and
recognition times of the same 62.000 English words for non-native English speakers as
described in Brysbaert, Keuleers, and Mandera (2020).
Lastly, lexical characteristics data in this study were obtained from the Word Features
Analysis of the ELP and ECP for the Open Science Framework (Brysbaert & Mandera,
2019). In this dataset, lexical characteristics calculations for English words in common
between the ECP and ELP datasets are available, which includes the following information:
word length, word frequency, orthographic and phonologic neighbors, orthographic and
phonologic distance, number of morphemes, number of phonemes, number of syllables, age

14
of acquisition, word prevalence, and concreteness. An intersection of the three data sources
resulted in a dataset of 18.236 English words which was utilized for this study.
4.1.2 Data Preprocessing
This study followed the same data cleaning procedures as implemented by the L1
and L2 ECP datasets presented in Mandera et al. (2020) and Brysbaert et al. (2020). Data of
participants who indicated that their native language was English was included in the L1
ECP dataset, and those who indicated that their native language was not English was
included in the L2 ECP dataset. Therefore, only data of respondents who completed the
person-related questions were included in the dataset. Outliers and irregular values were
taken out of the dataset. Only the first 3 sessions from each IP address were included to
avoid any undue influences from individuals. Reaction times of the first 9 trials of each
session were deleted as they were considered training trials.
Additionally, for non-native English speakers, observations with more than two ‘yes’
responses to non-words were removed to minimize the presence of guessing by participants
due to word similarity, which corresponds to a maximum rate of 7% guessing (Brysbaert et
al, 2020). Lastly, from the remaining observations in the dataset, only English words with
available lexical characteristics data in the English Lexicon Project were included in the
analysis. These steps pruned the initial dataset of 61.851 English words to 18.236 words.
For the purposes of this study, several variables were further omitted from the
analysis. For word recognition times, z-scores of both L1 and L2 response times instead of
their raw measures were used in order to cancel out differences in average response times
per participant (Mandera et al., 2020; Brysbaert et al., 2020). Word prevalence was also
removed from the analysis as it is a measure of how well each word is known based on the
complete ECP dataset (Brysbaert, 2019; Brysbaert, 2016; Keuleers, 2015), which is a
component of word ranking being measured within this study. Furthermore, orthographic and
phonologic neighbors, or the number of words that sound and are written similarly to a word
(Balota el al., 2007), were excluded from the analysis as the variables orthographic and
15
phonologic Levenshtein distances have been found to be better measures for word similarity
with other words. Both OLD and PLD not only compare word distances with their neighbors,
but with all other words within the corpus (Yarkoni et al., 2008). All variables were
aggregated per word observation and were standardized to control for differences in units of
each feature for every observation, and were standardized, with the exceptions of z-scores
of L1 recognition times, L1 accuracy ratings, and L2 word rankings. A summary of the
distribution of all predictors included within the models as well as their relationships with the
target can be found in Appendix A.
4.2 Experimental Procedure
The linear regression and artificial neural network algorithms were used to conduct
the analysis within this study. Data analysis was performed on Python 3.8, while data
cleaning and preprocessing were conducted on both Python 3.8 and R 4.10. On R 4.10, the
dplyr library (Wickham & Wickham, 2020) was used for dataset cleaning. On Python 3.8, this
study used the following libraries and modules for data cleaning, preprocessing, and
analysis: pandas (McKinney, 2011), os, numpy (Oliphant, 2006), scikit learn (Kramer, 2016),
random, math, matplotlib (Ari et al., 2014), and tensorflow (Abadi et al., 2016).
The dataset used for this study takes the form of a data frame with 18.236 rows and
13 columns. In total, there are 12 predictors as input to the models. All variables were
converted to two-dimensional numpy arrays and 30% of the entire dataset was left out as
the test set using the train-test split function from scikit-learn. This resulted in a train set with
a total of 12.765 observations and a test set with 5.471 observations.
Testing of the linear regression model utilized the Lasso module from the scikit-learn
library. The neural network was built with the Keras Regressor module from the scikit-learn
library and utilized a custom function to select the most optimum hyper parameters for the
network during training. The network takes as input to its input layer a 2-dimensional numpy
array corresponding to the different predictors included in each model. The last dense layer
of the network outputs a linear value of the target, a word ranking value between 1-18.236.
16
For all dense layers, Relu was used as the activation function. The model is compiled with
an optimization function that is selected during the tuning process.
4.2.1 Model Evaluation & Hyper Parameters
Five tasks were conducted to answer the research questions presented in this study.
The mean absolute error metric was used to evaluate the model performances. The MAE
was chosen as it calculates the absolute differences between prediction and expected
output over the test sample, where each individual difference has equal weight. Therefore,
the presence of outliers is not penalized. Since outliers have previously been removed from
the dataset for this study, the MAE is a good metric for the regression problem presented.
Moreover, it is also a more intuitive measure to interpret than for example, the mean
squared error.
For both the linear regression and neural network models to be able to estimate the
best set of parameters, several hyper parameters need to be determined beforehand and
optimized. For the Lasso linear regression, this parameter is the bias term or λ, which is a
component of the regularization term for the coefficients of the regression model. For the
neural network, these include parameters related to the structure of the network as well as
the learning of the model (Diaz et al., 2017). To determine the structure of the neural
network, the number of layers and number of layer nodes were tuned. To optimize how the
network learns, the optimization function, batch size, and number of epochs were tuned. The
grid search algorithm was used to optimize these parameters using the mean squared error
as a loss function.
Four-fold cross validation using the hold-out method was conducted to train the
parameters, and then tested on the subset of data which had been initially left out. For the
linear regression, a total of 5 lambda value candidates were fitted, totaling to 20 fits. For the
neural network, there were a total of 96 parameter candidates fitted across 4 folds, totaling
to 384 fits. A summary of the tested hyper parameters for both models and the best-
performing neural network architectures can be found in Appendix B.

17
4.2.2 Tasks
Task 1. The first task served to provide a baseline performance as comparison with the
predictive power of the variables of interest using the zero rule algorithm. The rule adopted
was to use the mean value of the train targets, L2 English receptive word knowledge, as the
default prediction.
Task 2. The aim of the second task was to evaluate the predictive power of word
frequency on L2 English receptive word knowledge by comparing the performances of two
algorithms with the baseline performance, where Model 1a is the linear regression model
and Model 1b the neural network.
Task 3. The third task aimed to evaluate the predictive power of commonly used lexical
characteristics (excluding word frequency) on L2 English receptive word knowledge by
comparing performances of the linear regression and neural network models with the
baseline performance. The lexical characteristics set of predictors include word length,
number of syllables, number of morphemes, number of phoneme, orthographic Levenshtein
distance, phonological Levenshtein distance, age of acquisition, and word concreteness.
Task 4. The fourth task aimed to evaluate the predictive power of L1 psycholinguistic
measures on L2 English receptive word knowledge by comparing Models 3a and 3b’s
performances with the baseline performance. L1 psycholinguistic measures are comprised
of the variables z-scores of L1 English response times and accuracy ratings during the word
recognition task.
Task 5. The fourth task aimed to compare the predictive power of word frequency
and L1 psycholinguistic measures on L2 English receptive word knowledge. Two linear
regression models and 2 neural networks were run to execute this task. In Models 4a and
4b, the linear regression and neural network were trained and tested on word frequency and
lexical characteristics. For Models 5a and 5b, the linear regression and neural network were
trained on L1 psycholinguistic measures and lexical characteristics. A comparison of the
predictive performances of these two algorithms were made based on the baseline
prediction.
18
5. Results
A total of 11 models including the baseline were trained on various predictors as
stated in the task descriptions in section 4.2.2. An overview of all model performance scores
is summarized in Table 1 below. Overall results of the regression tasks performed showed
that the neural network outperformed the linear regression model in predicting L2 receptive
word knowledge based on all sets of predictors, with the exception of Model 3b which inputs
L1 psycholinguistic measures as predictors. The best performing estimator is Model 4b, the
neural network tested on lexical characteristics and word frequency, with an MAE score of
2742.29 and an improved score from the baseline model by 1777.98 units. When comparing
the performances of Models 4 and 5, both the linear regression and neural network indicate
that the models tested on lexical characteristics and word frequency perform better than
those tested on lexical characteristics and L1 psycholinguistic measures, confirming findings
from prior studies on the importance of word frequency as a predictor for L2 proficiency
(Berger et al., 2019; Keuleers et al, 2012; Mandera et al, 2020).
To check for robustness of the neural networks, train and validation losses of each
model were visualized during the cross-validation process to monitor for any indications of
over or underfitting (Goodfellow et al., 2016). Based on the visualization results, there were
no indications of overfitting to be found. Moreover, a comparison of the performance scores
on the train data and test data for all models did not indicate large differences. All
visualizations and calculations related to the model robustness checks can be found in
Appendix C. In the next section, all results will be presented and elaborated for each task.
19
Table 1
Results of model predictions
MAE Improvement
Model Predictors Algorithm
Score from Baseline
Baseline Average of target values Zero rule 4520.27 -
1a LR 3473.90 1013.32
Word frequency
1b NN 3413.73 1106.54
2a LR 3729.27 766.39
Lexical characteristics
2b NN 3678.51 841.76
3a LR 3633.03 900.12
L1 psycholinguistic measures
3b NN 3854.81 665.46
4a Lexical characteristics + word LR 2973.21 1509.24
4b frequency NN 2742.29 1777.98
5a Lexical characteristics + L1 LR 3029.18 1473.77
5b psycholinguistic measures NN 2797.58 1722.69
5.1 Task 1
The first task aimed to predict L2 receptive word knowledge based on the mean
value of the train targets as the baseline prediction. This prediction was compared with the
true values of the test data and resulted in a mean absolute error score of 4,520.27.
5.2 Task 2
The second task aimed to evaluate the predictive power of word frequency on L2
English receptive word knowledge by comparing the performances of the linear regression
and neural network models with the baseline performance. The best-performing linear
regression estimator for Model 1a was selected with a validation MAE score of 3506.95 and
no regularization performed. The best-performing neural network for Model 1b was selected
with a validation MAE score of 3,430.98. The following parameters were selected as the
20
most optimum: 32 data batches, 60 epochs, 64 first layer nodes, 8 last layer nodes, 3 hidden
layers, and Adam as the optimization function. Both models performed better than the
baseline prediction, with the neural network showing slightly higher improvements from the
baseline prediction compared to the linear regression model. This indicates that word
frequency is able to perform well as a predictor for L2 word receptive knowledge in
comparison to the baseline, confirming previous findings on the importance of word
frequency as a predictor for L2 English vocabulary knowledge acquisition.
5.3 Task 3
In the third task, lexical characteristics (excluding word frequency) were utilized to
predict L2 receptive word knowledge. The best-performing linear regression estimator for
Model 2a was selected with a validation MAE score of 3753.88 and no regularization
performed. The best-performing neural network for Model 2b was selected with a validation
MAE score of 3,696.88. The following parameters were selected as the most optimum: 32
data batches, 60 epochs, 64 first layer nodes, 64 last layer nodes, 3 hidden layers, and
Adam as the optimization function. Both models performed better than the baseline
prediction, with the neural network showing slightly higher improvements from the baseline
prediction compared to the linear regression model. This indicates that lexical characteristics
as a set of predictors is able to perform well as a predictor for L2 word receptive knowledge
in comparison to the baseline, confirming findings from previous studies.
5.4 Task 4
For the fourth task, both models were trained on L1 psycholinguistic measures to
predict L2 receptive word knowledge. The best-performing linear regression estimator for
Model 3a was selected with a validation MAE score of 3620.15. The best-performing neural
network for Model 2b was selected with a validation MAE score of 3,846.76. The following
parameters were selected as the most optimum: 32 data batches, 60 epochs, 32 first layer
nodes, 64 last layer nodes, 3 hidden layers, and Adam as the optimization function. Both
models performed better than the baseline prediction, this time with the linear regression
showing slightly higher improvements from the baseline prediction compared to the neural
21
network model. This indicates that L1 psycholinguistic measures as a set of predictors is
able to perform well as a predictor for L2 word receptive knowledge in comparison to the
baseline, adding a new finding to currently existing literature.
5.5 Task 5
For task 5, two sets of linear regression and neural network models were trained in
order to compare the predictive performances of word frequency and L1 psycholinguistic
measures on L2 English receptive word knowledge. Models 4a and 4b were trained on word
frequency and other lexical characteristics, while Models 5a and 5b were trained on L1
psycholinguistic measures and other lexical characteristics. The best-performing linear
regression estimator for Model 4a was selected with a validation MAE score of 3011.03 and
no regularization performed. The best-performing neural network for Model 4b was selected
with a validation MAE score of 2,759.93. The following parameters were selected as the
most optimum: 32 data batches, 60 epochs, 64 first layer nodes, 64 last layer nodes, 3
hidden layers, and Adam as the optimization function.
The best-performing linear regression estimator for Model 5a was selected with a
validation MAE score of 3046.50 and no regularization performed. The best-performing
neural network for Model 5b was selected with a validation MAE score of 2,961.47 and the
following parameters selected as the most optimum: 32 data batches, 60 epochs, 64 first
layer nodes, 64 last layer nodes, 3 hidden layers, and Adam as the optimization function.
For both sets of models, the neural network consistently showed a better performance
compared to the linear regression model. Moreover, Models 4 showed higher improvements
in prediction performance compared to Models 5, where the neural network showed slightly
more improvement in performance compared to the linear regression model. This also
indicated slightly better performance of word frequency as a predictor for L2 English
receptive word knowledge compared to L1 psycholinguistic measure, as expected based on
findings from previous studies.

22
5.6 Post-Hoc Analysis
While all models showed better prediction performance compared to the baseline, a
large overall error remains present, which could indicate that the models systematically
under-predict the observed values of the targets. To further gain insights into the
performance of the models, additional analyses were performed on the best performing set
of models, namely Models 4a and 4b which utilized word characteristics and word frequency
as predictors. Visualizations made for the purpose of the post-hoc analysis can be found in
Appendix D.
Firstly, a comparison of the distributions of the predicted target values by the linear
regression and neural network models were compared with the actual target values. Both
prediction distributions differed considerably from the actual distribution of skewness level
-0.00034. It was found that distribution of prediction values by the neural network more
closely resembles the distribution of the actual targets with a skewness of -0.296 compared
to that of the linear regression with a skewness of -0.439. Furthermore, distribution of the
linear regression predictions indicates higher dispersion and the presence of more outliers.
Secondly, the relationship between the predictions and the residuals of each model
were explored. A visualization of these relationships for the two models indicate that
residuals for the neural network predictions tend to spread more proportionately around the
zero value compared to that of the linear regression, as depicted in Figure 2 below. Further
investigation reveals a correlation coefficient of r = 0.00084 for the predictions and residuals
of the linear regression, and r = 0.0066 for the neural network.
Lastly, the coefficients of all predictors from the linear regression model were
extracted to analyze the importance of each variable. It was found that word frequency had
the largest magnitude compared to the other predictors included within the model. This is
consistent with findings from existing literature on the importance of word frequency as a
predictor for L2 lexical proficiency. Additionally, the correlation coefficient between L2
receptive word knowledge and L1 psycholinguistic measures were calculated to further
investigate the exceptional behavior observed in Models 3a and 3b. It was found that L2
23
receptive word knowledge exhibited a correlation coefficient of r = 0.532 and r = 0.4128 with
mean L1 response time and standard deviation of L1 response time, respectively.
Figure 2
Relationship between residuals and predictions for the linear regression and neural network
models
Prediction vs Residual - Linear Regression
Prediction vs Residual - Neural Network
6. Discussion
This study aims to answer the following main research question:
To what extent can lexical characteristics and measures derived from L1 English
psycholinguistic behavior predict L2 English receptive word knowledge?

24
With respect to answering the sub-research questions concerning the importance of
the set of predictors included in this study, results within this study were consistent with
findings from existing studies. All models tested consistently showed higher predictive
performance compared to the baseline, confirming the validity and importance of the
predictors included within this study, namely word frequency, word characteristics, and L1
psycholinguistic measures (Balota et al., 2007; Keuleers et al, 2012; Mandera et al, 2020).
Moreover, based on findings from task 5, it was observed that both the linear regression and
neural network models fitted on lexical characteristics and word frequency showed a better
performance compared to the models fitted on lexical characteristics and L1
psycholinguistics measures. This is consistent with findings from Brysbaert et al. (2017) and
Berger et al. (2019) that highlighted the stronger impact of word frequency on L2 lexical
proficiency compared to L1-derived measures.
When comparing the performances of the two algorithms, the neural network showed
better predictive performance compared to the linear regression model with the exception of
the models fitted on L1 psycholinguistic measures. The dataset used within this study
consist a large number of observations and features, which adds to its dimensionality. High
dimensionality data is known to negatively impact the performance of non-parametric
algorithms such as the linear regression (Lavergne & Patilea, 2008). Therefore, this result in
not unexpected. Post-hoc analysis was conducted to investigate the exception found in the
behavior of the Model which included L1 psycholinguistic measures as predictors. It was
found that the correlation coefficient of L2 receptive word knowledge with mean L1 response
time and standard deviation of L1 response time were r = 0.532 and r = 0.4128. These
values are much higher than the correlation levels found between measures of word
knowledge in Hashimoto & Egbert (2019) and L1 ranking in Brysbaert et al. (2017), that
being r = 0.28.
Further post-hoc analysis also found a considerable difference in the distribution
between true values of L2 receptive word knowledge and the predicted values by the linear
regression and neural network models. Performing classification tasks for L2 proficiency
25
could better capture the differences in knowledge between learners as samples of second
language learners tend to exhibit imbalanced classes (Arnold et al., 2018; Aryadoust &
Baghael, 2016; Sinclair et al., 2021). In this study, L2 receptive word knowledge were
treated as a linear variable of ranking values from 1-18.236.
6.1 Limitations & Future Research
Several limitations arise due to the nature of the study approach adopted as well as
the dataset used. While utilizing the word ranking as a measure of L2 lexical proficiency
allows for the capture of additional dimension such as learner motivation in learning a word,
it may present certain limitations. Due to the imbalanced nature of L2 proficiency-related
information, a linear rank of lexical proficiency may lose information relevant to language
proficiency levels comparison.
Utilizing big data collected from mega studies for this thesis presents several
advantages compared to data obtained from factorial design studies, such as improved
model predictive power due to the large number of word stimuli as well as the availability of
the data for use in other studies. However, due to the nature of the word recognition task
used for the dataset collection, the measure of word knowledge in this study is based solely
on yes or no decisions. While this is a valid measure for word knowledge (Ferré & Brysbaert,
2017; Harrington & Carey, 2009; Laufer et al., 2004; Nation, 2001; Zhang et al., 2019), it is
only able to capture a surface-level knowledge of words, and not, for example, whether a
second language speaker indeed understands the meaning of a word (Brysbaert et al.,
2019). Furthermore, the dataset used in the present study only includes words, while the
English language vocabulary is also comprised of multiword expressions (Brysbaert et al.,
2019).
Future studies could address these limitations, firstly by treating L2 word knowledge
as a categorical variable. This would allow for a better comparison of L2 proficiency which
better reflects the true learning curve of second language learners (Arnold et al., 2018;
Aryadoust & Baghael, 2016). Secondly, future studies may improve and expand on the
measure of vocabulary knowledge used within this study. The availability of a large dataset
26
with measures representing other dimensions of vocabulary knowledge, such as productive
word knowledge, depth of vocabulary, and knowledge of word meaning (Henriksen, 1999) is
limited, leaving room for further contributions of such datasets. Lastly, future studies could
also include multi-word expressions in addition to singular words as stimuli for large-scale
second language vocabulary testing, more accurately representing the make-up for the
English vocabulary.
7. Conclusion
This study aimed to test the predictive power of lexical characteristics and receptive
word knowledge measures of native English speakers on English second language
speakers’ receptive word knowledge. Results from this study bring several scientific
implications. Existing literature have found lexical characteristics to be important predictors
of L2 English vocabulary as measured by both receptive and productive word knowledge.
The results showed that the models trained on word frequency showed the better predictive
performance on L2 English receptive word knowledge compared to the models trained on a
set of other commonly used lexical characteristics, which confirms previous findings. This
study found that that L1 English receptive word knowledge performed better than the
baseline in predicting L2 receptive word knowledge, but performed worse than word
frequency, which confirms previous findings that word frequency is a more important
predictor than L1 psycholinguistic behavior. A comparison of performance for different
machine learning models in predicting L2 receptive word knowledge based on information
from mega study brings a novel contribution to existing studies. Overall, the neural network
showed better performance in predicting L2 receptive word knowledge compared to the
linear regression algorithm. From a practical perspective, findings from this study could
provide input on how to model word difficulty and word acquisition levels for English second
language speakers based on native English vocabulary knowledge and lexical
characteristics. Specifically, this study identified word frequency to be a stronger indicator for
word difficulty and word acquisition levels of L2 English speakers compared to the other
27
variables of interest. This provides useful insights for English second language vocabulary
learning as well as English text difficulty estimation.

28
Acknowledgements
I would like to express my gratitude to my supervisor at Tilburg University for his inputs and
guidance throughout the process of writing this thesis. I would also like to acknowledge my
appreciation to the second reader of this thesis as well as other parties that have contributed
to the completion of this thesis.

29
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Zheng, X. (2016).
Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv
preprint arXiv:1603.04467.
Anderson, R. C., & Freebody, P.(1981). Effects of differing proportions and locations of
difficult vocabulary on text comprehension. Center for the Study of Reading Technical
Report; no. 202.
Ari, N., & Ustazhanov, M. (2014, September). Matplotlib in python. In 2014 11th International
Conference on Electronics, Computer and Computation (ICECCO) (pp. 1-6). IEEE.
Arnold, T., Ballier, N., Gaillat, T., & Lissón, P. (2018). Predicting CEFRL levels in learner
English on the basis of metrics and full texts. arXiv preprint arXiv:1806.11099.
Aryadoust, V., & Baghaei, P. (2016). Does EFL readers' lexical and grammatical knowledge
predict their reading ability? Insights from a perceptron artificial neural network study.
Educational Assessment, 21(2), 135-156.
Asfaha, Y. M., Beckman, D., Kurvers, J., & Kroon, S. (2009). L2 reading in multilingual
Eritrea: The influences of L1 reading and English proficiency. Journal of Research in
Reading, 32(4), 351-365.
Balota, D. A., & Chumbley, J. I. (1990). Where are the effects of frequency in visual word
recognition tasks? Right where we said they werep Comment on Monsell, Doyle, and
Haggard (1989).
Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., ... &
Treiman, R. (2007). The English lexicon project. Behavior research methods, 39(3), 445-
459.
Balota, D. A., Aschenbrenner, A. J., & Yap, M. J. (2013). Additive effects of word frequency
and stimulus quality: the influence of trial history and data transformations. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 39(5), 1563.

30
Berger, C. M., Crossley, S. A., & Kyle, K. (2019). Using native-speaker psycholinguistic
norms to predict lexical proficiency and development in second-language
production. Applied Linguistics, 40(1), 22-42.
Berger, C., Crossley, S., & Skalicky, S. (2019). Using lexical features to investigate second
language lexical decision performance. Studies in Second Language Acquisition, 41(5), 911-
935.
Brysbaert, M., & Cortese, M. J. (2011). Do the effects of subjective frequency and age of
acquisition survive better word frequency norms?. Quarterly Journal of Experimental
Psychology, 64(3), 545-559.
Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016). How many words do we
know? Practical estimates of vocabulary size dependent on word definition, the degree of
language input and the participant’s age. Frontiers in psychology, 7, 1116.
Brysbaert, M., Lagrou, E., & Stevens, M. (2017). Visual word recognition in a second
language: A test of the lexical entrenchment hypothesis with lexical decision times.
Bilingualism: Language and Cognition, 20(3), 530-548.
Brysbaert, M., Keuleers, E., & Mandera, P. (2019). Recognition times for 62 thousand
English words: Data from the English Crowdsourcing Project.
Brysbaert, M., Keuleers, E., & Mandera, P. (2020). Which words do English non-native
speakers know? New supernational levels based on yes/no decision. Second Language
Research, 37(2), 207-231.
Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language Learning &
Technology, 11(3), 38-63.
Cobb, T. (2016). Feeding numbers or numerology? A response to Nation (2014) and
McQuillan (2016).
DeKeyser, R. M. (2001). Automaticity and automatization. InP. Robinson (Ed.), Cognition
and second language instruction (pp. 125–151).

31
Diaz, G. I., Fokoue-Nkoutche, A., Nannicini, G., & Samulowitz, H. (2017). An effective
algorithm for hyperparameter optimization of neural networks. IBM Journal of Research and
Development, 61(4/5), 9-1.
Ellis, N. C. (2002). Frequency effects in language processing: A review with implications for
theories of implicit and explicit language acquisition. Studies in second language
acquisition, 24(2), 143-188.
Ellis, N. C. (2002). Reflections on frequency effects in language processing. Studies in
second language acquisition, 24(2), 297-339.
Ellis, R. (2003). Task-based language learning and teaching. Oxford university press.
Ferré, P., & Brysbaert, M. (2017). Can Lextale-Esp discriminate between groups of highly
proficient Catalan–Spanish bilinguals with different language dominances?. Behavior
Research Methods, 49(2), 717-723.
Freebody, P., & Anderson, R. C. (1981). Effects of differing proportions and locations of
difficult vocabulary on text comprehension. Center for the Study of Reading Technical
Report; no. 202.
Harrington, M., & Carey, M. (2009). The on-line Yes/No test as a placement tool. System,
37(4), 614-626.
Gardner, D., & Davies, M. (2014). A new academic vocabulary list. Applied linguistics, 35(3),
305-327.
Gaurav, M. (2019, December 17). How to find the optimum number of hidden layers and
nodes in a neural network model? Datagraphi.Com. Retrieved November 5, 2021, from
https://www.datagraphi.com/blog/post/2019/12/17/how-to-find-the-optimum-number-of-
hidden-layers-and-nodes-in-a-neural-network-model
Grasemann, U., Peñaloza, C., Dekhtyar, M., Miikkulainen, R., & Kiran, S. (2021). Predicting
language treatment response in bilingual aphasia using neural network-based patient
models. Scientific reports, 11(1), 1-11.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
32
Hashimoto, B. J., & Egbert, J. (2019). More than frequency? Exploring predictors of word
difficulty for second language learners. Language Learning, 69(4), 839-872.
Haykin, S., & Principe, J. (1998). Making sense of a complex world [chaotic events
modeling]. IEEE Signal Processing Magazine, 15(3), 66-81.
Hazenberg, S., & Hulstijn, J. H. (1996). Defining a minimal receptive second-language
vocabulary for non-native university students: An empirical investigation. Applied
linguistics, 17(2), 145-163.
He, X., & Godfroid, A. (2019). Choosing words to teach: A novel method for vocabulary
selection and its practical application. Tesol Quarterly, 53(2), 348-371.
Henriksen, B. (1999). Three dimensions of vocabulary development. Studies in second
language acquisition, 21(2), 303-317.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are
universal approximators. Neural Networks, 2(5), 359-366.
Hulstijn, J. H., Van Gelderen, A., & Schoonen, R. (2009). Automatization in second language
acquisition: What does the coefficient of variation tell us?. Applied Psycholinguistics, 30(4),
555-582.
Kerz, E., Wiechmann, D., Qiao, Y., Tseng, E., & Ströbel, M. (2021, April). Automated
Classification of Written Proficiency Levels on the CEFR-Scale through Complexity Contours
and RNNs. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building
Educational Applications (pp. 199-209).
Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project:
Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior
research methods, 44(1), 287-304.
Keuleers, E., & Balota, D. A. (2015). Megastudies, crowdsourcing, and large datasets in
psycholinguistics: An overview of recent developments. Quarterly Journal of Experimental
Psychology, 68(8), 1457-1468.

33
Kramer, O. (2016). Scikit-learn. In Machine learning for evolution strategies (pp. 45-53).
Springer, Cham.
Laufer, B. (1998). The development of passive and active vocabulary in a second language:
Same or different?. Applied linguistics, 19(2), 255-271.
Laufer, B., & Paribakht, T. S. (1998). The relationship between passive and active
vocabularies: Effects of languagelearning context. Language learning, 48(3), 365-391.
Laufer, B., Elder, C., Hill, K., & Congdon, P. (2004). Size and strength: Do we need both to
measure vocabulary knowledge?. Language testing, 21(2), 202-226.
Lavergne, P., & Patilea, V. (2008). Breaking the curse of dimensionality in nonparametric
testing. Journal of Econometrics, 143(1), 103-122.
Lee, J. W., & Schallert, D. L. (1997). The relative contribution of L2 language proficiency and
L1 reading ability to L2 reading performance: A test of the threshold hypothesis in an EFL
context. Tesol Quarterly, 31(4), 713-739.
Leow, R. P., Grey, S., Marijuan, S., & Moorman, C. (2014). Concurrent data elicitation
procedures, processes, and the early stages of L2 learning: A critical overview. Second
Language Research, 30(2), 111-127.
Liben-Nowell, D., Strand, J., Sharp, A., Wexler, T., & Woods, K. (2019). The danger of
testing by selecting controlled subsets, with applications to spoken-word recognition. Journal
of cognition, 2(1).
Maskor, Z. M., & Baharudin, H. (2016). Receptive vocabulary knowledge or productive
vocabulary knowledge in writing skill, which one important. International Journal of Academic
Research in Business and Social Sciences, 6(11), 261-271.
Mandera, P., Keuleers, E., & Brysbaert, M. (2020). Recognition times for 62 thousand
English words: Data from the English Crowdsourcing Project. Behavior Research
Methods, 52(2), 741-760.
McKinney, W. (2011). pandas: a foundational Python library for data analysis and
statistics. Python for high performance and scientific computing, 14(9), 1-9.
34
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary tests. Language
testing, 4(2), 142-154.
Meara, P. M., & Milton, J. L. (2003). The Swansea vocabulary levels test: The manual.
Newbury, UK: Express Publishing.
Milton, J. (2006). X-Lex: The Swansea vocabulary levels test. In Proceedings of the 7th and
8th Current Trends in English Language testing (CTELT) Conference (Vol. 4, pp. 29-39).
TESOL Arabia, UAE.
Monaghan, P., Chang, Y. N., Welbourne, S., & Brysbaert, M. (2017). Exploring the relations
between word frequency, language exposure, and bilingualism in a computational model of
reading. Journal of Memory and Language, 93, 1-21.
Monsell, S., Doyle, M. C., & Haggard, P. N. (1989). Effects of frequency on visual word
recognition tasks: Where are they?. Journal of Experimental Psychology: General, 118(1),
43.
Nation, P., & Waring, R. (1997). Vocabulary size, text coverage and word lists. Vocabulary:
Description, acquisition and pedagogy, 14, 6-19.
Nation, I. S. (2001). Learning vocabulary in another language. Ernst Klett Sprachen.
Nurweni, A., & Read, J. (1999). The English vocabulary knowledge of Indonesian university
students. English for Specific Purposes, 18(2), 161-175.
O'Dell, F., Read, J., & McCarthy, M. (2000). Assessing vocabulary. Cambridge university
press.
Oliphant, T. E. (2006). A guide to NumPy (Vol. 1, p. 85). USA: Trelgol Publishing.
Pae, T. I. (2019). A simultaneous analysis of relations between L1 and L2 skills in reading
and writing. Reading Research Quarterly, 54(1), 109-124.
Petersen, S. E., & Ostendorf, M. (2009). A machine learning approach to reading level
assessment. Computer speech & language, 23(1), 89-106.
Pulido, D., & Hambrick, D. Z. (2008). The virtuous circle: Modeling individual differences in
L2 reading and vocabulary development.

35
Read, J. (2013). Validating a test to measure depth of vocabulary knowledge. In Validation in
language assessment (pp. 55-74). Routledge.
Santos, V. D., Verspoor, M., & Nerbonne, J. (2012). Identifying important factors in essay
grading using machine learning. International experiences in language testing and
assessment—Selected papers in memory of Pavlos Pavlou, 295-309.
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the behaviour of
two new versions of the Vocabulary Levels Test. Language testing, 18(1), 55-88.
Skalicky, S., Crossley, S. A., & Berger, C. M. (2019). Predictors of second language English
lexical recognition: Further insights from a large database of second language lexical
decision times. The Mental Lexicon, 14(3), 333-356.
Sparks, R. L., Patton, J., Ganschow, L., & Humbach, N. (2012). Do L1 reading achievement
and L1 print exposure contribute to the prediction of L2 proficiency?. Language
Learning, 62(2), 473-505.
Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty before 1900.
Harvard University Press.
Van Gelderen, A., Schoonen, R., De Glopper, K., Hulstijn, J., Simis, A., Snellings, P., &
Stevenson, M. (2004). Linguistic knowledge, processing speed, and metacognitive
knowledge in first-and second-Language reading comprehension: a componential analysis.
Journal of educational psychology, 96(1), 19.
Van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A
new and improved word frequency database for British English. Quarterly journal of
experimental psychology, 67(6), 1176-1190.
Webb, S. A., & Chang, A. C. S. (2012). Second language vocabulary growth. RELC journal,
43(1), 113-126.
Wickham, H., & Wickham, M. H. (2020). Package ‘plyr’. A Grammar of Data Manipulation. R
package version, 8.
36
Yang, Y., Yu, W., & Lim, H. (2016). Predicting second language proficiency level using
linguistic cognitive task and machine learning techniques. Wireless Personal
Communications, 86(1), 271-285.
Yap, M. J., Balota, D. A., Tse, C. S., & Besner, D. (2008). On the additive effects of stimulus
quality and word frequency in lexical decision: evidence for opposing interactive influences
revealed by RT distributional analyses. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 34(3), 495.
Yarkoni, T., Balota, D., & Yap, M. (2008). Moving beyond Coltheart’s N: A new measure of
orthographic similarity. Psychonomic bulletin & review, 15(5), 971-979.
Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). ERNIE: Enhanced language
representation with informative entities. arXiv preprint arXiv:1905.07129.
Zhou, Z. H. (2019). Ensemble methods: foundations and algorithms. Chapman and
Hall/CRC.
37
Appendix A
Visualization of Model Features
Figure 3
Distribution of all scaled predictors

38
Figure 4
Relationship between the predictors and target

39
Appendix B
Hyper Parameter Tuning & Model Selection
Table 2
Tested Hyper Parameters
Model Tested Values
Linear regression
Lambda 0, 0.02, 0.06, 0.1, 0.2
Neural Network
Optimization functions Adam, RMSProp
Number of layers 2, 3
First layer nodes 64,32,16
Last layer nodes 64, 8
Batch size 32, 100
Number of epochs 30,60
Figure 5
Neural network models selected – Model 2a: word frequency as predictor

40
Figure 6
Neural network models selected – Model 2b: word characteristics as predictor
Figure 7
Neural network models selected – Model 3b: psycholinguistic measures as predictor

41
Figure 8
Neural network models selected – Model 4b: lexical characteristics & word frequency as
predictors
Figure 9
Neural network models selected – Model 5b: lexical characteristics & L1 psycholinguistic
measures as predictors
42
Appendix C
Model Robustness Checks
Figure 10
Train & Validation Loss Visualization for Model 1b – Neural network with word frequency
Model 1 Train & Val Loss per Epoch
Figure 11
Train & Validation Loss Visualization for Model 2b – Neural network with word characteristics
Model 2 Train & Val Loss per Epoch
43
Figure 12
Train & Validation Loss Visualization for Model 3b – Neural network with L1 psycholinguistic
measures
Figure 13
& word frequency

44
Figure 14
& L1 psycholinguistic measures

45
Appendix D
Post-Hoc Analysis Visualization
Figure 15
Distribution of test targets – L2 receptive word knowledge rank

46
Figure 16
Distribution of predictions based on word characteristics & word frequency – linear
regression
Figure 17
Distribution of predictions based on word characteristics & word frequency – neural network
47
Figure 18
Distribution of residuals for predictions based on word characteristics & word frequency –
linear regression
48
Figure 19
Distribution of residuals for predictions based on word characteristics & word frequency –
neural network

Data Science Master Thesis - DV Larasati

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Master Thesis - DV Larasati

Uploaded by

Copyright:

Available Formats

Predicting Non-Native English Speakers’ Vocabulary

Knowledge with Native English Speaker Psycholinguistic

Name: D.V. Larasati

THESIS SUBMITTED IN PARTIAL FULFILLMENT

Supervisor: dr. B. Nicenboim

Word count: 8,967

proficiency. An important indicator for a second language speaker’s lexical proficiency is

vocabulary knowledge. Various studies have highlighted the importance of lexical

literature by studying the predictive power of native language vocabulary knowledge as

measured by L1 psycholinguistic measures while conducting a word recognition task, on

second language vocabulary knowledge as measured by receptive word knowledge. This

study will compare the implementation of different machine learning algorithms on

Keywords: lexical proficiency, word recognition, machine learning, neural network,

English Crowdsourcing Project.

Data Source/Code/Ethics Statement ..................................................................................... 3

Predicting Non-Native English Speakers’ Vocabulary Knowledge with Native English

Speaker Psycholinguistic Behaviors: A Big Data Approach ................................................... 1

2. Related Work ................................................................................................................. 5

2.1 Relevant Work ............................................................................................................. 5

2.1.1 Measuring Vocabulary Knowledge ........................................................................ 5

2.1.2 Lexical characteristics as Predictors for L2 Vocabulary Knowledge ....................... 6

2.1.3 L1 Psycholinguistic Measures as Predictors for L2 Vocabulary Knowledge ........... 6

2.1.4 Machine Learning Approach to Lexical Proficiency Predictive Analysis ................. 7

2.2 The Current Study ....................................................................................................... 8

3.1 Modeling Approach ...................................................................................................... 9

3.1.1 Linear Regression ................................................................................................. 9

3.1.2 Neural Network ..................................................................................................... 9

3.1.1 Models ................................................................................................................ 10

3.2 Predictors .................................................................................................................. 11

3.2.1 Word Frequency ............................................................................................ 11

3.2.2 Lexical characteristics .................................................................................... 11

3.2.2 L1 English Psycholinguistic Behavior ............................................................. 12

3.3. Target ....................................................................................................................... 12

4.1.1 Data Sources ...................................................................................................... 13

4.1.2 Data Preprocessing ............................................................................................. 14

4.2 Experimental Procedure ............................................................................................ 15

4.2.1 Model Evaluation & Hyper Parameters ................................................................ 16

4.2.2 Tasks .................................................................................................................. 17

5.1 Task 1........................................................................................................................ 19

5.2 Task 2........................................................................................................................ 19

5.3 Task 3........................................................................................................................ 20

5.4 Task 4........................................................................................................................ 20

5.5 Task 5........................................................................................................................ 21

5.6 Post-Hoc Analysis...................................................................................................... 22

6.1 Limitations & Future Research ................................................................................... 25

Predicting Non-Native English Speakers’ Vocabulary Knowledge with Native

English Speaker Psycholinguistic Behaviors: A Big Data Approach

& Hulstijn, 1996; Nurweni & Read, 1999).

An indicator for someone’s vocabulary size is vocabulary knowledge, or their

(DeKeyser, 2001; Leow et al, 2014; Milton, 2006).

distance to other words, number of morphemes, average age of acquisition, and

are, in fact, well-known by L2 speakers. Secondly, word frequency is a measure derived

measurement and is subjective in nature.

Berger et al. (2019) established that psycholinguistic behavior of L1 English speakers, as

receptive word knowledge as they provide information on real-time automatic processing of

their processing abilities, allowing for conclusions on how automaticity is developed

2003). Moreover, compared to measures based on corpora and individual interpretation of

word meaning, these psycholinguistic measures bring some advantages. Corpora-based

be representative of language speakers’ actual language exposure. Meanwhile, other lexical

characteristics, such as concreteness, rely on the judgement and perception of individuals,

speakers’ performance on word knowledge tasks (Berger et al., 2019).

Mandera et al., 2020). As opposed to the conventional method of language proficiency

experimenter bias; random selection of stimuli to be presented to participants; as well as

measuring the predictive power of L1 psycholinguistic behavior on L2 lexical proficiency,