Professional Documents
Culture Documents
Student details
Thesis committee
Tilburg University
School of Humanities & Digital Sciences
Department of Cognitive Science & Artificial Intelligence
Tilburg, The Netherlands
3 December 2021
Abstract
With the growing needs and interests for language learning as well as the increasing
accessibility to tools and technologies that support second language learning, it is becoming
more relevant to add to the body of research conducted on this topic. One of the most
relevant topics studied over time is on how second language (L2) speakers develop lexical
characteristics in predicting the ease with which L2 speakers acquire new words. These
characteristics include how frequently speakers are exposed to words in the language, how
they are able to associate these words to something concrete, as well as word length.
However, few studies have modeled the relationship between native language (L1) lexical
proficiency and second language proficiency. This study aims to contribute to existing
regression tasks predicting L2 receptive word knowledge based on data obtained from a
mega study, which brings a unique contribution within this study domain. It was found that
word frequency performed better as a predictor for L2 receptive word knowledge for both the
linear regression and neural network models, confirming previous findings. Overall, the
neural network showed better performance compared to the linear regression algorithm as
an estimator.
psycholinguistic measures
Data Source/Code/Ethics Statement
Data collected and processed through the English Crowdsourcing Project (Mandera,
Keuleers, & Brysbaert, 2020) as part of a set of online vocabulary tests organized by Ghent
University were used for the purpose of writing this thesis. All data analyzed within this thesis
are publicly available on the following sources: Our lexicon projects (ugent.be) and OSF |
Code used to conduct experimental procedures for the purpose of data analysis in this thesis
were obtained from the following publicly available source (Gaurav, 2019). The repurposed
code written by author for this thesis can be found on the following link.
Table of Contents
Abstract ................................................................................................................................. 2
3. Methods ......................................................................................................................... 9
4. Experimental Setup...................................................................................................... 13
4.1 Dataset Description ................................................................................................... 13
Task 2. ............................................................................................................................ 17
Task 3. ............................................................................................................................ 17
5. Results ......................................................................................................................... 18
6. Discussion ................................................................................................................... 23
7. Conclusion ................................................................................................................... 26
Acknowledgements ............................................................................................................. 27
References.......................................................................................................................... 29
Appendix A.......................................................................................................................... 37
Appendix B.......................................................................................................................... 39
Appendix C ......................................................................................................................... 42
Appendix D ......................................................................................................................... 45
1
With the growing needs and interests for language learning as well as the increasing
accessibility to tools and technologies that support second language learning, it is becoming
more relevant to add to the body of research conducted on this topic. One of the most
relevant topics studied over time is on how second language (L2) speakers develop lexical
proficiency. L2 lexical proficiency can be defined as the depth of understanding (Anderson &
Freebody, 1981) as well as the breadth of lexical knowledge, or vocabulary size (Hazenberg
knowledge of words (Laufer et al., 2004; Nation, 2001). Previous studies have concluded
that vocabulary knowledge refers to a person’s scope of word usage in writing, reading,
listening, and speaking, which can be further divided into receptive and productive
knowledge (Henriksen, 1999; Laufer, 1998; O’Dell et al., 2000), among other dimensions
(Maskor & Baharudin, 2016). Receptive knowledge of a word involves aspects such as
recognizing a word, knowing what a word looks like, and associating a word to other words
(Nation, 2001). Therefore, one of the most common tasks used to measure receptive word
knowledge includes the word recognition task, where participants are asked to indicate
whether they recognize a word and to identify whether the word seen is an actual word or a
pseudo word (Berger et al, 2019). Response times and level of accuracy of participants’
responses have been found to be important indicators on how L2 speakers process and
access words mentally, how they develop automaticity, as well as vocabulary size
There are many factors which affect the ease of learning words and expanding
vocabulary knowledge for L2 speakers. One of which is lexical characteristics, a set of word
characteristics that include word frequency, word length, orthographic and phonological
concreteness (Berger et al, 2019a, 2019b; Brysbaert et al, 2016, 2020; Keuleers et al., 2012,
2015; Mandera et al, 2020; Skalicky et al, 2019). From these characteristics, word frequency
has been a consistently significant predictor for word knowledge, where words that appear
or are used more frequently within a language tend to be learned more quickly by both
second language and native language speakers (Berger et al, 2019; Brysbaert & Cortese,
2011; Ellis, 2002; 2003; Read, 1998; Schmitt et al, 2001; Mandera et al, 2020). However,
there are drawbacks to using word frequency as an indicator for L2 lexical proficiency
(Berger et al, 2019; Brysbaert et al, 2020; He & Godfroid, 2019; Mandera et al, 2020), such
as there being disproportionately higher numbers of low frequency words compared to high
frequency words, particularly in the English language. Therefore, many low frequency words
from a corpus or set of corpora, which means that it is dependent on the corpus used for its
Another variable which has been studied in relation to L2 lexical proficiency is lexical
proficiency of native language (L1) speakers. Previous studies have found that word
knowledge of individuals who claimed English to be their native tongue and individuals who
spoke English as a second language exhibited high correlation and similar patterns
(Monaghan et al., 2017; Brysbaert et al., 2020). Meanwhile, others noted a relationship
between reading, writing, and speaking abilities of native and second languages of bilingual
individuals (Asfaha et al, 2009; Lee & Schallert, 1997; Pae, 2019; Sparks et al, 2012).
measured by response time and accuracy during lexical tasks, were able to index the
difficulty level of words produced over time by a group of students who did not speak English
as a native language.
Response time and accuracy rate on lexical decision tasks are prevalent measures of
a language by both L2 and L1 speakers (Berger et al., 2019). This is the underlying
motivation for the focus of this study on how psycholinguistic measures of L1 English
3
speakers are able to predict L2 vocabulary knowledge. Measures of response time on lexical
tasks indicate how well a language speaker’s mental lexical network is organized as well as
(DeKeyser, 2001; Hulstijn et al., 2009; Van Gelderen et al., 2004). Accuracy rate, on the
other hand, is a widely accepted indicator for L1 and L2 vocabulary size (Meara & Milton,
measures such as word frequency rely on proxies of large reference corpora, which may not
making them more subjective compared to L1-derived measures which are based on L1
One of the most prevalent methods used to capture psycholinguistic measures is the
online or mega study approach (Balota et al., 2007; Berger et al., 2019; Marinis, 2010;
testing in a classroom or experimental set up, this approach has several advantages,
including: the ability to include a larger number of stimuli and participants, resulting in more
power; the ability to assess the importance of both existing and new word characteristics;
data availability for repeated use and answering different research questions; avoiding
evaluating the performance of new computational models (Balota et al., 2015; Keuleers &
Balota, 2013; Liben-Nowell et al., 2019; Mandera et al., 2020). The mega study approach
has led to increased availability of large amounts of data over the years (Manderaet al.,
2020), which together with rapid advancements of computational models have resulted in
increased interest in the utilization of machine learning models for lexical proficiency
prediction (Santos et al., 2012; Yang et al., 2016). However, such studies focused on
This study aims to add to the body of literature within the domain of L2 lexical
(L1) speakers can predict receptive word knowledge of non-native (L2) English speakers in
doing so, information obtained through the means of a mega study will be utilized and the
performances of a simple and a more complex machine learning algorithm will be compared.
The following main research question and sub-research questions are formed:
To what extent can lexical characteristics and measures derived from L1 English
on response times and accuracy ratings of L2 speakers on an online word recognition task,
ratings of native English speakers on the same task. To answer the research questions
presented in this study, one baseline model, five linear regression models and five neural
measures in place of word frequency together with other lexical characteristics within the
5
model yield comparable results, confirming its predictive power not only on productive L2
2. Related Work
vocabulary knowledge, which forms the foundation for their proficiency in writing, reading,
listening, and speaking (Pullido & Hanbrick, 2008). The present study will utilize the
receptive word knowledge as a measure for L2 vocabulary knowledge. The receptive word
knowledge includes measures such as knowing what a word looks and sounds like, knowing
what parts of a word are recognizable, and knowing the meaning of a word. A common tool
used to measure receptive word knowledge is the word recognition task (DeKeyser, 2001;
Leow et al., 2014; Meara & Milton, 2002), where an L2 speaker is typically presented with a
word stimuli and asked to identify whether they recognized the stimuli as an actual word
within the language or not (Berger et al., 2019). Measures of L2 response times and their
accuracy ratings on this task are recorded and have been found to have high correlations
with more demanding in-depth language tests (Ferré & Brysbaert, 2017; Harrington & Carey,
2009; Lemhöfer & Broersma, 2012; Meara & Buxton, 1987; Zhang et al., 2019). These
measures on the word recognition task have also been used to form lists of word ranking
which were able to indicate word difficulty, how frequently an L2 speaker is exposed to the
word within the language (Brysbaert et al., 2019), as well as how useful the L2 speaker
perceives the word or how motivated they are to learn the word (He & Godfroid, 2019). In
their study, Brysbaert, Keuleers, and Mandera (2019), found that L2 word rank had the
highest correlation with word difficulty measures based on human ratings (r = .60), followed
by log of word frequency (r = -.52) based the COCA corpus (Gardner & Davies, 2014), as
well as word usefulness based on human ratings (r = -.47). Moreover, the word rank as a
measure for L2 English word knowledge had a correlation of .68 with word knowledge from a
6
similar study by Hashimoto and Egbert (2019). Thus, this study will utilize word ranks as a
Although larger vocabulary size indicates higher L2 lexical proficiency, not all words
need to be known for successful language understanding. L2 speakers would normally only
acquire between 2.500-3.000 of the most frequent word families (i.e. a word and its various
forms of inflections and derivations) of a language (Cobb, 2007; 2016; Webb & Chang,
2012), compared to the 11.100 word families typically known by native (L1) speakers
(Brysbaert et al, 2016). More frequent words tend to be processed more quickly, both for L1
and L2 speakers (Ellis, 2002; 2003), and are typically learned by L2 speakers at an earlier
point in time compared to less frequent words (O’Dell et al., 2000; Schmitt et al, 2001). This
finding is further affirmed by several classic and well-known studies on the effect of lexical
characteristics on lexical proficiency, such as the English Lexicon Project and British Lexicon
Project (Monsell et al, 1989; Yap et al, 2008; Balota et al, 2007; Keuleers et al, 2012;
Mandera et al, 2020). Some of the other important lexical characteristics which have been
found to be significant predictors for L2 word knowledge include word length, number of
syllables of a word, and average age of acquisition of a word (Balota et al, 2007; Mandera et
al, 2020). Despite this, Brysbaert et al (2016) summarize the drawbacks of word frequency
assumption that all words encountered by a speaker are of the same weight; lack of
language representation in the corpora from which word frequency measures are taken; and
that receptiveness of a word also depends on the motivation of the learner to know the word”
(pp. 27-28). Keeping this in mind, there is room for further study on variables which could
better predict and serve as more reliable measures for L2 vocabulary knowledge.
Various studies have looked into the relationship between L1 and L2 lexical
proficiency. These studies oftentimes focus on the dynamics between L1 and L2 reading
7
and writing abilities for individuals. For example, Asfaha et al. (2009) revealed that L1
proficiency. Meanwhile, Lee & Schallert (1997) found that L2 proficiency is a better predictor
for L2 reading ability compared to L1 reading ability. Studies with similar findings on the
relationship between L1 and L2 writing and reading skills were conducted by Sparks et al.
(2012) and Pae (2019). However, little work has been done on L2 vocabulary knowledge
and how this can be predicted by L1 vocabulary. Brysbaert, Lagrou, and Stevens (2017) is
one of these studies, where they were able to confirm the lexical entrenchment hypothesis
for L1 and L2 English speakers, observing that the frequency effect has a stronger impact on
second language speakers compared to measures derived from native language speakers.
Another of such study is by Berger et al. (2019), which attempted to predict L2 English
productive word knowledge with psycholinguistic measures derived from L1 English speaker
performance on word recognition and lexical decision tasks. Findings from this study
suggest that L2 speakers with higher language proficiency (as assessed by human raters)
produce English words that are recognized and named more slowly and less accurately by
L1 speakers. Results from the longitudinal study within this paper conclude that L2 speakers
produce less frequent words over time, as well as words which are recognized or named
While there has been growing interest in the utilization and exploration of machine
learning approaches to predict lexical proficiency (Petersen & Ostendorf, 2009; Santos et al.,
2012; Yang et al., 2016), most studies adopting this approach have used corpora-based
measures or measures obtained through factorial and classroom designs as data sources.
measures based on human assessment. For example, Arnold et al. (2018) predicted CEFR
levels of L2 English learners using corpora-based lexical and syntactic metrics. The
performances of the Gradient Boosted Trees and the Neural Network were compared in a
linguistic point of view, the most distinguishable marks in proficiency among second
language learners arise from adjacent classes. Key findings of the study highlighted the
higher performance of the gradient boosted tree compared to the neural network and that
task-based corpora entail strong overfitting. Aryadoust & Baghaei (2016) attempted to
predict L2 English learners’ reading ability on the basis of their lexical and grammatical
knowledge using the neural network. Results from this study showed that the neural network
was able to classify L2 readers based on their grammatical and vocabulary knowledge with
an accuracy level of approximately 78% and confirmed previous findings on the relationship
between these variables. However, there were limitations acknowledged on the reliability of
the Rasch model utilized to derive proficiency measures. Lastly, Kerz et al. (2021)
The presented body of literature leaves room for further studies within the domain of
L2 lexical proficiency. While previous studies have analyzed the relationship between
receptive word knowledge and various lexical characteristics, little research has been done
to model the relationship between L1 English and L2 English lexical proficiency, particularly
for receptive word knowledge. Therefore, this study aims to replicate previous findings
showing the predictive power of lexical characteristics on L2 receptive word knowledge, and
mentioned studies utilizing machine learning to predict second language proficiency have
experiments. This study will thus add a unique contribution by adopting the machine learning
3. Methods
The current study utilized a machine learning approach to answer the presented
research questions, where the performance of a simpler machine learning algorithm was
compared with that of a more complex one. The simpler model tested was the linear
regression as it is a well-established model that is easy to interpret (Stigler, 1986). The feed
forward neural network was selected as the more complex machine learning model to
predict lexical proficiency as it may have a few advantages compared to the linear
regression, such as being more adaptive, is able to model non-linear functions, and does not
need assumptions about the data or relationship of the data being modeled (Haykin, 1998;
The linear regression is one of the most widely accepted and classical algorithms for
predictions dating back to more than 100 years (Stigler, 1986; Yan & Su, 2009). The
minimizes its loss function (Yan & Su, 2009). The most prevalent method to do this by
minimizing the sum of squared residuals of the estimated equation. The linear regression
𝑌 = 𝑓(𝑋𝑖 , 𝛽) + 𝑒𝑖
variables, 𝛽 the estimated parameters of the least squares regression equation, and e the
residuals or differences between the observed and fitted values of the dependent variable.
The goal of the neural network is to approximate some function, such as y = f2(z; u, c) as
activation functions within each layer. This is done by learning the values of parameters of the
10
network (in this case, w, b and u, c) that result in the best function approximation (Goodfellow
et al., 2016).
Figure 1
z = f1(x; w, b)
y = f2(z; u, c)
Output layer
Input layer
Hidden layer
A loss function is chosen to evaluate how well the model is able to predict true values
of the dataset based on the nature of the problem. The difference between the network’s
output and the expected output as calculated by the loss function is minimized by computing
its gradients, which informs the model how to update the parameters. In order to learn the
best parameter values, the derivative of the prediction error with respect to each parameter
is calculated with the backpropagation algorithm. For example, we can express the
backpropagation of the prediction error of the network in Figure 1 with the following
equation:
𝜕𝑙𝑜𝑠𝑠 𝜕𝑙𝑜𝑠𝑠 𝜕𝑦 𝜕𝑧
= × ×
𝜕𝑤 𝜕𝑦 𝜕𝑧 𝜕𝑤
3.1.1 Models
This study applied one baseline model, five linear regression models, and five neural network
models:
The zero-rule algorithm is one of the most commonly used baseline algorithms in
machine learning that takes information about the targets in a problem in order to
create a rule for predictions (Zhou, 2019). The mean value of the targets, L2 English
11
receptive word knowledge, was used as the default prediction, as it represents the
2. Models 1a and 1b: The linear regression model and neural network trained on word
frequency
3. Models 2a and 2b: The linear regression model and neural network trained on
4. Models 3a and 3b: The linear regression model and neural network trained on L1
5. Model 4a and 4b: The linear regression model and neural network trained on word
6. Model 5a and 5b: The linear regression model and neural network trained on L1
3.2 Predictors
Three predictor variables were used within this study, namely word frequency, lexical
Word frequency is a measure of how often a word is used within a language based
on a specific corpus. This study used word frequency measures from the SUBTLEX-US
database based on 51 million words in American English subtitles (Brysbaert & New, 2012).
The number of times that a word appears within the corpus is expressed as Zipf scores, a
standardized measure of frequency independent of corpus size (van Heuven et al., 2014).
variable, is word length as measured by the number of letters (length) and syllables (number
of syllables) that make up a word (Mandera et al., 2020). The variable number of
12
morphemes is the smallest unit of meaning within a word while number of phonemes
represent the phoneme counts of a word (Balota el al., 2007). Orthographic Levenshtein
distance (OLD) illustrates how similarly a word is written compared to other words in the
corpus, while phonological Levenshtein distance (PLD) shows how closely a word sounds
compared to other words (Balota el al., 2007). Age of acquisition (AoA) indicates the
average age at which a word is typically learned (Brysbaert, 2012). Lastly, concreteness
ratings are the degree to which an experience with a word can be perceived by language
response times and accuracy ratings of participants selecting English as their native
language to an online vocabulary test. Participants of this vocabulary test completed a word
recognition task where they were presented with a sequence of both existing English words
and non-words, and were instructed to respond with ‘yes’ or ‘no’ depending on whether they
knew the word being presented to them (Mandera et al., 2020). The mean and standard
deviation of response times to correctly identified words were recorded and their z-scores
were calculated. Z-score measures of the mean and standard deviation of response times
were used in this study. Meanwhile, accuracy ratings represented the percentage of correct
3.3. Target
English words based on L2 English speakers’ accuracy ratings and response times to the
word recognition task, which has been found to be a good indicator for L2 vocabulary
acquisition (Brysbaert et al., 2020). This measure is ranked from 1-18.236 based on the
percentage known of words and mean word recognition response time for ties.
13
4. Experimental Setup
The present study used a combined dataset from three data sources. The data used
in this study was collected through the English Crowdsourcing Project (ECP) (Mandera et
al., 2020), which is a mega study part of a set of online vocabulary tests organized by Ghent
University. The English vocabulary test was first conducted in 2014 (Brysbaert et al., 2016).
of this test are presented with a sequence of 100 stimuli, where 70 of these stimuli are
existing English words and 30 are pseudo words. They are instructed to indicate whether
they recognized the stimuli being presented and are informed that incorrect answers will be
penalized so as to avoid guessing during the test. The complete set of instructions for the
online vocabulary test are described in Brysbaert et al. (2020) and Mandera et al. (2020).
Two different versions of the ECP dataset were used. Psycholinguistic measures for
L1 English speakers were sampled from the ECP dataset as described in Mandera,
Keuleers, and Brysbaert (2020), where word recognition times and accuracy ratings of
62.000 English words for native English speakers were presented. L2 English receptive
word knowledge as measured by word rankings were derived from accuracy ratings and
recognition times of the same 62.000 English words for non-native English speakers as
Lastly, lexical characteristics data in this study were obtained from the Word Features
Analysis of the ELP and ECP for the Open Science Framework (Brysbaert & Mandera,
2019). In this dataset, lexical characteristics calculations for English words in common
between the ECP and ELP datasets are available, which includes the following information:
word length, word frequency, orthographic and phonologic neighbors, orthographic and
of acquisition, word prevalence, and concreteness. An intersection of the three data sources
resulted in a dataset of 18.236 English words which was utilized for this study.
This study followed the same data cleaning procedures as implemented by the L1
and L2 ECP datasets presented in Mandera et al. (2020) and Brysbaert et al. (2020). Data of
participants who indicated that their native language was English was included in the L1
ECP dataset, and those who indicated that their native language was not English was
included in the L2 ECP dataset. Therefore, only data of respondents who completed the
person-related questions were included in the dataset. Outliers and irregular values were
taken out of the dataset. Only the first 3 sessions from each IP address were included to
avoid any undue influences from individuals. Reaction times of the first 9 trials of each
Additionally, for non-native English speakers, observations with more than two ‘yes’
al, 2020). Lastly, from the remaining observations in the dataset, only English words with
available lexical characteristics data in the English Lexicon Project were included in the
analysis. These steps pruned the initial dataset of 61.851 English words to 18.236 words.
For the purposes of this study, several variables were further omitted from the
analysis. For word recognition times, z-scores of both L1 and L2 response times instead of
their raw measures were used in order to cancel out differences in average response times
per participant (Mandera et al., 2020; Brysbaert et al., 2020). Word prevalence was also
removed from the analysis as it is a measure of how well each word is known based on the
complete ECP dataset (Brysbaert, 2019; Brysbaert, 2016; Keuleers, 2015), which is a
component of word ranking being measured within this study. Furthermore, orthographic and
phonologic neighbors, or the number of words that sound and are written similarly to a word
(Balota el al., 2007), were excluded from the analysis as the variables orthographic and
15
phonologic Levenshtein distances have been found to be better measures for word similarity
with other words. Both OLD and PLD not only compare word distances with their neighbors,
but with all other words within the corpus (Yarkoni et al., 2008). All variables were
aggregated per word observation and were standardized to control for differences in units of
each feature for every observation, and were standardized, with the exceptions of z-scores
distribution of all predictors included within the models as well as their relationships with the
The linear regression and artificial neural network algorithms were used to conduct
the analysis within this study. Data analysis was performed on Python 3.8, while data
cleaning and preprocessing were conducted on both Python 3.8 and R 4.10. On R 4.10, the
dplyr library (Wickham & Wickham, 2020) was used for dataset cleaning. On Python 3.8, this
study used the following libraries and modules for data cleaning, preprocessing, and
analysis: pandas (McKinney, 2011), os, numpy (Oliphant, 2006), scikit learn (Kramer, 2016),
random, math, matplotlib (Ari et al., 2014), and tensorflow (Abadi et al., 2016).
The dataset used for this study takes the form of a data frame with 18.236 rows and
13 columns. In total, there are 12 predictors as input to the models. All variables were
converted to two-dimensional numpy arrays and 30% of the entire dataset was left out as
the test set using the train-test split function from scikit-learn. This resulted in a train set with
Testing of the linear regression model utilized the Lasso module from the scikit-learn
library. The neural network was built with the Keras Regressor module from the scikit-learn
library and utilized a custom function to select the most optimum hyper parameters for the
network during training. The network takes as input to its input layer a 2-dimensional numpy
array corresponding to the different predictors included in each model. The last dense layer
of the network outputs a linear value of the target, a word ranking value between 1-18.236.
16
For all dense layers, Relu was used as the activation function. The model is compiled with
Five tasks were conducted to answer the research questions presented in this study.
The mean absolute error metric was used to evaluate the model performances. The MAE
was chosen as it calculates the absolute differences between prediction and expected
output over the test sample, where each individual difference has equal weight. Therefore,
the presence of outliers is not penalized. Since outliers have previously been removed from
the dataset for this study, the MAE is a good metric for the regression problem presented.
Moreover, it is also a more intuitive measure to interpret than for example, the mean
squared error.
For both the linear regression and neural network models to be able to estimate the
best set of parameters, several hyper parameters need to be determined beforehand and
optimized. For the Lasso linear regression, this parameter is the bias term or λ, which is a
component of the regularization term for the coefficients of the regression model. For the
neural network, these include parameters related to the structure of the network as well as
the learning of the model (Diaz et al., 2017). To determine the structure of the neural
network, the number of layers and number of layer nodes were tuned. To optimize how the
network learns, the optimization function, batch size, and number of epochs were tuned. The
grid search algorithm was used to optimize these parameters using the mean squared error
as a loss function.
Four-fold cross validation using the hold-out method was conducted to train the
parameters, and then tested on the subset of data which had been initially left out. For the
linear regression, a total of 5 lambda value candidates were fitted, totaling to 20 fits. For the
neural network, there were a total of 96 parameter candidates fitted across 4 folds, totaling
to 384 fits. A summary of the tested hyper parameters for both models and the best-
4.2.2 Tasks
Task 1. The first task served to provide a baseline performance as comparison with the
predictive power of the variables of interest using the zero rule algorithm. The rule adopted
was to use the mean value of the train targets, L2 English receptive word knowledge, as the
default prediction.
Task 2. The aim of the second task was to evaluate the predictive power of word
algorithms with the baseline performance, where Model 1a is the linear regression model
Task 3. The third task aimed to evaluate the predictive power of commonly used lexical
comparing performances of the linear regression and neural network models with the
baseline performance. The lexical characteristics set of predictors include word length,
Task 4. The fourth task aimed to evaluate the predictive power of L1 psycholinguistic
of the variables z-scores of L1 English response times and accuracy ratings during the word
recognition task.
Task 5. The fourth task aimed to compare the predictive power of word frequency
regression models and 2 neural networks were run to execute this task. In Models 4a and
4b, the linear regression and neural network were trained and tested on word frequency and
lexical characteristics. For Models 5a and 5b, the linear regression and neural network were
predictive performances of these two algorithms were made based on the baseline
prediction.
18
5. Results
stated in the task descriptions in section 4.2.2. An overview of all model performance scores
is summarized in Table 1 below. Overall results of the regression tasks performed showed
that the neural network outperformed the linear regression model in predicting L2 receptive
word knowledge based on all sets of predictors, with the exception of Model 3b which inputs
L1 psycholinguistic measures as predictors. The best performing estimator is Model 4b, the
neural network tested on lexical characteristics and word frequency, with an MAE score of
2742.29 and an improved score from the baseline model by 1777.98 units. When comparing
the performances of Models 4 and 5, both the linear regression and neural network indicate
that the models tested on lexical characteristics and word frequency perform better than
from prior studies on the importance of word frequency as a predictor for L2 proficiency
To check for robustness of the neural networks, train and validation losses of each
model were visualized during the cross-validation process to monitor for any indications of
over or underfitting (Goodfellow et al., 2016). Based on the visualization results, there were
on the train data and test data for all models did not indicate large differences. All
visualizations and calculations related to the model robustness checks can be found in
Appendix C. In the next section, all results will be presented and elaborated for each task.
19
Table 1
MAE Improvement
Model Predictors Algorithm
Score from Baseline
1a LR 3473.90 1013.32
Word frequency
1b NN 3413.73 1106.54
2a LR 3729.27 766.39
Lexical characteristics
2b NN 3678.51 841.76
3a LR 3633.03 900.12
L1 psycholinguistic measures
3b NN 3854.81 665.46
5.1 Task 1
The first task aimed to predict L2 receptive word knowledge based on the mean
value of the train targets as the baseline prediction. This prediction was compared with the
true values of the test data and resulted in a mean absolute error score of 4,520.27.
5.2 Task 2
The second task aimed to evaluate the predictive power of word frequency on L2
English receptive word knowledge by comparing the performances of the linear regression
and neural network models with the baseline performance. The best-performing linear
regression estimator for Model 1a was selected with a validation MAE score of 3506.95 and
no regularization performed. The best-performing neural network for Model 1b was selected
with a validation MAE score of 3,430.98. The following parameters were selected as the
20
most optimum: 32 data batches, 60 epochs, 64 first layer nodes, 8 last layer nodes, 3 hidden
layers, and Adam as the optimization function. Both models performed better than the
baseline prediction, with the neural network showing slightly higher improvements from the
baseline prediction compared to the linear regression model. This indicates that word
5.3 Task 3
In the third task, lexical characteristics (excluding word frequency) were utilized to
predict L2 receptive word knowledge. The best-performing linear regression estimator for
Model 2a was selected with a validation MAE score of 3753.88 and no regularization
performed. The best-performing neural network for Model 2b was selected with a validation
MAE score of 3,696.88. The following parameters were selected as the most optimum: 32
data batches, 60 epochs, 64 first layer nodes, 64 last layer nodes, 3 hidden layers, and
Adam as the optimization function. Both models performed better than the baseline
prediction, with the neural network showing slightly higher improvements from the baseline
prediction compared to the linear regression model. This indicates that lexical characteristics
as a set of predictors is able to perform well as a predictor for L2 word receptive knowledge
5.4 Task 4
For the fourth task, both models were trained on L1 psycholinguistic measures to
predict L2 receptive word knowledge. The best-performing linear regression estimator for
Model 3a was selected with a validation MAE score of 3620.15. The best-performing neural
network for Model 2b was selected with a validation MAE score of 3,846.76. The following
parameters were selected as the most optimum: 32 data batches, 60 epochs, 32 first layer
nodes, 64 last layer nodes, 3 hidden layers, and Adam as the optimization function. Both
models performed better than the baseline prediction, this time with the linear regression
showing slightly higher improvements from the baseline prediction compared to the neural
21
able to perform well as a predictor for L2 word receptive knowledge in comparison to the
5.5 Task 5
For task 5, two sets of linear regression and neural network models were trained in
measures on L2 English receptive word knowledge. Models 4a and 4b were trained on word
frequency and other lexical characteristics, while Models 5a and 5b were trained on L1
regression estimator for Model 4a was selected with a validation MAE score of 3011.03 and
no regularization performed. The best-performing neural network for Model 4b was selected
with a validation MAE score of 2,759.93. The following parameters were selected as the
most optimum: 32 data batches, 60 epochs, 64 first layer nodes, 64 last layer nodes, 3
The best-performing linear regression estimator for Model 5a was selected with a
neural network for Model 5b was selected with a validation MAE score of 2,961.47 and the
following parameters selected as the most optimum: 32 data batches, 60 epochs, 64 first
layer nodes, 64 last layer nodes, 3 hidden layers, and Adam as the optimization function.
For both sets of models, the neural network consistently showed a better performance
compared to the linear regression model. Moreover, Models 4 showed higher improvements
in prediction performance compared to Models 5, where the neural network showed slightly
more improvement in performance compared to the linear regression model. This also
While all models showed better prediction performance compared to the baseline, a
large overall error remains present, which could indicate that the models systematically
under-predict the observed values of the targets. To further gain insights into the
performance of the models, additional analyses were performed on the best performing set
of models, namely Models 4a and 4b which utilized word characteristics and word frequency
as predictors. Visualizations made for the purpose of the post-hoc analysis can be found in
Appendix D.
Firstly, a comparison of the distributions of the predicted target values by the linear
regression and neural network models were compared with the actual target values. Both
prediction distributions differed considerably from the actual distribution of skewness level
-0.00034. It was found that distribution of prediction values by the neural network more
closely resembles the distribution of the actual targets with a skewness of -0.296 compared
to that of the linear regression with a skewness of -0.439. Furthermore, distribution of the
linear regression predictions indicates higher dispersion and the presence of more outliers.
Secondly, the relationship between the predictions and the residuals of each model
were explored. A visualization of these relationships for the two models indicate that
residuals for the neural network predictions tend to spread more proportionately around the
zero value compared to that of the linear regression, as depicted in Figure 2 below. Further
investigation reveals a correlation coefficient of r = 0.00084 for the predictions and residuals
Lastly, the coefficients of all predictors from the linear regression model were
extracted to analyze the importance of each variable. It was found that word frequency had
the largest magnitude compared to the other predictors included within the model. This is
consistent with findings from existing literature on the importance of word frequency as a
investigate the exceptional behavior observed in Models 3a and 3b. It was found that L2
23
receptive word knowledge exhibited a correlation coefficient of r = 0.532 and r = 0.4128 with
Figure 2
Relationship between residuals and predictions for the linear regression and neural network
models
Prediction vs Residual - Linear Regression
6. Discussion
To what extent can lexical characteristics and measures derived from L1 English
the set of predictors included in this study, results within this study were consistent with
findings from existing studies. All models tested consistently showed higher predictive
performance compared to the baseline, confirming the validity and importance of the
predictors included within this study, namely word frequency, word characteristics, and L1
psycholinguistic measures (Balota et al., 2007; Keuleers et al, 2012; Mandera et al, 2020).
Moreover, based on findings from task 5, it was observed that both the linear regression and
neural network models fitted on lexical characteristics and word frequency showed a better
psycholinguistics measures. This is consistent with findings from Brysbaert et al. (2017) and
Berger et al. (2019) that highlighted the stronger impact of word frequency on L2 lexical
When comparing the performances of the two algorithms, the neural network showed
better predictive performance compared to the linear regression model with the exception of
the models fitted on L1 psycholinguistic measures. The dataset used within this study
consist a large number of observations and features, which adds to its dimensionality. High
algorithms such as the linear regression (Lavergne & Patilea, 2008). Therefore, this result in
not unexpected. Post-hoc analysis was conducted to investigate the exception found in the
found that the correlation coefficient of L2 receptive word knowledge with mean L1 response
time and standard deviation of L1 response time were r = 0.532 and r = 0.4128. These
values are much higher than the correlation levels found between measures of word
knowledge in Hashimoto & Egbert (2019) and L1 ranking in Brysbaert et al. (2017), that
being r = 0.28.
between true values of L2 receptive word knowledge and the predicted values by the linear
regression and neural network models. Performing classification tasks for L2 proficiency
25
could better capture the differences in knowledge between learners as samples of second
language learners tend to exhibit imbalanced classes (Arnold et al., 2018; Aryadoust &
Baghael, 2016; Sinclair et al., 2021). In this study, L2 receptive word knowledge were
Several limitations arise due to the nature of the study approach adopted as well as
the dataset used. While utilizing the word ranking as a measure of L2 lexical proficiency
allows for the capture of additional dimension such as learner motivation in learning a word,
information, a linear rank of lexical proficiency may lose information relevant to language
Utilizing big data collected from mega studies for this thesis presents several
advantages compared to data obtained from factorial design studies, such as improved
model predictive power due to the large number of word stimuli as well as the availability of
the data for use in other studies. However, due to the nature of the word recognition task
used for the dataset collection, the measure of word knowledge in this study is based solely
on yes or no decisions. While this is a valid measure for word knowledge (Ferré & Brysbaert,
2017; Harrington & Carey, 2009; Laufer et al., 2004; Nation, 2001; Zhang et al., 2019), it is
only able to capture a surface-level knowledge of words, and not, for example, whether a
second language speaker indeed understands the meaning of a word (Brysbaert et al.,
2019). Furthermore, the dataset used in the present study only includes words, while the
2019).
Future studies could address these limitations, firstly by treating L2 word knowledge
as a categorical variable. This would allow for a better comparison of L2 proficiency which
better reflects the true learning curve of second language learners (Arnold et al., 2018;
Aryadoust & Baghael, 2016). Secondly, future studies may improve and expand on the
measure of vocabulary knowledge used within this study. The availability of a large dataset
26
word knowledge, depth of vocabulary, and knowledge of word meaning (Henriksen, 1999) is
limited, leaving room for further contributions of such datasets. Lastly, future studies could
also include multi-word expressions in addition to singular words as stimuli for large-scale
second language vocabulary testing, more accurately representing the make-up for the
English vocabulary.
7. Conclusion
This study aimed to test the predictive power of lexical characteristics and receptive
speakers’ receptive word knowledge. Results from this study bring several scientific
The results showed that the models trained on word frequency showed the better predictive
set of other commonly used lexical characteristics, which confirms previous findings. This
study found that that L1 English receptive word knowledge performed better than the
baseline in predicting L2 receptive word knowledge, but performed worse than word
frequency, which confirms previous findings that word frequency is a more important
from mega study brings a novel contribution to existing studies. Overall, the neural network
linear regression algorithm. From a practical perspective, findings from this study could
provide input on how to model word difficulty and word acquisition levels for English second
characteristics. Specifically, this study identified word frequency to be a stronger indicator for
word difficulty and word acquisition levels of L2 English speakers compared to the other
27
variables of interest. This provides useful insights for English second language vocabulary
Acknowledgements
I would like to express my gratitude to my supervisor at Tilburg University for his inputs and
guidance throughout the process of writing this thesis. I would also like to acknowledge my
appreciation to the second reader of this thesis as well as other parties that have contributed
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Zheng, X. (2016).
preprint arXiv:1603.04467.
Anderson, R. C., & Freebody, P.(1981). Effects of differing proportions and locations of
difficult vocabulary on text comprehension. Center for the Study of Reading Technical
Ari, N., & Ustazhanov, M. (2014, September). Matplotlib in python. In 2014 11th International
Arnold, T., Ballier, N., Gaillat, T., & Lissón, P. (2018). Predicting CEFRL levels in learner
English on the basis of metrics and full texts. arXiv preprint arXiv:1806.11099.
Aryadoust, V., & Baghaei, P. (2016). Does EFL readers' lexical and grammatical knowledge
predict their reading ability? Insights from a perceptron artificial neural network study.
Asfaha, Y. M., Beckman, D., Kurvers, J., & Kroon, S. (2009). L2 reading in multilingual
Balota, D. A., & Chumbley, J. I. (1990). Where are the effects of frequency in visual word
recognition tasks? Right where we said they werep Comment on Monsell, Doyle, and
Haggard (1989).
Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., ... &
Treiman, R. (2007). The English lexicon project. Behavior research methods, 39(3), 445-
459.
Balota, D. A., Aschenbrenner, A. J., & Yap, M. J. (2013). Additive effects of word frequency
and stimulus quality: the influence of trial history and data transformations. Journal of
Berger, C. M., Crossley, S. A., & Kyle, K. (2019). Using native-speaker psycholinguistic
Berger, C., Crossley, S., & Skalicky, S. (2019). Using lexical features to investigate second
language lexical decision performance. Studies in Second Language Acquisition, 41(5), 911-
935.
Brysbaert, M., & Cortese, M. J. (2011). Do the effects of subjective frequency and age of
Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016). How many words do we
know? Practical estimates of vocabulary size dependent on word definition, the degree of
Brysbaert, M., Lagrou, E., & Stevens, M. (2017). Visual word recognition in a second
language: A test of the lexical entrenchment hypothesis with lexical decision times.
Brysbaert, M., Keuleers, E., & Mandera, P. (2019). Recognition times for 62 thousand
Brysbaert, M., Keuleers, E., & Mandera, P. (2020). Which words do English non-native
speakers know? New supernational levels based on yes/no decision. Second Language
Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language Learning &
McQuillan (2016).
Diaz, G. I., Fokoue-Nkoutche, A., Nannicini, G., & Samulowitz, H. (2017). An effective
algorithm for hyperparameter optimization of neural networks. IBM Journal of Research and
Ellis, N. C. (2002). Frequency effects in language processing: A review with implications for
Ellis, R. (2003). Task-based language learning and teaching. Oxford university press.
Ferré, P., & Brysbaert, M. (2017). Can Lextale-Esp discriminate between groups of highly
Freebody, P., & Anderson, R. C. (1981). Effects of differing proportions and locations of
difficult vocabulary on text comprehension. Center for the Study of Reading Technical
Harrington, M., & Carey, M. (2009). The on-line Yes/No test as a placement tool. System,
37(4), 614-626.
Gardner, D., & Davies, M. (2014). A new academic vocabulary list. Applied linguistics, 35(3),
305-327.
Gaurav, M. (2019, December 17). How to find the optimum number of hidden layers and
https://www.datagraphi.com/blog/post/2019/12/17/how-to-find-the-optimum-number-of-
hidden-layers-and-nodes-in-a-neural-network-model
Grasemann, U., Peñaloza, C., Dekhtyar, M., Miikkulainen, R., & Kiran, S. (2021). Predicting
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
32
Hashimoto, B. J., & Egbert, J. (2019). More than frequency? Exploring predictors of word
Haykin, S., & Principe, J. (1998). Making sense of a complex world [chaotic events
He, X., & Godfroid, A. (2019). Choosing words to teach: A novel method for vocabulary
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are
Hulstijn, J. H., Van Gelderen, A., & Schoonen, R. (2009). Automatization in second language
acquisition: What does the coefficient of variation tell us?. Applied Psycholinguistics, 30(4),
555-582.
Kerz, E., Wiechmann, D., Qiao, Y., Tseng, E., & Ströbel, M. (2021, April). Automated
and RNNs. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building
Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project:
Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior
Keuleers, E., & Balota, D. A. (2015). Megastudies, crowdsourcing, and large datasets in
Kramer, O. (2016). Scikit-learn. In Machine learning for evolution strategies (pp. 45-53).
Springer, Cham.
Laufer, B. (1998). The development of passive and active vocabulary in a second language:
Laufer, B., & Paribakht, T. S. (1998). The relationship between passive and active
Laufer, B., Elder, C., Hill, K., & Congdon, P. (2004). Size and strength: Do we need both to
Lavergne, P., & Patilea, V. (2008). Breaking the curse of dimensionality in nonparametric
Lee, J. W., & Schallert, D. L. (1997). The relative contribution of L2 language proficiency and
Leow, R. P., Grey, S., Marijuan, S., & Moorman, C. (2014). Concurrent data elicitation
procedures, processes, and the early stages of L2 learning: A critical overview. Second
Liben-Nowell, D., Strand, J., Sharp, A., Wexler, T., & Woods, K. (2019). The danger of
of cognition, 2(1).
vocabulary knowledge in writing skill, which one important. International Journal of Academic
Mandera, P., Keuleers, E., & Brysbaert, M. (2020). Recognition times for 62 thousand
English words: Data from the English Crowdsourcing Project. Behavior Research
McKinney, W. (2011). pandas: a foundational Python library for data analysis and
statistics. Python for high performance and scientific computing, 14(9), 1-9.
34
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary tests. Language
Meara, P. M., & Milton, J. L. (2003). The Swansea vocabulary levels test: The manual.
Milton, J. (2006). X-Lex: The Swansea vocabulary levels test. In Proceedings of the 7th and
8th Current Trends in English Language testing (CTELT) Conference (Vol. 4, pp. 29-39).
Monaghan, P., Chang, Y. N., Welbourne, S., & Brysbaert, M. (2017). Exploring the relations
Monsell, S., Doyle, M. C., & Haggard, P. N. (1989). Effects of frequency on visual word
recognition tasks: Where are they?. Journal of Experimental Psychology: General, 118(1),
43.
Nation, P., & Waring, R. (1997). Vocabulary size, text coverage and word lists. Vocabulary:
Nurweni, A., & Read, J. (1999). The English vocabulary knowledge of Indonesian university
O'Dell, F., Read, J., & McCarthy, M. (2000). Assessing vocabulary. Cambridge university
press.
Petersen, S. E., & Ostendorf, M. (2009). A machine learning approach to reading level
Pulido, D., & Hambrick, D. Z. (2008). The virtuous circle: Modeling individual differences in
Santos, V. D., Verspoor, M., & Nerbonne, J. (2012). Identifying important factors in essay
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the behaviour of
two new versions of the Vocabulary Levels Test. Language testing, 18(1), 55-88.
Skalicky, S., Crossley, S. A., & Berger, C. M. (2019). Predictors of second language English
lexical recognition: Further insights from a large database of second language lexical
Sparks, R. L., Patton, J., Ganschow, L., & Humbach, N. (2012). Do L1 reading achievement
Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty before 1900.
Van Gelderen, A., Schoonen, R., De Glopper, K., Hulstijn, J., Simis, A., Snellings, P., &
Van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A
new and improved word frequency database for British English. Quarterly journal of
Webb, S. A., & Chang, A. C. S. (2012). Second language vocabulary growth. RELC journal,
43(1), 113-126.
Wickham, H., & Wickham, M. H. (2020). Package ‘plyr’. A Grammar of Data Manipulation. R
package version, 8.
36
Yang, Y., Yu, W., & Lim, H. (2016). Predicting second language proficiency level using
Yap, M. J., Balota, D. A., Tse, C. S., & Besner, D. (2008). On the additive effects of stimulus
quality and word frequency in lexical decision: evidence for opposing interactive influences
Yarkoni, T., Balota, D., & Yap, M. (2008). Moving beyond Coltheart’s N: A new measure of
Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). ERNIE: Enhanced language
Hall/CRC.
37
Appendix A
Figure 3
Figure 4
Appendix B
Table 2
Linear regression
Neural Network
Number of layers 2, 3
Figure 5
Figure 6
Figure 7
Figure 8
Neural network models selected – Model 4b: lexical characteristics & word frequency as
predictors
Figure 9
Neural network models selected – Model 5b: lexical characteristics & L1 psycholinguistic
measures as predictors
42
Appendix C
Figure 10
Train & Validation Loss Visualization for Model 1b – Neural network with word frequency
Figure 11
Train & Validation Loss Visualization for Model 2b – Neural network with word characteristics
Model 2 Train & Val Loss per Epoch
43
Figure 12
Train & Validation Loss Visualization for Model 3b – Neural network with L1 psycholinguistic
measures
Figure 13
Train & Validation Loss Visualization for Model 4b – Neural network with word characteristics
Figure 14
Train & Validation Loss Visualization for Model 5b – Neural network with word characteristics
Appendix D
Figure 15
Figure 16
regression
Figure 17
Distribution of predictions based on word characteristics & word frequency – neural network
47
Figure 18
Distribution of residuals for predictions based on word characteristics & word frequency –
linear regression
48
Figure 19
Distribution of residuals for predictions based on word characteristics & word frequency –
neural network