You are on page 1of 54

Predicting Non-Native English Speakers’ Vocabulary

Knowledge with Native English Speaker Psycholinguistic


Behaviors: A Big Data Approach

Student details

Name: D.V. Larasati


Student number: U252005

THESIS SUBMITTED IN PARTIAL FULFILLMENT


OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE IN DATA SCIENCE & SOCIETY
DEPARTMENT OF COGNITIVE SCIENCE & ARTIFICIAL INTELLIGENCE
SCHOOL OF HUMANITIES AND DIGITAL SCIENCES
TILBURG UNIVERSITY

Thesis committee

Supervisor: dr. B. Nicenboim


Second reader: dr. Emmanuel Keuleers

Word count: 8,967

Tilburg University
School of Humanities & Digital Sciences
Department of Cognitive Science & Artificial Intelligence
Tilburg, The Netherlands
3 December 2021
Abstract

With the growing needs and interests for language learning as well as the increasing

accessibility to tools and technologies that support second language learning, it is becoming

more relevant to add to the body of research conducted on this topic. One of the most

relevant topics studied over time is on how second language (L2) speakers develop lexical

proficiency. An important indicator for a second language speaker’s lexical proficiency is

vocabulary knowledge. Various studies have highlighted the importance of lexical

characteristics in predicting the ease with which L2 speakers acquire new words. These

characteristics include how frequently speakers are exposed to words in the language, how

they are able to associate these words to something concrete, as well as word length.

However, few studies have modeled the relationship between native language (L1) lexical

proficiency and second language proficiency. This study aims to contribute to existing

literature by studying the predictive power of native language vocabulary knowledge as

measured by L1 psycholinguistic measures while conducting a word recognition task, on

second language vocabulary knowledge as measured by receptive word knowledge. This

study will compare the implementation of different machine learning algorithms on

regression tasks predicting L2 receptive word knowledge based on data obtained from a

mega study, which brings a unique contribution within this study domain. It was found that

word frequency performed better as a predictor for L2 receptive word knowledge for both the

linear regression and neural network models, confirming previous findings. Overall, the

neural network showed better performance compared to the linear regression algorithm as

an estimator.

Keywords: lexical proficiency, word recognition, machine learning, neural network,

psycholinguistic measures
Data Source/Code/Ethics Statement

Data collected and processed through the English Crowdsourcing Project (Mandera,

Keuleers, & Brysbaert, 2020) as part of a set of online vocabulary tests organized by Ghent

University were used for the purpose of writing this thesis. All data analyzed within this thesis

are publicly available on the following sources: Our lexicon projects (ugent.be) and OSF |

English Crowdsourcing Project.

Code used to conduct experimental procedures for the purpose of data analysis in this thesis

were obtained from the following publicly available source (Gaurav, 2019). The repurposed

code written by author for this thesis can be found on the following link.
Table of Contents

Abstract ................................................................................................................................. 2

Data Source/Code/Ethics Statement ..................................................................................... 3

Predicting Non-Native English Speakers’ Vocabulary Knowledge with Native English

Speaker Psycholinguistic Behaviors: A Big Data Approach ................................................... 1

2. Related Work ................................................................................................................. 5

2.1 Relevant Work ............................................................................................................. 5

2.1.1 Measuring Vocabulary Knowledge ........................................................................ 5

2.1.2 Lexical characteristics as Predictors for L2 Vocabulary Knowledge ....................... 6

2.1.3 L1 Psycholinguistic Measures as Predictors for L2 Vocabulary Knowledge ........... 6

2.1.4 Machine Learning Approach to Lexical Proficiency Predictive Analysis ................. 7

2.2 The Current Study ....................................................................................................... 8

3. Methods ......................................................................................................................... 9

3.1 Modeling Approach ...................................................................................................... 9

3.1.1 Linear Regression ................................................................................................. 9

3.1.2 Neural Network ..................................................................................................... 9

3.1.1 Models ................................................................................................................ 10

3.2 Predictors .................................................................................................................. 11

3.2.1 Word Frequency ............................................................................................ 11

3.2.2 Lexical characteristics .................................................................................... 11

3.2.2 L1 English Psycholinguistic Behavior ............................................................. 12

3.3. Target ....................................................................................................................... 12

4. Experimental Setup...................................................................................................... 13
4.1 Dataset Description ................................................................................................... 13

4.1.1 Data Sources ...................................................................................................... 13

4.1.2 Data Preprocessing ............................................................................................. 14

4.2 Experimental Procedure ............................................................................................ 15

4.2.1 Model Evaluation & Hyper Parameters ................................................................ 16

4.2.2 Tasks .................................................................................................................. 17

Task 2. ............................................................................................................................ 17

Task 3. ............................................................................................................................ 17

5. Results ......................................................................................................................... 18

5.1 Task 1........................................................................................................................ 19

5.2 Task 2........................................................................................................................ 19

5.3 Task 3........................................................................................................................ 20

5.4 Task 4........................................................................................................................ 20

5.5 Task 5........................................................................................................................ 21

5.6 Post-Hoc Analysis...................................................................................................... 22

6. Discussion ................................................................................................................... 23

6.1 Limitations & Future Research ................................................................................... 25

7. Conclusion ................................................................................................................... 26

Acknowledgements ............................................................................................................. 27

References.......................................................................................................................... 29

Appendix A.......................................................................................................................... 37

Appendix B.......................................................................................................................... 39

Appendix C ......................................................................................................................... 42
Appendix D ......................................................................................................................... 45
1

Predicting Non-Native English Speakers’ Vocabulary Knowledge with Native

English Speaker Psycholinguistic Behaviors: A Big Data Approach

With the growing needs and interests for language learning as well as the increasing

accessibility to tools and technologies that support second language learning, it is becoming

more relevant to add to the body of research conducted on this topic. One of the most

relevant topics studied over time is on how second language (L2) speakers develop lexical

proficiency. L2 lexical proficiency can be defined as the depth of understanding (Anderson &

Freebody, 1981) as well as the breadth of lexical knowledge, or vocabulary size (Hazenberg

& Hulstijn, 1996; Nurweni & Read, 1999).

An indicator for someone’s vocabulary size is vocabulary knowledge, or their

knowledge of words (Laufer et al., 2004; Nation, 2001). Previous studies have concluded

that vocabulary knowledge refers to a person’s scope of word usage in writing, reading,

listening, and speaking, which can be further divided into receptive and productive

knowledge (Henriksen, 1999; Laufer, 1998; O’Dell et al., 2000), among other dimensions

(Maskor & Baharudin, 2016). Receptive knowledge of a word involves aspects such as

recognizing a word, knowing what a word looks like, and associating a word to other words

(Nation, 2001). Therefore, one of the most common tasks used to measure receptive word

knowledge includes the word recognition task, where participants are asked to indicate

whether they recognize a word and to identify whether the word seen is an actual word or a

pseudo word (Berger et al, 2019). Response times and level of accuracy of participants’

responses have been found to be important indicators on how L2 speakers process and

access words mentally, how they develop automaticity, as well as vocabulary size

(DeKeyser, 2001; Leow et al, 2014; Milton, 2006).

There are many factors which affect the ease of learning words and expanding

vocabulary knowledge for L2 speakers. One of which is lexical characteristics, a set of word

characteristics that include word frequency, word length, orthographic and phonological

distance to other words, number of morphemes, average age of acquisition, and


2

concreteness (Berger et al, 2019a, 2019b; Brysbaert et al, 2016, 2020; Keuleers et al., 2012,

2015; Mandera et al, 2020; Skalicky et al, 2019). From these characteristics, word frequency

has been a consistently significant predictor for word knowledge, where words that appear

or are used more frequently within a language tend to be learned more quickly by both

second language and native language speakers (Berger et al, 2019; Brysbaert & Cortese,

2011; Ellis, 2002; 2003; Read, 1998; Schmitt et al, 2001; Mandera et al, 2020). However,

there are drawbacks to using word frequency as an indicator for L2 lexical proficiency

(Berger et al, 2019; Brysbaert et al, 2020; He & Godfroid, 2019; Mandera et al, 2020), such

as there being disproportionately higher numbers of low frequency words compared to high

frequency words, particularly in the English language. Therefore, many low frequency words

are, in fact, well-known by L2 speakers. Secondly, word frequency is a measure derived

from a corpus or set of corpora, which means that it is dependent on the corpus used for its

measurement and is subjective in nature.

Another variable which has been studied in relation to L2 lexical proficiency is lexical

proficiency of native language (L1) speakers. Previous studies have found that word

knowledge of individuals who claimed English to be their native tongue and individuals who

spoke English as a second language exhibited high correlation and similar patterns

(Monaghan et al., 2017; Brysbaert et al., 2020). Meanwhile, others noted a relationship

between reading, writing, and speaking abilities of native and second languages of bilingual

individuals (Asfaha et al, 2009; Lee & Schallert, 1997; Pae, 2019; Sparks et al, 2012).

Berger et al. (2019) established that psycholinguistic behavior of L1 English speakers, as

measured by response time and accuracy during lexical tasks, were able to index the

difficulty level of words produced over time by a group of students who did not speak English

as a native language.

Response time and accuracy rate on lexical decision tasks are prevalent measures of

receptive word knowledge as they provide information on real-time automatic processing of

a language by both L2 and L1 speakers (Berger et al., 2019). This is the underlying

motivation for the focus of this study on how psycholinguistic measures of L1 English
3

speakers are able to predict L2 vocabulary knowledge. Measures of response time on lexical

tasks indicate how well a language speaker’s mental lexical network is organized as well as

their processing abilities, allowing for conclusions on how automaticity is developed

(DeKeyser, 2001; Hulstijn et al., 2009; Van Gelderen et al., 2004). Accuracy rate, on the

other hand, is a widely accepted indicator for L1 and L2 vocabulary size (Meara & Milton,

2003). Moreover, compared to measures based on corpora and individual interpretation of

word meaning, these psycholinguistic measures bring some advantages. Corpora-based

measures such as word frequency rely on proxies of large reference corpora, which may not

be representative of language speakers’ actual language exposure. Meanwhile, other lexical

characteristics, such as concreteness, rely on the judgement and perception of individuals,

making them more subjective compared to L1-derived measures which are based on L1

speakers’ performance on word knowledge tasks (Berger et al., 2019).

One of the most prevalent methods used to capture psycholinguistic measures is the

online or mega study approach (Balota et al., 2007; Berger et al., 2019; Marinis, 2010;

Mandera et al., 2020). As opposed to the conventional method of language proficiency

testing in a classroom or experimental set up, this approach has several advantages,

including: the ability to include a larger number of stimuli and participants, resulting in more

power; the ability to assess the importance of both existing and new word characteristics;

data availability for repeated use and answering different research questions; avoiding

experimenter bias; random selection of stimuli to be presented to participants; as well as

evaluating the performance of new computational models (Balota et al., 2015; Keuleers &

Balota, 2013; Liben-Nowell et al., 2019; Mandera et al., 2020). The mega study approach

has led to increased availability of large amounts of data over the years (Manderaet al.,

2020), which together with rapid advancements of computational models have resulted in

increased interest in the utilization of machine learning models for lexical proficiency

prediction (Santos et al., 2012; Yang et al., 2016). However, such studies focused on

measuring the predictive power of L1 psycholinguistic behavior on L2 lexical proficiency,

particularly those measures derived from mega studies, remain limited.


4

This study aims to add to the body of literature within the domain of L2 lexical

proficiency by focusing on evaluating how well psycholinguistic measures of native English

(L1) speakers can predict receptive word knowledge of non-native (L2) English speakers in

comparison to a more conventionally used set of predictors, namely lexical characteristics. In

doing so, information obtained through the means of a mega study will be utilized and the

performances of a simple and a more complex machine learning algorithm will be compared.

The following main research question and sub-research questions are formed:

To what extent can lexical characteristics and measures derived from L1 English

psycholinguistic behavior predict L2 English receptive word knowledge?

1. How do the performances of different machine learning algorithms compare in

predicting L2 English receptive word knowledge based on word frequency?

2. How do the performances of different machine learning algorithms compare in

predicting L2 English receptive word knowledge based on a set of commonly

used lexical characteristics?

3. How do the performances of different machine learning algorithms compare in

predicting L2 English receptive word knowledge based on L1 English

psycholinguistic behavior during word recognition and lexical decision tasks?

4. How does word frequency compare with L1 psycholinguistic behavior in

predicting L2 English receptive word knowledge?

English L2 receptive word knowledge is represented by word difficulty ranking based

on response times and accuracy ratings of L2 speakers on an online word recognition task,

while L1 English psycholinguistic behavior is assessed by response times and accuracy

ratings of native English speakers on the same task. To answer the research questions

presented in this study, one baseline model, five linear regression models and five neural

networks were run.

Findings of this study suggest that word frequency outperforms L1 English

psycholinguistic measures in predicting L2 receptive word knowledge. However, including L1

measures in place of word frequency together with other lexical characteristics within the
5

model yield comparable results, confirming its predictive power not only on productive L2

word knowledge, but also receptive L2 word knowledge.

2. Related Work

2.1 Relevant Work

2.1.1 Measuring Vocabulary Knowledge

An important indicator for a second language speaker’s lexical proficiency is their

vocabulary knowledge, which forms the foundation for their proficiency in writing, reading,

listening, and speaking (Pullido & Hanbrick, 2008). The present study will utilize the

receptive word knowledge as a measure for L2 vocabulary knowledge. The receptive word

knowledge includes measures such as knowing what a word looks and sounds like, knowing

what parts of a word are recognizable, and knowing the meaning of a word. A common tool

used to measure receptive word knowledge is the word recognition task (DeKeyser, 2001;

Leow et al., 2014; Meara & Milton, 2002), where an L2 speaker is typically presented with a

word stimuli and asked to identify whether they recognized the stimuli as an actual word

within the language or not (Berger et al., 2019). Measures of L2 response times and their

accuracy ratings on this task are recorded and have been found to have high correlations

with more demanding in-depth language tests (Ferré & Brysbaert, 2017; Harrington & Carey,

2009; Lemhöfer & Broersma, 2012; Meara & Buxton, 1987; Zhang et al., 2019). These

measures on the word recognition task have also been used to form lists of word ranking

which were able to indicate word difficulty, how frequently an L2 speaker is exposed to the

word within the language (Brysbaert et al., 2019), as well as how useful the L2 speaker

perceives the word or how motivated they are to learn the word (He & Godfroid, 2019). In

their study, Brysbaert, Keuleers, and Mandera (2019), found that L2 word rank had the

highest correlation with word difficulty measures based on human ratings (r = .60), followed

by log of word frequency (r = -.52) based the COCA corpus (Gardner & Davies, 2014), as

well as word usefulness based on human ratings (r = -.47). Moreover, the word rank as a

measure for L2 English word knowledge had a correlation of .68 with word knowledge from a
6

similar study by Hashimoto and Egbert (2019). Thus, this study will utilize word ranks as a

measure for L2 receptive word knowledge.

2.1.2 Lexical characteristics as Predictors for L2 Vocabulary Knowledge

Although larger vocabulary size indicates higher L2 lexical proficiency, not all words

need to be known for successful language understanding. L2 speakers would normally only

acquire between 2.500-3.000 of the most frequent word families (i.e. a word and its various

forms of inflections and derivations) of a language (Cobb, 2007; 2016; Webb & Chang,

2012), compared to the 11.100 word families typically known by native (L1) speakers

(Brysbaert et al, 2016). More frequent words tend to be processed more quickly, both for L1

and L2 speakers (Ellis, 2002; 2003), and are typically learned by L2 speakers at an earlier

point in time compared to less frequent words (O’Dell et al., 2000; Schmitt et al, 2001). This

finding is further affirmed by several classic and well-known studies on the effect of lexical

characteristics on lexical proficiency, such as the English Lexicon Project and British Lexicon

Project (Monsell et al, 1989; Yap et al, 2008; Balota et al, 2007; Keuleers et al, 2012;

Mandera et al, 2020). Some of the other important lexical characteristics which have been

found to be significant predictors for L2 word knowledge include word length, number of

syllables of a word, and average age of acquisition of a word (Balota et al, 2007; Mandera et

al, 2020). Despite this, Brysbaert et al (2016) summarize the drawbacks of word frequency

as a predictor for L2 word knowledge as follows: “word frequency is based on the

assumption that all words encountered by a speaker are of the same weight; lack of

language representation in the corpora from which word frequency measures are taken; and

that receptiveness of a word also depends on the motivation of the learner to know the word”

(pp. 27-28). Keeping this in mind, there is room for further study on variables which could

better predict and serve as more reliable measures for L2 vocabulary knowledge.

2.1.3 L1 Psycholinguistic Measures as Predictors for L2 Vocabulary Knowledge

Various studies have looked into the relationship between L1 and L2 lexical

proficiency. These studies oftentimes focus on the dynamics between L1 and L2 reading
7

and writing abilities for individuals. For example, Asfaha et al. (2009) revealed that L1

reading comprehension and L2 language proficiency significantly predict L2 reading

proficiency. Meanwhile, Lee & Schallert (1997) found that L2 proficiency is a better predictor

for L2 reading ability compared to L1 reading ability. Studies with similar findings on the

relationship between L1 and L2 writing and reading skills were conducted by Sparks et al.

(2012) and Pae (2019). However, little work has been done on L2 vocabulary knowledge

and how this can be predicted by L1 vocabulary. Brysbaert, Lagrou, and Stevens (2017) is

one of these studies, where they were able to confirm the lexical entrenchment hypothesis

for L1 and L2 English speakers, observing that the frequency effect has a stronger impact on

second language speakers compared to measures derived from native language speakers.

Another of such study is by Berger et al. (2019), which attempted to predict L2 English

productive word knowledge with psycholinguistic measures derived from L1 English speaker

performance on word recognition and lexical decision tasks. Findings from this study

suggest that L2 speakers with higher language proficiency (as assessed by human raters)

produce English words that are recognized and named more slowly and less accurately by

L1 speakers. Results from the longitudinal study within this paper conclude that L2 speakers

produce less frequent words over time, as well as words which are recognized or named

less accurately and more slowly by L1 speakers.

2.1.4 Machine Learning Approach to Lexical Proficiency Predictive Analysis

While there has been growing interest in the utilization and exploration of machine

learning approaches to predict lexical proficiency (Petersen & Ostendorf, 2009; Santos et al.,

2012; Yang et al., 2016), most studies adopting this approach have used corpora-based

measures or measures obtained through factorial and classroom designs as data sources.

Moreover, measures of L2 proficiency utilized within these studies were predominantly

measures based on human assessment. For example, Arnold et al. (2018) predicted CEFR

levels of L2 English learners using corpora-based lexical and syntactic metrics. The

performances of the Gradient Boosted Trees and the Neural Network were compared in a

pairwise-classification task of L2 proficiency classes of adjacent CEFR levels. From a


8

linguistic point of view, the most distinguishable marks in proficiency among second

language learners arise from adjacent classes. Key findings of the study highlighted the

higher performance of the gradient boosted tree compared to the neural network and that

task-based corpora entail strong overfitting. Aryadoust & Baghaei (2016) attempted to

predict L2 English learners’ reading ability on the basis of their lexical and grammatical

knowledge using the neural network. Results from this study showed that the neural network

was able to classify L2 readers based on their grammatical and vocabulary knowledge with

an accuracy level of approximately 78% and confirmed previous findings on the relationship

between these variables. However, there were limitations acknowledged on the reliability of

the Rasch model utilized to derive proficiency measures. Lastly, Kerz et al. (2021)

highlighted the advantages in utilizing automated-generated L2 proficiency metrics using the

RNN compared to metrics based on human assessment.

2.2 The Current Study

The presented body of literature leaves room for further studies within the domain of

L2 lexical proficiency. While previous studies have analyzed the relationship between

receptive word knowledge and various lexical characteristics, little research has been done

to model the relationship between L1 English and L2 English lexical proficiency, particularly

for receptive word knowledge. Therefore, this study aims to replicate previous findings

showing the predictive power of lexical characteristics on L2 receptive word knowledge, and

add a unique finding to existing literature by analyzing the predictive power of L1

psycholinguistic measures on L2 receptive word knowledge. Moreover, the previously

mentioned studies utilizing machine learning to predict second language proficiency have

predominantly used information obtained from corpora, conventional language tests, or

experiments. This study will thus add a unique contribution by adopting the machine learning

approach to predict L2 lexical proficiency using data from a mega study.


9

3. Methods

3.1 Modeling Approach

The current study utilized a machine learning approach to answer the presented

research questions, where the performance of a simpler machine learning algorithm was

compared with that of a more complex one. The simpler model tested was the linear

regression as it is a well-established model that is easy to interpret (Stigler, 1986). The feed

forward neural network was selected as the more complex machine learning model to

predict lexical proficiency as it may have a few advantages compared to the linear

regression, such as being more adaptive, is able to model non-linear functions, and does not

need assumptions about the data or relationship of the data being modeled (Haykin, 1998;

Hornik et al., 1989).

3.1.1 Linear Regression

The linear regression is one of the most widely accepted and classical algorithms for

predictions dating back to more than 100 years (Stigler, 1986; Yan & Su, 2009). The

purpose of a linear regression model is to examine the relationship of a set of independent

variables on a dependent variable by estimating parameters of the regression equation that

minimizes its loss function (Yan & Su, 2009). The most prevalent method to do this by

minimizing the sum of squared residuals of the estimated equation. The linear regression

equation can be depicted as follows:

𝑌 = 𝑓(𝑋𝑖 , 𝛽) + 𝑒𝑖

Where Y is the dependent variable, Xi is the set of corresponding dependent

variables, 𝛽 the estimated parameters of the least squares regression equation, and e the

residuals or differences between the observed and fitted values of the dependent variable.

3.1.2 Neural Network

The goal of the neural network is to approximate some function, such as y = f2(z; u, c) as

depicted in Figure 1, by mapping an input x to a value of y based on the calculations of

activation functions within each layer. This is done by learning the values of parameters of the
10

network (in this case, w, b and u, c) that result in the best function approximation (Goodfellow

et al., 2016).

Figure 1

Structure of Feed-Forward Neural Network

z = f1(x; w, b)

y = f2(z; u, c)

Output layer
Input layer

Hidden layer

A loss function is chosen to evaluate how well the model is able to predict true values

of the dataset based on the nature of the problem. The difference between the network’s

output and the expected output as calculated by the loss function is minimized by computing

its gradients, which informs the model how to update the parameters. In order to learn the

best parameter values, the derivative of the prediction error with respect to each parameter

is calculated with the backpropagation algorithm. For example, we can express the

backpropagation of the prediction error of the network in Figure 1 with the following

equation:

𝜕𝑙𝑜𝑠𝑠 𝜕𝑙𝑜𝑠𝑠 𝜕𝑦 𝜕𝑧
= × ×
𝜕𝑤 𝜕𝑦 𝜕𝑧 𝜕𝑤
3.1.1 Models

This study applied one baseline model, five linear regression models, and five neural network

models:

1. Baseline model: Zero-rule algorithm

The zero-rule algorithm is one of the most commonly used baseline algorithms in

machine learning that takes information about the targets in a problem in order to

create a rule for predictions (Zhou, 2019). The mean value of the targets, L2 English
11

receptive word knowledge, was used as the default prediction, as it represents the

central tendency for the output

2. Models 1a and 1b: The linear regression model and neural network trained on word

frequency

3. Models 2a and 2b: The linear regression model and neural network trained on

common lexical characteristics (excluding word frequency)

4. Models 3a and 3b: The linear regression model and neural network trained on L1

English psycholinguistic behavior measures of response times and accuracy rates on

lexical decision tasks

5. Model 4a and 4b: The linear regression model and neural network trained on word

frequency and other common lexical characteristics

6. Model 5a and 5b: The linear regression model and neural network trained on L1

English psycholinguistic behavior measures and common lexical characteristics

(excluding word frequency)

3.2 Predictors

Three predictor variables were used within this study, namely word frequency, lexical

characteristics, and L1 psycholinguistic measures.

3.2.1 Word Frequency

Word frequency is a measure of how often a word is used within a language based

on a specific corpus. This study used word frequency measures from the SUBTLEX-US

database based on 51 million words in American English subtitles (Brysbaert & New, 2012).

The number of times that a word appears within the corpus is expressed as Zipf scores, a

standardized measure of frequency independent of corpus size (van Heuven et al., 2014).

3.2.2 Lexical characteristics

Lexical characteristics is a group of predictors consisting of 8 variables. The first

variable, is word length as measured by the number of letters (length) and syllables (number

of syllables) that make up a word (Mandera et al., 2020). The variable number of
12

morphemes is the smallest unit of meaning within a word while number of phonemes

represent the phoneme counts of a word (Balota el al., 2007). Orthographic Levenshtein

distance (OLD) illustrates how similarly a word is written compared to other words in the

corpus, while phonological Levenshtein distance (PLD) shows how closely a word sounds

compared to other words (Balota el al., 2007). Age of acquisition (AoA) indicates the

average age at which a word is typically learned (Brysbaert, 2012). Lastly, concreteness

ratings are the degree to which an experience with a word can be perceived by language

speakers (Brysbaert et al., 2016).

3.2.2 L1 English Psycholinguistic Behavior

The L1 English psycholinguistic behavior is a set of predictor variables comprised of

response times and accuracy ratings of participants selecting English as their native

language to an online vocabulary test. Participants of this vocabulary test completed a word

recognition task where they were presented with a sequence of both existing English words

and non-words, and were instructed to respond with ‘yes’ or ‘no’ depending on whether they

knew the word being presented to them (Mandera et al., 2020). The mean and standard

deviation of response times to correctly identified words were recorded and their z-scores

were calculated. Z-score measures of the mean and standard deviation of response times

were used in this study. Meanwhile, accuracy ratings represented the percentage of correct

answers for each word.

3.3. Target

To measure L2 English receptive knowledge, the present study utilized a ranking of

English words based on L2 English speakers’ accuracy ratings and response times to the

word recognition task, which has been found to be a good indicator for L2 vocabulary

acquisition (Brysbaert et al., 2020). This measure is ranked from 1-18.236 based on the

percentage known of words and mean word recognition response time for ties.
13

4. Experimental Setup

4.1 Dataset Description

4.1.1 Data Sources

The present study used a combined dataset from three data sources. The data used

in this study was collected through the English Crowdsourcing Project (ECP) (Mandera et

al., 2020), which is a mega study part of a set of online vocabulary tests organized by Ghent

University. The English vocabulary test was first conducted in 2014 (Brysbaert et al., 2016).

It is currently still running and available to access at http://vocabulary.ugent.be/. Participants

of this test are presented with a sequence of 100 stimuli, where 70 of these stimuli are

existing English words and 30 are pseudo words. They are instructed to indicate whether

they recognized the stimuli being presented and are informed that incorrect answers will be

penalized so as to avoid guessing during the test. The complete set of instructions for the

online vocabulary test are described in Brysbaert et al. (2020) and Mandera et al. (2020).

Two different versions of the ECP dataset were used. Psycholinguistic measures for

L1 English speakers were sampled from the ECP dataset as described in Mandera,

Keuleers, and Brysbaert (2020), where word recognition times and accuracy ratings of

62.000 English words for native English speakers were presented. L2 English receptive

word knowledge as measured by word rankings were derived from accuracy ratings and

recognition times of the same 62.000 English words for non-native English speakers as

described in Brysbaert, Keuleers, and Mandera (2020).

Lastly, lexical characteristics data in this study were obtained from the Word Features

Analysis of the ELP and ECP for the Open Science Framework (Brysbaert & Mandera,

2019). In this dataset, lexical characteristics calculations for English words in common

between the ECP and ELP datasets are available, which includes the following information:

word length, word frequency, orthographic and phonologic neighbors, orthographic and

phonologic distance, number of morphemes, number of phonemes, number of syllables, age


14

of acquisition, word prevalence, and concreteness. An intersection of the three data sources

resulted in a dataset of 18.236 English words which was utilized for this study.

4.1.2 Data Preprocessing

This study followed the same data cleaning procedures as implemented by the L1

and L2 ECP datasets presented in Mandera et al. (2020) and Brysbaert et al. (2020). Data of

participants who indicated that their native language was English was included in the L1

ECP dataset, and those who indicated that their native language was not English was

included in the L2 ECP dataset. Therefore, only data of respondents who completed the

person-related questions were included in the dataset. Outliers and irregular values were

taken out of the dataset. Only the first 3 sessions from each IP address were included to

avoid any undue influences from individuals. Reaction times of the first 9 trials of each

session were deleted as they were considered training trials.

Additionally, for non-native English speakers, observations with more than two ‘yes’

responses to non-words were removed to minimize the presence of guessing by participants

due to word similarity, which corresponds to a maximum rate of 7% guessing (Brysbaert et

al, 2020). Lastly, from the remaining observations in the dataset, only English words with

available lexical characteristics data in the English Lexicon Project were included in the

analysis. These steps pruned the initial dataset of 61.851 English words to 18.236 words.

For the purposes of this study, several variables were further omitted from the

analysis. For word recognition times, z-scores of both L1 and L2 response times instead of

their raw measures were used in order to cancel out differences in average response times

per participant (Mandera et al., 2020; Brysbaert et al., 2020). Word prevalence was also

removed from the analysis as it is a measure of how well each word is known based on the

complete ECP dataset (Brysbaert, 2019; Brysbaert, 2016; Keuleers, 2015), which is a

component of word ranking being measured within this study. Furthermore, orthographic and

phonologic neighbors, or the number of words that sound and are written similarly to a word

(Balota el al., 2007), were excluded from the analysis as the variables orthographic and
15

phonologic Levenshtein distances have been found to be better measures for word similarity

with other words. Both OLD and PLD not only compare word distances with their neighbors,

but with all other words within the corpus (Yarkoni et al., 2008). All variables were

aggregated per word observation and were standardized to control for differences in units of

each feature for every observation, and were standardized, with the exceptions of z-scores

of L1 recognition times, L1 accuracy ratings, and L2 word rankings. A summary of the

distribution of all predictors included within the models as well as their relationships with the

target can be found in Appendix A.

4.2 Experimental Procedure

The linear regression and artificial neural network algorithms were used to conduct

the analysis within this study. Data analysis was performed on Python 3.8, while data

cleaning and preprocessing were conducted on both Python 3.8 and R 4.10. On R 4.10, the

dplyr library (Wickham & Wickham, 2020) was used for dataset cleaning. On Python 3.8, this

study used the following libraries and modules for data cleaning, preprocessing, and

analysis: pandas (McKinney, 2011), os, numpy (Oliphant, 2006), scikit learn (Kramer, 2016),

random, math, matplotlib (Ari et al., 2014), and tensorflow (Abadi et al., 2016).

The dataset used for this study takes the form of a data frame with 18.236 rows and

13 columns. In total, there are 12 predictors as input to the models. All variables were

converted to two-dimensional numpy arrays and 30% of the entire dataset was left out as

the test set using the train-test split function from scikit-learn. This resulted in a train set with

a total of 12.765 observations and a test set with 5.471 observations.

Testing of the linear regression model utilized the Lasso module from the scikit-learn

library. The neural network was built with the Keras Regressor module from the scikit-learn

library and utilized a custom function to select the most optimum hyper parameters for the

network during training. The network takes as input to its input layer a 2-dimensional numpy

array corresponding to the different predictors included in each model. The last dense layer

of the network outputs a linear value of the target, a word ranking value between 1-18.236.
16

For all dense layers, Relu was used as the activation function. The model is compiled with

an optimization function that is selected during the tuning process.

4.2.1 Model Evaluation & Hyper Parameters

Five tasks were conducted to answer the research questions presented in this study.

The mean absolute error metric was used to evaluate the model performances. The MAE

was chosen as it calculates the absolute differences between prediction and expected

output over the test sample, where each individual difference has equal weight. Therefore,

the presence of outliers is not penalized. Since outliers have previously been removed from

the dataset for this study, the MAE is a good metric for the regression problem presented.

Moreover, it is also a more intuitive measure to interpret than for example, the mean

squared error.

For both the linear regression and neural network models to be able to estimate the

best set of parameters, several hyper parameters need to be determined beforehand and

optimized. For the Lasso linear regression, this parameter is the bias term or λ, which is a

component of the regularization term for the coefficients of the regression model. For the

neural network, these include parameters related to the structure of the network as well as

the learning of the model (Diaz et al., 2017). To determine the structure of the neural

network, the number of layers and number of layer nodes were tuned. To optimize how the

network learns, the optimization function, batch size, and number of epochs were tuned. The

grid search algorithm was used to optimize these parameters using the mean squared error

as a loss function.

Four-fold cross validation using the hold-out method was conducted to train the

parameters, and then tested on the subset of data which had been initially left out. For the

linear regression, a total of 5 lambda value candidates were fitted, totaling to 20 fits. For the

neural network, there were a total of 96 parameter candidates fitted across 4 folds, totaling

to 384 fits. A summary of the tested hyper parameters for both models and the best-

performing neural network architectures can be found in Appendix B.


17

4.2.2 Tasks

Task 1. The first task served to provide a baseline performance as comparison with the

predictive power of the variables of interest using the zero rule algorithm. The rule adopted

was to use the mean value of the train targets, L2 English receptive word knowledge, as the

default prediction.

Task 2. The aim of the second task was to evaluate the predictive power of word

frequency on L2 English receptive word knowledge by comparing the performances of two

algorithms with the baseline performance, where Model 1a is the linear regression model

and Model 1b the neural network.

Task 3. The third task aimed to evaluate the predictive power of commonly used lexical

characteristics (excluding word frequency) on L2 English receptive word knowledge by

comparing performances of the linear regression and neural network models with the

baseline performance. The lexical characteristics set of predictors include word length,

number of syllables, number of morphemes, number of phoneme, orthographic Levenshtein

distance, phonological Levenshtein distance, age of acquisition, and word concreteness.

Task 4. The fourth task aimed to evaluate the predictive power of L1 psycholinguistic

measures on L2 English receptive word knowledge by comparing Models 3a and 3b’s

performances with the baseline performance. L1 psycholinguistic measures are comprised

of the variables z-scores of L1 English response times and accuracy ratings during the word

recognition task.

Task 5. The fourth task aimed to compare the predictive power of word frequency

and L1 psycholinguistic measures on L2 English receptive word knowledge. Two linear

regression models and 2 neural networks were run to execute this task. In Models 4a and

4b, the linear regression and neural network were trained and tested on word frequency and

lexical characteristics. For Models 5a and 5b, the linear regression and neural network were

trained on L1 psycholinguistic measures and lexical characteristics. A comparison of the

predictive performances of these two algorithms were made based on the baseline

prediction.
18

5. Results

A total of 11 models including the baseline were trained on various predictors as

stated in the task descriptions in section 4.2.2. An overview of all model performance scores

is summarized in Table 1 below. Overall results of the regression tasks performed showed

that the neural network outperformed the linear regression model in predicting L2 receptive

word knowledge based on all sets of predictors, with the exception of Model 3b which inputs

L1 psycholinguistic measures as predictors. The best performing estimator is Model 4b, the

neural network tested on lexical characteristics and word frequency, with an MAE score of

2742.29 and an improved score from the baseline model by 1777.98 units. When comparing

the performances of Models 4 and 5, both the linear regression and neural network indicate

that the models tested on lexical characteristics and word frequency perform better than

those tested on lexical characteristics and L1 psycholinguistic measures, confirming findings

from prior studies on the importance of word frequency as a predictor for L2 proficiency

(Berger et al., 2019; Keuleers et al, 2012; Mandera et al, 2020).

To check for robustness of the neural networks, train and validation losses of each

model were visualized during the cross-validation process to monitor for any indications of

over or underfitting (Goodfellow et al., 2016). Based on the visualization results, there were

no indications of overfitting to be found. Moreover, a comparison of the performance scores

on the train data and test data for all models did not indicate large differences. All

visualizations and calculations related to the model robustness checks can be found in

Appendix C. In the next section, all results will be presented and elaborated for each task.
19

Table 1

Results of model predictions

MAE Improvement
Model Predictors Algorithm
Score from Baseline

Baseline Average of target values Zero rule 4520.27 -

1a LR 3473.90 1013.32
Word frequency
1b NN 3413.73 1106.54

2a LR 3729.27 766.39
Lexical characteristics
2b NN 3678.51 841.76

3a LR 3633.03 900.12
L1 psycholinguistic measures
3b NN 3854.81 665.46

4a Lexical characteristics + word LR 2973.21 1509.24

4b frequency NN 2742.29 1777.98

5a Lexical characteristics + L1 LR 3029.18 1473.77

5b psycholinguistic measures NN 2797.58 1722.69

5.1 Task 1

The first task aimed to predict L2 receptive word knowledge based on the mean

value of the train targets as the baseline prediction. This prediction was compared with the

true values of the test data and resulted in a mean absolute error score of 4,520.27.

5.2 Task 2

The second task aimed to evaluate the predictive power of word frequency on L2

English receptive word knowledge by comparing the performances of the linear regression

and neural network models with the baseline performance. The best-performing linear

regression estimator for Model 1a was selected with a validation MAE score of 3506.95 and

no regularization performed. The best-performing neural network for Model 1b was selected

with a validation MAE score of 3,430.98. The following parameters were selected as the
20

most optimum: 32 data batches, 60 epochs, 64 first layer nodes, 8 last layer nodes, 3 hidden

layers, and Adam as the optimization function. Both models performed better than the

baseline prediction, with the neural network showing slightly higher improvements from the

baseline prediction compared to the linear regression model. This indicates that word

frequency is able to perform well as a predictor for L2 word receptive knowledge in

comparison to the baseline, confirming previous findings on the importance of word

frequency as a predictor for L2 English vocabulary knowledge acquisition.

5.3 Task 3

In the third task, lexical characteristics (excluding word frequency) were utilized to

predict L2 receptive word knowledge. The best-performing linear regression estimator for

Model 2a was selected with a validation MAE score of 3753.88 and no regularization

performed. The best-performing neural network for Model 2b was selected with a validation

MAE score of 3,696.88. The following parameters were selected as the most optimum: 32

data batches, 60 epochs, 64 first layer nodes, 64 last layer nodes, 3 hidden layers, and

Adam as the optimization function. Both models performed better than the baseline

prediction, with the neural network showing slightly higher improvements from the baseline

prediction compared to the linear regression model. This indicates that lexical characteristics

as a set of predictors is able to perform well as a predictor for L2 word receptive knowledge

in comparison to the baseline, confirming findings from previous studies.

5.4 Task 4

For the fourth task, both models were trained on L1 psycholinguistic measures to

predict L2 receptive word knowledge. The best-performing linear regression estimator for

Model 3a was selected with a validation MAE score of 3620.15. The best-performing neural

network for Model 2b was selected with a validation MAE score of 3,846.76. The following

parameters were selected as the most optimum: 32 data batches, 60 epochs, 32 first layer

nodes, 64 last layer nodes, 3 hidden layers, and Adam as the optimization function. Both

models performed better than the baseline prediction, this time with the linear regression

showing slightly higher improvements from the baseline prediction compared to the neural
21

network model. This indicates that L1 psycholinguistic measures as a set of predictors is

able to perform well as a predictor for L2 word receptive knowledge in comparison to the

baseline, adding a new finding to currently existing literature.

5.5 Task 5

For task 5, two sets of linear regression and neural network models were trained in

order to compare the predictive performances of word frequency and L1 psycholinguistic

measures on L2 English receptive word knowledge. Models 4a and 4b were trained on word

frequency and other lexical characteristics, while Models 5a and 5b were trained on L1

psycholinguistic measures and other lexical characteristics. The best-performing linear

regression estimator for Model 4a was selected with a validation MAE score of 3011.03 and

no regularization performed. The best-performing neural network for Model 4b was selected

with a validation MAE score of 2,759.93. The following parameters were selected as the

most optimum: 32 data batches, 60 epochs, 64 first layer nodes, 64 last layer nodes, 3

hidden layers, and Adam as the optimization function.

The best-performing linear regression estimator for Model 5a was selected with a

validation MAE score of 3046.50 and no regularization performed. The best-performing

neural network for Model 5b was selected with a validation MAE score of 2,961.47 and the

following parameters selected as the most optimum: 32 data batches, 60 epochs, 64 first

layer nodes, 64 last layer nodes, 3 hidden layers, and Adam as the optimization function.

For both sets of models, the neural network consistently showed a better performance

compared to the linear regression model. Moreover, Models 4 showed higher improvements

in prediction performance compared to Models 5, where the neural network showed slightly

more improvement in performance compared to the linear regression model. This also

indicated slightly better performance of word frequency as a predictor for L2 English

receptive word knowledge compared to L1 psycholinguistic measure, as expected based on

findings from previous studies.


22

5.6 Post-Hoc Analysis

While all models showed better prediction performance compared to the baseline, a

large overall error remains present, which could indicate that the models systematically

under-predict the observed values of the targets. To further gain insights into the

performance of the models, additional analyses were performed on the best performing set

of models, namely Models 4a and 4b which utilized word characteristics and word frequency

as predictors. Visualizations made for the purpose of the post-hoc analysis can be found in

Appendix D.

Firstly, a comparison of the distributions of the predicted target values by the linear

regression and neural network models were compared with the actual target values. Both

prediction distributions differed considerably from the actual distribution of skewness level

-0.00034. It was found that distribution of prediction values by the neural network more

closely resembles the distribution of the actual targets with a skewness of -0.296 compared

to that of the linear regression with a skewness of -0.439. Furthermore, distribution of the

linear regression predictions indicates higher dispersion and the presence of more outliers.

Secondly, the relationship between the predictions and the residuals of each model

were explored. A visualization of these relationships for the two models indicate that

residuals for the neural network predictions tend to spread more proportionately around the

zero value compared to that of the linear regression, as depicted in Figure 2 below. Further

investigation reveals a correlation coefficient of r = 0.00084 for the predictions and residuals

of the linear regression, and r = 0.0066 for the neural network.

Lastly, the coefficients of all predictors from the linear regression model were

extracted to analyze the importance of each variable. It was found that word frequency had

the largest magnitude compared to the other predictors included within the model. This is

consistent with findings from existing literature on the importance of word frequency as a

predictor for L2 lexical proficiency. Additionally, the correlation coefficient between L2

receptive word knowledge and L1 psycholinguistic measures were calculated to further

investigate the exceptional behavior observed in Models 3a and 3b. It was found that L2
23

receptive word knowledge exhibited a correlation coefficient of r = 0.532 and r = 0.4128 with

mean L1 response time and standard deviation of L1 response time, respectively.

Figure 2

Relationship between residuals and predictions for the linear regression and neural network

models
Prediction vs Residual - Linear Regression

Prediction vs Residual - Neural Network

6. Discussion

This study aims to answer the following main research question:

To what extent can lexical characteristics and measures derived from L1 English

psycholinguistic behavior predict L2 English receptive word knowledge?


24

With respect to answering the sub-research questions concerning the importance of

the set of predictors included in this study, results within this study were consistent with

findings from existing studies. All models tested consistently showed higher predictive

performance compared to the baseline, confirming the validity and importance of the

predictors included within this study, namely word frequency, word characteristics, and L1

psycholinguistic measures (Balota et al., 2007; Keuleers et al, 2012; Mandera et al, 2020).

Moreover, based on findings from task 5, it was observed that both the linear regression and

neural network models fitted on lexical characteristics and word frequency showed a better

performance compared to the models fitted on lexical characteristics and L1

psycholinguistics measures. This is consistent with findings from Brysbaert et al. (2017) and

Berger et al. (2019) that highlighted the stronger impact of word frequency on L2 lexical

proficiency compared to L1-derived measures.

When comparing the performances of the two algorithms, the neural network showed

better predictive performance compared to the linear regression model with the exception of

the models fitted on L1 psycholinguistic measures. The dataset used within this study

consist a large number of observations and features, which adds to its dimensionality. High

dimensionality data is known to negatively impact the performance of non-parametric

algorithms such as the linear regression (Lavergne & Patilea, 2008). Therefore, this result in

not unexpected. Post-hoc analysis was conducted to investigate the exception found in the

behavior of the Model which included L1 psycholinguistic measures as predictors. It was

found that the correlation coefficient of L2 receptive word knowledge with mean L1 response

time and standard deviation of L1 response time were r = 0.532 and r = 0.4128. These

values are much higher than the correlation levels found between measures of word

knowledge in Hashimoto & Egbert (2019) and L1 ranking in Brysbaert et al. (2017), that

being r = 0.28.

Further post-hoc analysis also found a considerable difference in the distribution

between true values of L2 receptive word knowledge and the predicted values by the linear

regression and neural network models. Performing classification tasks for L2 proficiency
25

could better capture the differences in knowledge between learners as samples of second

language learners tend to exhibit imbalanced classes (Arnold et al., 2018; Aryadoust &

Baghael, 2016; Sinclair et al., 2021). In this study, L2 receptive word knowledge were

treated as a linear variable of ranking values from 1-18.236.

6.1 Limitations & Future Research

Several limitations arise due to the nature of the study approach adopted as well as

the dataset used. While utilizing the word ranking as a measure of L2 lexical proficiency

allows for the capture of additional dimension such as learner motivation in learning a word,

it may present certain limitations. Due to the imbalanced nature of L2 proficiency-related

information, a linear rank of lexical proficiency may lose information relevant to language

proficiency levels comparison.

Utilizing big data collected from mega studies for this thesis presents several

advantages compared to data obtained from factorial design studies, such as improved

model predictive power due to the large number of word stimuli as well as the availability of

the data for use in other studies. However, due to the nature of the word recognition task

used for the dataset collection, the measure of word knowledge in this study is based solely

on yes or no decisions. While this is a valid measure for word knowledge (Ferré & Brysbaert,

2017; Harrington & Carey, 2009; Laufer et al., 2004; Nation, 2001; Zhang et al., 2019), it is

only able to capture a surface-level knowledge of words, and not, for example, whether a

second language speaker indeed understands the meaning of a word (Brysbaert et al.,

2019). Furthermore, the dataset used in the present study only includes words, while the

English language vocabulary is also comprised of multiword expressions (Brysbaert et al.,

2019).

Future studies could address these limitations, firstly by treating L2 word knowledge

as a categorical variable. This would allow for a better comparison of L2 proficiency which

better reflects the true learning curve of second language learners (Arnold et al., 2018;

Aryadoust & Baghael, 2016). Secondly, future studies may improve and expand on the

measure of vocabulary knowledge used within this study. The availability of a large dataset
26

with measures representing other dimensions of vocabulary knowledge, such as productive

word knowledge, depth of vocabulary, and knowledge of word meaning (Henriksen, 1999) is

limited, leaving room for further contributions of such datasets. Lastly, future studies could

also include multi-word expressions in addition to singular words as stimuli for large-scale

second language vocabulary testing, more accurately representing the make-up for the

English vocabulary.

7. Conclusion

This study aimed to test the predictive power of lexical characteristics and receptive

word knowledge measures of native English speakers on English second language

speakers’ receptive word knowledge. Results from this study bring several scientific

implications. Existing literature have found lexical characteristics to be important predictors

of L2 English vocabulary as measured by both receptive and productive word knowledge.

The results showed that the models trained on word frequency showed the better predictive

performance on L2 English receptive word knowledge compared to the models trained on a

set of other commonly used lexical characteristics, which confirms previous findings. This

study found that that L1 English receptive word knowledge performed better than the

baseline in predicting L2 receptive word knowledge, but performed worse than word

frequency, which confirms previous findings that word frequency is a more important

predictor than L1 psycholinguistic behavior. A comparison of performance for different

machine learning models in predicting L2 receptive word knowledge based on information

from mega study brings a novel contribution to existing studies. Overall, the neural network

showed better performance in predicting L2 receptive word knowledge compared to the

linear regression algorithm. From a practical perspective, findings from this study could

provide input on how to model word difficulty and word acquisition levels for English second

language speakers based on native English vocabulary knowledge and lexical

characteristics. Specifically, this study identified word frequency to be a stronger indicator for

word difficulty and word acquisition levels of L2 English speakers compared to the other
27

variables of interest. This provides useful insights for English second language vocabulary

learning as well as English text difficulty estimation.


28

Acknowledgements

I would like to express my gratitude to my supervisor at Tilburg University for his inputs and

guidance throughout the process of writing this thesis. I would also like to acknowledge my

appreciation to the second reader of this thesis as well as other parties that have contributed

to the completion of this thesis.


29

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Zheng, X. (2016).

Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv

preprint arXiv:1603.04467.

Anderson, R. C., & Freebody, P.(1981). Effects of differing proportions and locations of

difficult vocabulary on text comprehension. Center for the Study of Reading Technical

Report; no. 202.

Ari, N., & Ustazhanov, M. (2014, September). Matplotlib in python. In 2014 11th International

Conference on Electronics, Computer and Computation (ICECCO) (pp. 1-6). IEEE.

Arnold, T., Ballier, N., Gaillat, T., & Lissón, P. (2018). Predicting CEFRL levels in learner

English on the basis of metrics and full texts. arXiv preprint arXiv:1806.11099.

Aryadoust, V., & Baghaei, P. (2016). Does EFL readers' lexical and grammatical knowledge

predict their reading ability? Insights from a perceptron artificial neural network study.

Educational Assessment, 21(2), 135-156.

Asfaha, Y. M., Beckman, D., Kurvers, J., & Kroon, S. (2009). L2 reading in multilingual

Eritrea: The influences of L1 reading and English proficiency. Journal of Research in

Reading, 32(4), 351-365.

Balota, D. A., & Chumbley, J. I. (1990). Where are the effects of frequency in visual word

recognition tasks? Right where we said they werep Comment on Monsell, Doyle, and

Haggard (1989).

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., ... &

Treiman, R. (2007). The English lexicon project. Behavior research methods, 39(3), 445-

459.

Balota, D. A., Aschenbrenner, A. J., & Yap, M. J. (2013). Additive effects of word frequency

and stimulus quality: the influence of trial history and data transformations. Journal of

Experimental Psychology: Learning, Memory, and Cognition, 39(5), 1563.


30

Berger, C. M., Crossley, S. A., & Kyle, K. (2019). Using native-speaker psycholinguistic

norms to predict lexical proficiency and development in second-language

production. Applied Linguistics, 40(1), 22-42.

Berger, C., Crossley, S., & Skalicky, S. (2019). Using lexical features to investigate second

language lexical decision performance. Studies in Second Language Acquisition, 41(5), 911-

935.

Brysbaert, M., & Cortese, M. J. (2011). Do the effects of subjective frequency and age of

acquisition survive better word frequency norms?. Quarterly Journal of Experimental

Psychology, 64(3), 545-559.

Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016). How many words do we

know? Practical estimates of vocabulary size dependent on word definition, the degree of

language input and the participant’s age. Frontiers in psychology, 7, 1116.

Brysbaert, M., Lagrou, E., & Stevens, M. (2017). Visual word recognition in a second

language: A test of the lexical entrenchment hypothesis with lexical decision times.

Bilingualism: Language and Cognition, 20(3), 530-548.

Brysbaert, M., Keuleers, E., & Mandera, P. (2019). Recognition times for 62 thousand

English words: Data from the English Crowdsourcing Project.

Brysbaert, M., Keuleers, E., & Mandera, P. (2020). Which words do English non-native

speakers know? New supernational levels based on yes/no decision. Second Language

Research, 37(2), 207-231.

Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language Learning &

Technology, 11(3), 38-63.

Cobb, T. (2016). Feeding numbers or numerology? A response to Nation (2014) and

McQuillan (2016).

DeKeyser, R. M. (2001). Automaticity and automatization. InP. Robinson (Ed.), Cognition

and second language instruction (pp. 125–151).


31

Diaz, G. I., Fokoue-Nkoutche, A., Nannicini, G., & Samulowitz, H. (2017). An effective

algorithm for hyperparameter optimization of neural networks. IBM Journal of Research and

Development, 61(4/5), 9-1.

Ellis, N. C. (2002). Frequency effects in language processing: A review with implications for

theories of implicit and explicit language acquisition. Studies in second language

acquisition, 24(2), 143-188.

Ellis, N. C. (2002). Reflections on frequency effects in language processing. Studies in

second language acquisition, 24(2), 297-339.

Ellis, R. (2003). Task-based language learning and teaching. Oxford university press.

Ferré, P., & Brysbaert, M. (2017). Can Lextale-Esp discriminate between groups of highly

proficient Catalan–Spanish bilinguals with different language dominances?. Behavior

Research Methods, 49(2), 717-723.

Freebody, P., & Anderson, R. C. (1981). Effects of differing proportions and locations of

difficult vocabulary on text comprehension. Center for the Study of Reading Technical

Report; no. 202.

Harrington, M., & Carey, M. (2009). The on-line Yes/No test as a placement tool. System,

37(4), 614-626.

Gardner, D., & Davies, M. (2014). A new academic vocabulary list. Applied linguistics, 35(3),

305-327.

Gaurav, M. (2019, December 17). How to find the optimum number of hidden layers and

nodes in a neural network model? Datagraphi.Com. Retrieved November 5, 2021, from

https://www.datagraphi.com/blog/post/2019/12/17/how-to-find-the-optimum-number-of-

hidden-layers-and-nodes-in-a-neural-network-model

Grasemann, U., Peñaloza, C., Dekhtyar, M., Miikkulainen, R., & Kiran, S. (2021). Predicting

language treatment response in bilingual aphasia using neural network-based patient

models. Scientific reports, 11(1), 1-11.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
32

Hashimoto, B. J., & Egbert, J. (2019). More than frequency? Exploring predictors of word

difficulty for second language learners. Language Learning, 69(4), 839-872.

Haykin, S., & Principe, J. (1998). Making sense of a complex world [chaotic events

modeling]. IEEE Signal Processing Magazine, 15(3), 66-81.

Hazenberg, S., & Hulstijn, J. H. (1996). Defining a minimal receptive second-language

vocabulary for non-native university students: An empirical investigation. Applied

linguistics, 17(2), 145-163.

He, X., & Godfroid, A. (2019). Choosing words to teach: A novel method for vocabulary

selection and its practical application. Tesol Quarterly, 53(2), 348-371.

Henriksen, B. (1999). Three dimensions of vocabulary development. Studies in second

language acquisition, 21(2), 303-317.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are

universal approximators. Neural Networks, 2(5), 359-366.

Hulstijn, J. H., Van Gelderen, A., & Schoonen, R. (2009). Automatization in second language

acquisition: What does the coefficient of variation tell us?. Applied Psycholinguistics, 30(4),

555-582.

Kerz, E., Wiechmann, D., Qiao, Y., Tseng, E., & Ströbel, M. (2021, April). Automated

Classification of Written Proficiency Levels on the CEFR-Scale through Complexity Contours

and RNNs. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building

Educational Applications (pp. 199-209).

Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project:

Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior

research methods, 44(1), 287-304.

Keuleers, E., & Balota, D. A. (2015). Megastudies, crowdsourcing, and large datasets in

psycholinguistics: An overview of recent developments. Quarterly Journal of Experimental

Psychology, 68(8), 1457-1468.


33

Kramer, O. (2016). Scikit-learn. In Machine learning for evolution strategies (pp. 45-53).

Springer, Cham.

Laufer, B. (1998). The development of passive and active vocabulary in a second language:

Same or different?. Applied linguistics, 19(2), 255-271.

Laufer, B., & Paribakht, T. S. (1998). The relationship between passive and active

vocabularies: Effects of languagelearning context. Language learning, 48(3), 365-391.

Laufer, B., Elder, C., Hill, K., & Congdon, P. (2004). Size and strength: Do we need both to

measure vocabulary knowledge?. Language testing, 21(2), 202-226.

Lavergne, P., & Patilea, V. (2008). Breaking the curse of dimensionality in nonparametric

testing. Journal of Econometrics, 143(1), 103-122.

Lee, J. W., & Schallert, D. L. (1997). The relative contribution of L2 language proficiency and

L1 reading ability to L2 reading performance: A test of the threshold hypothesis in an EFL

context. Tesol Quarterly, 31(4), 713-739.

Leow, R. P., Grey, S., Marijuan, S., & Moorman, C. (2014). Concurrent data elicitation

procedures, processes, and the early stages of L2 learning: A critical overview. Second

Language Research, 30(2), 111-127.

Liben-Nowell, D., Strand, J., Sharp, A., Wexler, T., & Woods, K. (2019). The danger of

testing by selecting controlled subsets, with applications to spoken-word recognition. Journal

of cognition, 2(1).

Maskor, Z. M., & Baharudin, H. (2016). Receptive vocabulary knowledge or productive

vocabulary knowledge in writing skill, which one important. International Journal of Academic

Research in Business and Social Sciences, 6(11), 261-271.

Mandera, P., Keuleers, E., & Brysbaert, M. (2020). Recognition times for 62 thousand

English words: Data from the English Crowdsourcing Project. Behavior Research

Methods, 52(2), 741-760.

McKinney, W. (2011). pandas: a foundational Python library for data analysis and

statistics. Python for high performance and scientific computing, 14(9), 1-9.
34

Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary tests. Language

testing, 4(2), 142-154.

Meara, P. M., & Milton, J. L. (2003). The Swansea vocabulary levels test: The manual.

Newbury, UK: Express Publishing.

Milton, J. (2006). X-Lex: The Swansea vocabulary levels test. In Proceedings of the 7th and

8th Current Trends in English Language testing (CTELT) Conference (Vol. 4, pp. 29-39).

TESOL Arabia, UAE.

Monaghan, P., Chang, Y. N., Welbourne, S., & Brysbaert, M. (2017). Exploring the relations

between word frequency, language exposure, and bilingualism in a computational model of

reading. Journal of Memory and Language, 93, 1-21.

Monsell, S., Doyle, M. C., & Haggard, P. N. (1989). Effects of frequency on visual word

recognition tasks: Where are they?. Journal of Experimental Psychology: General, 118(1),

43.

Nation, P., & Waring, R. (1997). Vocabulary size, text coverage and word lists. Vocabulary:

Description, acquisition and pedagogy, 14, 6-19.

Nation, I. S. (2001). Learning vocabulary in another language. Ernst Klett Sprachen.

Nurweni, A., & Read, J. (1999). The English vocabulary knowledge of Indonesian university

students. English for Specific Purposes, 18(2), 161-175.

O'Dell, F., Read, J., & McCarthy, M. (2000). Assessing vocabulary. Cambridge university

press.

Oliphant, T. E. (2006). A guide to NumPy (Vol. 1, p. 85). USA: Trelgol Publishing.

Pae, T. I. (2019). A simultaneous analysis of relations between L1 and L2 skills in reading

and writing. Reading Research Quarterly, 54(1), 109-124.

Petersen, S. E., & Ostendorf, M. (2009). A machine learning approach to reading level

assessment. Computer speech & language, 23(1), 89-106.

Pulido, D., & Hambrick, D. Z. (2008). The virtuous circle: Modeling individual differences in

L2 reading and vocabulary development.


35

Read, J. (2013). Validating a test to measure depth of vocabulary knowledge. In Validation in

language assessment (pp. 55-74). Routledge.

Santos, V. D., Verspoor, M., & Nerbonne, J. (2012). Identifying important factors in essay

grading using machine learning. International experiences in language testing and

assessment—Selected papers in memory of Pavlos Pavlou, 295-309.

Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the behaviour of

two new versions of the Vocabulary Levels Test. Language testing, 18(1), 55-88.

Skalicky, S., Crossley, S. A., & Berger, C. M. (2019). Predictors of second language English

lexical recognition: Further insights from a large database of second language lexical

decision times. The Mental Lexicon, 14(3), 333-356.

Sparks, R. L., Patton, J., Ganschow, L., & Humbach, N. (2012). Do L1 reading achievement

and L1 print exposure contribute to the prediction of L2 proficiency?. Language

Learning, 62(2), 473-505.

Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty before 1900.

Harvard University Press.

Van Gelderen, A., Schoonen, R., De Glopper, K., Hulstijn, J., Simis, A., Snellings, P., &

Stevenson, M. (2004). Linguistic knowledge, processing speed, and metacognitive

knowledge in first-and second-Language reading comprehension: a componential analysis.

Journal of educational psychology, 96(1), 19.

Van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A

new and improved word frequency database for British English. Quarterly journal of

experimental psychology, 67(6), 1176-1190.

Webb, S. A., & Chang, A. C. S. (2012). Second language vocabulary growth. RELC journal,

43(1), 113-126.

Wickham, H., & Wickham, M. H. (2020). Package ‘plyr’. A Grammar of Data Manipulation. R

package version, 8.
36

Yang, Y., Yu, W., & Lim, H. (2016). Predicting second language proficiency level using

linguistic cognitive task and machine learning techniques. Wireless Personal

Communications, 86(1), 271-285.

Yap, M. J., Balota, D. A., Tse, C. S., & Besner, D. (2008). On the additive effects of stimulus

quality and word frequency in lexical decision: evidence for opposing interactive influences

revealed by RT distributional analyses. Journal of Experimental Psychology: Learning,

Memory, and Cognition, 34(3), 495.

Yarkoni, T., Balota, D., & Yap, M. (2008). Moving beyond Coltheart’s N: A new measure of

orthographic similarity. Psychonomic bulletin & review, 15(5), 971-979.

Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). ERNIE: Enhanced language

representation with informative entities. arXiv preprint arXiv:1905.07129.

Zhou, Z. H. (2019). Ensemble methods: foundations and algorithms. Chapman and

Hall/CRC.
37

Appendix A

Visualization of Model Features

Figure 3

Distribution of all scaled predictors


38

Figure 4

Relationship between the predictors and target


39

Appendix B

Hyper Parameter Tuning & Model Selection

Table 2

Tested Hyper Parameters

Model Tested Values

Linear regression

Lambda 0, 0.02, 0.06, 0.1, 0.2

Neural Network

Optimization functions Adam, RMSProp

Number of layers 2, 3

First layer nodes 64,32,16

Last layer nodes 64, 8

Batch size 32, 100

Number of epochs 30,60

Figure 5

Neural network models selected – Model 2a: word frequency as predictor


40

Figure 6

Neural network models selected – Model 2b: word characteristics as predictor

Figure 7

Neural network models selected – Model 3b: psycholinguistic measures as predictor


41

Figure 8

Neural network models selected – Model 4b: lexical characteristics & word frequency as

predictors

Figure 9

Neural network models selected – Model 5b: lexical characteristics & L1 psycholinguistic

measures as predictors
42

Appendix C

Model Robustness Checks

Figure 10

Train & Validation Loss Visualization for Model 1b – Neural network with word frequency

Model 1 Train & Val Loss per Epoch

Figure 11

Train & Validation Loss Visualization for Model 2b – Neural network with word characteristics
Model 2 Train & Val Loss per Epoch
43

Figure 12

Train & Validation Loss Visualization for Model 3b – Neural network with L1 psycholinguistic

measures

Figure 13

Train & Validation Loss Visualization for Model 4b – Neural network with word characteristics

& word frequency


44

Figure 14

Train & Validation Loss Visualization for Model 5b – Neural network with word characteristics

& L1 psycholinguistic measures


45

Appendix D

Post-Hoc Analysis Visualization

Figure 15

Distribution of test targets – L2 receptive word knowledge rank


46

Figure 16

Distribution of predictions based on word characteristics & word frequency – linear

regression

Figure 17

Distribution of predictions based on word characteristics & word frequency – neural network
47

Figure 18

Distribution of residuals for predictions based on word characteristics & word frequency –

linear regression
48

Figure 19

Distribution of residuals for predictions based on word characteristics & word frequency –

neural network

You might also like