You are on page 1of 5

British Journal of Educational Technology Vol 46 No 5 2015 1097–1101

doi:10.1111/bjet.12338

Classification of word levels with usage frequency, expert


opinions and machine learning

Gihad N. Sohsah, Muhammed Esad Ünal and Onur Güzey

Gihad N. Sohsah is pursuing an M.S. degree in the Electronics and Computer Engineering Program at İstanbul Şehir
University, Turkey. She has received her Bachelor’s Degree, in Computer and automatic control from Tanta Univer-
sity, Egypt. Her research interests include machine learning, natural language processing and bioinformatics.
Muhammed Esad Ünal is a student at College of Engineering and Natural Sciences in İstanbul Şehir University. He
is pursuing Computer Science and Electrical Engineering B.S. degrees as part of the Double Major Program. His
research interests are machine learning and big data. Assistant Prof. Dr. Onur Güzey received his B.S. degree in
computer engineering from Istanbul Technical University, MBA degree from Massachusetts Institute of Technology
and Ph.D. degree in electrical and computer engineering from the University of California Santa Barbara (UCSB),
Santa Barbara, in 2008. He has previously worked at the Intel Corporation in Santa Clara, USA, and McKinsey &
Company in Istanbul, Turkey. His research interests are machine learning and its applications. Address for corre-
spondence: Dr Onur Guzey, Department of Computer Engineering, Istanbul Şehir University, 34662, Istanbul,
Turkey. Email: onurguzey@sehir.edu.tr

Abstract
Educational applications for language teaching can utilize the language levels of words
to target proficiency levels of students. This paper and the accompanying data provide a
methodology for making educational standard-aligned language-level predictions for all
English words. The methodology involves expert opinions on language levels and extend-
ing these opinions to other words using machine learning and data from a large corpus.
Common European Framework for Languages (CEFR) level predictions for about 50 000
words, which can be readily used in educational applications, are also provided. For
applications where the cost of misclassification varies, machine learning model param-
eters and algorithm selection must be adjusted. A large number of expert opinions taken
from a survey with 30 practicing language teachers that can be used for this adjustment
are also released. The overall methodology can be applied to low-resource languages,
where CEFR-level classifications may not exist, by adding a comparable survey and
corpus. The data are released with a Creative Commons Attribution license to enable free
mixing, sharing and even use in commercial applications.

Dataset
Location: https://zenodo.org/record/12501
DOI: 10.5281/zenodo.12501
Creator/Publisher: Istanbul Şehir University
Date: October 30, 2014
Format: Comma separated values

Introduction
Students learning a new language interact with words that are determined by educators who are
designing the education plan. These words are often either chosen from a frequency list, such as
Oxford 3000 (Oxford University Press, n.d), or by relying on the previous experience of the people
preparing the education plan. In this paper, we introduce royalty-free, educational standard
aligned data and accompanying methods that can be used to make word selection more flexible
© 2015 British Educational Research Association
1098 British Journal of Educational Technology Vol 46 No 5 2015

and robust. The released data can be used in its current form by teachers, as well as being
incorporated into an automated software tool that requires word levels.
Language levels of words, or word levels, are a crucial building block for language teaching
applications. These levels are used in personalized testing (Corbett & Anderson, 1994), auto-
mated content generation (Heilman, Collins-Thompson & Callan, 2010) and language level-
aware web search (Collins-Thompson, Bennett, White, la Chica de & Sontag, 2011). Teachers also
use the word levels to select words to include in a particular lecture or exam.
Determining the language levels of words is not always straightforward. Words can have
different levels in different context, assume different part of speech and have multiple
meanings. In addition, teachers often have different opinions on the relative importance
of words, and therefore their levels. This results in large disparities between teachers on
what the level of a word should be. These disparities can get even more pronounced between
teachers from different backgrounds, teaching philosophies and geographies. Such complexities
make using “one size fits all” use of static data such as word lists or simple usage statistics less
useful.
This teacher survey, which includes data from 30 different teachers, enables researchers and
application developers to adjust their word-level selections to either increase or decrease the
weight of a particular teacher, or a group of teachers, to better fit the needs of their applications
or teaching philosophies. For example, removing a particular teacher’s opinions from data can
change levels of many words. Such changes would not be possible without the teacher survey
data.

The data
Production
The final data were constructed by merging expert survey results, word usage frequency in Google
Books corpus and machine learning-based classification results. A detailed explanation of a
potential preprocessing step that involved filtering words that are irrelevant for language learning
has been previously given in Sohsah, Akkurt, Safarli, Unal and Guzey (2014).

Survey
The survey contains about 7000 words. Thirty active English as a Second Language teachers
were asked to label each word, part of speech pair, with a Common European Framework for
Languages (CEFR) level or choose unknown for typographical errors or words they did not know.
We randomly chose words from a frequency list made from the British National Corpus (Kilgarriff,
1995). This randomized selection from the list was only used as starting point for the expert
survey. In the released data, only the frequencies from the much larger Google Books (Lin et al,
2012) corpus was used. Part of speech tags were also changed to fit the ones given in Google
Books corpus.
The data labels are based on CEFR levels, a language-level framework that is applicable to all
languages. The framework is ubiquitous in language education and is the most well-recognized
language-level scheme among language teachers. CEFR has six levels, beginner to advance level
labeled, in increasing order of competency as, A1, A2, B1, B2, C1 and C2. Most of these levels are
meaningful for external parties as they correspond to the ability level of the student in a language.
A1 is considered beginner, while C2 is a proficient user who can accomplish complicated tasks
such as delivering coherent spoken presentations.
Three different teachers labeled each word, part of speech pair. The word selection for teachers
was randomized to prevent bias. The CEFR-level distributions for 7000 words from the teacher
surveys are given in Figure 1.
© 2015 British Educational Research Association
Classification of word levels 1099

Figure 1: CEFR-level frequency distributions from the survey

Figure 2: CEFR-level frequency distributions for different classification methods

Machine learning classification


A statistical approach based on machine learning can be used for extending the labels on 7000
words to all other English words. In the dataset, baseline machine learning results are included for
more than 50 000 words. These results are constructed using three different machine learning
methods:
• Random Forest (Breiman, 2001);
• 2-layer feed-forward neural network;
• support vector machines (Scholkopf & Smola, 2001).
Predicted level distributions according to these algorithms are presented in Figure 2.
For the machine learning methods, a supervised learning (Vapnik, 2000) methodology is fol-
lowed. The output was the CEFR labels for 7000 words that were assigned by the teachers. Three
teachers evaluated each word and the final label is determined by averaging the levels given by
these teachers. If a word is labeled unknown by at least two teachers, that word is discarded. All
remaining words are used to construct the training set (Vapnik, 2000).
The inputs to the machine learning algorithm for each entry were the part of speech of the word
(categorical variable), and word usage frequencies from Google Books ngram data from 2000 to
2008 (integer variables). The output is the predicted CEFR level of the word (categorical variable).
After each model was trained using the training set, we performed threefold cross-validation
© 2015 British Educational Research Association
1100 British Journal of Educational Technology Vol 46 No 5 2015

Table 1: Classification accuracy for machine learning algorithms

Algorithm Random forest Support vector machines 2-layer feed-forward neural network

Average classification error 1.2 1.32 1.14

(Vapnik, 2000) to obtain a measure of their classification accuracy. This can also be considered a
measure of the machine learning algorithm’s ability to capture the regularities in the training
data.
Each algorithm’s cross-validation results are given in Table 1. In this table, a lower error rate
represents higher classification accuracy. Classification error is defined as the difference between
level predicted by the machine learning algorithm and the level assigned by the teacher survey
results. For example, if the level predicted by the algorithm is A2, and the teachers assigned the
same word level B2, this results in an error of 2. It should be noted that the levels assigned by the
teachers also vary, and a classification error close to zero should not be expected.
Depending on the application, the cost of misclassification errors can be different. For example, a
testing application that targets B1–B2 levels may want to ensure no unknown words are
misclassified as B1 or B2, although this can also remove some relevant words. In contrast, a
language level-aware web search application may be more tolerant to unknowns. Adjusting the
parameters in the machine learning algorithms can satisfy the needs of both of these applica-
tions. Parameter selection for adjusting misclassification penalties for random forests is explained
in (Sohsah et al, 2014), and this parameter selection method can be applied using the data
provided.

Location and format


The data are available in Zenodo repository in a comma separated values (csv) format. The data
are separated into three different files to facilitate different use cases. These three files are raw
survey results, word frequency results in Google Books corpus and machine learning algorithm
results.
For each word, part of speech, the word lemma and usage frequency are provided. All the answers
provided by the experts are given in the raw survey results file. Baseline predictions using three
classification methods—random forests, 2-layer neural networks and support vector machines—
are included in the machine learning algorithm results file. Details of the machine learning
algorithms, including data field descriptions and definitions of part of speech tags, is included in
the readme.pdf file accompanying the data and omitted here due to space constraints.

Limitations
A major issue for this research has been dealing with unbalanced nature of the word usage. A
small number of words (around 10 000) are frequently used while the remaining words in the
language are rarely used. This imbalance presents issues while developing machine learning
methods that are based on frequency statistics. As the initial list is chosen in a randomized fashion
to cover a large variety of words, even some words in the survey are typographical errors or very
rare words. Experts were asked to label these words as unknown. Due to this imbalance, the most
advanced level C2, which should contain rare but important words, becomes hard to distinguish
from unknown words.

Future work
The same methodology can be applied to other languages by adding a comparable expert survey
and corpus. The resulting data can be especially important for low-resource languages where
© 2015 British Educational Research Association
Classification of word levels 1101

even static word lists are not available. We encourage researchers who create such data to contact
us to include their results in a potential multilingual data release.
Acknowledgements
We would like to thank teachers and administrators from Istanbul Şehir University’s School of
Languages for their participation. We also would like to thank Emrah Akkurt for his help in
setting up the survey and recommending CEFR as the language-level standard.
Statements on open data, ethics and conflict of interest
The data can be downloaded from https://zenodo.org/record/12501 and can be used according
to Creative Commons Attribution license.
The data contains no personal, or personally identifiable information.
There is no conflict of interest regarding this work.
References
Breiman, L. (2001). Random forests. Machine Learning, 45, 1, 5–32.
Collins-Thompson, K., Bennett, P. N., White, R. W., la Chica de, S. & Sontag, D. (2011). Personalizing web
search results by reading level (pp. 403–412). Proceedings of the 20th ACM international conference
on Information and knowledge management (CIKM’11), New York, NY, USA. doi: 10.1145/2063576
.2063639.
Corbett, A. T. & Anderson, J. R. (1994). Knowledge tracing: modeling the acquisition of procedural
knowledge. User Modeling and User-adapted Interaction, 4, 4, 253–278. doi: 10.1007/BF01099821.
Heilman, M., Collins-Thompson, K. & Callan, J. (2010). Personalization of reading passages
improves vocabulary acquisition. International Journal of Artificial Intelligence in Education, 20, 1, 73–98.
doi: 10.3233/JAI-2010-0003.
Kilgarriff, A. (1995). BNC database and word frequency lists. Retrieved 18 October 2014, from
http://www.kilgarriff.co.uk/bnc-readme.html
Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W. & Petrov, S. (2012). Syntactic annotations for the
Google Books Ngram Corpus, 169–174.
Oxford University Press. (n.d.). Oxford 3000. Retrieved September 3, 2014, from http://www
.oxfordlearnersdictionaries.com/us/wordlist/english/oxford3000/ox3k_A-B/
Scholkopf, B. & Smola, A. J. (2001). Learning with kernels: support vector machines, regularization,
optimization, and beyond. (Adaptive computation and machine learning).
Sohsah, G. N., Akkurt, E., Safarli, I., Unal, M. E. & Guzey, O. (2014). Automatically filtering irrelevant
words for applications in language acquisition. International Conference on Machine Learning and
Applications. pp 1–5. doi: 10.1109/ICMLA.2014.113.
Vapnik, V. (2000). The nature of statistical learning theory. Medford, MA, USA: Springer.

© 2015 British Educational Research Association

You might also like