THE EF CAMBRIDGE OPEN LANGUAGE DATABASE
(EFCAMDAT)
Information for Users
Yan Huang, Jeroen Geertzen, Rachel Baker,
Anna Korhonen and Theodora Alexopoulow
Department of Theoretical and Applied Linguistics, University of Cambridge
and
EF Education First
Second release — September 2017
1 Preface
EFCAMDAT is an open access corpus consisting of writings of learners of English worldwide
submitted to Englishtown, the online school of EF Education First !, FCAMDAT is a
collaborative project between the Dept of Theoretical and Applied Linguistics, University
of Cambridge and EF Education First, developed with the support of the Isaac Newton
‘Trust, Trinity College. ‘The corpus was first released in July 2013, Below we provide
information on the second release of the corpus which contains significantly more data and
improvements in data pre-processing.
‘The EF-Cambridge Open Language Database, henceforth EFCAMDAT, is the first large
project of the EF Education First Research Lab for Applied Language Learning at the
Department of Theoretical and Applied Linguistics at the University of Cambridge. ‘The
Lab was launched in 2010 (originally as EF Research Unit) to promote research in second
language learning of English and innovation in language teaching through a systematic
cross-fertilisation between linguistic research and teaching techniques.
Early on it became apparent that we had a unique opportunity to use big educational
data from a real life language learning environment for research, ‘Thousands of learners
from around the world access English Live, formally known as Englishtown, daily. English
Live is EFs online language school, where students can follow lessons and submit work,
This provides a unique opportunity for the collection of written data on an unprecedented
scale,
hitps | /englislive ef com /en-us/learn-english-onlineBuilding a corpus of considerable scope and size raises important questions about its
structure, access and usability. Researchers working with large corpora like BRCAMDAT.
cannot rely on manual inspection, annotation or extraction of data, if the quantitative
power of such resources is to be exploited. Unless linguistic annotation is automated to a
considerable degree to enable automated extraction of pattems, the scope of such corpora
will remain unexploited. We therefore employed Natural Language Processing tools and
technology for automated analysis of learner English, a challenging enterprise in itself
given that NLP technology has only recently been applied to L2 data. Finally, a web-
based interface that supports data export and provides a search tool was built to maximise
the accessibility and usability of the corpus.
A diverso team of linguists and computational linguists was, thus, put together to
build a corpus of written data from Englishtown. The work has been sponsored by the
Isaac Newton ‘Trust, ‘Trinity College, Cambridge and EF Education First. EF has further
provided vital logistical support in data collection and transfer, obtaining consent from EF
students and additional meta-data where needed. Last, but not least, EF has championed.
‘an open access research resource for the study of leaner English.
We would like to thank a number of people that have made this project possible from.
its conception, to the first and second releases. From the EF side: Chris McCormick for his
critical ideas in shaping this project at early stages and his co-ordinating work to take this
project off ground; Erie Azumi and his team for their technical support of the data collec-
tion and transfer; Yerrie Kim for all her co-ordinating work at various stages of the project;
‘Tamsyn Brownie, Soraya Estrella, Jane Chen, and Franco Papeschi for designing the cover
pages of the web interface, From the Cambridge side: Jeroen Geertzen transformed some
millions of unstructured data into a structured, annotated and accessible corpus. He has
overseen the application and evaluation of NLP tools on learner data for the first release
of July 2013, and has also supported the second release, Yan Huang has prepared the
second release paying meticulous attention to cleaning and formatting the raw data so
as to overcome shortcomings of the first release, Henriétte Hendriks and John Hawkins
supported the project at its early stages. Colleagues at the Department of Theoretical and
Applied Linguistics have provided advice and feedback on various aspects of the project:
‘Teresa Parodi, Norbert Vanek, Amy Hsieh, Akira Murakami and Maria Kunevich, Many
thanks also to Ted Krawec for vital advice on providing a suitable user agreement and
overseeing legal clements of the project. Our external consultant Detmar Meurers, from.
the University of Tibingen, has generously discussed the project with us, sharing ideas on
the challenges of automated linguistic annotation for learner data and questions of build-
ing large scale databases. Sichu Jiang has extended our originally rudimentary web-based
interface with critical functionality during a student intemship at DTAL over Summer
2012. In addition, we would like to thank Ditnitris Michelioudakis and Toby Hudson for
providing manual annotations to sample parsed data and Caroline Williams for technical
support at early stages of data collection. Last but not least, Anna Korhonen and Brechtje
Post as co-investigators of this project have shaped the design and structure of this unique
resource.
2 Dora Alexopoulou and Rachel Baker2 Corpus Structure
‘CAMDAT consists of essays submitted to Englishtown, the online school of EF Education
First, by language learners all over the world (Education First, 2012). A full course in
Englishtown spans 16 proficiency levels aligned with common standards such as TOEFL,
IELTS and the Common European Framework of Reference for languages (CEFR) as shown,
in Table 1
Table 1: Englishtown skill levels in relation (indicative) to common standards
Englishtown, 13 46 79 1012131516
Cambridge Esol - KET PET FCE CAB -
IELTS - <3 45 6 67 37
TOEFL iBT - 57-86 87-109 110-120 -
TOEIC Listening & Reading 120-220 225-545 550-780 785-940 45 -
TOEIC Speaking & Writing 40-70 80-110 120-140 150-190 200,
CEFR Al Az BL B2 clo
Learners are allocated to proficiency levels after a placement test when they start a
contained cight units, offering a variet
consists of scripts of writing tasks at the end of each lesson on topics like those listed
in Table 2. Figures 1 and 2 illustrate the interface of the writing tasks.
EF? or through successful progression through coursework. Each of the 16 levels
of receptive and productive tasks. EFCAMDAT
‘Table 2: Examples of essay topics at various levels. Level and unit number are separated
by a colon.
ID Essay topic ID Essay topic
11 Introducing yourself by email 7: Giving instructions to play a game
1:3. Writing an online profile 8:2. Reviewing a song for a website
2:1 Describing your favourite day 9:7 Writing an apology email
2:6 ‘Telling someone what you're doing 11:1 Writing a movie review
2:8 Describing your family’s cating habits 12:1 ‘Turning down an invitation
3:1 Replying to a new penpal 13-4 Giving advice about budgeting
4:1 Writing about what you do 15:1 Covering a news story
6:4 _Writing a resume 16:8 Researching a legendary creature
Given 16 proficiency levels and 8 units per level, a learner who started at the first
evel and completed all 16 proficieney levels would produce 128 different essays. Essays are
*Siarting studenta are placed at the fret level of each stage, 1, 4, 7, 10, 18, or 16.‘Yoweri yu ke me van so much you wart aa ener you eet DY pay,
{Wits your on orn taton lu, Type tea box When yous tne, lek
JOHN INVITES YOU
TO HIS BIRTHDAY PARTY.
Figure 1: Screenshot of the writing task for Lesson 2 of Unit 2
graded by language teachers. Teachers provide feodback to leamers using a basic set of
error markup tags or through free comments on students’ writing. Currently, EFCAMDAT.
contains teacher feedback for 66% of scripts.
The data collected for the second release of EFCAMDAT contain 1,180,310 seripts (with
7,126,752 sentences, and 83,543,480 word tokens) written by 174,743 learners. As we have
no direct information on the 11 backgrounds of learners we use information on nationality
as the closest approximation to L1 background. EFCAMDAT contains data from learners,
from 198 nationalities. ‘Table 3 shows the spread of scripts across the nationalities with
the largest subcorpora.*
Few learners complete all of the proficiency levels. The majority of learners only com-
plete portions of the program. For many, their start or end of interacting with Englishtown,
fell outside the scope of the data collection period.
Characterizing scripts quantitatively is difficult, because of the variation across topics
and proficiency levels. Texts range from a list of words or a few short sentences to short ns
ratives or articles. As learners become more proficient they tend to produce longer scripts.
On average, scripts count 6 sentences (SD=3.2). Sample scripts are shown in Figure 3,
The data have been annotated with parts of speech tags and information on grammat-
ical dependencies using the The Penn ‘Irecbank Tagset (Marcus et al., 1993) and a freely
YOF the 198 nationalities, 4 have over 100 learners, and 68 nationalities over 50 learners,‘Youve ten rooted! Lic, youve ren. Foeninuance clan om. Type aoime at yO of cy?
atone. ven eure Nisa. cx Stem WSO WoT aT
Figure 2: Screenshot of the writing task for Lesson 5 of Unit 5
Table 3: Percentage and number of scripts per nationality of learners
‘tionality Percentage of scripts Number of Scripts Number of words
Brazilians 40.4% 476,817 31,078,406
Chinese 14.0% 165,162, 11,909,869
‘Mexicans: 7A% 87,260 5,707,891
Russians 5.9% 70,208 454,224
Germans: 4.6% 54,597 4,887,108
Saudi Arabians 4.0% 47,340 2,724,638
Italians 3.8% 45,249 3,761,909
French 3.5% 41,626 3,298,343
‘Taiwanese 25% 29,569 349,534
18% 21
4
602,328
available state-of-the-art parser (SyntaxNet parser; Andor et al. 2016). Geertzen ot al
(2013) provide a detailed discussion of the performance of the Stanford parser (Klein and
‘Manning, 2003) on the BFCAMDAT scripts. Huang ct al. (2017) provides a comparison of1. Learner 19345, Lever 1, Unrr 1, Cm
Hil Anna,How are you? Thank you to sendmail to me. My nane’s
Anfeng.I’n 24 years old.Mice to meet you !I think we are friends
already,I hope we can learn english toghter! Bye! Anfeng,
2. LEARNER 44816, Lever 2, Unrr 1, ITALIAN,
Hi, my name's Xavier. My favorite days is saturday. I get up at
@ o'clock. I have a breakfast, I have a show Then, I goes to
the market. In the afternoon, I play music or go by bicycle. I like
sunday. And you ?
3, LEARNER 160954, LeveL 8, UNrr 2, BRAZILIAN
Hone Improvenent is a pleasant protest song sung by Josh Woodvard
It’s a simple but realistic song that analyzes how rapid changes in
a town affects the lives of many people in the nane of progress. The
high bitter-sveet voice of the singer, the smooth guitar along with
‘the high pitched resonant drum sound like a moan recalling the past
or an ode to the previous town lifestyle and a protest to the negative
aspects this new prosperous city brought. T really enjoyed this song
Figure 3: Three scripts, in which learners are asked to introduce themselves (1), describe
their favourite day (2), and review a song for a website (3)the performance of various parsers on BRCAMDAT,
A substantial portion of scripts comes with error corrections that have been provided
by teachers using a list of error labels (See Appendix). ‘The purpose of these corrections.
was to provide feedback to learners and as such it cannot be viewed as error annotation
based on a specific annotation scheme developed specifically for annotating learner corpora.
as for instance the ones developed by Nicholls (2003) or Liideling et al. (2005)3 The EFCAMDAT web-based interface
‘The corpus can be accessed through a web-based interface at http: //corpus ml. cam
ac.uk/efcandat. Figure 4 shows the introductory page of the interface.
& a BaMaRSE
: overview
EF- CAMBRIDGE OPEN LANGUAGE DATABASE
The EFcambrge Oper Lecuage Dae FAMORN 8
pay aralae venoce 1 face scent legge
ln assent wie by 7400 ees aos wide
‘age ot eC ge B.Tech
ifort name ero, pr of see an atl
Tera i cy end te tne
Terie nd Apled Luss tthe Unvesy of
Cambridge arn ih Edveation et =
Figure 4 Introduction page of the EFCAMDAT interface
Fag‘Explore’
‘The ‘Explore’ page is shown in Figure 5. Users need to download and read the End User
License Agreement and accept its terms and conditions. Access to EFCAMDAT is free of
charge but it is restricted to academic, non-commercial research. The user agreement sets
out standard conditions protecting copyright. Users who agree to these conditions can
provide some personal data and obtain access to BFCAMDAT.
EE SF CNianGe
OVERVIEW EXPLORE FAQ
Cota
en)
ee
od
PUL
evento ts
EF EDUCATION FIRST RESEARCH
Figure 5: Access to EECAMDAT‘Select scripts’
Figure 6 shows the page that allows users to select scripts for queries or export. Seripts can
be selected from the 16 different EF levels. In Figure 6 we have selected Level 3. Each level
has 8 units involving a unique writing topic. Thus, scripts are spread across different topics,
e.g. Meeting people, Home and family, etc. You can check the description and the interface
snapshot of cach topic through the link of "Description of Script Topic’. In Figure 6 we
have selected scripts from just one topic: Home and family. We have also chosen to see
only scripts written by Egyptian learners. The numbers in brackets indicate the number of
scripts for the specific choice. For instance, there are 190 scripts from Egyptians at Level 3,
of which 28 are on the selected topic, Home and family. The top box provides a summary
of the selection.
10EF Psu:
OVERVIEW EXPLORE FAQ
EXPLORE conPus
Qunn fetpten lpn
Leones to
‘hemmed anyone ner ure om pet ate i arate Fon Not pc nb
ton
eo treme orp
= eee —
Ete as
‘eo ste eee apne ete pls Luter te etl rh aon
Figure 6: Selection of EFCAMDAT scripts according to teaching level, lesson and learner
nationality
‘Query corpus’
‘The corpus can be queried by providing word patterns within the set of scripts that has
been selected. Figure 7 shows the page that allows users to select an example or already
entered query, or construct a novel query and look for sentences that match the pattern.
Word patterns can be specified as follows:
unExMBRIDGE
to overview ERPLORE FA
xPLoRE conrus
sae Ce cnmore
Avena tet
Fe a ef is et tlc ott
Tecan fo
estes
Tsorses
1 unt) om eo: 2
rps cl amlearers who aveconped al be eed units
(sera ge tore ONT
forsee ow" B= Yass are 7
egy _Glarquey [Dil wed
"ae 7 mates sowing 17
1S Geom mete nso mena Semonce 0: 925733
i Wingmen ute nena nc on Sere eng 10tne
Testing ee
Lexi ot
icone,
"sree satel evepety te Deparment of Theil an pli Uns athe Urey of Cane ip wih van Fs
Figure 7: Query page
1. Each word specification is enclosed with brackets
2, A word can be specified by means of a word token (e.g. [vord="cars"] ), a lemma
(cg. Menma="car"] ), a part-of-speech tag (e.g, [pos="NNS"] ), its relation to the
head it attaches to (e.g. [dg-rel="nsubj"] ), and properties of the head (dg-hword
/ dg-blenma / dg-hpos)
123. A word specification may contain multiple properties. For instance, a plural noun
that attaches as a direct object to a past-tense verb can be specified as:
[(pos="NNS") & (dg-rel="dobj") & (dg-hpos="VBD")]
4, Properties may also be underspecified, using .*, For instance, any noun that attaches
as a direct object to any verb van be specified as
[(pos="N.#") & (dg-rel="dobj") & (ag-hpos="V.#")]
5, Word gaps in patterns can be indicated with empty brackets [] , such that the pattern
[word="to"] []{1,3} [word="for"] will match the word token "to”, followed by
at least one and at most three words, followed by the word token "for"
Construction of the pattems is aided by drop-down list boxes for the available part-of
speech tags and grammatical relations. Results can be visualised in plain sentences or in
part-of-speech tag annotated sentences, and meta-information is provided, such as teaching
level and learner nationality. A dependency tree can be visualised upon request.
13‘Export data’
Scripts that have been selected, or sentences that have been queried, can be exported to
XML files. Figure 8 shows the page that allows users to select what unit of interest to
export (seripts or queried sentences), what information should be ineluded (raw script text,
syntactic annotations, or error corrections), and whether to compress the resulting XML.
file
Gye Baws:
‘ sac EXPLORE FAQ
ao OVERVIEW
EXPLORE CORPUS:
Liortany Lope
section
There secon canis 28 eps (489 wrk) fom
1a) re: 3
‘Sifts on fomlezves who hare compel hese its
Ueto
‘Sens
‘sacs mottos
out foes
1 cones pet)
peta
owl 2014.
a.m (.2. MB)
‘hates ace doped ye parte of Thal an pln Lingus tte ney of andi pttahnh Eaton est
Figure 8: Export
uMIt is generally recommended to choose to download zip compressed XML, unless the
selection is rather small. Zipped XML. files, depending on the selection, may range from.
tens of KB to about 1GB at most.
‘The XML data contains a header with information about the corpus version, the selec-
tion of levels and nationalities, followed by either scripts or sentences and provided with
information according to the requested information to be inchided (see Figure 9). ‘The
XML structure of a full script with all available information is exemplified in Figure 10,
Figure 9: XML.
Frequently Asked Questions
‘The ‘FAQ’ page provides further information on how to use the corpus interface, and
contains documents with information on EFCAMDAT. We ask users to cite the following
papers when using EFCAMDAT:
YY. Huang, A. Murakami, 'T. Alexopoulou, A. Korhonen, (2017). Dependency
parsing of learner English.
J. Geertzen, T. Alexopoulou, A. Korhonen, (2013). Automatic linguistic anno-
tation of large scale L2 databases: ‘The EF-Cambridge Open Language Database
(EFCAMDAT) in Proceedings of the 31st Second Language Research Forum
(SLRF), Carnegie Mellon, Cascadilla PressReferences
Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, $., and
Collins, M. (2016). Globally normalized transition-based neural networks. In Proceedings
of the 54th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 2442-2452, Berlin, Germany, Association for Computational
Linguistics.
Education First (2012). Englishtown. http://www-englishtown.com/
Geertzen, J., Alexopoulou, T., and Korhonen, A. (2013). Automatic linguistic annotation of
large scale 1.2 databases: The EF-Cambridge Open Language Database (EFCAMDAT).
In Selected Proceedings of the 2012 Second Language Research Forum, Somerville, MA,
USA. Cascadilla Proceedings Project.
Huang, Y., Murakami, A., Alexopoulou, T., and Korhonen, A. (2017). Dependency parsing
of learner English.
Klein, D. and Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the
41st Annual Meeting on Association for Computational Linguistics - Volume 1, pages
423-430, Stroudsburg, PA, USA. Association for Computational Linguistics.
Lideling, A., Walter, M., Kroymann, E., and Adolphs, P, (2005), Multi-level error anno-
tation in learner corpora, In Proceedings from the Corpus Linguistics Conference Series,
volume 1
Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B, (1993). Building a large annotated.
corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313-330.
Nicholls, D. (2003). The Cambridge Learner Corpus: Error coding and analysis for lexi-
cography and ELT. In Proceedings of the Corpus Linguistics Conference, pages 572-581.
Lancaster University: University Centre for Computer Corpus Research on LanguageAppendi
Code
: Error codes
Meaning,
change from to y
we
wo
agreement,
article
add space
combine sentences
capitalization
delete
‘expression of idiom
highlight
insert
missing word
new sentence
no such word
phraseology
plural
possessive
preposition
part of speech
punctuation
remove space
singular
spelling
verb tense
word choice
word order