You are on page 1of 18
THE EF CAMBRIDGE OPEN LANGUAGE DATABASE (EFCAMDAT) Information for Users Yan Huang, Jeroen Geertzen, Rachel Baker, Anna Korhonen and Theodora Alexopoulow Department of Theoretical and Applied Linguistics, University of Cambridge and EF Education First Second release — September 2017 1 Preface EFCAMDAT is an open access corpus consisting of writings of learners of English worldwide submitted to Englishtown, the online school of EF Education First !, FCAMDAT is a collaborative project between the Dept of Theoretical and Applied Linguistics, University of Cambridge and EF Education First, developed with the support of the Isaac Newton ‘Trust, Trinity College. ‘The corpus was first released in July 2013, Below we provide information on the second release of the corpus which contains significantly more data and improvements in data pre-processing. ‘The EF-Cambridge Open Language Database, henceforth EFCAMDAT, is the first large project of the EF Education First Research Lab for Applied Language Learning at the Department of Theoretical and Applied Linguistics at the University of Cambridge. ‘The Lab was launched in 2010 (originally as EF Research Unit) to promote research in second language learning of English and innovation in language teaching through a systematic cross-fertilisation between linguistic research and teaching techniques. Early on it became apparent that we had a unique opportunity to use big educational data from a real life language learning environment for research, ‘Thousands of learners from around the world access English Live, formally known as Englishtown, daily. English Live is EFs online language school, where students can follow lessons and submit work, This provides a unique opportunity for the collection of written data on an unprecedented scale, hitps | /englislive ef com /en-us/learn-english-online Building a corpus of considerable scope and size raises important questions about its structure, access and usability. Researchers working with large corpora like BRCAMDAT. cannot rely on manual inspection, annotation or extraction of data, if the quantitative power of such resources is to be exploited. Unless linguistic annotation is automated to a considerable degree to enable automated extraction of pattems, the scope of such corpora will remain unexploited. We therefore employed Natural Language Processing tools and technology for automated analysis of learner English, a challenging enterprise in itself given that NLP technology has only recently been applied to L2 data. Finally, a web- based interface that supports data export and provides a search tool was built to maximise the accessibility and usability of the corpus. A diverso team of linguists and computational linguists was, thus, put together to build a corpus of written data from Englishtown. The work has been sponsored by the Isaac Newton ‘Trust, ‘Trinity College, Cambridge and EF Education First. EF has further provided vital logistical support in data collection and transfer, obtaining consent from EF students and additional meta-data where needed. Last, but not least, EF has championed. ‘an open access research resource for the study of leaner English. We would like to thank a number of people that have made this project possible from. its conception, to the first and second releases. From the EF side: Chris McCormick for his critical ideas in shaping this project at early stages and his co-ordinating work to take this project off ground; Erie Azumi and his team for their technical support of the data collec- tion and transfer; Yerrie Kim for all her co-ordinating work at various stages of the project; ‘Tamsyn Brownie, Soraya Estrella, Jane Chen, and Franco Papeschi for designing the cover pages of the web interface, From the Cambridge side: Jeroen Geertzen transformed some millions of unstructured data into a structured, annotated and accessible corpus. He has overseen the application and evaluation of NLP tools on learner data for the first release of July 2013, and has also supported the second release, Yan Huang has prepared the second release paying meticulous attention to cleaning and formatting the raw data so as to overcome shortcomings of the first release, Henriétte Hendriks and John Hawkins supported the project at its early stages. Colleagues at the Department of Theoretical and Applied Linguistics have provided advice and feedback on various aspects of the project: ‘Teresa Parodi, Norbert Vanek, Amy Hsieh, Akira Murakami and Maria Kunevich, Many thanks also to Ted Krawec for vital advice on providing a suitable user agreement and overseeing legal clements of the project. Our external consultant Detmar Meurers, from. the University of Tibingen, has generously discussed the project with us, sharing ideas on the challenges of automated linguistic annotation for learner data and questions of build- ing large scale databases. Sichu Jiang has extended our originally rudimentary web-based interface with critical functionality during a student intemship at DTAL over Summer 2012. In addition, we would like to thank Ditnitris Michelioudakis and Toby Hudson for providing manual annotations to sample parsed data and Caroline Williams for technical support at early stages of data collection. Last but not least, Anna Korhonen and Brechtje Post as co-investigators of this project have shaped the design and structure of this unique resource. 2 Dora Alexopoulou and Rachel Baker 2 Corpus Structure ‘CAMDAT consists of essays submitted to Englishtown, the online school of EF Education First, by language learners all over the world (Education First, 2012). A full course in Englishtown spans 16 proficiency levels aligned with common standards such as TOEFL, IELTS and the Common European Framework of Reference for languages (CEFR) as shown, in Table 1 Table 1: Englishtown skill levels in relation (indicative) to common standards Englishtown, 13 46 79 1012131516 Cambridge Esol - KET PET FCE CAB - IELTS - <3 45 6 67 37 TOEFL iBT - 57-86 87-109 110-120 - TOEIC Listening & Reading 120-220 225-545 550-780 785-940 45 - TOEIC Speaking & Writing 40-70 80-110 120-140 150-190 200, CEFR Al Az BL B2 clo Learners are allocated to proficiency levels after a placement test when they start a contained cight units, offering a variet consists of scripts of writing tasks at the end of each lesson on topics like those listed in Table 2. Figures 1 and 2 illustrate the interface of the writing tasks. EF? or through successful progression through coursework. Each of the 16 levels of receptive and productive tasks. EFCAMDAT ‘Table 2: Examples of essay topics at various levels. Level and unit number are separated by a colon. ID Essay topic ID Essay topic 11 Introducing yourself by email 7: Giving instructions to play a game 1:3. Writing an online profile 8:2. Reviewing a song for a website 2:1 Describing your favourite day 9:7 Writing an apology email 2:6 ‘Telling someone what you're doing 11:1 Writing a movie review 2:8 Describing your family’s cating habits 12:1 ‘Turning down an invitation 3:1 Replying to a new penpal 13-4 Giving advice about budgeting 4:1 Writing about what you do 15:1 Covering a news story 6:4 _Writing a resume 16:8 Researching a legendary creature Given 16 proficiency levels and 8 units per level, a learner who started at the first evel and completed all 16 proficieney levels would produce 128 different essays. Essays are *Siarting studenta are placed at the fret level of each stage, 1, 4, 7, 10, 18, or 16. ‘Yoweri yu ke me van so much you wart aa ener you eet DY pay, {Wits your on orn taton lu, Type tea box When yous tne, lek JOHN INVITES YOU TO HIS BIRTHDAY PARTY. Figure 1: Screenshot of the writing task for Lesson 2 of Unit 2 graded by language teachers. Teachers provide feodback to leamers using a basic set of error markup tags or through free comments on students’ writing. Currently, EFCAMDAT. contains teacher feedback for 66% of scripts. The data collected for the second release of EFCAMDAT contain 1,180,310 seripts (with 7,126,752 sentences, and 83,543,480 word tokens) written by 174,743 learners. As we have no direct information on the 11 backgrounds of learners we use information on nationality as the closest approximation to L1 background. EFCAMDAT contains data from learners, from 198 nationalities. ‘Table 3 shows the spread of scripts across the nationalities with the largest subcorpora.* Few learners complete all of the proficiency levels. The majority of learners only com- plete portions of the program. For many, their start or end of interacting with Englishtown, fell outside the scope of the data collection period. Characterizing scripts quantitatively is difficult, because of the variation across topics and proficiency levels. Texts range from a list of words or a few short sentences to short ns ratives or articles. As learners become more proficient they tend to produce longer scripts. On average, scripts count 6 sentences (SD=3.2). Sample scripts are shown in Figure 3, The data have been annotated with parts of speech tags and information on grammat- ical dependencies using the The Penn ‘Irecbank Tagset (Marcus et al., 1993) and a freely YOF the 198 nationalities, 4 have over 100 learners, and 68 nationalities over 50 learners, ‘Youve ten rooted! Lic, youve ren. Foeninuance clan om. Type aoime at yO of cy? atone. ven eure Nisa. cx Stem WSO WoT aT Figure 2: Screenshot of the writing task for Lesson 5 of Unit 5 Table 3: Percentage and number of scripts per nationality of learners ‘tionality Percentage of scripts Number of Scripts Number of words Brazilians 40.4% 476,817 31,078,406 Chinese 14.0% 165,162, 11,909,869 ‘Mexicans: 7A% 87,260 5,707,891 Russians 5.9% 70,208 454,224 Germans: 4.6% 54,597 4,887,108 Saudi Arabians 4.0% 47,340 2,724,638 Italians 3.8% 45,249 3,761,909 French 3.5% 41,626 3,298,343 ‘Taiwanese 25% 29,569 349,534 18% 21 4 602,328 available state-of-the-art parser (SyntaxNet parser; Andor et al. 2016). Geertzen ot al (2013) provide a detailed discussion of the performance of the Stanford parser (Klein and ‘Manning, 2003) on the BFCAMDAT scripts. Huang ct al. (2017) provides a comparison of 1. Learner 19345, Lever 1, Unrr 1, Cm Hil Anna,How are you? Thank you to sendmail to me. My nane’s Anfeng.I’n 24 years old.Mice to meet you !I think we are friends already,I hope we can learn english toghter! Bye! Anfeng, 2. LEARNER 44816, Lever 2, Unrr 1, ITALIAN, Hi, my name's Xavier. My favorite days is saturday. I get up at @ o'clock. I have a breakfast, I have a show Then, I goes to the market. In the afternoon, I play music or go by bicycle. I like sunday. And you ? 3, LEARNER 160954, LeveL 8, UNrr 2, BRAZILIAN Hone Improvenent is a pleasant protest song sung by Josh Woodvard It’s a simple but realistic song that analyzes how rapid changes in a town affects the lives of many people in the nane of progress. The high bitter-sveet voice of the singer, the smooth guitar along with ‘the high pitched resonant drum sound like a moan recalling the past or an ode to the previous town lifestyle and a protest to the negative aspects this new prosperous city brought. T really enjoyed this song Figure 3: Three scripts, in which learners are asked to introduce themselves (1), describe their favourite day (2), and review a song for a website (3) the performance of various parsers on BRCAMDAT, A substantial portion of scripts comes with error corrections that have been provided by teachers using a list of error labels (See Appendix). ‘The purpose of these corrections. was to provide feedback to learners and as such it cannot be viewed as error annotation based on a specific annotation scheme developed specifically for annotating learner corpora. as for instance the ones developed by Nicholls (2003) or Liideling et al. (2005) 3 The EFCAMDAT web-based interface ‘The corpus can be accessed through a web-based interface at http: //corpus ml. cam ac.uk/efcandat. Figure 4 shows the introductory page of the interface. & a BaMaRSE : overview EF- CAMBRIDGE OPEN LANGUAGE DATABASE The EFcambrge Oper Lecuage Dae FAMORN 8 pay aralae venoce 1 face scent legge ln assent wie by 7400 ees aos wide ‘age ot eC ge B.Tech ifort name ero, pr of see an atl Tera i cy end te tne Terie nd Apled Luss tthe Unvesy of Cambridge arn ih Edveation et = Figure 4 Introduction page of the EFCAMDAT interface Fag ‘Explore’ ‘The ‘Explore’ page is shown in Figure 5. Users need to download and read the End User License Agreement and accept its terms and conditions. Access to EFCAMDAT is free of charge but it is restricted to academic, non-commercial research. The user agreement sets out standard conditions protecting copyright. Users who agree to these conditions can provide some personal data and obtain access to BFCAMDAT. EE SF CNianGe OVERVIEW EXPLORE FAQ Cota en) ee od PUL evento ts EF EDUCATION FIRST RESEARCH Figure 5: Access to EECAMDAT ‘Select scripts’ Figure 6 shows the page that allows users to select scripts for queries or export. Seripts can be selected from the 16 different EF levels. In Figure 6 we have selected Level 3. Each level has 8 units involving a unique writing topic. Thus, scripts are spread across different topics, e.g. Meeting people, Home and family, etc. You can check the description and the interface snapshot of cach topic through the link of "Description of Script Topic’. In Figure 6 we have selected scripts from just one topic: Home and family. We have also chosen to see only scripts written by Egyptian learners. The numbers in brackets indicate the number of scripts for the specific choice. For instance, there are 190 scripts from Egyptians at Level 3, of which 28 are on the selected topic, Home and family. The top box provides a summary of the selection. 10 EF Psu: OVERVIEW EXPLORE FAQ EXPLORE conPus Qunn fetpten lpn Leones to ‘hemmed anyone ner ure om pet ate i arate Fon Not pc nb ton eo treme orp = eee — Ete as ‘eo ste eee apne ete pls Luter te etl rh aon Figure 6: Selection of EFCAMDAT scripts according to teaching level, lesson and learner nationality ‘Query corpus’ ‘The corpus can be queried by providing word patterns within the set of scripts that has been selected. Figure 7 shows the page that allows users to select an example or already entered query, or construct a novel query and look for sentences that match the pattern. Word patterns can be specified as follows: un ExMBRIDGE to overview ERPLORE FA xPLoRE conrus sae Ce cnmore Avena tet Fe a ef is et tlc ott Tecan fo estes Tsorses 1 unt) om eo: 2 rps cl amlearers who aveconped al be eed units (sera ge tore ONT forsee ow" B= Yass are 7 egy _Glarquey [Dil wed "ae 7 mates sowing 17 1S Geom mete nso mena Semonce 0: 925733 i Wingmen ute nena nc on Sere eng 10tne Testing ee Lexi ot icone, "sree satel evepety te Deparment of Theil an pli Uns athe Urey of Cane ip wih van Fs Figure 7: Query page 1. Each word specification is enclosed with brackets 2, A word can be specified by means of a word token (e.g. [vord="cars"] ), a lemma (cg. Menma="car"] ), a part-of-speech tag (e.g, [pos="NNS"] ), its relation to the head it attaches to (e.g. [dg-rel="nsubj"] ), and properties of the head (dg-hword / dg-blenma / dg-hpos) 12 3. A word specification may contain multiple properties. For instance, a plural noun that attaches as a direct object to a past-tense verb can be specified as: [(pos="NNS") & (dg-rel="dobj") & (dg-hpos="VBD")] 4, Properties may also be underspecified, using .*, For instance, any noun that attaches as a direct object to any verb van be specified as [(pos="N.#") & (dg-rel="dobj") & (ag-hpos="V.#")] 5, Word gaps in patterns can be indicated with empty brackets [] , such that the pattern [word="to"] []{1,3} [word="for"] will match the word token "to”, followed by at least one and at most three words, followed by the word token "for" Construction of the pattems is aided by drop-down list boxes for the available part-of speech tags and grammatical relations. Results can be visualised in plain sentences or in part-of-speech tag annotated sentences, and meta-information is provided, such as teaching level and learner nationality. A dependency tree can be visualised upon request. 13 ‘Export data’ Scripts that have been selected, or sentences that have been queried, can be exported to XML files. Figure 8 shows the page that allows users to select what unit of interest to export (seripts or queried sentences), what information should be ineluded (raw script text, syntactic annotations, or error corrections), and whether to compress the resulting XML. file Gye Baws: ‘ sac EXPLORE FAQ ao OVERVIEW EXPLORE CORPUS: Liortany Lope section There secon canis 28 eps (489 wrk) fom 1a) re: 3 ‘Sifts on fomlezves who hare compel hese its Ueto ‘Sens ‘sacs mottos out foes 1 cones pet) peta owl 2014. a.m (.2. MB) ‘hates ace doped ye parte of Thal an pln Lingus tte ney of andi pttahnh Eaton est Figure 8: Export uM It is generally recommended to choose to download zip compressed XML, unless the selection is rather small. Zipped XML. files, depending on the selection, may range from. tens of KB to about 1GB at most. ‘The XML data contains a header with information about the corpus version, the selec- tion of levels and nationalities, followed by either scripts or sentences and provided with information according to the requested information to be inchided (see Figure 9). ‘The XML structure of a full script with all available information is exemplified in Figure 10, Figure 9: XML. Frequently Asked Questions ‘The ‘FAQ’ page provides further information on how to use the corpus interface, and contains documents with information on EFCAMDAT. We ask users to cite the following papers when using EFCAMDAT: YY. Huang, A. Murakami, 'T. Alexopoulou, A. Korhonen, (2017). Dependency parsing of learner English. J. Geertzen, T. Alexopoulou, A. Korhonen, (2013). Automatic linguistic anno- tation of large scale L2 databases: ‘The EF-Cambridge Open Language Database (EFCAMDAT) in Proceedings of the 31st Second Language Research Forum (SLRF), Carnegie Mellon, Cascadilla Press References Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, $., and Collins, M. (2016). Globally normalized transition-based neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2442-2452, Berlin, Germany, Association for Computational Linguistics. Education First (2012). Englishtown. http://www-englishtown.com/ Geertzen, J., Alexopoulou, T., and Korhonen, A. (2013). Automatic linguistic annotation of large scale 1.2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In Selected Proceedings of the 2012 Second Language Research Forum, Somerville, MA, USA. Cascadilla Proceedings Project. Huang, Y., Murakami, A., Alexopoulou, T., and Korhonen, A. (2017). Dependency parsing of learner English. Klein, D. and Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, pages 423-430, Stroudsburg, PA, USA. Association for Computational Linguistics. Lideling, A., Walter, M., Kroymann, E., and Adolphs, P, (2005), Multi-level error anno- tation in learner corpora, In Proceedings from the Corpus Linguistics Conference Series, volume 1 Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B, (1993). Building a large annotated. corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313-330. Nicholls, D. (2003). The Cambridge Learner Corpus: Error coding and analysis for lexi- cography and ELT. In Proceedings of the Corpus Linguistics Conference, pages 572-581. Lancaster University: University Centre for Computer Corpus Research on Language Appendi Code : Error codes Meaning, change from to y we wo agreement, article add space combine sentences capitalization delete ‘expression of idiom highlight insert missing word new sentence no such word phraseology plural possessive preposition part of speech punctuation remove space singular spelling verb tense word choice word order

You might also like