You are on page 1of 9

EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

000 050
001
A Multi-aspect Analysis of Automatic Essay Scoring for Brazilian 051
002 Portuguese 052
003 053
004 054
005 055
Anonymous EACL submission
006 056
007 057
008 058
009 059
010 060
011 061
Abstract high school student in Brazil should write an essay
012 062
to the National Exam of High School (ENEM),
013 Several methods for automatic essay scor- 063
which Brazilian government uses to evaluate the
014 ing (AES) for the English language have 064
quality of high school teaching.
015 been proposed. However, multi-aspect 065
Although there are thousands of essays written
016 AES systems for other languages are un- to ENEM every year, to the best of our knowledge 066
017 usual. Therefore, we propose a multi- there is no AES system for Brazilian Portuguese 067
018 aspect AES system to apply on a dataset (BP) language and analysis of features in a multi- 068
019 of Brazilian Portuguese essays, which hu- aspect essay scoring system for BP. Each aspect 069
020 man experts evaluated according to five as- is a skill that student must master as a senior high 070
021 pects defined by Brazilian Government to school student. Nonetheless, several AES systems 071
022 the National Exam to High School Stu- have been proposed for the English Language. At- 072
023 dent (ENEM). These aspects are skills that tali and Burstein (Attali and Burstein, 2006) pro- 073
024 student must master and every skill is as- posed an AES system, called e-rater, that employs 074
025 sessed apart from each other. general features of argumentative essays to scor- 075
026 Besides prediction, we also performed fea- ing prediction. The main features used by e-rater 076
027 ture analysis for each aspect. The AES are grouped into the following types: grammar, us- 077
028 system proposed employs several features age, mechanics, and style; organization and devel- 078
029 already used by AES systems for the En- opment; lexical complexity; prompt-specific vo- 079
030 glish language. Our results show that cabulary usage. e-rater employs multiple regres- 080
031 predictions for some aspects performed sion and it is task independent AES system, i.e., 081
032 well with the features we employed, while its score is independent of the given prompt. 082
033 predictions for other aspects performed Napoles and Callison-Burch (Napoles and 083
poorly. Callison-Burch, 2015) employ linear regression
034 084
to an AES system that intends to assign more
035 Also, it is possible to note the difference 085
uniform grades than multiple humans evaluators.
036 between the five aspects in the detailed 086
Similar to our task, Napoles and Callison-Burch
037 feature analysis we performed. Besides 087
propose the task of multi-aspect classification us-
038 these contributions, the eight millions of 088
ing five grading categories. However, the authors
039 enrollments every year for ENEM raise 089
leave unexplained how each of their aspects is af-
040 some challenge issues for future directions 090
fected by their features. We think this is an im-
041 in our research. 091
portant contribution since student or professor can
042 use features as a feedback for better understand- 092
043
1 Introduction 093
ing essays writing. Besides that, Napoles and
044 The goal of automatic essay scoring (AES) sys- Callison-Burch assume that more than one eval- 094
045 tems is to score a given essay. AES systems are uator is available to train their model, which in the 095
046 relevant for educational institutions, since the hu- real world is not always the case. 096
047 man effort to evaluate essays is high and, stu- Larkey (Larkey, 1998) proposed three models 097
048 dent needs feedback to improve his or her writ- that are based on text classification to score es- 098
049 ing skill. Besides these issues, almost every senior says applying linear regression. However, Larkey 099

1
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

100 strategy is task-dependent. Chen and He (Chen 3. If it is possible to detect the bias of human 150
101 and He, 2013) also grouped features in four main evaluator, is it feasible to measure the quanti- 151
102 types: lexical features; syntactical features; gram- tative effect in grades? 152
103 mar and fluency features; content and prompt- 153
104 specific features. Then, the authors proposed a 4. Is there any difference between the words in 154
105 rank-based algorithm that maximizes the agree- biased evaluations that agrees with student 155
106 ment between human score and machine score. point of view and biased evaluations that dis- 156
107 Zesh et al. (Zesch et al., 2015) developed a agrees with student point of view? 157
108 technique that adapts domain in an AES system. 158
In a nutshell, there are many possible questions,
109 The authors tested their method in an English 159
and as far we know these questions still unan-
110 dataset and a German dataset. Chali and Hasan 160
swered.
111
(Hasan, 2012) proposed an LSA-based method, 161
The paper is organized as follows. The second
112
that is task-dependent, and the goal is to establish 162
section details our dataset and the features we use.
113
a strategy to understand the inner meaning of texts. 163
The third section explains the experiments we per-
Beyond English language, Kakkonen and Sutinen
114 formed and the results of our experiments. The 164
(Kakkonen and Sutinen, 2004) developed an AES
115 fourth section presents the main remarks about our 165
system to the Finnish language also based on LSA
116 research and the fifth section point to the future di- 166
algorithm.
117 rection for our research. 167
Besides assigning grade score, other researches
118 168
proposed to analyze argumentation strength of es- 2 Methodology
119 169
says (Persing and Ng, 2015), discourse structure
120 170
of essays (Song et al., 2015)(Stab and Gurevych, We propose a methodology that besides the usual
121 features employed by popular AES methodolo- 171
2014) , and grammar correction in general (Ro-
122 zovskaya and Roth, 2014)(Lee et al., 2014). gies (Attali and Burstein, 2006) (Chen and He, 172
123 2013) (Zesch et al., 2015), it also takes advantage 173
Our research is different from the previous re-
124 search since we aim to answer the following ques- of domain features. To test our proposed features, 174
125 tions: we used a dataset of nearly 1840 essays. Next sec- 175
126 tions describe our dataset and our features. 176
127 1. How objective features behave in a multi- 177
128 aspect automatic essay scoring system? 2.1 Dataset 178
129 Our dataset is composed of 1840 essays about 179
130 2. Which features are more relevant for each as- 96 topics, which were crawled from UOL Essay 180
131 pect? Database website1 . The average length in words 181
132 are 300.51; the biggest essay owns 1293 words, 182
133 Besides that, we aim to pose some interest- and the smallest essay holds 49 words. Each es- 183
134 ing questions for future research. Our essays say is evaluated according to the following five as- 184
135
present not only grades but also evaluators com- pects: 185
136
ments about the aspects that are considered in 186
ENEM. During the exploration of evaluators com- 1. Formal language: Mastering of the formal
137 187
ments, bias was observed in some evaluations. We Portuguese language.
138 188
call bias when some human evaluators seem to dis-
139 2. Understanding the task: Understanding of 189
agree or agree with student’s point of view, which
140
can lead to improper influence to the student’s essay prompt and application of concepts 190
141
grade. The possibility of bias of human evalua- from different knowledge fields, to develop 191
142
tions raises some questions. the theme in an argumentative dissertation 192
143 format. 193
144 1. Some topics for essays are more prone to re- 194
145
3. Organization of information: Selecting, 195
sult in biased evaluation? connecting, organizing, and interpreting in-
146 196
formation, facts, opinions, arguments to ad-
147 2. Is it possible to detect if human evaluator is 197
vocate a point of view.
148 biased for or against a given student’s point 198
1
149 of view? http://educacao.uol.com.br/bancoderedacoes 199

2
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

200 and the number of first person pronouns and 250


Table 1: Score and corresponding levels
201 verbs per number of tokens. Also, we sug- 251
Score Level
202 gest as feature the number of enclise, a Por- 252
203
2.0 Satisfactory tuguese language structure, and the number 253
204
1.5 Good of enclise per number of tokens. Enclise is 254
205
1.0 Regular unusual to BP spoken language, then if a stu- 255
206
0.5 Weak dent applies such concept in essay, probably 256
207
0.0 Unsatisfying he or she knows how to use formal language 257
208 better. Also, the excessive number of demon- 258
209 strative pronouns is condemned in written 259
Table 2: Average Score for each aspect and final BP (Martins, 2000); then we use the num-
210 260
grade in UOL Dataset ber of demonstrative pronouns and the num-
211 Aspect Average Score 261
212
ber of demonstrative pronouns per number of 262
Formal Language 1.1 tokens.
213 263
Understanding the task 0.91
214 2. General: Most of the general features are 264
Organization of information 0.93
215 based on Attali and Burstein (Attali and 265
Knowing argumentation 0.83
216 Burstein, 2006) research. However, due to 266
Solution proposal 1.08
217
Final grade 4.86 lack of tools for Brazilian Portuguese and 267
218 time constraints, we implemented the follow- 268
219 ing thirteen features, which are divided into 269
220 4. Knowing argumentation: Demonstration of five categories. 270
221 knowledge of linguistic mechanisms required 271
• Grammar and style: Grammar was
222 to construct arguments. 272
checked by CoGrOO (de desenvolvi-
223 273
5. Solution proposal: Formulation of a pro- mento CoGrOO, 2012), which is a
224 Brazilian add-on to Open Office Writer. 274
225
posal to the problem presented, respecting 275
human rights and considering socio-cultural Also, for spelling mistakes we use a
226
diversity. Brazilian software 2 . Both features were 276
227 also divided by the number of tokens in 277
228 Each aspect is scored according to the scale of an essay; then we employed four fea- 278
229 Table 1, and the final score is the sum of all aspects tures for grammar and spelling errors. 279
230 scores. Table 2 depicts the average score assign To evaluate style in essays, we applied 280
231 by humans for each aspect and final grade in our LanguageTool rules for Portuguese, but 281
232 dataset. also we added some rules suggested by 282
233 Each essay is evaluated by only one human. Al- a Portuguese manual of writing (Mar- 283
234 though this seems a disadvantage, we think that tins, 2000). We employed the number 284
235 this is a real world dataset, since in most high of style errors and the number of style 285
236 schools only one teacher scores essay. Also, as we of errors per sentence as features. 286
237 aim to detect the impact of features in each aspect, • Syntactical features: According to 287
238 one evaluator per essay is enough. (Martins, 2000), in Portuguese Lan- 288
239 guage, sentences longer than 70 char- 289
2.2 Features
240 acters are long sentences, and therefore 290
241
Features are divided into two main types, domain are not recommended. We employ as a 291
features that are related to ENEM exam or Brazil- feature, the number of sentences longer
242 292
ian Portuguese Language, and general features than 70 characters.
243 293
that are based on Attali and Burstein research (At- • Organization and development: There
244 294
tali and Burstein, 2006). are no tools to evaluate organization and
245 295
development in Portuguese language,
246 1. Domain features: ENEM exam doesn’t al- 296
then we collected discourse markers in a
247 low the using of the first person pronouns and 297
Brazilian Portuguese grammar (Jubran
248 verbs. Therefore, we employ as features the 298
2
249 number of first person pronouns and verbs https://github.com/giullianomorroni/JCorretorOrtografico 299

3
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

300 and Koch, 2006). Discourse markers 350


Table 3: List of Features grouped into domain type
301 are linguistic units that establish connec- 351
and general type
302 tions between sentences to build coher- 352
303 ent and knit discourse. We employed as Group Feature 353
304 features the number of discourse mark- #first person of singular of 354
305 ers and the number of discourse markers verbs and pronouns 355
306 per sentence. #first person of singular of 356
307 • Lexical complexity: To evaluate lexi- verbs and pronouns / #tokens 357
308 cal complexity, we used four features. Domain #demonstrative pronouns 358
309 The first feature is Portuguese version #demonstrative pronouns / #tokens 359
310 of Flesh score (Martins et al., 1996); #enclise 360
311 second feature is average word length, #enclise / #tokens 361
312 which length is the number of syllables; #sentences longer than 70 characters 362
313 the third feature is the number of to- #grammar errors 363
314 kens in an essay; the fourth feature is the #grammar errors / #token 364
315
number of different words in an essay. #spelling errors 365
316 • Prompt-specific vocabulary usage: It is #spelling errors / #token 366
317 desirable to employ concepts from the General #style errors / #sentences 367
318 prompt in the essay, therefore for each #discourse markers 368
319
essay we compute cosine similarity be- #discourse markers / #sentence 369
320
tween prompt and essay. In this case, the Flesh score 370
prompt is a frequency vector of words, Average word length (syllables)
321 371
and the essay is also a frequency vec- #tokens
322 372
tor of words, which are from prompt vo- similarity with prompt
323 373
cabulary. We decided for this strategy #different words
324 374
since, unlike other works, our dataset
325 375
comprises many different topics, each
326 376
with few essays. Then, we think that closer to 1, the higher the agreement between eval-
327 377
build a vocabulary for each domain it is uators, and when the value of kappa is closer to 0,
328 not helpful. 378
the lower the agreement between evaluators.
329 379
First, we compute a matrix of weights (Equation
330 380
3 Experiments 1) that are based on the difference between human
331 381
evaluation and machine scoring.
332 We performed two types of experiments: one eval- 382
333 uating the performance of grade prediction for (i − j)2 383
334
wi,j = (1) 384
each aspect and other evaluating the role of each (N − 1)2
335 feature in grade prediction task. Feature analy- 385
336 sis is of particular importance for this task since The second step calculates a histogram matrix 386
337 computer evaluation of an essay is different from called O, where Oi,j is the number of essays that 387
338 a human analysis. Therefore, explore which vari- receive grade i ∈ N by a human evaluator and a 388
able is important for which aspect is crucial for the grade j ∈ N by a machine evaluator. After that,
339 389
development of our research. we built another matrix E of expected ratings,
340 390
which is the outer product between each rater’s
341 391
3.1 Prediction Analysis histogram vector of ratings. Finally, we employ
342 392
Besides ASAP challenge at Kaggle 3 , several O, E, and w to compute the quadratic weighted
343 393
works employ quadratic weighted kappa as the kappa using Equation 2.
344 394
345
evaluation metric (Zesch et al., 2015)(Chen and P 395
i,j wi,j Oi,j
346
He, 2013)(Attali and Burstein, 2006), which aims κ=1− P (2) 396
to measure agreement between human evaluation i,j wi,j Ei,j
347 397
and machine scoring. When the value of kappa is
348 A simple regression is applied to predict the fi- 398
3
349 https://www.kaggle.com/c/asap-aes/details/evaluation nal grade of essays, and each of other five aspects. 399

4
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

400 measuring the resulting kappa. If removing a fea- 450


Table 4: Kappa values for each grade aspect
401 ture f lowers the resulting kappa, then that fea- 451
Grade Type Kappa
402 ture is relevant to the model of that aspect. Ac- 452
403
Final Grade 0.3673 cording to this criterion, the lower the resulting 453
404
Formal Language 0.3147 kappa when removing f from the training model, 454
405
Understanding the task 0.2678 the more important is f for this model. Therefore, 455
Organization of Information 0.2305 Table 3.2 describes the three features that most
406 456
Knowing argumentation 0.2704 diminished kappa value and the three features that
407 457
Solution proposal 0.1393 most increased kappa value. The full value in ta-
408 458
409 ble present the result with the full set of features 459
410 described earlier. 460
Table 5: Kappa values for each grade aspect after
411 oversampling (full and general feature set) It is possible to observe that the most relevant 461
412 features for the final grade are not necessarily a 462
413 Grade Type Full General mix of relevant features from the aspects. For in- 463
414 Final Grade 0.4245 0.4131 stance, vocabulary level is one of three most im- 464
415 Formal Language 0.3351 0.3249 portant features for the final grade, but it is not for 465
416 Understanding the task 0.1817 0.1822 the aspects. On the other hand, vocabulary level is 466
417 Organization of Information 0.2728 0.2679 not an irrelevant feature for the aspects. To un- 467
418 Knowing argumentation 0.2668 0.2484 derstand better the role of vocabulary level, we 468
Solution proposal 0.1542 0.1430 compute in our dataset average vocabulary level
419 469
for the final grade, and as expected the higher the
420 470
grade, the higher the number of different words in
421 471
essays. Besides vocabulary level, lexical complex-
422 Also, a simple oversampling strategy is applied 472
ity seems to play a significant role to final grade,
423 since grade distribution is unbalanced (Figure 1). 473
since three of the most important features to final
424 In the first step of oversampling strategy, it 474
grade prediction affect prediction.
425 searches by the class Gmax that holds the largest 475
number of instances. Then the strategy randomly Aspect Understanding the task presented the
426 476
selects instances from every class G 6= Gmax and lowest kappa value between aspects. However, we
427 477
replicates such instances into training datasets, un- can draw some conclusions from Table 3.2. For
428 478
til the size of every class G 6= Gmax be equal the instance, Flesh score affected expressively kappa
429 479
size of Gmax . value. Also, we observe that current features are
430 480
Table 4 describes the results using quadratic not enough for Understanding the task model,
431 therefore we will implement new features related 481
weighted kappa before the oversampling. We exe-
432 to this aspect. 482
cuted cross-validation five times and compute the
433 483
average of kappas of all experiments, for each Organization of information resulted in the sec-
434 484
aspect and final grade, to evaluate oversampling ond highest kappa value between aspects. As sim-
435 485
performance. Results after oversampling are de- ilarity to prompt was the most relevant features,
436 we believe that similarity between semantic vec- 486
scribed in Table 5.
437 tors, as proposed by Zesh et. al (Zesch et al., 487
Considering the lack of tools for processing the
438 2015), also can improve Organization of Informa- 488
Portuguese language, and the limited performance
439 of the few existing tools, the multi-aspect classi- tion prediction. Another observation is the influ- 489
440 fication performed satisfactorily. However, some ence of style errors upon Organization of Infor- 490
441 aspects performed poorly probably due to the sub- mation aspect. Perhaps this influence is because 491
442 jectivity intrinsic to this aspects and objective vari- the definition of style we used is related to how 492
443 ables probably can’t capture all the subjectivity. the writer “present” the information, which can be 493
444 redundancies or nonexistent language expressions. 494
445 3.2 Feature Analysis About Knowing argumentation aspect, we be- 495
446 Besides kappa results, we also performed an ex- lieve that style errors affected results for a similar 496
447 periment that investigates the impact of each fea- reason that we mentioned in the analysis of Orga- 497
448 ture in each aspect and final grade. The experi- nization of information aspect. However, in regard 498
449 ments were performed removing each feature and this aspect we think that perhaps some argument 499

5
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

500 550
501 551
502 552
503 553
504 554
505 555
506 556
507 557
508 558
509 559
510 560
511 561
512 562
513 563
514 564
515 565
516 566
517 567
518 568
519 569
520 570
521 571
522 572
523 573
524 574
525 575
526 576
527 577
528 578
529 579
530 580
531 581
532 582
533 583
534 584
535 585
536 586
537 587
538 Figure 1: Distribution of grades in UOL dataset for each aspect and final grade 588
539 589
540 590
541 591
542 592
543 593
544 594
545 595
546 596
547 597
548 598
549 599

6
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

600 features ( (Stab and Gurevych, 2014), (Song et Composite Classifier. A classifier to predict fi- 650
601 al., 2015)) will improve Knowing argumentation nal grades employing predictions of the five as- 651
602 scoring prediction. pects is a natural step in our research. 652
603 Adding new features to Brazilian Portuguese 653
604 4 Conclusion AES. There are more features to add to Brazilian 654
605 Portuguese AES. Some of these features are: POS- 655
We proposed a multi-aspect automatic essay cor-
606 tagging ratio; word length in characters; number 656
rection system for Brazilian Portuguese. Our pri-
607 of commas, quotations or exclamation marks; av- 657
mary goal is to evaluate if classical features for
608 erage sentence length; average depth of syntactic 658
AES system for the English language performs
609 trees; and topical overlap between adjacent sen- 659
well in a multi-aspect scenario, and assess which
610 tences. Also, cohesion features like proposed by 660
features are important for which aspect. In fact, af-
611 Song et al. (Song et al., 2015) can improve as- 661
ter experiments, some features performed well for
pects like Solution Proposal, which probably de-
612 some aspects. Nonetheless, each aspect performed 662
mands sophisticated features.
613 in a different way, which suggests that each aspect 663
614 needs an own suitable model. Also, more specific 664
615 features for some aspects probably will enhance References 665
616 subjective aspects. 666
Yigal Attali and Jill Burstein. 2006. Automated essay
617 Academic level, represented by Flesh score, is 667
scoring with e-rater R v. 2. The Journal of Technol-
618 extremely relevant in most aspects. A reason for ogy, Learning and Assessment, 4(3). 668
619 these results is probably because a high school 669
student should present advanced skills, like gram- Hongbo Chen and Ben He. 2013. Automated essay
620 670
scoring by maximizing human-machine agreement.
621 mar, spelling, argumentation, among others. De- In EMNLP, pages 1741–1752. 671
622
spite this feature in common, each aspect exhibits 672
their singularity. Like enclise affecting Under- Time de desenvolvimento CoGrOO, 2012. CoGrOO:
623 673
standing the task, similarity with prompt influ- Corretor Gramatical acoplável ao LibreOffice e
624 Apache OpenOffice. CCSL IME/USP, São Paulo, 674
encing Organization of information, and discourse Brasil.
625 675
markers changing Solution proposal. Therefore,
626 676
while some of the features enhance results for Yllias Chali Sadid A Hasan. 2012. Automatically as-
627 sessing free texts. In 24th International Conference 677
some aspects, these same features harm prediction
628 on Computational Linguistics, page 9. Citeseer. 678
for other aspects.
629 679
Clélia Cândida Abreu Spinardi Jubran and Ingedore
630 5 Future Directions Grunfeld Villaça Koch. 2006. Gramática do por- 680
631 tuguês culto falado no Brasil: construção do texto 681
632
The following issues are directions we aim to pur- falado, volume 1. UNICAMP. 682
633
sue in our further research. 683
Tuomo Kakkonen and Erkki Sutinen. 2004. Auto-
634
Analysis of evaluators comments. Our dataset matic assessment of the content of essays based on 684
comprises human evaluators comments. We in- course materials. In Information Technology: Re-
635 685
tend to analyze these comments, which is of par- search and Education, 2004. ITRE 2004. 2nd Inter-
636 national Conference on, pages 126–130. IEEE. 686
ticular importance for argumentative essays since
637 687
the opinion of human evaluators about a topic can Leah S Larkey. 1998. Automatic Essay Grading Using
638 688
affect grades. In a sample of 48 essays taken from Text Categorization Techniques.
639 689
our dataset, two linguists detected that 11 essays
640 Lung-Hao Lee, Liang-Chih Yu, Kuei-Ching Lee, Yuen- 690
presented biased evaluation. Biased evaluation is Hsien Tseng, Li-Ping Chang, and Hsin-Hsi Chen.
641 691
more serious issue if we will think about ENEM 2014. A sentence judgment system for grammatical
642 and other tests that are a relevant factor to many error detection. In COLING (Demos), pages 67–70. 692
643 students. Some works were performed in bias lan- 693
644
Teresa BF Martins, Claudete M Ghiraldelo, Maria 694
guage, but none of them analyzed bias on evalua- das Graças Volpe Nunes, and Osvaldo Novais
645 tions. Also, we can apply the same reasoning for de Oliveira Junior. 1996. Readability formulas ap- 695
646 other types of evaluations, like peer review of pa- plied to textbooks in brazilian portuguese. Icmsc- 696
647 pers. Besides that, we would like to research how Usp. 697
648 we can minimize bias on automatic scoring pre- E. Martins. 2000. Manual de redação e estilo. O Es- 698
649 diction. tado de São Paulo. 699

7
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

700 Courtney Napoles and Chris Callison-Burch. 2015. 750


701 Automatically scoring freshman writing: A prelim- 751
702
inary investigation. In Proceedings of the Tenth 752
Workshop on Innovative Use of NLP for Building
703 Educational Applications, pages 254–263. 753
704 754
705
Isaac Persing and Vincent Ng. 2015. Modeling argu- 755
ment strength in student essays. In Proceedings of
706 ACL. 756
707 757
Alla Rozovskaya and Dan Roth. 2014. Building a
708 758
state-of-the-art grammatical error correction system.
709 Transactions of the Association for Computational 759
710 Linguistics, 2:419–434. 760
711 761
Wei Song, Ruiji Fu, Lizhen Liu, Ting Liu, and Informa-
712 tion Engineering. 2015. Discourse Element Identi- 762
713 fication in Student Essays based on Global and Local 763
714
Cohesion. (September):2255–2261. 764
715 Christian Stab and Iryna Gurevych. 2014. Identify- 765
716 ing argumentative discourse structures in persuasive 766
717
essays. In EMNLP, pages 46–56. 767
718 Torsten Zesch, Michael Wojatzki, and Dirk Scholten- 768
719 Akoun. 2015. Task-Independent Features for Au- 769
tomated Essay Grading. Proceedings of the Build-
720 770
ing Educational Applications Workshop at NAACL,
721 pages 224–232. 771
722 772
723 773
724 774
725 775
726 776
727 777
728 778
729 779
730 780
731 781
732 782
733 783
734 784
735 785
736 786
737 787
738 788
739 789
740 790
741 791
742 792
743 793
744 794
745 795
746 796
747 797
748 798
749 799

8
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

800 850
801 851
802 852
803 Table 6: Kappa results for Feature Analysis 853
804 Aspect Feature Category Feature Removed Kappa 854
805 855
Average word Length 0.3890
806 856
Most Relevant Features Flesh Score 0.4010
807 Vocabulary Level 0.4059 857
808 Final Grade Discourse markers per #Sentence 0.4259 858
809 Least Relevant Features Count of first Person 0.4262 859
810 Count first Person per #Sentence 0.4320 860
811 Full feature set 0.4245 861
812 Flesh Score 0.1452 862
813 Most Relevant Features #enclise / #sentences 0.1655 863
814 #spelling errors 0.1655 864
815 Understanding the Task #grammar errors 0.1868 865
816 Least Relevant Features #style errors / #sentences 0.1878 866
817 #first person use / # sentences 0.1885 867
818 Full feature set 0.1817 868
819 Similarity with prompt 0.2496 869
820 Most Relevant Features Average word length 0.2581 870
821 #style errors / #sentences 0.2605 871
822 Organization of Information #long sentences 0.2788 872
823 Least Relevant Features #demonstrative pronoun / # sentence 0.2799 873
824 #first person use / #sentence 0.2817 874
825 Full feature set 0.2728 875
826 #spelling errors / #tokens 0.2438 876
827 Most Relevant Features #style errors / #sentences 0.2441 877
828 Flesh Score 0.2456 878
829 Knowing Argumentation #enclise / #sentences 0.2773 879
830 Least Relevant Features Average Word Length 0.2784 880
831 #grammar errors 0.2849 881
832 Full feature set 0.2668 882
833 Average word length 0.1048 883
834 Most Relevant Features Flesh score 0.1192 884
835 #discourse markers 0.1240 885
836 Solution Proposal #grammar errors / #Tokens 0.1586 886
837 Least Relevant Features #tokens 0.1593 887
838 #first person use 0.1655 888
839 Full feature set 0.1655 889
840 Flesh Score 0.3060 890
841 Most Relevant Features #grammar errors / #tokens 0.3138 891
842 #spelling mistakes 0.3248 892
843 Formal Language #long sentences 0.3396 893
844 Least Relevant Features #discourse markers 0.3396 894
845 #demonstrative pronouns 0.3429 895
846 Full feature set 0.3351 896
847 897
848 898
849 899

You might also like