Professional Documents
Culture Documents
000 050
001
A Multi-aspect Analysis of Automatic Essay Scoring for Brazilian 051
002 Portuguese 052
003 053
004 054
005 055
Anonymous EACL submission
006 056
007 057
008 058
009 059
010 060
011 061
Abstract high school student in Brazil should write an essay
012 062
to the National Exam of High School (ENEM),
013 Several methods for automatic essay scor- 063
which Brazilian government uses to evaluate the
014 ing (AES) for the English language have 064
quality of high school teaching.
015 been proposed. However, multi-aspect 065
Although there are thousands of essays written
016 AES systems for other languages are un- to ENEM every year, to the best of our knowledge 066
017 usual. Therefore, we propose a multi- there is no AES system for Brazilian Portuguese 067
018 aspect AES system to apply on a dataset (BP) language and analysis of features in a multi- 068
019 of Brazilian Portuguese essays, which hu- aspect essay scoring system for BP. Each aspect 069
020 man experts evaluated according to five as- is a skill that student must master as a senior high 070
021 pects defined by Brazilian Government to school student. Nonetheless, several AES systems 071
022 the National Exam to High School Stu- have been proposed for the English Language. At- 072
023 dent (ENEM). These aspects are skills that tali and Burstein (Attali and Burstein, 2006) pro- 073
024 student must master and every skill is as- posed an AES system, called e-rater, that employs 074
025 sessed apart from each other. general features of argumentative essays to scor- 075
026 Besides prediction, we also performed fea- ing prediction. The main features used by e-rater 076
027 ture analysis for each aspect. The AES are grouped into the following types: grammar, us- 077
028 system proposed employs several features age, mechanics, and style; organization and devel- 078
029 already used by AES systems for the En- opment; lexical complexity; prompt-specific vo- 079
030 glish language. Our results show that cabulary usage. e-rater employs multiple regres- 080
031 predictions for some aspects performed sion and it is task independent AES system, i.e., 081
032 well with the features we employed, while its score is independent of the given prompt. 082
033 predictions for other aspects performed Napoles and Callison-Burch (Napoles and 083
poorly. Callison-Burch, 2015) employ linear regression
034 084
to an AES system that intends to assign more
035 Also, it is possible to note the difference 085
uniform grades than multiple humans evaluators.
036 between the five aspects in the detailed 086
Similar to our task, Napoles and Callison-Burch
037 feature analysis we performed. Besides 087
propose the task of multi-aspect classification us-
038 these contributions, the eight millions of 088
ing five grading categories. However, the authors
039 enrollments every year for ENEM raise 089
leave unexplained how each of their aspects is af-
040 some challenge issues for future directions 090
fected by their features. We think this is an im-
041 in our research. 091
portant contribution since student or professor can
042 use features as a feedback for better understand- 092
043
1 Introduction 093
ing essays writing. Besides that, Napoles and
044 The goal of automatic essay scoring (AES) sys- Callison-Burch assume that more than one eval- 094
045 tems is to score a given essay. AES systems are uator is available to train their model, which in the 095
046 relevant for educational institutions, since the hu- real world is not always the case. 096
047 man effort to evaluate essays is high and, stu- Larkey (Larkey, 1998) proposed three models 097
048 dent needs feedback to improve his or her writ- that are based on text classification to score es- 098
049 ing skill. Besides these issues, almost every senior says applying linear regression. However, Larkey 099
1
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
100 strategy is task-dependent. Chen and He (Chen 3. If it is possible to detect the bias of human 150
101 and He, 2013) also grouped features in four main evaluator, is it feasible to measure the quanti- 151
102 types: lexical features; syntactical features; gram- tative effect in grades? 152
103 mar and fluency features; content and prompt- 153
104 specific features. Then, the authors proposed a 4. Is there any difference between the words in 154
105 rank-based algorithm that maximizes the agree- biased evaluations that agrees with student 155
106 ment between human score and machine score. point of view and biased evaluations that dis- 156
107 Zesh et al. (Zesch et al., 2015) developed a agrees with student point of view? 157
108 technique that adapts domain in an AES system. 158
In a nutshell, there are many possible questions,
109 The authors tested their method in an English 159
and as far we know these questions still unan-
110 dataset and a German dataset. Chali and Hasan 160
swered.
111
(Hasan, 2012) proposed an LSA-based method, 161
The paper is organized as follows. The second
112
that is task-dependent, and the goal is to establish 162
section details our dataset and the features we use.
113
a strategy to understand the inner meaning of texts. 163
The third section explains the experiments we per-
Beyond English language, Kakkonen and Sutinen
114 formed and the results of our experiments. The 164
(Kakkonen and Sutinen, 2004) developed an AES
115 fourth section presents the main remarks about our 165
system to the Finnish language also based on LSA
116 research and the fifth section point to the future di- 166
algorithm.
117 rection for our research. 167
Besides assigning grade score, other researches
118 168
proposed to analyze argumentation strength of es- 2 Methodology
119 169
says (Persing and Ng, 2015), discourse structure
120 170
of essays (Song et al., 2015)(Stab and Gurevych, We propose a methodology that besides the usual
121 features employed by popular AES methodolo- 171
2014) , and grammar correction in general (Ro-
122 zovskaya and Roth, 2014)(Lee et al., 2014). gies (Attali and Burstein, 2006) (Chen and He, 172
123 2013) (Zesch et al., 2015), it also takes advantage 173
Our research is different from the previous re-
124 search since we aim to answer the following ques- of domain features. To test our proposed features, 174
125 tions: we used a dataset of nearly 1840 essays. Next sec- 175
126 tions describe our dataset and our features. 176
127 1. How objective features behave in a multi- 177
128 aspect automatic essay scoring system? 2.1 Dataset 178
129 Our dataset is composed of 1840 essays about 179
130 2. Which features are more relevant for each as- 96 topics, which were crawled from UOL Essay 180
131 pect? Database website1 . The average length in words 181
132 are 300.51; the biggest essay owns 1293 words, 182
133 Besides that, we aim to pose some interest- and the smallest essay holds 49 words. Each es- 183
134 ing questions for future research. Our essays say is evaluated according to the following five as- 184
135
present not only grades but also evaluators com- pects: 185
136
ments about the aspects that are considered in 186
ENEM. During the exploration of evaluators com- 1. Formal language: Mastering of the formal
137 187
ments, bias was observed in some evaluations. We Portuguese language.
138 188
call bias when some human evaluators seem to dis-
139 2. Understanding the task: Understanding of 189
agree or agree with student’s point of view, which
140
can lead to improper influence to the student’s essay prompt and application of concepts 190
141
grade. The possibility of bias of human evalua- from different knowledge fields, to develop 191
142
tions raises some questions. the theme in an argumentative dissertation 192
143 format. 193
144 1. Some topics for essays are more prone to re- 194
145
3. Organization of information: Selecting, 195
sult in biased evaluation? connecting, organizing, and interpreting in-
146 196
formation, facts, opinions, arguments to ad-
147 2. Is it possible to detect if human evaluator is 197
vocate a point of view.
148 biased for or against a given student’s point 198
1
149 of view? http://educacao.uol.com.br/bancoderedacoes 199
2
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
3
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
4
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
5
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
500 550
501 551
502 552
503 553
504 554
505 555
506 556
507 557
508 558
509 559
510 560
511 561
512 562
513 563
514 564
515 565
516 566
517 567
518 568
519 569
520 570
521 571
522 572
523 573
524 574
525 575
526 576
527 577
528 578
529 579
530 580
531 581
532 582
533 583
534 584
535 585
536 586
537 587
538 Figure 1: Distribution of grades in UOL dataset for each aspect and final grade 588
539 589
540 590
541 591
542 592
543 593
544 594
545 595
546 596
547 597
548 598
549 599
6
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
600 features ( (Stab and Gurevych, 2014), (Song et Composite Classifier. A classifier to predict fi- 650
601 al., 2015)) will improve Knowing argumentation nal grades employing predictions of the five as- 651
602 scoring prediction. pects is a natural step in our research. 652
603 Adding new features to Brazilian Portuguese 653
604 4 Conclusion AES. There are more features to add to Brazilian 654
605 Portuguese AES. Some of these features are: POS- 655
We proposed a multi-aspect automatic essay cor-
606 tagging ratio; word length in characters; number 656
rection system for Brazilian Portuguese. Our pri-
607 of commas, quotations or exclamation marks; av- 657
mary goal is to evaluate if classical features for
608 erage sentence length; average depth of syntactic 658
AES system for the English language performs
609 trees; and topical overlap between adjacent sen- 659
well in a multi-aspect scenario, and assess which
610 tences. Also, cohesion features like proposed by 660
features are important for which aspect. In fact, af-
611 Song et al. (Song et al., 2015) can improve as- 661
ter experiments, some features performed well for
pects like Solution Proposal, which probably de-
612 some aspects. Nonetheless, each aspect performed 662
mands sophisticated features.
613 in a different way, which suggests that each aspect 663
614 needs an own suitable model. Also, more specific 664
615 features for some aspects probably will enhance References 665
616 subjective aspects. 666
Yigal Attali and Jill Burstein. 2006. Automated essay
617 Academic level, represented by Flesh score, is 667
scoring with e-rater
R v. 2. The Journal of Technol-
618 extremely relevant in most aspects. A reason for ogy, Learning and Assessment, 4(3). 668
619 these results is probably because a high school 669
student should present advanced skills, like gram- Hongbo Chen and Ben He. 2013. Automated essay
620 670
scoring by maximizing human-machine agreement.
621 mar, spelling, argumentation, among others. De- In EMNLP, pages 1741–1752. 671
622
spite this feature in common, each aspect exhibits 672
their singularity. Like enclise affecting Under- Time de desenvolvimento CoGrOO, 2012. CoGrOO:
623 673
standing the task, similarity with prompt influ- Corretor Gramatical acoplável ao LibreOffice e
624 Apache OpenOffice. CCSL IME/USP, São Paulo, 674
encing Organization of information, and discourse Brasil.
625 675
markers changing Solution proposal. Therefore,
626 676
while some of the features enhance results for Yllias Chali Sadid A Hasan. 2012. Automatically as-
627 sessing free texts. In 24th International Conference 677
some aspects, these same features harm prediction
628 on Computational Linguistics, page 9. Citeseer. 678
for other aspects.
629 679
Clélia Cândida Abreu Spinardi Jubran and Ingedore
630 5 Future Directions Grunfeld Villaça Koch. 2006. Gramática do por- 680
631 tuguês culto falado no Brasil: construção do texto 681
632
The following issues are directions we aim to pur- falado, volume 1. UNICAMP. 682
633
sue in our further research. 683
Tuomo Kakkonen and Erkki Sutinen. 2004. Auto-
634
Analysis of evaluators comments. Our dataset matic assessment of the content of essays based on 684
comprises human evaluators comments. We in- course materials. In Information Technology: Re-
635 685
tend to analyze these comments, which is of par- search and Education, 2004. ITRE 2004. 2nd Inter-
636 national Conference on, pages 126–130. IEEE. 686
ticular importance for argumentative essays since
637 687
the opinion of human evaluators about a topic can Leah S Larkey. 1998. Automatic Essay Grading Using
638 688
affect grades. In a sample of 48 essays taken from Text Categorization Techniques.
639 689
our dataset, two linguists detected that 11 essays
640 Lung-Hao Lee, Liang-Chih Yu, Kuei-Ching Lee, Yuen- 690
presented biased evaluation. Biased evaluation is Hsien Tseng, Li-Ping Chang, and Hsin-Hsi Chen.
641 691
more serious issue if we will think about ENEM 2014. A sentence judgment system for grammatical
642 and other tests that are a relevant factor to many error detection. In COLING (Demos), pages 67–70. 692
643 students. Some works were performed in bias lan- 693
644
Teresa BF Martins, Claudete M Ghiraldelo, Maria 694
guage, but none of them analyzed bias on evalua- das Graças Volpe Nunes, and Osvaldo Novais
645 tions. Also, we can apply the same reasoning for de Oliveira Junior. 1996. Readability formulas ap- 695
646 other types of evaluations, like peer review of pa- plied to textbooks in brazilian portuguese. Icmsc- 696
647 pers. Besides that, we would like to research how Usp. 697
648 we can minimize bias on automatic scoring pre- E. Martins. 2000. Manual de redação e estilo. O Es- 698
649 diction. tado de São Paulo. 699
7
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
8
EACL 2017 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
800 850
801 851
802 852
803 Table 6: Kappa results for Feature Analysis 853
804 Aspect Feature Category Feature Removed Kappa 854
805 855
Average word Length 0.3890
806 856
Most Relevant Features Flesh Score 0.4010
807 Vocabulary Level 0.4059 857
808 Final Grade Discourse markers per #Sentence 0.4259 858
809 Least Relevant Features Count of first Person 0.4262 859
810 Count first Person per #Sentence 0.4320 860
811 Full feature set 0.4245 861
812 Flesh Score 0.1452 862
813 Most Relevant Features #enclise / #sentences 0.1655 863
814 #spelling errors 0.1655 864
815 Understanding the Task #grammar errors 0.1868 865
816 Least Relevant Features #style errors / #sentences 0.1878 866
817 #first person use / # sentences 0.1885 867
818 Full feature set 0.1817 868
819 Similarity with prompt 0.2496 869
820 Most Relevant Features Average word length 0.2581 870
821 #style errors / #sentences 0.2605 871
822 Organization of Information #long sentences 0.2788 872
823 Least Relevant Features #demonstrative pronoun / # sentence 0.2799 873
824 #first person use / #sentence 0.2817 874
825 Full feature set 0.2728 875
826 #spelling errors / #tokens 0.2438 876
827 Most Relevant Features #style errors / #sentences 0.2441 877
828 Flesh Score 0.2456 878
829 Knowing Argumentation #enclise / #sentences 0.2773 879
830 Least Relevant Features Average Word Length 0.2784 880
831 #grammar errors 0.2849 881
832 Full feature set 0.2668 882
833 Average word length 0.1048 883
834 Most Relevant Features Flesh score 0.1192 884
835 #discourse markers 0.1240 885
836 Solution Proposal #grammar errors / #Tokens 0.1586 886
837 Least Relevant Features #tokens 0.1593 887
838 #first person use 0.1655 888
839 Full feature set 0.1655 889
840 Flesh Score 0.3060 890
841 Most Relevant Features #grammar errors / #tokens 0.3138 891
842 #spelling mistakes 0.3248 892
843 Formal Language #long sentences 0.3396 893
844 Least Relevant Features #discourse markers 0.3396 894
845 #demonstrative pronouns 0.3429 895
846 Full feature set 0.3351 896
847 897
848 898
849 899