Professional Documents
Culture Documents
Assessment
002 difficult it is for the reader to understand a et al., 2021; Lee et al., 2021). There has been less 044
003 text, has mostly been performed at the pas- research effort, however, on their application at the 045
004 sage level. There has been recent interest in sentence level. To the best of our knowledge, there 046
005 sentence-level assessment, given its applica- is only one reported evaluation of sentence-level 047
006 tions in downstream tasks such as text simplifi- neural readability assessment, but it did not attempt 048
007 cation and language exercise generation. While to integrate hand-crafted features (Schicchi et al., 049
008 neural approaches have been shown to improve
2020). 050
009 the assessment accuracy at the passage level,
010 they have yet to be applied on the sentence level. This paper evaluates a neural model on readabil- 051
011 This paper evaluates traditional and neural mod- ity assessment at the sentence level, and shows 052
012 els on readability assessment at the sentence that its performance can be further improved by 053
013 level, and shows that integrating linguistic fea- integrating linguistic features. On a dataset of 054
014 tures could improve the model performance. 9.8K Chinese sentences ranging from Grade 3 to 055
015 On a Chinese dataset of 9.8K sentences rang- 12, we obtained the strongest performance with a 056
016 ing from Grade 3 to 12, we found that neural
model that combines BERT and hand-crafted fea- 057
017 model obtained the stronger performance than
018 traditional classifiers. tures (Kaas et al., 2020). The rest of the paper is 058
organized as follows. After a review of previous 059
019 1 Introduction work (Section 2), we present our dataset (Section 3) 060
020 Text readability is defined as the cognitive load of and approach (Section 4), and then discuss experi- 061
021 a reader to comprehend a given text (Martinc et al., mental results (Section 5). 062
031 a substantial drop in performance (Kilgarriff et al., Readability formulas (Kincaid et al., 1975) and 070
032 2008; Pilán et al., 2016). It is also useful for many traditional approaches for readability assessment 071
033 downstream tasks in natural language processing have mostly relied on one-hot linguistic features 072
034 (NLP). It is essential to generation tasks that are and language models (Collins-Thompson, 2008; 073
035 sensitive to language difficulty, such as pedagog- Sung et al., 2015). More recent studies have shown 074
036 ical material and exercises (Pilán et al., 2014). It that neural approaches can improve assessment per- 075
037 also facilitates explainable text simplification (Gâr- formance (Azpiazu and Pera, 2019; Martinc et al., 076
038 bacea et al., 2021) by identifying which sentences 2021). An active area of research is to investigate 077
039 require simplification. how to incorporate linguistic features into neural 078
040 Neural approaches have been shown to excel in models. On passage-level assessment, some stud- 079
041 a large variety of NLP tasks, including readabil- ies observed no effect (Deutsch et al., 2020) or only 080
1
081 marginal improvement (Filighera et al., 2019) from
082 linguistic features, while others reported significant
083 improvement, e.g. by combining Random Forest
084 and RoBERTa (Lee et al., 2021). However, there
085 has not yet been any study on combining linguis-
086 tic features and neural approaches at the sentence
087 level.
118 readability are built with manually simplified sen- difficulty given the grade of the passage; 152
119 tences, drawn for example from Wikipedia and Below-grade The sentence is easier than the in- 153
120 Simple Wikipedia (Vajjala and Meurers, 2014; Šta- grade sentences; 154
121 jner et al., 2017). Due to the cost of manual simpli-
122 fication, such datasets are not widely available for Above-grade The sentence is harder than the in- 155
123 many languages. In contrast, graded passages, for grade sentences. 156
124 example those in textbooks, are commonly avail-
125 able in many languages. Table 3 (Section B) shows three example sen- 157
126 A possible methodology is to assign the grade of tences taken from a Grade 3 passage. The top 158
127 the passage to all sentences in that passage (Pilán sentence was judged to be typical, with multiple 159
128 et al., 2014). Substantial noise could be introduced subjects from tangmu ‘Thomas’ to hua ‘speech’ 160
1
129 to both the training process and evaluation results, This dataset is unfortunately not publicly available.
2
161 and shengyin ‘voice’. The middle sentence has bert-base-chinese that contains 12 layers, 768 hid- 207
162 a simpler structure, with only subject shaonian den units and 12 heads for Chinese dataset. We 208
163 ‘youth’. The bottom sentence is more difficult to fine-tuned the model on our datasets (Section 3) 209
164 read due to the complex modifiers for the noun into a classifier for sentence-level readability as- 210
165 xinyin ‘trust’ and for the noun xiyue ‘joy’. sessment.3 This approach is similar to one taken 211
166 On average, 7.90% of the sentences are labeled by Tseng et al. (2019) for passage-level readability 212
167 as above-grade, and 29.07% below-grade. The assessment. 213
168 number of below-grade sentences exceeds the num-
169 ber of above-grade sentences in all grades with 4.3 Integration of linguistic features 214
170 the exception of Grade 3. As shown in Figure 1 XGB+BERT. Following Deutsch et al. (Deutsch 215
171 (Section B), in general, the higher the grade, the et al., 2020) and Lee et al. (Lee et al., 2021), 216
172 smaller the proportion of above-grade sentences. we obtained the grade predicted by BERT (Sec- 217
173 The proportion of below-grade sentences varies tion 4.2) and added it as an additional feature to 218
174 from 41.26% to 12.17%. Statistics on our corpus Xgboost (XGB) (Section 4.1), which turned out 219
175 can be found Table 2 (Section B). to be the best-performing baseline classifier for 220
176 To create a balanced Chinese dataset for training Chinese dataset (Section 5.4). 221
177 and evaluation, we selected passages from each Augmented BERT. We used the modified BERT 222
178 grade such that there are about 1000 sentences rep- model architecture used by Kaas et al. (2020), 223
179 resenting each grade, from Grade 3 through 12.2 which was shown to be effective in propaganda 224
180 As shown in Table 2, this dataset contains 9,803 detection in news articles. In the first component, 225
181 sentences and the grade distribution is shown in linguistic features (Section 4.1) are fed into the one 226
182 Table 3. The agreement percentage between the linear layer, and then passed to one dense layer and 227
183 two annotators are 82.72%. They achieved a Co- ReLU layer, including dropout parameter. In the 228
184 hen’s Kappa of 0.61, corresponding to a substantial second component, the outputs of BERT hidden 229
185 level of agreement (Landis and Koch, 1977). They layer are passed to one dense layer and ReLU layer, 230
186 reconciled the label differences through discussion also including dropout parameter. Finally, these 231
187 to reach the final annotation. two vectors are concatenated to a fully-connected 232
layer and softmax layer to predict the final sentence 233
188 4 Approach readability level. 234
193 4.1 Baseline classifiers Nil Assign the grade of the passage to all sentences 238
194 We trained classifiers with Logistic Regression in that passage in the training set. This method 239
195 (LR), Random Forests (RF) and Xgboost (XGB), avoids the need to annotate at the sentence 240
196 using the implementation of scikit-learn (Pedregosa level in return for some noise in the training 241
203 4.2 Neural classifier ble 3, the below-grade sentence is considered 248
Grade 2 and the above-grade sentence Grade 249
204 BERT (Devlin et al., 2019) has achieved the state- 4. This method is designed to reduce the noise 250
205 of-the-art performance in many natural language in the training labels. 251
206 processing tasks. We used the pre-trained model
3
We use the Adam algorithm (Kingma and Ba, 2015) for
2
All data and code will be publicly released for research optimization. The epoch for each training is 10, and set the
purposes upon publication of this paper. maximum word embedding size as 128.
3
252 Exp Use in-grade sentences only, and assign the Model Anno 10-way 5-way
253 grade of the passage to these sentences. This tation Acc. Adj Acc. Acc
254 method results in a smaller training data size Majority n/a 10.69 31.18 20.84
255 but removes noise in the training labels. LR All 21.37 49.40 42.82
Nil 21.61 46.32 37.24
256 5.2 Testing Exp 20.57 47.26 38.31
257 To ensure the correctness of the gold label, we RF All 26.31 50.42 38.75
258 tested only those sentences labeled as “expected” Nil 23.26 37.58 38.44
259 in Chinese dataset, the same as testing sentences Exp 27.73 48.02 38.41
260 aligned with grade level in English dataset. We XGB All 28.56 55.83 41.42
261 used stratified ten-fold cross validation in all exper- Nil 33.03 56.04 46.59
262 iments for Chinese dataset, with 8 folds as training Exp 31.59 55.73 44.71
263 set, 1 fold as dev set and 1 fold as test set. The BERT All 33.08 58.38 47.27
264 hyperparameters learning rate, dropout and batch Nil 37.45 56.33 46.95
265 size are tuned on the dev set, and found the best Exp 36.74 55.74 46.85
266 parameters at about learning rate as 1e−5 , dropout XGB+ All 34.97 60.51 49.33
267 as 0.2 and batch size as 64. BERT Nil 37.97 61.42 48.77
268 All sentences from the same text are placed in the Exp 36.92 60.02 49.08
269 same fold, so that the entities and topics mentioned Aug. All 31.79 55.07 45.92
270 in the test sentences would not be seen during train- BERT Nil 38.57 54.66 48.02
271 ing. Exp 35.66 53.39 45.27
272 5.3 Metrics Table 1: Readability assessment performance for vari-
273 We report three metrics: 10-way classification ac- ous models and annotation methods during training on
Chinese Dataset, shown in percentage
274 curacy on predicting the grade of the sentence as
275 annotated in the corpus; adjacency accuracy, i.e.
276 allowing the prediction to deviate by one grade
but slightly worse in terms of adjacency accuracy.4 301
277 from the gold; and 5-way accuracy, by merging
The XGB+BERT model outperforms the BERT 302
278 Grades 3-4, Grades 5-6, Grades 7-8, Grades 9-10
model only slightly for 10-way classification, and 303
279 and Grades 11-12 into five difficulty levels.
achieved the best result in both adjacency classifica- 304
280 5.4 Results tion (61.42%) and 5-way classification (49.33%).5 305
281 Neural vs. statistical classifiers. As shown in Ta- These results suggest that it is beneficial to com- 306
282 ble 1, among the baseline classifiers, XGB yielded bine the insights from linguistic features and neural 307
283 the highest accuracy on all metrics, 10-way adja- networks in predicting sentence readability. 308
4
319 7 Limitation Anders Friis Kaas, Viktor Torp Thomsen, and Barbara 372
Plank. 2020. Team DiSaster at SemEval-2020 Task 373
320 Our experimental results should be interpreted with 11: Combining BERT and hand-crafted Features for 374
321 the following limitations in mind. First, our dataset Identifying Propaganda Techniques in News Media. 375
322 contains only Chinese text graded on the national In Proc. 14th International Workshop on Semantic 376
Evaluation. 377
323 curriculum. The performance of the model should
324 also be evaluated on other languages and other diffi- Adam Kilgarriff, Mils Husák, Katy McAdam, Michael 378
325 culty scales. Second, the improvement observed in Rundell, and Pavel Rychlý. 2008. GDEX: Automati- 379
326 our best models depends on both the efficacy of the cally Finding Good Dictionary Examples in a Corpus. 380
In Proc. EURALEX. 381
327 linguistic features and on the strength of the neural
328 model itself. As neural models continue to improve Peter J. Kincaid, Robert P. Fishburne, Richard L. Rogers, 382
329 and effective linguistic features are identified, the and Brad S. Chissom. 1975. Derivation of new read- 383
ability formulas for Navy enlisted personnel. In Re- 384
330 best methods for combining may also need to be search Branch Report 8–75. Chief of Naval Technical 385
331 updated. Training: Naval Air Station Memphis. 386
5
426 Proc. 9th Workshop on Innovative Use of NLP for Grade sent # sent Grade sent # sent
427 Building Educational Applications. 3 1231 965 8 1505 939
428 Daniele Schicchi, Giovanni Pilato, and Giosué Lo 4 1833 951 9 1523 995
429 Bosco. 2020. Deep neural attention-based model 5 1992 982 10 3051 1048
430 for the evaluation of italian sentences complexity. In 6 1824 997 11 1590 1001
431 2020 IEEE 14th International Conference on Seman- 7 2909 917 12 2394 1008
432 tic Computing (ICSC), pages 253–256. IEEE.
433 Elliot Schumacher, Maxine Eskenazi, Gwen Frishkoff, Table 2: Total number of sentences and the number of
434 and Kevyn Collins-Thompson. 2016. Predicting the in-grade sentences at each grade
435 relative difficulty of single sentences with and with-
436 out surrounding context. In Proc. Conference on
437 Empirical Methods in Natural Language Processing
438 (EMNLP), pages 1871–1881. 477
Lexical Features 478
439 Matthew Shardlow, Richard Evans, Gustavo Henrique
3. Average number of characters per word per 479
440 Paetzold, and Marcos Zampieri. 2021. SemEval-
441 2021 Task 1: Lexical Complexity Prediction. In Proc. sentence 480
442 15th International Workshop on Semantic Evaluation 4. Number of two-character words per sentence 481
443 (SemEval). 5. Percentage of two-character words per sentence 482
444 Yao-Ting Sung, Ju-Ling Chen, Ji-Her Cha, Hou-Chiang 6. Number of three-character words per sentence 483
445 Tseng, Tao-Hsing Chang, and Kuo-En Chang. 2015. 7. Percentage of three-character words per 484
446 Constructing and validating readability models: the sentence 485
447 method of integrating multilevel linguistic features 8. Number of four-character words per sentence 486
448 with machine learning. Behavior Research Methods,
449 47:340–354. 9. Percentage of four-character words per sentence 487
10. Number of five-up-character words per 488
450 Hou-Chiang Tseng, Hsueh-Chih Chen, Kuo-En Chang, sentence 489
451 Yao-Ting Sung, and Berlin Chen. 2019. An Inno-
11. Percentage of five-up-character words per 490
452 vative BERT-Based Readability Model. In Lecture
453 Notes in Computer Science, vol 11937. sentence 491
12. Percentage of level1 tokens per sentence 492
454 Sowmya Vajjala and Detmar Meurers. 2014. Assessing 13. Percentage of level2 tokens per sentence 493
455 the relative reading level of sentence pairs for text
456 simplification. In Proc. 14th Conference of the Euro- 14. Percentage of level3 tokens per sentence 494
457 pean Chapter of the Association for Computational 15. Percentage of level4 tokens per sentence 495
458 Linguistics (EACL), page 288–297. 16. Percentage of level5 tokens per sentence 496
17. Percentage of level6 tokens per sentence 497
459 Sanja Štajner, Simone Paolo Ponzetto, and Heiner Stuck-
460 enschmidt. 2017. Automatic Assessment of Absolute 18. Percentage of level7 tokens per sentence 498
461 Sentence Complexity. In Proc. 26th International 499
462 Joint Conference on Artificial Intelligence (IJCAI). Syntactic Features 500
19. Percentage of adjectives per sentence 501
463 A Appendix: Instructions to annotators 20. Number of adjectives per sentence 502
464 We recruited two annotators, who accepted to work 21. Percentage of verbs per sentence 503
465 on a volunteer basis, under the project Anonymous 22. Number of verbs per sentence 504
466 at XXX University. The project has been approved 23. Percentage of nouns per sentence 505
467 by a institutional ethics review board. The anno- 24. Number of nouns per sentence 506
468 tators were given an Excel file with one sentence 25. Percentage of adverbs per sentence 507
469 per row, and were asked to mark each sentence as 26. Number of adverbs per sentence 508
470 0 (“expected” level of difficulty), +1 (“harder”) or 27. Total number of words tagged as conjunctions 509
471 -1 (“easier”). 28. Percentage of conjunctions per sentence 510
29. Total number of words tagged as pronouns 511
472 B Appendix: Distribution of corpora 30. Percentage of pronouns per sentence 512
31. Total number of prepositions per sentence 513
473 C Appendix: List of linguistic features
32. Percentage of prepositions per sentence 514
474 Raw Sentences 515
475 1. Number of words per sentence Discourse Features 516
476 2. Number of characters per sentence 33. Height of parse tree per sentence 517
6
Sentence Difficulty
汤姆开头有点吞吞吐吐,渐渐地,话越来越多,声音也越来越大,越来越 in-grade
自然了。 ‘Thomas stuttered at first, but gradually his speech got longer,
his voice grew louder and more natural.’
少年连连摆手,用不太标准的中国话说:"不,不要钱。 below-grade
‘The youth waved his hands and said with an accent, “No, I don’t want money.” ’
在那里,我们得到的是人与人之间的信任和被信任的喜悦。 above-grade
‘There, what we get is the trust between people and the joy of being trusted.’
Table 3: Example sentences taken from a Grade 3 passage, annotated with their difficulty relative to Grade 3.