You are on page 1of 7

Combining Linguistic and Neural Approaches for Sentence Readability

Assessment

Anonymous ACL submission

Abstract ity assessment of passages (Filighera et al., 2019; 042

001 Readability assessment, which predicts how


Tseng et al., 2019; Deutsch et al., 2020; Martinc 043

002 difficult it is for the reader to understand a et al., 2021; Lee et al., 2021). There has been less 044
003 text, has mostly been performed at the pas- research effort, however, on their application at the 045
004 sage level. There has been recent interest in sentence level. To the best of our knowledge, there 046
005 sentence-level assessment, given its applica- is only one reported evaluation of sentence-level 047
006 tions in downstream tasks such as text simplifi- neural readability assessment, but it did not attempt 048
007 cation and language exercise generation. While to integrate hand-crafted features (Schicchi et al., 049
008 neural approaches have been shown to improve
2020). 050
009 the assessment accuracy at the passage level,
010 they have yet to be applied on the sentence level. This paper evaluates a neural model on readabil- 051

011 This paper evaluates traditional and neural mod- ity assessment at the sentence level, and shows 052
012 els on readability assessment at the sentence that its performance can be further improved by 053
013 level, and shows that integrating linguistic fea- integrating linguistic features. On a dataset of 054
014 tures could improve the model performance. 9.8K Chinese sentences ranging from Grade 3 to 055
015 On a Chinese dataset of 9.8K sentences rang- 12, we obtained the strongest performance with a 056
016 ing from Grade 3 to 12, we found that neural
model that combines BERT and hand-crafted fea- 057
017 model obtained the stronger performance than
018 traditional classifiers. tures (Kaas et al., 2020). The rest of the paper is 058
organized as follows. After a review of previous 059
019 1 Introduction work (Section 2), we present our dataset (Section 3) 060

020 Text readability is defined as the cognitive load of and approach (Section 4), and then discuss experi- 061

021 a reader to comprehend a given text (Martinc et al., mental results (Section 5). 062

022 2021). Most research on readability assessment


2 Previous work 063
023 has focused on estimating the difficulty of a pas-
024 sage (Azpiazu and Pera, 2019; Lee et al., 2021) or We summarize past research on combining linguis- 064
025 an expression (Shardlow et al., 2021) and paid less tic features and neural approaches in readability 065
026 attention to that of an individual sentence (Štajner assessment (Section 2.1), review the state-of-the- 066
027 et al., 2017). art in sentence-level readability assessment (Sec- 067
028 Sentence readability assessment is a task in its tion 2.2). 068
029 own right since a naive application of text readabil-
030 ity assessment models on sentences would lead to 2.1 Linguistic and neural approaches 069

031 a substantial drop in performance (Kilgarriff et al., Readability formulas (Kincaid et al., 1975) and 070
032 2008; Pilán et al., 2016). It is also useful for many traditional approaches for readability assessment 071
033 downstream tasks in natural language processing have mostly relied on one-hot linguistic features 072
034 (NLP). It is essential to generation tasks that are and language models (Collins-Thompson, 2008; 073
035 sensitive to language difficulty, such as pedagog- Sung et al., 2015). More recent studies have shown 074
036 ical material and exercises (Pilán et al., 2014). It that neural approaches can improve assessment per- 075
037 also facilitates explainable text simplification (Gâr- formance (Azpiazu and Pera, 2019; Martinc et al., 076
038 bacea et al., 2021) by identifying which sentences 2021). An active area of research is to investigate 077
039 require simplification. how to incorporate linguistic features into neural 078
040 Neural approaches have been shown to excel in models. On passage-level assessment, some stud- 079
041 a large variety of NLP tasks, including readabil- ies observed no effect (Deutsch et al., 2020) or only 080

1
081 marginal improvement (Filighera et al., 2019) from
082 linguistic features, while others reported significant
083 improvement, e.g. by combining Random Forest
084 and RoBERTa (Lee et al., 2021). However, there
085 has not yet been any study on combining linguis-
086 tic features and neural approaches at the sentence
087 level.

088 2.2 Sentence readability assessment


089 Most previous research on sentence readability as-
090 sessment pursued binary classification of easy vs.
091 difficult or pairwise difficulty prediction (Ambati
092 et al., 2016; Schumacher et al., 2016). An algo-
Figure 1: Distribution of sentences annotated as in-
093 rithm combining rule-based and statistical classi-
grade, above-grade and below-grade at each grade
094 fiers yielded 71% accuracy on binary classifica-
095 tion of texts for learning Swedish as a foreign lan-
096 guage (Pilán et al., 2014). Statistical classifiers however, as a text can contain easier or more dif- 130
097 achieved 66% accuracy on an English dataset based ficult sentences. Another approach is to calculate 131
098 on Wikipedia and Simple Wikipedia (Vajjala and the grade of a sentence based on grammar points 132
099 Meurers, 2014) and between 78.9% and 83.7% on in the sentence and the grade of the passage (Lu 133
100 an Italian dataset (Dell’Orletta et al., 2014). et al., 2020)1 , but this ignores other factors that 134
101 Automatic assessment on multiple levels of dif- may influence readability. 135
102 ficulty has been attempted on Chinese sentences
3.2 Annotation procedure 136
103 (10 levels) and English sentences (5 levels), reach-
104 ing 31.92% accuracy (Lu et al., 2020) and 0.66 To address the issues discussed above, we recruited 137
105 weighted F-score (Štajner et al., 2017), respec- two native speakers of Mandarin to perform read- 138
106 tively, both using traditional classifiers. To the ability annotation on graded passages. Covering a 139
107 best of our knowledge, Schicchi et al. (2020) re- variety of genres and topics, these passages were 140
108 ported the only study that applied neural models taken from a corpus of Chinese-language textbooks 141
109 on sentence-level readability assessment, show- used in Mainland China (Lee et al., 2020). One an- 142
110 ing that an RNN-based architecture outperformed notator was a certified Chinese-language teacher 143
111 Vec2Read (Mikolov et al., 2013), but it does not with over 26 years of experience, and the other was 144
112 incorporate any linguistic features. This paper aims an undergraduate student majoring in linguistics. 145
113 to fill in this gap by evaluating a neural model and Given the often fuzzy boundaries between grades, 146
114 the effect of incorporating linguistic features. it can be subjective and challenging to specify the 147
grade level of each sentence. To facilitate higher 148
115 3 Data agreement, we asked the annotators to identify sen- 149
tences in each passage to be: 150
116 3.1 Methodology
117 Most publicly available datasets for sentence-level In-grade The sentence is at the expected level of 151

118 readability are built with manually simplified sen- difficulty given the grade of the passage; 152

119 tences, drawn for example from Wikipedia and Below-grade The sentence is easier than the in- 153
120 Simple Wikipedia (Vajjala and Meurers, 2014; Šta- grade sentences; 154
121 jner et al., 2017). Due to the cost of manual simpli-
122 fication, such datasets are not widely available for Above-grade The sentence is harder than the in- 155
123 many languages. In contrast, graded passages, for grade sentences. 156
124 example those in textbooks, are commonly avail-
125 able in many languages. Table 3 (Section B) shows three example sen- 157

126 A possible methodology is to assign the grade of tences taken from a Grade 3 passage. The top 158

127 the passage to all sentences in that passage (Pilán sentence was judged to be typical, with multiple 159

128 et al., 2014). Substantial noise could be introduced subjects from tangmu ‘Thomas’ to hua ‘speech’ 160
1
129 to both the training process and evaluation results, This dataset is unfortunately not publicly available.

2
161 and shengyin ‘voice’. The middle sentence has bert-base-chinese that contains 12 layers, 768 hid- 207
162 a simpler structure, with only subject shaonian den units and 12 heads for Chinese dataset. We 208
163 ‘youth’. The bottom sentence is more difficult to fine-tuned the model on our datasets (Section 3) 209
164 read due to the complex modifiers for the noun into a classifier for sentence-level readability as- 210
165 xinyin ‘trust’ and for the noun xiyue ‘joy’. sessment.3 This approach is similar to one taken 211
166 On average, 7.90% of the sentences are labeled by Tseng et al. (2019) for passage-level readability 212
167 as above-grade, and 29.07% below-grade. The assessment. 213
168 number of below-grade sentences exceeds the num-
169 ber of above-grade sentences in all grades with 4.3 Integration of linguistic features 214

170 the exception of Grade 3. As shown in Figure 1 XGB+BERT. Following Deutsch et al. (Deutsch 215
171 (Section B), in general, the higher the grade, the et al., 2020) and Lee et al. (Lee et al., 2021), 216
172 smaller the proportion of above-grade sentences. we obtained the grade predicted by BERT (Sec- 217
173 The proportion of below-grade sentences varies tion 4.2) and added it as an additional feature to 218
174 from 41.26% to 12.17%. Statistics on our corpus Xgboost (XGB) (Section 4.1), which turned out 219
175 can be found Table 2 (Section B). to be the best-performing baseline classifier for 220
176 To create a balanced Chinese dataset for training Chinese dataset (Section 5.4). 221
177 and evaluation, we selected passages from each Augmented BERT. We used the modified BERT 222
178 grade such that there are about 1000 sentences rep- model architecture used by Kaas et al. (2020), 223
179 resenting each grade, from Grade 3 through 12.2 which was shown to be effective in propaganda 224
180 As shown in Table 2, this dataset contains 9,803 detection in news articles. In the first component, 225
181 sentences and the grade distribution is shown in linguistic features (Section 4.1) are fed into the one 226
182 Table 3. The agreement percentage between the linear layer, and then passed to one dense layer and 227
183 two annotators are 82.72%. They achieved a Co- ReLU layer, including dropout parameter. In the 228
184 hen’s Kappa of 0.61, corresponding to a substantial second component, the outputs of BERT hidden 229
185 level of agreement (Landis and Koch, 1977). They layer are passed to one dense layer and ReLU layer, 230
186 reconciled the label differences through discussion also including dropout parameter. Finally, these 231
187 to reach the final annotation. two vectors are concatenated to a fully-connected 232
layer and softmax layer to predict the final sentence 233
188 4 Approach readability level. 234

189 We trained a number of classifiers by traditional 5 Experiments 235


190 machine learning, a neural classifier and two mod-
191 ified neural architectures that combines linguistic 5.1 Annotation methods for training 236

192 features. We evaluated three annotation methods for training: 237

193 4.1 Baseline classifiers Nil Assign the grade of the passage to all sentences 238

194 We trained classifiers with Logistic Regression in that passage in the training set. This method 239

195 (LR), Random Forests (RF) and Xgboost (XGB), avoids the need to annotate at the sentence 240

196 using the implementation of scikit-learn (Pedregosa level in return for some noise in the training 241

197 et al., 2011). Similar to Lu et al. (2020), we trained labels. 242

198 these classifiers with bag-of-word features as well


All An “in-grade” sentence is assigned the grade 243
199 as a set of 41 linguistic features from raw sentences,
of its passage; a “below-grade” sentence is 244
200 lexical, syntactic and discourse features for created
assigned one grade below the grade of its pas- 245
201 Chinese Dataset. The complete description of lin-
sage; and an “above-grade” sentence is as- 246
202 guistic features are attached in the Appendix.
signed one grade above. For example, in Ta- 247

203 4.2 Neural classifier ble 3, the below-grade sentence is considered 248
Grade 2 and the above-grade sentence Grade 249
204 BERT (Devlin et al., 2019) has achieved the state- 4. This method is designed to reduce the noise 250
205 of-the-art performance in many natural language in the training labels. 251
206 processing tasks. We used the pre-trained model
3
We use the Adam algorithm (Kingma and Ba, 2015) for
2
All data and code will be publicly released for research optimization. The epoch for each training is 10, and set the
purposes upon publication of this paper. maximum word embedding size as 128.

3
252 Exp Use in-grade sentences only, and assign the Model Anno 10-way 5-way
253 grade of the passage to these sentences. This tation Acc. Adj Acc. Acc
254 method results in a smaller training data size Majority n/a 10.69 31.18 20.84
255 but removes noise in the training labels. LR All 21.37 49.40 42.82
Nil 21.61 46.32 37.24
256 5.2 Testing Exp 20.57 47.26 38.31
257 To ensure the correctness of the gold label, we RF All 26.31 50.42 38.75
258 tested only those sentences labeled as “expected” Nil 23.26 37.58 38.44
259 in Chinese dataset, the same as testing sentences Exp 27.73 48.02 38.41
260 aligned with grade level in English dataset. We XGB All 28.56 55.83 41.42
261 used stratified ten-fold cross validation in all exper- Nil 33.03 56.04 46.59
262 iments for Chinese dataset, with 8 folds as training Exp 31.59 55.73 44.71
263 set, 1 fold as dev set and 1 fold as test set. The BERT All 33.08 58.38 47.27
264 hyperparameters learning rate, dropout and batch Nil 37.45 56.33 46.95
265 size are tuned on the dev set, and found the best Exp 36.74 55.74 46.85
266 parameters at about learning rate as 1e−5 , dropout XGB+ All 34.97 60.51 49.33
267 as 0.2 and batch size as 64. BERT Nil 37.97 61.42 48.77
268 All sentences from the same text are placed in the Exp 36.92 60.02 49.08
269 same fold, so that the entities and topics mentioned Aug. All 31.79 55.07 45.92
270 in the test sentences would not be seen during train- BERT Nil 38.57 54.66 48.02
271 ing. Exp 35.66 53.39 45.27
272 5.3 Metrics Table 1: Readability assessment performance for vari-
273 We report three metrics: 10-way classification ac- ous models and annotation methods during training on
Chinese Dataset, shown in percentage
274 curacy on predicting the grade of the sentence as
275 annotated in the corpus; adjacency accuracy, i.e.
276 allowing the prediction to deviate by one grade
but slightly worse in terms of adjacency accuracy.4 301
277 from the gold; and 5-way accuracy, by merging
The XGB+BERT model outperforms the BERT 302
278 Grades 3-4, Grades 5-6, Grades 7-8, Grades 9-10
model only slightly for 10-way classification, and 303
279 and Grades 11-12 into five difficulty levels.
achieved the best result in both adjacency classifica- 304
280 5.4 Results tion (61.42%) and 5-way classification (49.33%).5 305

281 Neural vs. statistical classifiers. As shown in Ta- These results suggest that it is beneficial to com- 306

282 ble 1, among the baseline classifiers, XGB yielded bine the insights from linguistic features and neural 307

283 the highest accuracy on all metrics, 10-way adja- networks in predicting sentence readability. 308

284 cency (56.04%), 10-way accuracy (33.03%) 5-way


285 classification (46.59%). The BERT model substan- 6 Conclusion 309
286 tially improved upon the performance of XGB on
287 10-way classification to 37.45%. We have presented the first study on integrating 310
288 Annotation method. The Nil method — i.e., linguistic features into a neural model for sentence- 311
289 using only precise grade level — led to the best level readability assessment. Our contribution is 312
290 accuracy for 10-way classification for all models two-fold. First, we contribute a dataset of over 19K 313
291 except RF. In terms of 5-way classification, how- Chinese sentences annotated with their difficulty 314
292 ever, BERT and XGB+BERT achieved the best at the sentence level. Second, we evaluated an 315
293 performance with the All method. Both perform augmented BERT model that integrates linguistic 316
294 better than the Exp method, suggesting that a larger features, and demonstrated that it outperforms the 317
295 but noisier dataset is beneficial for more coarse- simple integrated models used in previous studies. 318
296 grained tasks.
297 Integration of linguistic features. The Aug-
298 mented BERT model achieved the best result in
4
At p < 7.1 · e−2 for 10-way Acc.and p < 7.0 · e−3 for
10-way Adj-acc. according to McNemar’s Test.
299 10-way metrics, significantly outperformed the 5
At p < 6.7 · e−16 for 10-way Adj-acc.and p < 3.4 · e−4
300 BERT model on 10-way classification (38.57%), for 5-way Acc. according to McNemar’s Test.

4
319 7 Limitation Anders Friis Kaas, Viktor Torp Thomsen, and Barbara 372
Plank. 2020. Team DiSaster at SemEval-2020 Task 373
320 Our experimental results should be interpreted with 11: Combining BERT and hand-crafted Features for 374
321 the following limitations in mind. First, our dataset Identifying Propaganda Techniques in News Media. 375
322 contains only Chinese text graded on the national In Proc. 14th International Workshop on Semantic 376
Evaluation. 377
323 curriculum. The performance of the model should
324 also be evaluated on other languages and other diffi- Adam Kilgarriff, Mils Husák, Katy McAdam, Michael 378
325 culty scales. Second, the improvement observed in Rundell, and Pavel Rychlý. 2008. GDEX: Automati- 379
326 our best models depends on both the efficacy of the cally Finding Good Dictionary Examples in a Corpus. 380
In Proc. EURALEX. 381
327 linguistic features and on the strength of the neural
328 model itself. As neural models continue to improve Peter J. Kincaid, Robert P. Fishburne, Richard L. Rogers, 382
329 and effective linguistic features are identified, the and Brad S. Chissom. 1975. Derivation of new read- 383
ability formulas for Navy enlisted personnel. In Re- 384
330 best methods for combining may also need to be search Branch Report 8–75. Chief of Naval Technical 385
331 updated. Training: Naval Air Station Memphis. 386

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A 387


332 References Method for Stochastic Optimization. In Proc. 3rd In- 388
ternational Conference for Learning Representations, 389
333 Bharat Ram Ambati, Siva Reddy, and Mark Steedman. San Diego. 390
334 2016. Assessing Relative Sentence Complexity us-
335 ing an Incremental CCG Parser. In Proceedings of J. Richard Landis and Gary G. Koch. 1977. The Mea- 391
336 NAACL-HLT 2016. surement of Observer Agreement for Categorical 392
Data. Biometrics, 33:159–174. 393
337 Ion Madrazo Azpiazu and Maria Soledad Pera. 2019.
338 Multiattentive recurrent neural network architecture Bruce W. Lee, Yoo Sung Jang, and Jason Hyung-Jong 394
339 for multilingual readability assessment. Transactions Lee. 2021. Pushing on Text Readability Assessment: 395
340 of the Association for Computational Linguistics, A Transformer Meets Handcrafted Linguistic Fea- 396
341 7:421–436. tures. In Proceedings of the 2021 Conference on 397
Empirical Methods in Natural Language Processing. 398
342 Kevyn Collins-Thompson. 2008. Computational assess-
343 ment of text readability: A survey of current and John Lee, Meichun Liu, and Tianyuan Cai. 2020. Using 399
344 future research. International Journal of Applied Verb Frames for Text Difficulty Assessment. In Proc. 400
345 Linguistics, 165(2):97–135. International FrameNet Workshop 2020: Towards a 401
346 Felice Dell’Orletta, Martijn Wieling, Andrea Cimino, Global, Multilingual FrameNet. 402
347 Giulia Venturi, and Simonetta Montemagni. 2014.
Dawei Lu, Xinying Qiu, and Yi Cai. 2020. Sentence- 403
348 Assessing the Readability of Sentences: Which Cor-
level readability assessment for l2 chinese learning. 404
349 pora and Features? In Proc. 9th Ninth Workshop
CLSW 2019, LNAI, 11831:381–392. 405
350 on Innovative Use of NLP for Building Educational
351 Applications. Matej Martinc, Senja Pollak, Marko, and Robnik- 406
352 Tovly Deutsch, Masoud Jasbi, and Stuart Shieber. 2020. Šikonja. 2021. Supervised and Unsupervised Neural 407
353 Linguistic Features for Readability Assessment. In Approaches to Text Readability. Computational Lin- 408
354 Proceedings of the Fifteenth Workshop on Innovative guistics, 47(1):141–179. 409
355 Use of NLP for Building Educational Applications.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey 410
356 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Dean. 2013. Efficient Estimation of Word Repre- 411
357 Kristina Toutanova. 2019. BERT: Pretraining of sentations in Vector Space. In Proc. International 412
358 Deep Bidirectional Transformers for Language Un- Conference on Learning Representations (ICLR). 413
359 derstanding. In Proc. North American Chapter of the
360 Association for Computational Linguistics - Human F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, 414
361 Language Technologies (NAACL-HLT). B. Thirion, and O. Grisel. 2011. Scikit-learn: Ma- 415
chine learning in python. Journal of machine learn- 416
362 Anna Filighera, Tim Steuer, and Christoph Rensing. ing research, 12(Oct):2825–2830. 417
363 2019. Automatic text difficulty estimation using em-
364 beddings and neural networks. In European Con- Ildikó Pilán, Sowmya Vajjala, and Elena Volodina. 2016. 418
365 ference on Technology Enhanced Learning, page A Readable Read: Automatic Assessment of Lan- 419
366 335–348. Springer. guage Learning Materials based on Linguistic Com- 420
plexity. International Journal of Computational Lin- 421
367 Cristina Gârbacea, Mengtian Guo, Samuel Carton, and guistics and Applications, 7(1):143–159. 422
368 Qiaozhu Mei. 2021. Explainable Prediction of Text
369 Complexity: The Missing Preliminaries for Text Sim- Ildikó Pilán, Elena Volodina, and Richard Johansson. 423
370 plification. In Proc. 59th Annual Meeting of the 2014. Rule-based and Machine Learning Approaches 424
371 Association for Computational Linguistics (ACL). for Second Language Sentence-level Readability. In 425

5
426 Proc. 9th Workshop on Innovative Use of NLP for Grade sent # sent Grade sent # sent
427 Building Educational Applications. 3 1231 965 8 1505 939
428 Daniele Schicchi, Giovanni Pilato, and Giosué Lo 4 1833 951 9 1523 995
429 Bosco. 2020. Deep neural attention-based model 5 1992 982 10 3051 1048
430 for the evaluation of italian sentences complexity. In 6 1824 997 11 1590 1001
431 2020 IEEE 14th International Conference on Seman- 7 2909 917 12 2394 1008
432 tic Computing (ICSC), pages 253–256. IEEE.
433 Elliot Schumacher, Maxine Eskenazi, Gwen Frishkoff, Table 2: Total number of sentences and the number of
434 and Kevyn Collins-Thompson. 2016. Predicting the in-grade sentences at each grade
435 relative difficulty of single sentences with and with-
436 out surrounding context. In Proc. Conference on
437 Empirical Methods in Natural Language Processing
438 (EMNLP), pages 1871–1881. 477
Lexical Features 478
439 Matthew Shardlow, Richard Evans, Gustavo Henrique
3. Average number of characters per word per 479
440 Paetzold, and Marcos Zampieri. 2021. SemEval-
441 2021 Task 1: Lexical Complexity Prediction. In Proc. sentence 480
442 15th International Workshop on Semantic Evaluation 4. Number of two-character words per sentence 481
443 (SemEval). 5. Percentage of two-character words per sentence 482

444 Yao-Ting Sung, Ju-Ling Chen, Ji-Her Cha, Hou-Chiang 6. Number of three-character words per sentence 483
445 Tseng, Tao-Hsing Chang, and Kuo-En Chang. 2015. 7. Percentage of three-character words per 484
446 Constructing and validating readability models: the sentence 485
447 method of integrating multilevel linguistic features 8. Number of four-character words per sentence 486
448 with machine learning. Behavior Research Methods,
449 47:340–354. 9. Percentage of four-character words per sentence 487
10. Number of five-up-character words per 488
450 Hou-Chiang Tseng, Hsueh-Chih Chen, Kuo-En Chang, sentence 489
451 Yao-Ting Sung, and Berlin Chen. 2019. An Inno-
11. Percentage of five-up-character words per 490
452 vative BERT-Based Readability Model. In Lecture
453 Notes in Computer Science, vol 11937. sentence 491
12. Percentage of level1 tokens per sentence 492
454 Sowmya Vajjala and Detmar Meurers. 2014. Assessing 13. Percentage of level2 tokens per sentence 493
455 the relative reading level of sentence pairs for text
456 simplification. In Proc. 14th Conference of the Euro- 14. Percentage of level3 tokens per sentence 494
457 pean Chapter of the Association for Computational 15. Percentage of level4 tokens per sentence 495
458 Linguistics (EACL), page 288–297. 16. Percentage of level5 tokens per sentence 496
17. Percentage of level6 tokens per sentence 497
459 Sanja Štajner, Simone Paolo Ponzetto, and Heiner Stuck-
460 enschmidt. 2017. Automatic Assessment of Absolute 18. Percentage of level7 tokens per sentence 498
461 Sentence Complexity. In Proc. 26th International 499
462 Joint Conference on Artificial Intelligence (IJCAI). Syntactic Features 500
19. Percentage of adjectives per sentence 501
463 A Appendix: Instructions to annotators 20. Number of adjectives per sentence 502
464 We recruited two annotators, who accepted to work 21. Percentage of verbs per sentence 503
465 on a volunteer basis, under the project Anonymous 22. Number of verbs per sentence 504
466 at XXX University. The project has been approved 23. Percentage of nouns per sentence 505
467 by a institutional ethics review board. The anno- 24. Number of nouns per sentence 506
468 tators were given an Excel file with one sentence 25. Percentage of adverbs per sentence 507
469 per row, and were asked to mark each sentence as 26. Number of adverbs per sentence 508
470 0 (“expected” level of difficulty), +1 (“harder”) or 27. Total number of words tagged as conjunctions 509
471 -1 (“easier”). 28. Percentage of conjunctions per sentence 510
29. Total number of words tagged as pronouns 511
472 B Appendix: Distribution of corpora 30. Percentage of pronouns per sentence 512
31. Total number of prepositions per sentence 513
473 C Appendix: List of linguistic features
32. Percentage of prepositions per sentence 514
474 Raw Sentences 515
475 1. Number of words per sentence Discourse Features 516
476 2. Number of characters per sentence 33. Height of parse tree per sentence 517

6
Sentence Difficulty
汤姆开头有点吞吞吐吐,渐渐地,话越来越多,声音也越来越大,越来越 in-grade
自然了。 ‘Thomas stuttered at first, but gradually his speech got longer,
his voice grew louder and more natural.’
少年连连摆手,用不太标准的中国话说:"不,不要钱。 below-grade
‘The youth waved his hands and said with an accent, “No, I don’t want money.” ’
在那里,我们得到的是人与人之间的信任和被信任的喜悦。 above-grade
‘There, what we get is the trust between people and the joy of being trusted.’

Table 3: Example sentences taken from a Grade 3 passage, annotated with their difficulty relative to Grade 3.

518 34. Total number of noun phrases per sentence


519 35. Total number of verbal phrases per sentence
520 36. Total number of prepositional phrases per
521 sentence
522 37. Total number of dependency distances per
523 sentence
524 38. Average number of dependency distances per
525 sentence
526 39. Total number of entities per sentence
527 40. Percentage of entities per sentence
528 41. Number of Not-Entity nouns per sentence
529

You might also like