Y12 1059 PDF

Text Readability Classication of Textbooks of a Low-Resource Language
Zahurul Islam, Alexander Mehler and Rashedur Rahman

AG Texttechnology
Institut fur Informatik
Goethe-Universitat Frankfurt
zahurul,mehler@em.uni-frankfurt.de, kamol.sustcse@gmail.com
Abstract on their different readability levels. As input, this

function operates on various statistics relating to lex-
There are many languages considered to be ical and other text features.
low-density languages, either because the
Automatic readability classication can be useful
population speaking the language is not very
large, or because insufcient digitized text
for many Natural Language Processing (NLP) appli-
material is available in the language even cations. Automatic essay grading can benet from
though millions of people speak the language. readability classication as a guide to how good an
Bangla is one of the latter ones. Readabil- essay actually is. Similarly, search engines can use
ity classication is an important Natural Lan- a readability classier to rank its generated search
guage Processing (NLP) application that can results. Automatically generated documents, for ex-
be used to judge the quality of documents ample documents generated by text summarization
and assist writers to locate possible prob-
systems or machine translation systems, tend to be
lems. This paper presents a readability clas-
sier of Bangla textbook documents based error-prone and less readable. In this case, a read-
on information-theoretic and lexical features. ability classication system can be used to lter out
The features proposed in this paper result in documents that are less readable. The system can
an F-score that is 50% higher than that for tra- also be used to evaluate machine translation output.
ditional readability formulas. A document of higher readability tends to be better
than a document that belongs to a lower readability
class.
1 Introduction
Research in the eld of readability classication
The readability of a text relates to how easily hu- started in 1920. English is the dominating language
man readers can process and understand a text as the in this eld although much research has been done
writer of the text intended. There are many text re- for other languages like German, French, Chinese
lated factors that inuence the readability of a text. and so on. These languages are considered as high-
These factors include very simple features such as density languages as many language resources and
type face, font size, text vocabulary as well as com- tools are available for them. However, many lan-
plex features like grammatical conciseness, clarity, guages are considered to be low-density languages,
underlying semantics and lack of ambiguity. either because the population speaking the language
Nowadays, teachers, journalists, editors and other is not very large or because insufcient digitized text
professionals who create text for a specic audience material is available for the language even though it
routinely check the readability of their text. Read- is spoken by millions of people. Bangla is such a
ability classication, then, is the task of mapping language. Bangla, an Indo-Aryan language, is spo-
text onto a scale of readability levels. We explore the ken in Southeast Asia, specically in present day
task of automatically classifying documents based Bangladesh and the Indian state of West Bengal.
545
Copyright 2012 by Zahurul Islam, Alexander Mehler, and Rashedur Rahman
26th Pacific Asia Conference on Language,Information and Computation pages 545553
With nearly 230 million speakers, Bangla is one of grammatical structure than a shorter one. Dale
the largest spoken languages in the world, but only a and Chall (1948; 1995) showed that reading
very small number of linguistic tools and resources difculty is a linear function of the ASL of the
are available for it. For instance, there is no morpho- percentage of rare words. They listed 3, 000
logical analyzer, POS tagger or syntax parser avail- commonly known words for the 4th grade.
able for Bangla.
Gunning (1952) also considered the numbers
To create a supervised readability classication, it
of sentences and complex words to measure
is important to use a corpus that is already classi-
text readability. The formula uses similar lex-
ed for the different levels of readers. In this work,
ical features as (Dale and Chall, 1948; Dale
the corpus is collected from textbooks that are used
and Chall, 1995) with different constants. The
in primary and middle school in Bangladesh. The
Flesch-Kincaid readability index (Kincaid et
collected documents are classied according to their
al., 1975) considers the average number of
readability. So the extracted corpus is ideal for a
words per sentence and the average number of
readability classication task.
syllables per words. They proposed two dif-
In this paper, we present a readability classica-
ferent formulas, one for measuring how easy a
tion based on information-theoretic and lexical fea-
text is to read and the other one for measuring
tures. We evaluate this classier in comparison with
grading level. Senter and Smith (1967) also de-
traditional readability formulas that, even though
signed a readability index for the US Air force
they were proposed in the early stages of readabil-
that uses the average number of characters in
ity classication research, are still widely used.
a word and the average number of words in a
The paper is organized as follows: Section 2 dis-
sentence. Many of the other readability formu-
cusses related work followed by an introduction of
las are summarized in (Dubay, 2004).
the corpus in Section 3. The features used for clas-
sication are described in Section 4, and our exper- English has a long history of readability re-
iments in Section 5 are followed by a discussion in search, but there is very little previous research
Section 6. Finally, we present our conclusions in in Bangla text readability. Das and Roychud-
Section 7. hury (2004; 2006) show that readability for-
mulas proposed by (Kincaid et al., 1975) and
2 Related Work (Gunning, 1952) work well for Bangla text.
The readability formulas were tested semi-
There is no standard approach to measuring text
automatically on seven documents, mostly nov-
quality. According to Mullan (2008), a good read-
els. Obviously this data set is small.
able English sentence should contain 14 to 22 words.
He also stated that if the average sentence length is Petersen & Ostendorf (2009) and Feng et al.
more than 22 words then the content is not clear. If (2009) show that these traditional methods
the average sentence length is shorter than 14 then it have signicant drawbacks. Longer sentences
is probable that the presentation of ideas is discon- are not always syntactically complex and the
tinuous. syllable number of a single word does not cor-
Much work was done previously in this eld and relate with its difculty. With recent advance-
many different types of features were used. We sum- ments of NLP tools, a new class of text features
marize the related research grouped by type: is now available.
Lexical Features: In the early stage of read- Language Model Based Features: Collins-
ability research fairly simple features were Thompson and Callan (2004), Schwarm and
used due to the lack of linguistic resources and Ostendorf (2005), Alusio et al. (2010), Kate et
computational power. Average Sentence length al. (2010) and Eickhoff et al. (2011) use statis-
(ASL) is one of them. The ASL can be used as tical language models to classify texts for their
a measure of grammatical complexity assum- readability. They show that trigrams are more
ing that a longer sentence has a more complex informative than bigram and unigram mod-
546
els. Combining information from statistical baseline system that uses three traditional readabil-
language models with other features using Sup- ity formulas proposed by Gunning (1952) , Dale and
port Vector Machines (SVM) outperform tradi- Chall (1948; 1995) and Senter and Smith (1967).
tional readability measures. Pitler and Nenkova These traditional formulas are widely used in many
(2008) also used a unigram language model and readability classication tools.
found that this feature is a strong predictor of
readability. 3 Corpus Extraction
POS-based Features: Parts of Speech (POS)- The government agency National Curriculum and
based grammatical features were shown to be Textbook Board, Bangladesh1 makes available text-
useful in readability classication (Pitler and books that are used in public schools in Bangladesh.
Nenkova, 2008; Feng et al., 2009; Aluisio et The textbooks cover many different subjects, includ-
al., 2010; Feng et al., 2010). In the experiment ing Bangla Literature, Social Science, General Sci-
of (Feng et al., 2010), these features outperform ence and Religious Studies. These textbooks are for
language-model-based features. students from grade one to grade ten. All of the
textbooks are in Portable Document Format (PDF).
Syntax-based Features: Text readability is af- Some of them are made by scanning textbooks and
fected by syntactic constructions (Pitler and some of them are converted from typed text. There
Nenkova, 2008; Barzilay and Lapata, 2008; is a Bangla OCR (Hasnat et al., 2007) available but
Heilman et al., 2007; Heilman et al., 2008). it is unable to extract text from the scanned PDF
In this line of research, Barzilay and Lapata books. Therefore, we only considered textbooks that
(2008) show, for example, that multiple noun were converted to PDF from typed text. The Apache
phrases in a single sentence require the reader PDFBox2 is used to extract text from PDFs. Note
to remember more items. that 24 textbooks were extracted from class two to
class eight. After text extraction, it was observed
Semantic-based Features: On the semantic that the text was not written in Unicode Bangla.
level, a paragraph that refers to many entities A non-standard Bangla input method called Bijoy
burdens the reader since he has to keep track is used to type the textbooks. This is an ASCII
of these entities, their semantic representations based Bangla input method that was widely used in
and how these entities are related. Texts that the 1990s. The next challenge was to convert non-
refer to many entities are extremely difcult to standard text to Bangla Unicode.
understand for people with intellectual disabil- The selected text books were written using a font
ities (Feng et al., 2009). Noemie and Huener- called SutonnyMJ that has many different versions,
fauth (2009) show how working memory limits all of which differ slightly in terms of the code point
the semantic encoding of new information by of some consonant conjuncts. The freely available
readers. open source CRBLPConverter3 is used to convert
Researchers also experimented with semantic these non-standard Bangla texts to Unicode. To cope
features like lexical chains, discourse relations with the font of the text, the CRBLPConverter re-
and entity grids (Feng et al., 2010; Barzilay and quired some slight modications. Text books not
Lapata, 2008). It has been shown that these fea- only contain descriptive texts but also contain ques-
tures are useful for readability classication. tions, poems, religious hymns, texts from other lan-
guages (e.g., Arabic, Pali) and transcription of Ara-
In this paper, we do not compare our work with any bic texts (e.g., Surah). Manual work was involved
previous work that explores linguistic features. Due to clean these non-descriptive texts and extract each
to the unavailability of a Bangla syllable identica- chapter as a document. Class two contains only one
tion system, we could not compare our work with 1
http://nctb.gov.bd/book.php
readability formulas that use syllable information. 2
http://pdfbox.apache.org/
We will only compare our proposed features with a 3
http://crblp.bracu.ac.bd/converter.php
547
Classes Documents Avg. Avg. Avg. in order to understand the sentence, which makes
Doc- Sentence Word a longer sentence more difcult for them. As an
ument Length Length
example, Table 1 shows that the Average Sentence
Length
Length rises in the text of higher readability classes.
three 123 65.21 8.07 4.31 The Average Word Length is another lexical feature
four 88 126.25 8.63 4.37
ve 43 196.72 9.34 4.41 that is useful for readability classication. A longer
six 62 130.13 11.53 4.85 word carries some difculties for children at a lower
grade level. For example: the word biodegradable
Table 1: The Bangla Readability Corpus will be harder to pronounce, spell and understand
for children at a lower grade level. This characteris-
tic is reected in our readability corpus that is shown
textbook and class six, seven and eight contain two
in Table 1. The Average Word Length will be more
textbooks each. To avoid a data sparseness prob-
useful for agglutinative languages such as German,
lem, class two is merged with class three and class
which allows concatenation of morphemes to build
seven and eight are merged with class six. Each doc-
longer words.
ument is tokenized using a slightly modied version
of the tokenizer which is freely available in4 . Table 1 The Average Number of Complex Words feature
shows the details of the corpus. The Average Docu- is related to the Average Word Length. The aver-
ment Length shows the average number of sentences age length of English written words is 5.5 (Nadas,
per document. The Average Sentence Length repre- 1984). Table 1 shows that the average word length
sents the average number of words in a sentence and in our corpus is below 5. Dash (2005) showed that
Average Word Length displays the average number the average word length in the CIIL5 corpus is 5.12.
of characters in a word. Majumder et al. (2006) claimed that the average
word length in a Bangla news corpus is 8.62. They
It should be noted that 80% of the corpus is used
have mentioned that the average length is higher due
for training and the remaining 20% is used as a test
to the presence of many hyphenated words in the
set.
news corpus. In this work, any word that contains
4 Features 10 or more characters is considered a complex word.
A complex word will be harder to read for children
4.1 Lexical Features at a lower grade level. The type token ratio (TTR),
which indicates the lexical density of text, has been
In this paper, we compare a lexical and information-
considered as a readability feature too. Low lexical
theoretic classier of text readability with a classier
densities involve a great deal of repetition.
based on traditional readability formulas. The liter-
ature explores some of the linguistic indicators of The term Hapax Legomena is widely used in lin-
readability. This includes the avg. sentence length, guistics referring to words which occur only once
avg. word length and the avg. number of dif- within a context or document. These are mostly con-
cult words (of more than 9 letters). We develop tent words. Kornai (2008) showed that 40% to 60%
a classier of text readability based on lexical and of the words in larger corpora are Hapax Legomena.
information-theoretic features. We rst describe lex- Documents with more Hapax Legomena generally
ical features used by this classier. will contain more information. In terms of text read-
ability, the difculty level will be higher.
The Average Sentence Length is a quantitative
measure of syntactic complexity. In most cases, the
4.2 Entropy Based Features
syntax of a longer sentence is more difcult than
the syntax of a shorter sentence. However, children Recently, researchers have independently made the
of lower grade levels are not aware of syntax. In suggestion that the entropy rate plays a role in hu-
any event, a longer sentence contains more entities man communication in general (Genzel and Char-
and children have to remember all of these entities niak, 2002; Levy and Jaeger, 2007). The rate of
4 5
http://statmt.org/wmt09/scripts.tgz http://www.elda.org/catalogue/en/text/W0037.html
548
information transmission per second in a human D(p||q) is an asymmetric measure that considers the
speech conversation is roughly constant, that is, number of additional bits needed to encode p, when
transmitting a constant number of bits per second or using an optimal code for q instead of an optimal
maintaining a constant entropy rate. code for p. In other words: D(p||q) measures how
Since the most efcient way to send information much one probability distribution is different from
through a noisy channel is at a constant rate, Plotkin another distribution. More specically, if the proba-
and Nowak (2000) have shown that this principle bility distribution of a document p is closer to q than
could be viewed as biological evidence of how hu- to q then the document has a smaller distance to q.
man language processing evolved. Communication The document belongs to the category correspond-
through a text should satisfy this principle. That ing to q.
is, each sentence of a text, for example, conveys In order to apply this method in our framework
roughly the same amount of information. In order to we start from a training corpus where for each target
utilize this information-theoretical notion, we start class and each random variable under consideration
from random variables and consider their entropy as we compute the distribution q(x). This gives a refer-
indicators of readability. ence distribution such that for a text T whose class
Shannon (1948) introduced entropy as a measure membership is unknown, we can compute the dis-
of information. Entropy, the amount of information tribution p(x) only for T in order to ask how much
in a random variable, can be thought of as the av- information we get about p(x) when knowing q(x).
erage length of the message needed to have an out- Since q(x) is computed for each of the four target
come on that variable. The entropy of a random vari- classes (see Table 1), this gives for any random vari-
able X is dened as able X four features of relative entropy.
n
H(X) = p(xi ) log p(xi ) (1) 5 Experiments and Results
i=1
5.1 Baseline System
The more the outcome of X converges towards a
uniform distribution, the higher H(X). Our hy- To measure accuracy of our proposed features, a
pothesis is that the higher the entropy, the less read- baseline system is implemented that uses three tra-
able the text along the feature represented by X. In ditional readability formulas, such as: Gunning
our experiment, we consider the following random fog readability index (Gunning, 1952), DaleChall
variables: word probability, character probability, readability formula (Dale and Chall, 1948; Dale and
word length probability and word frequency proba- Chall, 1995) and Automated readability index (Sen-
bility (or frequency spectrum, respectively). Note ter and Smith, 1967). There are more traditional for-
that there is a correlation between the probability mulas available that use syllable information, these
distribution of words and the corresponding distri- are not considered for this task due to unavailability
bution of word frequencies. As we use Support Vec- of a Bangla syllable identication system. The Gun-
tor Machines (SVM) for classication, these corre- ning fog readability index and DaleChall readabil-
lations are taken into consideration. ity formula both use complex or difcult words. The
denition of these words varies slightly. Gunning
4.3 Kullback-Leibler Divergence-based (1952) denes a complex word as a word that con-
Features tains more than three syllables and Dale and Chall
The Kullback-Leibler divergence or relative entropy (1948; 1995) introduce 3000 familiar words. Any
is a non-negative measure of the divergence of two word not in the list of 3000 words is considered dif-
probability distributions. Let p(x) and q(x) be two cult. For this work, both types of words are dened
probability distributions of a random variable X. in the same way, described in section 4.1. We con-
The relative entropy of these distributions is dened sider any word that has 10 or more letters as a dif-
as: cult or complex word. Table 2 shows the evalua-
n
p(xi ) tion of the baseline system. The evaluation shows
D(p||q) = p(xi ) log (2)
i=1
q(xi ) that these features do not perform well. Among
549
Features Accuracy F-Score Features Accuracy F-Score
Gunning fog readability index 48.3% 36.5% Word probability 53.3% 49.3%
DaleChall readability formula 48.3% 45.0% Character probability 48.3% 35.4%
Automated readability index 51.6% 46.2% Word length probability 50.0% 36.9%
All together 53.3% 49.6% Word frequency probability 43.3% 32.4%
Character frequency probability 53.3% 47.7%
Table 2: Evaluation of baseline system with 3 traditional Entropy features 61.6% 59.8%
readability formulas. Lexical + entropy features 73.3% 72.1%
Table 4: Evaluation of entropy based features.

Features Accuracy F-Score
Average sentence length 51.6% 47.3%
Type token ratio 41.6% 30.6% The documents in this work are assumed to be
Avg. word length 50.3% 46.9% a medium of communication between writers and
Avg. number of complex Words 46.6% 34.2%
Hapax legomena 40.0% 28.3%
readers. Conversely, information ow of a very
All together 60.0% 56.5% readable document will differ from that of a less
readable document. So, the constants for the cor-
Table 3: Evaluation of lexical features. responding entropy rates of the different readability
classes will differ. As a single feature, these entropy
these formulas, Automated readability index is the based features perform similarly to lexical features.
highest performing formula. Das and Roychudhury But, collectively this is the best performing feature
(2004; 2006) showed that these traditional features set. Among all similar features the random variable
nonetheless work well for Bangla novels. Note that with Word Probability works better than others. Ta-
we have used the SMO (Platt, 1998; Keerthi et al., ble 4 shows the results of these features. Adding lex-
2001) classier model in WEKA (Hall et al., 2009) ical features with entropy based features improves
together with the Pearson VII function-based univer- accuracy and F-score substantially.
un et al., 2006).
sal kernel PUK (Ust
5.4 System with Kullback-Leibler
5.2 System with Lexical Features Divergence-based Features
Lexical features use the same kind of surface fea- Relative entropy-based features represent the dis-
tures as the traditional readability formulas used in tance between the test document and target classes.
the baseline system (see: Section 5.1). Table 1 The target class with the lowest distance will be the
shows that the average sentence length and dif- class of the test document. Five different types of
culty levels are proportional. That means that sen- random variables are used in this work (see Sec-
tence length increases for higher readability classes. tion 4.3). The random variable based on character
Average word length exhibits the same characteris- probabilities is the best performing individual fea-
tics. These characteristics are reected in the exper- ture among all features used in this work. However,
iment. These two are the best performing features this feature set performs worse than the lexical and
among all of the lexical features. Table 3 shows the entropy based features set. The evaluation is shown
evaluation of the system that uses only lexical fea- in Table 5. The combination of all, i.e., lexical, en-
tures. Although the individual accuracy of some of tropy and relative entropy based features, gives the
these features is similar to the traditional formulas, best result, namely accuracy of 75% and F-score of
the combination of all lexical features outperforms 74.1%.
the baseline system.
6 Discussion
5.3 System with Entropy Based Features Das and Roychudhury (2004; 2006) found that tra-
As noted earlier, entropy measures the amount of in- ditional readability formulas are useful for Bangla
formation in a document. The entropy rate is con- readability classication. However, the experimen-
stant in human communication (see Section: 4.2). tal results in this paper show that these formulas are
550
Features Accuracy F-Score 7 Conclusion
4 Word probabilities 50.0% 50.2%
4 Character probabilities 61.6% 61.1% In this paper, we have presented features for text
4 Word length probabilities 48.3% 46.5% readability classication of a low resource language.
4 Word frequency probabilities 50.0% 45.6% Altogether we have proposed 30 quantitative fea-
4 Character frequency probabilities 43.3% 34.2% tures. Twenty-ve of them are information-theoretic
20 Relative entropy based features 56.6% 54.0%
Entropy + relative entropy features 68.3% 65.9%
based features. These features do not require any
Lexical + entropy + kind of linguistic processing. Recent advances in
relative entropy based features 75.0% 74.1% NLP tools argue that linguistic features are useful
for readability classication. However, our exper-
Table 5: Evaluation of Kullback-Leibler Divergence- imental results show that lexical and information-
based Features.
theoretic features perform very well. There are
many languages in the Asia Pacic region that are
not useful for studies like the one presented here. still considered as low resource languages. These
This is probably due to the fact that these formulas features can be used for readability classication of
were specially designed for English. One reason for these languages. As a future work, we plan to ex-
the poor performance is that Bangla script is a syl- plore many other information-theoretic features like
labic script that has glyphs representing clusters and mutual information, point wise mutual information
ligatures. and motifs.
It also has to be noted that Bangla is an inec-
8 Acknowledgements
tional language, so that the average word length can
be longer than that of many other languages. We would like to thank Mr. Munir Hasan from the
The lexical features that are assumed to be good Bangladesh Open Source Network (BdOSN) and
indicators of text difculty did indeed perform well Mr. Murshid Aktar from the National Curriculum
in classication. The respective feature set performs & Textbook Board Authority, Bangladesh for their
better than the baseline system. Average sentence help on corpus collection. We would also like to
length and Average word length do not perform well, thank Andy Lucking, Paul Warner and Armin Hoe-
as reected in Table 1. That shows that the average nen for their fruitful suggestions and comments. Fi-
word and sentence lengths are longer in higher read- nally, we thank three anonymous reviewers. This
ability classes than in lower readability classes. work is funded by the LOEWE Digital-Humanities
As an individual feature, each entropy based fea- project in the Goethe-Universitat Frankfurt.
ture performs similarly to other features. However,
the combination of the entropy based features are the
References
best performing features among all. The classica-
tion performance even increases when entropy based Ra Aluisio, Lucia Specia, Caroline Gasperin, and Car-
features are combined with lexical features. olina Scarton. 2010. Readability assessment for text
Among all relative entropy based features, the simplication. In NAACL-HLT 2010: The 5th Work-
shop on Innovative Use of NLP for Building Educa-
random variable based on character probabilities
tional Applications.
performs best. This feature performs better than the
Regina Barzilay and Mirella Lapata. 2008. Modeling
baseline system. But the performance drops when local coherence: An entity-based approach. Computa-
this feature is added to other relative entropy based tional Linguistics, 21(3):285301.
features. Although the relative entropy based fea- Kevyn Collins-Thompson and James P Callan. 2004. A
ture set performs better than the baseline system, the language modeling approach to predicting reading dif-
lexical and entropy based feature set performs even culty. In HLT-NAACL.
better. The performance surpasses the baseline sys- Edgar Dale and Jeanne S. Chall. 1948. A formula for
tem by 50% when lexical, entropy based and relative predicting readability. Educational Research Bulletin,
entropy based features are combined. 27(1):1120+28.
551
Edgar Dale and Jeanne S. Chall. 1995. Readability and features for reading difculty prediction. In Pro-
Revisited: The New Dale-Chall Readability formula. ceedings of the Third Workshop on Innovative Use of
Brookline Books. NLP for Building Educational Applications (EANL).
Sreerupa Das and Rajkumar Roychoudhury. 2004. Test- Rohit J. Kate, Xiaoqiang Luo, Siddharth Patwardhan,
ing level of readability in bangla novels of bankim Martin Franz, Radu Florian, Raymond J. Mooney,
chandra chattopodhay w.r.t the density of polysyllabic Salim Roukos, and Chris Welty. 2010. Learning to
words. Indian Journal of Linguistics, 22:4151. predict readability using diverse linguistic features. In
Sreerupa Das and Rajkumar Roychoudhury. 2006. Read- 23rd International Conference on Computational Lin-
abilit modeling and comparison of one and two para- guistics (COLING 2010).
metric t: a case study in bangla. Journal of Quanta- S.S. Keerthi, S. K. Shevade, C. Bhattacharyya, and
tive Linguistics, 13(1). K. R. K. Murthy. 2001. Improvements to Platts SMO
Niladri Sekher Dash. 2005. Corpus Linguistics and algorithm for SVM classier design. Neural Compu-
Language Technology: With Reference to Indian Lan- tation, 13(3):637649.
guages. New Delhi: Mittal Publications. J. Kincaid, R. Fishburne, R. Rodegers, and B. Chissom.
William H. Dubay. 2004. The principles of readability. 1975. Derivation of new readability formulas for Navy
Costa Mesa, CA: Impact Information. enlisted personnel. Technical report, US Navy, Branch
Report 8-75, Cheif of Naval Traning, Millington, TN.
Carsten Eickhoff, Pavel Serdyukov, and Arjen P. de Vries.
Andras Kornai. 2008. Mathematical Linguistics.
2011. A combined topical/non-topical approach to
Springer.
identifying web sites for children. In Proceedings
of the fourth ACM international conference on Web Roger Levy and T. Florian Jaeger. 2007. Speakers opti-
search and data mining. mize information density through syntactic reduction.
Advances in neural information processing systems,
Lijun Feng, Noemie Elhadad, and Matt Huenerfauth.
pages 849856.
2009. Cognitively motivated features for readability
Khair Md. Yeasir Arafat Majumder, Md. Zahurul Islam,
assessment. In Proceedings of the 12th Conference of
and Mumit Khan. 2006. Analysis and observations
the European Chapter of the ACL.
from a bangla news corpus. In 9th International Con-
Lijun Feng, Martin Janche, Matt Huenerfauth, and ference on Computer and Information Technology (IC-
Noemie Elhadad. 2010. A comparison of features CIT 2006).
for automatic readability assessment. In The 23rd In-
W.M.A Mullan. 2008. Dairy science and food technol-
ternational Conference on Computational Linguistics
ogy improving your writing using a readability calcu-
(COLING).
lator.
Dimitry Genzel and Eugene Charniak. 2002. Entropy A. Nadas. 1984. Estimation of probabilities in the lan-
rate constancy in text. In Proceedings of the 40st guage model of the ibm speech recognition system.
Meeting of the Association for Computational Linguis- IEEE Transactions on Acoustics, Speech and Signal
tics (ACL 2002). Processing, 32(4):859861.
Robert Gunning. 1952. The Technique of clear writing. Sarah E. Petersen and Mari Ostendorf. 2009. A machine
McGraw-Hill; Fourh Printing Edition. learning approach to reading level assesment. Com-
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard puter Speech and Language, 23(1):89106.
Pfahringer, Peter Reutemann, and Ian H. Witten. Emily Pitler and Ani Nenkova. 2008. Revisiting read-
2009. The WEKA data mining software: an update. ability: A unied framework for predicting text qual-
ACM SIGKDD Explorations, 11(1):1018. ity. In Proceedings of the Conference on Empirical
Md. Abul Hasnat, S M Murtoza Habib, and Mumit Khan. Methods in Natural Language Processing (EMNLP).
2007. A high performance domain specic ocr for John C. Platt. 1998. Fast training of support vector ma-
bangla script. In International Joint Conferences on chines using sequential minimal optimization. MIT
Computer, Information, and Systems Sciences, and En- Press.
gineering (CISSE). Joshua B. Plotkin and Martik A. Nowak. 2000. Lan-
Michael Heilman, Kevyn Collins-Thompson, and Max- guage evolution and information theory. Journal of
ine Eskenazi. 2007. Combining lexical and grammat- Theoretical Biology, 205(1):147159.
ical features to improve readavility measures for rst Sarah E. Schwarm and Mari Ostendorf. 2005. Read-
and second language text. In Proceedings of the Hu- ing level assessment using support vector machines
man Language Technology Conference. and statistical language models. In the Proceedings
Michael Heilman, Kevyn Collins-Thompson, and Max- of the 43rd Annual Meeting on Association for Com-
ine Eskenazi. 2008. An analysis of statistical models putational Linguistics(ACL 2005).
552
R.J. Senter and E. A. Smith. 1967. Automated read-
ability index. Technical report, Wright-Patterson Air
Force Base.
Claude Elwood Shannon. 1948. A mathematical theory
of communication. The Bell System Technical Jour-
nal, 27(1):379423.
un, W.J. Melssen, and L.M.C. Buydens. 2006.
B. Ust
Facilitating the application of support vector regres-
sion by using a universal Pearson VII function based
kernel. Chemometrics and Intelligent Laboratory Sys-
tems, 81(1):2940.
553

Y12 1059 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Y12 1059 PDF

Uploaded by

Copyright:

Available Formats

Text Readability Classication of Textbooks of a Low-Resource Language

Zahurul Islam, Alexander Mehler and Rashedur Rahman

Abstract on their different readability levels. As input, this

Table 4: Evaluation of entropy based features.

You might also like