Script 2

This article was downloaded by: [Northeastern University]
On: 09 November 2014, At: 16:06

Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Journal of Quantitative Linguistics

Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/njql20
Readability modelling and

comparison of one and two
parametric fit: A case study in
Bangla*
a b
Sreerupa Das & Rajkumar Roychoudhury
a
Department of Linguistics , University of Calcutta ,
India
b
Physics and Applied Mathematics Unit , Indian
Statistical Institute , Calcutta, India
Published online: 16 Feb 2007.
To cite this article: Sreerupa Das & Rajkumar Roychoudhury (2006) Readability modelling
and comparison of one and two parametric fit: A case study in Bangla*, Journal of
Quantitative Linguistics, 13:01, 17-34, DOI: 10.1080/09296170500500843
To link to this article: http://dx.doi.org/10.1080/09296170500500843
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information
(the “Content”) contained in the publications on our platform. However, Taylor
& Francis, our agents, and our licensors make no representations or warranties
whatsoever as to the accuracy, completeness, or suitability for any purpose
of the Content. Any opinions and views expressed in this publication are the
opinions and views of the authors, and are not the views of or endorsed by
Taylor & Francis. The accuracy of the Content should not be relied upon and
should be independently verified with primary sources of information. Taylor and
Francis shall not be liable for any losses, actions, claims, proceedings, demands,
costs, expenses, damages, and other liabilities whatsoever or howsoever caused
arising directly or indirectly in connection with, in relation to or arising out of the
use of the Content.
This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan, sub-
licensing, systematic supply, or distribution in any form to anyone is expressly
forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Downloaded by [Northeastern University] at 16:06 09 November 2014
Journal of Quantitative Linguistics
2006, Volume 13, Number 1, pp. 17 – 34
DOI: 10.1080/09296170500500843
Readability Modelling and Comparison of One and Two

Parametric Fit: A Case Study in Bangla*
Sreerupa Das1 and Rajkumar Roychoudhury2
1
Department of Linguistics, University of Calcutta, India; 2Physics and Applied
Mathematics Unit, Indian Statistical Institute, Calcutta, India
ABSTRACT
This paper deals with an interesting problem in computational linguistics namely

‘‘readability of texts’’. A piece of text appears to be easy or difficult depending on certain
parameters involved within the text pattern. Based on these parameters, several readability
indices have been developed; namely Flesch reading ease, Fog index, Flesch – Kincaid
formula, and so on. Our paper deals with the construction of a miniature model or
readability index for Bangla documents, using textbooks. We take into consideration
parameters such as average sentence length, number of syllables per 100 words, and so on.
INTRODUCTION
Making a document readable is key to producing a clearly-written text.

One problem in public education and mass communication is how to tell
whether a particular piece of writing is likely to be readable to a particular
group of readers. ‘‘Readability’’ refers to ease of comprehension, and
not the artistic merit of the passage; it is reflected by the accuracy with
which the readers answer comprehension tests based on the passage
(Bhattacharya, 1965). Readability (Hou, 1983) tends to measure how
comfortable a reader feels when reading a piece of text. Analysis of
readability is extremely important if the document is to reach a sufficient
number of readers for whom it has been prepared. A readability index
*Address correspondence to: Rajkumar Roychoudhury, Physics and Applied Mathe-

matics Unit, Indian Statistical Institute, 203, B.T. Road, Calcutta 700035, West Bengal,
India. E-mail: raj@isical.ac.in
0929-6174/06/13010017$16.00 Ó Taylor & Francis

18 S. DAS & R. ROYCHOUDHURY
(which provides assistance in producing a readable document) must thus

be developed. A readability index is a measure of the ease (or difficulty) of
reading and understanding a piece of text.
Readability (or ‘‘reading ease’’) formulae are arithmetical functions
that assign a numerical value to a text, indicating the reading proficiency
that will be required to understand that text. They are based on the
assumption that comprehension of a text, difficulty and readability are
directly linked to each other, and are a stochastic function of aspects of
the text such as ratios of sentences, words, and syllables: that is, features
that can be objectively measured (DeVries, 2000).

Several formulae to compute readability indices have been reported
and are in fairly wide use (Flesch, 1948). Readability formulae were
originally mainly developed by educators and reading specialists. Their
primary application was in defining the appropriate reading level for
textbooks for elementary and secondary schools. In other words, these
formulae were originally developed to help schools decide whether
textbooks were appropriate for students at a particular grade level (Farr
et al., 1951; Gunning, 1952).
Notable work on readability analysis started in the United States at
the beginning of the 20th century. Among those studies that predate
the computer age, the Gray – Leary yardstick is worth mentioning
(McCallum & Peterson, 1982). Starting with 289 variables, a formula
was ultimately devised with 64 variables, which seemed too complex
and unattractive to work with. Earlier, Thorndike (Hochhauser, 1997)
showed that words encountered frequently by readers are less difficult to
understand than those appear rarely. Hence a measure of readability can
be made in terms of their frequency in normal use. Consequently, the
McCall-Crabbs lessons (Klare, 1975) and the Dale-Long list (containing
easy words) were published.
Since the 1940s, researchers have devised a number of readability
formulae. Notable among them are the Flesch formula (McCallum &
Peterson, 1982), the Fog index (McCallum & Peterson, 1982), the
Flesch – Kincaid formula (McCallum & Peterson, 1982), and others. All
these formulae were developed and tested on English; there exists to date
no quantitative study of readability on any Indian language. The need for
making a readability index for Bangla is quite clear. Such an index,
applied to a document, would estimate the grade or level for which the
document was prepared. This would naturally be very helpful when
screening texts from huge samples.
READABILITY MODELLING FOR BANGLA 19
The readability formulae for English may not be directly applicable for
Bangla. This is because, while European scripts are pseudo – phonetic,
Bangla is a syllabic script with glyphs representing clusters and ligatures.
That is, there are certain features or parameters in Bangla which need to
be incorporated in the index to give more accurate scores for Bangla texts.
This paper describes an attempt, perhaps for the first time, to bridge
this gap between Bangla and English. We have extracted a set of
parameters from the older Flesch index (Flesch, 1948) and, based on that,
created a miniature readability model for testing on Bangla documents.
Our model, which is based on a small sample (small number of texts) as

well as a small range of data (small number of respondents), does
incorporate some errors (in prediction). But it serves as a miniature or a
test model for Bangla texts, to show the relative efficiency of the old
parameters. This will help in developing a more realistic model in future
work based on a much larger sample.
AIMS OF THE PRESENT WORK
The goal of this present paper is to explore and analyse the ‘‘readability’’
of a few Bangla texts by a number of authors of high repute.
Since ‘‘difficulty level’’ is a qualitative concept, to draw any concrete
inference regarding its significance and interpretations in various
domains of the Bangla language, our primary need is to quantify; i.e.
to transform from qualitative to quantitative.
There are many factors in language structure which make a text
‘‘easy’’, ‘‘moderate’’ or ‘‘difficult’’ (Mikk, 1995, 1999). From the view-
point of a linguist, two such factors are average sentence length (total
words/total sentences) and number of syllables per 100 words (total
syllables/total words * 100).
Our ultimate aim is then to build a readability index via multiple
regression using those two factors. The steps to be taken towards this
goal are illustrated in Diagram A.
Sample Survey
This step involves collecting data, which are of two principal types in our
study. The first sets of data are the sample texts, which are drawn from the
CORPUS randomly. These are then given to a group of readers who share
a common educational and cultural background. Then we collect the
Diagram A
qualitative responses of the readers (very easy – very difficult) on a (0 – 100

scale), which are divided into the following scale of 0 – 20 (very hard) to
80 – 100 (very easy), taking the frequencies of the readers’ responses to
derive quantitative data. This is the second type of data we work with.
Parameter Extraction
This step involves studying the various standard readability parameters.
Principally, one needs to investigate the correlation coefficients between
various parameters in the text. In this step we again draw some random
samples from the corpus, which are then given to the readers for their
responses. This step serves a two-fold purpose. First it makes smooth any
irregularities or discrepancies present in the data (more precisely it
corrects for the bias, if any, in the responses). Secondly, from the
responses, we can get a clear picture of which parameters reflect reading
ease or difficulty.
Analysis. This is a very important step in our journey to build up the

model. In this stage, we need to draw various graphs to see the behaviour
of the extracted parameters. We also take into consideration the scatter
plot between the parameters and the readers’ responses. This is to capture
any pattern (positive or negative) that may affect readability.
Next, we select from the extracted set of parameters those parameters
which indicate a strong positive correlation with the responses. This gives
us the most likely set of parameters to be included in our model.
Assimilation. In this step we assimilate all the inferences obtained in the

various steps preceding steps. We prepare the data and the set of
parameters which are to be included in the regression model.
Model Building. This is a purely statistical procedure where we use the

technique of multiple regression (Butler, 1958; Oaks, 1992). Then, using
the least square method, we estimate the various parameters in the
model. This completes the model building procedure and we are now
ready with a test model at hand.
Underlying Principles
Defining and selecting a readability formula requires some attention
to the underlying question: What exactly constitutes a readable
document? Specifically, what features of the text play an important role
in determining readability? Many factors can be suggested as having an
influence on readability: the proportion of less frequent words, the type-
token ratio, word length, sentence length, frequency of personal
references, and so on (Bhattacharya, 1965). A survey of the features
that have been used in various readability formulae reveals the following
list of features:
(A) Length of words (in characters).

(B) Number of words of six or more letters.
(C) Number of syllables in terms of words.
(D) Number of words which are monosyllables.
(E) Number of words of three or more syllables.
(F) Number of affixes (prefixes or suffixes).
(G) Number of words per sentence.
(H) Number of sentences.
(I) Number of pronouns.
(J) Number of prepositions.
Experiments have been carried out to study the correlation between such
factors and the readability scores observed in tests of reading
comprehension. Klare (1968) and others (McLaughlin, 1966) have
shown that the two most common variables in a readability formula are:
1. A measure of word difficulty.

2. A measure of sentence difficulty.
Clearly, a sentence with a number of unusual and uncommon words will

be more difficult to understand than a sentence with simpler, more
common words. Similarly, a sentence with contorted and complex
syntactic structure is more difficult to read than a sentence with a simpler
structure.
A preliminary study has been made already on applications of various
readability indices on a few samples of Bangla text (Das & Chaudhuri,
2000a). We find that the Flesch formula, Fog index and Fog index
recalculated (Powers – Sumner – Kearl) have highest correlations with
manual grading of Bangla texts; Spearman’s rank correlation measures

are applied on them. The four different indices and the variables they
utilize are stated in the table below (see Table 1).
In the formulae for R and G there is no mathematical scale normaliza-
tion to ensure that they are limited to values within the prescribed ranges.
Our project here is to build a readability index for Bangla based on the
variables cited above. For this, we have selected the variables used for the
Flesch readability index. We have chosen Flesch because the other two
formulae, Fog index and Fog index recalculated, both involve a
parameter P that is the total number of hard words. For the time being
we are putting this variable aside (see section Extraction of Variables).
The variables used in Flesch have a strong correlation with manual
grading of the Bangla texts. The Flesch index, as well as the Flesch-
Kincaid index, involves the same variables; thus we could have chosen
either of the two. Here we will use Flesch (1948) as it is the earlier of the
two.
It is our perception that these parameters can be useful in formulating
a readability index for Bangla and thus we restrict ourselves to these two
only.
Table 1. Showing different readability indices along with their mathematical expressions.
Formula Variables Mathematical expressions
Flesch formula S, W, T R ¼ 206.835 – 84.6 * S/W – 1.015 * W/T

Fog index P, W, T G ¼ (P/W þ W/T) * 0.4
Fog index recalculated P, W, T G0 ¼ 3.0680 þ 9.84 * P/W þ .0877 * W/T
Flesch – Kincaid formula S, W, T R0 ¼ 11.8 * S/W þ 0.39 * W/T – 15.59
S, total number of syllables; P, total number of words with three or more syllables; W,
total number of words; R, reading ease score in the range 0 (hard) to 100 (easy); T, total
number of sentences; G, grade level in the range 0 (very easy) to 12 (very hard).
Extraction of Variables
Since Indian languages, especially Bangla on which we have started our
work is an inflectional language, certain features of it are worth noting.
Being inflectional in nature, the average word length (in terms of
syllables) can be longer than that of Western European languages like
English. The Bangla word Kariachilam (‘‘I had done’’) is five syllables
long, but it would not be classed with the ‘‘hard words’’, as it is not an
uncommon or difficult word at all. Since this is a general feature of the
language, the length of words in terms of syllables can easily be ignored
at this stage of our work. We have selected the three parameters S, W, T

for our purpose. Here S is the total number of syllables, W the total
number of words, and T the total number of sentences.
EXPERIMENTAL PROCEDURE
As stated above, our objective in this paper is to make a miniature and

rough readability index or to build a multiple linear regression (MLR)
model (y) lying in the range 0 (hard) – 100 (easy) where y decreases with
the increasing difficulty of the text (Das & Chaudhuri, 2000b; 2004a).
Sampling Scheme
For the present study, we consider a set of seven Bangla documents (a
detailed description of the authors is given in the Appendix). Not taking
into consideration the content of the documents, they are arbitrarily
numbered from 1 to 7. From each sample, a page is randomly selected.
Next we select a paragraph, once again randomly from these pages. Below
we give the list of documents along with the names of the authors. The
names are given according to the numbering of the samples (see Table 2).
Collection of Readers’ Responses

A selected portion from each text is subjected to test by a group of 35
informants coming from similar academic backgrounds and social status
(Mikk, 1997). The informants are asked to read each of the portions of the
text and rate its difficulty on a scale with five ranges: very easy, easy, stan-
dard, difficult, and very difficult. The ranges in the scale are shown in Table 3.
The following table shows the scores given by the informants to each
of the samples. Here S refers to the samples and R the rank given by the
informants (see Table 4).
Table 2. The numbering of the samples.
Sample Name of document Author
1 gharaana bajae rekhe banaaner samataabidhan Syamdulal Kundu

2 kaakaabaabu samaggro Sunil Gangopadhyay
3 aattoghaati naayok promothesh boruaa Ashis Nandi
4 bodh, buddhi ebong kampiutaar Somprakash Bandopadhyay
5 maitreo jaatok Bani Basu
6 mahalayar upohar Lila Majumdar
7 alik manush Saiyad Mujtaba Siraj
Table 3. The grades and their attributes.
Grade Numerical value Attribute
E 0 – 20 Very difficult
D 20 – 40 Difficult
C 40 – 60 Standard
B 60 – 80 Easy
A 80 – 100 Very easy
For the sake of convenience we denote the first parameter (average

sentence length) as x1 and the second parameter (number of syllables
per 100 words) as x2. Y denotes the mean ranks given by readers. The
following table shows mean ranks (y), variance (s2), standard deviation
(s) and coefficient of variance defined as s/y (CV) (viz. Table 5).
The other parameters taken into account (average sentence length – i.e.
the total number of words in the sample divided by the total number of
sentences) are computed with the help of simple computer programs.
Algorithm for syllable counting in Bangla was not available; the syllable
counts for each text had to be calculated manually. Table 6 shows the
results of each computation.
Model Building
Next, a model is built on the above observed score (y) following multiple
linear regression. Based on the above data, the model fitted becomes:
y ¼ 69:425 1:204ðx1 Þ þ 0:014ðx2 Þ ð1Þ
¼ 69:425 1:204 * ASL þ 0:014 * NOSY=100 words ð1aÞ

The above model is purely based on the sample score.
Table 4. The scores by the informants.
Student Age Sex S1 S2 S3 S4 S5 S6 S7
1 23 F 3 1 3 3 4 2 5
2 23 F 4 2 5 3 3 2 3
3 23 F 4 1 2 3 3 1 3
4 23 F 3 1 2 4 4 1 5
5 23 M 3 1 2 3 3 2 2
6 23 M 5 1 3 3 4 1 4
7 23 M 4 3 2 3 4 2 3
8 23 M 3 2 4 5 3 1 2
9 23 M 4 1 3 4 3 1 2
10 23 M 3 1 3 5 3 1 2
11 24 F 4 2 3 4 3 1 5
12 24 F 4 1 3 2 3 1 4
13 24 F 4 1 2 2 2 1 3
14 24 F 3 1 4 4 4 2 3
15 24 F 4 1 2 5 3 1 3
16 24 F 3 2 3 4 4 1 5
17 2 M 2 2 3 2 3 2 3
18 2 M 4 1 3 3 3 1 2
19 26 F 3 1 3 2 3 1 3
20 26 M 3 1 2 4 3 1 4
21 29 F 4 3 3 3 3 3 3
22 31 F 4 1 3 2 3 1 3
23 31 M 4 2 3 4 3 2 3
24 32 M 2 2 4 3 2 4 5
25 33 F 3 1 2 2 2 2 4
26 35 F 3 1 4 5 4 2 4
27 36 F 4 1 4 3 3 1 3
28 37 F 2 2 3 3 3 1 4
29 45 F 3 2 3 3 3 1 4
30 46 M 3 2 3 3 3 1 4
31 47 F 2 1 3 1 4 1 4
32 53 F 3 3 3 3 3 2 4
33 55 F 4 1 3 3 2 1 3
34 63 M 4 3 3 4 3 1 3
35 66 F 4 1 4 5 3 2 3
It is to be noted that the coefficient of x2 in Formula (1) is rather small.

This prompted us to attempt a parabolic least square fit with only one
parameter. So we took only one parameter (x1) and we built another
model based on the observed score (y), which is a parabolic curve in x1.
Table 5. Mean rank, variance, standard deviation and coefficient of variance.
Sample Mean rank Variance Standard Coefficient of

number (y) (2) deviation () variance (CV)
1 3.4000 0.52 0.72 0.21

2 1.5140 0.48 0.69 0.46
3 3.0000 0.51 0.72 0.24
4 3.2850 1.00 1.00 0.30
5 3.1142 0.33 0.57 0.18
6 1.4800 0.48 0.69 0.46
7 3.4280 0.82 0.90 0.26
Table 6. The values computed for each of the samples.
Number of
Average syllables per
Observed score Total Total sentence Total 100 words
Sample (mean) words sentences length (ASL) syllables (NOSY)
number (y) (w) (t) x1 ¼ W/T (S) x2 ¼ S/W * 100
1 58.00 489 40 12.225 1318 269.529

2 80.28 785 95 8.263 1633 208.025
3 51.14 1094 89 12.292 2683 245.246
4 55.71 856 48 17.833 1939 226.518
5 52.28 793 84 9.440 1761 222.068
6 57.20 774 73 10.602 1508 194.832
7 58.57 568 76 7.473 1348 123.217
Based on the above data, the model fitted becomes

Y ¼ 95:3921 7:34667ðx1 Þ þ 0:324841ðx1 Þ2 ð2Þ
RESULTS
Significance of the Regression Coefficients

The model is built on the observed score (Oi) following MLR. The
formula of MLR runs as follows:
y ¼ a þ b1 x1 þ b2 x2
x1 and x2 are treated as variables and b1 and b2 as parameters. Here
b1 ¼ ryx1 – ryx2 * rx1x2/1 – r2x1x2 (r ¼ Pearson’s Product Moment Correla-
tion Coefficient).
b2 ¼ ryx2 – ryx1 * rx1x2/1 – r2x1x2 and a ¼ y 7 b1x1 7 b2x2x1 ¼ sample

mean of A.S.L) (x2 ¼ sample mean of syllables per 100 words)
(y ¼ observed score, i.e. Oi).
The geometrical significance of the regression coefficients lies in the
fact that the fitted model y ¼ a þ b1x1 þ b2x2 can be seen as a hyper-plane
in three dimensions, the regressors being x1 and x2 and y being the fitted
mean response. In the light of the geometry behind the model the
following features are presented.
1. Is the intercept of the hyper-plane which it makes with the response

axis or y axis.
2. b1 is the rate of change of y-axis wrt x1 when x2 is kept fixed/
constant. Thus b1 and b2 serves as slopes of the plane wrt the x1 and
x2 axes respectively.
Comparison of Predicted Values of 1 and 2 Parameters

Now we compare the distance between the mean score and the one
parameter predicted value and also the mean score and the two
parameter predicted value. The results are given in Table 7.
Variance can be an important statistical measure for comparing the
performance of our model and the response score, although we need to
look more closely, using a different set of samples collected under
identical conditions, for some meaningful inference (Tabachnik et al.,
2003).
Table 7. The values computed for each of the samples.
Response 1. Parameter 2. Parameter

Sample mean/observed score predicted score predicted score
number Y1 (Oi) Y2 (EI) Y3
1 58 55.90698 58.59208
2 80.28 54.89941 62.47238
3 51.14 56.03511 58.16459
4 55.71 67.58832 51.24107
5 52.28 52.88495 61.25923
6 57.20 53.37981 59.47422
7 58.57 58.96056 62.20943
NOTES AND DISCUSSIONS
A preliminary study has been made in modelling a readability index for

Bangla documents. This is the first work of its kind on Bangla. It thus
bridges what was formerly a gap between English and Bangla, since for
English many indices were computed whereas there were none for
Bangla.
The expected score (Ei) of samples 2, 3, 5 (see Table 7) shows that these
samples are easier than samples 1, 4, 6, 7. (The assessment of easy or
difficult is quantified on a scale of 0 – 100). Samples 2 and 5 were taken

from two novels whereas sample 3 was taken from a newspaper article.
Sample 4 was taken from a popular essay about computers (containing
lot of lengthy technical words). Sample 1 is also from a newspaper article,
but this article was full of long compound and complex sentences.
Sample 7 was a translation from an Urdu novel. As a result it was full of
long words and words with consonant clusters. The only sample which
does not show parity with the expected score is sample 6. It is a sample
taken from a novel for children, but its score shows that it falls within the
‘‘difficult range’’. However, more text samples should be tested in order
to come to a more definite conclusion.
We initially selected the first parameter and saw its strong correlation
to the response score. It seemed worthwhile to test a single parameter fit.
However, in spite of the strong correlation, the single parameter fit did
not satisfy us and we considered a two parameter fit to be necessary.
Hence, a second parameter was also chosen. Our attempt to fit a curve
with two parameters gives us a better result than a single parametric fit
(Das & Roychoudhury, 2004b).
Figures 1 and 2 show graphically the one-parameter and two-
parameter fits respectively. w2 (defined as S(Oi 7 Ei)2/Ei where Oi and
Ei are the observed and predicted scores respectively) for the two fits are
respectively 12.52 and 7.93. It is clear that the two-parameter fit is better
than the one-parameter fit. However, a larger sample is needed to get a
more definite picture.
For Indian scripts as the one used, e.g. by Bangla, there are several
factors that need further study. The visual complexity of the compound
characters of Indian languages has yet to be investigated. Furthermore,
parameters such as length of paragraph (in terms of sentences),
percentage of unknown words, and so on, should be incorporated into
the index to make it a better model.
Fig. 1. One parameter fit according to the model given in Equation (2).
Fig. 2. Two parameter fit according to the model given in Equation (1).
This discrepancy in sample 6 is due to a response bias on the part of the

informants. This sort of error should be eliminated if suitable measures
are adapted in sampling and also in testing reading comprehension. Since
the fitted model is purely based on the sample, any error in sampling may
be reflected in the model, which would thus become biased.
Linguistics has reached an advanced stage of development. It has

studied in great detail the grammatical structure of different
languages and the evolution of languages with time (Jesperson, 1922;
Taraporewala, 1951). Very remotely linguistic methods are sometimes
‘‘statistical’’ while systematically observing and analysing large
volumes of data (Mosteller et al., 1984). Individual workers have
carried out statistical studies on languages and literary styles from time
to time, during the last 150 years or more. The frequency of such
studies has increased under the influence of pioneers like Zipf,
Yule, and Shannon (Shannon, 1948; Bhattacharya, 1965). Linguistics

as a subject is emerging gradually as a distinct branch of applied statistics
as modern statistical methods are being used with increasing frequency.
In this paper an attempt has been made to make a quantitative study of
Bangla language by application of modern statistical methods.
ACKNOWLEDGEMENT
The authors are grateful to the referees for their immensely constructive suggestions.
REFERENCES
Butler, C. (1958). Statistics for Linguistics. Oxford: Blackwell.

Bhattacharya, N. (1965). Some Statistical Studies on Languages. PhD thesis, Indian
Statistical Institute, Calcutta.
Das, S., & Chaudhuri, B. (2000a). On readability evaluation of Bangla text documents by
computational linguistic approach. PILC (Pondicherry Institute of Language and
Culture). Journal of Dravidic Studies, 10(2), 201 – 208.
Das, S., & Chaudhuri, B. (2000). Readability modeling using statistical regression:
A study in Bangla texts. International Journal of Dravidian Linguistics, 23(1),
59 – 69.
Das, S., & Roychoudhury, R. (2004a). Testing level of readability in Bangla novels of
Bankim Chandra Chattopadhyay w.r.t. the density of polysyllabic words. Indian
Journal of Linguistics, 22, 41 – 51.
Das, S., & Roychoudhury, R. (2004b). Comparison between one parametric and two
parametric fit for testing level of readability of Bangla novels of Bankim Chandra
Chattopadhyay. Communicated to: International Journal Of Dravidian Linguistics.
DeVries, H. (2000). Reading Ease @ WWW. Research Project, Speech and Language
Processing, Macquarie University, Australia.
Flesch, R. (1948). Readability yardstick. Journal of Applied Psychology, 32(3),
221 – 233.
Farr, J., Jenkins, J. J., & Paterson, D. J. (1951). Simplification of Flesch reading ease
formula. Journal of Applied Psychology, 35(5), 333 – 337.
Gunning, R. (1952). The Technique of Clear Writing. New York: McGraw Hill.
Hochhauser, M. (1997). Some overlooked aspects of consent form readability.
Institutional Review Bulletin: A Review of Human Subjects Research, 19(5),
5 – 9.
Hou, H. S. (1983). Digital Document Processing. USA: Wiley-Interscience Inc.
Jesperson, O. (1922). Language, Its Nature, Development and Origin. London: George
Allen and Unwin Ltd (reprint 1947).
Klare, G. R. (1968). The role of word frequency in readability. Elementary English, 45,
12 – 22.
Klare, G. R. (1975). Assessing readability. Reading Research Quarterly, 1, 62 – 02.

McLaughlin, G. H. (1966). What Makes Prose Understandable. PhD thesis, University
College, London.
McCallum, D. R., & Peterson, J. L. (1982). Computer-based readability indices.
Proceedings of the ACM ’82 Conference.
Mikk, J. (1995). Methods of determining optimal readability of texts. Journal of
Quantitative Linguistics, 2, 125 – 132.
Mikk, J. (1997). Parts of speech in predicting reading comprehension. Journal of
Quantitative Linguistics, 4, 156 – 163.
Mikk, J. (1999). A reading comprehension formula of reader and text characteristics.
Journal of Quantitative Linguistics, 6, 214 – 221.
Mosteller, F., & Wallace, D. L. (1984). Applied Bayesian and Classical Inference – The
Case of the Federalist Papers. New York: Springer Verlag.
Oaks, M. P. (1992). Statistics for Corpus Linguistics. Edinburgh: Edinburgh University
Press.
Shanon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical
Journal, 27, 379 – 423, 623 – 656.
Taraporewala, I. J. S. (1951). Elements of Science of Language. University of
Calcutta.
APPENDIX
Brief biographies of the authors (in alphabetical order) used in the

sample are given here.
Ashish Nandy
Ashish Nandy, a well known Indian intellectual, writes on cultural and
sociological topics. His important books include An Ambiguous Journey
to the City (Oxford University Press) and a biography of Pramathesh
Borua, a legendary figure in the history of Indian films.
Bani Basu
Bani Basu, the prolific Bangla writer, was born on 11 March 1939, in
Calcutta. She is one of the most talented and creative women writers of
Bangla literature. She graduated with English Honours from Scottish
Church College, Calcutta, and obtained her MA in English Literature
from the University of Calcutta. She sets her stories in contemporary
West Bengal and populates them with lifelike characters, providing
introspective glimpse at modern society.
She also translated works including The Sonnets of Sri Aurobindo
(Srinvantu), the love stories of Somerset Maugham (1984, Rupa) and the
best stories of D. H. Lawrence (1987, Rupa). Her first novel was
Janmobhumi, Matribhumi published in Anandoloke in 1987. Her first
short story of its kind was published in Desh magazine in 1981. Her widely
read novels include Gandharbi, Pancama Purusha and Ashtama garbha.
Her short stories Svetpatharera Thala and Radhanagar reflect her versatility.
She is at present a lecturer in English in Bijoykrishna Girls’ College,
Howrah. She has received many important awards including Tarashankar
Puroshkar(1991), Sahitya Setu Puraskar (1995), Ananda and Siromoni
Award (1997).
The excerpt in the sample is taken from Maitreyo jatak which reflects
her stylized use of language, her strong sense of history and sociology
and her excellent craftsmanship.
Lila Majumdar
Lila Majumdar, one of our best and best-loved children’s writers in
Bengali was born in a famous Brahmo milieu in 1908. As a young
woman, she was a stellar student of English literature, topping the
Calcutta University MA. Her restless creativity did not allow her to
settle into the discipline of teaching, but she had distinguished stints of
school and college teaching, having been head-hunted by Rabindranath
Tagore. She wrote bestselling cookbooks and household hint books,
which are benchmarks of excellence in their field, worked successfully
for years in All India Radio (1956 – 1963), and took an active interest
in social welfare activities organized by pioneering civil society
organizations.
Her children’s books, such as Din Dupure, Padipisir Barmi Baksa, and
Halde Pakhir Palak are some of the best fantasy, adventure, and ghost
stories in Bangla; their sensitivity and zany imagination have kept readers
in thrall for decades. Another aspect of her oeuvre is the lesser known but
beautifully written works she penned in the romantic suspense and

female gothic mode. She also wrote autobiographical works that allow us
to see how her imaginative and creative worlds blossomed. She received
several awards for children’s novels. The excerpts here are taken from a
children’s novel.
Saiyed Mustafa Siraj

Saiyed Mustafa Siraj, one of the better known writers in Bengali, was
born on 14 October 1930 at Khoshbashpur village in Murshidabad
district. He attended school in the village and for higher studies came to
Calcutta where he settled down. He was associated for long with the
Ananda Bazar group as a journalist. After his retirement, he dedicated
himself fully to creative literary writing.
In the beginning of his literary career he wrote poems. But his
versatility in Bengali short stories caught attention of critics as well as
general public. He is the author of over 170 books, his first published
work being Nil Gharer Nati. He has received many accolades including
the Ananda, Bibhuti Smriti awards, and the Narsimha Das award of
Delhi University. The sample is taken from Alik Manush that received
many awards including the Bhualka award (1990) and Sahitya Akademy
award and Bankim Purashkar (1994).
Sunil Gangopadhyay
Sunil Gangopadhyay, perhaps the most popular living Bangla author,
was born on 7 September 1934 at Faridpur, now in Bangladesh. He
received his Master’s degree in Economics from the University of
Calcutta in 1954. He is currently associated with Ananda Bazar group, a
major publishing house in Calcutta.
Author of well over 200 books Sunil Gangopadhyay excelled in
different genres. He is also the founder-editor of Krittibaas, a seminal
poetry magazine which became a platform for new generation of
poets. He is also known for his unique style in prose. Eka Ebong
Koyekjon is one of his well known works of fiction. Sei Somoy (Those
Days), a historical fiction written by him, received the Sahitya Akademy
Award in 1985. He has also written travelogues, children’s fictions,
novels and essays. Among his pen names are Nil Lohit, Sanataan
Pathak and Nil Upadhyay. The sample is taken from a children’s
adventure story.
Shyamdulal Kundu
He is an intellectual who writes articles occasionally about Bangla
grammar. The excerpt here is taken from an article, which states rules of
spelling in Bangla.
Somprokash Bandopadhyay
He is a computer specialist writing occasionally. The excerpt here is taken
from his book on computer science.

Script 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Script 2

Uploaded by

Copyright:

Available Formats

This article was downloaded by: [Northeastern University]

On: 09 November 2014, At: 16:06

Journal of Quantitative Linguistics

Readability modelling and

To link to this article: http://dx.doi.org/10.1080/09296170500500843

PLEASE SCROLL DOWN FOR ARTICLE

Readability Modelling and Comparison of One and Two

This paper deals with an interesting problem in computational linguistics namely

Making a document readable is key to producing a clearly-written text.

*Address correspondence to: Rajkumar Roychoudhury, Physics and Applied Mathe-

0929-6174/06/13010017$16.00 Ó Taylor & Francis

(which provides assistance in producing a readable document) must thus

that can be objectively measured (DeVries, 2000).

Our model, which is based on a small sample (small number of texts) as

AIMS OF THE PRESENT WORK

qualitative responses of the readers (very easy – very diﬃcult) on a (0 – 100

Analysis. This is a very important step in our journey to build up the

Assimilation. In this step we assimilate all the inferences obtained in the

Model Building. This is a purely statistical procedure where we use the

(A) Length of words (in characters).

1. A measure of word diﬃculty.

Clearly, a sentence with a number of unusual and uncommon words will

manual grading of Bangla texts; Spearman’s rank correlation measures

Formula Variables Mathematical expressions

Flesch formula S, W, T R ¼ 206.835 – 84.6 * S/W – 1.015 * W/T

at this stage of our work. We have selected the three parameters S, W, T

As stated above, our objective in this paper is to make a miniature and

Collection of Readers’ Responses

Table 2. The numbering of the samples.

Sample Name of document Author

1 gharaana bajae rekhe banaaner samataabidhan Syamdulal Kundu

Table 3. The grades and their attributes.

Grade Numerical value Attribute

For the sake of convenience we denote the ﬁrst parameter (average

¼ 69:425 1:204 * ASL þ 0:014 * NOSY=100 words ð1aÞ

Table 4. The scores by the informants.

Student Age Sex S1 S2 S3 S4 S5 S6 S7

It is to be noted that the coeﬃcient of x2 in Formula (1) is rather small.

Table 5. Mean rank, variance, standard deviation and coeﬃcient of variance.

Sample Mean rank Variance Standard Coeﬃcient of

1 3.4000 0.52 0.72 0.21

Table 6. The values computed for each of the samples.

1 58.00 489 40 12.225 1318 269.529

Based on the above data, the model ﬁtted becomes

Signiﬁcance of the Regression Coeﬃcients

b2 ¼ ryx2 – ryx1 * rx1x2/1 – r2x1x2 and a ¼ y 7 b1x1 7 b2x2x1 ¼ sample

1. Is the intercept of the hyper-plane which it makes with the response

Comparison of Predicted Values of 1 and 2 Parameters

Table 7. The values computed for each of the samples.

Response 1. Parameter 2. Parameter

NOTES AND DISCUSSIONS

A preliminary study has been made in modelling a readability index for

diﬃcult is quantiﬁed on a scale of 0 – 100). Samples 2 and 5 were taken

Fig. 1. One parameter ﬁt according to the model given in Equation (2).

Fig. 2. Two parameter ﬁt according to the model given in Equation (1).

This discrepancy in sample 6 is due to a response bias on the part of the

Linguistics has reached an advanced stage of development. It has

Yule, and Shannon (Shannon, 1948; Bhattacharya, 1965). Linguistics

Butler, C. (1958). Statistics for Linguistics. Oxford: Blackwell.

Klare, G. R. (1975). Assessing readability. Reading Research Quarterly, 1, 62 – 02.

Brief biographies of the authors (in alphabetical order) used in the

beautifully written works she penned in the romantic suspense and