Professional Documents
Culture Documents
Yao-Tung Lee
Dept. of Educational Psychology and Counseling
National Taiwan Normal University
Taipei, Taiwan, R.O.C
ddt0320@ntnu.edu.tw
đối với
Abstract— As for the field of educational research and its hierarchical database of medical descriptor as a conceptual
applications, evaluating concept difficulty is necessary but database. Discovering each concept in medical texts from
difficult to carry out the work. Two major approaches are MeSH, the method then computed the distance from the
employed in previous research to evaluate the difficulty of concept to the bottom of this hierarchical tree. Using the
concepts. Both of two approaches do not take into account distance, the depth of the concept in hierarchical tree, the
whether concepts are acquired by learners and readers or document scope (DS) of each text could be achieved.
not. This paper will be focused on constructing basic concept Furthermore, Zhao & Kan [2] employed Math World
list of domain knowledge with latent semantic analysis (LSA),
Encyclopedia to establish a list of hierarchical domain
and use the age of acquisition of concept for representing the
concepts and to compute the difficulty of each concept by
difficulty of concept. This paper will utilize natural science
texts of elementary school in Taiwan as experimental
using Flesch-Kincaid Reading Ease formula.
materials to verify the validity of using our proposed method Although the ontology-based method took into account
for evaluating the difficulty of concept. the hierarchal relationship among concepts in domain
knowledge to evaluate the difficulty of concept, this
Keywords- concept difficulty; LSA; age-of-acquisition. approach had two limits. Firstly, there is no any available
hierarchical concept list in some fields. Because of the lack,
it was difficult to evaluate the difficulty of concepts.
I. INTRODUCTION
Secondly, the researches based on hierarchical concepts
As for the field of educational research and its only took into account the location of concepts in the
applications, evaluating concept difficulty is necessary but hierarchical tree to evaluate the difficulty of concepts. The
difficult to carry out the work. For example, students’ researches do not take into account whether domain
comprehension is often affected by the difficulty of concepts are acquired by the reader or not.
concepts on reading. Hence, teachers should employ The second method utilized word-list to evaluate
concepts which students have acquired to design reading domain specific texts. The method applied the familiarity
materials for teaching new concepts. In addition, domain- and rarity of domain words to evaluate the difficulty of
specific concepts are also the fundamental elements of the each word in domain specific texts. The assumption of the
domain knowledge, such as social studies and science method was that the less the word occur in a language, the
whereas the sequence in concept learning has to take into complexity of that word will be and more difficult to
account the difficulty of concepts for learning effectively. understand. Therefore, the concept referred by the word
Therefore, to construct an effective and method for may be more difficult to understand. For instance, Borst,
evaluating the difficulty of concepts is a crucial and critical Gaudiant, Grabar & Boyer [3] utilized word frequency and
issue. the complexity of the word within domain as two variables
Two major approaches are employed in previous to compute the complexity of words. Furthermore, by
research to evaluate the difficulty of concepts. One is these two variables, it was able to evaluate the difficulty of
ontology-based approach; the other is word-based sentences and texts in medical online documents. The
approach. Both of these approaches intend to discover lists methods using word-list have two limits. The first is that
of domain concepts and then evaluate the difficulty of each the assumption of this method has several counterexamples.
concept. Ontology in the first approach refers to a For example, word “tooth brush” is rare in a large-scale
hierarchical tree structure which is utilized to address the balanced corpus, but “tooth brush” is an easy and simple
relation between domain concepts. Such approach applies concept. Secondly, this method also does not take into
the location of the concepts in the ontology to determine account whether domain concepts are acquired by the
the difficulty of concepts. For instance, Yan, Song & Li reader or not.
(2006) treated Medical Subject Headings (MeSH)
Domain
Documents
Text set Word 1
Text set of Word n
grade k Retrieving
conceptual words
Generating
of each grade
term-document
matrix Transferring to Transferring to
semantic vector semantic vector
Evaluating the
difficulty of
SVD conceptual word
VSM
Difficulty of
Semantic similarity concept w
Latent semantic
between words and
space
text sets
194
frequency of word Ti occurring in the document dj. The
matrix A is so called term-document matrix.
cosine Sˆ w , Sˆd Sˆ w Sˆ d (3)
Sˆ Sˆ
w d
195
each word was identified as class 3, 4, 5 or 6. Class 3 different domains. In addition, this study only explored the
represents the conceptual word could be cognized and properties of fundamental conceptual words in the series
learned by the third graders while class 6 represents the textbooks of grades 3-6. The proposed method can be used
conceptual word could be cognized and learned by the to evaluate the concept difficulty of the words in grades 7-
sixth graders. 12 textbooks.
Table 1. Statistical description for experimental materials ACKNOWLEDGMENT
Grade 3 4 5 6 This work is particularly supported by "Aim for the
Number of documents 70 67 64 60 Top University Project" of the National Taiwan Normal
Average number of words 310 397 578 593
University and the Ministry of Education, Taiwan, R.O.C.
In order to contrast the performance of previous
methods and the proposed method, the difficulty of the 56 REFERENCES
conceptual words are also computed by the following [1] X. Yan, D. Song, and X. Li, ”Concept-based document readability
word-list method. First, 1398 words in the textbooks are in domain specific information retrieval,” Proceedings of the 15th
ACM international conference on Information and knowledge
ranking by their frequency in Sinica balance corpus. For management, 2006.
example, the frequency of word T1 is highest while that of [2] J. Zhao and M. Y. Kan, ”Domain-specific iterative readability
word T1398 is lowest. Second, these words are quartered computation,” Proceedings of the 10th annual joint conference on
by their order. For example, T1 and T349 falls first quarter, Digital libraries, 2010.
T350 and T699 falls second quarter, and so on. Finally, the [3] A. Borst, A. Gaudinat, C. Boyer, and N. Grabar, ”Lexically based
difficulty of a conceptual word is determined by the label distinction of readability levels of health documents,“ Poster at
of the quarter in which the word falls. For instance, the MIE, 2008.
difficulty of word T1 is tagged as class 3 while T1398 is [4] J. B. Carroll and M. N. White, ”Word frequency and age-of-
acquisition as determiners of picture-naming latency,” Quarterly
tagged as class 6. Table 2 shows the number of conceptual Journal of Experimental Psychology, vol. 25, pp.85-95, 1973.
words classified into different grades by word list and the
[5] C. M. Morrison and A. W. Ellis, ”The roles of word frequency and
proposed methods respectively. age of acquisition in word naming and lexical decision,” Journal of
Table 2. The number of conceptual words Experimental Psychology: Learning, Memory, and Cognition, vol.
21, pp. 116-133. 1995.
Grade [6] C. M. Morrison, T. D. Chappell, and A. W. Ellis, ”Age of
3 4 5 6 acquisition norms for a large set of object names and their relation
Word list 9 18 15 14 to adult estimates and other variables,“ Quarterly Journal of
The proposed method 12 11 15 18 Experimental Psychology: Human Experimental Psychology, vol.
50A, pp.528-559, 1997.
This study adopts Spearman rank correlation
[7] A. Biemiller and N. Slonim, ”Estimating Root Word Vocabulary
coefficient to measure the difference between the Growth in Normative and Advantaged Populations: Evidence for a
difficulties estimated by expert and various methods. Table Common Sequence of Vocabulary Acquisition,” Journal of
3 shows the comparison between the evaluations of word Educational Psychology, vol. 93, pp.498-520, 2001.
list method and the proposed method. In Table 3, the [8] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R.
results support that there is significant correlation between Harshman, ”Indexing by latent semantic analysis,” Journal of the
American Society for Information Science, vol. 41, pp.391-407,
the evaluations of the expert and the proposed method. On 1990.
the other hand, the results also show that there is no
[9] T. K. Landauer, P. W. Foltz, and D. Laham, ”Introduction to latent
significant correlation between the evaluations of the semantic analysis,” Discourse Processes. 25, 259-284, 1998.
expert and word list method [10] K. S. Jones, ”A statistical interpretation of term specificity and its
Table 3. The comparison between the performances of word list application in retrieval,” Journal of Documentation, vol. 28, pp.11-
method and the proposed method 21, 1972.
[11] G. Salton and C. Buckley, ”Term-weighting approaches in
methods spearman p-value automatic text retrieval,” Information Processing and Management,
Word list -.009 .949 vol. 24, pp.513-523, 1988.
The proposed method .598 .000
[12] K. Kireyev and T. K. Landauar, ”Word Maturity: Computational
Modeling of Word Knowledge,” Proceedings of the 49th Annual
IV. DISCUSSION AND CONCLUSION Meeting of the Association for Computational Linguistics, pp. 299-
Experimental results shown that the proposed method 308, 2011.
can retrieve conceptual words and evaluate the difficulty of [13] J. Y. Yeh, H. R. Ke, W. P. Yang, and I. H. Meng, ”Text
summarization using a trainable summarizer and latent semantic
the words. It overcomes the limitations of previous analysis,” Information Processing and Management, vol. 41, pp.75-
methods because it explores conceptual words from 95, 2005.
domain knowledge automatically. In addition, the results [14] Y. T. Sung, J. L. Chen, Y. T. Lee, Y. S. Lee, C. Y. Peng, H. C.
also shown that the difficulty determined by the proposed Tseng, and T. H. Chang, ”Constructing and Validating a
method is closer to the cognition of experts than previous Readability Modal with LSA : A Case Study of Chinese and Social
methods. This may due to the fact that ontology and Science Textbooks,” Proceedings of the 22th Annual Meeting of
Society for Text and Discourse Process, Montreal, Canada, 2012.
wordlist do not take into account the age of acquisition of
conceptual words, thereby having gaps between previous [15] A. C. Graesser, D. S. Mcnamara, M. M. Louwerse, and Z.
CAI, ”Coh-Metrix: Analysis of text on cohesion and language.
methods and the cognition of experts. Behavior Research Methods,” Instruments, & Computers, vol. 36 ,
In the future, many researches can be developed based pp. 193-202, 2004.
on this work. Using the proposed method can further [16] T. K. Landauer, D. S. McNamara, S. Dennis, and W. Kintsch, ”
analyze different domain knowledge, such as social, and Handbook of Latent Semantic Analysis,” Lawrence Erlbaum
explore the difference of conceptual words between Associates, Publishers, Mahwah, New Jersey, 2007.
196