You are on page 1of 4

2013 International Conference on Asian Language Processing

Evaluating the Difficulty of Concepts on Domain knowledge


Using Latent Semantic Analysis

Tao-Hsing Chang Yao-Ting Sung


Dept. of Computer Science and Information Engineering Dept. of Educational Psychology and Counseling
National Kaohsiung University of Applied Sciences National Taiwan Normal University
Kaohsiung, Taiwan, R.O.C Taipei, Taiwan, R.O.C
changth@kuas.edu.tw sungtc@ntnu.edu.tw

Yao-Tung Lee
Dept. of Educational Psychology and Counseling
National Taiwan Normal University
Taipei, Taiwan, R.O.C
ddt0320@ntnu.edu.tw

đối với
Abstract— As for the field of educational research and its hierarchical database of medical descriptor as a conceptual
applications, evaluating concept difficulty is necessary but database. Discovering each concept in medical texts from
difficult to carry out the work. Two major approaches are MeSH, the method then computed the distance from the
employed in previous research to evaluate the difficulty of concept to the bottom of this hierarchical tree. Using the
concepts. Both of two approaches do not take into account distance, the depth of the concept in hierarchical tree, the
whether concepts are acquired by learners and readers or document scope (DS) of each text could be achieved.
not. This paper will be focused on constructing basic concept Furthermore, Zhao & Kan [2] employed Math World
list of domain knowledge with latent semantic analysis (LSA),
Encyclopedia to establish a list of hierarchical domain
and use the age of acquisition of concept for representing the
concepts and to compute the difficulty of each concept by
difficulty of concept. This paper will utilize natural science
texts of elementary school in Taiwan as experimental
using Flesch-Kincaid Reading Ease formula.
materials to verify the validity of using our proposed method Although the ontology-based method took into account
for evaluating the difficulty of concept. the hierarchal relationship among concepts in domain
knowledge to evaluate the difficulty of concept, this
Keywords- concept difficulty; LSA; age-of-acquisition. approach had two limits. Firstly, there is no any available
hierarchical concept list in some fields. Because of the lack,
it was difficult to evaluate the difficulty of concepts.
I. INTRODUCTION
Secondly, the researches based on hierarchical concepts
As for the field of educational research and its only took into account the location of concepts in the
applications, evaluating concept difficulty is necessary but hierarchical tree to evaluate the difficulty of concepts. The
difficult to carry out the work. For example, students’ researches do not take into account whether domain
comprehension is often affected by the difficulty of concepts are acquired by the reader or not.
concepts on reading. Hence, teachers should employ The second method utilized word-list to evaluate
concepts which students have acquired to design reading domain specific texts. The method applied the familiarity
materials for teaching new concepts. In addition, domain- and rarity of domain words to evaluate the difficulty of
specific concepts are also the fundamental elements of the each word in domain specific texts. The assumption of the
domain knowledge, such as social studies and science method was that the less the word occur in a language, the
whereas the sequence in concept learning has to take into complexity of that word will be and more difficult to
account the difficulty of concepts for learning effectively. understand. Therefore, the concept referred by the word
Therefore, to construct an effective and method for may be more difficult to understand. For instance, Borst,
evaluating the difficulty of concepts is a crucial and critical Gaudiant, Grabar & Boyer [3] utilized word frequency and
issue. the complexity of the word within domain as two variables
Two major approaches are employed in previous to compute the complexity of words. Furthermore, by
research to evaluate the difficulty of concepts. One is these two variables, it was able to evaluate the difficulty of
ontology-based approach; the other is word-based sentences and texts in medical online documents. The
approach. Both of these approaches intend to discover lists methods using word-list have two limits. The first is that
of domain concepts and then evaluate the difficulty of each the assumption of this method has several counterexamples.
concept. Ontology in the first approach refers to a For example, word “tooth brush” is rare in a large-scale
hierarchical tree structure which is utilized to address the balanced corpus, but “tooth brush” is an easy and simple
relation between domain concepts. Such approach applies concept. Secondly, this method also does not take into
the location of the concepts in the ontology to determine account whether domain concepts are acquired by the
the difficulty of concepts. For instance, Yan, Song & Li reader or not.
(2006) treated Medical Subject Headings (MeSH)

978-0-7695-5063-3/13 $26.00 © 2013 IEEE 193


DOI 10.1109/IALP.2013.58
These previous studies indicated that a concept should matrix was transformed to a matrix called latent semantic
be appeared in form of word. A word introduced to express space by the algorithm. Words and texts could all be
an abstract concept is called conceptual word. If a pupil transferred to semantic vectors on the semantic space. If
learned a concept, it suggested the pupil understood the the semantic vectors were closer, then the two words or
meaning of the concept and the conceptual word at the texts were more similar. Using LSA, this paper will collect
same time. For example, concept “fruit” contains apple, the words which are related to semantics on domain to be
pineapple, banana and tomato. A pupil might know the domain concepts. This paper also proposes an algorithm to
common property of these individual fruit, but the nominal estimate the age of acquisition of conceptual words as the
word of this property, fruit, was not perceived by a pupil. difficulty of the concepts.
Oppositely, a pupil might have heard of “fruit”, but the The rest of this paper is organized as follows. Section 2
meaning was not understood. Therefore, it was not explains how the list of conceptual words and age of
concluded and inferred the pupils have learned concept acquisition can be determined. Section 3 examines the
“fruit”. validity of methodology in this study. The final section
The age of acquisition of conceptual words was displays discussions and conclusions.
supposed to be an important component of the concept
difficulty. Previous studies have proved the age of II. METHODOLOGY
acquisition of words was an impact for processing words. Our framework can be divided into three stages as
The earlier the word was acquired, the faster the word was shown in Figure 1. The first stage is to establish the latent
processed in the brain. On the contrary, the word was semantic space of domain knowledge. The second stage
being processed later if the word was learned afterwards. utilizes the latent semantic space to compute semantic
This was termed as age-of-acquisition effect [4]. Some vectors of each word occurring in textbooks and semantic
researches [5][6] showed that the age of acquisition of vectors of each grade level textbooks respectively. The
words was a crucial variable to affect the cognitive process third stage utilizes semantic vectors in the second stage to
of words. Moreover, a study [7] proved the age of calculate semantic similarity of each level textbook
acquisition of words had an impact on understanding corresponding to each word. The following subsections
important concept of domain texts for children. In other illustrate technical details of each stage.
words, how fast the pupil could process the word in the
brain was an evidence of concept difficulty, as a result, the A. Constructing Latent Semantic Space of Domain
age of acquisition of conceptual words represented as Knowledge
concept difficulty was a reasonable point of view. Since First stage of the proposed method is to construct latent
the knowledge was all learned from textbooks for most semantic space of the domain. Two procedures are
learners, the age of acquisition of fundamental conceptual contained in the stage. The first is using domain texts to
words on domain knowledge could be determined as the generate term-document matrix. The second is that using
time the student learned from textbooks. SVD to transforms the term-document matrix into the
Summarizing from the above discussion, this paper latent semantic space of the domain. The purpose of term-
will be focused on constructing basic concept list of document matrix, a two-dimension matrix as shown in
domain knowledge with latent semantic analysis (LSA), Figure 2, aims to address the relation between documents
and use the age of acquisition of concept for representing and words in documents. In Figure 2, d1 to dn represent as
the difficulty of concept. LSA [8] was an algorithm for the first document to the nth document; T1 to Tm indicate m
semantic retrieval. It first generated a matrix consisting of words occurring in all documents; ei,j stands for the
the frequency of each word in different documents. The
Stage 1 Stage 2 Stage 3

Domain
Documents
Text set Word 1
Text set of Word n
grade k Retrieving
conceptual words
Generating
of each grade
term-document
matrix Transferring to Transferring to
semantic vector semantic vector
Evaluating the
difficulty of
SVD conceptual word
VSM

Difficulty of
Semantic similarity concept w
Latent semantic
between words and
space
text sets

Fig. 1 The framework for computing the difficulty of concept

194
frequency of word Ti occurring in the document dj. The
matrix A is so called term-document matrix.
cosine Sˆ w , Sˆd Sˆ w ˜ Sˆ d (3)
Sˆ Sˆ
w d

d1 " d n The cosine value calculated by VSM falls between -1


T1 ª e1,1 " e1, n º to 1. The value closer to 1 indicates more similar semantic,
Amu n « » but that closer to 0 refers more irrelevant semantic.
# « # % # »
C. Computing the Difficulty of Conceptual Words
Tm «em,1 " em, m »
¬ ¼ Using formula (3), the semantic similarity between
each word and texts of each grade can be measured. The
Fig. 2. Example of Term-Document Matrix words which strongly related to the texts of a grade will be
regarded as conceptual words on domain knowledge
Due to huge term-document matrix, SVD, a well- corresponding to the grade. Therefore, given a grade k, the
known operation in linear algebra, is applied to transform proposed method computes semantic similarities between
term-document matrix into term-semantic matrix. The the text set of grade k and each of words. All words,
transformation is shown as in formula (1). furthermore, are ranked by the similarities and then to
Amun U mur u ¦r ur uVrTun (1) select the top n percent of the words to be conceptual word
for grade k.
where r represents as the number of semantic dimensions Words in domain texts can be classified into three
and usually far less than n. categories. First is that a word is not the conceptual word
Using formula (1), term-document matrix A can be in texts of any grade. It suggests the word is an auxiliary
divided into three smaller matrix, U,  and V. Matrix U term for learning concepts. The word is not a fundamental
refers to the relation between m words and the dimensions. concept in the domain even though it is a significant
Matrix, called singular value matrix, is a diagonal matrix, concept for adults or basic concept in other domains.
where all entries on the diagonal are called singular values Therefore, age of acquisition and difficulty are not herein
which represent the importance of each semantic discussed. Second is that the word is identified as a
dimension to term-document matrix. Matrix V stands for conceptual word of unitary grade. It means the concept of
the importance of each text corresponding to each the word is well formed at the grade even it is first learned.
dimension. The result of matrix multiplication Uu is so Thus, the grade is assigned to the age of acquisition of the
called latent semantic space of matrix A. concept while it is also the difficulty of the concept.
Third is that the word is identified as a conceptual
B. Computing Semantic Vectors word in the texts of different grades. For example, a certain
Through latent semantic space, any word, text or set of word is treated as conceptual words for grade 3 and grade
texts can be transformed to a semantic vector for 5. The difficulty of the conceptual word will be assigned
representing the association within domain knowledge. A with grade 3 based on the following two reasons. First, the
semantic vector of a word, text or set of texts can be word would have been acquired at grade 3. At grade 5, the
obtained by using formula (2). Assuming column vector W concept is used to explore new concepts. Hence, the age of
is the word vector and entry wi,1 in W refers to the acquisition for the word should be grade 3 not grade 5.
frequency of ith word occurring in the text set. Semantic Second, the word would be composed of multiple concepts.
vector Ŝ of the text set is As a result, at grade 3 and grade 5, pupils would learn
various concepts derived from the word. This paper
Sˆ1ur W1Tum u U mur u ¦ r ur (2)
assigns grade 3, the first grade of acquisition of the
th
For a word viewed as the i word in domain conceptual word, to the difficulty of the conceptual word.
knowledge, the entry wi,1 in vector W of the word refers to Summarizing the discussion mentioned above, the
the frequency of the word occurring in all documents and concept difficulty CpDi of a conceptual word w is
other entries in vector W is 0. For a set of texts determined as follows.
corresponding to particular grade, moreover, the semantic
vector of the set is also generated by using formula (2). (4)
The entry wj,1 in semantic vector W of the set refers to the
frequency of the jth word occurring on the text set. where G represents set of grades; Ti represents set of the
Assume a semantic vector Ŝ w of word w and a conceptual words in texts of grade i.
semantic vector Ŝ d of a text set corresponding to a grade. III. EXPERIMENTS
If both of vectors Ŝ w and Ŝd obtain higher values on the In the section, textbooks of science used in elementary
same dimensions and lower values on other dimensions, it school in Taiwan are employed to analyze the performance
means both of the word and set may express the same of different methods. The textbooks contain 261
subtopic. In other words, the word is the key term on texts documents. The statistical description of the textbooks for
of the grade. Hence, higher semantic similarity between different grades is listed in Table 1. This study randomly
the word and the set should be measured. Based on above selected 56 conceptual words on the textbooks to be
observation, vector space model (VSM) [16] shown as experimental materials. An expert, who taught sciences
formula (3) is employed to compute the similarity between from grade 3 to 6 several years, was asked to evaluate the
semantic vectors Ŝw and Ŝd . difficulty of the 56 conceptual words. The difficulty of

195
each word was identified as class 3, 4, 5 or 6. Class 3 different domains. In addition, this study only explored the
represents the conceptual word could be cognized and properties of fundamental conceptual words in the series
learned by the third graders while class 6 represents the textbooks of grades 3-6. The proposed method can be used
conceptual word could be cognized and learned by the to evaluate the concept difficulty of the words in grades 7-
sixth graders. 12 textbooks.
Table 1. Statistical description for experimental materials ACKNOWLEDGMENT
Grade 3 4 5 6 This work is particularly supported by "Aim for the
Number of documents 70 67 64 60 Top University Project" of the National Taiwan Normal
Average number of words 310 397 578 593
University and the Ministry of Education, Taiwan, R.O.C.
In order to contrast the performance of previous
methods and the proposed method, the difficulty of the 56 REFERENCES
conceptual words are also computed by the following [1] X. Yan, D. Song, and X. Li, ”Concept-based document readability
word-list method. First, 1398 words in the textbooks are in domain specific information retrieval,” Proceedings of the 15th
ACM international conference on Information and knowledge
ranking by their frequency in Sinica balance corpus. For management, 2006.
example, the frequency of word T1 is highest while that of [2] J. Zhao and M. Y. Kan, ”Domain-specific iterative readability
word T1398 is lowest. Second, these words are quartered computation,” Proceedings of the 10th annual joint conference on
by their order. For example, T1 and T349 falls first quarter, Digital libraries, 2010.
T350 and T699 falls second quarter, and so on. Finally, the [3] A. Borst, A. Gaudinat, C. Boyer, and N. Grabar, ”Lexically based
difficulty of a conceptual word is determined by the label distinction of readability levels of health documents,“ Poster at
of the quarter in which the word falls. For instance, the MIE, 2008.
difficulty of word T1 is tagged as class 3 while T1398 is [4] J. B. Carroll and M. N. White, ”Word frequency and age-of-
acquisition as determiners of picture-naming latency,” Quarterly
tagged as class 6. Table 2 shows the number of conceptual Journal of Experimental Psychology, vol. 25, pp.85-95, 1973.
words classified into different grades by word list and the
[5] C. M. Morrison and A. W. Ellis, ”The roles of word frequency and
proposed methods respectively. age of acquisition in word naming and lexical decision,” Journal of
Table 2. The number of conceptual words Experimental Psychology: Learning, Memory, and Cognition, vol.
21, pp. 116-133. 1995.
Grade [6] C. M. Morrison, T. D. Chappell, and A. W. Ellis, ”Age of
3 4 5 6 acquisition norms for a large set of object names and their relation
Word list 9 18 15 14 to adult estimates and other variables,“ Quarterly Journal of
The proposed method 12 11 15 18 Experimental Psychology: Human Experimental Psychology, vol.
50A, pp.528-559, 1997.
This study adopts Spearman rank correlation
[7] A. Biemiller and N. Slonim, ”Estimating Root Word Vocabulary
coefficient to measure the difference between the Growth in Normative and Advantaged Populations: Evidence for a
difficulties estimated by expert and various methods. Table Common Sequence of Vocabulary Acquisition,” Journal of
3 shows the comparison between the evaluations of word Educational Psychology, vol. 93, pp.498-520, 2001.
list method and the proposed method. In Table 3, the [8] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R.
results support that there is significant correlation between Harshman, ”Indexing by latent semantic analysis,” Journal of the
American Society for Information Science, vol. 41, pp.391-407,
the evaluations of the expert and the proposed method. On 1990.
the other hand, the results also show that there is no
[9] T. K. Landauer, P. W. Foltz, and D. Laham, ”Introduction to latent
significant correlation between the evaluations of the semantic analysis,” Discourse Processes. 25, 259-284, 1998.
expert and word list method [10] K. S. Jones, ”A statistical interpretation of term specificity and its
Table 3. The comparison between the performances of word list application in retrieval,” Journal of Documentation, vol. 28, pp.11-
method and the proposed method 21, 1972.
[11] G. Salton and C. Buckley, ”Term-weighting approaches in
methods spearman p-value automatic text retrieval,” Information Processing and Management,
Word list -.009 .949 vol. 24, pp.513-523, 1988.
The proposed method .598 .000
[12] K. Kireyev and T. K. Landauar, ”Word Maturity: Computational
Modeling of Word Knowledge,” Proceedings of the 49th Annual
IV. DISCUSSION AND CONCLUSION Meeting of the Association for Computational Linguistics, pp. 299-
Experimental results shown that the proposed method 308, 2011.
can retrieve conceptual words and evaluate the difficulty of [13] J. Y. Yeh, H. R. Ke, W. P. Yang, and I. H. Meng, ”Text
summarization using a trainable summarizer and latent semantic
the words. It overcomes the limitations of previous analysis,” Information Processing and Management, vol. 41, pp.75-
methods because it explores conceptual words from 95, 2005.
domain knowledge automatically. In addition, the results [14] Y. T. Sung, J. L. Chen, Y. T. Lee, Y. S. Lee, C. Y. Peng, H. C.
also shown that the difficulty determined by the proposed Tseng, and T. H. Chang, ”Constructing and Validating a
method is closer to the cognition of experts than previous Readability Modal with LSA : A Case Study of Chinese and Social
methods. This may due to the fact that ontology and Science Textbooks,” Proceedings of the 22th Annual Meeting of
Society for Text and Discourse Process, Montreal, Canada, 2012.
wordlist do not take into account the age of acquisition of
conceptual words, thereby having gaps between previous [15] A. C. Graesser, D. S. Mcnamara, M. M. Louwerse, and Z.
CAI, ”Coh-Metrix: Analysis of text on cohesion and language.
methods and the cognition of experts. Behavior Research Methods,” Instruments, & Computers, vol. 36 ,
In the future, many researches can be developed based pp. 193-202, 2004.
on this work. Using the proposed method can further [16] T. K. Landauer, D. S. McNamara, S. Dennis, and W. Kintsch, ”
analyze different domain knowledge, such as social, and Handbook of Latent Semantic Analysis,” Lawrence Erlbaum
explore the difference of conceptual words between Associates, Publishers, Mahwah, New Jersey, 2007.

196

You might also like