You are on page 1of 6

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 47

A New Dynamic Weight Assignment


Schema for Index Terms Based on
Statistical Approach
1
Abad Shah, 2Syed Ahsan, 3Amjad Farooq

Abstract - During the development of Information Retrieval (IR) systems, weights are assigned to an extracted set of index terms,
and then they are used for different purposes such as indexing, partial matching, and computing rankings. The weight assignment
schemes are usually provided by IR models that are selected and used in the development of IR systems. Currently available weight
assignment schemes are static which means that once weights are assigned to index terms, their values never change during the
whole life-span of an IR system. Before this weight assignment scheme, we have already proposed a dynamic weight assignment
scheme that can be used as part of any IR model. This previous dynamic weight assignment scheme is an empirical and intuitive. In
this paper, we propose another dynamic weight assignment scheme based on statistical approach. In our opinion, this new scheme
can give a better performance than our previous scheme because this scheme is based on proper foundation.

Index Terms - Standard deviation, weight assignment scheme, information retrieval, dynamic weight
——————————  ——————————
and retrieval scheme is the low retrieval
1. INTRODUCTION performance. One of the reasons of low retrieval
Currently on the Web, the growth rate in the performance is that index terms capture only partial
volume of information is tremendous and it is information about a document. In other words, a
estimated as 300% per annum. In the case this set of extracted index terms from a document does
present trend continuous, then retrieval of relevant not completely represent the contents of the
information will be a serious problem in the near documents. Therefore, using set of index terms as
future. Many efforts have been spent to develop the basis of indexing and retrieval is inappropriate.
efficient information retrieval (IR) systems on the There are many other sources of irrelevancy in the
Web [3, 5, 17, 18]. In spite of all these efforts by currently available IR systems (for details see [23]).
using currently available indexing schemes, the To achieve a reasonable level of the retrieval
results are not much encouraging because it has performance of an IR system, it is essential to
been estimated that on the average only 30% of capture more semantic information about
the returned documents (or information) are documents. Some attempts have already been
relevant to a user’s need, and the remaining 70% made to improve the traditional indexing schemes
of these relevant documents in the collection are using Natural Language Processing [20], logic [5,
never returned [21]. These figures are much below 7] and document clustering [4], and they have
to an acceptable level. In the existing indexing gained some improvements.
schemes, each document in a collection is Retrieval performance of an IR system can also be
represented by a set of meaningful terms, called enhanced by improving the two development
descriptors, index terms or keywords, and it is processes of IR systems, and they are: i)
believed that they express the content of the extraction of a good representative set of index
document. A set of index terms is extracted from a terms from text documents, ii) weight assignment
collection of text documents, and they are assigned scheme. The weight assignment schemes
numerical values, called weights, using a provided by the currently available IR models are
method/scheme provided by the Information static that means once weights are assigned to
Retrieval (IR) model such as Boolean Model, index terms, they remain unchanged. In this paper,
Vector Model, Probabilistic Model and their we address the second process, i.e., the weight
extensions. An IR model is selected in the assignment scheme, and make it dynamic to
development of an IR system [6, 8, 9]. These improve the retrieval performance of an IR system.
weights are assigned to the set of index terms To make a weight assignment scheme dynamic,
during the development of the IR system [6, 8, 9]. we identify the two main independent groups (or
There are many sources of irrelevancy which we entities) which can influence the retrieval
have identified and described in [23] but the major performance of the systems. The first group is a
drawback of using an index term based indexing group of the documents writers, and the second
group is a group of users of IR systems after their
1,2
Khawarizmi Institute of Computer Science, developments. These two groups differ in their
University of Engineering and Technology, Lahore
3
Department of Computer Science & Engineering,
UET, Lahore
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 48

characteristics including the timings when they Probabilistic models assign Boolean values as
participate. In our opinion, the retrieval weights to the set of index terms,, and the Vector
performance of an IR system can be improved by IR model assigns real values from the closed
incorporating participation of the second group interval [0, 1] as weight to the set of index terms
(users of an IR system). The first group participates [2, 4, 17]. All these classical IR models and their
during the development of an IR system, but its extensions are the static weight assignment
nature is the semi-static, which means generally its schemes.
participation is only during the development of the
systems. In other words, the first group participates 2.2 O'Donnell Weight Assignment Scheme
once in the life-span of an IR system and it cannot Mann and Thompson proposed Rhetorical
influence retrieval performance of the system after Structure Theory (RST) for better understanding of
its development. But, there is a possibility to the English text semantically [10]. This theory splits
improve the retrieval performance of an IR system sentences of a text into two spans that are referred
after its development by incorporating participation to as nucleus and satellite, and then identifies
of the second group in the IR system. This relationships, called the RST relations/relationships
objective can be achieved by translating the between these two identified spans [10]. O’Donnell
participation of the second group and quantifying used the RST relationships for the indexing
its participation as an increment in the weights of purposes, proposed a weight assignment scheme
the index terms. To achieve this purpose, we did to the RST relationships. This scheme first extracts
our first attempt in [24] and proposed an intuitive the RST relationships in pairs from a text document
dynamic weight assignment scheme. This scheme [15]. Later, these extracted relationships are
is based on the imitative concept, works at index transformed into a tree structure, in which an
term level used in user need/query and increases internal node denotes pair of the RST relationship
weights of only those index terms that are used and a leaf node denotes hieratical structure of the
and matched in index file. In this paper, we text (for details see [5]). The weight to the pair of
propose another dynamic weight assignment the relationship at the root node is assigned
scheme, works at query level and increases intuitively and weights to the nodes at the lower
weights of those index terms of the query that are levels are assigned as product of the weights of the
matched in index file. This new scheme is based relationship pair at its immediate upper level. The
on the statistics generated by the index terms of a weights to the RST relationships are assigned as a
query and matched/relevant index terms of the real number from the closed interval [0, 1]. Note
query. We expect that this new weight assignment that in this scheme weights are assigned to the
scheme can give us better results than our RST relationships not to index terms because here
previously proposed scheme (for details see [24]). the RST relations are extracted instead of index
The remainder of this paper is organized as terms from text documents. This is also a static
follows: In Section 2 we give the related work. We weight assignment scheme.
give summary of previous dynamic weight However, there are some cases in which this
assignment technique in Section 3. In Section 4, weight assignment scheme fails because non-
we propose our new statistics based dynamic clarity does not always reflect the centrality of
weight assignment scheme. Finally, in Section 4, information. Sometimes an author of a text writes
give concluding remarks and future directions. information in the text at a rhetorically unimportant
place, yet that information may be needed later to
2. RELATED WORK understand the argument. In these cases, it
becomes difficult to locate the RST relationships in
In this section, we first give summary and analysis
those text documents. Other investigators have
of few available weight assignment schemes. All
applied similar schemes for weight assignment to
reported weight assignment schemes are static,
the RST relationships [14, 16].
which means that once weights are assigned to
index terms during development of an IR system,
2.2 Semantic Vector Space Model
they remain unchanged in life-span of the IR
In [8], Liu proposed a Semantic Vector Space
system. Note that it is the responsibility of an IR
Model (SVSM) for text representation, and then its
model to provide weight assignment scheme to the
searching and retrieving. This model is based on
set of index terms.
the combination of the mathematical concept of
vector space, heuristic syntax parsing and
2.1 Classical Weight Assigning Schemes
distributed representation of semantic structures
The classical Information Retrieval (IR) models
[8]. In this model, both text documents and queries
(i.e., Boolean, Vector and Probabilistic IR models)
(or user needs) are represented as semantic
and their extensions provide different weight-
matrices, and search mechanism is designed to
assigning schemes [2]. Both Boolean and
compute the similarity between the two semantic
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 49

matrices (documents and query) to predict the level increment in weight of an index term is computed
of relevancy. A prototype system has been by Equation (2).
developed based on this model, modifying the increment = Wie * 1/( |tc- tp|)n where |tc- tp| <>0
SMART system and using the Xerox Part-of ….(2)
Speech (P-O-S) tagger as the preprocessor of the
indexing process. This prototype system is used in
an experimental study for the evaluating retrieval In Equation (2), we have given an empirical
performance of this technique, and its formula to compute the increment, and as
effectiveness of relevance ranking. The results of mentioned earlier, We is the weight of the index
this study have shown that if both documents and term ITi before the increment. If there exists the
queries are a few (typically less than two lines in boundary conditions, i.e., |tc- tp| = 0 or1, then there
length), then this technique is less effective. But is no increment in the weight. The time duration
with big size documents and queries, especially |tc- tp| in Equation (2) ensures linear increment in
when original documents are used as queries, in weight in the case an index term is being frequently
this case it is found that the system gives referred and matched. In other words, the linear
significantly better performances than SMART increment restricts weight of an index term to reach
system. quickly to its maximum value. The real number n is
SMART system is one of the first and the best a controlling (or normalizing) power that keeps the
available IR system. This system was developed increment within its boundaries. Value of the
by Gerard Salton at Cornell University using the number n is computed by the empirical formulas
vector space model for the representing and given in Equation (3).
querying documents. Interesting, no weight Note that the minimum 0.1 increment is allowed in
assignment is mentioned. a weight, and if it is less than 0.1, then no change
in the weight and it keeps its previous value. Also,
3. OUR PREVIOUS DYNAMIC WEIGHT if weight of an index term attains its maximum
ASSIGNMENT SCHEME value, then obviously there is no further increment
in the weight.
In this section, we give summary of our previously
proposed dynamic weight assignment scheme [24]. n = floor value ( (Wi, /2) (3)
We refer to this scheme as the dynamic weight
assignment scheme (or simply dynamic scheme). In Equation (3), we use value of the existing weight
We of an index term in computing the number n.
3.1 Dynamic Scheme As mentioned in [24], this scheme continuously
This scheme incorporates the participation of the updates weights of those index terms of the index
second group (users) in a developed IR system as file which are referred and matched by the user
we have already explained in Section 1. The queries in the life-span of an IR system. This
participation is incorporated in an IR system as dynamically updates of wrights is capturing of the
increments in existing weights of index terms. Note user participation in an IR system. The problems of
that only weights of those index terms of the query system stabilization and saturation of the scheme
get increments, which are matched in the index file. have been pointed out and their reasons are given
The increment in weight is a function of time, it in [24]. It is also necessary to mention that this
means that weight of an index term is increased scheme has no mathematical rationale.
each time when the index term is referred in a user
query and matched in the index file. 4. NEW DYNAMIC WEIGHT ASSIGNMENT
Suppose that existing/current weight We of the
SCHEME
index term ITi is assigned at the instance tp, and
later on at the time instance tc, (tp < tc), when a In Section 3, we have presented a brief of previous
query refers to the index term, its weight is dynamic weight assignment scheme and pointed
incremented and a new value Wl (latest weight) is out shortcomings of the scheme. The main
assigned to We after adding the incensement. To problem with that scheme is that it has no formal
compute the new weight Wl, we propose Equation background of the scheme. It has an intuitive
(1). foundation and its criteria for the increment in
Wl = We + increment …….. (1) weight are based on time interval between two
In Equation (1), increment is the amount that is consecutive references to an index term of index
added to the weight We. It is defined as the file. In
function of time duration (or time interval) between the new weight assignment scheme that is
the two consecutive references and matching proposed in this paper, a different approach and is
made to the same index term at the two different based on the statistics of the weights of index
tine instances tp and tc, respectively. The amount of terms used in a query and the index term from the
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 50

query that have matched in the index file of the omlyn2 number of ITs of the user need has
collection. matched. Then, number of ITs that did not match
Now we present our new dynamic weight with ITs given in the user need, is n1-n2. The four
assignment scheme using the statistical concepts. numbers N, n1, n2, n1-n2 hold the following two
Suppose that total number of index terms (ITs) in inequalities.
the collection of an IR system is N, total number n2 ≤ n1 < N ------- (1)
of ITs given in a user need (query) is n1, and n1 - n2 ≤ n1 < N ------ (2)

Table 1: ItT of a user need and their weights Table 2: ItT of the user need that match and their weights

In Table 1, a user need consisting of n1 number of


ITs and their corresponding current weights is
given. Table 2 shows those ITs and their  1 = n1
corresponding current weights (before this user
need/query) which have matched from the N ∑ (W2  M1)²
number of ITs of the collection. 2=1 (5)
We give increment to weights of ITs at query (user n1
need) level instead of IT level as we did in our
previous dynamic weight assignment scheme. In
this new scheme, we take collective effect of a
query and increase weights of those ITs that have
found and matched in the set of ITS of the  2 = n2
collection. In other words in this new proposed ∑ (W2  M1)²
scheme, we quantify effect of the query in term of (6)
increment in weights of matched ITs. 2=1
The mean M1 of the weights given in Table 1 is n2
given in Equation (3). Similarly, the mean M2 of the
weights of the matched ITs given in Table 2 can be
computed using Equation (4). Note that these two
means M1 and M2 hold the inequality M2 ≤ M1. The standard deviations  1 and  2 of the two
n1 data arrays given in Table 1 and Table 2 can be
∑ Wi computed using Equation (5) and Equation (6).
i=1 These two real number values of standard
M1 = _________ ………….. (3) deviations hold the inequality  2 ≤  1 since
n1 M2 ≤ M1.
We suggest an equal amount of increment in all
weights of the n2 number of index terms (see Table
n2 2) of a query that have matched. This equal
∑ Wi amount of increment is derived from the ration of
i=1 the two standard deviations, i.e.,  1 and  2 given
M2 = _________ …….. (4)
in Equation (5) and Equation (6). We compute the
n2
increment by Equation (7).
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 51

Increment =  1 / 2 ……….. (7) Learning (IDEAL2003), Hong Kong, 2003, Pp


596-604.
Note that this increment in the weights is always
less than or equal to 1 (i.e., increment ≤ 1). It can [6] Korfhage, R., “Information Storage and
be 1 but in very rear cases and in majority of cases Retrieval,” John Wiley and Sons, London,
its value will be less than 1. This property of 1007.
Equation (7) keeps increment in weights under
control, and in computing of the increment no time [7] Lalmas, Mounia, Bruza, Peter, D., “The Use
interval is used as we used in our previous Of Logic In Information Retrieval Modeling,”
proposal reported in [24]. the Knowledge Engineering Review, 13(3),
1998, Pp 263-295.
5. CONCLUDING REMARKS AND [8] Liu, G. Z., “Semantic Vector Space Model:
FUTURE DIRECTIONS Implementation and Evaluation.” the Journal of
the American Society for Information Science,
In this paper, we have proposed a new dynamic 48(5), 1997, Pp 395-417.
weight assignment scheme. This scheme has the [9] Losee, R.M., “Comparing Boolean and
follow salient features that are listed as follows: Probabilistic Information Retrieval Systems
i) It has mathematical and statistical Across Queries And Disciplines,” the Journal
foundation instead empirical and intuitive of the American Society for Information
concepts. Science, 48(2), 1997, Pp 143-156.
ii) Weights of ITs are increased at query
level instead of IT level. [10] Mann, W.C., Thompson, S.A.,“Rhetorical
iii) The increment in weights is slower than the Structure Theory: Towards a Functional
increment in weights in our previously Theory Of Text Organization,” the Journal
proposed dynamic weight scheme in {24]. It Text, 8 (3). 1998, pp 243-28.
means that ITs of an IR system with this new [11] Marcu, D., “Building up Rhetorical Structure
scheme attain late their maximum values and Trees,” theProceedings of the Thirteenth
the system gets saturated after a longer time. National Conference on Artificial Intelligence,
Imitatively, it is expected that this scheme can give vol.2. USA, 1996, 1069-107.
better results in terms of the retrieval performance [12] Marcu, D.,“.The Rhetorical Parsing of Natural
and getting saturation of an IR system. Nowadays Language Texts,” the Proceedings of the 35th
we are actively working for the simulation to get Annual Meeting of the Association for
above mentioned results of this new dynamic Computational Linguistics (ACL-97), 1997, Pp
weight assignment scheme. 96-103.
[13] Marcu, D., “The Theory And Practice Of
Discourse Parsing And Summarization,” the
REFERENCES Proceedings of the 35th Annual Meeting of the
Association for Computational Linguistics (ACL-
[1] Aristotle, “The Rhetoric, in W. Rhys Roberts 97), 2000, Pp 96-103.
(translator),” The Rhetoric and Poetics of [14] Marcu, D., Echihabi, A., “An Unsupervised Approach
Aristotle, Random House, New York, 1954. To Recognizing Discourse Relations,” the Proceedings
of the 40th Annual Meeting of the Association for
[2] Baeza-Yates, R., Ribeiro-Neto, B., “Modern
Computational Linguistics (ACL-2002), USA, 2002, Pp
Information Retrieval,” Addison-Wesley
38-275.
Publishing Company, 1999.
[15] O'Donnell, Michael, “Variable-Length On-Line
[3] Frants, Valery.I, Shapiro, Jacob Voiskunskii,
Document Generation,” the Proceedings of the
Vladimir G,, “Automated Information Retrieval:
Flexible Hypertext Workshop of the Eighth
Theory and Methods,” Academic Press,
ACM International Hypertext Conference, UK,
California, 1997.
1007.
[4] Hagen, Eric, “An Information Retrieval System [16] Ono, Kenji, Kazuo Sumita,, Seiji Miike, “Abstract
For Performing Hierarchical Document Generation Based on rhetorical Structure Extraction,”
Clustering,” Thesis, Dartmouth College, 1997. the Proceedings of the 15th International Conference
on Computational Linguistics (COLING- 94), Vol. 1,
[5] Haouam, K., Marir, F., “ SEMIR: Semantic Kyoto, Japan, 1994.
Indexing and Retrieving Web Document Using
Rhetorical Structure Theory” the Proceedings [17] Pollitt, A. S, “Information Storage and Retrieval
of the fourth International Conference on Systems,” Ellis Horwood Ltd., Chichester, UK,
Intelligent Data Engineering and Automated 1998.
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 52

[18] Salton Gerard, “Automatic Text Processing,” [22] Vadera, Sand Meziane, F., “From English to
Addison -Wesley, USA, 1989. Formal Specifications,” the Computer Journal,
[19] Simon H Corston-Oliver, "Computing 37(9), 1994.
Representations of the Structure of Written [23] Shah, A. and Shoaib, M., “Sources of Irrelevancy
Discourse,” Technical Report MSR-TR-98-15, in Information Retrieval Systems,” the
Microsoft research, Microsoft Corporation, One International Conference on Software
Microsoft way, Redmond, WA 98052, 1998,. Engineering Research & Practice (SERP'05
[20] Smeaton, A. F., “Progress In The Application Of June 27-30, 2005, Las Vegas, USA, Pp 877-883.
Natural Language Processing To Information [24] Shoaib, M., Shah, A., and Vashishta, A. “A
Retrieval, “ the Computer Journal, 35, 1992, Pp Dynamic Indexing Technique for IR Systems,”
268-278. the First International Conference of Information
[21] Sparck-Jones, Karen, Willet, Peter, “Readings in and Communication Technologies, Pakistan,
Information Retrieval,” Morgan Kauffman, August 27-28, 2005, pp. 272-275.
California, USA, 1997.
.

You might also like