than manual indexing can. However, it also remains at aprimitive level of development, despite many years of study.(For details on current ways to automatically assign contentterms to documents, see the sidebar below.)
INFORMATION RETRIEVAL MODELS
An IR model is characterized by four parameters:
s
representations for documents and queries,
s
matching strategies for assessing the relevance of docu-ments to a user query,
s
methods for ranking query output, and
s
mechanisms for acquiring user-relevance feedback.IR models can be classed into four types: set theoretic, alge-braic, probabilistic, and hybrid models. In the following sec-tions, we describe instances of each type in the context of the IR model parameters.
FEATURE
60
SEPTEMBER • OCTOBER 1997
http:/ / computer.org/ internet/
IEEE
INTERNETCOMPUTING
.
The automatic assigning of content terms to documents can bebased on single or multiple terms.
Single-Term Indexing
The
term set
of the document includes its set of words and theirfrequency. Words that perform strictly grammatical functionsare compiled into a
stop list
and removed. The term set can alsobe refined by
stemming
to remove word suffixes.Approaches to assigning weights for single terms may begrouped into the following categories: statistical, information-theoretic, and probabilistic. While the first two categories justuse document and collection properties, the probabilisticapproaches require user input in terms of relevance judgments.
Statistical methods.
Assume that we have
N
documents in acollection. Let
tf
ij
denote the term frequency, which is a functionof the frequency of the term
T
j
in document
D
i
.Indexing based on
term frequency
fulfills one indexing aim,namely, recall. However, terms that have concentration in a fewdocuments of a collection can be used to improve precision bydistinguishing documents in which they occur from those inwhich they do not. Let
df
j
denote the document frequency of theterm
T
j
in a collection of
N
documents, which is the number ofdocuments in which the term occurs. Then, the inverse documentfrequency, given by log(
N
/
df
j
), is an appropriate indicator of
T
j
as a document discriminator.The term-frequency and inverse-document-frequency com-ponents can be combined into a single frequency-based index-ing model,
1,2
where the weight of a term
T
j
in document
D
i
denot-ed
w
ij
is given by
w
ij
=
tf
ij
log(
N
/
df
j
)Another statistical approach to indexing is based on
term discrimination.
This approach views each document as a pointin the document space. As the term sets for two documentsbecome more similar, the corresponding points in the documentspace become closer (that is, the density of the document spaceincreases) and vice versa.Under this scheme, we can approximate the value of a termas a document discriminator based on the type of change thatoccurs in the document space when a term is introduced to thecollection. We can quantify this change according to the increaseor decrease in the average distance between the documents. Aterm has a good discrimination value if it increases the averagedistance between the documents; in other words, terms with gooddiscrimination value decrease the density of the document space.The term-discrimination value of a term
T
j
, denoted
dv
j
, is thencomputed as the difference of the document space densitiesbefore and after the term
T
j
is introduced. The net effect is thathigh-frequency terms have negative discrimination values, medi-um-frequency terms have positive discrimination values, and low-frequency terms tend to have discrimination values close to zero.
1
A term-weighting scheme such as
w
ij
=
tf
ij
dv
j
is used to combineterm frequency and discrimination values.
Information-theoretic methods.
In information theory, theleast-predictable terms carry the greatest information value.
3
Least-predictable terms are those that occur with smallestprobabilities. Information theory concepts have been used toderive a measure, called
signal-noise ratio
, of term usefulnessfor indexing. This method favors terms that are concentrated inparticular documents. Therefore, its properties are similar tothose of inverse document frequency.
Probabilistic methods.
Probabilistic approaches require atraining set of documents obtained by asking users to providerelevance judgments with respect to query results.
4
The trainingset is used to compute term weights by estimating conditionalprobabilities that a term occurs given that a document is relevant(or irrelevant). Assume that a collection of
N
documents of which
R
are relevant to the user query,
R
t
of the relevant documentscontain term
t
, and
t
occurs in
f
t
documents. Two conditionalprobabilities are estimated for each term as follows:Pr [
t
in document|document is relevant] =
R
t
/
R
;Pr [
t
in document|document is irrelevant] = (
f
t
−
R
t
)/(
N
−
R
).From these estimates, Bayes’ theorem is used, under certainassumptions, to derive the weight of term
t
as
AUTOMATIC INDEXING METHODS
Leave a Comment