Professional Documents
Culture Documents
UNIT I
Boolean retrieval:
The Boolean model of information retrieval (BIR)[1] is a classical information retrieval (IR)
model and, at the same time, the first and most adopted one. It is used by many IR systems to
this day
2) Definitions[edit]
1)
3)
4)
5)
6)
7)
8)
9)
The BIR is based on Boolean logic and classical set theory in that both the documents to be
searched and the user's query are conceived as sets of terms. Retrieval is based on whether or not
the documents contain the query terms. Given a finite set
T = {t1, t2, ..., tj, ..., tm}
of elements called index terms (e.g. words or expressions - which may be stemmed - describing
or characterising documents such as keywords given for a journal article), a finite set
D = {D1, ..., Di, ..., Dn}, where Di is an element of the powerset of T
of elements called documents. Given a Boolean expression - in a normal form - Q called a query
as follows:
Q = (Wi OR Wk OR ...) AND ... AND (Wj OR Ws OR ...),
with Wi=ti, Wk=tk, Wj=tj, Ws=ts, or Wi=NON ti, Wk=NON tk, Wj=NON tj, Ws=NON ts
10) where ti means that the term ti is present in document Di, whereas NON ti means that it is not.
11) Equivalently, Q can be given in a disjunctive normal form, too. An operation called retrieval,
or Wj=NON tj) :
13) Sj = {Di|Wj element of Di}
14) 2. Those documents are retrieved in response to Q which are the result of the corresponding sets
operations, i.e. the answer to Q is as follows:
15) UNION ( INTERSECTION Sj)
Advantages[edit]
Clean formalism
Easy to implement
Intuitive concept
Tokenization
Given a character sequence and a defined document unit, tokenization is the task of chopping it
up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as
punctuation. Here is an example of tokenization:
Input:
Friends,
Romans,
Countrymen,
lend
me
your
ears;
Output:
These tokens are often loosely referred to as terms or words, but it is sometimes important to
make a type/token distinction. A token is an instance of a sequence of characters in some
particular document that are grouped together as a useful semantic unit for processing. A type is
the class of all tokens containing the same character sequence. A term is a (perhaps normalized)
type that is included in the IR system's dictionary.
Dropping common terms: stop words
Figure 2.5: A stop list of 25 semantically non-selective words which are common in ReutersRCV1.
Sometimes, some extremely common words which would appear to be of little value in helping
select documents matching a user need are excluded from the vocabulary entirely. These words
are called stop words . The general strategy for determining a stop list is to sort the terms
by collection frequency (the total number of times each term appears in the document collection),
and then to take the most frequent terms, often hand-filtered for their semantic content relative to
the domain of the documents being indexed, as a stop list , the members of which are then
discarded during indexing.
NORMALIZATION (EQUIVALENCE CLASSING OF TERMS)
Having broken up our documents (and also our query) into tokens, the easy case is if tokens in
the query just match tokens in the token list of the document. However, there are many cases
when two character sequences are not quite the same but you would like a match to occur. For
instance, if you search for USA, you might hope to also match documents containing U.S.A.
Token normalization is the process of canonicalizing tokens so that matches occur despite
superficial differences in the character sequences of the tokens. The most standard way to
normalize is to implicitly create equivalence classes , which are normally named after one
member of the set. For instance, if the tokens anti-discriminatory and antidiscriminatory are both
mapped onto the term antidiscriminatory, in both the document text and queries, then searches
for one term will retrieve documents that contain either.
The advantage of just using mapping rules that remove characters like hyphens is that the
equivalence classing to be done is implicit, rather than being fully calculated in advance: the
terms that happen to become identical as the result of these rules are the equivalence classes. It is
only easy to write rules of this sort that remove characters. Since the equivalence classes are
implicit, it is not obvious when you might want to add characters. For instance, it would be hard
to know to turnantidiscriminatory into anti-discriminatory.
Figure 2.6: An example of how asymmetric expansion of query terms can usefully model users'
expectations.
Accents and diacritics.
Diacritics on characters in English have a fairly marginal status, and we might well
want clich and cliche to match, or naive and nave. This can be done by normalizing tokens to
remove diacritics. In many other languages, diacritics are a regular part of the writing system
and distinguish different sounds. Occasionally words are distinguished only by their accents. For
instance, in Spanish, pea is `a cliff', while pena is `sorrow'. Nevertheless, the important
question is usually not prescriptive or linguistic but is a question of how users are likely to write
queries for these words. In many cases, users will enter queries for words without diacritics,
whether for reasons of speed, laziness, limited software, or habits born of the days when it was
hard to use non-ASCII text on many computer systems. In these cases, it might be best to
equate all words to a form without diacritics.
Capitalization/case-folding.
A common strategy is to do case-folding by reducing all letters to lower case. Often this is a
good idea: it will allow instances of Automobile at the beginning of a sentence to match with a
query of automobile. It will also help on a web search engine when most of your users type
in ferrari when they are interested in a Ferrari car. On the other hand, such case folding can
equate words that might better be kept apart. Many proper nouns are derived from common
nouns and so are distinguished only by case, including companies (General Motors, The
Associated Press), government organizations (the Fed vs. fed) and person names
(Bush, Black). We already mentioned an example of unintended query expansion with
acronyms, which involved not only acronym normalization (C.A.T. CAT) but also case-folding
(CAT cat).
4) Index construction:
5) Index compression.
An inverted index for a given text collection can be quite large, especially if it contains full
positional information about all term occurrences in the collection. In a typical collection of
English text there is approximately one token for every 6 bytes of text (including punctuation
and whitespace characters). Hence, if postings are stored as 64-bit integers, we may expect that
an uncompressed positional index for such a collection consumes between 130% and 140% of
the uncompressed size of the original text data. This estimate is confirmed by Table 6.1, which
lists the three example collections used in this book along with their uncompressed size, their
compressed size, and the sizes of an uncompressed and a compressed schema-independent index.
An uncompressed schema-independent index for the TREC45 collection, for instance, consumes
about 331 MB, or 122% of the raw collection size.1
. General-Purpose Data Compression
In general, a data compression algorithm takes a chunk of data A and transforms it into another
chunk of data B, such that B is (hopefully) smaller than A, that is, it requires fewer bits to
be transmitted over a communication channel or to be stored on a storage medium. Every
compression algorithm consists of two components: the encoder (or compressor) and the decoder
(or decompressor). The encoder takes the original data A and outputs the compressed data B.
The decoder takes B and produces some output C.
Symbolwise Data Compression
When compressing a given chunk of data A, we are usually less concerned with As actual
appearance as a bit string than with the information contained in A. This information is called
the message, denoted as M. Many data compression techniques treat M as a sequence of symbols
from a symbol set S (called the alphabet):
M = h1, 2, . . . , ni, i S.
UNIT II
1)
this, the search engine computes, for each matching document, a score with
respect to the query at hand. In this chapterwe initiate the study of assigning
a score to a (query, document) pair. This chapter consists of threemain ideas.
Weighted zone scoring
Given a Boolean query q and a document d, weighted zone scoring assigns
to the pair (q, d) a score in the interval [0, 1], by computing a linear combination
of zone scores, where each zone of the document contributes a Boolean
value. More specifically, consider a set of documents each of which has
zones. Let g1, . . . , g [0, 1] such that
i=1 gi = 1. For 1 i , let si be the
Boolean score denoting a match (or absence thereof) between q and the ith
zone. For instance, the Boolean score from a zone could be 1 if all the query
term(s) occur in that zone, and zero otherwise; indeed, it could be any Boolean
function that maps the presence of query terms in a zone to 0, 1. Then,
the weighted zone score is defined to be
i=1
gisi. (6.1)
Weighted zone scoring is sometimes referred to also as ranked Boolean re- RANKED
BOOLEAN
RETRIEVAL trieval.
Learning weights
How do we determine the weights gi for weighted zone scoring? These
weights could be specified by an expert (or, in principle, the user); but increasingly,
these weights are learned using training examples that have
been judged editorially. This latter methodology falls under a general class
of approaches to scoring and ranking in information retrieval, known as
machine-learned relevance. We provide a brief introduction to this topic here MACHINELEARNED
RELEVANCE because weighted zone scoring presents a clean setting for introducing it;
The definition of term depends on the application. Typically terms are single words, keywords,
or longer phrases. If the words are chosen to be the terms, the dimensionality of the vector is the
number of words in the vocabulary (the number of distinct words occurring in the corpus).
Applications[edit]
Relevance rankings of documents in a keyword search can be calculated, using the assumptions
of document similarities theory, by comparing the deviation of angles between each document
vector and the original query vector where the query is represented as the same kind of vector as
the documents.
In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the
angle itself:
Where is the intersection (i.e. the dot product) of the document (d2 in the figure to the right) and
the query (q in the figure) vectors, is the norm of vector d2, and is the norm of vector q.
The norm of a vector is calculated as such:
As all vectors under consideration by this model are elementwise nonnegative, a cosine value of
zero means that the query and document vector are orthogonal and have no match (i.e. the query
term does not exist in the document being considered). See cosine similarity for further
information.
Advantages[edit]
The vector space model has the following advantages over the Standard Boolean model:
1.
2.
3.
4.
5.
The style of search used by both the electro-mechanical and computer-based IR systems was socalled Boolean
retrieval. A query was a logical combination of terms (a synonym of word in IR literature), which
resulted in a
set of those documents that exactly matched the query. Luhn [19] proposed and Maron, Kuhns,
and Ray [20]
tested an alternative approach, where each document in the collection was assigned a score
indicating its
relevance to a given query.
Learning to rank
Up to this point, the ranking functions used in search engines were manually devised and tuned
by hand through
experimentation. Fuhr [40] described work where the retrieval function was learned based on
relevant
documents identified for an existing set of queries
Web search
Until the rise of the web, the collections that people searched were selected and edited from
authoritative
sources by the search service providers. On the web, however, the situation was different. Search
engine
developers quickly realised that they could use the links between web pages to construct a
crawler or robot to
traverse and gather most web pages on the internet; thereby automating acquisition of content3
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based
on metadata or on full-text (or other content-based) indexing.
Automated information retrieval systems are used to reduce what has been called "information
overload". Many universities and public libraries use IR systems to provide access to books,
journals and other documents. Web search engines are the most visible IR applications.
An information retrieval process begins when a user enters a query into the system. Queries are
formal statements of information needs, for example search strings in web search engines. In
information retrieval a query does not uniquely identify a single object in the collection.
Performance and correctness measures[edit]
Main article: Precision and recall
Many different measures for evaluating the performance of information retrieval systems have
been proposed. The measures require a collection of documents and a query. All common
measures described here assume a ground truth notion of relevancy: every document is known to
be either relevant or non-relevant to a particular query. In practice queries may be ill-posed and
there may be different shades of relevancy.
Precision[edit]
Precision is the fraction of the documents retrieved that are relevant to the user's information
need.
In binary classification, precision is analogous to positive predictive value. Precision takes all
retrieved documents into account. It can also be evaluated at a given cut-off rank, considering
only the topmost results returned by the system. This measure is called precision at n or P@n.
Note that the meaning and usage of "precision" in the field of Information Retrieval differs from
the definition of accuracy and precision within other branches of science and statistics.
Recall[edit]
Recall is the fraction of the documents that are relevant to the query that are successfully
retrieved.
In binary classification, recall is often called sensitivity. So it can be looked at as the probability
that a relevant document is retrieved by the query.
It is trivial to achieve recall of 100% by returning all documents in response to any query.
Therefore recall alone is not enough but one needs to measure the number of non-relevant
documents also, for example by computing the precision.
Fall-out[edit]
The proportion of non-relevant documents that are retrieved, out of all non-relevant documents
available:
In binary classification, fall-out is closely related to specificity and is equal to . It can be looked
at as the probability that a non-relevant document is retrieved by the query.
It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.
4) Relevance feedback
Relevance feedback is a feature of some information retrieval systems. The idea behind
relevance feedback is to take the results that are initially returned from a given query and to use
information about whether or not those results are relevant to perform a new query. We can
usefully distinguish between three types of feedback: explicit feedback, implicit feedback, and
blind or "pseudo" feedback.
Explicit feedback[edit]
Explicit feedback is obtained from assessors of relevance indicating the relevance of a document
retrieved for a query. This type of feedback is defined as explicit only when the assessors (or
other users of a system) know that the feedback provided is interpreted as relevance judgments.
Users may indicate relevance explicitly using a binary or graded relevance system. Binary
relevance feedback indicates that a document is either relevant or irrelevant for a given query.
Graded relevance feedback indicates the relevance of a document to a query on a scale using
numbers, letters, or descriptions (such as "not relevant", "somewhat relevant", "relevant", or
"very relevant"). Graded relevance may also take the form of a cardinal ordering of documents
created by an assessor; that is, the assessor places documents of a result set in order of (usually
descending) relevance. An example of this would be the SearchWiki feature implemented
by Google on their search website.
The relevance feedback information needs to be interpolated with the original query to improve
retrieval performance, such as the well-known Rocchio Algorithm.
A performance metric which became popular around 2005 to measure the usefulness of a
ranking algorithm based on the explicit relevance feedback is NDCG. Other measures
include precision at kand mean average precision.
Implicit feedback[edit]
Implicit feedback is inferred from user behavior, such as noting which documents they do and do
not select for viewing, the duration of time spent viewing a document, or page browsing or
scrolling actions [1].
The key differences of implicit relevance feedback from that of explicit include [2]:
1. the user is not assessing relevance for the benefit of the IR system, but only satisfying their own
needs and
2. the user is not necessarily informed that their behavior (selected documents) will be used as
relevance feedback
An example of this is the Surf Canyon browser extension, which advances search results from
later pages of the result set based on both user interaction (clicking an icon) and time spent
viewing the page linked to in a search result.
Blind feedback[edit]
Pseudo relevance feedback, also known as blind relevance feedback, provides a method for
automatic local analysis. It automates the manual part of relevance feedback, so that the user gets
improved retrieval performance without an extended interaction. The method is to do normal
retrieval to find an initial set of most relevant documents, to then assume that the top "k" ranked
documents are relevant, and finally to do relevance feedback as before under this assumption.
The procedure is:
1. Take the results returned by initial query as relevant results (only top k with k being between 10
to 50 in most experiments).
2. Select top 20-30 (indicative number) terms from these documents using for instance tfidf weights.
3. Do Query Expansion, add these terms to query, and then match the returned documents for this
query and finally return the most relevant documents.
Some experiments such as results from the Cornell SMART system published in (Buckley et
al.1995), show improvement of retrieval systems performances using pseudo-relevance feedback
in the context of TREC 4 experiments.
This automatic technique mostly works. Evidence suggests that it tends to work better than
global analysis.[1] Through a query expansion, some relevant documents missed in the initial
round can then be retrieved to improve the overall performance. Clearly, the effect of this method
strongly relies on the quality of selected expansion terms. It has been found to improve
performance in the TREC ad hoc task[citation needed]. But it is not without the dangers of an automatic
process. For example, if the query is about copper mines and the top several documents are all
about mines in Chile, then there may be query drift in the direction of documents on Chile. In
addition, if the words added to the original query are unrelated to the query topic, the quality of
the retrieval is likely to be degraded, especially in Web search, where web documents often cover
multiple different topics. To improve the quality of expansion words in pseudo-relevance
feedback, a positional relevance feedback for pseudo-relevance feedback has been proposed to
select from feedback documents those words that are focused on the query topic based on
UNIT III
1)
in a relevant document P(t|R = 1), and that this could be the basis of a
classifier that decideswhether documents are relevant or not. .
Terms
Probability in doc
0.1
world
0.2
likes
0.05
we
0.05
share
0.3
...
...
Terms
Probability in Doc1
Probability in Doc2
0.1
0.3
world
0.2
0.1
likes
0.05
0.03
we
0.05
0.02
share
0.3
0.2
...
...
...
In information retrieval contexts, unigram language models are often smoothed to avoid
instances where . A common approach is to generate a maximum-likelihood model for the
entire collection and linearly interpolate the collection model with a maximum-likelihood model
for each document to create a smoothed document model.[2]
4) Text classification:
Document
classification or document
categorization is
a
problem
in library
science, information science and computer science. The task is to assign a document to one or
more classes orcategories. This may be done "manually" (or "intellectually") or algorithmically.
The intellectual classification of documents has mostly been the province of library science,
while the algorithmic classification of documents is used mainly in information science and
computer science. The problems are overlapping, however, and there is therefore also
interdisciplinary research on document classification.
The documents to be classified may be texts, images, music, etc. Each kind of document
possesses its special classification problems. When not otherwise specified, text classification is
implied.
Documents may be classified according to their subjects or according to other attributes (such
as document type, author, printing year etc.). In the rest of this article only subject classification
is considered. There are two main philosophies of subject classification of documents: The
content based approach and the request based approach.
Lancaster has argued, this distinction is not fruitful. "These terminological distinctions, he
writes, are quite meaningless and only serve to cause confusion (Lancaster, 2003, p. 21[3]).
The view that this distinction is purely superficial is also supported by the fact that a
classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986,
[4]
2004;[5] Broughton, 2008;[6] Riesthuis & Bliedung, 1991[7]). Therefore is the act of labeling a
document (say by assigning a term from a controlled vocabulary to a document) at the same
time to assign that document to the class of documents indexed by that term (all documents
indexed or classified as X belong to the same class of documents).
Contiguity hypothesis. Documents in the same class form a contiguous region and regions of
different classes do not overlap.
There are many classification tasks, in particular the type of text classification that we
encountered in Chapter 13 , where classes can be distinguished by word patterns. For example,
documents in the class China tend to have high values on dimensions like Chinese, Beijing, and
Mao whereas documents in the class UK tend to have high values for London, British and
Queen. Documents of the two classes therefore form distinct contiguous regions as shown in
Figure 14.1 and we can draw boundaries that separate them and classify new documents. How
exactly this is done is the topic of this chapter.
Rocchio classification
Figure 14.1 shows three classes, China, UK and Kenya, in a two-dimensional (2D) space.
Documents are shown as circles, diamonds and X's. The boundaries in the figure, which we
call decision boundaries , are chosen to separate the three classes, but are otherwise arbitrary. To
classify a new document, depicted as a star in the figure, we determine the region it occurs in and
assign it the class of that region - China in this case. Our task in vector space classification is to
devise algorithms that compute good boundaries where ``good'' means high classification
accuracy on data unseen during training.
Figure 14.3: Rocchio classification.
Perhaps the best-known way of computing good class boundaries is Rocchio classification ,
which uses centroids to define the boundaries. The centroid of a class is computed as the
vector average or center of mass of its members:
UNIT IV
classifiers, and to choose the class that is selected by the most classifiers. While this involves
building classifiers, the time for training classifiers may actually decrease, since the training
data set for each classifier is much smaller.
Nonlinear SVMs
Figure 15.6: Projecting data that is not linearly separable into a higher dimensional space can
make it linearly separable.
2) Flat clustering:
Flat clustering
Clustering algorithms group a set of documents into subsets or clusters . The algorithms' goal is
to create clusters that are coherent internally, but clearly different from each other. In other
words, documents within a cluster should be as similar as possible; and documents in one
cluster should be as dissimilar as possible from documents in other clusters.
3) Hierachical clustering:
Flat clustering is efficient and conceptually simple, but as we saw in Chapter 16 it has a number
of drawbacks. The algorithms introduced in Chapter 16 return a flat unstructured set of clusters,
require a prespecified number of clusters as input and are nondeterministic. Hierarchical
clustering (or hierarchic clustering ) outputs a hierarchy, a structure that is more informative
than the unstructured set of clusters returned by flat clustering. Hierarchical clustering does not
require us to prespecify the number of clusters and most hierarchical algorithms that have been
used in IR are deterministic. These advantages of hierarchical clustering come at the cost of
lower efficiency. The most common hierarchical clustering algorithms have a complexity that is
at least quadratic in the number of documents compared to the linear complexity of -means and
EM
We begin by giving two theorems on the decomposition of a square matrix into the product of
three matrices of a special form. The first of these, Theorem 18.1.1, gives the basic factorization
of a square real-valued matrix into three factors. The second, Theorem 18.1.1, applies to square
symmetric matrices and is the basis of the singular value decomposition described in
Theorem 18.2.
Theorem. (Matrix diagonalization theorem) Let be a square real-valued matrix with linearly
independent eigenvectors. Then there exists an eigen decomposition
(223)
where the columns of are the eigenvectors of and is a diagonal matrix whose diagonal entries
are the eigenvalues of in decreasing order
(224)
If the eigenvalues are distinct, then this decomposition is unique. End theorem.
To understand how Theorem 18.1.1 works, we note that has the eigenvectors of as columns
(225)
Then we have
(226)
(227)
(228)
Thus, we have , or .
We next state a closely related decomposition of a symmetric square matrix into the product of
matrices derived from its eigenvectors. This will pave the way for the development of our main
tool
for
text
analysis,
the
singular
value
decomposition
(Section 18.2 ).
Theorem. (Symmetric diagonalization theorem) Let be a square, symmetric realvalued matrix with linearly independent eigenvectors. Then there exists a symmetric diagonal
decomposition
(229)
where the columns of are the orthogonal and normalized (unit length, real) eigenvectors of ,
and is the diagonal matrix whose entries are the eigenvalues of . Further, all entries of are real
and we have . End theorem.
We will build on this symmetric diagonal decomposition to build low-rank approximations to
term-document
matrices.
Exercises.
(244)
Now, we may use cosine similarities as in Section 6.3.1 (page ) to compute the similarity
between a query and a document, between two documents, or between two terms. Note
especially that Equation 244 does not in any way depend on being a query; it is simply a vector
in the space of terms. This means that if we have an LSI representation of a collection of
documents, a new document not in the collection can be ``folded in'' to this representation using
Equation 244. This allows us to incrementally add documents to an LSI representation. Of
course, such incremental addition fails to capture the co-occurrences of the newly added
documents (and even ignores any new terms they contain). As such, the quality of the LSI
representation will degrade as more documents are added and will eventually require a
recomputation of the LSI representation.
The fidelity of the approximation of to leads us to hope that the relative values of cosine
similarities are preserved: if a query is close to a document in the original space, it remains
relatively close in the -dimensional space. But this in itself is not sufficiently interesting,
especially given that the sparse query vector turns into a dense query vector in the lowdimensional space. This has a significant computational cost, when compared with the cost of
processing in
its
native
form.
Worked example. Consider the term-document matrix
ship
boat
ocean
voyage
trip
Its singular value decomposition is the product of three matrices as below. First we have which
in this example is:
boat
0.00
0.73
ocean
0.00
ship
voyage
0.35
trip
0.65
0.15
0.16
0.58
When applying the SVD to a term-document matrix, is known as the SVD term matrix. The
singular values are
2.16
0.00
0.00
0.00
0.00
0.00
1.59
0.00
0.00
0.00
0.00
0.00
1.28
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.39
Unit v
The analysis of hyperlinks and the graph structure of the Web has been
instrumental
in the development of web search. In this chapterwe focus on the
use of hyperlinks for ranking web search results. Such link analysis is
one
of many factors considered by web search engines in computing a
composite
score for a web page on any given query. We begin by reviewing some
basics of the Web as a graph in Section 21.1, then proceed to the
technical
development of the elements of link analysis for ranking.
Link analysis forweb search has intellectual antecedents in the field of
citation
analysis, aspects of which overlap with an area known as bibliometrics.
These disciplines seek to quantify the influence of scholarly articles by
analyzing
the pattern of citations amongst them. Much as citations represent the
conferral of authority from a scholarly article to others, link analysis on
the
Web treats hyperlinks from a web page to another as a conferral of
authority.
Clearly, not every citation or hyperlink implies such authority conferral;
for
this reason, simply measuring the quality of a web page by the number
of
in-links (citations from other pages) is not robust enough. For instance,
one
may contrive to set up multiple web pages pointing to a target web page,
with the intent of artificially boosting the latters tally of in-links. This
phenomenon
is referred to as link spam. Nevertheless, the phenomenon of citation
is prevalent and dependable enough that it is feasible for web search
engines to derive useful signals for ranking from more sophisticated link