You are on page 1of 38

CS6007 – information retrieval

UNIT V DOCUMENT TEXT MINING


Information Filtering; organization and relevance feedback – Text Mining -Text
classification and clustering - Categorization algorithms: naive Bayes; decision trees; and
nearest neighbor - Clustering algorithms: agglomerative clustering; k-means; expectation
maximization (EM).

Information Filtering:
An Information filtering system is a system that removes redundant or
unwanted information from an information stream using (semi)automated or computerized
methods prior to presentation to a human user. Its main goal is the management of
the information overload and increment of the semantic signal-to-noise ratio. To do this the
user's profile is compared to some reference characteristics. These characteristics may
originate from the information item (the content-based approach) or the user's social
environment (the collaborative filtering approach).
Whereas in information transmission signal processing filters are used against syntax-
disrupting noise on the bit-level, the methods employed in information filtering act on the
semantic level.
The range of machine methods employed builds on the same principles as those for
information extraction. A notable application can be found in the field of email spam filters.
Thus, it is not only the information explosion that necessitates some form of filters, but also
inadvertently or maliciously introduced pseudo-information.
On the presentation level, information filtering takes the form of user-preferences-
based newsfeeds, etc.
Recommender systems are active information filtering systems that attempt to present
to the user information items (film, television, music, books, news, web pages) the user is
interested in. These systems add information items to the information flowing towards the
user, as opposed to removing information items from the information flow towards the user.
Recommender systems typically use collaborative filtering approaches or a combination of the

Unit-V-1
CS6007 – information retrieval

collaborative filtering and content-based filtering approaches, although content-based


recommender systems do exist.
A filtering system of this style consists of several tools that help people find the most
valuable information, so the limited time you can dedicate to read / listen / view, is correctly
directed to the most interesting and valuable documents. These filters are also used to organize
and structure information in a correct and understandable way, in addition to group messages
on the mail addressed. These filters are very necessary in the results obtained of the search
engines on the Internet. The functions of filtering improves every day to get downloading Web
documents and more efficient messages.

RELEVANCE FEEDBACK:
The idea of relevance feedback (RF) is to involve the user in the retrieval process so as to
improve the final result set. In particular, the user gives feedback on the relevance of
documents in an initial set of results. The basic procedure is:
• The user issues a (short, simple) query.
• The system returns an initial set of retrieval results.
• The user marks some returned documents as relevant or non relevant.
• The system computes a better representation of the information need based on the user
feedback.
• The system displays a revised set of retrieval results.
Relevance feedback can go through one or more iterations of this sort. The process
exploits the idea that it may be difficult to formulate a good query when you don’t know the
collection well, but it is easy to judge particular documents, and so it makes sense to engage in
iterative query refinement of this sort. In such a scenario, relevance feedback can also be
effective in tracking a user’s evolving information need: seeing some documents may lead users
to refine their understanding of the information they are seeking.

The Rocchio algorithm for relevance feedback:

Unit-V-2
CS6007 – information retrieval

The Rocchio Algorithm is the classic algorithm for implementing relevance feedback. It
models a way of incorporating relevance feedback information into the vector space model.

Fig: A

Fig : B

Unit-V-3
CS6007 – information retrieval

Relevance feedback searching over images. (a) The user views the initial query results
for a query of bike, selects the first, third and fourth result in the top row and the fourth result
in the bottom row as relevant, and submits this feedback. (b) The users sees the revised result
set. Precision is greatly improved.

Fig : he Rocchio optimal query for separating relevant and non relevant documents.
We want to find a query vector, denoted as ~q, that maximizes similarity
with relevant documents while minimizing similarity with nonrelevant documents.
If Cr is the set of relevant documents and Cnr is the set of nonrelevant documents,
then we wish to find :

where simis defined as in Equation 6.10. Under cosine similarity, the


optimal query vector ~qopt for separating the relevant and nonrelevant documents is:

That is, the optimal query is the vector difference between the centroids of the
relevant and nonrelevant documents;. However, this observation is not terribly useful,

Unit-V-4
CS6007 – information retrieval

precisely because the full set of relevant documents is not known: it is what we want to
find.
The algorithm proposes using the modified query qm→:

where q0 is the original query vector, Dr and Dnr are the set of known relevant
and nonrelevant documents respectively, and α, β, and γ are weights attached to each
term. These control the balance between trusting the judged document set versus the
query: if we have a lot of judged documents, we would like a higher β and γ. Starting
from q0, the new query moves you some distance toward the centroid of the relevant
documents and some distance away from the centroid of the nonrelevant documents.
This new query can be used for retrieval in the standard vector space model. We can
easily leave the positive quadrant of the vector space by subtracting off a nonrelevant
document’s vector. In the Rocchio algorithm, negative term weights are ignored. That is,
the term weight is set to 0. Figure shows the effect of applying relevance feedback.
Relevance feedback can improve both recall and precision. But, in practice, it has been
shown to be most useful for increasing recall in situations. where recall is important. This is
partly because the technique expands the query, but it is also partly an effect of the use
case: when they want high recall, users can be expected to take time to review results
and to iterate on the search. Positive feedback also turns out to be much more valuable
than negative feedback, and so most IR systems set γ < β. Reasonable values might be
α = 1, β = 0.75, and γ = 0.15. In fact, many systems, such as the image search system
in Figure 9.1, allow only positive feedback, which is equivalent to setting γ = 0. Another
alternative is to use only the marked nonrelevant document which received the highest
ranking from the IR system as negative feedback (here, |Dnr| = 1 in Equation (9.3)).
While many of the experimental results comparing various relevance feedback variants
are rather inconclusive, some studies have suggested that this variant, called Ide dec-hi
is the most effective or at least the most consistent performer.

Classification and Clustering:

Unit-V-5
CS6007 – information retrieval

Classification, also referred to as categorization, is the task of automatically applying


labels to data, such as emails, web pages, or images. People classify items throughout their
daily lives. It would be infeasible, however, to manually label every page on the Web according
to some criteria, such as “spam” or “not spam.”

Clustering, defined as the task of grouping related items together. In classification,


each item is assigned a label, such as “spam” or “not spam.” In clustering, however, each item is
assigned to one or more clusters, where the cluster does not necessarily correspond to a
meaningful concept, such as “spam” or “notspam.” Instead, as we will describe later in this
chapter, items are grouped together according to their similarity. Therefore, rather than
mapping items onto a predefined set of labels, clustering allows the data to “speak for itself ”
by uncovering the implicit structure that relates the items.

These two tasks are classic machine learning problems. In machine learning, the learning
algorithms are typically characterized as supervised or unsupervised. In supervised learning, a
model is learned using a set of fully labeled items, which is often called the training set. Once
a model is learned, it can be applied to a set of unlabeled items, called the test set, in order to
automatically apply labels. Classification is often cast as a supervised learning problem. For
example, given a set of emails that have been labeled as “spam” or “not spam” (the training
set), a classification model can be learned. The model then can be applied to incoming emails in
order to classify them as “spam” or “not spam”.

Unsupervised learning algorithms, on the other hand, learn entirely based on


unlabeled data. Unsupervised learning tasks are often posed differently than supervised
learning tasks, since the input data is not mapped to a predefined set of labels. Clustering is the
most common example of unsupervised learning.

Classification and Categorization:

Unit-V-6
CS6007 – information retrieval

Applying labels to observations is a very natural task, and something that most of us do,
often without much thought, in our everyday lives. For example, consider a trip to the local
grocery store. We often implicitly assign labels such as “ripe” or “not ripe,” “healthy” or “not
healthy,” and “cheap” or “expensive” to the groceries that we see. These are examples of
binary labels, since there are only two options for each. It is also possible to apply multivalued
labels to foods, such as “starch,” “meat,” “vegetable,” or “fruit.” Another possible labeling
scheme would arrange categories into a hierarchy, in which the “vegetable” category would be
split by color into subcategories, such as “green,” “red,” and “yellow.” Under this scheme, foods
would be labeled according to their position within the hierarchy. These different labeling or
categorization schemes, which include binary, multivalued, and hierarchical, are called
ontologies.

NAÏVE BAYES:

One of the most straightforward yet effective classification techniques is called Naïve
Bayes. There were just two classes of interest, the relevant class and the non-relevant class. In
general, classification tasks can involve more than two labels or classes. In that situation, Bayes’
Rule, which is the basis of a Bayes classifier, states that:

Where C and D are random variables. Random variables are commonly used when
modeling uncertainty. Such variables do not have a fixed (deterministic) value. Instead, the
value of the variable is random. Every random variable has a set of possible outcomes
associated with it, as well as a probability distribution over the outcomes. As an example, the
outcome of a coin toss can be modeled as a random variable X. The possible outcomes of the

Unit-V-7
CS6007 – information retrieval

random variable are “heads” (h) and “tails” (t). Given a fair coin, the probability associated with
both the heads outcome and the tails outcome is 0.5. Therefore, P(X = h) = P(X = t) = 0.5.

Consider another example, where you have the algebraic expression Y =10 + 2X. If X was
a deterministic variable, then Y would be deterministic as well. That is, for a fixed X, Y would
always evaluate to the same value. However, If X is a random variable, then Y is also a random
variable. Suppose that X had possible outcomes –1 (with probability 0.1), 0 (with probability
0.25), and 1 (with probability 0.65). The possible outcomes for Y would then be 8, 10, and 12,
with P(Y = 8) = 0.1, P(Y = 10) = 0.25, and P(Y = 12) = 0.65.

We denote random variables with capital letters (e.g., C, D) and outcomes of random
variables as lowercase letters (e.g., c, d). Furthermore, we denote the entire set of outcomes
with caligraphic letters (e.g., C, D). Finally, for notational convenience, instead of writing P(X =
x), we write P(x). Similarly for conditional probabilities, rather than writing P(X = x|Y = y), we
write P(x|y).

Bayes’ Rule is important because it allows us to write a conditional probability (such as


P(C|D)) in terms of the “reverse” conditional (P(D|C)). This is a very powerful theorem, because
it is often easy to estimate or compute the conditional probability in one direction but not the
other. For example, consider spam classification, where D represents a document’s text and C
represents the class label (e.g., “spam” or “not spam”). It is not immediately clear how to write
a program that detects whether a document is spam; that program is represented by P(C|D).

However, it is easy to find examples of documents that are and are not spam. It is
possible to come up with estimates for P(D|C) given examples or training data. The magic of
Bayes’ Rule is that it tells us how to get what we want (P(C|D)), but may not immediately know
how to estimate, from something we do know how to estimate (P(D|C)).

It is straightforward to use this rule to classify items if we let C be the random variable
associated with observing a class label and let D be the random variable associated with
observing a document, as in our spam example. Given a document1 d (an outcome of random
variable D) and a set of classes C = c1, . . . , cN (outcomes of the random variable C), we can use
Bayes’ Rule to compute P(c1|d), . . . , P(cN|d), which computes the likelihood of observing class
Unit-V-8
CS6007 – information retrieval

label ci given that document d was observed. Document d can then be labeled with the class
with the highest probability of being observed given the document. That is, Naïve Bayes
classifies a document d as follows:

where arg maxc∈C P(c|d) means “return the class c, out of the set of all possible classes
C, that maximizes P(c|d).” This is a mathematical way of saying that we are trying to find the
most likely class c given the document d. Instead of computing P(c|d) directly, we can compute
P(d|c) and P(c) instead and then apply Bayes’ Rule to obtain P(c|d). As we explained before,
one reason for using Bayes’ Rule is when it is easier to estimate the probabilities of one
conditional, but not the other. We now explain how these values are typically estimated in
practice.

We first describe how to estimate the class prior, P(c). The estimation is straightforward.
It is estimated according to:

where Nc is the number of training instances that have label c, and N is the total number
of training instances. Therefore,P(c) is simply the proportion of training instances that have
label c. Estimating P(d|c) is a little more complicated because the same “counting” estimate
that we were able to use for estimating P(c) would not work. In order to make the estimation
feasible, we must impose the simplifying assumption that d can be represented as d =
w1, . . . ,wn and that wi is independent of wj for every i ̸= j. Simply stated, this says that
document d can be factored into a set of elements (terms) and that the elements (terms) are
independent of each other.2 This assumption is the reason for calling the classifier naïve,
because it requires documents to be represented in an overly simplified way. In reality, terms

Unit-V-9
CS6007 – information retrieval

are not independent of each other. However properly modeling term dependencies is possible,
but typically more difficult. Despite the independence assumption, the Naïve Bayes classifier
has been shown to be robust and highly effective for various classification tasks. This naïve
independence assumption allows us to invoke a classic result from probability that states that
the joint probability of a set of (conditionally) independent random variables can be written as
the product of the individual conditional probabilities. That means that P(d|c) can be written as:

Therefore, we must estimate P(w|c) for every possible term w in the vocabulary V and
class c in the ontology C. It turns out that this is a much easier task than estimating P(d|c) since
there is a finite number of terms in the vocabulary and a finite number of classes, but an infinite
number of possible documents. The independence assumption allows us to write the
probability P(c|d) as:

The only thing left to describe is how to estimate P(w|c). Before we can estimate the
probability, we must first decide on what the probability actually means. For example, P(w|c)
could be interpreted as “the probability that term w is related to class c,” “the probability that
w has nothing to do with class c,” or any number of other things. In order to make the meaning
concrete, we must explicitly define the event space that the probability is defined over. An
event space is the set of possible events (or outcomes) from some process. A probability is
assigned to each event in the event space, and the sum of the probabilities over all of the
events in the event space must equal one.

Unit-V-10
CS6007 – information retrieval

The probability estimates and the resulting classification will vary depending on the
choice of event space. We will now briefly describe two of the more popular event spaces and
show how P(w|c) is estimated in each.

Multiple-Bernoulli model

The first event space that we describe is very simple. Given a class c, we define a binary
random variable wi for every term in the vocabulary. The outcome for the binary event is either
0 or 1. The probabilityP(wi = 1|c) can then be interpreted as “the probability that term wi is
generated by class c.” Conversely, P(wi = 0|c) can be interpreted as “the probability that
termwi is not generated by class c.” This is exactly the event space used by the binary
independence model and is known as the multiple-Bernoulli event space.

Under this event space, for each term in some class c, we estimate the probability that
the term is generated by the class. For example, in a spam classifier, P(cheap = 1|spam) is likely
to have a high probability, whereas P(dinner = 1|spam) is going to have a much lower
probability.

Fig. Illustration of how documents are represented in the multiple-Bernoulli


eventspace. In this example, there are 10 documents (each with a unique id), two classes (spam
Unit-V-11
CS6007 – information retrieval

and not spam), and a vocabulary that consists of the terms “cheap”, “buy”, “banking”, “dinner”,
and “the”.

Figure shows how a set of training documents can be represented in this event space.
In the example, there are 10 documents, two classes (spam and not spam), and a vocabulary
that consists of the terms “cheap”, “buy”, “banking”, “dinner”, and “the”. In this example,
P(spam) = 3 /10 and P(not spam) = 7/10. Next, we must estimate P(w|c) for every pair of terms
and classes. The most straightforward way is to estimate the probabilities using what is called
the maximum likelihood estimate, which is:

where dfw;c is the number of training documents with class label c in which term w
occurs, and Nc is the total number of training documents with class label c. As we see, the
maximum likelihood estimate is nothing more than the proportion of documents in class c that
contain term w. Using the maximum likelihood estimate, we can easily compute

P(the|spam) = 1,

P(the|not spam) = 1,

P(dinner|spam) = 0,

P(dinner|not spam) = 1/7 , and so on.

Using the multiple-Bernoulli model, the document likelihood, P(d|c), can be


written as:

where δ(w,D) is 1 if and only if term w occurs in document d. In practice, it is not


possible to use the maximum likelihood estimate because of the zero probability problem. In

Unit-V-12
CS6007 – information retrieval

order to illustrate the zero probability problem, let us return to the spam classification example
from Figure . Suppose that we receive a spam email that happens to contain the term “dinner”.
No matter what other terms the email does or does not contain, the probabilityP(d|c) will
always be zero because P(dinner|spam) = 0 and the term occurs in the document (i.e.,
δdinner;d = 1). Therefore, any document that contains the term “dinner” will automatically
have zero probability of being spam. This problem is more general, since a zero probability will
result whenever a document contains a term that never occurs in one or more classes. The
problem here is that the maximum likelihood estimate is based on counting occurrences in the
training set. However, the training set is finite, so not every possible event is observed. This is
known as data sparseness. Sparseness is often a problem with small training sets, but it can also
happen with relatively large data sets. Therefore, we must alter the estimates in such a way
that all terms, including those that have not been observed for a given class, are given some
probability mass. That is, we must ensure thatP(w|c) is nonzero for all terms in V. By doing so,
we will avoid all of the problems associated with the zero probability problem.

Smoothing is a useful technique for overcoming the zero probability problem. One
popular smoothing technique is often called Bayesian smoothing, which assumes some prior
probability over models and uses a maximum a posteriori estimate. The resulting smoothed
estimate for the multiple- Bernoulli model has the form:

where αw and βw are parameters that depend on w. Different settings of these


parameters result in different estimates. One popular choice is to set αw = 1 and βw = 0 for all
w, which results in the following estimate:

Unit-V-13
CS6007 – information retrieval

Another choice is to set αw = μNw /N and βw = μ(1− Nw/N ) for all w, where Nw is the
total number of training documents in which term w occurs, and μ is a single tunable
parameter. This results in the following estimate:

This event space only captures whether or not the term is generated; it fails to capture
how many times the term occurs, which can be an important piece of information. We will now
describe an event space that takes term frequency into account.

Multinomial model:

The binary event space of the multiple-Bernoulli model is overly simplistic, as it does not
model the number of times that a term occurs in a document. Term frequency has been shown
to be an important feature for retrieval and classification, especially when used on long
documents. When documents are very short, it is unlikely that many terms will occur more than
one time, and therefore the multiple-Bernoulli model will be an accurate model. However,
more often than not, real collections contain documents that are both short and long, and
therefore it is important to take term frequency and, subsequently, document length into
account.

The multinomial event space is very similar to the multiple-Bernoulli event space, except
rather than assuming that term occurrences are binary (“term occurs” or “term does not
occur”), it assumes that terms occur zero or more times (“term occurs zero times”, “term
occurs one time”, etc.).

Algorithm:

Unit-V-14
CS6007 – information retrieval

EXPECTATION MAXIMIZATION/MODEL-BASED CLUSTERING:

In K-means, we attempt to find centroids that are good representatives. We can view
the set of K centroids as a model that generates the data. Generating a document in this model
consists of first picking a centroid at random and then adding some noise. If the noise is
normally distributed, this procedure will result in clusters of spherical shape. Model-based
clustering assumes that the data were generated by a model and tries to recover the original
model from the data. The model that we recover from the data then defines clusters and an
assignment of documents to clusters.

A commonly used criterion for estimating the model parameters is maximum likelihood.
In K-means, the quantity exp(−RSS) is proportional to the likelihood that a particular model (i.e.,
a set of centroids) generated the data. For K-means, maximum likelihood and minimal RSS are
equivalent criteria. We denote the model parameters by Θ.

Unit-V-15
CS6007 – information retrieval

More generally, the maximum likelihood criterion is to select the parameters Θ that
maximize the log-likelihood of generating the data D:

L(D| Θ) is the objective function that measures the goodness of the clustering. Given
two clustering with the same number of clusters, we prefer the one with higher L(D| Θ).

Once we have Θ, we can compute an assignment probability P(d|ωk; Θ) for each


document-cluster pair. This set of assignment probabilities defines a soft clustering.

An example of a soft assignment is that a document about Chinese cars may have a
fractional membership of 0.5 in each of the two clusters China and automobiles, reflecting the
fact that both topics are pertinent. A hard clustering like K-means cannot model this
simultaneous relevance to two topics.

Model-based clustering provides a framework for incorporating our knowledge about a


domain. in K-means are assumed to be spheres. Model-based clustering offers more flexibility.
The clustering model can be adapted to what we know about the underlying distribution of the
data, be it Bernoulli, Gaussian with non-spherical variance or a member of a different family.

EXPECTATION MAXIMIZATION ALGORITHM

commonly used algorithm for model-based clustering - is the Expectation-Maximization


algorithm or EM algorithm. EM clustering is an iterative algorithm that maximizes L(D|Θ). EM
can be applied to many different types of probabilistic modeling. We will work with a mixture of
multivariate Bernoulli distributions here, the distribution:

where Q = {Q1, . . . ,QK}, Qk = (αk, q1k, . . . , qMk), and qmk = P(Um = 1|ωk) are the parameters
of the model.3 P(Um = 1|ωk) is the probability that a document from cluster ωk contains term

Unit-V-16
CS6007 – information retrieval

tm. The probability αk is the prior of cluster ωk: the probability that a document d is in ωk if we
have no information about d.

The mixture model then

In this model, we generate a document by first picking a cluster k with probability αk and then
generating the terms of the document according to the parameters qmk. Recall that the
document representation of the multivariate Bernoulli is a vector of M Boolean values.

How do we use EM to infer the parameters of the clustering from the data?:

That is, how do we choose parameters Q that maximize L(D|Θ)? EM is similar to K-


means in that it alternates between an expectation step, corresponding to reassignment, and a
maximization step, corresponding to recomputation of the parameters of the model. The
parameters of K-means are the centroids, the parameters of the instance of EM in this section
are the αk and qmk.

The maximization step recomputes the conditional parameters qmk and the priors αk as
follows:

The expectation step computes the soft assignment of documents to clusters given the current
parameters qmk and αk:

The expectation step is nothing else but Bernoulli Naive Bayes classification (including
normalization, i.e. dividing by the denominator, to get a probability distribution over clusters).

Unit-V-17
CS6007 – information retrieval

Finding good seeds is even more critical for EM than for K-means. EM is prone to get
stuck in local optima if the seeds are not chosen well. This is a general problem that also occurs
in other applications of EM.4 Therefore, as with K-means, the initial assignment of documents
to clusters is often computed by a different algorithm. For example, a hard K-means clustering
may provide the initial assignment, which EM can then “soften up.”

CLUSTERING ALGORITHMS:

Clustering algorithms provide a different approach to organizing data. Clustering


algorithms are based on unsupervised learning, which means that they do not require any
training data. Clustering algorithms take a set of unlabeled instances and group (cluster) them
together. One problem with clustering is that it is often an ill-defined problem. Classification
has very clear objectives. However, the notion of a good clustering is often defined very
subjectively.

Cluster hypothesis. Documents in the same cluster behave similarly with respect to
relevance to information needs.

The hypothesis states that if there is a document from a cluster that is relevant to a
search request, then it is likely that other documents from the same cluster are also relevant.
This is because clustering puts together documents that share many terms.

K-MEANS Clustering:

K-means is the most important flat clustering algorithm. Its objective is to minimize the
average squared Euclidean distance of documents from their cluster centers where a cluster
center is defined as the mean or centroid μ → of the documents CENTROID in a cluster ω:

Unit-V-18
CS6007 – information retrieval

The ideal cluster in K-means is a sphere with the centroid as its center of gravity. Ideally,
the clusters should not overlap. Our desiderata for classes in Rocchio classification were the
same. The difference is that we have no labeled training set in clustering for which we know
which documents should be in the same cluster.

A measure of how well the centroids represent the members of their clusters is
the RESIDUAL SUM OF SQUARES or RSS, the squared distance of each vector from its centroid
summed over all vectors.

RSS is the objective function in K-means and our goal is to minimize it. Since N is fixed,
minimizing RSS is equivalent to minimizing the average squared distance, a measure of how
well centroids represent their documents.

Fig: The K-means algorithm.

The first step of K-means is to select as initial cluster centers K randomly selected
documents, the SEEDs. The algorithm then moves the cluster centers around in space in order

Unit-V-19
CS6007 – information retrieval

to minimize RSS. This is done iteratively by repeating two steps until a stopping criterion is met:
reassigning documents to the cluster with the closest centroid; and recomputing each centroid
based on the current members of its cluster. The Following Figure shows snapshots from nine
iterations of the K-means algorithm for a set of points.

We can apply one of the following termination conditions.

 A fixed number of iterations I has been completed. This condition limits the runtime of
the clustering algorithm, but in some cases the quality of the clustering will be poor
because of an insufficient number of iterations.
 Assignment of documents to clusters (the partitioning function γ) does not change
between iterations. Except for cases with a bad local minimum, this produces a good
clustering, but runtimes may be unacceptably long.
 Centroids μ→k do not change between iterations. This is equivalent to γ not changing
 Terminate when RSS falls below a threshold. This criterion ensures that the clustering is
of a desired quality after termination.
 Terminate when the decrease in RSS falls below a threshold Θ. For small Θ, this
indicates that we are close to convergence. Again, we need to combine it with a bound
on the number of iterations to prevent very long runtimes.

Unit-V-20
CS6007 – information retrieval

Fig: A K-means example for K = 2 in R2. The position of the two centroids (μ→’s shown as
X’s in the top four panels) converges after nine iterations.

Unit-V-21
CS6007 – information retrieval

We now show that K-means converges by proving that RSS monotonically decreases in each
iteration. We will use decrease in the meaning decrease or does not change in this section. First,
RSS decreases in the reassignment step since each vector is assigned to the closest centroid, so
the distance it contributes to RSS decreases. Second, it decreases in the recomputation step
because the new centroid is the vector ~v for which RSSk reaches its minimum.

where xm and vm are the mth components of their respective vectors. Setting the partial
derivative to zero, we get:

which is the componentwise definition of the centroid. Thus, we minimize RSS k when
the old centroid is replaced with the new centroid. RSS, the sum of the RSS k, must then also
decrease during recomputation.

Since there is only a finite set of possible clusterings, a monotonically decreasing


algorithm will eventually arrive at a (local) minimum. Take care, however, to break ties
consistently, e.g., by assigning a document to the cluster with the lowest index if there are
several equidistant centroids. Otherwise, the algorithm can cycle forever in a loop of clusterings
that have the same cost.

While this proves the convergence of K-means, there is unfortunately no guarantee


that a global minimum in the objective function will be reached.

This is a particular problem if a document set contains many outliers, documents that
are far from any other documents and therefore do not fit well into any cluster. Frequently, if
an outlier is chosen as an initial seed, then no other vector is assigned to it during subsequent
iterations. Thus, we end up with a singleton cluster (a cluster with only one document) even

Unit-V-22
CS6007 – information retrieval

though there is probably a clustering with lower RSS. Following Figure shows an example of a
suboptimal clustering resulting from a bad choice of initial seeds.

Effective heuristics for seed selection include

1. Excluding outliers from the seed set;


2. Trying out multiple starting points and choosing the clustering with lowest cost;
3. Obtaining seeds from another method such as hierarchical clustering.

Since deterministic hierarchical clustering methods are more predictable than K-means,
a hierarchical clustering of a small random sample of size iK (e.g., for i = 5 or i = 10) often
provides good seeds.

Time complexity of K-means:

Most of the time is spent on computing vector distances. One such operation costs
Θ(M). The reassignment step computes KN distances, so its overall complexity is Θ (KNM). In
the recomputation step, each vector gets added to a centroid once, so the complexity of this
step is Θ (NM). For a fixed number of iterations I, the overall complexity is therefore Θ (IKNM).
Thus, K-means is linear in all relevant factors: iterations, number of clusters, number of vectors
and dimensionality of the space. This means that K-means is more efficient than the
hierarchical algorithms.

(OR)

(U may study this also/ study as per ur wish . both are same)

Unit-V-23
CS6007 – information retrieval

The K-means algorithm is fundamentally different than the class of hierarchical


clustering algorithms just described. For example, with agglomerative clustering, the algorithm
begins withN clusters and iteratively combines two (or more) clusters together based on how
costly it is to do so. As the algorithm proceeds, the number of clusters decreases. Furthermore,
the algorithm has the property that once instances Xi and Xj are in the same cluster as each
other, there is no way for them to end up in different clusters as the algorithm proceeds.

With the K-means algorithm, the number of clusters never changes. The algorithm starts
with K clusters and ends with K clusters. During each iteration of the K-means algorithm, each
instance is either kept in the same cluster or assigned to a different cluster. This process is
repeated until some stopping criteria is met.

The goal of the K-means algorithm is to find the cluster assignments, represented by the
assignment vector A[1], . . . ,A[N], that minimize the following cost function:

where dist(Xi,Ck) is the distance between instance Xi and class Ck. As with the various
hierarchical clustering costs, this distance measure can be any reasonable measure, although it
is typically assumed to be the following:

which is the Euclidean distance between Xi and μCk squared. Here, as before, μCk is the
centroid of cluster Ck. Notice that this distance measure is very similar to the cost associated
with Ward’s method for agglomerative clustering. Therefore, the method attempts to find the
clustering that minimizes the intra cluster variance of the instances.

Alternatively, the cosine similarity between Xi and μCk can be used as the distance
measure. As described in Chapter 7, the cosine similarity measures the angle between two

Unit-V-24
CS6007 – information retrieval

vectors. For some text applications, the cosine similarity measure has been shown to be more
effective than the Euclidean distance. This specific form of K-means is often called spherical K-
means.

One of the most naïve ways to solve this optimization problem is to try every possible
combination of cluster assignments. However, for large data sets this is computationally
intractable, because it requires computing an exponential numberof costs. Rather than finding
the globally optimal solution, the K-means algorithm finds an approximate, heuristic solution
that iteratively tries to minimize the cost. This solution returned by the algorithm is not
guaranteed to be the global optimal. In fact, it is not even guaranteed to be locally optimal.
Despite its heuristic nature, the algorithm tends to work very well in practice.

Algorithm lists the pseudocode for one possibleK-means implementation.The algorithm


begins by initializing the assignment of instances to clusters. This can be done either randomly
or by using some knowledge of the data to make a more informed decision. An iteration of the
algorithm then proceeds as follows. Each instance is assigned to the cluster that it is closest to,
in terms of the distance measure dist(Xi,Ck). The variable change keeps track of whether any of
the instances changed clusters during the current iteration. If some have changed, then the
algorithm proceeds. If none have changed, then the algorithm ends. Another reasonable
stopping criterion is to run the algorithm for some fixed number of iterations.

In practice, K-means clustering tends to converge very quickly to a solution. Even though
it is not guaranteed to find the optimal solution, the solutions returned are often optimal or
close to optimal. When compared to hierarchical clustering, K-means is more efficient.
Specifically, since KN distance computations are done in every iteration and the number of
iterations is small, implementations of K-means are O(KN) rather than the O(N2) complexity of
hierarchical methods. Although the clusters produced byK-means depend on the starting points
chosen (the initial clusters) and the ordering of the input data, K-means generally produces
clusters of similar quality to hierarchical methods. Therefore, K-means is a good choice for an
all-purpose clustering algorithm for a wide range of search engine–related tasks, especially for
large data sets.

Unit-V-25
CS6007 – information retrieval

K NEAREST NEIGHBOR CLUSTERING:

Even though hierarchical andK-means clustering are different from an algorithmic point
of view, one thing that they have in common is the fact that both algorithms place every input
into exactly one cluster, which means that clusters do not overlap. Therefore, these algorithms
partition the input instances into K partitions (clusters). However, for certain tasks, it may be
useful to allow clusters to overlap. One very simple way of producing overlapping clusters is
called K nearest neighbor clustering. It is important to note that the K here is very different
from the K in K-means clustering, as will soon become very apparent.

In K nearest neighbor clustering, a cluster is formed around every input instance. For
input instance x, theK points that are nearest to x according to some distance metric and x itself
form a cluster. Figure shows several examples of nearest neighbor clusters with K = 5 formed
for the points A, B, C, and D. Although the figure only shows clusters around four input
instances, in reality there would be one cluster per input instance, resulting in N clusters.

As Figure illustrates, the algorithm often fails to find meaningful clusters. In sparse areas
of the input space, such as around D, the points assigned to cluster D are rather far away and

Unit-V-26
CS6007 – information retrieval

probably should not be placed in the same cluster as D. However, in denser areas of the input
space, such as around B, the clusters are better defined, even though some related inputs may
be missed because K is not large enough. Applications that use K nearest neighbor clustering
tend to emphasize finding a small number of closely related instances in the K nearest
neighbors (i.e., precision) over finding all the closely related instances (recall).

K nearest neighbor clustering can be rather expensive, since it requires computing


distances between every pair of input instances. If we assume that computing the distance
between two input instances takes constant time with respect to K and N, then this
computation takes O(N2) time. After all of the distances have been computed, it takes at most
O(N2) time to find the K nearest neighbors for each point. Therefore, the total time complexity
forK nearest neighbor clustering is O(N2), which is the same as hierarchical clustering.

For certain applications, K nearest neighbor clustering is the best choice of clustering
algorithm. The method is especially useful for tasks with very dense input spaces where it is
useful or important to find a number of related items for every input. Examples of these tasks
include language model smoothing, document score smoothing, and pseudo-relevance
feedback. We describe how clustering can be applied to smoothing shortly.

Unit-V-27
CS6007 – information retrieval

Fig: Example of overlapping clustering using nearest neighbor clustering with K = 5. The
overlapping clusters for the black points (A, B, C, and D) are shown. The five nearest neighbors
for each black point are shaded gray and labeled accordingly.

AGGLOMERATIVE CLUSTERING / HIERARCHICAL CLUSTERING /

HIERARCHICAL AGGLOMERATIVE CLUSTERING (HAC):

Hierarchical clustering algorithms are either top-down or bottom-up. Bottom up


algorithms treat each document as a singleton cluster at the outset and then successively
merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster
that contains all documents. Bottom up hierarchical clustering is therefore called
HIERARCHICAL AGGLOMERATIVE CLUSTERING or HAC. Top-down clustering requires a method
for splitting a cluster. It proceeds by splitting clusters recursively until individual documents are
reached.

Unit-V-28
CS6007 – information retrieval

We first introduce a method for depicting hierarchical clusterings graphically, discuss a few key
properties of HACs and present a simple algorithm for computing an HAC.

An HAC clustering is typically visualized as a dendrogram as shown in Figure 1. Each


merge is represented by a horizontal line. The y-coordinate of the horizontal line is the
similarity of the two clusters that were merged, where documents are viewed as singleton
clusters. We call this similarity the combination similarity of the merged cluster. For example,
the combination similarity of the cluster consisting of Lloyd’s CEO questioned and Lloyd’s chief /
U.S. grilling in Figure 1 is ≈ 0.56.

We define the combination similarity of a singleton cluster as its document’s self-


similarity (which is 1.0 for cosine similarity). By moving up from the bottom layer to the top
node, a dendrogram allows us to reconstruct the history of merges that resulted in the depicted
clustering. For example, we see that the two documents entitled War hero Colin Powell were
merged first in Figure 1 and that the last merge added Ag trade reform to a cluster consisting of
the other 29 documents. A fundamental assumption in HAC is that the merge operation is
monotonic.

Unit-V-29
CS6007 – information retrieval

Fig 1: A dendrogram of a single-link clustering of 30 documents from Reuters-RCV1

Monotonicmeans that if s1, s2, . . . , sK−1 are the combination similarities of the
successivemerges of an HAC, then s1 ≥ s2 ≥ . . . ≥ sK−1 holds. A non monotonic hierarchical
clustering contains at least one inversion si < si+1 and contradicts the fundamental assumption
that we chose the best merge available at each step. We will see an example of an inversion in
Figure 2.

Unit-V-30
CS6007 – information retrieval

Fig 2: Algorithm

Hierarchical clustering does not require a prespecified number of clusters. However, in


some applications we want a partition of disjoint clusters just as in flat clustering. In those
cases, the hierarchy needs to be cut at some point.

A number of criteria can be used to determine the cutting point:

1. Cut at a prespecified level of similarity. For example, we cut the dendrogram at 0.4 if we
want clusters with a minimum combination similarity of 0.4. In Figure 17.1, cutting the
diagram at y = 0.4 yields 24 clusters (grouping only documents with high similarity
together) and cutting it at y = 0.1 yields 12 clusters (one large financial news cluster and
11 smaller clusters).
2. Cut the dendrogram where the gap between two successive combination similarities is
largest. Such large gaps arguably indicate “natural” clusterings. Adding one more
cluster decreases the quality of the clustering significantly, so cutting before this steep
decrease occurs is desirable.

Unit-V-31
CS6007 – information retrieval

where K′ refers to the cut of the hierarchy that results in K′ clusters, RSS is the residual sum of
squares and λ is a penalty for each additional cluster. Instead of RSS, another measure of
distortion can be used.

3. As in flat clustering, we can also prespecify the number of clusters K and select the
cutting point that produces K clusters.

A simple, naive HAC algorithm is shown in above Figure2. We first compute the N × N similarity
matrix C. The algorithm then executes N − 1 steps of merging the currently most similar
clusters. In each iteration, the two most similar clusters are merged and the rows and columns
of the merged cluster i in C are updated. The clustering is stored as a list of merges in A. I
indicates which clusters are still available to be merged. The function SIM(i,m, j) computes the
similarity of cluster j with the merge of clusters I and m. For some HAC algorithms, SIM(i,m, j) is
simply a function of C[j][i] and C[j][m], for example, the maximum of these two values for
single-link. We will now refine this algorithm for the different similarity measures of single-link
and complete-link clustering (Section 17.2) and group-average and centroid clustering. The
merge criteria of these four variants of HAC are shown in Figure 3.

Unit-V-32
CS6007 – information retrieval

Fi
g3

DECISION TREES:

Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a dataset into smaller and smaller subsets while at the same time an associated decision tree is
incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node
(e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g., Play)
represents a classification or decision. The topmost decision node in a tree which corresponds to the

Unit-V-33
CS6007 – information retrieval

best predictor called root node. Decision trees can handle both categorical and numerical data. 

Algorithm: The core algorithm for building decision trees called ID3 by J. R. Quinlan which
employs a top-down, greedy search through the space of possible branches with no backtracking. ID3
uses Entropy and Information Gain to construct a decision tree.

Entropy: A decision tree is built top-down from a root node and involves partitioning the data into
subsets that contain instances with similar values (homogenous). ID3 algorithm uses entropy to
calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero
and if the sample is an equally divided it has entropy of one.

To build a decision tree, we need to calculate two types of entropy using frequency tables as follows:

Unit-V-34
CS6007 – information retrieval

a) Entropy using the frequency table of one attribute:

b) Entropy using the frequency table of two attributes:

Information Gain

The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e.,
the most homogeneous branches).

Step 1: Calculate entropy of the target. 

Unit-V-35
CS6007 – information retrieval

Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated.
Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted
from the entropy before the split. The result is the Information Gain, or decrease in entropy. 

Step 3: Choose attribute with the largest information gain as the decision node. 

Unit-V-36
CS6007 – information retrieval

Step 4a: A branch with entropy of 0 is a leaf node.

Step 4b: A branch with entropy more than 0 needs further splitting.

Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.

Decision Tree to Decision Rules

A decision tree can easily be transformed to a set of rules by mapping from the root node to the leaf

Unit-V-37
CS6007 – information retrieval

nodes one by one.

Unit-V-38

You might also like