You are on page 1of 95

INFORMATION RETRIEVAL SYSTEMS

By

Ms.Suparna Das Ms. D. Swapna Ms.S.Vidyullatha


Assistant Professor Associate Professor Assistant Professor

Department of Computer Science and Engineering


Name of the Course Information Retrieval Systems
Course Code CS523PE
Year & Semester B.Tech III Year I Sem
Section
Name of the Faculty Ms.Suparna Das,Ms.D.Swapna,Ms.S.Vidyullatha
Lecture Hour | Date
Name of the Topic Automatic Indexing
Course Outcome(s)
Automatic Indexing is the process of analyzing an item to extract the
Information to be permanently kept in an index.
This process is associated with generation of the searchable data
Structures associated with an item.
Search Strategies are classified as
Statistical
Natural
Concept
Stastical strategies covers the broadest range of indexing techniques
and are most prevalent in commercial systems.

The basis for statistical approach is use of frequency of occurrence


of events.
The events are usually related to occurrences of processing tokens
(words/pharses) with in documents and with in database.
Document and Term Clustering
Outline
• Introduction to Clustering
• Thesaurus Generation
• Item Clustering
• Hierarchy of Clustering
Clustering - Definition

─ Process of grouping similar objects together.


─ Clusters should be very similar to each
other but…
─ Should be very different from the objects of other
clusters.
─ We can say that intra-cluster similarity
between objects is high and inter-cluster
similarity is low
Clustering
• Clustering:process of grouping a set of objects into
classes of similar objects.
– Documents within a cluster should be similar.
– Documents from different clusters should be dissimilar.
– We can say that intra-cluster similarity between objects is high and inter-
cluster similarity is low.
• Clustering also allows linkages between clusters .
• Hard clustering: Each document belongs to
exactly one cluster
• Soft clustering: A document can belong
to more than one cluster.
Clustering vs. Classification

Clustering groups “similar objects” and


separates dissimilar objects

Class 1

Class 2

Class 3
Classifier or a
classification process Class 4
Clustering
• Clustering is often called an unsupervised learning
because no predefined classes will be provided.
-Example : Given a collection of text documents, we want to organize them
according to their content similarities.
• IR model has two types of clustering:
– Term clustering
• Used to create a statistical thesaurus
• Increases recall by expanding searches with related terms (query
expansion)
– Document clustering
• Used to create document clusters
• The search can retrieve items similar to an item of interest, even if the
query would not have retrieved the item (resultant set
expansion)
Process of Clustering
• Define the domain for clustering
– Determine set of items to be clustered.
• Determine the attributes of the objects to be clustered
– determine the specific words in the objects to be used.
– focus on specific zone within the items that are to be used to
determine similarity
– Reduce erroneous association
Process of Clustering
• Determine the relationships between the attributes whose
co-occurrence in objects suggest those objects should be in
the same class
– determine which words are synonyms and the strength of their
relationships
– For documents define a similarity function based on word
co- occurrences.
• Apply some algorithm to determine the class(es) to which
each object will be assigned.
Clustering
Guidelines on the Characteristics of the
Classes

• A well-defined semantic definition should exist for each


class.
• The size of the classes should be within the same order of
magnitude.
• Within a class, one object should not dominate the class.
• When an object is assigned to multiple classes just
one class must be decided at creation time.
Thesaurus Generation

• Clustering of words originated with thesaurus


generation.
• Thesaurus is a dictionary that providessynonyms
and antonyms for each word..
• Thesaurus generation may be Manual
or Automatic. So there are two types of
clustering:
 Manual clustering
 Automatic Clustering
Manual Clustering
• This process follows the steps(1to 4) in the generation of a thesaurus.
• concordance is used in the determination of useful words .
• A concordance is an alphabetical listing of words from a set of items
along with their frequency of occurrence.
• If a concordance is used other tools such as KWOC,KWIC or KWAC
may help in determining useful words.
• A Key Word Out of Context(KWOC) is another name for Concordance.
• Key Word In Context(KWIC)displays a possible term in its
phrase context.
• Key Word And Context(KWAC)displays the keywords followed by their
context.
• Once the terms are selected they are clustered based upon the word
relationships and the strength of relationship.
• Example:The various displays for the sentence
“computer design contains memory chips”
• KWOC
TERM FREQ ITEM ID
chips 2 doc2,doc4
computer 3 doc1,doc4,doc10
design 1 doc4
memor 3 doc3,doc4,doc8
• KWIC
chips/ computer design contains memory
computer design contains memory chips/
design contain memory chips/ computer
memory chips/ computer design contains
• KWAC
chips computer design contains memory chips
computer computer design contains memory chips
design computer design contains memory chips
memory computer design contains memory chips
Automatic Term Clustering
• Attempt to generate a thesaurus automatically by analyzing
the collection of documents.
• Basic idea for term clustering: the more frequently two terms
co-occur in the same items, the more likely they are about
the same concept.
Complete Term Relation Method
• The documents, terms and weight can be represented in a
matrix called term matrix where rows are items and
columns are terms..
• The similarity between every term pair is calculated as a
basis for determining the clusters.
• Using the vector model for clustering a similarity is
calculated with the measure:

where "k" is summed across the set of all items.


Complete Term Relation Method
• The similarity of two columns is computed by multiplying the
corresponding values and accumulating.
• The results can be paced in a resultant "m" by "m" matrix,
called a Term-Term Matrix , where "m" is the number of
columns (terms) in the original matrix.
• The next step is to select a threshold that determines if two
terms are considered similar enough to each other to be in
the same class. This produces a new binary matrix called
Term Relationship Matrix.
• The final step in creating clusters is to determine when two
objects (terms) are in the same class.
Example
• A matrix representation of 5 documents and 8 terms

• The similarity between Term1 and Term2, using the


previous measure :
0*4 + 3*1 + 3*0 + 0*1 + 2*2 = 7
Complete Term Relation Method

Term-Term Matrix Threshold = 10

Term Relationship Matrix


Complete Term Relation Method
• The final step in creating clusters is assigning the
terms to clusters.
• There are many different algorithms available. The following
algorithms are the most common:

 stars
 cliques
 single link
 connected components or strings
Stars
• Select a term and all its’ related terms then place them
in a class.
• Terms not yet in classes are selected as new classes until
all terms are assigned to classes.
• Applying this algorithm for creating clusters to the
Term Relationship Matrix, the
following classes are created:
– Class 1 (Term 1, Term 3, Term 4, Term 5, Term 6)
– Class 2 (Term 2, Term 4, Term 8, Term 6)
– Class 3 (Term 7)
Cliques
• Cliques require all terms in a cluster to be within in
the threshold of all other terms.
• Classes created :
Class1 = (Term1, Term3, Term4, Term6)
Class2 = (Term1, Term5)
Class3 = (Term2, Term4, Term6)
Class4 = (Term2, Term6, Term8)
Class5 = (Term7)
Single link
• Algorithm :
1. Select a term not yet in a class and place it in a new class
2. Place in that class all other terms that are related to it.
3. For each term entered into the class, perform step 2
4.When no new terms are identified in Step 2, goto Step 1
Classes created :
Class1= (Term 1, Term 3, Term 4, Term 5, Term 6, Term 2)
Class2 = (Term7)
Connected Components or strings
• Connected components require all terms in a cluster(thesaurus
class) to be similar to at least one other term.
• Algorithm:
1. Select a term not yet in a class and place it in a new class ( If all
terms are in classes, stop)
2. Add to this class a term similar to the selected term and not yet in
the class
3. Repeat Step 2 with the new term, until no new terms may be
added
4. When no new terms are identified in Step 2, goto Step 1
• Example : Classes created :
Class1 = (Term1, Term3, Term4, Term2, Term8, Term6)
Class2 = (Term5)
Class3 = (Term7)
Clustering Using Existing Clusters
• Start with a set of existing clusters.
• The initial assignment of terms to the clusters is revised
by revalidating every term assignment to a cluster.
– To minimum calculations,centroids are calculated for each cluster
• Centroid: the average of all of the vectors in a cluster.
– The similarity between all existing terms and the centroids of the
clusters can be calculated.
– The term is reallocated to the cluster that has the highest
similarity.
• The process stops when minimal movement between
clusters is detected.
Illustration of Centroid Movement

Initial centroids for clusters

Centroids after reassigning terms


o I teration 1
• Initial assignment
– Class 1 = (Term 1, Term 2)
– Class 2 = (Term 3, Term 4)
– Class 3 = (Term 5, Term 6)
• Initial centroid
Centroid1 = (4/2, 4/2, 3/2, 1/2, 4/2)
Centroid2 = (0/2, 7/2, 0/2, 3/2,
5/2) Centroid3 = (2/1, 3/2, 3/2,
0/2, 5/2)
Example (Cont.)

Apply the simple similarity measure between each of the 8 terms and 3
centroids

One technique for breaking ties is to look at the similarity weights of


the other terms in the class and assign it to the class that has the most
similar weights
Iteration2

Note : Term7 moved from Class1


to Class3.
Next iteration will not cause
movements
so the process
New Centroids and Cluster assignments
stops
Clustering Using Existing Clusters
• Process requires less calculations.
• The number of classes is defined at the start of the process
and cannot grow.
– It is possible to be fewer classes at the end of the process
• Since all terms must be assigned to a class, it forces terms
to be allocated to classes, even if their similarity to the class
is very weak compared to other terms assigned.
One Pass Assignments
• Minimum overhead: only one pass of all of the terms is used
to assign terms to classes.
• Algorithm
1. Assign the next term to a new class.
2. Compute the centroid of the modified class.
3. Compare the next term to the centroids of all existing
classes.
• If the similarity to all existing centroids is less
than the predetermined threshold then goto Step 1.
• Otherwise, assign this term to the class with the most
similar centroid and goto Step 2.
One Pass Assignments
Example
Term1 = (0,3,3,0,2)
Assign Term1 to new Class1. Centroid1 = (0/1,3/1,3/1,0/1,2/1)
Term2 = (4,1,0,1,2)
Similarity(Term2, Centroid1)=7(below threshold)
Assign Term2 to new Class2. Centroid2 = ( 4/1,1/1,0/1,1/1,2/1)
Term3 = (0,4,0,0,2)
Similarity(Term3,
Centroid1)=16(highest)
Similarity(Term3, Centroid2)=8
Assign Term3 to Class1. Centroid1 =
(0/2,7/2,3/2,0/2,4/2)
Term4=(0,3,0,3,3)
Similarity (Term4,
Centroid1)=16.5(highest) Similarity
(Term4, Centroid2)=12
Assign Term4 to Class1.
One Pass Assignments
Example(Cont.)
Term5=(0,1,3,0,1)
Similarity (Term5, Centroid1)=8.67(below threshold)
Similarity (Term5, Centroid2)=3(below threshold)
Assign Term5 to new Class3. Centroid3=(0/1,1/1,3/1,0/1,1/1)
Term6=(2,2,0,0,4)
Similarity (Term6, Centroid1)=13.67
Similarity (Term6, Centroid2)=17(highest)
Similarity (Term6, Centroid3)=6
Assign Term6 to Class2.
Centroid2=(6/2,3/2,0/2,1/2,6/2)
Term7=(1,0,3,2,0)
Similarity (Term7, Centroid1)=5(below threshold)
Similarity (Term7, Centroid2)=4(below
threshold) Similarity (Term7,
Centroid3)=9(below threshold)
Assign Term7 to new Class4.
Centroid4=(1/1,0/1,3/1,2/1,0/1)
• Example ( cont.) One Pass Assignments
Term8 = ( 3,1,0,0,2 )
Similarity (Term8,
Centroid1) = 8
Similarity (Term8, Centroid2) = 16.5 (highest)
Similarity (Term8, Centroid3) = 3
Similarity (Term8, Centroid4) = 3
Assign Term8 to Class2. Centroid2 = (9/3,
4/3, 0/3, 1/3, 8/3)
Final classes:
Class1 = (Term1, Term3, Term4)
Class2 = (Term2, Term6, Term8)
Class3 = (Term5)
Class4 = (Term7)
Item Clustering
• Similar to term clustering for the thesaurus generation.
• The techniques for term clustering can also be applicable to
Item Clustering.
• Automatic item clustering the documents, terms and
weights can be represented in a matrix called Term Matrix.
• Similarity between documents is based on two terms
that have
in common. It is calculated with the measure:
Item/Item and Item Relationship Matrix
Item Clustering
• Using the Stars algorithm for assigning items to classes
produces the following classes:

Class 1 - Item 1, Item 2, Item 5


Class 2 - Item 3, Item 2
Class 3 - Item 4, Item 2, Item 5
Clustering with existing clusters
Class 1: {Item 1, Item 3}
Class 2: {Item 2, Item 4}
Hierarchy of Clusters
•Hierarchical clustering is a method of cluster analysis which
seeks to build a hierarchy of clusters.
• Two main types of hierarchical clustering
Agglomerative:
 This is a "bottom up" approach.
 pairs of clusters are merged as one moves up the hierarchy.
Divisive:
 This is a "top down" approach:
 splits are performed recursively as one moves down the hierarchy.
•The results of hierarchical clustering are usually presented
in a dendrogram.
Hierarchy of Clusters
• Hierarchical agglomerative clustering
starts Method(HACM ) un-clustered
similarity
with measures to determine
items and the clusters. pair-wise
perform
• We keep merging the most similar pairs of items until we
have one big cluster left.
Hierarchical Clustering
• This produces a binary
tree or dendogram.
• The final cluster is the
root and each data item
is a leaf.
Objectives of Creating a
Hierarchy of Clusters
• Reduce the overhead of search.
• Provide visual representation of the information space.
• Expand the retrieval of relevant items.
Dendogram for Visualizing
Hierarchical Clusters
Similarity Measure between
Clusters
• Single link clustering
– The similarity between two clusters is computed as the
maximum similarity between any two documents in the two
clusters, each from a separate cluster.
• Complete linkage
– similarity is computed as the minimum of the similarity
between any two documents in the two clusters, each from a
separate cluster.
• Group average(or centroid linkage)
– Similarity is computed as the maximum of the average similarity
values of two clusters.
Questions

1. Define clustering? Clearly bring out the steps of


the process of a clustering
2. Compare and contrast Manual Clustering and
Automatic Clustering.
3. List out the various techniques in Automatic
Term Clustering. Explain them
4. Discuss about the Item Clustering.
5. Write about hierarchal agglomerative clustering
technique in detail?

You might also like