Professional Documents
Culture Documents
By
Class 1
Class 2
Class 3
Classifier or a
classification process Class 4
Clustering
• Clustering is often called an unsupervised learning
because no predefined classes will be provided.
-Example : Given a collection of text documents, we want to organize them
according to their content similarities.
• IR model has two types of clustering:
– Term clustering
• Used to create a statistical thesaurus
• Increases recall by expanding searches with related terms (query
expansion)
– Document clustering
• Used to create document clusters
• The search can retrieve items similar to an item of interest, even if the
query would not have retrieved the item (resultant set
expansion)
Process of Clustering
• Define the domain for clustering
– Determine set of items to be clustered.
• Determine the attributes of the objects to be clustered
– determine the specific words in the objects to be used.
– focus on specific zone within the items that are to be used to
determine similarity
– Reduce erroneous association
Process of Clustering
• Determine the relationships between the attributes whose
co-occurrence in objects suggest those objects should be in
the same class
– determine which words are synonyms and the strength of their
relationships
– For documents define a similarity function based on word
co- occurrences.
• Apply some algorithm to determine the class(es) to which
each object will be assigned.
Clustering
Guidelines on the Characteristics of the
Classes
stars
cliques
single link
connected components or strings
Stars
• Select a term and all its’ related terms then place them
in a class.
• Terms not yet in classes are selected as new classes until
all terms are assigned to classes.
• Applying this algorithm for creating clusters to the
Term Relationship Matrix, the
following classes are created:
– Class 1 (Term 1, Term 3, Term 4, Term 5, Term 6)
– Class 2 (Term 2, Term 4, Term 8, Term 6)
– Class 3 (Term 7)
Cliques
• Cliques require all terms in a cluster to be within in
the threshold of all other terms.
• Classes created :
Class1 = (Term1, Term3, Term4, Term6)
Class2 = (Term1, Term5)
Class3 = (Term2, Term4, Term6)
Class4 = (Term2, Term6, Term8)
Class5 = (Term7)
Single link
• Algorithm :
1. Select a term not yet in a class and place it in a new class
2. Place in that class all other terms that are related to it.
3. For each term entered into the class, perform step 2
4.When no new terms are identified in Step 2, goto Step 1
Classes created :
Class1= (Term 1, Term 3, Term 4, Term 5, Term 6, Term 2)
Class2 = (Term7)
Connected Components or strings
• Connected components require all terms in a cluster(thesaurus
class) to be similar to at least one other term.
• Algorithm:
1. Select a term not yet in a class and place it in a new class ( If all
terms are in classes, stop)
2. Add to this class a term similar to the selected term and not yet in
the class
3. Repeat Step 2 with the new term, until no new terms may be
added
4. When no new terms are identified in Step 2, goto Step 1
• Example : Classes created :
Class1 = (Term1, Term3, Term4, Term2, Term8, Term6)
Class2 = (Term5)
Class3 = (Term7)
Clustering Using Existing Clusters
• Start with a set of existing clusters.
• The initial assignment of terms to the clusters is revised
by revalidating every term assignment to a cluster.
– To minimum calculations,centroids are calculated for each cluster
• Centroid: the average of all of the vectors in a cluster.
– The similarity between all existing terms and the centroids of the
clusters can be calculated.
– The term is reallocated to the cluster that has the highest
similarity.
• The process stops when minimal movement between
clusters is detected.
Illustration of Centroid Movement
Apply the simple similarity measure between each of the 8 terms and 3
centroids