You are on page 1of 51

Unsupervised learning: (Text) Clustering

Machine Learning for NLP

Magnus Rosell

Magnus Rosell 1 / 51 Unsupervised learning: (Text) Clustering

Contents

Contents
I Introduction
I Text Categorization
I Text Clustering

Magnus Rosell 2 / 51 Unsupervised learning: (Text) Clustering

Introduction

Introduction

Magnus Rosell 3 / 51 Unsupervised learning: (Text) Clustering

Clustering

To divide a set of objects into parts
To partition a set of objects so that . . .
The result of clustering: a clustering consisting of clusters

Magnus Rosell 4 / 51 Unsupervised learning: (Text) Clustering

Clustering vs. Categorization

I Categorization (supervised machine learning)
To group objects into predetermined categories.
I Needs a representation of the objects, a similarity measure and
a training set.
I Clustering (unsupervised machine learning)
To divide a set of objects into clusters (parts of the set) so
that objects in the same cluster are similar to each other,
and/or objects in different clusters are dissimilar.
I Needs a representation of the objects and a similarity measure.

Magnus Rosell 5 / 51 Unsupervised learning: (Text) Clustering

obj1 .Representation and Similarity Representation: Features Objects F1 F2 F3 .. . . obj2 .. There are many different similarity measures. .. .. Similarity: compare distribution of features between objects.. . . ... . . . .. . . . .. Magnus Rosell 6 / 51 Unsupervised learning: (Text) Clustering . ...

the similarity (and the representation).What kinds of groups? Depends on the algorithm. readability. I Text: Contents... genre.? Magnus Rosell 7 / 51 Unsupervised learning: (Text) Clustering . and the data set.

com I iBoogie.com I Carrot2. Magnus Rosell 8 / 51 Unsupervised learning: (Text) Clustering . http://demo.iboogie.org/demo-stable/main I Dimensionality reduction I Multitext summarization I Exploration tool: I Any unknown text set I Questionnaires I With different parameters and algorithms several different overviews of the same set of texts may be constructed.Why text clustering? I Historically: preprocessing for searching the cluster hypothesis I Postprocessing of search results. http://www. http://clusty.carrot2. ex: I Clusty.

Magnus Rosell 9 / 51 Unsupervised learning: (Text) Clustering . increased by 50 per cent. Stock traders nervous as interest rate increases.Six Hypothetical Newspaper Headlines 1. 4. 6. Ericsson stock market winner. Match-making in Olympic final. Zimbabwe defeated China in the Olympic match. Chelsea won the final. 2. Interest rate at 500 per cent in Zimbabwe. 5. 3.

0. . .25 .Term-document-matrix doc1 doc2 doc3 doc4 doc5 doc6 Chelsea 0. . . .25 .16 0. .25 - Magnus Rosell 10 / 51 Unsupervised learning: (Text) Clustering .2 0. - nervous . 0. 0. . 0.16 interest . .16 make . . 0. .16 Zimbabwe . .25 - rate .25 .16 Olympic . .16 . . . 0. . . - final 0. 0. . 0. - win 0. .2 .2 . .25 0.2 . - increase . 0.16 . . 0. 0.16 . . 0. .16 .33 .16 . - market . - per cent .25 0. 0. . . . . 0.25 . .16 trade . . . - Ericsson .2 0. .33 . 0. . - China . .33 . - defeat . - match . 0. . 0. . 0. . . . 0. 0.16 stock . .

Text Categorization Magnus Rosell 11 / 51 Unsupervised learning: (Text) Clustering .

. 0.14 make .06 win 0. 0.25 .14 rate . 0. 0. 0.11 trade . 0. .15 - per cent . 0.Term-document-matrix Sports Economy Centroids doc1 doc2 doc3 doc4 doc5 doc6 Sports Economy Chelsea 0.08 - market . .25 0. 0. . . . . 0. 0. .2 0. 0. . . .16 0. 0.11 interest . . . 0. .11 0. . 0. . 0. . .25 .06 final 0.16 . 0. . .06 match . . . 0.16 .25 . 0. . .2 .16 . .19 - increase .33 . .16 . .2 0.16 . 0. . .16 .07 - Ericsson . 0.33 .15 - nervous .16 . . 0. . . 0. . . . 0. . .08 Magnus Rosell 12 / 51 Unsupervised learning: (Text) Clustering .06 0. 0. .11 - China . . .06 Olympic . .16 . 0. 0. . 0. . .2 . 0. 0. . . .16 . 0.25 .16 . 0. 0.06 Zimbabwe .33 . 0. . 0.25 0.2 . 0.16 .25 .14 stock . 0. .07 - defeat . .25 . . 0. 0.

If we use normalized text vectors. and do not normalize the centroids: 1 X sim(d. Magnus Rosell 13 / 51 Unsupervised learning: (Text) Clustering .Centroid Representation of a set of texts c. the dot product as a similarity measure. c) = sim(d. dc ). |c| dc ∈c the average similarity between d and all texts in c. c) = · sim(d. the word-wise average of the document vectors: 1 X c= dc |c| dc ∈c Similarity between a text d and c: sim(d. The centroid. c).

I Automatic webdirectories. Spam filtering I .. Magnus Rosell 14 / 51 Unsupervised learning: (Text) Clustering ..Text Categorization (cont. . etc. I Text Filtering. Text Categorization Applications I Categorization into groups of content.. genre. .) Text Categorization Algorithms I Centroid Categorization I Naı̈ve Bayes I .. .

Text Clustering Magnus Rosell 15 / 51 Unsupervised learning: (Text) Clustering .

I Clusters are often represented by their centroids. I Similarity: cosine similarity The goal: clusters of texts similar in content.Representation I The vector space model of Information Retrieval is common. Magnus Rosell 16 / 51 Unsupervised learning: (Text) Clustering .

hierarchy of clusters Magnus Rosell 17 / 51 Unsupervised learning: (Text) Clustering . flat partitions I Hierarchical algorithms.Two types of clustering algorithms There are many clustering algorithms. Some of the most common can be divided into two categories: I Partitioning algorithms.

beer. Magnus Rosell 18 / 51 Unsupervised learning: (Text) Clustering . Use the inverse of cartesian distance as similarity: closer in the plane equals more similar. We only consider two words: football. In some way several texts are represented using only these two.2D Clustering Examples Will use some examples in 2D.

9 1 football Magnus Rosell 19 / 51 Unsupervised learning: (Text) Clustering .3 0.7 0.5 0.3 0.8 0.2D Clustering Example 1: two well separated 1 0.1 0 0 0.1 0.9 0.2 0.7 0.2 0.4 0.6 beer 0.5 0.4 0.8 0.6 0.

Initial partition.K-Means A Partitioning Algorithm: K-Means 1. 2. 4. Repeat 2 and 3 until some condition is fulfilled. Magnus Rosell 20 / 51 Unsupervised learning: (Text) Clustering . Put each document in the most similar cluster. for example: pick k documents at random as first cluster centroids. 3. Calculate new cluster centroids.

) Initial partition I Pick k center points at random I Pick k objects at random I Construct a random partition Conditions I Repeat until (almost) no object changes cluster I Repeat a predetermined number of times I Repeat until a predetermined quality is reached Magnus Rosell 21 / 51 Unsupervised learning: (Text) Clustering .K-Means (cont.

K-Means: example Show K-Means example Magnus Rosell 22 / 51 Unsupervised learning: (Text) Clustering .

6 beer 0.4 0.5 0.2 0.1 0 0 0.4 0.9 0.9 1 football Magnus Rosell 23 / 51 Unsupervised learning: (Text) Clustering .7 0.8 0.1 0.2 0.8 0.3 0.3 0.5 0.7 0.6 0.2D Clustering Example 2: two overlapping 1 0.

5 0.6 beer 0.4 0.1 0.1 0 0 0.5 0. one wide 1 0.8 0.9 0.2D Clustering Example 3: one tight.3 0.4 0.2 0.6 0.9 1 football Magnus Rosell 24 / 51 Unsupervised learning: (Text) Clustering .7 0.7 0.8 0.3 0.2 0.

2 0.7 0.9 1 football Magnus Rosell 25 / 51 Unsupervised learning: (Text) Clustering .6 beer 0.6 0.5 0.9 0.2D Clustering Example 4: random 1 0.8 0.3 0.1 0.7 0.5 0.4 0.2 0.3 0.1 0 0 0.4 0.8 0.

Magnus Rosell 26 / 51 Unsupervised learning: (Text) Clustering .) K-Means I Time complexity: O(knI ) similarity calculations where k number of clusters. Takes word co-occurrence into account through centroids. n number of objects.K-Means (cont. and I number of iterations. I Decide number of clusters in advance I Try several clusterings! I Different results depending on initial partition I Try several initial partitions! I What is a good partition? I “Global” – revisits all texts in every iteration: the centroids are kept up-to-date.

Join the most similar pair into one cluster. Often depicted as a dendogram. 3. 2. The result is a cluster hierarchy. Repeat 2 until some condition is fulfilled.Hierarchical Algorithms: Agglomerative Agglomerative 1. Make one cluster for each document. Magnus Rosell 27 / 51 Unsupervised learning: (Text) Clustering .

I Repeat until a predetermined number of clusters I Repeat until a predetermined quality is reached (the quality is usually increasing with number of clusters) Magnus Rosell 28 / 51 Unsupervised learning: (Text) Clustering .Hierarchical Algorithms: Agglomerative (cont.) Conditions I Repeat until one cluster.

“compact” clusters.) Similarity in Agglomerative: 1. 4.Hierarchical Algorithms: Agglomerative (cont. Single Link: similarity between the most similar texts in each cluster. 3. “middle ground”. Ward’s Method: “combination” method. Group Average Link: similarity between cluster centroids. Magnus Rosell 29 / 51 Unsupervised learning: (Text) Clustering . Complete Link: similarity between the least similar texts in each cluster. “elongated” clusters 2.

.1) s(3.2) s(1.Hierarchical Algorithms: Agglomerative (cont..3) . ...1) s(2. doc1 s(1.. .... .2) s(3... doc2 s(2.2) s(2.3) .) The similarity matrix: doc1 doc2 doc3 . Calculation time: O(n2 ) Magnus Rosell 30 / 51 Unsupervised learning: (Text) Clustering ..1) s(1.3) . .. doc3 s(3. . . . .. ..

Magnus Rosell 31 / 51 Unsupervised learning: (Text) Clustering .) Agglomerative I Time complexity: O(n2 ) similarity calculations I Results in a hierarchy that could be browsed. Bad early descisions can’t be undone. I Deterministic: same result everytime. I “Local” – in each iteration only the similarities at that level is considered.Hierarchical Algorithms: Agglomerative (cont.

I O(n log(k)I ) Magnus Rosell 32 / 51 Unsupervised learning: (Text) Clustering . I Different results depending on all initial partitions. I Robust – dominating features may be dealt with in the beginning.Hierarchical Algorithms: Divisive Divisive algorithm 1. 2. Split the worst cluster. I Considers many objects in every iteration. Repeat 2 until some condition is fulfilled. Put all documents in one cluster. Example: Bisecting K-Means I Splits the biggest cluster into two using K-Means. 3.

considers all objects in every iteration I fast: O(knI ) Agglomerative I hierarchy I may stop at “optimal” number of clusters I same result every time (deterministic) I “local”. bad early descision can not be changed I slow: O(n2 ) Magnus Rosell 33 / 51 Unsupervised learning: (Text) Clustering .Comparison of Algorithms K-Means I flat partition I decide number of clusters in advance I different results depending on initial partition I “global”.

I Hard and soft (fuzzy) clustering I Texts on the border between clusters I Many other clustering algorithms There is no clear evidence that any clustering algorithm performs better in general. Magnus Rosell 34 / 51 Unsupervised learning: (Text) Clustering . Some algorithms perform better for particular data and applications.Algorithm Discussion I Will always find clusters: 2D Example 4: random. I Outliers.

Total stockholm. index. Cluster 1 167 4 1 37 23 232 increase. club reut. day. stock. company. aftonbl. reut. swedish. stock Magnus Rosell 35 / 51 Unsupervised learning: (Text) Clustering . Cluster 5 3 48 19 241 413 724 person. press. swedish. game. rate film. Cluster 2 18 421 22 176 40 677 swede game. tt tt. stock police. write. death. Cluster 3 0 19 452 10 14 495 win. match. per cent. 500 500 500 500 500 2500 time. tv. write. wound. Cluster 4 312 8 6 36 10 372 stockholm. Clustering Result Number of texts Word Economy Culture Sports Sweden World Total per cent.

I External quality measures: compare the result to a known partition or hierarchy (categorization) I Internal Evaluation I Internal quality measures: use a criterion inherit for the model and/or clustering algorithm. Magnus Rosell 36 / 51 Unsupervised learning: (Text) Clustering . What is a good clustering? Relevance? I External Evaluation I Indirect evaluation: how clustering influences an application that uses it as an intermediate step.Evaluation It is very hard to evaluate a clustering. For instance the average similarity of objects in the same cluster.

Evaluation: Definitions Let C = {ci } a clustering with γ clusters ci K = {k (j) } a categorization with κ kategories k (j) n number of documents ni number of documents in cluster ci n(j) number of documents in category k (j) (j) ni number of documents in cluster ci and category k (j) (j) M = {ni } The confusion matrix Magnus Rosell 37 / 51 Unsupervised learning: (Text) Clustering .

If the centroids are not normalized it is the average similarity of the texts in the cluster. ci ).Evaluation: Internal Quality Measures sim(ci . (1) n ci ∈C which is the average similarity of the texts in the set to all texts in their respective clusters. ci ) is the cluster intra similarity. The intra similarity of a clustering: 1 X Φintra (C ) = ni · sim(ci . It is a measure of how “cohesive” the cluster is. Magnus Rosell 38 / 51 Unsupervised learning: (Text) Clustering .

C ).) Similarily.Evaluation: Internal Quality Measures (cont. the average similarity of all texts in each cluster to all the texts in the entire set may be calcuated: 1 X Φinter (C ) = ni · sim(ci . (2) n ci ∈C This measures how separated the clusters are. Magnus Rosell 39 / 51 Unsupervised learning: (Text) Clustering .

Magnus Rosell 40 / 51 Unsupervised learning: (Text) Clustering . since they may utilize this differently.) Internal quality measures are depending on the representation and/or similarity measure. I Don’t use to evaluate different representations and/or similarity measures.Evaluation: Internal Quality Measures (cont. I Don’t use to evaluate different clustering algorithms.

j). j) = ni /ni (j) I Recall: r (i . is the probability that a text drawn at random from cluster ci belongs to category k (j) . Magnus Rosell 41 / 51 Unsupervised learning: (Text) Clustering .Evaluation: External Quality Measures Precision and Recall For each cluster and category: (j) I Precision: p(i . the probability that a text picked at random belongs to category k (j) . p(i . j) = ni /nj Precision. given that it belongs to cluster ci . In other words: p(k (j) |ci ).

) Purity I Cluster Purity: (j) ni ρ(ci ) = max p(i . . . j) = max j j ni I Clustering Purity: X ni ρ(C ) = · ρ(ci ) n i Purity disregards all other than the majority class in each cluster. Magnus Rosell 42 / 51 Unsupervised learning: (Text) Clustering .Evaluation: External Quality Measures (cont. But a cluster that only contains texts from two categories is not that bad .

Normalized entropy: NE (ci ) = E (ci )/ log(κ). j)) j P ni I Entropy for the Clustering: E (C ) = i n E (ci ) Entropy measures the disorganization. I min(E (ci )) = 0. j) log(p(i . when texts only from one category in the cluster. when equal number from all categories. A lower value is better. Magnus Rosell 43 / 51 Unsupervised learning: (Text) Clustering .) Entropy P I Entropy for a cluster: E (ci ) = − p(i . I max(E (ci )) = log(κ).Evaluation: External Quality Measures (cont.

A clustering of n clusters would be considered perfect. Magnus Rosell 44 / 51 Unsupervised learning: (Text) Clustering . . Mutual Information. Sometimes you might accept to increase the number of clusters to get better results. And normalized mutual information. .Evaluation: External Quality Measures (cont. . . We can consider the whole confusion matrix at once .) Entropy does not consider the number of clusters. but not always. .

None of the external measures takes into account how “difficult” the texts of the actual corpus are. Magnus Rosell 45 / 51 Unsupervised learning: (Text) Clustering .) External quality measures are depending on the quality of the categorization. we have to stick to external quality measures.Evaluation: External Quality Measures (cont. If we do not have the possibility to test in the context of another task or by asking humans. I A certain corpus might be especially hard to cluster. I How do we know a categorization is of high quality? I What is a good partition? I There might be several good partitions.

Be careful when discussing the implications! Use baseline methods.Evaluation (cont.) Try to use at least one internal and one external measure. Magnus Rosell 46 / 51 Unsupervised learning: (Text) Clustering . for instance random clustering.

(Word Clustering) (Word Clustering) One possibility: cluster the columns of the term-document-matrix. Magnus Rosell 47 / 51 Unsupervised learning: (Text) Clustering . Clusters of related words: words that have a similar distribution over the texts.

Choose clusters (words with high weight) 3. Repeat until satisfied. Cluster again 4.Clustering Exploration Text Clustering Exploration I Text Mining I Scatter/Gather 1. Magnus Rosell 48 / 51 Unsupervised learning: (Text) Clustering . Cluster a set 2.

Clustering Result Presentation A human has to interpret the (end) result! I Search Engine – index I Clustering – dynamic table of contents How should the result be presented? I Textual I Graphical – Visualization Magnus Rosell 49 / 51 Unsupervised learning: (Text) Clustering .

texts. the confusion matrix. Display: lists of clusters. the browser pages. and words.Textual Presentation The search engines. Cluster Labels/Descriptions: usually single words or phrases based on word distribution: I in cluster (descriptive) I between clusters (descriminating) Other: I Suffix Tree Clustering I Frequent Term-based Clustering Magnus Rosell 50 / 51 Unsupervised learning: (Text) Clustering .

Visualization I Similarity I SOM . I Projections I Representation I Infomat Magnus Rosell 51 / 51 Unsupervised learning: (Text) Clustering .Self Organizing Maps: WEBSOM.