You are on page 1of 57

Thesis topic definition paper

Czech Technical University in Prague


Faculty of Electrical Engineering
Department of Computer Science and Engineering

BACHELOR’S THESIS:
Email clustering algorithm

Author: Libor Grafnetr


Supervisor: Ing. Pavel Kordı́k, Ph.D.
Year: 2009
Acknowledgments
I would like to thank my supervisor Ing. Pavel Kordı́k, Ph.D. for guidance through the work
on this thesis.

I
II
Declaration
I hereby declare that I have completed this thesis independently and that I have listed all
the literature and publications used.
I have no objection to usage of this work in compliance with the act §60 Zákona č. 121/2000Sb.
(copyright law), and with the rights connected with the copyright act including the changes
in the act.

In Prague on June 9, 2009 .............................................................

III
IV
Abstract

This thesis describes analysis, design and implementation of a clustering algorithm that
operates on email messages. Its aim is to recognize messages that deal with similar topic
and aggregate them in one cluster. The algorithm makes use of general document clustering
methods combined with analysis of data specific to email messages. Given the nature of
email, it supports online clustering.
The functionality will be integrated into exiting email client software named ”eM Client”.
Implementation of suitable data structures with database backend, mail client interoperabil-
ity code and user interface is also part of this work.

Abstrakt

Tato práce popisuje analýzu, návrh a implementaci shlukovacı́ho algoritmu, který bude pra-
covat s emailovými zprávami. Jeho cı́lem je rozpoznat zprávy, které se týkajı́ stejného
tématu, a sdružit je do jednoho shluku. Algoritmus využı́vá obecných postupů shlukovánı́
dokumentů zkombinovaných s analýzou informacı́ specifických pro emailové zprávy. Vzhle-
dem k povaze emailu algoritmus podporuje inkrementálnı́ shlukovánı́.
Shlukovacı́ funkcionalita bude integrována do existujı́cı́ho emailového klienta ”eM Client”.
Součástı́ práce je také implementace vhodných datových struktůr s napojenı́m na databázi,
kódu pro spolupráci s mailovým klientem a uživatelského rozhranı́.

V
VI
Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Problem description 3
2.1 Project breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Existing approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Analysis 7
3.1 Project components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Clustering framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Distance function and text analysis . . . . . . . . . . . . . . . . . . . . 7
3.1.3 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.4 Interoperability code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.5 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1.2 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1.3 Fuzzy C-means . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Data instances representation . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Measuring distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.4 Email messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.5 Document clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.5.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.5.2 Term ranking and TF-IDF . . . . . . . . . . . . . . . . . . . 17
3.2.5.3 Cosine measure . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.6 eM Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Design 21
4.1 Hierarchical agglomerative clustering . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Email messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Data instance features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Text analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

VII
4.5 Distance function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6 Database backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.7 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.8 Integration layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Implementation 27
5.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Mail clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.5 Integration layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Testing 33
6.1 Clustering quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.1 Subjective evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.2 Objective evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Data structures optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Conclusion 37
7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Bibliography 39

A SQL Schema 41

B Clustering Illustrations 45

C CD directory structure 47

VIII
Chapter 1

Introduction

1.1 Motivation
Email has become one of the most important means of communication among people in
both personal and corporate environment. Its unprecedented features have allowed people
to communicate and cooperate in their activities like no other technology before. Since its
notable spread among people, the volume of email messages has risen by far more than only
few ranks and now it represents a core component of every corporation’s workflow. However,
the way people are presented their email messages by software almost has not changed.
While simple list of emails may be sufficient for low volume of mostly independent mails,
today’s communication frequently consist of interconnected messages, bound by similar topic,
possibly from several different individuals.
The need to change how people work with their email is known for many years, the
problem is getting into general consciousness and public media are focusing its attention
to it more often. There are two basic (but interconnected) parts of the issue: the first
is the immense volume of emails that many users are subject to - this problem has been
coined ”Email Overload”, the second part is most users’ inability to use email efficiently, e.g.
sending unnecessary messages or overutilizing reply-all functionality. There have even been
attempts to quantize costs of inefficient email usage, with estimates as high as $650 billion
[1].
Automatically organizing user’s messages would be a significant remedy to the problem.
It would allow users to distribute their time more efficiently to relevant tasks, speed up
information look-up and generally create a more pleasant email client environment, in which
many people spend significant part of their work time.

1
2 CHAPTER 1. INTRODUCTION
Chapter 2

Problem description

Aim of this work is to create an email client extension that automatically clusters user’s email
messages and presents him with the result. The clustering process must be incremental
as new messages continually arrive into user’s inbox and framework’s data necessary to
perform the clustering must be persistent between application executions. The resulting data
consisting of clusters and emails they contain must be presented to user in a comprehensible
and non-intrusive way from within the email client.

2.1 Project breakdown


To achieve the above we have divided the project into following tasks that need to be ac-
complished:

• Devise a variation on existing clustering algorithm that will suit specific needs of the
clustering task.
• Analyze email messages nature and find relevant data mining features that will con-
stitute basis for measuring similarity between email instances.
• Design a similarity measurement function yielding good results for evaluation of emails’
topic similarity.
• Implement the algorithm, knowledge extraction framework and suitable data structures
that will allow persisting clustering work data.
• Integrate the implemented module into ”eM Client” and create an user interface that
will present the clustering result and allow user to execute necessary operations of the
clustering framework.
• Test the project with testing dataset and point out areas where optimization had to
be added and where more improvement could be done in future.

3
4 CHAPTER 2. PROBLEM DESCRIPTION

2.2 Existing approaches


Despite the urgency of the problem a successful solution to that would gain widespread
acceptance has not been created yet. There are several partial solutions in use, but they
either require continual feedback from the user, the extent of their helpfulness is rather
limited or were not finished and made available for public use. Existing partial solutions
include following:

• Threaded view is a method that displays a conversation tree based on information


about relation to previous emails. This method works only for mails that were replied
to or forwarded. New email concerning the same topic from a user is not connected.
This method is one of the oldest attempts to improve experience when working with
email, but it is very limited, especially with the present scale of the overload problem.
• Conversation view is an improved variation on the threaded view approach. It has been
introduced in Google’s Gmail service. It improves the user experience by flattening
the threaded view tree and by displaying each conversation as single item in the basic
folder view. The approach provides a clearer view on the user’s emails, but still doesn’t
constitute a powerful tool to radically improve work with email.
• Rules - most modern desktop email clients have means to define rules consisting of
a condition and an action to be performed if the specified condition is met. The
condition is usually used to match specific substring in email subject or body and
the action then performs move or copy operation if the condition is met. While rules
are relatively popular among desktop client users, its filtering capabilities are limited,
because the user has to define every single rule. Common use therefore is separating
several basic types of email that enter user’s inbox rather than separating every new
topic that has to be dealt with.
• Automatic foldering is a more sophisticated approach based on filters matching the
message with existing mail folders. Real world implementations allows user to define
folders that automatically process incoming mails and filter messages similar to those
already present in the folder. The user needs to add several example messages when
the folder is created to provide some initial data to the folder filter and is also allowed
to add or remove messages from the folder later on, further improving its filtering
performance. A popular implementation of this method is present in the browser/email
client software ”Opera”. Although a promising approach, known implementations still
require the user to define a folder for each topic he would like to separate.
Algorithms used in these filters are for example: Bayesian filtering, Support Vector
Machines or k-NN. Attempts to integrate new folder recognition were also made, but
haven’t had any major influence. Overview of existing works in automatic foldering
and their effectiveness can be found in[2].
Our approach might be viewed as a restatement of automatic foldering extended with
new folder recognition, but from clustering branch of data mining point of view.

Most notable work having similar approach to this thesis is a master thesis of Gabor
Cselle: Organizing Email[3]. Cselle’s work has notably influenced author in his research.
Cselle has performed an extensive research on email organization and on basis of the work
2.3. THESIS STRUCTURE 5

created a clustering framework and an extension to email client ”Thunderbird” that per-
formed email clustering. However the extension itself was never published.

2.3 Thesis structure


This document will be structured into five main chapters reflecting progress of author work:

• chapter 3 will describe project components, necessary clustering theory that is used
in our implementation and also the software ”eM Client” that will integrate our the
clustering functionality.
• chapter 4 provides an in-depth description of the design of the clustering algorithm,
text analysis algorithm, similarity function, utilized email message features, realization
of the database backend and the design of the user interface.
• chapter 5 explains architecture of the project, describes classes that were implemented,
the way clustering is integrated with the email client and the background of the user
interface.
• chapter 6 describes how was the final product tested, what steps were taken to improve
performance and enumerates existing issues
6 CHAPTER 2. PROBLEM DESCRIPTION
Chapter 3

Analysis

This chapter will describe logical components employed in the clustering process, basic data
mining concepts and the target environment that the project will be integrated with.

3.1 Project components

The whole mail clustering process has many steps that can be divided into following sev-
eral blocks. The first component would be a clustering framework containing the clustering
algorithm, clusters, instances, data source for instances and other necessary infrastructure.
Tightly connected is a distance function utilized by the cluster algorithm and a text anal-
ysis that evaluates textual similarity of the mail content. In addition there is a database
persistence framework that needs to be provided to the clustering framework in the most
transparent way possible. Finally a component integrating the clustering with the email
client and an user interface displaying the results from the clustering framework needs to be
connected to the above components.

3.1.1 Clustering framework

The core part of the project is the clustering algorithm itself. On input this algorithm receives
data instances from a data source. These items are first pre-processed by the distance
function for information extraction. Afterwards, instances are processed by the distance
function to determine similarity to existing instances. Which instance pairs are processed by
the distance function depends on the particular algorithm. Based on the calculated distance
values clustering process is executed. During this process clusters are formed or modified.
Once the algorithm meets its stop criteria, the new mail instances have been clustered and
updated clusters information may be loaded by other components.

3.1.2 Distance function and text analysis

The distance function provides the clustering algorithm information about individual in-
stances similarity to each other. Every new instance is passed to the distance function by

7
8 CHAPTER 3. ANALYSIS

the clustering algorithm for processing. The distance function further passes the new in-
stance to text analysis subcomponent. When a clustering algorithm calls distance function,
similarity of selected mail item features is evaluated and also a mail body text similarity
is calculated. These several scores are then merged into a distance value that is returned
to the clustering algorithm. Text analysis subcomponent processes mail body text of new
instances and updates its internal data structures that store relevance of individual words.
When a distance function is asked to calculate similarity, text analysis performs similarity
comparison of the two mail instances texts and calculates text similarity score using known
text clustering formulas.

3.1.3 Data structures

Specialized data structures are necessary to store all data used in the whole clustering process.
Except simple lists used to store existing clusters and data instance, dictionaries are used for
pre-calculated text analysis data and two-dimensional dictionaries are used to store distance
values between individual instances and clusters. Although lists and one-dimensional dictio-
naries exist in the target environment for in-memory operation, they had to be implemented
as layers over a relational database used to persist the data. Two-dimensional dictionary had
to be implemented for both, in-memory and database operation. As these data structures
are heavily utilized during the clustering process, they need to be highly optimized as well
as the underlying database schema.

3.1.4 Interoperability code

To integrate previous components with the email client, an interoperability layer needs to
exist. This layer will govern the initialization of the clustering framework and its database
backend and provide a custom data source transforming email messages from the client to
data instances that the clustering process works with. The data source also needs to monitor
additions and removal of email messages and forward these event to the clustering framework.
There are several other functional roles described in the implementation chapter that the
layer needs to perform.

3.1.5 User interface

The email client must have a user interface that will provide access to results of the clustering
process. The interface must have a non-intrusive character, that is, it must let the user to use
the client as usual, but easily allow to take advantage of the clustering output. The interface
will be developed using existing concepts present in the client. In the most minimal form, the
interface will provide view on existing clusters, display information about selected cluster,
provide a method to quickly view only messages from the selected cluster and allow user
to see to which cluster a mail belongs. The user interface will interact with the clustering
framework primarily through the interoperability code.
3.2. CLUSTERING 9

3.2 Clustering
Clustering is a branch of data mining discipline which aims at grouping similar data instances
together into sets called clusters. To be able to devise a suitable clustering process, we need
to overview relevant existing knowledge that will be necessary. Our view on the techniques
must also reflect specifics of the data and environment we are about to embed the technology
in. Clustering of email messages can be understood as a special case of document clustering
with a lot of additional metadata taking part in the clustering process. There are also other
specifics such as online nature of the clustering, the fact that the target result is a large
number of relatively small clusters with need to continually detect new ones and the need
for unsupervised operation of the algorithm. In following sections we will describe algorithms,
data representation, distance functions and knowledge extraction principles that relate to
our problem. For each link in the clustering chain, we will also analyze what features and
capabilities it needs to have in our implementation and why.

3.2.1 Algorithms

There are already many algorithms for clustering and new ones are continually being devised.
This is mainly because each field has specific features that customized algorithms can take
advantage of. For our clustering tasks we have researched the several popular and relatively
versatile clustering algorithms as they are known to perform reasonably well on document
clustering tasks and some of them have advantageous features that can address some of our
needs. We will describe those algorithms including their advantages or disadvantages and
state whether we have chosen the algorithm for our project.

3.2.1.1 K-means

K-means is one of the most essential clustering algorithms. It dates back to 1956 and has been
gaining popularity ever since. It isn’t complicated to implement and performs the clustering
relatively fast - its computational complexity is linear. To use K-means, it is necessary to
specify the target number of clusters - let k be the desired number of clusters. The algorithm
creates k data points representing cluster centers, called centroids. An iterative partitioning
loop described below is then executed to carry out the clustering.
The formal definition of K-means[4] is then to partition the data instances into k clusters
so as to minimize a sum of squared error objective function that is defined as:
k X
X n
E= |ci − pj |2
i=1 j=1

, where E is the squared-error value, k the desired number of clusters, n the number of
cluster data instances, ci is a centroid of the i-th cluster and pj is j-th data instance. The
subtraction of instances represents distance of the two instances.
The algorithm below is a common approach to accomplish the described function mini-
mization. The approached minimum is, however, local, which why initial centroid distribu-
tion is so important.
10 CHAPTER 3. ANALYSIS

The data points selected as cluster centers may be generated randomly, or better some
form of heuristics may be used to select centroids that are likely to yield good clustering
result. A popular method is to gradually pick centroids from clustered data instances so
that their distance from nearest other centroid is maximal.
The clustering in K-means is performed iteratively by re-calculating centroids and re-
associating data instances with nearest centroid until stability is reached and no data in-
stances change their parent cluster. A K-means algorithm in pseudo-code is written below:
1 s e l e c t _ k _ i n i t i a l _ c e n t r o i d s () ;
2 do {
3 a s s o c i a t e _ d a t a _ i n s t a n c e s _ w i t h _ n e a r e s t _ c e n t r o i d () ;
// if some data instance changes parent
cluster clustering_change variable is set to
true
4 c a l c u l a t e _ n e w _ c e n t r o i d _ p o s i t i o n s _ () ; // new mean
of data instances of each cluster is
calculated and set as new centroid for the
cluster .
5 while ( clustering_change ) ;

Figure 3.1: An example of K-means algorithm on random two-dimensional data

Although the standard definition of k-means doesn’t deal with incremental clustering, it
is possible to modify the implementation to support it and some research on this topic and
issues arising from time-dependent nature of the task has already been performed [5].
However, K-means has two major disadvantages. The first is the need to know the number
of clusters in advance, which would require additional estimation algorithms or substantial
algorithm modifications. Second issue arises from the concept of centroid that assumes that
3.2. CLUSTERING 11

a mean of data instances can be calculated. For more complicated data instances, that are
not just simple vectors (see subsection 3.2.2) this condition cannot be met. It can be over-
come by avoiding the centroids completely a determining the instance distance from clusters
by average distance from cluster’s member. Although a viable approach, the computational
complexity would rise enormously.

Although K-means is a straightforward, versatile and fast algorithm we have decided not
to use it in its standard form as a clustering algorithm for our project.

3.2.1.2 Hierarchical clustering

The hierarchical clustering algorithm is together with K-means the most popular clustering
technique. This method operates by building a hierarchy of clusters based on similarity of
the data instances they contain. The process of building the hierarchy may be terminated
when a specified condition is met, resulting in clusters of desired similarity and size.
The metric used to measure similarity between instances can be chosen independently
of the algorithm, but a second measure is used that defines how the distance between two
clusters is calculated from distances between instances in these clusters. There are three
basic measures used:

• Minimum distance - the distance between two clusters is equal to the smallest distance
between an instance from cluster A to an instance from cluster B.
• Maximum distance - the distance between two clusters is equal to the largest distance
between an instance from cluster A to an instance from cluster B.
• Average distance - the distance between two clusters is equal to the average distance
of instances from cluster A to instances from cluster B.

An algorithm that employs minimum distance metric and also uses a minimal inter-cluster
distance as threshold to stop merging clusters is called single-linkage algorithm.
Once the measure of distance between clusters is defined there are two approaches to the
hierarchy’s construction:

• Agglomerative - The algorithm starts with every data instance in a separate cluster
and repetitively merges clusters that are nearest to each other.
• Divisive - At the beginning all instances are in one cluster and a logic using instance
distance function repetitively splits existing clusters.

Agglomerative method is the more frequent choice when using hierarchical algorithm
and it also suits our needs better - it will allow easier implementation of the incremental
clustering and provides better means to define criteria that will stop the clustering.
12 CHAPTER 3. ANALYSIS

Figure 3.2: A dendogram of clusters created by hierarchical clustering on the well-known


Iris dataset

A pseudo-code showing how agglomerative hierarchical single-linkage clustering operates


follows:

1 c r e a t e _ c l u s t e r _ f o r _ e a c h _ d a t a _ i n s t a n c e () ;
2 do {
3 c a l c u l a t e _ i n t e r c l u s t e r _ d i s t a n c e s () ; // Function also
stores minimal distance to variable
minimal_distance
4 if ( minimal_distance < merge_threshold )
5 me rg e_ nea re st _cl us te rs () ;
6 }
7 while ( minimal_distance < merge_threshold ) ;

Hierarchical clustering doesn’t impose any requirements on data instances or the distance
function. This makes it more suitable for clustering complex data as in our case.
However, the algorithm does have two issues that must be noted. Once a cluster merge
(or split) is performed, it cannot be undone. That is, if an instance I is merged with a
cluster A and afterwards a cluster B containing instances very similar to instance I enters the
clustering process, there is no way for instance I to be moved to cluster B. This might cause
clustering performance degradation in incremental clustering. Another issue is algorithm’s
computational complexity that is quadratic! This is given by the need to calculate distances
between all pairs of data instances. The complexity may cause problems if the algorithm is
to be employed in our project.

3.2.1.3 Fuzzy C-means

Fuzzy c-means is a variation of the K-means algorithm with addition of uncertainty. Previous
algorithms understood instance’s belonging to a cluster as a boolean statement and a single
3.2. CLUSTERING 13

instance could belong only to one cluster. Contrary to these models fuzzy c-means allows
one data instance to belong to several clusters. The extent to which an instance belongs
to a cluster is denoted by a scalar value ”degree of membership”[6]. The fuzzy nature
of membership would be a very useful feature for our application, allowing to propagate
algorithm’s uncertainty to user interface.
The algorithm operates similarly to K-means, but in each iteration instead of re-associating
data instances with nearest clusters a matrix containing degree of membership for each
cluster-instance combination is recalculated. The formal definition of the algorithm is still
aimed at optimizing an objective function, but the function takes the degree of membership
into account, so it is now defined as:
k X
X n
E= uij · |ci − pj |2
i=1 j=1

, where ui j is the degree of membership of j-th data instance to i-th cluster, other variables
have the same meaning as in objective function of K-means.
The formula for instance’s degree of membership to cluster A calculates the value based
on proximity of the cluster A in comparison to other clusters. The formula also contains an
additional parameter called fuzzifier that influences the gradient with which the membership
decreases. If ui j is the degree of membership of j-th data instance to i-th cluster, then
1
ui j = 2
Pk |ci −pj | m−1
m=1 ( |cm −pj | )

, where k is the number of clusters, ci is the centroid for i-th cluster, pj is j-th data instance
and m is the fuzzifier.
The the variable degree of membership also has to be reflected in the centroids calculation.
Instance with lesser membership to a cluster should have smaller impact on cluster’s centroid
position. This can be achieved by incorporating the ui j value to arithmetic average fraction,
to both - the denominator and the numerator.
Fuzzy c-means seems suitable for incorporation of simple new cluster detection. If an
instance’s maximum degree of membership to any cluster is under specified threshold a new
cluster containing this instance may be created. Similarly to K-means, implementing incre-
mental support into Fuzzy c-means seems viable.

While the author perceives this algorithm as promising and would like to focus on its
potential in future, research necessary to correctly implement it for our application would
need to be extensive and over the time frame available for this work, therefore it won’t be
used in our project.

3.2.1.4 Conclusion

We have described three most promising candidates for the algorithm used in our project.
All of them could be modified to suit our needs, but to ensure that results yielded by the
clustering framework will be of decent quality from the beginning we have decided to use
the Hierarchical Agglomerative Clustering in its Single-linkage variation.
14 CHAPTER 3. ANALYSIS

3.2.2 Data instances representation


The way data instances are represented in the clustering process is highly important and
depends on many factors. Many steps of the process have specific needs on the form of
the input data, therefore data conversion and preprocessing is a step, often very complex
and lengthy one, that most data mining efforts start with. This is usually because the work
takes place in existing data mining environments that have predefined input format and only
limited ability to alter how the data is being processed.
Given the fact that our clustering framework is being built from scratch, we will be
designing its architecture in a very modular manner to allow as much independence on the
data form as possible. Only parts that need to understand the meaning of the data will
have to be customized, such as distance calculating component, the rest of the framework
will work only through these custom implementations. Therefore, there will be no need to
convert data processed by the framework in our project and possibly in any other use, should
there be any.
This is a correct approach from engineering standpoint, but also a mandatory requirement
dictated by the context of the project. The data processed by clustering is already present
in the email client software and it is not feasible to perform any large scale conversion.
While he have described the emphasis that we put into keeping the data the way it is,
certain transformations will have to take place to provide information necessary for text anal-
ysis (see subsection 3.2.5). This transformation and data extraction will not, however, alter
existing data, but put the extracted knowledge in a distinct data store. The target data store
will be designed to facilitate performance and semantic needs of the text analysis component.

The data our project will work with are email messages. Emails have a rather complicated
structure, but it can be condensed to two main parts that interest us:

• A decent set of metadata attributes. Many of these attributes will become instance
features, meaning they will be analyzed during the clustering process, but they don’t
need to be transformed or stored in an additional location.
• Content - each email has one or more content parts. One of the parts will be selected
and taken as a text document. This document part will be subject to processing using
document clustering techniques that require further data transformation and storage.

Document clustering is usually modeled by representing documents as items in vector


space. Each vector in the vector space represent one document. Terms (individual words)
extracted from all documents are used as vector’s components, each term being one compo-
nent. Value of each component in a vector is a score determining relevance of the associated
term in the document the current vector represents (see subsubsection 3.2.5.2).
Having each document represented by a vector with one component for each existing
term from the document set is highly inefficient. The vectors would be mostly empty as each
document contains only a small subset of all terms in a set. Therefore, the actual structure
of the data store containing document terms will be different. Each document will have a
3.2. CLUSTERING 15

list of term and value pairs for terms contained in the document. Although this will slightly
complicate the data structures and database backend for persisting the information, it will
save large amounts of space and will have a positive effect on performance.

As has been explained, there is no need to store the data instances in a specific form
for purposes of the clustering. The framework should process instances independently of
their form through customized processing components and the text analysis component will
only store necessary calculated data, but won’t alter data instances or duplicate data they
contain.

3.2.3 Measuring distance


The algorithm performs clustering based on instances distance. The quality of clustering
therefore tightly depends on quality of the distance calculation. Most of clustering algorithms
operate with a pair-wise distance function that takes two instances as input and output a
distance value. The distance function may range from a simple formula to an algorithm
combining several other distance functions, which may perform analytical computations.
The methods of a distance calculation depend on types of data features that are to be
reflected in the calculation. When instances are represented by vectors that contain only
numerical values, simple metrics such as Euclidean distance, Manhattan distance or even
linear combination of the values may be used.
One feature type, our project will make use of, is a set of textual strings. For this type
an intersection of two sets may be performed and an intersecting set’s size then incorporated
in a composite distance function.
Other feature type may be a time stamp. Two time stamp features may be subtracted
and the result normalized into a predefined interval - thus obtaining a valid scalar value that
may be added in the composite distance.
Many data mining processes employ feature weighting to further improve the output
quality. When feature weighting is used, the value of each feature is modified by a weight
quotient. Weight quotients may be determined by many methods, such as training lin-
ear Support Vector Machines, gradient descent technique, Odds Ratio or Information Gain
training[7][8].
Document clustering also measures the distance of documents, but here the concept of
weighting individual values is realized in the text analysis process by term relevance scoring.
To join many distance values calculated by various methods a grouping distance function
must be formulated. Such function can be viewed simply as a next level of distance function
processing the numerical feature values, except this time, the values are already an extraction
of the features’ content. Therefore, the same formulas may be used. To reflect the essence
of the grouping distance function - to combine partial distance values into one - a weighted
linear combination is a good choice.

3.2.4 Email messages


Internet Message (termed email) is a very versatile information container from which a
lot of clustering relevant data can be extracted. Email message consist of an envelope
16 CHAPTER 3. ANALYSIS

and contents[9]. Envelope contains fields such as the sender, recipients, message’s subject,
a time stamp defining when the message was sent or fields containing reference to other
email messages. These fields convey valuable information that has a substantial influence on
recognizing relation between emails.
The contents of the message may either be a plain text or consist of content parts -
MIME parts[10]. Most of today’s email communication contents is in the format of MIME
parts. These parts allow not only several sequential data blocks, such as mail text and a
binary attachment, but can form nested structures allowing to provide alternative forms
of one content(e.g. a rich text format and plain text format of the message), encapsulate
altered content (e.g. encryption using S/MIME standard) and have many other advantages.
This flexibility slightly complicates the necessary processing logic in mail client, because
it has to select the best part to display to the user. It is obvious that the displayed part will
also be the one most suitable for the document clustering analysis.

When clustering email messages, both the envelope fields and contents of the message
must be taken into account. Therefore several metrics for envelope fields must be combined
with the metric of the document clustering component to get the final distance function
using combination, as described in the previous section.

3.2.5 Document clustering


Document clustering aims at grouping similar documents based on analysis of their text. It is
a field of text mining, which derives many concepts from information retrieval and statistics.
Many approaches to determining similarity between texts exist, but the process has usually
two parts: document processing and similarity calculation. The document processing takes
place at the beginning of the text clustering process and can be divided into several steps:

• Decompose the text to single tokens - in most cases words. Apply preprocessing to the
words, such as stemming, case conversion or stop words exclusion. Preprocessed words
are now regarded as terms.
• Analyze terms in context of the document they were extracted from and calculate
values necessary to determine term’s relevance scoring later. The relevance scoring
isn’t usually calculated during document processing, as its formula parameters change
when other documents are clustered. Therefore, it is preferable to store intermediate
values that relate only to this document.
• Update global values relating to each term processed in the document’s analysis.

When the clustering itself takes place, pairwise document similarities are being calculated.
Ideally, during this calculation documents’ texts aren’t being processed, as all necessary
information - the terms the documents contain associated with the values necessary for
term’s relevance calculation - is already known from the initial documents’ analysis.
Document similarity calculation iterates through documents’ terms, determines a rele-
vance scoring for each term and uses the selected measure to calculate the similarity value
from all term scorings from both documents.
In following sections, we will describe the relevant parts of this process in more detail.
3.2. CLUSTERING 17

3.2.5.1 Tokenization

Tokenization is a process of splitting a string of characters into tokens based on predefined


lexical grammar. In case of document clustering a simple grammar defining one token
consisting of letters and separated by a non-letter character may be sufficient.

3.2.5.2 Term ranking and TF-IDF

It has been described that each term needs to have a relevance scoring that is used during
similarity calculation. There are following reasons that require such scoring to exist and the
formula used to calculate the score has to address them:

• Documents vary in length. Even if one document has more occurrences of a term than
other document, it doesn’t mean that it is more related to the term. The first document
may just be several times longer and it may even deal with completely unrelated topic.
Therefore the term’s number of occurrences must be put in relation to the document’s
length.
• Individual terms highly vary in their importance. Some words are very frequent and
may be found in large portion of a document set, while others have very specific
meaning and are present only in few documents. The fact that two documents share a
specific, infrequent term has thus much higher weight than if they shared a term that
can be found in most of the other documents. Thus a notion about global relevance of
a term must be maintained and used appropriately in the similarity calculation.

Both of these issues are addressed in a popular term ranking measure named ”TF-IDF”
(Term Frequency - Inverse Document Frequency). It is a product of term’s frequency within
a document and an inverse frequency of the term’s presence in all documents. TF-IDF value
for term i in document j is defined by following formula:

tij |D|
tf idfij = P · log
m tmj mi
P
,where tij is the number of occurrences of term i in document j, m tmj is the sum of all
terms in document j, |D| is the number of document in the set and mi is the number of
document where term i appears.
As can be seen from the formula, the score increases when the term is frequent within the
document, but decreases with the number of documents it is present in. This measure has
been proven to provide satisfactory output and we have decided to use it in our clustering
framework.

3.2.5.3 Cosine measure

A custom metric is necessary to calculate similarity of two documents based on term relevance
scores. The metric must neutralize the number of terms the documents contain and it should
be normalized into defined interval. Cosine Measure is a frequent measure of choice used
18 CHAPTER 3. ANALYSIS

with TF-IDF measure. For similarity of documents j and k by summing term scores, it has
following form:

· tnk
P
n tnj
cosinemeasurejk =
kjk · kkk

P
,where n iterates over union of all terms from j and k, tnj is the number of occurrences of
term n in document j and kjk is magnitude
s of document j. The magnitude can be calculated
X
as if j was an Euclidean vector: kjk = tnj 2 . When term n is not present in a document
n
j the number of occurrences tnj shall be 0.

3.2.6 eM Client

The clustering framework will be integrated within software ”eM Client”[11]. It is a modern
email client on whose development the author participates. It aims at being a full-featured
collaboration product easily integrated with third-party server messaging and groupware
solutions. Aside from email, the client supports calendaring, task lists and contact manage-
ment. All of these items may be synchronized with a server through variety of open protocols
such as IMAP, POP3, SMTP, CalDAV or CardDAV.

The product’s history is rather short, as the development began three years ago. The
development team has now five members, although most of the software has been written
by a team of three that has expanded several months ago.

The client is built on Microsoft .NET Framework that allows fairly dynamic growth of the
application thanks to powerful features of the runtime environment and to large base class
library that provides solid building blocks for software’s expansion in almost any direction.
3.2. CLUSTERING 19

Figure 3.3: The main window of eM Client

eM Client doesn’t support any extensibility through plugins yet. Therefore, any new
functionality needs to be integrated directly into project’s source code. However, thanks to
well-designed architecture of the software, new component integration code needs to be added
only to few locations. The user interface also allows new controls to be easily incorporated
into existing layout and connected with other parts of the interface. The integration will not
require any existing functionality to be modified.
For the implementation discussed in this work, we have decided that clustering framework
will operate only on items from Inbox folder of the mail client.
20 CHAPTER 3. ANALYSIS
Chapter 4

Design

Based on the general principles and facts described in previous chapter we will outline
concrete form of algorithms and methods used in the clustering framework. The ideas have
been formed partially before the implementation, but changed during coding based on arising
issues and even after the framework has been finished.

4.1 Hierarchical agglomerative clustering

From several candidates for the clustering algorithm, we have chosen the hierarchical clus-
tering. Its principle can be implemented fairly easily, it will allow incremental clustering and
the way it operates will enable us to cache much of the intermediate values it produces.
We have decided for the agglomerative variant of the algorithm. Each instance therefore
starts with its own cluster and individual clustering iterations merge nearest clusters.
With regard to the purpose of the clustering, we must view the clustering process as
continual and never ending. It won’t be performed again and again, but rather started once
and then slowly proceeding as new data instances will be added to the dataset.
When the process is started for the first time, initial processing must be performed. All
instances present in the dataset must be analyzed, necessary values calculated, new clusters
created and distances between the clusters calculated. Afterwards, once all instances have
been processed, clustering on the dataset must be performed. We call this process initial
clustering.
The algorithm has several data structures that contain information necessary to perform
the clustering effectively. Some of these structures could be abandoned, but the performance
of the algorithm would then fall below usable threshold. The basic structures are:

• Instance distance matrix contains a matrix of distances between all instances being
clustered. This matrix is updated every time a new instance is added by inserting
values returned by the distance function. All subsequent operations using a distance
value retrieve the calculated value from this matrix. The matrix uses a two-dimensional
dictionary.

21
22 CHAPTER 4. DESIGN

• Cluster distance matrix contains distances of all existing clusters between each
other. Modifications are performed during clustering process when clusters are merged
and when a cluster for newly added instance is created. This matrix also uses a two-
dimensional dictionary.
• Clusters list maintains all clusters that exist in the current state of the clustering
process.
• Instances is the list of all instances in the process. Although most of the time this
list contains same items as the algorithm’s data source, this allows for better clustering
framework separation and resolves issues with newly added instances

When an instance is processed by the algorithm, either during initial clustering or when
being added to the dataset three basic operation are performed:
• Instance is passed to the distance function for processing. The distance function per-
forms necessary analysis, such as term extraction and ranking (see subsection 3.2.5)
• Instance is added to the algorithm’s data structures and a distance function is invoked
for pairs of existing instances and the new instances to add values to the instance
distance matrix.
• A new cluster is created and added to the cluster list.

For similarity between clusters the minimum distance measure has been chosen. The
minimum distance measure for two clusters is defined as the minimum from all distances
between instances from the two clusters. This approach is used when filling the cluster dis-
tance matrix. When clusters are merged, the new cluster’s distance to cluster A is minimum
of distances of the old clusters to cluster A and as only two clusters are always merged this
simplifies to choosing smaller of two numbers.
The clustering itself is realized within a loop that iterates until any two clusters are closer
than a specified threshold. The value of this threshold is very important to the quality of
clustering. Single iteration of the loop consists of following steps:
• Find nearest two clusters. If their distance is above the threshold end the loop.
• Create a new cluster containing instances from the two clusters.
• Calculate distances to other clusters. Go to the first step.

Incremental clustering is supported by performing similar instance processing as during


the initial clustering and then executing the clustering loop. If the instance lies close enough
to any existing cluster, merging will take place, otherwise instance’s cluster will continue to
exist.
The client allows emails messages to be permanently removed. This event is announced
by to the clustering algorithm and clean up operations are performed. The instance is
removed from instances list and the cluster it is contained in. The cluster’s distance from
other clusters must then be updated. This operation is computationally far more costly
than when a cluster is merged as the modified cluster’s instances must be compared with
instances from all other clusters.
4.2. EMAIL MESSAGES 23

4.2 Email messages

To be able to work with emails and extract their content the framework must be able to parse
them. Although basic functionality to do so is relatively easy to implements a full-featured,
convenient library that will provide all necessary functions is a long-time effort.
Thankfully, when integrated, the framework can make use of the library used in eM
Client to work with email messages.
For testing purposes this library might be used independently of eM Client.

4.3 Data instance features

The features being used as a base for the distance calculation are of the same, if not higher,
importance, as the clustering algorithm itself. Large amount of features can be created
from email messages properties. During analysis of what is important when deciding that
items discuss the same topic, several obvious choices came up. We have also drawn from
other research, mainly Cselle’s work[3]. Each of the properties has also specific method of
comparison and calculation of the numeric distance value. We devised the calculation of
each feature to express similarity on interval from 0 to 1. Following features participate in
the distance calculation:

• Most email clients add headers for unique message identification and for items that
are replies a header containing identification string of the message that is being replied
to. The unique identification header is named Message-Id and the header in reply
messages is In-Reply-To. Value of these headers has no specific meaning and is usually
randomly generated. Its only use is as operand in string comparison to determine if
message are related. It is probable that related messages will discuss same topic.
Numeric feature value from these properties is 1 if one mail is a reply to the other(or
vice versa) and 0 if there is no relationship between the messages.
• Based on the same headers as above, we check whether two mails are replies to the
same email. This would be a typical case when multiple users discuss a topic. When
the mails share In-Reply-To value, this feature is 1, otherwise it is 0.
• Sender also bears an informational value. It is stored in From header of an email.
When the sender of compared mail is identical this feature is 1.
• Recipient sets (present in header field To) of email messages are compared to determine
how many recipients do emails share. Size of the intersecting set related to size of union
of both recipient sets is calculated.
• Each email also has an origination date in header Date. Emails relating to the same
topic will often be near each other on a time line. We have decided to normalize the
time distance to an interval of two months. The feature value calculation computes
time difference in hours divided by total hours in 2 months and performs normalization
and inversion, so that value of 1 represent items not differing in time and value 0 items
that are further away than 2 months.
24 CHAPTER 4. DESIGN

• Subject similarity is a very important feature. Subjects of both messages are tokenized
and the number of tokens present in both subjects is divided by the total number of
tokens in both texts.
• The final, most important feature, is the text similarity. Our distance function re-
trieves the text similarity from the text analysis component, which is described later
in following section. The value is normalized on interval 0 to 1.

Listed features are combined in a distance function as described in section 4.5.

4.4 Text analysis


Performing a quality text analysis and a subsequent similarity calculation should bring the
highest benefit to the information value of the distance function. The text analysis compo-
nent operation can be divided into tokenization, term ranking and similarity calculation as
described in subsection 3.2.5.
Tokenization in our framework simply splits words separated by any non-letter character.
Regular expressions are used to achieve this and although not necessary, this will allow to
improve the tokenizer in the future. The used regular expression is ”(\S+)\s”.
Before describing the other two steps, it is necessary to list data structures that these
steps use:

• Terms is a dictionary of all terms in all documents, each term has an associated value
defining in how many documents is the term present. This structure is updated when a
new instance is being processed by the distance function. The occurrence count value
is used when calculating the IDF coefficient.
• CachedIDFs dictionary is used to cache Inverse Document Frequency values of terms
to avoid calculating them every time a document similarity is being computed.
• Documents is also a dictionary, this one contains information about Term Frequency
of terms contained in a data instance - document. Each document has an associated
dictionary - TermTFs of terms within that document. This nested dictionary stores
a Term Frequency value for each term.

Term ranking iterates through all terms returned from the tokenization and updates the
necessary items in the structures. Terms dictionary is checked for the presence of the term -
if it is present occurrence count is incremented, else the term is added with a value of 1. The
cached IDF value for a term must be updated in CachedIDFs to reflect that the occurrence
value has changed. Last update concerns Documents where the new document entry must
be added and every processed term inserted into the inner dictionary.
As can be seen from the data structures used, the final TF-IDF score isn’t stored any-
where. This is because a new document addition would require to recalculate the score for
every contained term in every document that contains the same term, which would be highly
4.5. DISTANCE FUNCTION 25

inefficient.

When a text similarity has to be calculated, both document’s entries are retrieved from
Documents dictionary. Term lists from both documents are enumerated in sequence. For
each term TF-IDF is calculated from values in CachedIDFs and TermTFs and variables
holding values to calculate cosine measure numerator and denominator are updated with
the TF-IDF value.

4.5 Distance function


The distance function combines all intermediate similarity values to form one resulting dis-
tance value. Linear combination of the intermediate values is used and its coefficients are
feature weights. The value then needs to be inverted to become a distance. As each feature
value was normalized to a known interval, we may calculate the maximal achievable value
and subtract the sum of current values. Final step is to normalize the resulting distance
and this is achieved by dividing by the maximal achievable value. The complete formula has
following form:
m mmax −
P P
m mval
distance = P
m
m max
P
,where m enumerates over all features, mmax is the maximal value of feature m and mval
is the value for feature m for current document.

4.6 Database backend


The data used by components from this chapter needs to be stored in a persistent location.
There are high demands on the performance of the persistent store, so a suitable database
engine needs to be used. eM Client uses relational database engine SQLite, which is very
lightweight and provides good performance. For compatibility reason and based on previous
positive experience we have chosen to use SQLite also in the clustering framework.
eM Client also employs a typical two layer model for objects with storage objects directly
representing the database data and application objects operating above storage objects with
a higher-level logic. The storage objects are further managed by repositories. We have
closely mimicked this architecture when designing persistence classes for this project.
There are several types of application objects and collections for enumeration of these
objects. The objects represent cluster, data instance, document information (contains list of
document’s terms) and term. The collections are list, dictionary and two-key dictionary.
Detailed form of the model is described in section 5.3 and a database schema is attached
in Appendix A.

4.7 User interface


The interface we will add to eM Client needs to fit in well with other parts of the software.
The client contains a panel on the right side of the main window, where contact list, com-
munication and attachment history are placed. The side bar is a non-intrusive, yet handy
26 CHAPTER 4. DESIGN

location and our functionality is similar to the other components on the panel - to make
user’s work with email easier. Each component in the side bar is represented by a box
control that needs to be implemented.
Visual layout of the control should be separated to two areas - a list containing existing
clusters and an area containing details about each cluster. The clustering interface should
actively react when the user selects an email and lookup the cluster to which the email
belongs. This way the user will always be presented with the clustering context of the email.
When a cluster is selected either from a list or by being looked up when an email is selected,
it should allow very easily to filter emails just from the cluster. The interface should hide if
the currently selected folder isn’t being clustered.

4.8 Integration layer


To interconnect the clustering framework with eM Client, an integration layer needs to be
created. For such relatively standalone functionality singleton managing classes are being
used in eM Client. These classes are initialized at startup and autonomously perform neces-
sary operations, based on events from application classes, collections or from other sources.
The class will need to carry out following tasks:

• Manage background thread in which all clustering operations will be invoked. All
events must be delegated to this thread.
• Act as a data source for the clustering algorithm. This consists of enumerating items
from client’s folder, creating matching data instance objects and monitoring additions
and removals from the clustered folder and announcing these events in transformed
form to the clustering framework.
• Execute initialization tasks when clustering is launched for the first time and actions
based on user’s choices.
Chapter 5

Implementation

Based on the design from previous chapter, we have implemented the clustering framework,
integration code and integrated the framework into the client.
The platform used for implementation is Microsoft .NET Framework. To maintain com-
patibility with eM Client and its requirements, we use version 2.0 of the .NET Framework.
Development environment has been Microsoft Visual Studio 2008.
When designing the object model of the project we made heavy use of interfaces to ensure
that individual components will not depend on specific implementation and any parts of the
process may be easily swapped for a different implementation of the interface.
The project’s architecture can be divided into several blocks, which we will describe in
following sections. These sections are concerned only with the general architecture consisting
of interfaces and with classes that are used by the framework when integrated with eM
Client. There are also many classes that were used only when working with the framework
as standalone application. These classes either implement the interfaces to be described in
a different manner (e.g. to access email files from a disk folder) or provide other helper
functionality. In case of interest in these classes, they are available on the attached CD.

5.1 Clustering
This part contains all infrastructure related with the clustering itself. The elementary item
interfaces for cluster and data instances are ICluster and IInstance. Clustering algorithm is
represented by IClusterer, which depends on a source of instances defined by IDataSource.
The algorithm also needs a metric that calculates the instance similarity and this metric is
IDistance.
ICluster interface contains properties Identifier for textual cluster identification and In-
stances which is a dictionary of instances paired with a membership degree. In the current
implementation, where hierarchical clustering is used, the degree of membership is always 1.
IInstance has only one string property Identifier that provides string for textual identifica-
tion. Any other more specific members of the generic instance do not exist. The clustering
algorithm doesn’t need to access any instance properties, since it lets IDistance to do the
evaluation and it is expected that IDistance will be implemented to work with concrete
IInstance implementation.

27
28 CHAPTER 5. IMPLEMENTATION

Since we support online clustering throughout the framework, but necessarily do not
require the data source to be of online nature, members specific to online data source have
been separated in IDataSource descendant IOnlineDataSource.
Our implementation of the IClusterer is HierarchicalClusterer class. In section 4.1 we
have described several data structures that this algorithm uses. To separate implementation
of the data structures - we have used in-memory structures during testing and database-
backed structures are implemented in the final version - an interface IHierarchicalCluster-
erContext is used by the hierarchical clusterer class to access its data.

5.2 Mail clustering


To support clustering of emails several of the above interfaces had to be implemented to
perform specific operations. Several other classes and interfaces had to be added in order to
correctly structure the code.
Interface IMailInstance has been created as a descendant of IInstance to provide access
to the email messages the instances represent. The email is exposed through members Mail.
IDistance was implemented in MailDistance to perform distance calculations as described
earlier. However, since text clustering is a more complex process, MailDistance delegates all
operations related to document clustering to MailTextAnalyzer.
Class MailTextAnalyzer performs processing of new mail documents and text similarity
calculation. This, requires data structure filled by analysis and used afterwards for calcula-
tions. Same approach as with hierarchical clustering has been used and an interface named
IMailTextAnalyzerContext hiding the storage implementation from the analyzer was created.
5.2. MAIL CLUSTERING 29

Figure 5.1: Architecture of the core clustering framework and associated classes for mail
clustering.
30 CHAPTER 5. IMPLEMENTATION

5.3 Storage
Storage classes provide persistence to other parts of the framework. The storage classes
operate over a relational database, but completely shield the rest of the project from having
to manage the database.
The persistence is exposed through implementations MailHierarchicalClustererContext
and MailTextAnalyzerContext of context interfaces IHierarchicalClustererContext and IMail-
TextAnalyzerContext. These implementations then access application objects described be-
low.
Database storage is separated into two levels:

• The data level directly cooperates with the relational database, it executes SQL queries
for inserting, updating, removing or listing items. Each type of item, such as cluster
or mail data instance has its own interface inherited from IRepositoryItem. Data
objects implementing this interface are then used for in-memory storage of the item’s
properties. Each type of item is managed by a central repository class inherited from
base class DbRepository. Note that class DbRepository has been taken from eM Client’s
storage layer. The repository takes care of all database interaction and manages a cache
of storage objects. Specialized abstract class serving as a base for repositories that are
encapsulated by two key dictionary application level collection was implemented in
DbDoubleKeyRepository
• The application level provides an abstraction layer from the database. Item application
classes encapsulate the data storage items and trigger modification operations in the
repository. Collection classes implement standard platform operations for enumeration,
addition and removal of items. Matrices that are used in IHierarchicalClustererContext
are implemented as descendants of class DoubleKeyDictionary.

5.4 User interface


The user interface in the client has been implemented correspondingly to section 4.7. A
class ControlSidebarBoxClusters is a descendant of a control that supports embedding in the
sidebar.
The inherited class contains a datagrid control that displays the list of clusters and a
panel that where information about the current cluster is displayed. These two controls
change their visibility on folder selection change and are visible only if the current folder is
being clustered. Information in the detail panel are refreshed when either the user selects a
cluster from the list or when a mail is selected in the main area of eM Client.
The user interface also allows to filter messages from cluster. When a user double clicks
an item in the cluster list or clicks ”Filter cluster items” button in the detail panel, filtering
is enabled in the messages area of the application and only messages from current cluster
are visible.
Integration with other user interface components of the client is implemented in several
methods in the main form of the application - formMain.
5.4. USER INTERFACE 31

Other interaction with the user, such as confirming actions of the clustering framework
is implemented through message boxes in the class that manages clustering and is described
in following section.

Figure 5.2: Fragment of the eM Client’s main form showing ControlSidebarBoxClusters’s


look.
32 CHAPTER 5. IMPLEMENTATION

5.5 Integration layer


To interconnect the clustering framework with the client a class controlling the clustering
and serving as a bridge for data in both directions needs to exist. This class is named
ClusteringManager and it follows the Singleton design pattern. Upon eM Client startup, an
initialization method of this class is executed. The initialization sets up all components of
the clustering framework including database-backed contexts for hierarchical clusterer and
text analysis.
A background thread is also created during the initialization. All actions such as initial
clustering, mail item addition or removal are delegated to this thread and passed to the
clusterer class. Therefore lengthy clustering operations do not block execution of threads that
originated the events. This thread has also background priority to avoid slowing program’s
operation down.
ClusteringManager implements IOnlineDataSource to provide instance to the clusterer.
Mail items in clustered folder are enumerated and a corresponding object based on IMailIn-
stance is created for each item. Folder’s events are being monitored and trigger analogous
events of online data source interface.
To allow retrieving mails matching to data instances an another interface IMailLoader
is used by data instance objects. This interface is used to look up the mail item based on
its identifier stored in the instance object whenever the instance object needs to access its
email and does not already have a reference to the email item.
Last role of the ClusteringManager is to expose the list of clusters from the framework.
This is achieved by a property that returns the cluster list present in the hierarchical clusterer
context interface.
Chapter 6

Testing

During and after the project’s implementation we have tested the functionality on a dataset
that represented a portion of author’s inbox. Performance of the clustering has been analyzed
in regard to memory and time. Quality has been analyzed in two ways: by subjective
evaluation and using cophenetic correlation coefficient.

6.1 Clustering quality


6.1.1 Subjective evaluation
First a subjective evaluation of the clusters’ quality was carried out - judging whether they
correspond to real topics of the processed emails and if the aggregation does ease orientation
in the sample dataset. The result of this review was positive - emails were grouped into
convenient clusters that distinguished individual topics. To illustrate this, we have extracted
a dendrogram of the sample dataset:

33
34 CHAPTER 6. TESTING

Figure 6.1: Dendogram of the sample dataset. The red line denotes maximal distance of
clusters to be merged, colored boxes encapsulate multi-item clusters presented to the user.

As can be seen, 8 clusters containing more than 1 email were created and each represents
a compact topic corresponding to how author thinks of the contained emails. To let the
reader judge for himself, emails’ subjects matched to leafs in the dendogram can be found in
Appendix B. Emails that were not merged with other items are also represented as clusters
in the framework, but will not be shown to the user in future.
6.1. CLUSTERING QUALITY 35

6.1.2 Objective evaluation

To perform objective evaluation of the clustering, we had to select one of known quality
measures. Since the testing dataset wasn’t manually processed to assign a target cluster to
each data instance, we have to restrict our choice to internal quality measures. One of such
measures, aimed at hierarchical clustering, is a cophenetic correlation coefficient.
This coefficient tells us, how well does the clustering represent the instances’ relations
based on their distance. Two sets of values are compared: the instance distance matrix and
the distance of individual cluster in the dendogram. The calculation of the coefficient was
done in MathWorks MATLAB, which contains necessary functions. The coefficient value
may lie in an interval from 0 to 1, where 1 indicates an absolute correlation of clusters’
distances with the instances’ distance.
We have performed several experiments on the same dataset as in the previous section.
There were two elementary properties, whose impact on the clustering we have tested: the
distance function formula section 4.5 and the cluster distance metric of the hierarchical
clustering subsubsection 3.2.1.2.
The function that calculates the final distance between instances may have many forms,
our current implementation uses weighted linear combination, but there are other two very
popular choices - Euclidean distance and Cosine measure. Since Cosine measure requires
component values of instance vectors, which is not compatible with the way our clustering
works, only the Euclidean distance remains. We have modified the distance function and ran
the clustering for the standard and the alternative formula. The resulting distance matrices
were extracted and used in testing described below.
The second property we modified is the cluster distance metric, which has major impact
on the shape of the clusters. The available possibilities are: single (which is used in our
implementation), average and complete linkages.
Using extracted instance distance matrices for each distance function and different cluster
metrics we have performed series of calculations of the cophenetic correlation coefficient in
MATLAB and compiled the following comparison table:

Euclidean Linear comb.


Single 0.965 0.960
Average 0.974 0.963
Complete 0.958 0.934

The data show that Euclidean instance distance measurement is able to perform slightly
better than linear combination, which has been implemented at first. Therefore we have
modified the framework to use Euclidean distance by default.
Cluster metrics indicate that average and single linkage perform better for our clustering
task, but single linkage was kept in use for now.
The differences between values are quite subtle and therefore a further evaluation on
several larger datasets will need to be done to get more confident results.
36 CHAPTER 6. TESTING

6.2 Data structures optimization


Data structures used in context interfaces (see section 5.3) are linked with a database, which
means that every operation such as modification, enumeration or look up performs a SQL
query. While this is not a problem for lists such as Clusters or Instances, where modifications
and enumerations are performed at sustainable rate, it becomes an issue for two dimensional
dictionaries used to store instance distance and cluster distance matrices, which are heavily
utilized during the clustering process (their utilization corresponds do the algorithm’s com-
plexity - n2 ). We have attempted to improve the situation at least for the read operation by
implementing a cache. An inner, in-memory two dimensional dictionary has been added and
once a value for each key pair is modified or retrieved, it is stored in the cached dictionary.
All subsequent accesses are catered from the cache.
Another substantial optimization to be implemented in future may be taking advantage
of the symmetric nature of the distance function. This allows the size of the distance matrices
to be reduced by more than a half.

6.3 Performance
The hierarchical clusterer’s greatest disadvantage is its high resource consumption. We have
described an optimization to reduce disk access, which in result dramatically improves the
speed of the clustering. Measurements have shown an improvement of 30
To get a notion of the memory consumption of the whole clustering framework, we have
used memory allocation monitoring on two versions of eM Client - with and without the
clustering implemented. This is not a byte-precise approach due to the nature of .NET
Framework, but sufficient for our needs. We have executed initial clustering in the first
version on a dataset with 70 items. The resulting difference was close to 6 megabytes of
allocated memory.
The overall speed and memory performance wasn’t found to be up to the author’s ex-
pectations and will be subject to further improvements. Luckily, there is a large space for
enhancement and thus getting the resource consumption to product grade standards will be
achievable.
Chapter 7

Conclusion

The author has decided to realize an email clustering framework as an approach to evolve the
three decades old model of work with email. The motivation comes from necessity to cope
with contemporary problems of email overload. This approach is very different from well
known takes on the issue. It does not require user’s interaction, but is able to continually
provide alternative organization of the inbox that will likely correspond to how the user
thinks about the clustered emails. It is also non-intrusive, therefore has high chances of
successful adoption by users.
In this work we went through relevant concepts, algorithms and formulas that will power
the clustering framework. After choosing suitable techniques we have described their de-
sign and individual features they will exhibit. Afterwards, an implementation of the whole
framework has been described along with reasoning for decisions that were made. The im-
plementation was realized with aim to allow integration into an existing email client and
with an emphasis on possibility of future expansion and improvement of the functionality.
Furthermore, a convenient user interface that suits the target software’s environment has
been devised.
Finally, testing and performance evaluation yielded relevant information, which led to
several optimizations and will be used as a guide in future work on improving performance.
We can conclude that we have succeeded in realizing goals laid out at the beginning of
our work.

7.1 Future work


Although a fair amount of effort went into the project so far, it needs much more to get to
a state, where author would like to see it - as mature component performing advanced, high
quality clustering and having many features further enhancing this new element in work with
email. Also integration with other clients is a long-term goal. The main tasks for future
evolution of this project are:

• Optimize operations with persistent data structures such as batch writing using trans-
action scopes, intelligent caching making use of weak references, saving space when
having symmetric matrices.

37
38 CHAPTER 7. CONCLUSION

• Improve the clustering speed, optimize the algorithm as much as possible, add support
for parallelization.
• Perform large scale experimenting with model parameters using automated procedures
based on optimizing quality measures, both internal and external[12].
• Investigate possibilities of automatic model parameter adjustment, parameter profiles
or user based parameter adjustments.
• Allow user based cluster modification - merge cluster, separate cluster, move items
between clusters.
• Investigate on extending user’s perception of cluster as category and on possible inter-
connection of rules with clusters.
• Work on Fuzzy C-Means algorithm, compare with current algorithm, evaluate advan-
tages of the fuzzy membership from both user and programmatic standpoint, add
incremental support, explore effect on the model upon external modification of the
clustering and more.
Bibliography

[1] New York Times - Is Information Overload a 650 Billion Drag on the Economy?
http://bits.blogs.nytimes.com/2007/12/20/is-information-overload-a-650-billion-drag-
on-the-economy/.

[2] Andrew McCallum Ron Bekkerman and Gary Huang. Automatic categorization of email
into folders: Benchmark experiments on enron and sri corpora.

[3] Gabor Cselle. Organizing email. Master’s thesis, ETH Zurich, 2006.

[4] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. The
Morgan Kaufmann Publishers, 2nd edition, 2006.

[5] André Ponce de Leon F. de Carvalho Eduardo J. Spinosa and Joao Gama. An online
learning technique for coping with novelty detection and concept drift in data streams.

[6] Matteo Matteucci. Clustering - Fuzzy C-Means Clustering.


http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html.

[7] D. S. Yeung and X. Z. Wang. Improving performance of similarity-based clustering by


feature weight learning.

[8] Marko Grobelnik Dunja Mladenić, Janez Brank and Natasa Milic-Frayling. Feature
selection using linear classifier weights: interaction with classification models.

[9] P. Resnick. RFC 2822: Internet message format.

[10] N. Freed and N. Borenstein. RFC2045: Multipurpose internet mail extensions (mime)
part one.

[11] eM Client.
http://www.emclient.com.

[12] George Karypis Michael Steinbach and Vipin Kumar. A comparison of document clus-
tering techniques.

39
40 BIBLIOGRAPHY
Appendix A

SQL Schema

1 ATTACH DATABASE ’ clusters . dat ’ as clusters ;


2 PRAGMA clusters . auto_vacuum = full ;
3 PRAGMA clusters . page_size =4096;
4 CREATE TABLE clusters . " Clusters " (
5 " id " INTEGER NOT NULL PRIMARY KEY ,
6 " identifier " TEXT ,
7 " l a t e s t M a i l O r i g i n a t i o n D a te " TIMESTAMP ,
8 " color " INTEGER ) ;
9 CREATE TABLE clusters . " ClusterInstances " (
10 " id " INTEGER NOT NULL PRIMARY KEY ,
11 " parentId " INTEGER NOT NULL CONSTRAINT
f k _ C l u s t e r I n s t a n c e s _ p a r e n t I d REFERENCES "
Clusters " ( " id " ) ON DELETE CASCADE ,
12 " instanceId " INTEGER NOT NULL CONSTRAINT
f k _ C l u s t e r I n s t a n c e s _ i n s t a n c e I d REFERENCES "
Instances " ( " id " ) ON DELETE CASCADE ,
13 " membershipDegree " FLOAT ) ;
14 CREATE INDEX clusters . " idx_Parent " ON " ClusterInstances "
( " parentId " ) ;
15 DETACH DATABASE clusters ;
16
17 ATTACH DATABASE ’ instances . dat ’ as instances ;
18 PRAGMA instances . auto_vacuum = full ;
19 PRAGMA instances . page_size =4096;
20 CREATE TABLE instances . " MailInstances " (
21 " id " INTEGER NOT NULL PRIMARY KEY ,
22 " identifier " TEXT ,
23 " mailIdentifier " TEXT ) ;
24 DETACH DATABASE instances ;
25
26 ATTACH DATABASE ’ instance_distances . dat ’ as
instance_distances ;
27 PRAGMA instance_distances . auto_vacuum = full ;
28 PRAGMA instance_distances . page_size =4096;
29 CREATE TABLE instance_distances . " I ns ta nce Di st anc eM at rix "
(
30 " id " INTEGER NOT NULL PRIMARY KEY ,
31 " key1 " TEXT ,
32 " key2 " TEXT ,

41
42 APPENDIX A. SQL SCHEMA

33 " value " FLOAT ) ;


34 CREATE INDEX instance_distances . " idx_Keys " ON "
In st an ceD is ta nce Ma tr ix " ( " key1 " , " key2 " ) ;
35 DETACH DATABASE instance_distances ;
36
37 ATTACH DATABASE ’ cluster_distances . dat ’ as
cluster_distances ;
38 PRAGMA cluster_distances . auto_vacuum = full ;
39 PRAGMA cluster_distances . page_size =4096;
40 CREATE TABLE cluster_distances . " Cl uster Dist ance Matri x " (
41 " id " INTEGER NOT NULL PRIMARY KEY ,
42 " key1 " TEXT ,
43 " key2 " TEXT ,
44 " value " FLOAT ) ;
45 CREATE INDEX cluster_distances . " idx_Keys " ON "
Clu ster Dista nceM atrix " ( " key1 " , " key2 " ) ;
46 DETACH DATABASE cluster_distances ;
47
48 ATTACH DATABASE ’ terms . dat ’ as terms ;
49 PRAGMA terms . auto_vacuum = full ;
50 PRAGMA terms . page_size =4096;
51 CREATE TABLE terms . " Terms " (
52 " id " INTEGER NOT NULL PRIMARY KEY ,
53 " key1 " TEXT ,
54 " key2 " TEXT ,
55 " documentOccurrences " INTEGER ) ;
56 CREATE INDEX terms . " idx_Keys " ON " Terms " ( " key1 " , " key2 " )
;
57 DETACH DATABASE terms ;
58
59 ATTACH DATABASE ’ document_infos . dat ’ as document_infos ;
60 PRAGMA document_infos . auto_vacuum = full ;
61 PRAGMA document_infos . page_size =4096;
62 CREATE TABLE document_infos . " DocumentInfo " (
63 " id " INTEGER NOT NULL PRIMARY KEY ,
64 " key1 " TEXT ,
65 " key2 " TEXT ,
66 " termsCount " INTEGER ) ;
67 CREATE TABLE document_infos . " DocumentTerms " (
68 " id " INTEGER NOT NULL PRIMARY KEY ,
69 " parentId " INTEGER NOT NULL CONSTRAINT
f k _ C l u s t e r I n s t a n c e s _ p a r e n t I d REFERENCES "
Clusters " ( " id " ) ON DELETE CASCADE ,
70 " term " TEXT ,
71 " tf " FLOAT ) ;
72 CREATE INDEX document_infos . " idx_Parent " ON "
DocumentTerms " ( " parentId " ) ;
73 DETACH DATABASE document_infos ;
74
75 ATTACH DATABASE ’ idfs . dat ’ as idfs ;
76 PRAGMA idfs . auto_vacuum = full ;
77 PRAGMA idfs . page_size =4096;
78 CREATE TABLE idfs . " CachedIDFs " (
79 " id " INTEGER NOT NULL PRIMARY KEY ,
80 " key1 " TEXT ,
81 " key2 " TEXT ,
43

82 " value " FLOAT ) ;


83 CREATE INDEX idfs . " idx_Keys " ON " CachedIDFs " ( " key1 " , "
key2 " ) ;
84 DETACH DATABASE idfs ;
44 APPENDIX A. SQL SCHEMA
45
46 APPENDIX B. CLUSTERING ILLUSTRATIONS

Appendix B

Clustering Illustrations

Figure B.1: Dendogram of the sample dataset with subject of each email.
Appendix C

CD directory structure

The attached CD has following directory structure:


\ bin - Directory containing executable copy of eM Client with
clustering functionality ( executable is eM Client . exe )
\ clustering - Contains the source code of the clustering framework
( project file is Clustering . csproj )
\ text - Contains this text ( PDF is grafnl . pdf , Latex file grafnl1 .
tex )

47