You are on page 1of 5

An Event Detection Algorithm Based on Improved STC

Li-Qing Qiu Bin-Pang Li-Ping Zhao


State Key Lab. ofSoftware Development Environment, Beihang University, 100083
{qiuliqing, pangbin, zhaolp }, nlsde. buaa. edu. cn

Abstract (2)Proposing a new algorithm based on improved


STC for event detection.
In order to overcome some shortcomings of The remainder of this paper is organized as
traditional algorithms in event detection, we propose follows. We review the previous research work in
an event detection algorithm based on improved STC section 2.1n section 3 we summarize STC and
(suffix tree clustering), which detects significant describe the problem of STC. The proposed
events from large volumes of news and presents the algorithm is presented in section 4.Then we report on
main content of the events to the user as summaries. experimental methodologies and results in section 5.
The experimental results on TDT indicate that the At last, we conclude our paper and discuss the future
new algorithm is an effective document clustering plans in section 6.
algorithm.
2. Related work
1. Introduction
One of the hotspots of event detection is clustering
A new event is defined as a specific thing happens algorithm. In addition, some researchers proposed
at a specific time and place [ 1], which may some methods to improve the performance such as
consecutively reported by many news articles in a imposing a time window [2] and re-weighting named
period. Automated detection of new events from entities [3].
Web documents is an open challenge in text mining. Yang et al. [2] proposed a clustering algorithm,
Event detection is an unsupervised learning task. GAC (Group Average Clustering), a divided-and-
Document clustering algorithms attempt to group conquer version of a group-average clustering
documents together based on their similarities; the algorithm. GAC performs AHC (Agglomerative
documents that are relevant to a certain topic will Hierarchical Clustering), producing hierarchically
hopefully be allocated in a single cluster [2].More organized document clustering. AHC does not try to
specially, the cluster algorithm used in event find "best" clusters, but keeps merging the closest
detection is incremental. Current event detection pair of objects to form clusters. With a reasonable
systems are mostly based on comparing a new distance measurement, the best time complexity of a
document to the clusters of documents in the past, practical AHC algorithm is O(N2) . So AHC is
and thresholding on the similarity scores-if all the typically slow when applied to large Web documents.
similarity scores are below a threshold, the new Yang et al. [2] and Allan [4] proposed Single-pass
document is predicted as the first story of a novel algorithm, which was straightforward. The algorithm
event, otherwise, it belongs to the most similar sequentially processes the input documents, once at a
clusters. time and grows clusters incrementally. A new
In this paper, we focus on how to use improved document is absorbed by the most similar cluster in
STC to detect new events, which is an incremental, 0 the past if the similarity between the document and
(n) time algorithm that produces coherent clusters. the cluster is above a preselected clustering
Our main work includes: threshold. Single-pass method is very easy, as well as
(1)Improving the algorithm of STC the most popular clustering algorithm in event

528

Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.
detection, but it suffers from being order dependant Document3: Cat ate mouse too.
and from having a tendency to produce large clusters
[5]. Moreover, it only processes the input documents
once at a time, which is limited in large Web
tc kiNu COU
-.
documents.
Lei et al. [6] proposed an improved incremental K-
means for detecting events. In order to select initial cwc5c Lli
I-1
(II,
im
i. 11)
t'nolis
RIOU-Ne 0
cluster centers objectively, the algorithm utilizes ttx)
\R.W
chr.

\,--l
density function to initialize cluster centers. The u

quantity of clusters is affected little by the order in


which the news stories are processed. But the Figurel. An instance of suffix tree
initialization of cluster centers adds time and space-
consuming, so it is unreasonable used in on-line At the same time, suffix tree is trying to keep the
detection. sequential order of each word in the original
From the aspect of utilizing the contents of documents in order to display the summary in step
documents, TF-IDF is still the dominant technique (3).The structure can be constructed in time linear
for document representation. In this method, each (linear in the size of the document set), and can be
document is represented by a vector of weighted constructed incrementally as the documents are being
terms that can be either words or phrases and ignore read [5].
the sequence order of the words or phrases[5], thus Each node of the suffix tree represents a group of
losing valuable information. documents and a phase that is common to all of
them. The label of the node represents the common
3. Summarization of STC phrase; the set of documents tagging the suffix-nodes
that are descendants of the node represents a base
STC is proposed by Zamir and Etziomi [5] for cluster.
clustering in their meta-search engine, which first (3) Merging these base clusters into clusters
identifies sets of documents that share common The final step is merging base clusters with a high
phrases by construction suffix tree, and then creates overlap in their document sets, which allows a
clusters according to these phrases. STC does not document to appear in more than one cluster.
treat a document as a set of words but rather as a Clusters are scored and a label is generated for each
string, making use of proximity information between clusters. Figure 2 is an example of base cluster graph.
words. STC relies on a suffix tree to efficiently
identify sets of documents that share common phases Phrase: cat at
and used this information to create clusters and to Dcument
rlt )1sS 3

succinctly summarize their contents for users. 1)- liraSe i 1nous


D)ocum e
Ph rase: chet

STC has three logical steps as followings:


(1) Document "cleaning"
Sentence boundaries are marked and non-word e.ocmnt'.
token (such as HTML tags) are stripped. The strings Phae too}
of each document are transformed using a stemming P.ira sem ate 4
Ie 3 IDoculerits.
algorithm.
(2)Identifying base clusters using a suffix tree Figure2. The example of base cluster
Suffix tree document model considers a document
to be a set of suffix substrings, the common prefixes Based on above model, we have identified several
of the suffix substrings are selected as phrases to key features of STC, which is excessively suitable for
label the edges of a suffix tree. Figure 1 is an event detection:
example of the suffix tree of a set of documents as (1) Incrementality: The strings associated with
following: each document can be easily inserted onto the suffix
Documentl: Cat ate cheese. tree as soon as the document is received. In contrast,
Document2: Mouse ate cheese too.

529

Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.
most cluster algorithms can not process document
|Bm0B |/|Bnl > 0.5

sets incrementally including AHC.


(2) Linear time: Unlike AHC, STC is a linear time Otherwise, their similarity is defined to be 0.
clustering algorithm (linear in the size of the We found that the "and" Boolean operator is not
clustering set) which is based on identifying phrases suitable in the following condition if one base cluster
that are common to groups of documents. is the subset ofthe other:
Ba cBn andB < 0.5
(3) Overlapping clusters: STC creates overlapping or

clusters, namely, it allows one document belong to BnaCBm and B| < 0 5 Bm


more than one cluster, which is more reasonable
So we change the "and" operator to "or" operator:
because one document may have multiple topics.
(4) Non-exhaustive: Unlike Single-pass algorithm |BmnBn|/ gBm| > 0.5 or

which processes the input documents once a time,


|Bm0B /Bn > 0.5
STC can process a batch of documents easily by
inserting the strings of each document onto the suffix This is essentially identical to Yang's [10]
tree. improvement, except that we take place the threshold
as 0.5 rather than an undefined parameter.
(5) Browsable summaries: One or more several
phrases are naturally selected to generate a topic (2)Cluster score is very important, however, Zamir
summary to label the corresponding cluster clusters et al. [5] does not describe how to score the cluster.
during building the clusters in STC. However, it can We score the clusters using following function:
not provide concise and accurate descriptions of the Score(B,) = B, /min(Length(Label, )) Weight(doc)
* *

clusters in most clustering algorithm. Where IBX is number of documents in cluster


However, STC has some shortcomings. Chim et al.
[7] proposed that there was no efficient measure to Bx, min(Length(Label1)) is the minimum length of
evaluate the quality of clusters in STC. Branson et al. labels of cluster Bq, Weight(doc) is the weight of the
[8] proposed that it was unnecessary presenting all documents which Bq belong to.
labels to the users, because it was common that labels (3) Clusters remain at least half of the nodes
that were subsets of one another to be merged into which will contain the main idea of the documents.
the same cluster. Therefore any label or combination of labels in the
In the paper, the measure to evaluate the quality of merged cluster should be a good general description
clusters in STC will be improved. Furthermore, of the documents in the cluster. However, it is noted
labels that will present to the users will be improved. that some labels are the subsets of one another, which
may be merged into the same clusters. We do not
4. Improvement of STC want to double count them. More specially, we
choose the longest label first, and then choose the
We implement a modified version of STC. The labels which are not the subsets of any selected
input to our algorithm is a collection of documents labels.
and a set of user-specified parameters. The output is (4) Chim et al. [7] proposed a novel method which maps
a forest of trees of clusters. The main steps are all nodes n of the common suffix tree to a
described in section 3. We make following changes M dimensional space of VSD (Vector Space Model)
to the algorithm in section 3. (n = 1,2, , M ),each document d can be represented
(I)For merging base clusters, STC defines a binary as a feature vector of the weights of M nodes:
similarity measure between base clusters based on
the overlap of their document sets. Given two base d = {w(l, d), w(2, d), w(M, d)} .,

clusters Sm and Bn , with sizes IBm The document frequency of each node df(n) is
defined as the number of the different documents that
and IBnrespectively, and representing the number
have traversed node n .For example in Figure2,
of documents common to both base clusters. The df(a) = 2.
similarity of Bm and Bn to be 1 if:
Chim et al. [7] also proposed that simply ignoring
BnBn|/ IB > 0.5 and the stopwords becomes impractical in STC, because

530

Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.
suffix tree model is trying to keep the sequential cells. Five evolutional measures are defined
order of each words in a document, and the same including Miss, False alarm, Recall, Precision, F-
phrase or words might occur in different nodes of the measure. To measure global performance, two
suffix tree. Based on the idea, they proposed the averaging methods are used: micro-average by
definition of "stopnode", which applies the same idea summing the corresponding cells and then compute
of stopwords in the suffix tree similarity measure the five measures, macro-average by averaging the
computation. A threshold idfthd of inverse document five measures of all events.
frequency (idf ) is given to identify whether a node
Table 2. Contingency table
is a stopnode. The experiment showed that the design in event not in event
is very efficient. in cluster a b
In our paper, we adopt the same idea of
not in cluster c d
"stopnode" instead of stopwords, which also prove to
be efficient.
In our paper, we use micro-average of F-measure
and macro-average of F-measure as our evaluation
5. Experimental results measures. We compute micro-average and macro-
average of F-measure score for each clustering result
5.1. Data preparation respectively. F-measure is originally defined by C.J.
Rijsbergen [9], which is the harmonic mean of recall
We prepare TDT [1] corpus for experiments, and precision. The measure is defined as following
which are benchmarks for event detection. We if(a + b + c) > 0, otherwise undefined:
choose TDT2 English corpus to run experiments. The
TDT2 English corpus contains news data collected F 2 * Precision * Recall 2a
daily from 6 news sources, over a period of six Precision + Recall 2a + h + c
months. Detection is an unsupervised classification Where Precision = a l(a + b) if a+b>O ,
task that does not involve training data, so all the otherwise undefined; Recall = a l(a + c) if a + c > 0,
English corpus is used as evaluation data. More otherwise undefined.
details of our dataset are listed in Tablet.
5.3. Performance on the data set
Table 1. Details of dataset
Time January - June, 1998 To compare our approach with other algorithms,
Number of Articles 1930 Yang et al.'s augmented GAC and the generally used
Number of KNN algorithm are chosen as baselines. Since GAC
Topics(Events) 70 is a hierarchical clustering method, we stop after
Average articles there are k clusters left, and run re-clustering 5 times
per event 27 as the recommended settings in [2].For KNN we
report the results under the best threshold.
5.2. Evaluation measures The original STC algorithm selects the 500 highest
scoring base clusters for further cluster merging, but
TDT project has its own evaluation plan, i.e., only the top 10 clusters are selected from the merged
detection performance is characterized in terms of the clusters as the final clustering result. Thus we also
probability of miss and false alarm errors, and these allowed GAC and KNN to generate 10 clusters in our
error probabilities are then combined into a single experiments to conduct as fair as possible
detection cost by assigning costs to miss and false comparisons.
alarm errors. However, their tasks are not consistent We use Java to implement all there algorithms, and
with ours. Thus, we choose the same evaluation use Excel to plot all figures. Figurel illustrates the
metrics as that in Yang et al.[2].Table2 illustrates the results of the three approaches. The comparison of
contingency table for a cluster -event pair, where a, the three methods is shown as Figure 1 which shows
b, c, d are document counts in the corresponding

531

Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.
the performance of the three and Figure2 which also demonstrate a substantial results using improved
shows the execution time of the three. STC algorithm, which indicate best performance and
shortest execution time compared to KNN and GAC.
In the course of this work, we encountered a
number of interesting questions and hope to answer
them in our future research. For one, we are not
satisfied with the threshold, although STC is not very
0. . sensitive to the threshold. Second, we will consider
improving STC algorithm by combination with other
algorithms, which may obtain better results.
)| !i ,
l
acavF
The present work can be extended in a number of
important directions. One is dictated by the multi-
lingual nature of TDT: STC algorithm should be
Figure3. The performance of the three methods capable of dealing with documents in multiple
languages. We are very interested in implementing
the improved STC algorithm into Chinese.

Reference
[l]Topic detection and tracking (tdt) project. homepage:
4 http://www.nist.gov/speech/tests/tdt/.
[2]Y. Yang and J. G. Carbonell et al. Learning Approaches
for Detecting and Tracking News Events [J]. IEEE
Intelligent Systems: Special Issue on Applications of
| GAC £TC Intelligent Information Retrieval, 1999, 14 (4):32 - 43.
Figure4. The execution time of the three methods [3]Y. Yang and J.Z. et al. Topic-conditioned novelty
(Time/s) detection In Proc. of SIGKDD international conference on
knowledge discovery and data mining,2002.
The best results obtained by STC are obviously. [4]J. Allan. Topic detection and tracking: event-based
information organization [M].Dordrecht: Kluwer Academic
We believe the main reason for STC's best Publishers, 2002.
performance is the STC model, which is most [5]0. Zamir and 0. Etzioni. Web Document Clustering: A
suitable for event detection. Furthermore, STC is the Feasibility Demonstrate. In Proc. of SIGIR'98, University
fastest algorithm among the three, which own to the of Washington, Settle, USA, 1998.
direct-inserting policy of STC. [6]Z. Lei and L. Wu et al. Incremental K-means Method
Based on Initialisation of Cluster Centers and Its
6. Conclusions and future work Application in News Event Detection. Journal of the China
Society for Scientific.25 (3):289-295, 2006.
[7]H. Chim and X. Deng. A New Suffix Tree Similarity
Our work presented in this paper is mainly focused Measure for Document Clustering. In Proc. of WWW'2007,
on improving the effectiveness of document Banff, Alberta, Canada, 2007.
clustering algorithms which is the hotspot of Event [8]S. Branson and A. Greenberg. Clustering Web Search
detection. Efficiency optimization of the algorithm Results Using Suffix Tree Methods. homepage:
has been a target of our current work. STC is a http://stanford.edu/lass/archive/cs/cs276a/cs276a
suitable algorithm for clustering in event detection, [9]C. J. van Rijsbergen, Information Retrieval [M].London:
which has excellent features such as linear time Butterworths, 1979.
complexity and incrementality 1. In this work we [10]J.W.Yang. A Chinese Web Page Clustering Algorithm
have shown a new event detection algorithm based Based on the Suffix Tree. Wuhan University Journal of
National Sciences [M]. 9 (5):817-822, 2004.
on improved STC algorithm. We improved STC in
the following aspects: improving method of merging
base clusters, new definition of cluster score, new
cluster labels and implementation of "stopnode". We

532

Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.