You are on page 1of 18

FREQUENT PATTERN BASED

CLUSTERING
FREQUENT PATTERNS
 Frequent patterns are patterns (e.g., itemsets, subsequences, or
substructures) that appear frequently in a data set.
 For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set which is a frequent
itemset.
 A subsequence, such as buying first a PC, then a digital camera,
and then a memory card, if it occurs frequently in a shopping
history database, is a (frequent) sequential pattern
FREQUENT PATTERN BASED
CLUSTERING
 Frequent pattern mining can be applied to clustering
resulting in frequent pattern based cluster analysis. 

 Frequent pattern mining can lead to the discovery of


interesting association and correlation among data
object. 

 The Idea behind frequent pattern based cluster analysis is


that the frequent patterns discovered may also indicate
clusters. 
FREQUENT PATTERN BASED
CLUSTERING
 Frequent pattern based cluster analysis is well suited to
high dimension data

 Rather than growing the clusters dimension by


dimension ,we grow sets of frequent item sets which
eventually lead to cluster descriptions.

 Examples of frequent pattern based cluster analysis :


Clustering of text documents that contain thousands of
distinct keywords
EXAMPLE: TEXT CLUSTERING
 Text clustering is the application of cluster analysis to
text-based documents.
Working:
 descriptors (sets of words that describe topic matter) are
extracted from the document first.
 Then they are analyzed for the frequency in which they
are found in the document compared to other terms.
 After which, clusters of descriptors can be identified and
then auto-tagged.
 From there, the information can be used in any number
of ways
EXAMPLE: TEXT CLUSTERING
 Google’s search engine is probably the best and most
widely known example.
 When you search for a term on Google, it pulls up pages
that apply to that term.
 How Google can analyze billions of web pages to deliver
an accurate and fast result?
 It’s because of text clustering! Google’s algorithm breaks
down unstructured data from web pages and turns it into
a matrix model, tagging pages with keywords that are
then used in search results!
FREQUENT PATTERN BASED
CLUSTERING
 There are two forms of frequent pattern based cluster
analysis

1. Frequent term based text clustering


2. Clustering by pattern similarity in microarray data
analysis.
1. FREQUENT TERM BASED TEXT
CLUSTERING
 In frequent term based based text clustering text documents are
clustered based on the frequent terms they contain. Examples include
processing word documents, HTML tags etc

 A stemming algorithm is applied to reduce each term to its basic stem in


this way each document can be represented as a set of  terms.
 the dimension space can be referred to another vector space with each
document is represented by a term vector

 A well selected subset of the set of all frequent item sets can be
considered as the clustering

 An advantage of frequent term based text clustering is that it


automatically generates a description for the generated clusters in terms
of their frequent term sets.
A stemming algorithm is a process of linguistic
normalization, in which the variant forms of a
word are reduced to a common form, for example,
Connection
Connections
Connective --------------------------> connect
Connected
connecting
It is important to appreciate that we use stemming
with the intention of improving the performance of
IR systems.
It is not an exercise in etymology or grammar. In fact from
an etymological(Relating to the origin and historical
development of words and their meanings) or
grammatical viewpoint, a stemming algorithm is liable to
make many mistakes.

In addition, stemming algorithms - at least the ones


presented here - are applicable to the written, not the
spoken, form of the language.
For some of the world's languages, Chinese for example, the
concept of stemming is not applicable, but it is certainly
meaningful for the many languages of the Indo-European group.
In these languages words tend to be constant at the front, and to
vary at the end:
-ion
-ions
connect
-ive
-ed
-ing

The variable part is the ending, or suffix. Taking these endings


off is called suffix stripping or stemming, and the residual part is
called the stem.
2. PCLUSTER
 Another approach for clustering high dimensional data is
based on pattern similarity among the objects on a subset
of dimensions.

 pCluster method performs clustering by pattern


similarity in microarray data analysis.

 Example is DNA microarray analysis


DNA MICROARRAY ANALYSIS
 A microarray is a laboratory tool used to detect the
expression of thousands of genes at the same time.
 DNA microarrays are microscope slides that are printed
with thousands of tiny spots in defined positions, with
each spot containing a known DNA sequence or gene.
DNA MICROARRAY ANALYSIS
 Scientists know that a mutation - or alteration - in a
particular gene's DNA may contribute to a certain
disease.
 However, it can be very difficult to develop a test to
detect these mutations, because most large genes have
many regions where mutations can occur.
 For example, researchers believe that mutations in the
genes BRCA1 and BRCA2 cause as many as 60 percent
of all cases of hereditary breast and ovarian cancers.
 But there is not one specific mutation responsible for all
of these cases.
 Researchers have already discovered over 800 different
mutations in BRCA1 alone.
 The DNA microarray is a tool used to determine
whether the DNA from a particular individual contains a
mutation in genes like BRCA1 and BRCA2.
 The chip consists of a small glass plate encased in
plastic. Some companies manufacture microarrays using
methods similar to those used to make computer
microchips.
 On the surface, each chip contains thousands of short,
synthetic, single-stranded DNA sequences, which
together add up to the normal gene in question, and to
variants (mutations) of that gene that have been found in
the human population.
 under the pCluster model, two objects are similar if they
exhibit a coherent pattern on a subset of dimensions.

 Allthough the magnitude of their expression levels may


not be close, the pattern they exhibit can be very much
alike.
 The pCluster model though developed in the study of
microarray data cluster analysis can be applied to many
other applications that require finding similar or coherent
patterns involving a subset of numerical dimensions in
large high dimensional data sets

You might also like