Data Mining For BI - Part 5

Topic 4
Data Mining for Business Intelligence
Part 5: Clustering Pattern

ISP642 – BUSINESS
INTELLIGENCE
Learning Objectives
 Learn about clustering pattern
Data Mining Patterns: An
Overview
DATA
MINING
PREDICTI ASSOCATI CLUSTERI

ON ON NG
Supervise Unsupervi Unsupervi

d sed sed
Learning Learning Learning
Method Method Method
Pattern: Clustering
 Video Data Mining Learning Method Popular Algorithms
Classification and Regression Trees,

Prediction Supervised
ANN, SVM, Genetic Algorithms
Decision trees, ANN/MLP, SVM, Rough

Classification Supervised
sets, Genetic Algorithms
Linear/Nonlinear Regression, Regression

Regression Supervised
trees, ANN/MLP, SVM
Association Unsupervised Apriory, OneR, ZeroR, Eclat
Link analysis Unsupervised Expectation Maximization, Apriory

Algorithm, Graph-based Matching
Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique
Clustering Unsupervised K-means, ANN/SOM
Outlier analysis Unsupervised K-means, Expectation Maximization (EM)

Pattern: Clusters
 Segmentation
 Identify natural grouping of things based on their known
characteristic, such as assigning customers in different segments
based on their demographics and past purchase behaviour.
 Data items are grouped according to logical relationships or
consumer preferences. For example, data can be mined to
identify market segments or consumer affinities.
What is clustering?
 Clustering is the classification of objects into
different groups, or
 the partitioning of a data set into subsets (clusters),
so that the data in each subset (ideally) share some
common trait - often according to some defined
distance measure.
Clustering
 Grouping data into categories as suggested by the
data’s own patterns
 Maximize similarity within a category
 Minimize similarity between categories
Clustering
 groups data that share similar trend and pattern.
 large sets of data is grouped into clusters of smaller sets of
similar data.
Methods in Clustering Mining
Partitioning • Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
method • Typical methods: k-means, k-medoids, CLARANS
• Create a hierarchical decomposition of the set of data (or

Hierarchical objects) using some criterion
• Typical methods: Diana, Agnes, BIRCH, ROCK,
method CAMELEON
Density-based • Based on connectivity and density functions

• Typical methods: DBSACN, OPTICS, DenClue
method
Grid-based • Based on a multiple-level granularity structure
• Typical methods: STING, WaveCluster, CLIQUE
method
Model-based • A model is hypothesized for each of the clusters and tries to
find the best fit of that model to each other
method • Typical methods: EM, SOM, COBWEB
K-Means Clustering
 an algorithm to classify or to group your objects
based on attributes or features into k number of
group
 k is predetemined number of clusters.
 The grouping is done by minimizing the sum of
squares of distances between data and the
corresponding cluster centroid.
 Thus, the purpose of k-mean clustering is to
classify the data.
K-Means Clustering: Unsupervised
 Each object represented by one attribute point is an
example to the algorithm and it is assigned to each
data point(customer, event, object) to the cluster
a.k.a centroid is nearest.
 called “unsupervised learning”
 because the algorithm classifies the object
automatically only based on the criteria that we give.
 The learning process depends on the training
examples fed into the algorithm.
What is a natural grouping among
these objects?
Clustering is subjective
Simpson's Family School Employees Females Males

Algorithm Steps
Initialization step : Choose the number of cluster( the
value of k)
1. Step 1: Randomly generate k random points as
initial cluster center.
2. Step 2 : Assign each point to nearest cluster
center.
3. Step 3 : Recompute the new cluster centers.
Repetition step : Repeat step 2 and step 3 until some
convergence criterion is methodology and the
assignment of points to clusters become stable.
K-Means Clustering
Step 1 Step 2 Step 3

K-Means Clustering: Example
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
 Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
 Their new centroids are:
Step 3:
 Now using these centroids we
compute the Euclidean

distance of each object, as
shown in table.
 Therefore, the new clusters

are:
{1,2} and {3,4,5,6,7}
 Next centroids are:

m1=(1.25,1.5) and m2 =
(3.9,5.1)
 Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
 Therefore, there is no change

in the cluster.
 Thus, the algorithm comes to
a halt here and final result
consist of 2 clusters {1,2} and
{3,4,5,6,7}.
(with K=3)
Step 1 Step 2
PLOT
Disadvantages of K-Mean Clustering
 When the numbers of data are not so many, initial

grouping will determine the cluster significantly.
 The number of cluster, k, must be determined before
hand. Its disadvantage is that it does not yield the same
result with each run, since the resulting clusters depend
on the initial random assignments.
 We never know the real cluster, using the same data,
because if it is inputted in a different order it may
produce different cluster if the number of data is few.
Clustering: Video
 C L U S T E R A N A LY S I S F O R D ATA M I N
ING
 K-Means
Pattern: Outlier Analysis
Data Mining Learning Method Popular Algorithms
Classification and Regression Trees,

Prediction Supervised
ANN, SVM, Genetic Algorithms
Decision trees, ANN/MLP, SVM, Rough

Classification Supervised
sets, Genetic Algorithms
Linear/Nonlinear Regression, Regression

Regression Supervised
trees, ANN/MLP, SVM
Association Unsupervised Apriory, OneR, ZeroR, Eclat
Link analysis Unsupervised Expectation Maximization, Apriory

Algorithm, Graph-based Matching
Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique
Clustering Unsupervised K-means, ANN/SOM
Outlier analysis Unsupervised K-means, Expectation Maximization (EM)

Outlier Analysis
Outlier Analysis
 Data point that is far outside the norm for a
variable or population;
 Outlier analysis has numerous applications in a

wide variety of domains such as the financial
industry, quality control, fault diagnosis, intrusion
detection, web analytics, and medical diagnosis.
Causes of Outliers
 Poor data quality / contamination
 Low quality measurements, malfunctioning
equipment, manual error

 Correct but exceptional data
Data Errors
 Outliers are often caused by human error, such as
errors in data collection, recording, or entry.
Application Areas
 Quality Control Applications
 Financial Applications
 Web Log Analytics
 Medical Applications
 Text and Social Media Applications
 Earth Science Applications
Text and Social Media
 Text and social media applications are
extremely common because of the ubiquity of
text data in social interactions such as email,
the web, and blogs.
Application (Noisy and Spam Links) - Given a social

network with content at the nodes, determine the noisy
and spam links with the use of structure and content
information.
Earth Science
 Outlier detection is used in numerous weather,
climate, or vegetation cover applications, where
anomalous regions are detected in spatial data either
at a single snapshot or over time. Therefore, many of
these applications are spatial or spatiotemporal in
nature.
 For example, sea surface temperatures are often

tracked in order to determine significant and
anomalous weather patterns.
END OF PART 5
END OF TOPIC
33
 THANK YOU FOR YOUR ATTENTION

References
34
1. RN Prasad & Seema Acharya(2011), Fundamentals of Business

Analytics, Wiley India Pvt. Ltd
2. Ramesh Sharda, Dursun Delen & Efraim Turban (2014), Business
Intelligence and Analytics, 10th ed., Pearson Education Ltd.
3. Nathan Yau, (2013), Data Points: Visualization That Means
Something, Wiley.
4. http://faculty.ist.psu.edu/jessieli/Site/research.html
5. http://www.laits.utexas.edu/~anorman/BUS.FOR/course.mat/Alex/
6. Image search from Google search engine
7. http://timmanns.blogspot.com/

Data Mining For BI - Part 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining For BI - Part 5

Uploaded by

Copyright:

Available Formats

Topic 4

Data Mining for Business Intelligence

Part 5: Clustering Pattern

PREDICTI ASSOCATI CLUSTERI

Supervise Unsupervi Unsupervi

Classification and Regression Trees,

Decision trees, ANN/MLP, SVM, Rough

Linear/Nonlinear Regression, Regression

Association Unsupervised Apriory, OneR, ZeroR, Eclat

Link analysis Unsupervised Expectation Maximization, Apriory

Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique

Clustering Unsupervised K-means, ANN/SOM

Outlier analysis Unsupervised K-means, Expectation Maximization (EM)

• Create a hierarchical decomposition of the set of data (or

Density-based • Based on connectivity and density functions

Simpson's Family School Employees Females Males

Step 1 Step 2 Step 3

compute the Euclidean

 Therefore, the new clusters

 Next centroids are:

 Therefore, there is no change

 When the numbers of data are not so many, initial

Classification and Regression Trees,

Decision trees, ANN/MLP, SVM, Rough

Linear/Nonlinear Regression, Regression

Association Unsupervised Apriory, OneR, ZeroR, Eclat

Link analysis Unsupervised Expectation Maximization, Apriory

Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique

Clustering Unsupervised K-means, ANN/SOM

Outlier analysis Unsupervised K-means, Expectation Maximization (EM)

 Outlier analysis has numerous applications in a

equipment, manual error

Application (Noisy and Spam Links) - Given a social

 For example, sea surface temperatures are often

 THANK YOU FOR YOUR ATTENTION

1. RN Prasad & Seema Acharya(2011), Fundamentals of Business

You might also like