You are on page 1of 34

Topic 4

Data Mining for Business Intelligence

Part 5: Clustering Pattern


ISP642 – BUSINESS
INTELLIGENCE
Learning Objectives
 Learn about clustering pattern
Data Mining Patterns: An
Overview
DATA
MINING

PREDICTI ASSOCATI CLUSTERI


ON ON NG

Supervise Unsupervi Unsupervi


d sed sed
Learning Learning Learning
Method Method Method
Pattern: Clustering
 Video Data Mining Learning Method Popular Algorithms

Classification and Regression Trees,


Prediction Supervised
ANN, SVM, Genetic Algorithms

Decision trees, ANN/MLP, SVM, Rough


Classification Supervised
sets, Genetic Algorithms

Linear/Nonlinear Regression, Regression


Regression Supervised
trees, ANN/MLP, SVM

Association Unsupervised Apriory, OneR, ZeroR, Eclat

Link analysis Unsupervised Expectation Maximization, Apriory


Algorithm, Graph-based Matching

Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique

Clustering Unsupervised K-means, ANN/SOM

Outlier analysis Unsupervised K-means, Expectation Maximization (EM)


Pattern: Clusters
 Segmentation
 Identify natural grouping of things based on their known
characteristic, such as assigning customers in different segments
based on their demographics and past purchase behaviour.
 Data items are grouped according to logical relationships or
consumer preferences. For example, data can be mined to
identify market segments or consumer affinities.
What is clustering?
 Clustering is the classification of objects into
different groups, or
 the partitioning of a data set into subsets (clusters),
so that the data in each subset (ideally) share some
common trait - often according to some defined
distance measure.
Clustering
 Grouping data into categories as suggested by the
data’s own patterns
 Maximize similarity within a category
 Minimize similarity between categories
Clustering
 groups data that share similar trend and pattern.
 large sets of data is grouped into clusters of smaller sets of
similar data.
Methods in Clustering Mining
Partitioning • Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
method • Typical methods: k-means, k-medoids, CLARANS

• Create a hierarchical decomposition of the set of data (or


Hierarchical objects) using some criterion
• Typical methods: Diana, Agnes, BIRCH, ROCK,
method CAMELEON

Density-based • Based on connectivity and density functions


• Typical methods: DBSACN, OPTICS, DenClue
method
Grid-based • Based on a multiple-level granularity structure
• Typical methods: STING, WaveCluster, CLIQUE
method
Model-based • A model is hypothesized for each of the clusters and tries to
find the best fit of that model to each other
method • Typical methods: EM, SOM, COBWEB
K-Means Clustering
 an algorithm to classify or to group your objects
based on attributes or features into k number of
group
 k is predetemined number of clusters.
 The grouping is done by minimizing the sum of
squares of distances between data and the
corresponding cluster centroid.
 Thus, the purpose of k-mean clustering is to
classify the data.
K-Means Clustering: Unsupervised
 Each object represented by one attribute point is an
example to the algorithm and it is assigned to each
data point(customer, event, object) to the cluster
a.k.a centroid is nearest.
 called “unsupervised learning”
 because the algorithm classifies the object
automatically only based on the criteria that we give.
 The learning process depends on the training
examples fed into the algorithm.
What is a natural grouping among
these objects?

Clustering is subjective

Simpson's Family School Employees Females Males


Algorithm Steps
Initialization step : Choose the number of cluster( the
value of k)
1. Step 1: Randomly generate k random points as
initial cluster center.
2. Step 2 : Assign each point to nearest cluster
center.
3. Step 3 : Recompute the new cluster centers.
Repetition step : Repeat step 2 and step 3 until some
convergence criterion is methodology and the
assignment of points to clusters become stable.
K-Means Clustering

Step 1 Step 2 Step 3


K-Means Clustering: Example
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
 Thus, we obtain two clusters

containing:
{1,2,3} and {4,5,6,7}.
 Their new centroids are:
Step 3:
 Now using these centroids we

compute the Euclidean


distance of each object, as
shown in table.

 Therefore, the new clusters


are:
{1,2} and {3,4,5,6,7}

 Next centroids are:


m1=(1.25,1.5) and m2 =
(3.9,5.1)
 Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

 Therefore, there is no change


in the cluster.
 Thus, the algorithm comes to
a halt here and final result
consist of 2 clusters {1,2} and
{3,4,5,6,7}.
(with K=3)

Step 1 Step 2
PLOT
Disadvantages of K-Mean Clustering

 When the numbers of data are not so many, initial


grouping will determine the cluster significantly.
 The number of cluster, k, must be determined before
hand. Its disadvantage is that it does not yield the same
result with each run, since the resulting clusters depend
on the initial random assignments.
 We never know the real cluster, using the same data,
because if it is inputted in a different order it may
produce different cluster if the number of data is few.
Clustering: Video

 C L U S T E R A N A LY S I S F O R D ATA M I N
ING
 K-Means
Pattern: Outlier Analysis
Data Mining Learning Method Popular Algorithms

Classification and Regression Trees,


Prediction Supervised
ANN, SVM, Genetic Algorithms

Decision trees, ANN/MLP, SVM, Rough


Classification Supervised
sets, Genetic Algorithms

Linear/Nonlinear Regression, Regression


Regression Supervised
trees, ANN/MLP, SVM

Association Unsupervised Apriory, OneR, ZeroR, Eclat

Link analysis Unsupervised Expectation Maximization, Apriory


Algorithm, Graph-based Matching

Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique

Clustering Unsupervised K-means, ANN/SOM

Outlier analysis Unsupervised K-means, Expectation Maximization (EM)


Outlier Analysis
Outlier Analysis
 Data point that is far outside the norm for a
variable or population;

 Outlier analysis has numerous applications in a


wide variety of domains such as the financial
industry, quality control, fault diagnosis, intrusion
detection, web analytics, and medical diagnosis.
Causes of Outliers
 Poor data quality / contamination
 Low quality measurements, malfunctioning

equipment, manual error


 Correct but exceptional data
Data Errors
 Outliers are often caused by human error, such as
errors in data collection, recording, or entry.
Application Areas
 Quality Control Applications
 Financial Applications
 Web Log Analytics
 Medical Applications
 Text and Social Media Applications
 Earth Science Applications
Text and Social Media
 Text and social media applications are
extremely common because of the ubiquity of
text data in social interactions such as email,
the web, and blogs.

Application (Noisy and Spam Links) - Given a social


network with content at the nodes, determine the noisy
and spam links with the use of structure and content
information.
Earth Science
 Outlier detection is used in numerous weather,
climate, or vegetation cover applications, where
anomalous regions are detected in spatial data either
at a single snapshot or over time. Therefore, many of
these applications are spatial or spatiotemporal in
nature.

 For example, sea surface temperatures are often


tracked in order to determine significant and
anomalous weather patterns.
END OF PART 5
END OF TOPIC
33

 THANK YOU FOR YOUR ATTENTION


References
34

1. RN Prasad & Seema Acharya(2011), Fundamentals of Business


Analytics, Wiley India Pvt. Ltd
2. Ramesh Sharda, Dursun Delen & Efraim Turban (2014), Business
Intelligence and Analytics, 10th ed., Pearson Education Ltd.
3. Nathan Yau, (2013), Data Points: Visualization That Means
Something, Wiley.
4. http://faculty.ist.psu.edu/jessieli/Site/research.html
5. http://www.laits.utexas.edu/~anorman/BUS.FOR/course.mat/Alex/
6. Image search from Google search engine
7. http://timmanns.blogspot.com/

You might also like