You are on page 1of 32

# Quiz

## What is the most natural (non-autonomous, e.g.

breathing) thing done by human beings?
How often does the average human do it?

Clustering
With your host, the self-appointed King of
ClusteringKai Larsen

Cluster Analysis

Source: http://www.vias.org/science_cartoons/cluster_analysis.html

http://www.abdn.ac.uk/zoologymuseum/images/kingdoms.jpg

## Can we use this information?

Writing Skills

English Majors

Salary
5

Unsupervised Classification
Training Data
case
case
case
case
case

1: inputs, ?
2: inputs, ?
3: inputs, ?
4: inputs, ?
5: inputs, ?

new
case
6

Training Data
case 1: inputs, cluster 1
case 2: inputs, cluster 3
case 3: inputs, cluster
2 case 4: inputs, cluster
1 case 5: inputs, cluster
2
new
case

What:

## Classification with an unknown target

# of classes is unknown

## Increase between class distance, decrease within class distance

Why:

Description
For example, segmenting existing customers into groups and associating a
distinct profile with each group could help future marketing strategies.
From the Internet: There are three customer types, each of which need to
be sold to very differently. These are: the Financier, the Techie and the
User.
From Kai: There are two kinds of students, those with BI experience, and
those without
Caveat:

There is no guarantee that the resulting clusters will be meaningful or useful. You
have to carefully consider them.

## Two, basic, types of cluster analysis

K-means (iterative)
Hierarchical (one-shot)

k-means Clustering

Assignment

10

Reassignment

11

## Example K-means Clustering

Andromeda Galaxy
Source:www.freewebs.com/
bnip1/andromedakmeans.htm

12

Euclidean Distance

(U2,V2)
(U1,V1)
L2 = ((U1 - U2)2 + (V1 - V2)2)1/2
(generally leads to spherical clusters)
13

Hierarchical

## Create a table with all distances

between people or cases

Red1

Red2

Red3

Red4

Red1

1.12

.5

2.7

Red2

1.12

Red3

.5

2.24

Red4

2.7

2.24

## Now, starting with he shortest distances between dots, we cluster

items.

14

Hierarchical
Create a table with all distances
between people or cases
We get the following table of differences:

Red1/3

Red2

Red4

Red1/3

1.03

2.46

Red2

1.03

Red4

2.46

1/3

## Now, starting with he shortest distances between dots, we cluster

items.

15

Hierarchical
Create a table with all distances
between people or cases
We get the following table of differences:

Red1/2/3

Red4

Red1/2/3

2.28

Red4

2.28

1/2/3

## Now, starting with he shortest distances between dots, we cluster

items.

16

Hierarchical
Create a table with all distances
between people or cases
We get the following table of differences:

1/2/
3/4

Red1/2/3/4

1
Red1/2/3/4

## Now, starting with he shortest distances between dots, we cluster

items.

17

Result

18

Manhattan Distance

(U2,V2)
(U1,V1)
L1 = |U1 - U2| + |V1 - V2|

19

In teams of two
1. Using Manhattan Distance,
create a table with all
distances between red dots
2. Create a dendrogram

20

6
1

21

Tribe Movement

22

Tribe Creation

23

## Flow Clustering Example

24

Source: http://wiki.na-mic.org/Wiki/index.php/Progress_Report:DTI_Clustering

## Ancient Chinese Classification of Animals:

"Animals are divided into:
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
k)
l)
m)
n)

## those that belong to the Emperor

embalmed ones
those that are trained
suckling pigs
Mermaids
fabulous ones
stray dogs
those that are included in this classification
those that tremble as if they were mad
innumerable ones
those drawn with a very fine camel's hair brush
others
those that have just broken a flower vase
those that resemble flies from a distance."
from Other Inquisitions: 1937-1952 by Jorge Luis Borges

25

## For the Marketing Buffs

(a quick intro)

Association Rules
A B C

A CD

Rule
CA
AC
B&CD

B CD

Support
2/5 (.40)
2/5 (.40)
2/5 (.40)
1/5 (.20)
Probability
Probabilitythat
thattwo
twoitems
items
co-occur
co-occur
# transactions with both A and D
# transactions with both A and D
All transactions
All transactions

27

B C E

Confidence
2/3 (.67)
2/4 (.50)
2/3 (.67)
1/3 (.33)
Conditional
Conditionalprobability
probabilitythat
that
transaction
contains
D,
transaction contains D,
given
giventhat
thatititcontains
containsAA
# transactions with both A and D
# transactions with both A and D
# transactions with A
# transactions with A

28

Size
Sizeofofbox=
box=transaction
transactioncounts
counts
Color
of
indicates
confidence
Color of link= indicates confidencelevel
levelofofrule
rule
Thickness
of
=confidence
Thickness of link =confidence

29

Barbie Candy
1.
2.
3.
4.
5.
6.
7.
8.

30

## Put them closer together in the store.

Put them far apart in the store.
Package candy bars with the dolls.
Package Barbie + candy + poorly selling item.
Raise the price on one, lower it on the other.
Barbie accessories for proofs of purchase.
Do not advertise candy and Barbie together.
Offer candies in the shape of a Barbie doll.

Conclusions
Clustering provides another way to understand data
Its results need to jive with human understanding
Unless we use the clusters directly for predictive
analysis
Market basket analysis is now an industry standard

31

32

## Create Kaggle Account

Invite team members
Download train and test files from Kaggle
Save files as .xlsx
Import files into SQL Server
Run prediction with multiple models
Figure out which is best based on cross-validation
Use that model to predict