Professional Documents
Culture Documents
Forschungskolleg: Data Analytics Methods and Techniques
Forschungskolleg: Data Analytics Methods and Techniques
Martin Hahmann,
Gunnar Schrder,
Phillip Grosse
Prof. Dr.-Ing. Wolfgang Lehner |
Martin Hahmann |
Data Analytics
The Big Four
Classification
Association
Rules
Prediction
Clustering
23.11.
Martin Hahmann |
Clustering
Clustering
automated grouping of objects into so called clusters
objects of the same group are similar
different groups are dissimilar
Problems
applicability
versatility
support for non-expert users
Martin Hahmann |
Clustering Workflow
>500
Martin Hahmann |
algorithm setup
result interpretation
algorithm selection
adjustment
Clustering Workflow
Martin Hahmann |
algorithm setup
result interpretation
algorithm selection
adjustment
Algorithmics
Feedback
Data Analytics Methods and Techniques
Traditional Clustering
partitional
traditional clustering
one clustering
densitybased
hierarchical
Algorithmics
Martin Hahmann |
Feedback
Data Analytics Methods and Techniques
Traditional Clustering
Partitional Clustering
clusters represented by prototypes
objects are assigned to most similar prototype
similarity via distance function
k-means[Lloyd, 1957]
partitions dataset into k disjoint subsets
minimizes sum-of-squares criterion
parameters: k, seed
Martin Hahmann |
Traditional Clustering
k=4
k=7
Partitional Clustering
clusters represented by prototypes
objects are assigned to most similar prototype
similarity via distancefunction
k-means[Lloyd, 1957]
partitions dataset into k disjoint subsets
minimizes sum-of-squares criterion
parameters: k, seed
k=4
Martin Hahmann |
Traditional Clustering
k=7
Partitional Clustering
clusters represented by prototypes
objects are assigned to most similar prototype
similarity via distancefunction
k-means[Lloyd, 1957]
partitions dataset into k disjoint subsets
minimizes sum-of-squares criterion
parameters: k, seed
k=7
Martin Hahmann |
10
Traditional Clustering
Density-Based Clustering
clusters are modelled as dense areas separated by less dense areas
density of an object = number of objects satisfying a similarity threshold
DENCLUE[Hinneburg, 1998]
density via gaussian kernels
cluster extraction by hill climbing
=1.0; minPts=5
Martin Hahmann |
11
Traditional Clustering
Density-Based Clustering
clusters are modelled as dense areas separated by less dense areas
density of an object = number of objects satisfying a similarity threshold
DENCLUE[Hinneburg, 1998]
density via gaussian kernels
cluster extraction by hill climbing
=0.5; minPts=5
Martin Hahmann |
12
Traditional Clustering
Density-Based Clustering
clusters are modelled as dense areas separated by less dense areas
density of an object = number of objects satisfying a similarity threshold
DENCLUE[Hinneburg, 1998]
density via gaussian kernels
cluster extraction by hill climbing
=0.5; minPts=20
Martin Hahmann |
13
Traditional Clustering
Hierarchical Clusterings
builds a hierarchy by progressively merging / splitting clusters
clusterings are extracted by cutting the dendrogramm
bottom-up, AGNES[Ward, 1963]
top-down,
DIANA[Kaufmann & Rousseeuw, 1990]
Martin Hahmann |
14
Multiple Clusterings
multi-solution clustering
multiple clusterings
traditional
clustering
ensemble clustering
Algorithmics
Martin Hahmann |
Feedback
Data Analytics Methods and Techniques
15
Multi-Solution Clustering
alternative
subspace
multi-solution clustering
multiple clusterings
traditional
clustering
Algorithmics
Martin Hahmann |
Feedback
Data Analytics Methods and Techniques
16
Multi-Solution Clustering
Alternative Clustering
generate initial clustering with traditional algorithm
utilize given knowledge to find alternative clusterings
generate a dissimilar clustering
initial clustering
Martin Hahmann |
alternative 1
alternative 2
17
Multi-Solution Clustering
Alternative Clustering: COALA[Bae & Bailey, 2006]
Martin Hahmann |
18
Multi-Solution Clustering
Subspace Clustering
high dimensional data
clusters can be observed in arbitrary attribute combinations (subspaces)
detect multiple clusterings in different subspaces
FIRES[Kriegel et al.,2005]
stepwise merge of base clusters (1D)
max. dimensional subspace clusters
DENSEST[Mller et al.,2009]
estimation of dense subspaces
correlation based (2D histograms)
Martin Hahmann |
19
Ensemble Clustering
multi-solution
clustering
traditional
clustering
ensemble clustering
multiple base clusterings
one consensus clustering
pairwise
similarity
Algorithmics
Martin Hahmann |
cluster
ids
Feedback
Data Analytics Methods and Techniques
20
Ensemble Clustering
General Idea
generate multiple traditional clusterings
employ different algorithms / parameters ensemble
combine ensemble to single robust consensus clustering
different consensus mechanisms
[Strehl & Ghosh, 2002], [Gionis et al., 2007]
Martin Hahmann |
21
Ensemble Clustering
Pairwise Similarity
a pair of objects belongs to: the same or different cluster(s)
pairwise similarities represented as coassociation matrices
Martin Hahmann |
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
22
Ensemble Clustering
Pairwise Similarity
a pair of objects belongs to: the same or different cluster(s)
pairwise similarities represented as coassociation matrices
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1
x1
x1
x1
x2
x2
x2
x2
x3
x3
x3
x3
x4
x4
x4
x4
x5
x5
x5
x5
x6
x6
x7
x7
x8
x8
x9
x9
x6
x7
x8
x9
Martin Hahmann |
0
-
0
1
-
0
1
1
-
x6
x7
x8
x9
0
-
1
0
-
23
Ensemble Clustering
Consensus Mechanism
majority decision
pairs located in the same cluster in at least 50% of the ensemble, are also part of the
same cluster in the consensus clustering
x1
Martin Hahmann |
x2
x3
x4
x5
x6
x7
x8
x9
x4 x5 x6 x7 x8 x9
x1
x2
3/4
x3
x4
x5
x5
x7
x6
x7
x8
2/4
x9
0-
x1
x2
0--
0-0
x3
1/4
- x41/4
--
0--
0-
0- 0
1
0- 1
0- 0
0- 0
-0
00
0
-- -1/4
-- -1/4
--
x8
x9
--
0-
4/4
1/4
-- -1/4
--
x6
0-
-0
00
0 0
0- 0 1/4
0
1 -1
0- - 1/4
-- -1/4
- - 3/4
- 1/4
- - 2/4 4/4
24
Ensemble Clustering
Consensus Mechanism
majority decision
pairs located in the same cluster in at least 50% of the ensemble, are also part of the
same cluster in the consensus clustering
more consensus mechanisms exist
e.g. graph partitioning via METIS [Karypis et al., 1997]
Martin Hahmann |
25
Martin Hahmann |
26
# vector of size 1
# vector of size 3
# vector of size 10
# vector of size 1
Vector output
>x
[1] 1 2 3 4 5 6 7 8 9 10
> sprintf("y is %s", y)
Martin Hahmann |
27
Martin Hahmann |
28
Database
R Client
SHM
Manager
1
TCP/IP
[Urbanek03]
fork R process
4
pass R Script
3
write data
access data
7
Martin Hahmann |
Rserve
RICE
SHM
5
access data
write data
SHM
Manager
29
R scripts in databases
## Creates some dummy data
CREATE INSERT ONLY COLUMN TABLE Test1(A INTEGER, B DOUBLE);
INSERT INTO Test1 VALUES(0, 2.5);
INSERT INTO Test1 VALUES(1, 1.5);
INSERT INTO Test1 VALUES(2, 2.0);
calcModel
R-Op
Martin Hahmann |
Dataset
2.5
0.0
1.5
1.5
2.0
4.0
30
parallel R operators
## call ROP in parallel
DROP PROCEDURE testPartition1;
CREATE PROCEDURE testPartition1 (OUT result "ResultTable")
AS SQLSCRIPT
BEGIN
input1 = select * from Test1;
p = CE_PARTITION([@input1@],[],'''HASH 3 A''');
CALL ROP(@p$$input1@, r);
s = CE_MERGE([@r@]);
calcModel
R-Op
Merge
...
R-Op ... R-Op
...
Split-Op
Dataset
CALL testPartition1(?);
Martin Hahmann |
31
Result Interpretation
multi-solution
clustering
traditional
clustering
result
interpretation
quality measures
visualization
ensemble
clustering
Algorithmics
Martin Hahmann |
Feedback
Data Analytics Methods and Techniques
32
Result Interpretation
Quality measures
How good is the obtained result?
Problem: There is no universally valid definition of clustering quality
multiple approaches and metrics
Martin Hahmann |
33
Result Interpretation
Visualization
Martin Hahmann |
34
Result Interpretation
Visualization: examples
all objects and dimensions are visualized scalability issues for large scale data
display multiple dimensions on a 2D screen
scatterplot matrix
Martin Hahmann |
35
Result Interpretation
Visualization: Heidi Matrix[Vadapali et al.,2009]
Martin Hahmann |
36
Feedback
multi-solution
clustering
traditional
clustering
result
interpretation
quality measures
visualization
adjustment
ensemble
clustering
Algorithmics
Martin Hahmann |
Feedback
Data Analytics Methods and Techniques
37
Feedback
Result adjustment
Martin Hahmann |
38
algorithmic platform
Martin Hahmann |
visual-interactive
interface
39
Augur
Generalized Process
based on Shneidermann information seeking mantra
Visualization, Interaction,
algorithmic platform based on multiple clustering solutions
Martin Hahmann |
40
Martin Hahmann |
41
Martin Hahmann |
42
Martin Hahmann |
43
Challenges
Huge amounts of data
building recommender models is expensive!
Dynamic
models age
user preferences change
new users & items
Personalize UserInterface
Calculate
Recommendations
Monitor
User
Actions
(Re-)Build Model
Evaluate Model
Parameterization
data and domain dependent
support
automatic model maintenance
Optimize Model
Parameters
Martin Hahmann |
44
>
Questions?
Martin Hahmann |
45