Forschungskolleg: Data Analytics Methods and Techniques

Forschungskolleg
Data Analytics Methods and Techniques
Martin Hahmann,
Gunnar Schrder,
Phillip Grosse
Prof. Dr.-Ing. Wolfgang Lehner |
Why do we need it?

We are drowning in data, but starving for knowledge! John Naisbett
23 petabyte/second of raw data

1 petabyte/year
Martin Hahmann |
Data Analytics
The Big Four
Classification
Association
Rules
Prediction
Clustering
23.11.
Martin Hahmann |
Clustering
Clustering
automated grouping of objects into so called clusters
objects of the same group are similar
different groups are dissimilar
Problems
applicability
versatility
support for non-expert users
Martin Hahmann |
Clustering Workflow
>500
Martin Hahmann |
algorithm setup
result interpretation
algorithm selection
adjustment
Clustering Workflow
Martin Hahmann |
Which algorithm fits my data?
How good is the obtained result?
Which parameters fit my data?
How to improve result quality?
algorithm setup
result interpretation
algorithm selection
adjustment
Algorithmics
Feedback
Traditional Clustering
partitional
traditional clustering
one clustering
densitybased
hierarchical
Algorithmics
Martin Hahmann |
Feedback
Partitional Clustering
clusters represented by prototypes
objects are assigned to most similar prototype
similarity via distance function
k-means[Lloyd, 1957]
partitions dataset into k disjoint subsets
minimizes sum-of-squares criterion
parameters: k, seed
Martin Hahmann |
k=4
k=7
similarity via distancefunction
parameters: k, seed
k=4
ISODATA[Ball & Hall, 1965]
adjusts number of clusters

merges, splits, deletes according to thresholds
5 additional parameters
Martin Hahmann |
k=7
similarity via distancefunction
parameters: k, seed
k=7
ISODATA[Ball & Hall, 1965]
adjusts number of clusters

merges, splits, deletes according to thresholds
5 additional parameters
Martin Hahmann |
10
Density-Based Clustering
clusters are modelled as dense areas separated by less dense areas
density of an object = number of objects satisfying a similarity threshold
DBSCAN [Ester & Kriegel et al., 1996]
dense areas core objects

object count in defined -neighbourhood
connection via overlapping neighbourhood
parameters: , minPts
DENCLUE[Hinneburg, 1998]
density via gaussian kernels
cluster extraction by hill climbing
=1.0; minPts=5
Martin Hahmann |
11

=0.5; minPts=5
Martin Hahmann |
12

=0.5; minPts=20
Martin Hahmann |
13
Hierarchical Clusterings
builds a hierarchy by progressively merging / splitting clusters
clusterings are extracted by cutting the dendrogramm
bottom-up, AGNES[Ward, 1963]
top-down,
DIANA[Kaufmann & Rousseeuw, 1990]
Martin Hahmann |
14
Multiple Clusterings
multi-solution clustering
multiple clusterings
traditional
clustering
ensemble clustering
Algorithmics
Martin Hahmann |
Feedback
15
Multi-Solution Clustering
alternative
subspace
multi-solution clustering
multiple clusterings
traditional
clustering
Algorithmics
Martin Hahmann |
Feedback
16
Alternative Clustering
generate initial clustering with traditional algorithm
utilize given knowledge to find alternative clusterings
generate a dissimilar clustering
initial clustering
Martin Hahmann |
alternative 1
alternative 2
17
Alternative Clustering: COALA[Bae & Bailey, 2006]
generate alternative with same number of clusters

high dissimilarity to initial clustering(cannot-link constraint)
high quality alternative (objects in cluster still similar)
trade-off between two goals controlled by parameter w
if dquality< w*ddissimilarity choose quality else dissimilarity
CAMI[Dang & Bailey, 2010]

clusters as gaussian mixture
quality by likelihood
dissimilarity by mutual information
Orthogonal Clustering[Davidson & Qi, 2008]

iterative transformation of database
use of constraints
Martin Hahmann |
18
Subspace Clustering
high dimensional data
clusters can be observed in arbitrary attribute combinations (subspaces)
detect multiple clusterings in different subspaces
CLIQUE [Agarawal et al. 1998]

based on grid cells
find dense cells in all subspaces
FIRES[Kriegel et al.,2005]
stepwise merge of base clusters (1D)
max. dimensional subspace clusters
DENSEST[Mller et al.,2009]
estimation of dense subspaces
correlation based (2D histograms)
Martin Hahmann |
19
Ensemble Clustering
multi-solution
clustering
traditional
clustering
ensemble clustering
multiple base clusterings
one consensus clustering
pairwise
similarity
Algorithmics
Martin Hahmann |
cluster
ids
Feedback
20
Ensemble Clustering
General Idea
generate multiple traditional clusterings
employ different algorithms / parameters ensemble
combine ensemble to single robust consensus clustering
different consensus mechanisms
[Strehl & Ghosh, 2002], [Gionis et al., 2007]
Martin Hahmann |
21
Ensemble Clustering
Pairwise Similarity
a pair of objects belongs to: the same or different cluster(s)
pairwise similarities represented as coassociation matrices
Martin Hahmann |
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
22
Ensemble Clustering
Pairwise Similarity
a pair of objects belongs to: the same or different cluster(s)
pairwise similarities represented as coassociation matrices
simple example based on [Gionis et al., 2007]

goal: generate consensus clustering with maximal similarity to ensemble
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1
x1
x1
x1
x2
x2
x2
x2
x3
x3
x3
x3
x4
x4
x4
x4
x5
x5
x5
x5
x6
x6
x7
x7
x8
x8
x9
x9
x6
x7
x8
x9
Martin Hahmann |
0
-
0
1
-
0
1
1
-
x6
x7
x8
x9
0
-
1
0
-
23
Ensemble Clustering
Consensus Mechanism
majority decision
pairs located in the same cluster in at least 50% of the ensemble, are also part of the
same cluster in the consensus clustering
x1
Martin Hahmann |
x2
x3
x4
x5
x6
3/4 4/4 1/4

- 1/4
-x1 x2 0-x3
x7
x8
x9
x4 x5 x6 x7 x8 x9
x1
x2
3/4
x3
x4
x5
x5
x7
x6
x7
x8
2/4
x9
0-
x1
x2
0--
0-0
x3
1/4
- x41/4
--
0--
0-
0- 0
1
0- 1
0- 0
0- 0
-0
00
0
-- -1/4
-- -1/4
--
x8
x9
--
0-
4/4
1/4
-- -1/4
--
x6
0-
-0
00
0 0
0- 0 1/4
0
1 -1
0- - 1/4
-- -1/4
- - 3/4
- 1/4
- - 2/4 4/4
24
Ensemble Clustering
Consensus Mechanism
majority decision
pairs located in the same cluster in at least 50% of the ensemble, are also part of the
same cluster in the consensus clustering
more consensus mechanisms exist
e.g. graph partitioning via METIS [Karypis et al., 1997]
Martin Hahmann |
25
Our Research Area

Algorithm Integration in databases
Dipl.-Inf. Phillip Grosse

philipp.grosse@sap.com
the R Project for Statistical Computing

free software environment for statistical computing and graphics
includes lots of standard math functions/libraries
function- and object-oriented
flexible/extensible
Martin Hahmann |
26
Our Research Area
Vector: basic datastructure in R

> x <- 1
> x <- c(1,2,3)
> x <- 1:10
> "abc" -> y
# vector of size 1
# vector of size 3
# vector of size 10
# vector of size 1
Vector output
>x
[1] 1 2 3 4 5 6 7 8 9 10
> sprintf("y is %s", y)
[1] "y is abc"
Martin Hahmann |
27
Our Research Area

Vector arithmetic
element-wise basic operators (+ - * / ^)
> x <- c(1,2,3,4)
> y <- c(2,3,4,5)
> x*y
[1] 2 6 12 20
vectorrecycling (short vectors)
> y <- c(2,7)
> x*y
[1] 2 14 6 28
arithmetic functions
log, exp, sin, cos, tan, sqrt, min, max, range, length, sum, prod, mean, var
Martin Hahmann |
28
R as HANA operator (R-OP)
Database
R Client
SHM
Manager
1
TCP/IP
[Urbanek03]
fork R process
4
pass R Script
3
write data
access data
7
Martin Hahmann |
Rserve
RICE
SHM
5
access data
write data
SHM
Manager
29
R scripts in databases
## Creates some dummy data
CREATE INSERT ONLY COLUMN TABLE Test1(A INTEGER, B DOUBLE);
INSERT INTO Test1 VALUES(0, 2.5);
calcModel
R-Op
CREATE COLUMN TABLE "ResultTable" AS TABLE "TEST1" WITH NO DATA;

ALTER TABLE "ResultTable" ADD ("C" DOUBLE);
## Creates an SQL-script function including the R script
CREATE PROCEDURE ROP(IN input1 "Test1", OUT result "ResultTable" )
LANGUAGE RLANG AS
BEGIN
A <- input1$A ;
B <- input1$B ;
C <- A * B;
result <- cbind(A, B, C) ;
END;
## execute SQL-script function and retrieve result
CALL ROP(Test1, "ResultTable");
SELECT * FROM "ResultTable";
Martin Hahmann |
Dataset
2.5
0.0
1.5
1.5
2.0
4.0
30
parallel R operators
## call ROP in parallel
DROP PROCEDURE testPartition1;
CREATE PROCEDURE testPartition1 (OUT result "ResultTable")
AS SQLSCRIPT
BEGIN
input1 = select * from Test1;
p = CE_PARTITION([@input1@],[],'''HASH 3 A''');
CALL ROP(@p$$input1@, r);
s = CE_MERGE([@r@]);
result = select * from @s$$r@;

END;
calcModel
R-Op
Merge
...
R-Op ... R-Op
...
Split-Op
Dataset
CALL testPartition1(?);
Martin Hahmann |
31
Result Interpretation
multi-solution
clustering
traditional
clustering
result
interpretation
quality measures
visualization
ensemble
clustering
Algorithmics
Martin Hahmann |
Feedback
32
Quality measures
How good is the obtained result?
Problem: There is no universally valid definition of clustering quality
multiple approaches and metrics
Rand index [Rand , 1971]

comparison to known solution
Dunns Index[Dunn, 1974]

ratio intercluster / intracluster distances
higher score is better
Davis Bouldin Index[Davies & Bouldin, 1979]

ratio intercluster / intracluster distances
lower score is better
Martin Hahmann |
33
Visualization
use visual representations to evaluate clustering quality

utilize perception capacity of humans
result interpretation left to user subjective quality evaluation
communicate information about structures
data-driven visualize whole dataset
Martin Hahmann |
34
Visualization: examples
all objects and dimensions are visualized scalability issues for large scale data
display multiple dimensions on a 2D screen
scatterplot matrix
Martin Hahmann |
parallel coordinates[Inselberg, 1985]

35
Visualization: Heidi Matrix[Vadapali et al.,2009]
visualize clusters in high dimensional space

show cluster overlap in subspaces
pixel technique based on k-nearest neighbour relations
pixel= object pair, color= subspace
Martin Hahmann |
36
Feedback
multi-solution
clustering
traditional
clustering
result
interpretation
quality measures
visualization
adjustment
ensemble
clustering
Algorithmics
Martin Hahmann |
Feedback
37
Feedback
Result adjustment
best practise: adjustment via re-parameterization or algorithm swap

to infer modifications from result interpretation
adjust low-level parameters
only indirect
Clustering with interactive Feedback [Balcan & Blum, 2008]

theoretical proof of concept
user adjustsments via high-level feedback: split and merge
limitations: 1d data, target solution available to the user
Martin Hahmann |
38
Our Research Area
algorithmic platform
Martin Hahmann |
visual-interactive
interface
39
Augur
Generalized Process
based on Shneidermann information seeking mantra
Visualization, Interaction,
algorithmic platform based on multiple clustering solutions
Martin Hahmann |
40
Our Research Area

A complex use-case scenario
self-optimising recommender systems
Martin Hahmann |
Dipl.-Inf. Gunnar Schrder

gunnar.schroeder@tu-dresden.de
41
Our Research Area

Predict user ratings:
Products for cross selling & product classification:
Martin Hahmann |
42
Our Research Area

Similar items & similar users
Martin Hahmann |
43
Challenges
Huge amounts of data
building recommender models is expensive!
Dynamic
models age
user preferences change
new users & items
Personalize UserInterface
Calculate
Recommendations
Monitor
User
Actions
(Re-)Build Model
Evaluate Model
Parameterization
data and domain dependent
support
automatic model maintenance
Optimize Model
Parameters
Martin Hahmann |
44
>
Questions?
Martin Hahmann |
45

Forschungskolleg: Data Analytics Methods and Techniques

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Forschungskolleg: Data Analytics Methods and Techniques

Uploaded by

Copyright:

Available Formats

Forschungskolleg

Data Analytics Methods and Techniques

Why do we need it?

23 petabyte/second of raw data

Data Analytics Methods and Techniques

Data Analytics Methods and Techniques

Data Analytics Methods and Techniques

Data Analytics Methods and Techniques

Which algorithm fits my data?

How good is the obtained result?

Which parameters fit my data?

How to improve result quality?

Data Analytics Methods and Techniques

ISODATA[Ball & Hall, 1965]

adjusts number of clusters

Data Analytics Methods and Techniques

ISODATA[Ball & Hall, 1965]

adjusts number of clusters

Data Analytics Methods and Techniques

DBSCAN [Ester & Kriegel et al., 1996]

dense areas core objects

Data Analytics Methods and Techniques

DBSCAN [Ester & Kriegel et al., 1996]

dense areas core objects

Data Analytics Methods and Techniques

DBSCAN [Ester & Kriegel et al., 1996]

dense areas core objects

Data Analytics Methods and Techniques

Data Analytics Methods and Techniques

Data Analytics Methods and Techniques

generate alternative with same number of clusters

CAMI[Dang & Bailey, 2010]

Orthogonal Clustering[Davidson & Qi, 2008]

Data Analytics Methods and Techniques

CLIQUE [Agarawal et al. 1998]

Data Analytics Methods and Techniques

Data Analytics Methods and Techniques

Data Analytics Methods and Techniques

simple example based on [Gionis et al., 2007]

Data Analytics Methods and Techniques

3/4 4/4 1/4

Data Analytics Methods and Techniques

Data Analytics Methods and Techniques

Our Research Area

Dipl.-Inf. Phillip Grosse

the R Project for Statistical Computing

Data Analytics Methods and Techniques

Our Research Area

Vector: basic datastructure in R

[1] "y is abc"

Data Analytics Methods and Techniques

Our Research Area

Data Analytics Methods and Techniques

R as HANA operator (R-OP)

Data Analytics Methods and Techniques

CREATE COLUMN TABLE "ResultTable" AS TABLE "TEST1" WITH NO DATA;

Data Analytics Methods and Techniques

result = select * from @s$$r@;

Data Analytics Methods and Techniques

Rand index [Rand , 1971]

Dunns Index[Dunn, 1974]

Davis Bouldin Index[Davies & Bouldin, 1979]

Data Analytics Methods and Techniques