This action might not be possible to undo. Are you sure you want to continue?
methods in data mining
Yousef Saad
Department of Computer Science
and Engineering
University of Minnesota
NASCA09 – 05/20/2009
Introduction: What is data mining?
® Common goal of data mining methods: to extract meaningful
information or patterns from data. Very broad area – includes: data
analysis, machine learning, pattern recognition, information retrieval, ...
® Main tools used: linear algebra; graph theory; approximation theory;
optimization; ...
® In this talk: emphasis on dimension reduction techniques, interrelations
between techniques, and graph theory tools.
Agadir, 05192009 p. 2
Major tool of Data Mining: Dimension reduction
® Goal is not just to reduce computational cost but to:
• Reduce noise and redundancy in data
• Discover ’features’ or ’patterns’ (e.g., supervised learning)
® Techniques depend on application: Preserve angles? Preserve dis
tances? Maximize variance? ..
Agadir, 05192009 p. 3
The problem of Dimension Reduction
® Given d ¸m ﬁnd a mapping
Φ : x ∈ R
m
−→y ∈ R
d
® Mapping may be explicit [typically linear], e.g.:
y = V
T
x
® Or implicit (nonlinear)
Practically: Given X ∈ R
m·n
,
we want to ﬁnd a lowdimensional
representation Y ∈ R
d·n
of X
Agadir, 05192009 p. 4
Linear Dimensionality Reduction
Given: a data set X = [x
1
, x
2
, . . . , x
n
], and d the dimension of the
desired reduced space Y = [y
1
, y
2
, . . . , y
n
].
Want: a linear transformation from X to Y
v
T
d
m
m
d
n
n
X
Y
x
y
i
i
X ∈ R
m·n
V ∈ R
m·d
Y = V
¯
X
→ Y ∈ R
d·n
® mdimens. objects (x
i
) ‘ﬂattened’ to ddimens. space (y
i
)
Constraint: The y
i
’s must satisfy certain properties
® Optimization problem
Agadir, 05192009 p. 5
Example 1: The ‘SwillRoll’ (2000 points in 3D)
−15
−10
−5
0
5
10
15
−20
−10
0
10
20
−15
−10
−5
0
5
10
15
Original Data in 3−D
Agadir, 05192009 p. 6
2D ‘reductions’:
−20 −10 0 10 20
−15
−10
−5
0
5
10
15
PCA
−15 −10 −5 0 5 10
−15
−10
−5
0
5
10
15
LPP
Eigenmaps ONPP
Agadir, 05192009 p. 7
Example 2: Digit images (a random sample of 30)
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
Agadir, 05192009 p. 8
2D ’reductions’:
−10 −5 0 5 10
−6
−4
−2
0
2
4
6
PCA − digits : 0 −− 4
0
1
2
3
4
−0.2 −0.1 0 0.1 0.2
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
LLE − digits : 0 −− 4
0.07 0.071 0.072 0.073 0.074
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
K−PCA − digits : 0 −− 4
0
1
2
3
4
−5.4341 −5.4341 −5.4341 −5.4341 −5.4341
x 10
−3
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
ONPP − digits : 0 −− 4
Agadir, 05192009 p. 9
Basic linear dimensionality reduction method: PCA
® We are given points in R
n
and we want to project them
in R
d
. Best way to do this?
® i.e.: ﬁnd the best axes for
projecting the data
® Q: “best in what sense”?
® A: maximize variance of
new data
® Principal Component Analysis [PCA]
Agadir, 05192009 p. 10
® Recall y
i
= V
T
x
i
, where V is m·d orthogonal
® We need to maximize over all orthogonal m·d matrices V :
i
]y
i
−
1
n
j
y
j
]
2
2
= · · · = Tr
_
V
¯
¯
X
¯
X
¯
V
¸
Where:
¯
X = X(I −
1
n
11
T
) == originrecentered version of X
® Solution V =  dominant eigenvectors  of the covariance matrix ==
Set of left singular vectors of
¯
X
® Solution V also minimizes ‘reconstruction error’ ..
i
]x
i
−V V
T
x
i
]
2
=
i
]x
i
−V y
i
]
2
® .. and it also maximizes [Korel and Carmel 04]
i,j
]y
i
−y
j
]
2
Agadir, 05192009 p. 11
Unsupervised learning: Clustering
Problem: partition a given set into subsets such that items of the same
subset are most similar and those of two different subsets most dissimilar.
Superhard
Photovoltaic
Superconductors
Catalytic
Ferromagnetic
Thermo−electric
Multi−ferroics
® Basic technique: Kmeans algorithm [slow but effective.]
® Example of application : cluster bloggers by ‘social groups’ (anarchists,
ecologists, sportsfans, liberalsconservative, ... )
Agadir, 05192009 p. 12
Sparse Matrices viewpoint
® Communities modeled by an ‘afﬁnity’ graph [e.g., ’user Asends frequent
emails to user B’]
® Adjacency Graph represented
by a sparse matrix
® Goal: ﬁnd reordering such that
blocks are as dense as possible:
® Advantage of this viewpoint: no need to know number of clusters.
® Use ‘blocking’ techniques for sparse matrices [O’Neil Szyld ’90, YS 03,
...]
Agadir, 05192009 p. 13
Example of application in knowledge discovery. Data set from :
http://wwwpersonal.umich.edu/
∼
mejn/netdata/
® Network connecting bloggers of different political orientations [2004 US
presidentual election]
® ’Communities’: liberal vs. conservative
® Graph: 1, 490 vertices (blogs) : ﬁrst 758: liberal, rest: conservative.
® Edge: i → j means there is a citation between blogs i and j
® Blocking algorithm (Density theshold=0.4) found 2 subgraphs
® Note: density = E/V 
2
.
® Smaller subgraph: conservative blogs, larger one: liberals
® One observation: smaller subgraph is denser than that of the larger
one. Concl. agrees with [L. A. Adamic and N. Glance 05] :
‘rightleaning (conservative) blogs have a denser structure of strong con
nections than the left (liberal)”
Agadir, 05192009 p. 14
Supervised learning: classiﬁcation
Problem: Given labels (say “A” and “B”) for each item of a given set, ﬁnd
a way to classify an unlabelled item into either the “A” or the “B” class.
?
?
® Many applications.
® Example: distinguish between SPAM and nonSPAM messages
® Can be extended to more than 2 classes.
Agadir, 05192009 p. 15
Linear
classifier
Linear classiﬁers: Find a hyperplane which best separates the data in
classes A and B.
Agadir, 05192009 p. 16
A harder case:
−2 −1 0 1 2 3 4 5
−4
−3
−2
−1
0
1
2
3
4
Spectral Bisection (PDDP)
® Use kernels to transorm
Agadir, 05192009 p. 17
Linear classiﬁers
® Let X = [x
1
, · · · , x
n
] be the data matrix.
® and L = [l
1
, · · · , l
n
] the labels either +1 or 1.
® First Solution: Find a vector v
such that v
T
x
i
close to l
i
for all i.
® More common solution: Use SVD
to reduce dimension of data [e.g. 2
D] then do comparison in this space.
e.g. A: v
T
x
i
≥ 0 , B: v
T
x
i
< 0
[Note: for clarity v principal axis
drawn below where it should be]
v
Agadir, 05192009 p. 18
Better solution: Linear Discriminant Analysis (LDA)
Deﬁne “between scatter”: a measure of how well separated two distinct
classes are.
Deﬁne “within scatter”: a measure of how well clustered items of the same
class are.
® Goal: to make “between scatter” measure large, while making “within
scatter” small.
Idea: Project the data in lowdimensional space so as to maximize the
ratio of the “between scatter” measure over “within scatter” measure of
the classes.
Agadir, 05192009 p. 19
® Let µ = mean of X,
® and µ
(k)
= mean of the kth class (of size n
k
).
#
GLOBAL CENTROID
CLUSTER CENTROIDS
#
Agadir, 05192009 p. 20
Deﬁne
S
B
=
c
k=1
n
k
(µ
(k)
−µ)(µ
(k)
−µ)
T
,
S
W
=
c
k=1
x
i
∈X
k
(x
i
−µ
(k)
)(x
i
−µ
(k)
)
T
.
® Project set on a onedimensional space spanned by a vector a. Then:
a
T
S
B
a =
c
i=1
n
k
a
T
(µ
(k)
−µ)
2
, a
T
S
W
a =
c
k=1
x
i
∈ X
k
a
T
(x
i
−µ
(k)
)
2
® LDA projects the data so as to maxi
mize the ratio of these two numbers:
max
a
a
T
S
B
a
a
T
S
W
a
® Optimal a = eigenvector associated with the largest eigenvalue of:
S
B
u
i
= λ
i
S
W
u
i
.
Agadir, 05192009 p. 21
LDA – Extension to arbitrary dimension
® Would like to project in d dimensions –
® Normally we wish to maximize the ratio of two traces
Tr [U
T
S
B
U]
Tr [U
T
S
W
U]
® U subject to being unitary U
T
U = I (orthogonal projector).
® Reduced dimension data: Y = U
T
X.
Difﬁculty: Hard to maximize. See Bellalij & YS (in progress).
® Common alternative: Solve instead the (easier) problem:
max
U
T
S
W
U=I
Tr [U
T
S
B
U]
® Solution: largest eigenvectors of
S
B
u
i
= λ
i
S
W
u
i
.
Agadir, 05192009 p. 22
Motivating example: Collaborative ﬁltering, Netﬂix Prize
® Important application in commerce ..
® .. but potential for other areas too
® When you order a book in Amazon you will get recommendations®
’Recommender system’
® A very hot topic at the height of the dotcom era ...
® Some of the tools used: statistics, clustering, kNN graphs, LSI (a form
of PCA), ...
® Visit the page GoupLens page http://www.grouplens.org/
® Netﬂix: “qui veut gagner un million?” [who wants to win a $ million?]
Agadir, 05192009 p. 23
Face Recognition – background
Problem: We are given a database of images: [arrays of pixel values].
And a test (new) image.
ÿ ÿ ÿ ÿ ÿ ÿ
¸ ↑ ,
Question: Does this newimage correspond to one of those in the database?
Agadir, 05192009 p. 25
Difﬁculty
® Different positions, expressions, lighting, ..., situations :
Common approach: eigenfaces – Principal Component Analysis tech
nique
Agadir, 05192009 p. 26
® See reallife examples – [international manhunt]
® Poor images or deliberately altered images [‘occlusion’]
® See recent paper by John Wright et al.
Agadir, 05192009 p. 27
Eigenfaces
– Consider each picture as a onedimensional colum of all pixels
– Put together into an array A of size # pixels ·# images.
. . .
=⇒
. . .
. ¸¸ .
A
– Do an SVD of Aand perform comparison with any test image in lowdim.
space
– Similar to LSI in spirit – but data is not sparse.
Idea: replace SVD by Lanczos vectors (same as for IR)
Agadir, 05192009 p. 28
Tests: Face Recognition
Tests with 2 wellknown data sets:
ORL 40 subjects, 10 sample images each – example:
# of pixels : 112 ·92 TOT. # images : 400
AR set 126 subjects – 4 facial expressions selected for each [natural,
smiling, angry, screaming] – example:
# of pixels : 112 ·92 # TOT. # images : 504
Agadir, 05192009 p. 29
Tests: Face Recognition
Recognition accuracy of Lanczos approximation vs SVD
ORL dataset
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ORL
iterations
a
v
e
r
a
g
e
e
r
r
o
r
r
a
t
e
lanczos
tsvd
AR dataset
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
AR
iterations
a
v
e
r
a
g
e
e
r
r
o
r
r
a
t
e
lanczos
tsvd
Vertical axis shows average error rate. Horizontal = Subspace dimension
Agadir, 05192009 p. 30
GRAPHBASED TECHNIQUES
Laplacean Eigenmaps (BelkinNiyogi02)
® Not a linear (projection) method but a Nonlinear method
® Starts with knearestneighbors graph:
kNN graph = graph of k nearest neighbors
® Then deﬁne a graph Laplacean:
L = D −W
® Simplest:
x
x
j
i
w
ij
=
_
1 if j ∈ N
i
0 else
D = diag
_
_
d
ii
=
j,=i
w
ij
_
_
with N
i
= neighborhood of i (excl. i)
Agadir, 05192009 p. 32
A few properties of graph Laplacean matrices
® Let L = any matrix s.t. L = D −W, with D = diag(d
i
) and
w
ij
≥ 0, d
i
=
j,=i
w
ij
Property 1: for any x ∈ R
n
:
x
¯
Lx =
1
2
i,j
w
ij
x
i
−x
j

2
Property 2: (generalization) for any Y ∈ R
d·n
:
Tr [Y LY
¯
] =
1
2
i,j
w
ij
]y
i
−y
j
]
2
Agadir, 05192009 p. 33
Property 3: For the particular L = I −
1
n
11
¯
XLX
¯
=
¯
X
¯
X
¯
== n ·Covariance matrix
[Proof: 1) L is a projector: L
¯
L = L
2
= L, and 2) XL =
¯
X]
® Consequence1: PCA equivalent to maximizing
ij
]y
i
−y
j
]
2
® Consequence2: what about replacing trivial L with something else?
[viewpoint in KorenCarmel’04]
Property 4: (Graph partitioning) If x is a vector of signs (±1) then
x
¯
Lx = 4 ·(’number of edge cuts’)
edgecut = pair (i, j) with x
i
,= x
j
® Consequence: Can be used for partitioning graphs, or ‘clustering’ [take
p = sign(u
2
), where u
2
= 2nd smallest eigenvector..]
Agadir, 05192009 p. 34
Return to Laplacean eigenmaps approach
Laplacean Eigenmaps *minimizes*
J
EM
(Y ) =
n
i,j=1
w
ij
]y
i
−y
j
]
2
subject to Y DY
¯
= I .
Notes:
1. Motivation: if ]x
i
− x
j
] is small (orig.
data), we want ]y
i
− y
j
] to be also small
(lowD data)
2. Note: Min instead of Max as in PCA
[counterintuitive]
3. Above problem uses original data indirectly
through its graph
x
x
j
i
y
i
y
j
Agadir, 05192009 p. 35
® Problem translates to:
min
_
¸
_
¸
_
Y ∈ R
d·n
Y D Y
¯
= I
Tr
_
Y (D −W)Y
¯
¸
.
® Solution (sort eigenvalues increasingly):
(D −W)u
i
= λ
i
Du
i
; y
i
= u
¯
i
; i = 1, · · · , d
® An n ·n sparse eigenvalue problem [In ‘sample’ space]
® Note: can assume D = I. Amounts to rescaling data. Problem
becomes
(I −W)u
i
= λ
i
u
i
; y
i
= u
¯
i
; i = 1, · · · , d
Agadir, 05192009 p. 36
Why smallest eigenvalues vs largest for PCA?
Intuition:
Graph Laplacean and ‘unit’ Laplacean are very different: one involves a
sparse graph (More like a discr. differential operator). The other involves a
dense graph. (More like a discr. integral operator). They should be treated
as the inverses of each other.
® Viewpoint conﬁrmed by what we learn from Kernel approach
Agadir, 05192009 p. 37
A uniﬁed view
® Most techniques lead to one of two types of problems
First :
® Y obtained from solving
an eigenvalue problem
® LLE, Eigenmaps (normal
ized), ..
min
_
¸
_
¸
_
Y ∈ R
d·n
Y Y
¯
= I
Tr
_
Y MY
¯
¸
Second:
® LowDimensional data:
Y = V
¯
X
® G is either the identity
matrix or XDX
¯
or XX
¯
.
min
_
¸
_
¸
_
V ∈ R
m·d
V
¯
G V = I
Tr
_
V
¯
XMX
¯
V
¸
.
Important observation: 2nd is just a projected version of the 1st.
Agadir, 05192009 p. 38
Graphbased methods in a supervised setting
® Subjects of training set are known (labeled). Q: given a test image (say)
ﬁnd its label.
H
.
S
a
d
o
k
H
.
S
a
d
o
k
H
.
S
a
d
o
k
H
.
S
a
d
o
k
M
.
B
e
l
l
a
l
i
j
M
.
B
e
l
l
a
l
i
j
M
.
B
e
l
l
a
l
i
j
M
.
B
e
l
l
a
l
i
j
K
.
J
b
i
l
o
u
K
.
J
b
i
l
o
u
K
.
J
b
i
l
o
u
K
.
J
b
i
l
o
u
ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ
¸ ↑ ,
Question: Find label (best match) for test image.
Agadir, 05192009 p. 39
Methods can be adapted to supervised mode by building the graph to
take into account class labels. Idea is simple: Build G so that nodes
in the same class are neighbors. If c = # classes, G will consist of c
cliques.
® Matrix W will be blockdiagonal
W =
_
_
_
_
_
_
_
_
W
1
W
2
W
3
W
4
W
5
_
_
_
_
_
_
_
_
® Easy to see that rank(W) = n −c.
® Can be used for LPP, ONPP, etc..
® Recent improvement: add repulsion Laplacean [Kokiopoulou, YS 09]
Agadir, 05192009 p. 40
Class 1
Class 2
Class 3
10 20 30 40 50 60 70 80 90 100
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
ORL −− TrainPer−Class=5
onpp
pca
olpp−R
laplace
fisher
onpp−R
olpp
Agadir, 05192009 p. 41
TIME FOR A MATLAB DEMO
Supervised learning experiments: digit recognition
® Set of 390 images of digits (39 of
each digit)
® Each picture has 20 ·16 = 320
pixels.
® Select any one of the digits and
try to recognize it among the 389
remaining images
® Methods: PCA, LPP, ONPP
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
5 10 15
10
20
Agadir, 05192009 p. 43
MULTILEVEL METHODS
Multilevel techniques
® Divide and conquer paradigms as well as multilevel methods in the
sense of ‘domain decomposition’
® Main principle:
very costly to do an
SVD [or Lanczos] on
the whole set. Why
not ﬁnd a smaller
set on which to do
the analysis – with
out too much loss?
G
G G
G
G
G
.
.
.
.
. .
.
.
.
.
.
.
.
. .
. .
.
.
.
G
G G
G
G
G
.
.
.
.
. .
.
.
.
.
.
.
.
. .
. .
.
.
.
G
G G
G
G
G .
.
.
.
. .
.
.
.
.
.
.
.
. .
. .
.
.
.
Coarsen
Coarsen
Project
Graph
n
m X
Last Level
n
L
L
Y
L
X
m d
n
L
Do the hard work here − not here
® Main tool used: graph coarsening
Agadir, 05192009 p. 45
Hypergraphs and Hypergraph coarsening
A hypergraph H = (V, E) is a generalizaition of a graph
® V = set of vertices V
® E = set of hyperedges. Each e ∈ E is a nonempty subset of V
® Standard undirected graphs: each e consists of two vertices.
® degree of e = e
® degree of a vertex v = number of hyperedges e s.t. x ∈ e.
® Two vertices are neighbors if there is a hyperedge connecting them
Agadir, 05192009 p. 46
Example of a hypergraph
G
G
G
G
G
1
2
3
4
5
6
7
8
9
a
b
c
d
e
net e
net d
Boolean matrix representation
1 2 3 4 5 6 7 8 9
1 1 1 1 a
1 1 1 1 b
A = 1 1 1 1 c
1 1 1 d
1 1 e
® Canonical hypergraph representation for sparse data (e.g. text)
Agadir, 05192009 p. 47
Hypergraph Coarsening
® Coarsening a hypergraph H = (V, E) means ﬁnding a ‘coarse’ ap
proximation
ˆ
H = (
ˆ
V ,
ˆ
E) to H with 
ˆ
V  < V , which tries to retain as
much as possible of the structure of the original hypergraph
® Idea: repeat coarsening recursively.
® Result: succession of smaller hypergraphs which approximate the orig
inal graph.
® Several methods exist. We use ‘matching’, which successively merges
pairs of vertices
® Used in most graph partitioning methods: hMETIS, Patoh, zoltan, ..
® Algorithm successively selects pairs of vertices to merge – based on
measure of similarity of the vertices.
Agadir, 05192009 p. 48
Application: Multilevel Dimension Reduction
Main Idea: coarsen to
a certain level. Then use
the resulting data set
ˆ
X to
ﬁnd projector from R
m
to
R
d
. This projector can be
used to project the original
data or any new data.
G
G G
G
G
G
.
.
.
.
. .
.
.
.
.
.
.
.
. .
. .
.
.
.
G
G G
G
G
G
.
.
.
.
. .
.
.
.
.
.
.
.
. .
. .
.
.
.
G
G G
G
G
G .
.
.
.
. .
.
.
.
.
.
.
.
. .
. .
.
.
.
d
Y
n
Y
n
d
r
r
Coarsen
Coarsen
Graph
n
m X
Last Level
X
m
n
r
r
Project
® Main gain: Dimension reduction is done with a much smaller set. Hope:
not much loss compared to using whole data
Agadir, 05192009 p. 49
Application to text mining
® Recall common approach:
1. Scale data [e.g. TFIDF scaling: ]
2. Perform a (partial) SVD on resulting matrix X ≈ U
d
Σ
d
V
T
d
3. Process query by same scaling (e.g. TFIDF)
4. Compute similarities in ddimensional space: s
i
= {ˆ q, ˆ x
i
)/]ˆ q]]ˆ x
i
]
where [ˆ x
1
, ˆ x
2
, . . . , ˆ x
n
] = V
T
d
∈ R
d·n
; ˆ q = Σ
−1
d
U
T
d
¯ q ∈ R
d
® Multilevel approach: replace SVD (or any other dim. reduction) by
dimension reduction on coarse set. Only difference: TFIDF done on the
coarse set not original set.
Agadir, 05192009 p. 50
Tests
Three public data sets used for experiments: Medline, Cran and NPL
(cs.cornell.edu)
® Coarsening to a max. of 4 levels.
Data set Medline Cran NPL
# documents 1033 1398 11429
# terms 7014 3763 7491
sparsity (%) 0.74% 1.41% 0.27%
# queries 30 225 93
avg. # rel./query 23.2 8.2 22.4
Agadir, 05192009 p. 51
Results with NPL
Statistics
Level
coarsen. # optimal optimal avg.
time doc. # dim. precision
#1 N/A 11429 736 23.5%
#2 3.68 5717 592 23.8%
#3 2.19 2861 516 23.9%
#4 1.50 1434 533 23.3%
Agadir, 05192009 p. 52
PrecisionRecall curves
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
NPL
recall
p
r
e
c
i
s
i
o
n
LSI
multilevel−LSI, r=2
multilevel−LSI, r=3
multilevel−LSI, r=4
Agadir, 05192009 p. 53
CPU times for preprocessing (Dim. reduction part)
0 100 200 300 400 500 600 700 800
0
200
400
600
800
1000
1200
1400
# dimensions
C
P
U
t
i
m
e
f
o
r
t
r
u
n
c
a
t
e
d
S
V
D
(
s
e
c
s
)
NPL
LSI
multilevel−LSI, r=2
multilevel−LSI, r=3
multilevel−LSI, r=4
Agadir, 05192009 p. 54
Conclusion
® Eigenvalue problems of datamining are not cheap to solve..
® .. and cost issue does not seem to bother practitioners too much for
now..
® Ingredients that will become mandatory:
1 Avoid the SVD
2 Fast algorithms that do not sacriﬁce quality.
3 In particullar: Multilevel approaches
4 Multilinear algebra [tensors]
Agadir, 05192009 p. 55
Introduction: What is data mining?
® Common goal of data mining methods: to extract meaningful
information or patterns from data. Very broad area – includes: data analysis, machine learning, pattern recognition, information retrieval, ...
® Main tools used: linear algebra; graph theory; approximation theory; optimization; ... ® In this talk: emphasis on dimension reduction techniques, interrelations between techniques, and graph theory tools.
Agadir, 05192009
p.
2
Major tool of Data Mining: Dimension reduction
® Goal is not just to reduce computational cost but to:
• Reduce noise and redundancy in data • Discover ’features’ or ’patterns’ (e.g., supervised learning)
® Techniques depend on application: Preserve angles? Preserve distances? Maximize variance? ..
Agadir, 05192009
p.
3
The problem of Dimension Reduction
® Given d
m ﬁnd a mapping Φ : x ∈ Rm −→ y ∈ Rd
® Mapping may be explicit [typically linear], e.g.:
y = V Tx
® Or implicit (nonlinear)
Practically:
Given X ∈ Rm×n,
we want to ﬁnd a lowdimensional representation Y ∈ Rd×n of X
Agadir, 05192009
p.
4
Linear Dimensionality Reduction
Given: a data set X = [x1, x2, . . . , xn], and d the dimension of the desired reduced space Y = [y1, y2, . . . , yn]. Want: a linear transformation from X to Y
n
m m d
X Y
xi yi
n d
vT
X ∈ Rm×n V ∈ Rm×d Y =V X → Y ∈ Rd×n
® mdimens. objects (xi) ‘ﬂattened’ to ddimens. space (yi) Constraint: The yi’s must satisfy certain properties ® Optimization problem
Agadir, 05192009
p. 5
05192009 p.Example 1: The ‘SwillRoll’ (2000 points in 3D) Original Data in 3−D 15 10 5 0 −5 20 −10 0 −15 −15 −10 −5 0 5 10 15 −20 −10 10 Agadir. 6 .
7 .2D ‘reductions’: PCA 15 10 5 0 −5 −10 −15 −20 −10 0 10 20 15 10 5 0 −5 −10 −15 −15 −10 −5 0 5 10 LPP Eigenmaps ONPP Agadir. 05192009 p.
8 . 05192009 p.Example 2: Digit images (a random sample of 30) 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 Agadir.
4341 −5.072 0.1 −0.4341 ONPP − digits : 0 −− 4 −5.2 0. 9 .4341 −5.05 3 0 4 −0.1 0 0.05 −0.25 −5.2 −0.05 2 0 3 4 −0.2 0.1 0 1 0.073 0.05 0 −0.15 0.1 0. 05192009 p.2D ’reductions’: PCA − digits : 0 −− 4 6 4 2 0 −2 −4 −6 −10 −5 0 5 10 0 1 0.2 −0.1 0.1 −0.07 0.15 −0.4341 −5.4341 x 10 −3 Agadir.1 −0.05 −0.15 LLE − digits : 0 −− 4 K−PCA − digits : 0 −− 4 0.071 0.1 2 0.074 0.2 −0.15 −0.05 −0.15 0.
10 . 05192009 p.: ﬁnd the best axes for projecting the data ® Q: “best in what sense”? ® A: maximize variance of new data ® Principal Component Analysis [PCA] Agadir.Basic linear dimensionality reduction method: PCA ® We are given points in Rn and we want to project them in Rd. Best way to do this? ® i.e.
.. where V is m × d orthogonal ® We need to maximize over all orthogonal m × d matrices V : i yi − 1 n j yj 2 2 ¯ ¯ = · · · = Tr V X X V 1 ¯ Where: X = X(I − n 11T ) == originrecentered version of X ® Solution V = { dominant eigenvectors } of the covariance matrix == ¯ Set of left singular vectors of X ® Solution V also minimizes ‘reconstruction error’ . 11 .j 2 ® .® Recall yi = V T xi. and it also maximizes [Korel and Carmel 04] yi − yj 2 Agadir. 05192009 p. xi − V V T xi i 2 = i xi − V y i i.
05192009 p. . liberalsconservative. Photovoltaic Superhard Superconductors Ferromagnetic Multi−ferroics Catalytic Thermo−electric ® Basic technique: Kmeans algorithm [slow but effective.] ® Example of application : cluster bloggers by ‘social groups’ (anarchists..Unsupervised learning: Clustering Problem: partition a given set into subsets such that items of the same subset are most similar and those of two different subsets most dissimilar. ecologists. sportsfans. 12 .. ) Agadir.
Sparse Matrices viewpoint ® Communities modeled by an ‘afﬁnity’ graph [e. ® Use ‘blocking’ techniques for sparse matrices [O’Neil Szyld ’90. 05192009 p. ’user A sends frequent emails to user B ’] ® Adjacency Graph represented by a sparse matrix ® Goal: ﬁnd reordering such that blocks are as dense as possible: ® Advantage of this viewpoint: no need to know number of clusters.. .g... 13 . YS 03.] Agadir.
Data set from : http://wwwpersonal.edu/∼mejn/netdata/ ® Network connecting bloggers of different political orientations [2004 US presidentual election] ® ’Communities’: liberal vs.Example of application in knowledge discovery. Concl. conservative ® Graph: 1. A. agrees with [L. rest: conservative. 14 . 490 vertices (blogs) : ﬁrst 758: liberal. Smaller subgraph: conservative blogs. Adamic and N. 05192009 p.4) found 2 subgraphs Note: density = E/V 2. Glance 05] : ‘rightleaning (conservative) blogs have a denser structure of strong connections than the left (liberal)” Agadir.umich. larger one: liberals ® One observation: smaller subgraph is denser than that of the larger one. ® ® ® ® Edge: i → j means there is a citation between blogs i and j Blocking algorithm (Density theshold=0.
? ? ® Many applications. Agadir. 15 . 05192009 p. ﬁnd a way to classify an unlabelled item into either the “A” or the “B” class. ® Example: distinguish between SPAM and nonSPAM messages ® Can be extended to more than 2 classes.Supervised learning: classiﬁcation Problem: Given labels (say “A” and “B”) for each item of a given set.
05192009 p.Linear classifier Linear classiﬁers: Find a hyperplane which best separates the data in classes A and B. Agadir. 16 .
A harder case: Spectral Bisection (PDDP) 4 3 2 1 0 −1 −2 −3 −4 −2 −1 0 1 2 3 4 5 ® Use kernels to transorm Agadir. 05192009 p. 17 .
05192009 p. ® and L = [l1. 2D] then do comparison in this space. · · · . e. ® More common solution: Use SVD to reduce dimension of data [e.g. A: v T xi ≥ 0 .g. ® First Solution: Find a vector v such that v T xi close to li for all i. ln] the labels either +1 or 1. xn] be the data matrix. · · · .Linear classiﬁers ® Let X = [x1. B: v T xi < 0 v [Note: for clarity v principal axis drawn below where it should be] Agadir. 18 .
Idea: Project the data in lowdimensional space so as to maximize the ratio of the “between scatter” measure over “within scatter” measure of the classes. 19 .Better solution: Linear Discriminant Analysis (LDA) Deﬁne “between scatter”: a measure of how well separated two distinct classes are. 05192009 p. Agadir. Deﬁne “within scatter”: a measure of how well clustered items of the same class are. ® Goal: to make “between scatter” measure large. while making “within scatter” small.
® and µ(k) = mean of the kth class (of size nk ). 05192009 p. 20 .® Let µ = mean of X . CLUSTER CENTROIDS # GLOBAL CENTROID # Agadir.
aT SW a = k=1 xi ∈ Xk aT (xi − µ(k))2 ® LDA projects the data so as to maximize the ratio of these two numbers: max a aT SB a aT SW a ® Optimal a = eigenvector associated with the largest eigenvalue of: SB ui = λiSW ui . k=1 xi ∈Xk SW = ® Project set on a onedimensional space spanned by a vector a. Agadir.c SB = Deﬁne k=1 c nk (µ(k) − µ)(µ(k) − µ)T . 05192009 p. (xi − µ(k))(xi − µ(k))T . 21 . Then: c c aT SB a = i=1 nk aT (µ(k) − µ)2.
® Reduced dimension data: Y = U T X . ® Common alternative: Solve instead the (easier) problem: U SW U =I max T Tr [U T SB U ] ® Solution: largest eigenvectors of SB ui = λiSW ui . See Bellalij & YS (in progress). 05192009 p.LDA – Extension to arbitrary dimension ® Would like to project in d dimensions – ® Normally we wish to maximize the ratio of two traces Tr [U T SB U ] Tr [U T SW U ] ® U subject to being unitary U T U = I (orthogonal projector). Agadir. Difﬁculty: Hard to maximize. 22 .
. ® Some of the tools used: statistics. 05192009 p. 23 ... ® Visit the page GoupLens page http://www. but potential for other areas too ® When you order a book in Amazon you will get recommendations® ’Recommender system’ ® A very hot topic at the height of the dotcom era . ® .org/ ® Netﬂix: “qui veut gagner un million?” [who wants to win a $ million?] Agadir. Netﬂix Prize ® Important application in commerce .Motivating example: Collaborative ﬁltering. kNN graphs.grouplens.. LSI (a form of PCA).. . clustering..
.
05192009 p.Face Recognition – background Problem: We are given a database of images: [arrays of pixel values]. And a test (new) image. ÿ ÿ ÿ ↑ ÿ ÿ ÿ Question: Does this new image correspond to one of those in the database? Agadir. 25 .
situations : Common approach: eigenfaces nique – Principal Component Analysis tech Agadir. 26 .. .. 05192009 p..Difﬁculty ® Different positions. expressions. lighting.
® See reallife examples – [international manhunt] ® Poor images or deliberately altered images [‘occlusion’] ® See recent paper by John Wright et al. 27 . Agadir. 05192009 p.
Idea: replace SVD by Lanczos vectors (same as for IR) Agadir..Eigenfaces – Consider each picture as a onedimensional colum of all pixels – Put together into an array A of size # pixels × # images.. 28 .. 05192009 p. . =⇒ .. space – Similar to LSI in spirit – but data is not sparse. A – Do an SVD of A and perform comparison with any test image in lowdim.
29 . 10 sample images each – example: # of pixels : 112 × 92 TOT. # images : 504 Agadir.Tests: Face Recognition Tests with 2 wellknown data sets: ORL 40 subjects. 05192009 p. screaming] – example: # of pixels : 112 × 92 # TOT. angry. smiling. # images : 400 AR set 126 subjects – 4 facial expressions selected for each [natural.
4 0.Tests: Face Recognition Recognition accuracy of Lanczos approximation vs SVD ORL dataset ORL 1 1 AR dataset AR lanczos tsvd lanczos tsvd 0.9 0.2 0.3 0.7 0. 05192009 p.1 0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 iterations iterations Vertical axis shows average error rate.6 0.9 0.4 0.1 0.8 0.3 0.6 0.5 0.2 0.7 average error rate average error rate 0. 30 .5 0. Horizontal = Subspace dimension Agadir.8 0.
GRAPHBASED TECHNIQUES .
i) Agadir.Laplacean Eigenmaps (BelkinNiyogi02) ® Not a linear (projection) method but a Nonlinear method ® Starts with knearestneighbors graph: kNN graph = graph of k nearest neighbors ® Then deﬁne a graph Laplacean: x xj i L=D−W ® Simplest: wij = 1 if j ∈ Ni 0 else D = diag dii = j=i wij with Ni = neighborhood of i (excl. 05192009 p. 32 .
t. with D = diag(di) and wij ≥ 0. Property 1: for any x ∈ Rn : di = j=i wij x Lx = 1 2 i. 05192009 p. L = D − W .A few properties of graph Laplacean matrices ® Let L = any matrix s. 33 .j wij yi − yj 2 Agadir.j wij xi − xj 2 Property 2: (generalization) for any Y ∈ Rd×n : Tr [Y LY ] = 1 2 i.
34 .1 Property 3: For the particular L = I − n 11 XLX ¯ ¯ = XX == n × Covariance matrix ¯ [Proof: 1) L is a projector: L L = L2 = L. 05192009 p. and 2) XL = X ] ® Consequence1: PCA equivalent to maximizing ij yi − yj 2 ® Consequence2: what about replacing trivial L with something else? [viewpoint in KorenCarmel’04] Property 4: (Graph partitioning) If x is a vector of signs (±1) then x Lx = 4 × (’number of edge cuts’) edgecut = pair (i. j) with xi = xj ® Consequence: Can be used for partitioning graphs.] Agadir. or ‘clustering’ [take p = sign(u2).. where u2 = 2nd smallest eigenvector.
j=1 wij yi − yj 2 subject to Y DY =I. data). 35 .Return to Laplacean eigenmaps approach Laplacean Eigenmaps *minimizes* n FEM (Y ) = i. 05192009 p. Notes: 1. Motivation: if xi − xj is small (orig. Above problem uses original data indirectly through its graph xj x i y yi j Agadir. Note: Min instead of Max as in PCA [counterintuitive] 3. we want yi − yj to be also small (lowD data) 2.
i = 1. · · · . d ® An n × n sparse eigenvalue problem [In ‘sample’ space] ® Note: can assume D = I . Problem becomes (I − W )ui = λiui . 05192009 p. i = 1. · · · . 36 . y i = ui .® Problem translates to: Tr Y (D − W )Y min Y ∈ Rd×n YD Y =I . Amounts to rescaling data. d Agadir. ® Solution (sort eigenvalues increasingly): (D − W )ui = λiDui . y i = ui .
05192009 p. 37 . The other involves a dense graph. differential operator). ® Viewpoint conﬁrmed by what we learn from Kernel approach Agadir. They should be treated as the inverses of each other. (More like a discr. integral operator).Why smallest eigenvalues vs largest for PCA? Intuition: Graph Laplacean and ‘unit’ Laplacean are very different: one involves a sparse graph (More like a discr.
Agadir. Second: ® LowDimensional data: Y =V X ® G is either the identity matrix or XDX or XX . .. Important observation: 2nd is just a projected version of the 1st. 05192009 p. Tr Y M Y min Y ∈ Rd×n YY =I Tr V XM X V min V ∈ Rm×d V GV =I .A uniﬁed view ® Most techniques lead to one of two types of problems First : ® Y obtained from solving an eigenvalue problem ® LLE. Eigenmaps (normalized). 38 .
B M. K. S bilo u K. H. Agadir.Graphbased methods in a supervised setting ® Subjects of training set are known (labeled). ij j ella lij ok k k Sa do k ella li ij Jb ilou bilo u Sa Sa M. H. J H. B H. 39 . J bilo u K. J ella l ad ella l do do ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ↑ Question: Find label (best match) for test image. B M. 05192009 p. K. Q: given a test image (say) ﬁnd its label. B M.
ONPP. 05192009 p.. 40 . G will consist of c cliques. If c = # classes. YS 09] Agadir. etc. ® Can be used for LPP. ® Matrix W will be blockdiagonal W = W1 W2 W3 W4 W5 ® Easy to see that rank(W ) = n − c.Methods can be adapted to supervised mode by building the graph to take into account class labels. ® Recent improvement: add repulsion Laplacean [Kokiopoulou. Idea is simple: Build G so that nodes in the same class are neighbors.
ORL −− TrainPer−Class=5 0.88 0.8 10 Agadir.98 Class 1 Class 2 0.94 0.84 0.82 Class 3 onpp pca olpp−R laplace fisher onpp−R olpp 20 30 40 50 60 70 80 90 100 0.92 0. 05192009 p.9 0.96 0. 41 .86 0.
TIME FOR A MATLAB DEMO .
43 . ONPP 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 10 20 5 10 15 Agadir. 05192009 p.Supervised learning experiments: digit recognition ® Set of 390 images of digits (39 of each digit) ® Each picture has 20 × 16 = 320 pixels. ® Select any one of the digits and try to recognize it among the 389 remaining images ® Methods: PCA. LPP.
MULTILEVEL METHODS .
. G . . G G . G G . . . . 05192009 p. . G . . . . . n ® Main principle: X m very costly to do an SVD [or Lanczos] on the whole set. Coarsen Do the hard work here − not here . . . . Why not ﬁnd a smaller set on which to do the analysis – without too much loss? Coarsen . G G . . .Multilevel techniques ® Divide and conquer paradigms as well as multilevel methods in the sense of ‘domain decomposition’ Graph . .. . G . G .. . . . . . . . . G .. G . . . . . . . . . . Last Level m XL nL YL Project nL d ® Main tool used: graph coarsening Agadir.. 45 . . . . . G G . G G . . . . G G .
Each e ∈ E is a nonempty subset of V ® Standard undirected graphs: each e consists of two vertices.t. ® Two vertices are neighbors if there is a hyperedge connecting them Agadir. E) is a generalizaition of a graph ® V = set of vertices V ® E = set of hyperedges.Hypergraphs and Hypergraph coarsening A hypergraph H = (V. ® degree of e = e ® degree of a vertex v = number of hyperedges e s. 46 . 05192009 p. x ∈ e.
text) Agadir.g. 05192009 p. 47 .Example of a hypergraph 1 a G 3 5 b G 4 2 eG 7 6 c G Boolean matrix representation 1 2 3 4 5 6 7 8 9 1 1 1 1 a 1 1 1 1 b 1 1 1 1 c A= 1 1 1 net d G d net e 9 8 1 d 1 e ® Canonical hypergraph representation for sparse data (e.
E) to H with V  < V . which successively merges pairs of vertices ® Used in most graph partitioning methods: hMETIS.. ® Algorithm successively selects pairs of vertices to merge – based on measure of similarity of the vertices. . E) means ﬁnding a ‘coarse’ apˆ ˆ ˆ ˆ proximation H = (V . Agadir. zoltan. 48 . ® Several methods exist. 05192009 p.Hypergraph Coarsening ® Coarsening a hypergraph H = (V. We use ‘matching’. which tries to retain as much as possible of the structure of the original hypergraph ® Idea: repeat coarsening recursively. Patoh. ® Result: succession of smaller hypergraphs which approximate the original graph.
Then use ˆ the resulting data set X to ﬁnd projector from Rm to Rd. G . . . . . . G G . . . . This projector can be used to project the original data or any new data. n X m Y d Main Idea: coarsen to a certain level. G . . G G . . G . . . . . . G G .Application: Multilevel Dimension Reduction Graph n . . . . . G G . 49 . . G . G . . . Coarsen . . . . . Hope: not much loss compared to using whole data Agadir. G G . m Xr n r Project Yr nr d ® Main gain: Dimension reduction is done with a much smaller set. . . . . G . . . Coarsen Last Level . . . . 05192009 p. G G ... . .. . . .. .
50 . Process query by same scaling (e. . 05192009 p. q = Σ−1Ud q ∈ Rd x ˆ ˆ ˆ ¯ d xi ˆ ® Multilevel approach: replace SVD (or any other dim. . .Application to text mining ® Recall common approach: 1. Only difference: TFIDF done on the coarse set not original set. xi / q T where [ˆ1. reduction) by dimension reduction on coarse set. Perform a (partial) SVD on resulting matrix X ≈ UdΣdVdT 3. TFIDF scaling: ] 2. Agadir. x2. Compute similarities in ddimensional space: si = q.g. xn] = VdT ∈ Rd×n .g. Scale data [e. TFIDF) ˆ ˆ ˆ 4. .
/query Medline Cran NPL 1033 1398 11429 7014 3763 7491 0.edu) ® Coarsening to a max. Cran and NPL (cs. 51 . of 4 levels.2 225 8.2 93 22.74% 1. # rel.41% 0.27% 30 23.4 Agadir.cornell.Tests Three public data sets used for experiments: Medline. Data set # documents # terms sparsity (%) # queries avg. 05192009 p.
11429 5717 2861 1434 optimal optimal avg. 05192009 p.3% Agadir. Level time #1 #2 #3 #4 N/A 3.68 2.9% 23.50 # doc. 52 .5% 23.19 1.Results with NPL Statistics coarsen. precision 736 592 516 533 23. # dim.8% 23.
5 0.15 0.35 precision 0.3 0.25 0. 05192009 p.45 0.2 0.1 0.05 LSI multilevel−LSI.4 0.PrecisionRecall curves NPL 0.5 0.4 0. r=2 multilevel−LSI.2 0. r=3 multilevel−LSI.7 0.1 0.3 0.6 0. 53 .9 1 0 recall Agadir.8 0. r=4 0 0.
CPU times for preprocessing (Dim. r=3 multilevel−LSI. 05192009 p. r=2 multilevel−LSI. reduction part) NPL 1400 CPU time for truncated SVD (secs) 1200 LSI multilevel−LSI. r=4 1000 800 600 400 200 0 0 100 200 300 400 500 600 700 800 # dimensions Agadir. 54 .
. ® . 05192009 p.Conclusion ® Eigenvalue problems of datamining are not cheap to solve. 55 . ® Ingredients that will become mandatory: 1 2 3 4 Avoid the SVD Fast algorithms that do not sacriﬁce quality.. In particullar: Multilevel approaches Multilinear algebra [tensors] Agadir. and cost issue does not seem to bother practitioners too much for now..
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.