A Link Analysis Extension of Correspondence Analysis For Mining Relational Databases

A Link Analysis Extension of Correspondence
Analysis for Mining Relational Databases

Luh Yen, Marco Saerens, Member, IEEE, and Franc ois Fouss
AbstractThis work introduces a link analysis procedure for discovering relationships in a relational database or a graph, generalizing
both simple and multiple correspondence analysis. It is based on a random walk model through the database defining a Markov chain
having as many states as elements in the database. Suppose we are interested in analyzing the relationships between some elements
(or records) contained in two different tables of the relational database. To this end, in a first step, a reduced, much smaller, Markov
chain containing only the elements of interest and preserving the main characteristics of the initial chain, is extracted by stochastic
complementation [41]. This reduced chain is then analyzed by projecting jointly the elements of interest in the diffusion map subspace
[42] and visualizing the results. This two-step procedure reduces to simple correspondence analysis when only two tables are defined,
and to multiple correspondence analysis when the database takes the form of a simple star-schema. On the other hand, a kernel
version of the diffusion map distance, generalizing the basic diffusion map distance to directed graphs, is also introduced and the links
with spectral clustering are discussed. Several data sets are analyzed by using the proposed methodology, showing the usefulness of
the technique for extracting relationships in relational databases or graphs.
Index TermsGraph mining, link analysis, kernel on a graph, diffusion map, correspondence analysis, dimensionality reduction,
statistical relational learning.
1 INTRODUCTION
T
RADITIONAL statistical, machine learning, pattern recogni-
tion, and data mining approaches (see, for example, [28])
usually assume a random sample of independent objects
from a single relation. Many of these techniques have gone
through the extraction of knowledge from data (typically
extracted from relational databases), almost always leading,
in the end, to the classical double-entry tabular format,
containing features for a sample of the population. These
features are therefore used in order to learn fromthe sample,
provided that it is representative of the population as a
whole. However, real-world data coming from many fields
(such as World Wide Web, marketing, social networks, or
biology; see [16]) are often multirelational and interrelated.
The work recently performed in statistical relational learning
[22], aiming at working with such data sets, incorporates
research topics, such as link analysis [36], [63], web mining
[1], [9], social network analysis [8], [66], or graph mining [11].
All these research fields intend to find and exploit links
between objects (in addition to featuresas is also the case in
the field of spatial statistics [13], [53]), which could be of
various types and involved in different kinds of relation-
ships. The focus of the techniques has moved over from the
analysis of the features describing each instance belonging to
the population of interest (attribute value analysis) to the
analysis of the links existing between these instances
(relational analysis), in addition to the features.
This paper precisely proposes a link-analysis-based
technique allowing to discover relationships existing
between elements of a relational database or, more
generally, a graph. More specifically, this work is based
on a random walk through the database defining a Markov
chain having as many states as elements in the database.
Suppose, for instance, we are interested in analyzing the
relationships between elements contained in two different
tables of a relational database. To this end, a two-step
procedure is developed. First, a much smaller, reduced,
Markov chain, only containing the elements of interestty-
pically the elements contained in the two tablesand
preserving the main characteristics of the initial chain, is
extracted by stochastic complementation [41]. An efficient
algorithm for extracting the reduced Markov chain from the
large, sparse, Markov chain representing the database is
proposed. Then, the reduced chain is analyzed by, for
instance, projecting the states in the subspace spanned by
the right eigenvectors of the transition matrix ([42], [43],
[46], [47]; called the basic diffusion map in this paper), or
by computing a kernel principal component analysis [54],
[57] on a diffusion map kernel computed from the reduced
graph and visualizing the results. Indeed, a valid graph
kernel based on the diffusion map distance, extending the
basic diffusion map to directed graphs, is introduced.
The motivations for developing this two-step procedure
are twofold. First, the computation would be cumbersome,
if not impossible, when dealing with the complete
database. Second, in many situations, the analyst is not
interested in studying all the relationships between all
elements of the database, but only a subset of them.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011 481
. L. Yen and M. Saerens are with the Information Systems Research Unit
(ISYS/LSM) and Machine Learning Group (MLG), Universite Catholique
de Louvain (UCL), 1, place des Doyens, 1348 Louvain-La-Neuve, Belgium.
E-mail: {luh.yen, marco.saerens}@ucLouvain.be.
. F. Fouss is with the Management DepartmentLSM, Facultes Uni-
versitaires Catholiques de Mons (FUCaM), 151, Chaussee de Binche,
7000 Mons, Belgium. E-mail: francois.fouss@fucam.ac.be.
Manuscript received 16 Jan. 2009; revised 14 Sept. 2009; accepted 29 Dec.
2009; published online 23 Aug. 2010.
Recommended for acceptance by Z.-H. Zhou.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2009-01-0023.
Digital Object Identifier no. 10.1109/TKDE.2010.142.
1041-4347/11/$26.00 2011 IEEE Published by the IEEE Computer Society
Moreover, if the whole set of elements in the database is
analyzed, the resulting mapping would be averaged out by
the numerous relationships and elements we are not
interested infor instance, the principal axis would be
completely different. It would therefore not exclusively
reflect the relationships between the elements of interest.
Therefore, reducing the Markov chain by stochastic
complementation allows to focus the analysis on the
elements and relationships we are interested in.
Interestingly enough, when dealing with a bipartite graph
(i.e., the database only contains two tables linked by one
relation), stochastic complementation followed by a basic
diffusion map is exactly equivalent to simple correspon-
dence analysis. On the other hand, when dealing with a star-
schema database (i.e., one central table linked to several
tables by different relations), this two-step procedure
reduces to multiple correspondence analysis. The proposed
methodology therefore extends correspondence analysis to
the analysis of a relational database.
In short, this paper has three main contributions:
. A two-step procedure for analyzing weighted
graphs or relational databases is proposed.
. It is shown that the suggested procedure extends
correspondence analysis.
. A kernel version of the diffusion map distance,
applicable to directed graphs, is introduced.
The paper is organized as follows: Section 2 introduces
the basic diffusion map distance and its natural kernel on a
graph. Section 3 introduces some basic notions of stochastic
complementation of a Markov chain. Section 4 presents the
two-step procedure for analyzing the relationships between
elements of different tables and establishes the equivalence
between the proposed methodology and correspondence
analysis in some special cases. Section 5 presents some
illustrative examples involving several data sets, while
Section 6 gives the conclusion.
2 THE DIFFUSION MAP DISTANCE AND ITS
NATURAL KERNEL MATRIX
In Section 2, the basic diffusion map distance [42], [43], [46],
[47] is briefly reviewed and some of its theoretical
justifications are detailed. Then, a natural kernel matrix is
derived from the diffusion map distance, providing a
meaningful similarity measure between nodes.
2.1 Notations and Definitions
Let us consider that we are given a weighted, directed,
graph G possibly defined from a relational database in the
following, obvious, way: each element of the database is a
node and each relation corresponds to a link (for a detailed
procedure allowing to build a graph from a relational
database, see [20]). The associated adjacency matrix A is
defined in a standard way as o
i,
A
i,
n
i,
if node i is
connected to node , and o
i,
0 otherwise (say G has i
nodes in total). The weight n
i,
0 of the edge connecting
node i and node , is set to have larger value if the affinity
between i and , is important. If no information about the
strength of relationship is available, we simply set n
i,
1
(unweighted graph). We further assume that there are no
self-loops (n
ii
0 for i 1. .... i) and that the graph has a
single connected component; that is, any node can be
reached from any other node. If the graph is not connected,
there is no relationship at all between the different
components and the analysis has to be performed sepa-
rately on each of them. It is therefore to be hoped that the
graph modeling the relational database does not contain too
many disconnected componentsthis can be considered as
a limitation of our method. Partitioning a graph into
connected components from its adjacency matrix can be
done in Oi
2
(see, for instance, [56]). Based on the
adjacency matrix, the Laplacian matrix L of the graph is
defined in the usual manner: L DA, where D
Diago
i.
is the generalized outdegree matrix with diagonal
entries d
ii
D
ii
o
i.

P
i
,1
o
i,
. The column vector d
diago
i.
is simply the vector containing the outdegree of
each node. Furthermore, the volume of the graph is defined
as .
q
.o|G
P
i
i1
d
ii

P
i
i.,1
o
i,
. Usually, we are deal-
ing with symmetric adjacency matrices, in which case L is
symmetric and positive semidefinite (see, for instance, [10]).
From this graph, we define a natural random walk
through the graph in the usual way by associating a state to
each node and assigning a transition probability to each
link. Thus, a random walker can jump from element to
element, and each element therefore, represents a state of
the Markov chain describing the sequence of visited states.
A random variable :t contains the current state of the
Markov chain at time step t: if the random walker is in state
i at time t, then :t i. The random walk is defined by the
following single-step transition probabilities of jumping
from any state i :t to an adjacent state: , :t 1:
P:t 1 ,j:t i o
i,
,o
i.
j
i,
. The transition prob-
abilities only depend on the current state and not on the
past ones (first-order Markov chain). Since the graph is
completely connected, the Markov chain is irreducible, that
is, every state can be reached from any other state. If we
denote the probability of being in state i at time t by r
i
t
P:t i and we define P as the transition matrix with
entries j
i,
, the evolution of the Markov chain is character-
ized by xt 1 P
T
xt, with x0 x
0
and T being the
matrix transpose. This provides the state probability
distribution xt r
1
t. r
2
t. . . . . r
i
t
T
at time t once
the initial distribution x0 is known. Moreover, we will
denote as x
i
t the column vector containing the probability
distribution of finding the random walker in each state at
time t when starting from state i at time t 0. That is, the
entries of the vector x
i
t are r
i,
t P:t ,j:0
i. , 1. . . . i.
Since the Markov chain represents a random walk on the
graph G, the transition matrix is simply P D
1
A. More-
over, if the adjacency matrix A is symmetric, the Markov
chain is reversible and the steady-state vector, , is simply
proportional to the degree of each state [48], d (which has to
be normalized in order to obtain a valid probability
distribution). Moreover, this implies that all the eigenvalues
(both left and right) of the transition matrix are real.
2.2 The Diffusion Map Distance
In our two-step procedure, a diffusion map projection,
based on the so-called diffusion map distance, will be
performed after stochastic complementation. Now, since
482 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011
the original definition of the diffusion map distance deals
only with undirected, aperiodic, Markov chains, it will first
be assumed in Section 2 that the reduced Markov chain,
obtained after stochastic complementation, is indeed un-
directed, aperiodic, and connectedin which case the
corresponding random walk defines an irreducible rever-
sible Markov chain. Notice, that it is not required that the
original adjacency matrix is irreducible and reversible; these
assumptions are only required for the reduced adjacency
matrix obtained after stochastic complementation (see the
discussion in Section 3.1). Moreover, some of these
assumptions will be relaxed in Section 2.3, when introdu-
cing the diffusion map kernel that is well-defined, even if
the graph is directed.
The original derivation of the diffusion map, introduced
independently by Nadler et al., and Pons and Latapy [42],
[43], [46], [47], is detailed in Section 2, but other interpreta-
tions of this mapping appeared in the literature (see the
discussion at the end of Section 2). Moreover, the basic
diffusion map is closely related to correspondence analysis,
as detailed in Section 4. For an application of the basic
diffusion map to dimensionality reduction, see [35].
Since P is aperiodic, irreducible, and reversible, it is well
known that all the eigenvalues of P are real and the
eigenvectors are also real (see, e.g., [7], p. 202). Moreover, all
its eigenvalues 2 1. 1, and the eigenvalue 1 has multi-
plicity one [7]. With these assumptions, Nadler et al. and
Pons and Latapy [42], [43], [46], [47] proposed to use as
distance between states i and ,
d
2
i,
t
X
i
/1
r
i/
t r
,/
t
2
/
1
/ x
i
t x
,
t
T
D
1
x
i
t x
,
t. 2
since, for a simple random walk on an undirected graph,
the entries of the steady-state vector are proportional (the
/ sign) to the generalized degree of each node (the total of
the elements of the corresponding row of the adjacency
matrix [48]). This distance, called the diffusion map
distance, corresponds to the sum of the squared differences
between the probability distribution of being in any state
after t transitions when starting (i.e., at time t 0) from two
different states, state i and state ,. In other words, two
nodes are similar when they diffuse through the network
and thus influence the networkin a similar way. This is a
natural definition which quantifies the similarity between
two states based on the evolution of the states probability
distribution. Of course, when i ,. d
i,
t 0.
Nadler et al. [42], [43] showed that this distance measure
has a simple expression in terms of the right eigenvectors
of P:
d
2
i,
t
X
i
/1
`
2t
/
n
/i
n
/,
2
. 3
where n
/i
u
/
i
is component i of the /th right eigenvec-
tor, u
/
, of P and `
/
is its corresponding eigenvalue. As
usual, the `
/
are ordered by decreasing modulus, so that the
contributions to the sum in (3) are decreasing with /. On
the other hand, x
i
t can easily be expressed [42], [43] in the
space spanned by the left eigenvectors of P, the v
/
,
x
i
t P
T
t
e
i

X
i
/1
`
t
/
v
/
u
T
/
e
i

X
i
/1
`
t
/
n
/i
v
/
. 4
where e
i
is the ith column of I,
e
i
0. . . . . 0. 1. 0. . . . . 0
T
. with the single 1 in position i.
The resulting mapping aims to represent each state i in a
i-dimensional euclidean space with coordinates
j`
t
2
jn
2i
. j`
t
3
jn
3i
. . . . . j`
t
i
jn
ii
), as in (4) (the first right eigen-
vector is trivial and is therefore discarded). Dimensions are
ordered by decreasing modulus, j`
t
/
j. This original mapping
introduced by Nadler and coauthors will be referred to as
the basic diffusion map in this paper, in contrast with the
diffusion map kernel (K
DM
) that will be introduced in
Section 3.
The weighting factor, D
1
, in (2) is necessary to obtain
(3), since the v
/
are not orthogonal. Instead, it can easily be
shown that we have v
T
i
D
1
v
,
c
i,
, which aims to redefine
the inner product as r. y h i x
T
D
1
y, where the metric of
the space is D
1
[7].
Notice also that there is a close relationship between
spectral clustering (the mapping provided by the normal-
ized Laplacian matrix; see, for instance, [15], [45], [65]) and
the basic diffusion map. Indeed, a common embedding of
the nodes consists of representing each node by the
coordinates of the smallest nontrivial eigenvectors (corre-
sponding to the smallest eigenvalues) of the normalized
Laplacian matrix,
~
L D
1,2
LD
1,2
. More precisely, if u
/
is
the /th largest right eigenvector of the transition matrix P
and
~
l
/
is the /th smallest nontrivial eigenvector of the
normalized Laplacian matrix
~
L, we have (see the Appendix
for the proof and details)
u
/
D
1,2
~
l
/
5
and
~
l
/
is associated to eigenvalue 1 `
/
.
Asubtle, still important, difference between this mapping
and the one provided by the basic diffusion map concerns the
order in which the dimensions are sorted. Indeed, for the basic
diffusion map, the eigenvalues of the transition matrix Pare
ordered by decreasing modulus value. For this spectral
clustering model, the eigenvalues are sorted by decreasing
value (and not modulus), which can result in a different
representation if P has large negative eigenvalues. This
shows that the mappings providedby spectral clustering and
by the basic diffusion map are closely related.
Notice that at least three other justifications of this
eigenvector-based mapping appeared before in the litera-
ture, and are briefly reviewed here. 1) It has been shown
that the entries of the subdominant right eigenvector of the
transition matrix P of an aperiodic, irreducible, reversible,
Markov chain can be interpreted as a relative distance to its
stationary distribution (see [60], Section 1.8.1, or [18],
Appendix). This distance may be regarded as an indicator
of the number of iterations required to reach this equili-
brium position, if the system starts in the state from which
the distance is being measured. These quantities are only
relative, but they serve as a means of comparison among the
states [60]. 2) The same embedding can be obtained by
minimizing the criterion
P
i
i1
P
i
,1
o
i,
:
i
:
,
2
z
T
Lz
[10], [26] subject to z
T
Dz 1, thereby penalizing the nodes
having a large outdegree [74]. Here, :
i
is the coordinate of
YEN ET AL.: A LINK ANALYSIS EXTENSION OF CORRESPONDENCE ANALYSIS FOR MINING RELATIONAL DATABASES 483
node i on the axis and the vector z contains the :
i
. The
problem sums up in finding the smallest nontrivial
eigenvector of I P, which is the same as the second
largest eigenvector of P, and this is once more similar to the
basic diffusion map. Notice that this mapping has been
rediscovered and reinterpreted by Belkin and Niroyi [2], [3]
in the context of nonlinear dimensionality reduction. 3) The
last justification of the basic diffusion map, introduced in
[15], is based on the concept of two-way partitioning of a
graph [58]. Minimizing a normalized cut criterion while
imposing that the membership vector is centered with
respect to the metric Dleads to exactly the same embedding
as in the previous interpretation. Moreover, some authors
[72] showed that applying a specific cut criteria to bipartite
graphs leads to simple correspondence analysis. Notice
that the second justification 1) leads to exactly the same
mapping as the basic diffusion map while the third and
fourth justifications, 2) and 3) lead to the same embedding
space, but with a possibly different ordering and rescaling
of the axis.
More generally, these mappings are, of course, also
related to graph embedding and nonlinear dimensionality
reduction, which have been highly studied topics in recent
years, especially in the manifold learning community (see,
i.e., [21], [30], [37], [67], for recent surveys or developments).
Experimental comparisons with popular nonlinear dimen-
sionality reduction techniques are presented in Section 5.
2.3 A Kernel View of the Diffusion Map Distance
We now introduce
1
a variant of the basic diffusion map
model introduced by Nadler et al. and Pons and Latapy
[42], [43], [46], [47], which is still well-defined when the
original graph is directed. In other words, we do not
assume that the initial adjacency matrix A is symmetric in
this section. This extension presents several advantages in
comparison with the original basic diffusion map:
1. the kernel version of the diffusion map is applicable
to directed graphs while the original model is
restricted to undirected graphs,
2. the extended model induces a valid kernel on a
graph,
3. the resulting matrix has the nice property of being
symmetric positive definitethe spectral decomposi-
tion can thus be computed on a symmetric positive
definite matrix, and finally
4. the resulting mapping is displayed in a euclidean
space in which the coordinate axes are set in the
directions of maximal variance by using (uncentered if
the kernel is not centered) kernel principal compo-
nent analysis [54], [57] or multidimensional scaling
[6], [12].
This kernel-based technique will be referred to as the
diffusion map kernel PCA or the K
DM
PCA.
Let us define W Diag
1
, where is the stationary
distribution of the finite Markov chain. Remember that if
the adjacency matrix is symmetric, the stationary distribu-
tion of the natural random walk is proportional to the
degree of the nodes, W/ D
1
[48]. The diffusion map
distance is therefore redefined as
d
2
i,
t x
i
t x
,
t
T
Wx
i
t x
,
t. 6
Since x
i
t P
T
t
e
i
, (6) becomes
d
2
i,
t e
i
e
,
T
P
t
WP
T
t
e
i
e
,
e
i
e
,
T
K
DM
e
i
e
,
K
DM
ii
K
DM
,,
K
DM
i,
K
DM
,i
.
7
where we defined
K
DM
t P
t
W P
T

t
8
referred to as the diffusion map kernel. Thus, the matrix
K
DM
is the natural kernel (inner product matrix) associated
to the squared diffusion map distances [6], [12]. It is clear
that this matrix is symmetric positive semidefinite and
contains inner products in a euclidean space, where the
node vectors are exactly separated by d
i,
t (the proof is
straightforward and can be found in [17]appendix D
where the same reasoning was applied to the commute time
kernel). It is therefore a valid kernel matrix.
Performing a (uncentered if the kernel is not centered)
principal component analysis (PCA) in this embedding
space aims to change the coordinate system by putting the
new coordinate axes in the directions of maximal variances.
From the theory of classical multidimensional scaling [6],
[12], it suffices
2
to compute the i first eigenvalues/
eigenvectors of K
DM
and to consider that these eigenvectors
multiplied by the squared root of the corresponding
eigenvalues are the coordinates of the nodes in the principal
component space spanned by these eigenvectors (see [6],
[12]; for a similar application with the commute time kernel,
see [51]; for the general definition of kernel PCA, see [54],
[55]). In other words, we compute the i first eigenvalues/
eigenvectors of K
DM
: K
DM
w
/
j
/
w
/
, where the w
/
are
orthonormal. Then, we represent each node i in a i-
dimensional euclidean space with coordinates
j
1
p
n
1i
,
j
2
p
n
2i
. . . . .
j
i
p
n
ii
, where n
/i
w
/
i
corresponds to
element i of the eigenvector w
/
associated to eigenvalue j
/
;
this is the vector representation of state i in the principal
component space.
It can easily be shown that when the initial graph is
undirected, this PCA based on the kernel matrix K
DM
is
similar to the diffusion map introduced in the last section,
up to an isometry. Indeed, by the classical theory of
multidimensional scaling [6], [12], the eigenvectors of the
kernel matrix K
DM
multiplied by the squared root of the
corresponding eigenvalues define coordinates in a eucli-
dean space, where the observations are exactly separated by
the distance d
i,
t. Since this is exactly the property of the
basic diffusion map (3), both representations are similar up
to an isometry.
Notice that the resulting kernel matrix can easily be
centered [40] by HK
DM
H with H I ee
T
,i, where e is
a column vector all of whose elements are 1 (i.e.,
e 1. 1. . . . . 1
T
). H is called the centering matrix. This
aims to place the origin of the coordinates of the diffusion
map at the center of gravity of the node vectors.
1. Part of the material of this section was published in a conference paper
[19] presenting a similar idea.
2. The proof must be slightly adapted in order to account for the fact that
the kernel matrix in not centered, as in [50].
2.4 Links between the Basic Diffusion Map and the
Kernel Diffusion Map
While both representing the graph in a euclidean space,
where the nodes are exactly separated by the distances
defined by (2), and thus providing exactly the same
embedding, the mappings are, however, different for each
method. Indeed, the coordinate system in the embedding
space differs for each method.
In the case of the basic diffusion map, the eigenvector u
/
represents the /th coordinate of the nodes in the embed-
ding space. However, in the case of the diffusion map
kernel, since a kernel PCA is performed, the first
coordinate axis corresponds instead to the direction of
maximal variance in terms of diffusion map distance (2).
Therefore, the coordinate system used by the diffusion
map kernel is actually different than the one used by the
basic diffusion map.
Putting the coordinate system in the directions of
maximal variance, and thus computing a kernel PCA, is
probably more natural. We now show that there is a close
relationship between the two representations. Indeed, from
(4), we easily observe that the mapping provided by the
basic diffusion map remains the same in function of the
parameter t, up to a scaling of each coordinate/dimension
(only the scaling changes). This is, in fact, not the case for
the kernel diffusion map. In fact, the mapping provided by
the diffusion map kernel tends to be the same as the one
provided by the basic diffusion map for growing values of t
in the case of an undirected graph. Indeed, it can be shown
that the kernel matrix can be rewritten as K
DM
/ U
2t
U
T
,
where U contains the right eigenvectors of P. u
/
, as
columns. In this case, when t is large, every additional
dimension has a very small contribution in comparison
with the previous ones.
This fact will be illustrated in the experimental section
(i.e., Section 5). In practice, we observed that the two
mappings are already almost identical when t is equal to 5
or 6 (see, for instance, Fig. 3 in Section 5).
3 ANALYZING RELATIONS BY STOCHASTIC
COMPLEMENTATION
In Section 3, the concept of stochastic complementation [41]
is briefly reviewed and applied to the analysis of a graph
through the random-walk-on-a-graph model. From the
initial graph, a reduced graph containing only the nodes
of interest, and which is much more easy to analyze, is built.
3.1 Computing a Reduced Markov Chain by
Stochastic Complementation
Suppose we are interested in analyzing the relationship
between two sets of nodes of interest. A reduced Markov
chain can be computed from the original chain, in the
following manner: First, the set of states is partitioned into
two subsets, o
1
corresponding to the nodes of interest to
be analyzedand o
2
corresponding to the remaining
nodes, to be hidden. We further denote by i
1
and i
2
(with i
1
i
2
i) the number of states in o
1
and o
2
,
respectively; usually i
2
)i
1
. Thus, the transition matrix
is repartitioned as
P
o
1
o
2
o
1
o
2
P
11
P
12
P
21
P
22
!
.
9
The idea is to censor the useless elements by masking
them during the random walk. That is, during any random
walk on the original chain, only the states belonging to o
1
are recorded; all the other reached states belonging to
subset o
2
being censored, and therefore, not recorded. One
can show that the resulting reduced Markov chain obtained
by censoring the states o
2
is the stochastic complement of
the original chain [41]. Thus, performing a stochastic
complementation allows to focus the analysis on the tables
and elements representing the factors/features of interest.
The reduced chain inherits all the characteristics from the
original chain; it simply censors the useless states. The
stochastic complement P
c
of the chain, partitioned as in (9),
is defined as (see, for instance, [41])
P
c
P
11
P
12
I P
22
1
P
21
. 10
It can be shown that the matrix P
c
is stochastic, that is,
the sum of the elements of each row is equal to 1 [41]; it
therefore corresponds to a valid transition matrix between
states of interest. We will assume that this resulting
stochastic matrix is aperiodic and irreducible, that is,
primitive [48]. Indeed, Meyer showed in [41] that if the
initial chain is irreducible or aperiodic, so is the reduced
chain. Moreover, even if the initial chain is periodic, the
reduced chain frequently becomes aperiodic by stochastic
complementation [41]. One way to ensure the aperiodicity
of the reduced chain is to introduce a small positive
quantity on the diagonal of the adjacency matrix A, which
does not fundamentally change the model. Then, P has
nonzero diagonal entries and the stochastic complement,
P
c
, is primitive (see [41], Theorem 5.1).
Let us show that the reduced chain also represents a
random walk on a reduced graph G
c
containing only the
nodes of interest. We therefore partition the matrices
A. D. L, as
A
A
11
A
12
A
21
A
22
!
; D
D
1
O
O D
2
!
; L
L
11
L
12
L
21
L
22
!
from which we easily find P
c
D
1
1
A
11
A
12
D
2

A
22
1
A
21
D
1
1
A
c
, where we defined A
c
A
11

A
12
D
2
A
22
1
A
21
. Notice that if A is symmetric (the
graph G
c
is undirected), A
c
is symmetric as well. Since P
c
is
stochastic, we deduce that the diagonal matrix D
1
contains
the row sums of A
c
and that the entries of A
c
are positive.
The reduced chain thus corresponds to a random walk on
the graph G
c
whose adjacency matrix is A
c
.
Moreover, the corresponding Laplacian matrix of the
graph G
c
can be obtained by
L
c
D
1
A
c
D
1
A
11
A
12
D
2
A
22
1
A
21
L
11
L
12
L
1
22
L
21
11
since L
12
A
12
and L
21
A
21
. If the adjacency matrix
A is symmetric, L
11
(L
22
) is positive definite, since it is
obtained from the positive semidefinite matrix L by
deleting the rows associated to o
2
(o
1
) and the corre-
sponding columns, thereby eliminating the linear relation-
ship. Notice that L
c
is simply the Schur complement of
L
22
[27]. Thus, for an undirected graph G, instead of
directly computing P
c
, it is more interesting to compute
L
c
, which is symmetric positive definite, from which P
c
can easily be deduced: P
c
I D
1
1
L
c
, directly following
from L
c
D
1
A
c
; see Section 3.2 for a proposition of
iterative computation of L
c
.
3.2 Iterative Computation of L
c
for Large, Sparse,
Graphs
In order to compute L
c
from (11), we need to evaluate
L
1
22
L
21
. We now show how this matrix can be computed
iteratively by using, for instance, a simple Jacobi-like or
Gauss-Seidel-like algorithm. In order to simplify the
development, let us pick up one column, say column ,, of
L
21
and denote it as b
,
; there are i
1
(i
2
such columns. The
problem is thus to compute x
,
L
1
22
L
21
e
,
(column , of
L
1
22
L
21
) such that
L
22
x
,
D
2
A
22
x
,
b
,
. 12
By transforming this last equation, we obtain x
,

D
1
2
A
22
x
,
b
,
P
22
x
,
D
1
2
b
,
, which leads to the fol-
lowing iterative updating rule:
b x
,
P
22
b x
,
D
1
2
b
,
. 13
where b x
,
is an approximation of x
,
. This procedure
converges, since the matrix L
22
is irreducible and has weak
diagonal dominance [70]. Initially, one can start, for
instance, with b x
,
0. The computation of (13) fully exploits
the sparseness, since only the nonzero entries of P
22
contribute to the update. Of course, D
1
2
b
,
is precalculated.
Once all the x
,
have been computed in turn (only i
1
in
total), the matrix X containing the x
,
as columns is
constructed; we thus have L
1
22
L
21
X. The final step
consists of computing L
c
L
11
L
12
X.
The time complexity of this algorithm is i
1
(the
complexity of solving one sparse system of i
2
linear
equations). Now, if the graph is undirected, the matrix
L
22
is positive definite. Recent numerical analysis techni-
ques have shown that positive definite sparse linear
systems in an i-by-i matrix with i nonzero entries can
be solved in time Oii, for instance, by using conjugate
gradient techniques [59], [64]. Thus, apart from the matrix
multiplication L
12
X, the complexity is Oi
1
i
2
i, where i
is the number of nonzero entries of L
22
. In practice, the
matrix is usually very sparse and i c i
2
with c quite
small, resulting in Oc i
1
i
2
2
with i
1
(i
2
. The second step,
i.e., the mapping by basic diffusion map or by diffusion
map kernel PCA, has negligible complexity since it is
performed on a reduced i
1
i
1
matrix. Of course, more
sophisticated iterative algorithms can be used as well (see,
for instance, [25], [49], [59]).
4 ANALYZING THE REDUCED MARKOV CHAIN WITH
THE BASIC DIFFUSION MAP: LINKS WITH
CORRESPONDENCE ANALYSIS
Once a reduced Markov chain containing only the nodes of
interest has been obtained, one may want to visualize the
graph in a low-dimensional space preserving as accurately
as possible the proximity between the nodes. This is the
second step of our procedure. For this purpose, we propose
to use the diffusion maps introduced in Sections 2.2 and 2.3.
Interestingly enough, computing a basic diffusion map on
the reduced Markov chain is equivalent to correspondence
analysis in two special cases of interest: a bipartite graph
and a star-schema database. Therefore, the proposed two-
step procedure can be considered as a generalization of
Correspondence analysis (see, for instance, [23], [24],
[31], [62]) is a widely used multivariate statistical analysis
technique which still is the subject of much research efforts
(see, for instance, [5], [29]). As stated, for instance, in [34],
simple correspondence analysis aims to provide insights into
the dependence of two categorical variables. The relation-
ships between the attributes of the two categorical
variables are usually analyzed through a biplot [23]a
2D representation of the attributes of both variables. The
coordinates of the attributes on the biplot are obtained by
computing the eigenvectors of a matrix. Many different
derivations of simple correspondence analysis have been
developed, allowing for different interpretations of the
technique, such as maximizing the correlation between two
discrete variables, reciprocal averaging, categorical discri-
minant analysis, scaling and quantification of categorical
variables, performing a principal component analysis
based on the chi-square distance, optimal scaling, dual
scaling, etc. [34]. Multiple correspondence analysis is the
extension of simple correspondence analysis to a larger
number of categorical variables.
4.1 Simple Correspondence Analysis
As stated before, simple correspondence analysis (see, for
instance, [23], [24], [31], [62]) aims to study the relation-
ships between two random variables r
1
and r
2
(the
features) having each mutually exclusive, categorical,
outcomes, denoted as attributes. Suppose the variable r
1
has i
1
observed attributes and the variable r
2
has i
2
observed attributes, each attribute being a possible out-
come value for the feature. An experimenter makes a series
of measurements of the features r
1
. r
2
on a sample of .
q
individuals and records the outcomes in a frequency (also
called contingency) table, )
i,
, containing the number of
individuals having both attribute r
1
i and attribute
r
2
,. In our relational database, this corresponds to two
tables, each table corresponding to one variable, and
containing the set of observed attributes (outcomes) of
the variable. The two tables are linked by a single relation
(see Fig. 1 for a simple example).
This situation can be modeled as a bipartite graph, where
each node corresponds to an attribute and links are only
defined between attributes of r
1
and attributes of r
2
. The
weight associated to each link is set to n
i,
)
i,
, quantifying
the strength of the relationship between i and ,. The
Fig. 1. Trivial example of a single relation between two variables,
Document and Word. The Document table contains outcomes of
documents while the Word table contains outcomes of words.
associated i i adjacency matrix and the corresponding
transition matrix can be factorized as
A
O A
12
A
21
O
!
. P
O P
12
P
21
O
!
. 14
where O is a matrix full of zeroes.
Suppose we are interested in studying the relationships
between the attributes of the first variable r
1
, which
corresponds to the i
1
first elements. By stochastic com-
plementation (see (10)), we easily obtain P
c
P
12
P
21

D
1
1
A
12
D
1
2
A
21
. Computing the diffusion map for t 1
aims to extract the subdominant right-hand eigenvectors of
P
c
, which exactly corresponds to correspondence analysis
(see, for instance, [24], (4.3.5)). Moreover, it can easily be
shown that P
c
has only real nonnegative eigenvalues, and
thus, ordering the eigenvalues by modulus is equivalent to
ordering them by value. In correspondence analysis,
eigenvalues reflect the relative importance of the dimen-
sions: each eigenvalue is the amount of inertia a given
dimension exhibits in the frequency table [31]. The basic
diffusion map after stochastic complementation on this
bipartite graph therefore leads to the same results as simple
Relationships between simple correspondence analysis
and link analysis techniques have already been highlighted.
For instance, Zha et al. [72] showed the equivalence of a
normalized cut performed on a bipartite graph and simple
correspondence analysis. On the other hand, Saerens et al.
investigated the relationships between Kleinbergs HITS
algorithm [33], and correspondence analysis [18] or princi-
pal component analysis [50].
4.2 Multiple Correspondence Analysis
Multiple correspondence analysis assigns a numerical
score to each attribute of a set of j 2 categorical variables
[23], [62]. Suppose the data are available in the form of a
star-schema: the individuals are contained in a main table
and the categorial features of these individuals, such as
education level, gender, etc., are contained in j auxiliary,
satellite, tables. The corresponding graph is built naturally
by defining one node for each individual and for each
attribute while a link between an individual and an
attribute is defined when the individual possesses this
attribute. This configuration is known as a star-schema
[32] in the data warehouse or relational database fields
(see Fig. 2 for a trivial example).
Let us first renumber the nodes in such a way that the
attribute nodes appear first and the individuals nodes last.
Thus, the attributes-to-individuals matrix will be denoted by
A
12
; it contains a 1 on the i. , entry when the individual ,
has attribute i, and 0 otherwise. The individuals-to-attributes
matrix, the transpose of the attributes-to-individuals matrix,
is A
21
. Thus, the adjacency matrix of the graph is
A
O A
12
A
21
O
!
. 15
Now, the individuals-to-attributes matrix exactly corre-
sponds to the data matrix A
21
X containing, as rows, the
individuals and, as columns, the attributes. Since the
different features are coded as indicator (dummy) variables,
a row of the X matrix contains a 1 if the individual has the
corresponding attribute and 0 otherwise. We thus have
A
21
X and A
12
X
T
.
Assuming binary weights, the matrix D
1
contains on its
diagonal the frequencies of each attribute, that is, the number
of individuals having this attribute. On the other hand, D
2
contains j on each element of its diagonal, since each
individual has exactly one attribute for each of the j features
(attributes corresponding to a feature are mutually exclu-
sive). Thus, D
2
j I and P
12
D
1
1
A
12
. P
21
D
1
2
A
21
.
Suppose we are first interested in the relationships
between attribute nodes, thereby hiding the individual
nodes contained in the main table. By stochastic comple-
mentation (10), the corresponding attribute-attribute transi-
tion matrix is
P
c
D
1
1
A
12
D
1
2
A
21

1
j
D
1
1
A
12
A
21
1
j
D
1
1
X
T
X
1
j
D
1
1
F.
16
where the element )
i,
of the frequency matrix F X
T
X,
also called the Burt matrix, contains the number of co-
occurences of the two attributes i and ,, that is, the number
of individuals having both attribute i and attribute ,.
The largest nontrivial right eigenvector of the matrix P
c
represents the scores of the attributes in a multiple
correspondence analysis. Thus, computing the eigenvalues
and eigenvectors of P
c
and displaying the nodes with
coordinates proportional to the eigenvectors, weighted by
the corresponding eigenvalue, exactly corresponds to
multiple correspondence analysis (see [62], (10)). This is
precisely what we obtain when computing the basic
diffusion map on P
c
with t 1. Indeed, as for simple
correspondence analysis, it can easily be shown that P
c
has
real nonnegative eigenvalues, and thus, ordering the
eigenvalues by modulus is equivalent to ordering by value.
Fig. 2. Trivial example of a star-schema relation between a main
variable, Individual, and auxiliary variables, Gender, Education level,
etc. Each table contains outcomes of the corresponding random
variable.
If we are interested in the relationships between
elements of the main table (the individuals) instead of the
attributes, we obtain
P
c

1
j
A
21
D
1
1
A
12

1
j
XD
1
1
X
T
. 17
which, once again, is exactly the result obtained by multiple
correspondence analysis (see [62], (9)).
5 EXPERIMENTS
This experimental section aims to answer four research
questions.
1. Howdoes the graphmappings providedbythe kernel
PCA based on the diffusion map kernel (K
DM
PCA)
compares with the basic diffusion map projection?
2. Does the proposed two-step procedure (stochastic
complementation + diffusion map) provide realistic
subgraph drawings?
3. How does the diffusion map kernel combined with
stochastic complementation compares to other pop-
ular dimensionality reduction techniques?
4. Does stochastic complementation accurately pre-
serve the structural information?
5.1 Graph Embedding
Two simple graphs are studied in order to illustrate the
visualization of the graph structure by diffusion maps alone
(without stochastic complementation): the Zachary karate
club [71] and the dolphins social network [38].
5.1.1 Zachary Karate Club
Zachary has studied the relations between the 34 members
of a karate club. A disagreement between the club instructor
(node 1) and the administrator (node 34) resulted in the
split of the club into two parts. Each member of the club is
represented by a node of the graph and the weight between
nodes i. , is set to be the number of times member i and
member , met outside the club. The Ucinet drawing of the
network is shown in Fig. 3a. The built friendship network
between members allows to discover how the club split, but
its mapping also allows to study the proximity between the
different members.
It can be observed from the graph embeddings provided
by the basic diffusion map (Fig. 3c) that the value of
parameter t has no influence on the nodes position, but
only on the axis scaling (remember (4)). On the contrary, the
influence of the value of parameter t is clearly visible on the
K
DM
PCA mapping (Fig. 3d).
However, the projection of a graph on a 2D space usually
leads to a loss of information. The information preservation
ratio can be estimated using `
1
`
2
,
P
i
`
i
; it was com-
puted for the 2D mapping of the network using the basic
diffusion map and the K
DM
PCA, and is shown in Fig. 3b. It
can be observedthat the ratio increases with t but, overall, the
information is better preserved with the K
DM
PCAthan with
the basic diffusion map. This is due to a better choice of
the projection axes for the K
DM
PCA that are oriented in the
directions of maximal variance. Since a proximity on the
mapping can be interpreted as a strong relationship between
members, the social community structure is clearly obser-
vable visually from Figs. 3c and 3d. For the K
DM
PCA, the
choice of the value for the parameter t depends onthe graphs
property that the user wishes to highlight. When t is low
(t 1), nodes with few relevant connections are considered
as outliers andrejectedfromthe community. Onthe contrary,
when t increases, the effect of marginal nodes fades and only
the global members structure, similar to the basic diffusion
map, is visible.
5.1.2 Dolphins Social Network
This unweighted graph represents the associations between
bottlenose dolphins from the same community, observed
over a period of 7 years. Only the basic diffusion map with
t 1 is shown (since, once again, the parameter t has no
influence on the mapping) while the K
DM
PCA is computed
for t 1. 3, and 6. It can be observed (see Fig. 4) from K
DM
PCA that a dolphin (the 61th member, Zig) is rejected far
away from the community due to his lack of interaction
with other dolphins. His only contact is also poorly
connected with the remaining community. As explained
in Section 2 and confirmed in Fig. 4, the basic diffusion map
and the K
DM
PCA mappings become similar when t
increases. For instance, when t 6 or more, the outlier is
no more highlighted and only the main structure of the
network is visible. Two main subgroups can be identified
from the mapping (notice that Newman and Girvan [44]
actually identified four subcommunities by clustering).
5.2 Analyzing the Effect of Stochastic
Complementation on a Real-World Data Set
This second experiment aims to illustrate the two-step
mapping procedure, i.e., first applying a stochastic comple-
mentationandthencomputingthe K
DM
PCA, ona real-world
data setthe newsgroups data. However, for illustrative
purposes, the procedure is first applied on a toy example.
5.2.1 Toy Example
The graph (see Fig. 5) is composed of four objects (c
1
. c
2
. c
3
,
and c
4
) belonging to two different classes (c
1
and c
2
). Each
object is also connected to one or many of the five attributes
(o
1
. . . . . o
5
). The reduced graph mapping obtained by our
two-step procedure allows to highlight the relations between
the attribute values (i.e., the o nodes) and the classes (i.e., the
c nodes). To achieve this goal, the c nodes are eliminated by
performing a stochastic complementation: only the o and c
nodes are kept. The resulting subgraph is displayed on a
2D plane by performing a K
DM
PCA. Since the connectivity
between nodes o
1
and o
2
(o
3
. o
4
, and o
5
) is larger than with
the remaining nodes, these two (three) nodes are close
together on the resulting map. Moreover, node c
1
(c
2
) is
highly connected to nodes o
1
. o
2
o
3
. o
4
, and o
5
through
indirect links and is therefore displayed close to these nodes.
5.2.2 Newsgroups Data Set
The real-world data set studied in this section is the
newsgroups data set. It is composed of about 20,000 un-
structured documents, taken from 20 discussion groups
(newsgroups) of the Usernet diffusion list. For the ease of
interpretation, we decided to limit the data set size by
randomly sampling 150 documents out of three slightly
correlated topics (sport/baseball, politics/mideast, and
space/general; 50 documents from each topic). Those
150 documents are preprocessed as described in [68], [69].
The resulting graph is composed of 150 document nodes,
564 term nodes, and three topic nodes representing the
topics of the documents. Each document is connected to its
corresponding topic node with a weight fixed to 1. The
weights between the documents and the terms are set equal
to the tf.idf factor and normalized in order to obtain
normalized weights between 0 and 1 [68], [69].
Thus, the class (or topics) nodes are connected to
document nodes of the corresponding topics, and each
document is also connected to terms contained in
the document. Drawing a parallel with our illustrative
example (see Fig. 5), topic nodes correspond to c-nodes,
document nodes toc-nodes, andterms too-nodes. The goal of
this experiment is to study the similarity between the terms
and the topics through their connections with the document
nodes. The reduced Markov chain is computed by setting o
1
to the nodes of the graph corresponding to the terms and the
topics. The remaining nodes (documents) are rejected in the
subgroupo
2
. Fig. 6 shows the embedding of the terms usedin
the 150 documents of the newsgroups subset, as well as the
topics of the documents. The K
DM
PCA quickly provides
Fig. 3. Zachary karate social network: (a) Ucinet projection of the Zachary karate network, (b) the information preservation ratio of the 2D projection
in function of the parameter t for the basic diffusion map (diffusion map) and the K
DM
PCA, (c) the graph mapping obtained by the basic diffusion
map (t 1 and 4), and (d) the graph mapping obtained by a K
DM
PCA (t 1 and 4).
thesamemappingas thebasic diffusionmapwhenincreasing
the value of t . However, it can be observedon the embedding
with t 1 that several nodes are rejected outside the
triangle, far away from the other nodes of the graph.
A new mapping, where the terms corresponding to each
node are also displayed (for the visualization convenience,
only terms cited by 25 documents or more are shown) for
K
DM
PCA with t 1 is shown in Fig. 7. It can be observed
that a group of terms are stuck near each topic nodes,
denoting their relevance to this topic (i.e., player, win,
hit, and team for the sport topic). We also observe that
terms lying in-between two topics are also commonly used
by both topics (human, nation, and European seem to
be words used in discussions on politics as well as on space),
or centered in the middle of the three topics (common terms
without any specificity, such as work, Usa, or make).
Globally, the projection provides a good representation of
the use of the terms in the different discussion groups for
both the basic diffusion map and the K
DM
PCA. Terms
rejected outside the triangle are often only cited by few
documents and seem to have few links with other terms.
They are probably out of topic as, for instance, the series of
terms on printer in the outlier group on the left of the sport
topic (printer, Packard, or Hewlett).
5.3 Graph Reduction Influence and Embedding
Comparison
The objective of this experiment is twofold. The first aim is
to study the influence of stochastic complementation on
graph mapping. The second one is to compare five popular
dimensionality reduction methods, namely, the diffusion
map kernel PCA (K
DM
PCA or simply KDM), the Laplacian
Eigenmap (LE) [3], the Curvilinear Component Analysis
(CCA) [14], Sammons nonlinear Mapping (SM) [52], and
the classical Multidimensional Scaling [6], [12], based on
geodesic distances (MDS). For CCA, SM, and MDS, the
distance matrix is given by the shortest path distance
computed on the reduced graph whose weights are set to
the inverse of the entries of the adjacency matrix obtained
by stochastic complementation. Notice that the MDS
method computed from the geodesic distance on a graph
is also known as the ISOMAP method after [61]. Provided
that the resulting reduced Markov chain is usually dense,
the time complexity of each algorithm is as follows: For
K
DM
PCA, LE, and MDS, the problem is to compute the
d dominant eigenvectors of a square matrix since the graph
is mapped on a d-dimensional space, which is Od t i
2
1
,
where i
1
is the number of nodes of interest being displayed
and t is the number of iterations of the power method. For
SM and CCA, the complexity is about Ot i
2
1
, where t is
the number of iterations (these algorithms are iterative by
Fig. 5. Toy example illustrating our two-step procedure (stochastic
complementation followed by K
DM
PCA).
Fig. 4. Dolphins social network: The graph mapping obtained by the basic diffusion map (t 1; upper left figure) and using the K
DM
PCA (t 1. 3,
and 6).
nature). On the other hand, computing the shortest path
distances matrix takes Oi
2
1
logi
1
. Thus, each algorithm
has a time complexity between Oi
2
1
and Oi
3
1
.
In this experiment, we address the task of classification
of unlabeled nodes in partially labeled graphs, that is,
semisupervised classification on a graph [73]. Notice that
the goal of this experiment is not to design a state-of-the-
art semisupervised classifier; rather it is to study the
performance of the proposed method, in comparison with
other embedding methods.
Three graphs are investigated. The first graph is
constructed from the well-known Iris data set [4]. The
weight (affinity) between nodes representing samples is
provided by n
i,
crjd
2
i,
,o
2
, where d
i,
is the euclidean
distance in the feature space and o
2
is simply the sample
variance. The classes are the three iris species. The second
graph is extracted from the IMDb movie database [39]. Here,
1,126 movies are linked to form a connected graph: an edge
is added between two movies if they share common
production companies. Each node is classified to be a high-
or low-income movie (two classes). The last graph,
extracted from the CORA data set [39], is composed of
scientific papers from three topics. A citation graph is built
upon the data set, where two papers are linked if the first
paper cites the second one. The tested graph contains
1,410 nodes divided into three classes representing machine
learning research topics.
For each of these three graphs, extra nodes are added to
represent the class labels (called the class nodes). Each class
node is connected to the graph nodes of the corresponding
class. Moreover, in order to define cross-validation folds,
these graph nodes are randomly split into training sets and
test sets (called the training nodes and the test nodes,
respectively), the edges between the test nodes and the class
nodes being removed. The graph is then reduced to the test
nodes and to the class nodes by stochastic complementation
(the training nodes are rejected in the o
2
subset, and thus,
censored), and projected into a 2D space by applying one of
the projection algorithms describedbefore. If the relationship
Fig. 7. Newsgroups data set: Example of stochastic complementation
followed by a K
DM
PCA (t 1). Only terms cited by at least
25 documents and peripheral terms are displayed.
Fig. 6. Newsgroups data set: Example of stochastic complementation followed by a basic diffusion map (upper left figure) and by a K
DM
PCA with
t 1 (upper right), t 3 (lower left), and t 5 (lower right). Terms and topic nodes are displayed jointly.
between the test nodes and the class nodes is accurately
reconstructed in the reducedgraph, these nodes fromthe test
set should be projected close to the class node of their
corresponding class. We report the classification accuracy for
several labeling rates, i.e., portions of unlabeled nodes which
constitute the test set. The proportion of the test nodes varies
between 50 percent of the graph nodes (twofold cross-
validation) to 10 percent (10-fold cross validation). This
means that the proportion of training nodes left apart
(censored) by stochastic complementation increases with
the number of folds. The whole cross-validation procedure is
repeated 10 times (10 runs) and the classification accuracy
averaged on these 10 runs is reported, as well as the
95 percent confidence interval.
For classification, the assigned label of each test node is
simply the label provided by the nearest class node, in terms
of euclidean distance in the 2D embedding space. This will
permit to assess if the class information is correctly preserved
during stochastic complementation and 2D dimensionality
reduction. The parameter t of the K
DM
PCAis set to 5, in view
of our preliminary experiments.
Figs. 8a, 8b, and 8c show the classification accuracy, as
well as the 95 percent confidence interval, obtained on the
three investigated graphs for different training/test set
partitioning (folds). The r-axis represents the number of
folds, and thus, an increasing number of nodes left apart
(censored) by stochastic complementation (from0, 50, . . . , up
to 90 percent). As a baseline, the whole original graph
(corresponding to one single fold and referred to as 1-fold) is
also projected without removing any class link and without
performing a stochastic complementation; this situation
represents the ideal case, since all the class information is
kept. All the methods should obtain a good accuracy score in
this settingthis is indeed what is observed.
First, we observe that, although obtaining very good
performance when projecting the original graph (1-fold),
CCA and SM perform poorly when the number of folds,
and thus, the amount of censored nodes, increases. On the
other hand, LE is quite unstable, performing poorly on the
CORA data set. This means that stochastic complementa-
tion combined with CCA, SM, or LE does not work
properly. On the contrary, the performance of K
DM
PCA
and MDS remains fairly stable; for instance, the average
decrease of performance of K
DM
PCA is around 10 percent,
in comparison with the mapping of the original graph
(from 1-fold to 2-fold50 percent of the nodes are
censored), which remains reasonable. MDS offers a good
alternative to K
DM
PCA, showing competitive performance;
however, it involves the computation of the all-pairs
shortest path distance.
These results are confirmed when displaying the map-
pings. Figs. 8d, 8e, 8f, 8g, and 8h showa mapping example of
the test nodes, as well as the class nodes (the white markers)
of the CORA graph, for the 10-fold cross-validation setting.
Thus, only 10 percent of the graph nodes are unlabeled and
projected after stochastic complementation of the 90 percent
remaining nodes. It can be observed that the Laplacian
Eigenmap (Fig. 8e) and the MDS (Fig. 8h) managed to
separate the different classes, but mostly in terms of angular
similarity. On the K
DM
PCA mapping (Fig. 8d), the class
nodes are well-located, at the center of the set of nodes
belonging to the class. On the other hand, the mappings
provided by CCA and SM after stochastic complementation
do not accurately preserve the class information.
5.4 Discussion of the Results
Let us now come back to our research questions. As a first
observation, we can say that the two-step procedure
(stochastic complementation followed by a diffusion map
projection) provides an embedding in a low-dimensional
subspace from which useful information can be extracted.
Indeed, the experiments show that highly related elements
are displayed close together while poorly related elements
tend to be drawn far apart. This is quite similar to
correspondence analysis to which the procedure is closely
related. Second, it seems that stochastic complementation
reasonably preserves proximity information, when com-
bined with a diffusion map (K
DM
PCA) or an ISOMAP
projection (MDS). For the diffusion map, this is normal,
since both stochastic complementation and the diffusion
map distance are based on a Markov chain modelsto-
chastic complementation is the natural technique allowing
to censor states of a Markov chain. On the contrary,
stochastic complementation should not be combined with
a Laplacian Eigenmap, a curvilinear component analysis, or
a Sammon nonlinear mappingthe resulting mapping is
not accurate. Finally, the K
DM
PCA provides exactly the
same results as the basic diffusion map when t is large.
However, when the parameter t is low, the resulting
projection tends to highlight the outlier nodes and to
magnify the relative differences between nodes. It is
therefore recommended to display a whole range of
mappings for several different values of t.
6 CONCLUSIONS AND FURTHER WORK
This work introduced a link-analysis-based technique
allowing to analyze relationships existing in relational
databases. The database is viewed as a graph, where the
nodes correspond to the elements contained in the tables
and the links correspond to the relations between the tables.
A two-step procedure is defined for analyzing the relation-
ships between elements of interest contained in a table, or a
subset of tables. More precisely, this work 1) proposes to
use stochastic complementation for extracting a subgraph
containing the elements of interest from the original graph
and 2) introduces a kernel-based extension of the basic
diffusion map for displaying and analyzing the reduced
subgraph. It is shown that the resulting method is closely
related to correspondence analysis.
Several data sets are analyzed by using this procedure,
showing that it seems to be well-suited for analyzing
relationships between elements. Indeed, stochastic comple-
mentation considerably reduces the original graph and
allows to focus the analysis on the elements of interest,
without having to define a state of the Markov chain for
each element of the relational database. However, one
fundamental limitation of this method is that the relational
database could contain too many disconnected components,
in which case our link analysis approach is almost useless.
Moreover, it is clearly not always an easy task to extract a
graph from a relational database, especially when the
database is huge. These are the two main drawbacks of
the proposed two-step procedure.
Further work will be devoted to the application of this
methodology to fuzzy SQL queries or fuzzy information
retrieval. The objective is to retrieve not only the elements
strictly complying with the constraints of the SQL query,
but also the elements that almost comply with these
constraints and are therefore close to the target elements.
We will also evaluate the proposed methodology on real
relational databases.
Fig. 8. (a)-(c) Classification accuracy obtained by the five compared projection methods for the Iris ((a), three classes), IMDb ((b), two classes), and
Cora ((c), three classes) data sets, respectively, in function of training/test set partitioning (number of folds). (d)-(h) The mapping of 10 percent of the
Cora graph (10-folds setting) obtained by the five projection methods. The compared methods are the diffusion map kernel ((d), K
DM
PCA, or KDM),
the Laplacian Eigenmap ((e), LE), the Curvilinear Component Analysis ((f), CCA), the Sammon nonlinear Mapping ((g), SM), and the
Multidimensional Scaling or ISOMAP ((h), MDS). The class label nodes are represented by white markers.
APPENDIX
SOME LINKS BETWEEN THE BASIC DIFFUSION MAP
AND SPECTRAL CLUSTERING
Suppose the right eigenvectors and eigenvalues of matrix P
are u
/
and `
/
. Then, the matrix I P has the same
eigenvectors as P and eigenvalues given by 1 `
/
. There-
fore, the largest eigenvectors of Pcorrespond to the smallest
eigenvectors of I P.
Assuming a connected undirected graph, we will now
express the eigenvectors of P in terms of the eigenvectors
~
l
/
of the normalized Laplacian matrix,
~
L D
1,2
LD
1,2
. We
thus have
I Pu
/
1 `
/
u
/
. 18
But
I P D
1
DA
D
1
L
D
1,2
D
1,2
LD
1,2
D
1,2
D
1,2
~
LD
1,2
.
19
Inserting this result in (18) provides
~
LD
1,2
u
/

1 `
/
D
1,2
u
/
. Thus, the (unnormalized) eigenvectors of
~
L are
~
l
/
D
1,2
u
/
, and are associated to eigenvalues
1 `
/
.
ACKNOWLEDGMENTS
Part of this work has been funded by projects with the
Region Wallonne and the Belgian Politique Scientifique
Federale. We thank these institutions for giving us the
opportunity to conduct both fundamental and applied
research. We also thank Professor Jef Wijsen, from the
Universite de Mons, Belgium, and Dr John Lee, from the
Universite Catholique de Louvain, Belgium, for the inter-
esting discussions and suggestions about this work.
REFERENCES
[1] P. Baldi, P. Frasconi, and P. Smyth, Modeling the Internet and the
Web: Probabilistic Methods and Algorithms. John Wiley & Sons, 2003.
[2] M. Belkin and P. Niyogi, Laplacian Eigenmaps and Spectral
Techniques for Embedding and Clustering, Advances in Neural
Information Processing Systems, vol. 14, pp. 585-591, MIT Press, 2001.
[3] M. Belkin and P. Niyogi, Laplacian Eigenmaps for Dimension-
ality Reduction and Data Representation, Neural Computation,
vol. 15, pp. 1373-1396, 2003.
[4] C. Blake, E. Keogh, and C. Merz, UCI Repository of Machine
Learning Databases, Univ. California, Dept. of Information and
Computer Science, http: //www. ics. uci . edu/~ml earn/
MLRepository.html, 1998.
[5] J. Blasius, M. Greenacre, P. Groenen, and M. van de Velden,
Special Issue on Correspondence Analysis and Related Meth-
ods, Computational Statistics and Data Analysis, vol. 53, no. 8,
pp. 3103-3106, 2009.
[6] I. Borg and P. Groenen, Modern Multidimensional Scaling: Theory
and Applications. Springer, 1997.
[7] P. Bremaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation,
and Queues. Springer-Verlag, 1999.
[8] P. Carrington, J. Scott, and S. Wasserman, Models and Methods in
Social Network Analysis. Cambridge Univ. Press, 2006.
[9] S. Chakrabarti, Mining the Web: Discovering Knowledge from
Hypertext Data. Elsevier Science, 2003.
[10] F.R. Chung, Spectral Graph Theory. Am. Math. Soc., 1997.
[11] D.J. Cook and L.B. Holder, Mining Graph Data. Wiley and Sons,
2006.
[12] T. Cox and M. Cox, Multidimensional Scaling, second ed. Chapman
and Hall, 2001.
[13] N. Cressie, Statistics for Spatial Data. Wiley, 1991.
[14] P. Demartines and J. Herault, Curvilinear Component Analysis:
A Self-Organizing Neural Network for Nonlinear Mapping of
Data Sets, IEEE Trans. Neural Networks, vol. 8, no. 1, pp. 148-154,
Jan. 1997.
[15] C. Ding, Spectral Clustering, Tutorial presented at the 16th
European Conf. Machine Learning (ECML 05), 2005.
[16] P. Domingos, Prospects and Challenges for Multi-Relational Data
Mining, ACM SIGKDD Explorations Newsletter, vol. 5, no. 1,
pp. 80-83, 2003.
[17] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens, Random-
Walk Computation of Similarities between Nodes of a Graph,
with Application to Collaborative Recommendation, IEEE Trans.
Knowledge and Data Eng., vol. 19, no. 3, pp. 355-369, Mar. 2007.
[18] F. Fouss, J.-M. Renders, and M. Saerens, Links between
Kleinbergs Hubs and Authorities, Correspondence Analysis and
Markov Chains, Proc. Third IEEE Intl Conf. Data Mining (ICDM),
pp. 521-524, 2003.
[19] F. Fouss, L. Yen, A. Pirotte, and M. Saerens, An Experimental
Investigation of Graph Kernels on a Collaborative Recommenda-
tion Task, Proc. Sixth Intl Conf. Data Mining (ICDM 06), pp. 863-
868, 2006.
[20] F. Geerts, H. Mannila, and E. Terzi, Relational Link-Based
Ranking, Proc. 30th Very Large Data Bases Conf. (VLDB), pp. 552-
563, 2004.
[21] X. Geng, D.-C. Zhan, and Z.-H. Zhou, Supervised Nonlinear
Dimensionality Reduction for Visualization and Classification,
IEEE Trans. Systems, Man, and Cybernetics, Part B: Cybernetics,
vol. 35, no. 6, pp. 1098-1107, Dec. 2005.
[22] Introduction to Statistical Relational Learning, L. Getoor and
B. Taskar, eds. MIT Press, 2007.
[23] J. Gower and D. Hand, Biplots. Chapman & Hall, 1995.
[24] M.J. Greenacre, Theory and Applications of Correspondence Analysis.
Academic Press, 1984.
[25] A. Greenbaum, Iterative Methods for Solving Linear Systems. Soc. for
Industrial and Applied Math., 1997.
[26] K.M. Hall, An R-Dimensional Quadratic Placement Algorithm,
Management Science, vol. 17, no. 8, pp. 219-229, 1970.
[27] D.A. Harville, Matrix Algebra from a Statisticians Perspective.
Springer-Verlag, 1997.
[28] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of
Statistical Learning: Data Mining, Inference, and Prediction,
second ed. Springer-Verlag, 2009.
[29] H. Hwang, W. Dhillon, and Y. Takane, An Extension of Multiple
Correspondence Analysis for Identifying Heterogeneous Sub-
groups of Respondents, Psychometrika, vol. 71, no. 1, pp. 161-171,
2006.
[30] A. Izenman, Modern Multivariate Statistical Techniques: Regression,
Classification, and Manifold Learning. Springer, 2008.
[31] R. Johnson and D. Wichern, Applied Multivariate Statistical
Analysis, sixth ed. Prentice Hall, 2007.
[32] R. Kimball and M. Ross, The Data Warehouse Toolkit: The Complete
Guide to Dimensional Modeling. John Wiley & Sons, 2002.
[33] J.M. Kleinberg, Authoritative Sources in a Hyperlinked Environ-
ment, J. ACM, vol. 46, no. 5, pp. 604-632, 1999.
[34] P. Kroonenberg and M. Greenacre, Correspondence Analysis,
Encyclopedia of Statistical Sciences, S. Kotz, ed., second ed., pp. 1394-
1403, John Wiley & Sons, 2006.
[35] S. Lafon and A.B. Lee, Diffusion Maps and Coarse-Graining: A
Unified Framework for Dimensionality Reduction, Graph Parti-
tioning, and Data Set Parameterization, IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1393-1403, Sept.
2006.
[36] A.N. Langville and C.D. Meyer, Googles PageRank and Beyond: The
Science of Search Engine Rankings. Princeton Univ. Press, 2006.
[37] J. Lee and M. Verleysen, Nonlinear Dimensionality Reduction.
Springer, 2007.
[38] D. Lusseau, K. Schneider, O. Boisseau, P. Haase, E. Slooten, and
S. Dawson, The Bottlenose Dolphin Community of Doubtful
Sound Features a Large Proportion of Long-Lasting Associa-
tions. Can Geographic Isolation Explain This Unique Trait?
Behavioral Ecology and Sociobiology, vol. 54, no. 4, pp. 396-405,
2003.
[39] S.A. Macskassy and F. Provost, Classification in Networked Data:
A Toolkit and a Univariate Case Study, J. Machine Learning
Research, vol. 8, pp. 935-983, 2007.
[40] K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis.
Academic Press, 1979.
[41] C.D. Meyer, Stochastic Complementation, Uncoupling Markov
Chains, and the Theory of Nearly Reducible Systems, SIAM Rev.,
vol. 31, no. 2, pp. 240-272, 1989.
[42] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis, Diffusion
Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck
Operators, Advances in Neural Information Processing Systems,
vol. 18, pp. 955-962, MIT Press, 2005.
[43] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis, Diffusion
Maps, Spectral Clustering and Reaction Coordinate of Dynamical
Systems, Applied and Computational Harmonic Analysis, vol. 21,
pp. 113-127, 2006.
[44] M. Newman and M. Girvan, Finding and Evaluating Community
Structure in Networks, Physical Rev. E, 69, p. 026113, 2004.
[45] A.Y. Ng, M.I. Jordan, and Y. Weiss, On Spectral Clustering:
Analysis and an Algorithm, Advances in Neural Information
Processing Systems, vol. 14, pp. 849-856, MIT Press, 2001.
[46] P. Pons and M. Latapy, Computing Communities in Large
Networks Using Random Walks, Proc. Intl Symp. Computer and
Information Sciences (ISCIS 05), pp. 284-293, 2005.
[47] P. Pons and M. Latapy, Computing Communities in Large
Networks Using Random Walks, J. Graph Algorithms and
Applications, vol. 10, no. 2, pp. 191-218, 2006.
[48] S. Ross, Stochastic Processes, second ed. Wiley, 1996.
[49] Y. Saad, Iterative Methods for Sparse Linear Systems, second ed. Soc.
for Industrial and Applied Math., 2003.
[50] M. Saerens and F. Fouss, HITS Is Principal Component
Analysis, Proc. 2005 IEEE/WIC/ACM Intl Joint Conf. Web
Intelligence, pp. 782-785, 2005.
[51] M. Saerens, F. Fouss, L. Yen, and P. Dupont, The Principal
Components Analysis of a Graph, and Its Relationships to Spectral
Clustering, Proc. 15th European Conf. Machine Learning (ECML 04),
pp. 371-383, 2004.
[52] J.W. Sammon, A Nonlinear Mapping for Data Structure
Analysis, IEEE Trans. Computers, vol. C-18, no. 5, pp. 401-409,
May 1969.
[53] O. Schabenberger and C. Gotway, Statistical Methods for Spatial
Data Analysis. Chapman & Hall, 2005.
[54] B. Scholkopf and A. Smola, Learning with Kernels. MIT Press, 2002.
[55] B. Scholkopf, A. Smola, and K. Muller, Nonlinear Component
Analysis as a Kernel Eigenvalue Problem, Neural Computation,
vol. 10, no. 5, pp. 1299-1319, 1998.
[56] R. Sedgewick, Algorithms in C. Addison-Wesley, 1990.
[57] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern
Analysis. Cambridge Univ. Press, 2004.
[58] J. Shi and J. Malik, Normalized Cuts and Image Segmentation,
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8,
pp. 888-905, Aug. 2000.
[59] D. Spielman and S.-H. Teng, Nearly-Linear Time Algorithms for
Preconditioning and Solving Symmetric, Diagonally Dominant
Linear Systems, arXiv, http://arxiv.org/abs/cs/0607105, 2007.
[60] W.J. Stewart, Introduction to the Numerical Solution of Markov
Chains. Princeton Univ. Press, 1994.
[61] J.B. Tenenbaum, V. de Silva, and J.C. Langford, A Global
Geometric Framework for Nonlinear Dimensionality Reduction,
Science, vol. 290, pp. 2319-2323, 2000.
[62] M. Tenenhaus and F. Young, An Analysis and Synthesis of
Multiple Correspondence Analysis, Optimal Scaling, Dual Scal-
ing, Homogeneity Analysis and Other Methods for Quantifying
Categorical Multivariate Data, Psychometrika, vol. 50, no. 1,
pp. 91-119, 1985.
[63] M. Thelwall, Link Analysis: An Information Science Approach.
Elsevier, 2004.
[64] L. Trefethen and D. Bau, Numerical Linear Algebra. Soc. for
Industrial and Applied Math., 1997.
[65] U. von Luxburg, A Tutorial on Spectral Clustering, Statistics and
Computing, vol. 17, no. 4, pp. 395-416, 2007.
[66] S. Wasserman and K. Faust, Social Network Analysis: Methods and
Applications. Cambridge Univ. Press, 1994.
[67] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, Graph
Embedding and Extensions: A General Framework for Dimen-
sionality Reduction, IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 29, no. 1, pp. 40-51, Jan. 2007.
[68] L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens,
Graph Nodes Clustering Based on the Commute-Time Kernel,
Proc. 11th Pacific-Asia Conf. Knowledge Discovery and Data Mining
(PAKDD 07), 2007.
[69] L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens,
Graph Nodes Clustering with the Sigmoid Commute-Time
Kernel: A Comparative Study, Data and Knowledge Eng., vol. 68,
pp. 338-361, 2009.
[70] D.M. Young and R.T. Gregory, A Survey of Numerical Mathematics.
Dover Publications, 1988.
[71] W.W. Zachary, An Information Flow Model for Conflict and
Fission in Small Groups, J. Anthropological Research, vol. 33,
pp. 452-473, 1977.
[72] H. Zha, X. He, C.H.Q. Ding, M. Gu, and H.D. Simon, Bipartite
Graph Partitioning and Data Clustering, Proc. ACM 10th Intl
Conf. Information and Knowledge Management (CIKM 01), pp. 25-32,
2001.
[73] X. Zhu and A. Goldberg, Introduction to Semi-Supervised Learning.
Morgan & Claypool Publishers, 2009.
[74] J.Y. Zien, M.D. Schlag, and P.K. Chan, Multilevel Spectral
Hypergraph Partitioning with Arbitrary Vertex Sizes, IEEE
Trans. Computer-Aided Design of Integrated Circuits and Systems,
vol. 18, no. 9, pp. 1389-1399, 1999.
Luh Yen received the MSc degree in electrical
engineering from the Universite Catholique de
Louvain (UCL), Belgium, in 2002. She com-
pleted her PhD degree at the ISYS Laboratory
and Machine Learning Group of the Universite
Catholique de Louvain. Her research interests
include graph mining, clustering, and multivari-
ate statistical analysis.
Marco Saerens received the BSc degree in
physics engineering and the MSc degree in
theoretical physics from the Universite Libre
de Bruxelles (ULB). After graduation, he joined
the IRIDIA Laboratory (the Artificial Intelligence
Laboratory, the Universite Libre de Bruxelles
(ULB), Belgium) as a research assistant and
completed the PhD degree in applied sciences.
While remaining a part-time researcher at
IRIDIA, he also worked as a senior researcher
in the R&D Department of various companies, mainly in the fields of
speech recognition, data mining, and artificial intelligence. In 2002, he
joined the Universite Catholique de Louvain (UCL) as a professor in
computer sciences. His main research interests include artificial
intelligence, machine learning, pattern recognition, data mining, and
speech/language processing. He is a member of the IEEE and the ACM.
Franc ois Fouss received the MS degree in
information systems and the PhD degree in
management science from the Universite Cath-
olique de Louvain (UCL), Belgium, in 2002 and
2007, respectively. In 2007, he joined the
Faculte s Universitaires Catholiques de Mons
(FUCaM) as a professor of computing science.
His main research areas include data mining,
machine learning, graph mining, classification,
and collaborative recommendation.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

A Link Analysis Extension of Correspondence Analysis For Mining Relational Databases

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Link Analysis Extension of Correspondence Analysis For Mining Relational Databases

Uploaded by

Copyright:

Available Formats

A Link Analysis Extension of Correspondence

Analysis for Mining Relational Databases

You might also like