You are on page 1of 19

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO.

4, JULY/AUGUST 2002 731

Finding Patterns in
Three-Dimensional Graphs: Algorithms and
Applications to Scientific Data Mining
Xiong Wang, Member, IEEE, Jason T.L. Wang, Member, IEEE,
Dennis Shasha, Bruce A. Shapiro, Isidore Rigoutsos, Member, IEEE, and Kaizhong Zhang

AbstractÐThis paper presents a method for finding patterns in 3D graphs. Each node in a graph is an undecomposable or atomic unit
and has a label. Edges are links between the atomic units. Patterns are rigid substructures that may occur in a graph after allowing for
an arbitrary number of whole-structure rotations and translations as well as a small number (specified by the user) of edit operations in
the patterns or in the graph. (When a pattern appears in a graph only after the graph has been modified, we call that appearance
ªapproximate occurrence.º) The edit operations include relabeling a node, deleting a node and inserting a node. The proposed method
is based on the geometric hashing technique, which hashes node-triplets of the graphs into a 3D table and compresses the label-
triplets in the table. To demonstrate the utility of our algorithms, we discuss two applications of them in scientific data mining. First, we
apply the method to locating frequently occurring motifs in two families of proteins pertaining to RNA-directed DNA Polymerase and
Thymidylate Synthase and use the motifs to classify the proteins. Then, we apply the method to clustering chemical compounds
pertaining to aromatic, bicyclicalkanes, and photosynthesis. Experimental results indicate the good performance of our algorithms and
high recall and precision rates for both classification and clustering.

Index TermsÐKDD, classification and clustering, data mining, geometric hashing, structural pattern discovery, biochemistry,
medicine.

1 INTRODUCTION

S TRUCTURAL pattern discovery finds many applications in


natural sciences, computer-aided design, and image
processing [8], [33]. For instance, detecting repeatedly
the data mining field, where automated discovery of
patterns, classification and clustering rules is one of the
main tasks. We establish a framework for structural pattern
occurring structures in molecules can help biologists to discovery in the graphs and apply our approach to
understand functions of the molecules. In these domains, classifying proteins and clustering compounds. While the
molecules are often represented by 3D graphs. The tertiary domains chosen here focus on biochemistry, our approach
structures of proteins, for example, are 3D graphs [5], [9], can be generalized to other applications where graph data
[18]. As another example, chemical compounds are also occur commonly.
3D graphs [21].
In this paper, we study a pattern discovery problem for 1.1 3D Graphs
graph data. Specifically, we propose a geometric hashing Each node of the graphs we are concerned with is an
technique to find frequently occurring substructures in a set undecomposable or atomic unit and has a 3D coordinate.1
of 3D graphs. Our study is motivated by recent advances in Each node has a label, which is not necessarily unique in a
graph. Node labels are chosen from a domain-dependent
alphabet . In chemical compounds, for example, the
. X. Wang is with the Department of Computer Science, California State alphabet includes the names of all atoms. A node can be
University, Fullerton, CA 92834. E-mail: wang@ecs.fullerton.edu.
. J.T.L. Wang is with the Department of Computer and Information Science, identified by a unique, user-assigned number in the graph.
New Jersey Institute of Technology, University Heights, Newark, NJ Edges in the graph are links between the atomic units. In
07102. E-mail: jason@cis.njit.edu. the paper, we will consider 3D graphs that are connected.
. D. Shasha is with the Courant Institute of Mathematical Sciences, New
York University, 251 Mercer St., New York, NY 10012. For disconnected graphs, we consider their connected
E-mail: shasha@cs.nyu.edu. components [16].
. B.A. Shapiro is with the Laboratory of Experimental and Computational A graph can be divided into one or more rigid
Biology, Division of Basic Sciences, National Cancer Institutes, Frederick,
MD 21702. E-mail: bshapiro@ncifcrf.gov. substructures. A rigid substructure is a subgraph in which
. I. Rigoutsos is with the IBM T.J. Watson Research Center, Yorktown the relative positions of the nodes in the substructure are
Heights, NY 10598. E-mail: rigoutso@us.ibm.com. fixed, under some set of conditions of interest. Note that the
. K. Zhang is with the Department of Computer Science, The University of
Western Ontario, London, Ontario, Canada N6A 5B7. rigid substructure as a whole can be rotated (we refer to this
E-mail: kzhang@csd.uwo.ca. as a ªwhole-structureº rotation or simply a rotation when
Manuscript received 19 May 1999; revised 18 Oct. 2000; accepted 18 Jan. the context is clear). Thus, the relative position of a node in
2001; posted to Digital Library 7 Sept. 2001.
For information on obtaining reprints of this article, please send e-mail to: 1. More precisely, the 3D coordinate indicates the location of the center of
tkde@computer.org, and reference IEEECS Log Number 109849. the atomic unit.

1041-4347/02/$17.00 ß 2002 IEEE


732 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002

TABLE 1
Identifiers, Labels and Global Coordinates of the
Nodes of the Graph in Fig. 1

Fig. 1. A graph G.

the substructure and a node outside the substructure can be


changed under the rotation. The precise definition of a
ªsubstructureº is application dependent. For example, in
chemical compounds, a ring is often a rigid substructure.
Example 1. To illustrate rigid substructures in a graph,
1.2 Patterns in 3D Graphs
consider the graph G in Fig. 1. Each node is associated
with a unique number, with its label being enclosed in We consider a pattern to be a rigid substructure that may
parentheses. Table 1 shows the 3D coordinates of the occur in a graph after allowing for an arbitrary number of
nodes in the graph with respect to the Global Coordinate rotations and translations as well as a small number
Frame. We divide the graph into two rigid substructures: (specified by the user) of edit operations in the pattern or
Str0 and Str1 . Str0 consists of nodes numbered 0, 1, 2, 3, in the graph. We allow three types of edit operations:
relabeling a node, deleting a node and inserting a node.
4, and 5 as, well as, edges connecting the nodes (Fig. 2a).
Relabeling a node v means to change the label of v to any
Str1 consists of nodes numbered 6, 7, 8, 9, and 10 as, well
valid label that differs from its original label. Deleting a
as, edges connecting them (Fig. 2b). Edges in the rigid
node v from a graph means to remove the corresponding
substructures are represented by boldface links. The
atomic unit from the 3D Euclidean space and make the
edge f5; 6g connecting the two rigid substructures is
edges touching v connect with one of its neighbors v0 .
represented by a lightface link, meaning that the two
Inserting a node v into a graph means to add the
substructures are rotatable with respect to each other
corresponding atomic unit to the 3D Euclidean space and
around the edge.2 Note that a rigid substructure is not
make a node v0 and a subset of its neighbors become the
necessarily complete. For example, in Fig. 2a, there is no
neighbors of v.3
edge connecting the node numbered 1 and the node
We say graph G matches graph G0 with n mutations if by
numbered 3.
applying an arbitrary number of rotations and translations
We attach a local coordinate frame SF0 (SF1 , respec- as well as n node insert, delete or relabeling operations, 1) G
tively) to substructure Str0 (Str1 , respectively). For instance, and G0 have the same size,4 2) the nodes in G geometrically
let us focus on the substructure Str0 in Fig. 2a. We attach a match those in G0 , i.e., they have coinciding 3D coordinates,
local coordinate frame to Str0 whose origin is the node and 3) for each pair of geometrically matching nodes, they
numbered 0. This local coordinate frame is represented by have the same label. A substructure P approximately occurs
three basis points Pb1 , Pb2 , and Pb3 , with coordinates in a graph G (or G approximately contains P ) within
Pb1 …x0 ; y0 ; z0 †, Pb2 …x0 ‡ 1; y0 ; z0 †, and Pb3 …x0 ; y0 ‡ 1; z0 †, re- n mutations if P matches some subgraph of G with
spectively. The origin is Pb1 and the three basis vectors are n mutations or fewer where n is chosen by the user.
V~b1 ;b2 , V~b1 ;b3 , and V~b1 ;b2  V~b1 ;b3 . Here, V~b1 ;b2 represents the Example 2. To illustrate patterns in graphs, consider the set
vector starting at point Pb1 and ending at point Pb2 . V~b1 ;b2  S of three graphs in Fig. 3a. Suppose only exactly
V~b1 ;b3 stands for the cross product of the two corresponding coinciding substructures (without mutations) occurring
vectors. We refer to this coordinate frame as Substructure
Frame 0, or SF0 . Note that the basis vectors of SF0 are 3. Here, exactly which subset of the neighbors is chosen is unimportant,
as the proposed geometric hashing technique hashes nodes only, ignoring
orthonormal. That is, the length of each vector is 1 and the edges among the nodes. Notice that when a node v is inserted or deleted,
angle between any two basis vectors has 90 degrees. Also the nodes surrounding v do not move, i.e., their coordinates remain the
same. Note also that we do not allow multiple edit operations to be applied
note that, for any node numbered i in the substructure Str0 to the same node. Thus, for example, inserting a node with label m followed
with global coordinate Pi …xi ; yi ; zi †, we can find a local by relabeling it to n is considered as inserting a node with label n. The three
edit operations are extensions of the edit operations on sequences; they arise
coordinate of the node i with respect to SF0 , denoted Pi0 , naturally in graph editing [10] and molecule evolution [25]. As shown in
where Section 4, based on these edit operations, our algorithm finds useful
patterns that can be used to classify and cluster 3D molecules effectively.
Pi0 ˆ V~b1 ;i ˆ …xi ÿ x0 ; yi ÿ y0 ; zi ÿ z0 †: …1† 4. The size of a graph is defined to be the number of nodes in the graph,
since our geometric hashing technique considers nodes only. Furthermore,
in our target applications, e.g., chemistry, nodes are atomic units and
2. Such an edge is referred to as a rotatable edge. If two rigid determine the size of a compound. Edges are links between the nodes and
substructures cannot be rotated with respect to each other around the edge have a different meaning from the atomic units; as a consequence, we
connecting them, that edge is a nonrotatable edge. exclude the edges from the size definition.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 733

quality of the patterns by using them to classify the


compounds. Here, we extend the work in [35] by

1. considering more general edit operations including


node insert, delete, and relabeling,
2. presenting the theoretical foundation and evaluating
the performance and efficiency of our pattern-
finding algorithm,
3. applying the discovered patterns to classifying
3D proteins, which are much larger and more
complicated in topology than chemical compounds,
Fig. 2. The rigid substructures of the graph in Fig. 1.
and
4. presenting a technique to cluster 3D graphs based on
the patterns occurring in them.
in at least two graphs and having size greater than three
Specifically, we conducted two experiments. In the first
are considered as ªpatterns.º Then, S contains one
experiment, we applied the proposed method to locating
pattern shown in Fig. 3b. If substructures having size
frequently occurring motifs (substructures) in two families
greater than four and approximately occurring in all the of proteins pertaining to RNA-directed DNA Polymerase
three graphs within one mutation (i.e., one node delete, and Thymidylate Synthase, and used the motifs to classify
insert, or relabeling is allowed in matching a substruc- the proteins. Experimental results showed that our method
ture with a graph) are considered as ªpatterns,º then S achieved a 96.4 percent precision rate. In the second
contains one pattern shown in Fig. 3c. experiment, we applied our pattern-finding algorithm to
discovering frequently occurring patterns in chemical
Our strategy to find the patterns in a set of 3D graphs is
compounds chosen from the Merck Index pertaining to
to decompose the graphs into rigid substructures and, then,
aromatic, bicyclicalkanes and photosynthesis. We then used
use geometric hashing [14] to organize the substructures the patterns to cluster the compounds. Experimental results
and, then, to find the frequently occurring ones. In [35], we showed that our method achieved 99 percent recall and
applied the approach to the discovery of patterns in precision rates.
chemical compounds under a restricted set of edit opera- The rest of the paper is organized as follows: Section 2
tions including node insert and node delete, and tested the presents the theoretical framework of our approach and

Fig. 3. (a) The set S of three graphs, (b) the pattern exactly occurring in two graphs in S, and (c) the pattern approximately occurring, within one
mutation, in all the three graphs.
734 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
 
describes the pattern-finding algorithm in detail. Section 3 4
5 ˆ 20 node-triplets:
evaluates the performance and efficiency of the pattern- 3
finding algorithm. Section 4 describes the applications of There are several alternative ways to decompose
our approach to classifying proteins and clustering com- 3D graphs into rigid substructures, depending on the
pounds. Section 5 discusses related work. Section 6 application at hand and the nature of the graphs. For the
concludes the paper. purposes of exposition, we describe our pattern-finding
algorithm based on a partitioning strategy. Our approach
assumes a notion of atomic unit which is the lowest level of
2 PATTERN-FINDING ALGORITHM description in the case of interest. Intuitively, atomic units
2.1 Terminology are fundamental building elements, e.g., atoms in a
Let S be a set of 3D graphs. The occurrence number of a molecule. Edges arise as bonds between atomic units. We
pattern P is the number of graphs in S that approxi- break a graph into maximal size rigid substructures (recall
that a rigid substructure is a subgraph in which the relative
mately contain P within the allowed number of muta-
positions of nodes in the substructure are fixed). To
tions. Formally, the occurrence number of a pattern P
accomplish this, we use an approach similar to [16] that
with respect to mutation d and set S, denoted employs a depth-first search algorithm, referred to as DFB,
occur nodS …P †, is k if there are k graphs in S that contain to find blocks in a graph.5 The DFB works by traversing the
P within d mutations. For example, consider Fig. 3 again. graph in a depth-first order and collecting nodes belonging
Let S contain the three graphs in Fig. 3a. Then, to a block during the traversal, as illustrated in the example
occur no0S …P1 † = 2; occur no1S …P2 † = 3. below. Each block is a rigid substructure. We merge two
Given a set S of 3D graphs, our algorithm finds all the rigid substructures B1 and B2 if they are not rotatable with
patterns P where P approximately occurs in at least respect to each other, that is, the relative position of a node
Occur graphs in S within the allowed number of mutations n1 2 B1 and a node n2 2 B2 is fixed. The algorithm
Mut and jP j  Size, where jP j represents the size, i.e., the maintains a stack, denoted ST K, which keeps the rigid
number of nodes, of the pattern P . (Mut, Occur, and Size substructures being merged. Fig. 4 shows the algorithm,
which outputs a set of rigid substructures of a graph G. We
are user-specified parameters.) One can use the patterns in
then throw away the substructures P where jP j < Size. The
several ways. For example, biologists or chemists may
remaining substructures constitute the candidate patterns
evaluate whether the patterns are significant; computer generated from G. This pattern-generation algorithm runs
scientists may use the patterns to classify or cluster in time linearly proportional to the number of edges in G.
molecules as demonstrated in Section 4.
Our algorithm proceeds in two phases to search for the Example 3. We use the graph in Fig. 5 to illustrate how the
Find_Rigid_Substructures algorithm in Fig. 4 works.
patterns: 1) find candidate patterns from the graphs in S;
Rotatable edges in the graph are represented by lightface
and 2) calculate the occurrence numbers of the candidate
links; nonrotatable edges are represented by boldface
patterns to determine which of them satisfy the user-
links. Initially, the stack ST K is empty. We invoke DFB
specified requirements. We describe each phase in turn
to locate the first block (Step 4). DFB begins by visiting
below. the node numbered 0. Following the depth-first search,
2.2 Phase 1 of the Algorithm DFB then visits the nodes numbered 1, 2, and 5. Next,
DFB may visit the node numbered 6 or 4. Without loss of
In phase 1 of the algorithm, we decompose the graphs into
generality, assume DFB visits the node numbered 6 and,
rigid substructures. Dividing a graph into substructures is
then, the nodes numbered 10, 11, 13, 14, 15, 16, 18, and
necessary for two reasons. First, in dealing with some 17, in that order. Then, DFB visits the node numbered 14,
molecules such as chemical compounds in which there may and realizes that this node has been visited before. Thus,
exist two substructures that are rotatable with respect to DFB goes back to the node numbered 17, 18, etc., until it
each other, any graph containing the two substructures is returns to the node numbered 14. At this point, DFB
not rigid. As a result, we decompose the graph into identifies the first block, B1 , which includes nodes
substructures having no rotatable components and consider numbered 14, 15, 16, 17, and 18. Since the stack ST K is
the substructures separately. Second, our algorithm hashes empty now, we push B1 into ST K (Step 7).
node-triplets within rigid substructures into a 3D table. In iteration 2, we call DFB again to find the next block
When a graph as a whole is too large, as in the case of (Step 4). DFB returns the nonrotatable edge f13; 14g as a
block, denoted B2 .6 The block is pushed into ST K
proteins, considering all combinations of three nodes in the
(Step 7). In iteration 3, DFB locates the block B3 , which
graph may become prohibitive. Consequently, decompos-
includes nodes numbered 10, 11, 12, and 13 (Step 4).
ing the graph into substructures and hashing node-triplets Since B3 and the top entry of the stack, B2 are not
of the substructures can increase efficiency. For example, rotatable with respect to each other, we push B3 into
consider a graph of 20 nodes. There are
  5. A block is a maximal subgraph that has no cut-vertices. A cut-vertex of
20 a graph is one whose removal results in dividing the graph into multiple,
ˆ 1140 node-triplets: disjointed subgraphs [16].
3
6. In practice, in graph representations for molecules, one uses different
notation to distinguish nonrotatable edges from rotatable edges. For
On the other hand, if we decompose the graph into five example, in chemical compounds, a double bond is nonrotatable. The
substructures, each having four nodes, then there are only different notation helps the algorithm to determine the types of edges.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 735

Fig. 4. Algorithm for finding rigid substructures in a graph.

ST K (Step 7). In iteration 4, we continue to call DFB and is rotatable with respect to B9 , we pop out f5; 6g to form
get the single edge f6; 10g as the next block, B4 (Step 4). a rigid substructure (Step 9) and push B9 into ST K
Since f6; 10g is rotatable and the stack ST K is nonempty, (Step 10). Finally, since there is no block left in the graph,
we pop out all nodes in ST K, merge them and output we pop out all nodes in B9 to form a rigid substructure
the rigid substructure containing nodes numbered 10, 11, and terminate (Step 13).
12, 13, 14, 15, 16, 17, 18 (Step 9). We then push B4 into
ST K (Step 10). 2.3 Phase 2 of the Algorithm
Phase 2 of our pattern-finding algorithm consists of two
In iteration 5, DFB visits the node numbered 7 and,
subphases. In subphase A of phase 2, we hash and store the
then, 8 from which DFB goes back to the node numbered
candidate patterns generated from the graphs in phase 1 in
7. It returns the single edge f7; 8g as the next block, B5
a 3D table H. In subphase B, we rehash each candidate
(Step 4). Since f7; 8g is connected to the current top entry pattern into H and calculate its occurrence number. Notice
of ST K, f6; 10g, via a rotatable edge f6; 7g, we pop out that in the subphase B, one does not need to store the
f6; 10g, which itself becomes a rigid substructure (Step 9). candidate patterns in H again.
We then push f7; 8g into ST K (Step 10). In iteration 6, In processing a rigid substructure (pattern) of a 3D graph,
DFB returns the single edge f7; 9g as the next block, B6 we choose all three-node combinations, referred to as node-
(Step 4). Since B6 and the current top entry of ST K, B5 = triplets, in the substructure and hash the node-triplets. We
f7; 8g, are not rotatable with respect to each other, we hash three-node combinations, because to fix a rigid
push B6 into ST K (Step 7). In iteration 7, DFB goes back substructure in the 3D Euclidean space one needs at least
from the node numbered 7 to the node numbered 6 and three nodes from the substructure and three nodes are
returns the single edge f6; 7g as the next block, B7 sufficient provided they are not collinear. Notice that the
(Step 4). Since f6; 7g is rotatable, we pop out all nodes in proper order of choosing the nodes i; j; k, in a triplet is
ST K, merge them and output the resulting rigid significant and has an impact on the accuracy of our
substructure containing nodes numbered 7, 8, and 9 approach, as we will show later in the paper. We determine
(Step 9). We then push B7 = f6; 7g into ST K (Step 10). the order of the three nodes by considering the triangle
In iteration 8, DFB returns the block B8 = f5; 6g formed by them. The first node chosen always opposes the
(Step 4). Since f5; 6g and f6; 7g are both rotatable, we pop longest edge of the triangle and the third node chosen
out f6; 7g to form a rigid substructure (Step 9). We then opposes the shortest edge. For example, in the triangle in
push B8 into ST K (Step 10). In iteration 9, DFB returns Fig. 6, we choose i; j; k, in that order. Thus, the order is
the block B9 containing nodes numbered 0, 1, 2, 3, 4, and unique if the triangle is not isosceles or equilateral, which
5 (Step 4). Since the current top entry of ST K, B8 = f5; 6g, usually holds when the coordinates are floating point

Fig. 5. The graph used for illustrating how the Find_Rigid_Substructures


algorithm works. Fig. 6. The triangle formed by the three nodes i; j; k.
736 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002

TABLE 2
The Node Labels of the Graph in Fig. 1 and
Their Indices in the Array A

numbers. On the other hand, when the triangle is isosceles,


we hash and store the two node-triplets ‰i; j; kŠ and ‰i; k; jŠ,
assuming that node i opposes the longest edge and edges
fi; jg, fi; kg have the same length. When the triangle is
equilateral, we hash and store the six node-triplets ‰i; j; kŠ,
Fig. 7. Calculation of the coordinates of the basis points Pb1 , Pb2 , Pb3 of
‰i; k; jŠ, ‰j; i; kŠ, ‰j; k; iŠ, ‰k; i; jŠ, and ‰k; j; iŠ.
Substructure Frame 0 (SF0 ) with respect to the local coordinate frame
The labels of the nodes in a triplet form a label-triplet, LF ‰i; j; kŠ.
which is encoded as follows: Suppose the three nodes
chosen are v1 , v2 , v3 , in that order. We maintain all node Let
labels in the alphabet  in an array A. The code for the
labels is an unsigned long integer, defined as d1 ˆ …l1 ‡ l2 † mod P rime1 mod Nrow
d2 ˆ …l2 ‡ l3 † mod P rime2 mod Nrow :
……L1  P rime ‡ L2 †  P rime† ‡ L3 ;
d3 ˆ …l3 ‡ l1 † mod P rime3 mod Nrow
where P rime > jj is a prime number, L1 , L2 , and L3 are
the indices for the node labels of v1 , v2 , and v3 , respectively, P rime1 , P rime2 , and P rime3 are three prime numbers and
in the array A. Thus, the code of a label-triplet is unique. Nrow is the cardinality of the hash table in each dimension.
This simple encoding scheme reduces three label compar- We use three different prime numbers in the hope that the
isons into one integer comparison.
distribution of the hash function values is not skewed even
Example 4. To illustrate how we encode label-triplets, if pairs of l1 , l2 , l3 are correlated. The node-triplet ‰i; j; kŠ is
consider again the graph G in Fig. 1. Suppose the node
labels are stored in the array A, as shown in Table 2. hashed to the 3D bin with the address h‰d1 Š‰d2 Š‰d3 Š.
Suppose P rime is 1,009. Then, for example, for the three Intuitively, we use the squares of the lengths of the three
nodes numbered 2, 0, and 1 in Fig. 1, the code for the edges connecting the three chosen nodes to determine the
corresponding label-triplet is ……2  1; 009 ‡ 0†  1; 009† ‡ hash bin address. Stored in that bin are the graph
1 ˆ 2; 036; 163: identification number, the substructure identification num-
ber, and the label-triplet code. In addition, we store the
2.3.1 Subphase A of Phase 2 coordinates of the basis points Pb1 , Pb2 , Pb3 of Substructure
In this subphase, we hash the candidate patterns generated Frame 0 (SF0 ) with respect to the three chosen nodes.
in phase 1 of the pattern-finding algorithm into a 3D table. Specifically, suppose the chosen nodes i, j, k are not
For the purposes of exposition, consider the example collinear. We can construct another local coordinate frame,
substructure Str0 in Fig. 2a, which is assumed to be a
denoted LF ‰i; j; kŠ, using V~i;j , V~i;k and V~i;j  V~i;k as basis
candidate pattern. We choose any three nodes in Str0 and
calculate their 3D hash function values as follows: Suppose vectors. The coordinates of Pb1 , Pb2 , Pb3 with respect to the
the chosen nodes are numbered i, j, k, and have global local coordinate frame LF ‰i; j; kŠ, denoted SF0 ‰i; j; kŠ, form a
coordinates Pi …xi ; yi ; zi †, Pj …xj ; yj ; zj †, and Pk …xk ; yk ; zk †, 3  3 matrix, which is calculated as follows (see Fig. 7):
respectively. Let l1 , l2 , l3 be three integers where 0 1
V~i;b1
l1 ˆ ……xi ÿ xj †2 ‡ …yi ÿ yj †2 ‡ …zi ÿ zj †2 †  Scale B C
SF0 ‰i; j; kŠ ˆ @ V~i;b2 A  Aÿ1 ; …3†
l2 ˆ ……xi ÿ xk †2 ‡ …yi ÿ yk †2 ‡ …zi ÿ zk †2 †  Scale : …2† ~
Vi;b3
2 2 2
l3 ˆ ……xk ÿ xj † ‡ …yk ÿ yj † ‡ …zk ÿ zj † †  Scale where
p 0 1
Here, Scale ˆ 10 is a multiplier. Intuitively, we round to the V~i;j
nearest pth position following the decimal point (here p is the B C
Aˆ@ V~i;k A: …4†
last accurate position) and, then, multiply the numbers by 10p . V~i;j  V~i;k
The reason for using the multiplier is that we want some
digits following the decimal point to contribute to the Thus, suppose the graph in Fig. 1 has identification number
distribution of the hash function values. We ignore the digits 12. The hash bin entry for the three chosen nodes i, j, k is
after the position p because they are inaccurate. (The …12; 0; Lcode; SF0 ‰i; j; kŠ†, where Lcode is the label-triplet
appropriate value for the multiplier can be calculated once code. Since there are 6 nodes in the substructure Str0 , we
data are given, as we will show in Section 3.) have
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 737
 
6 Recall that we choose the three nodes i, j, k based on the
ˆ 20 node-triplets
3
triangle formed by themÐthe first node chosen always
generated from the substructure and, therefore, 20 entries in opposes the longest edge of the triangle and the third node
the hash table for the substructure.7 chosen opposes the shortest edge. Without loss of generality,
let us assume that the nodes i; j; k are chosen in that order.
Example 5. To illustrate how we hash and store patterns in
Thus, V~i;j has the shortest length, V~i;k is the second shortest
the hash table, let us consider Table 1 again. The basis
and V~j;k is the longest. We use node i as the origin, V~i;j as the X-
points of SF0 of Fig. 2a have the following global
axis and V~i;k as the Y-axis. Then, construct the local coordinate
coordinates:
frame LF ‰i; j; kŠ using V~i;j , V~i;k , and V~i;j  V~i;k as basis vectors.
Pb1 …1:0178; 1:0048; 2:5101†; Thus, we exclude the longest vector V~j;k when constructing
Pb2 …2:0178; 1:0048; 2:5101†; LF ‰i; j; kŠ. Here is why.
The coordinates …x; y; z† of each node in a 3D graph have
Pb3 …1:0178; 2:0048; 2:5101†:
an error due to rounding. Thus, the real coordinates for the
The local coordinates, with respect to SF0 , of the node should be …x; y; z†, where x ˆ x ‡ 1 , y ˆ y ‡ 2 , z ˆ
nodes numbered 0, 1, 2, 3, and 4 in substructure Str0 of z ‡ 3 for three small decimal fractions 1 ; 2 ; 3 . After
Fig. 2a are as follows: constructing LF ‰i; j; kŠ and when calculating SF0 ‰i; j; kŠ,
one may add or multiply the coordinates of the 3D vectors.
P00 …0:0000; 0:0000; 0:0000†;
We define the accumulative error induced by a calculation C,
P10 …0:1843; 1:0362; ÿ0:5081†; denoted …C†, as
P20 …0:3782; 1:9816; ÿ0:5095†;
P30 …ÿ0:3052; 1:0442; 0:6820†; …C† ˆ jf ÿ fj;
P40 …ÿ0:2568; 1:7077; 0:5023†: where f is the result obtained from C with the real
coordinates and f is the result obtained from C with
Now, suppose Scale, P rime1 , P rime2 , P rime3 are 10, rounding errors.
1,009, 1,033, and 1,051, respectively, and Nrow is 31. Recall that in calculating SF0 ‰i; j; kŠ, the three basis vectors
Thus, for example, for the nodes numbered 1, 2, and 3, of LF ‰i; j; kŠ all appear in the matrix A defined in (4). Let
the hash bin address is h‰25Š‰12Š‰21Š and,
0 1 jV~i;j j ˆ jV~i;j j ‡ 1 ; jV~i;k j ˆ jV~i;k j ‡ 2 ;
ÿ1:056731 0:357816 0:173981
SF0 ‰1; 2; 3Š ˆ @ ÿ0:875868 0:071949 0:907227 A: …5† and jV~j;k j ˆ jV~j;k j ‡ 3 . Let  ˆ maxfj1 j; j2 j; j3 jg. Notice
ÿ0:035973 0:417517 0:024041
jV~i;j  V~i;k j ˆ jV~i;j jjV~i;k j sin ;
As another example, for the nodes numbered 1, 4, and 2,
the hash bin address is h‰24Š‰0Š‰9Š and where jV~i;j j is the length of V~i;j , and  is the angle between
0 1 V~i;j and V~i;k . Thus,
0:369457 ÿ1:308266 ÿ0:242990
SF0 ‰1; 4; 2Š ˆ @ ÿ0:043584 ÿ0:857105 ÿ1:006793 A: …6† …jV~i;j  V~i;k j†
0:455285 ÿ0:343703 ÿ0:086982
ˆ jjV~i;j jjV~i;k j sin  ÿ jV~i;j jjV~i;k j sin j
Similarly, for the substructure Str1 , we attach a local
ˆ j…jV~i;j j ‡ 1 †…jV~i;k j ‡ 2 † sin  ÿ jV~i;j jjV~i;k j sin j
coordinate frame SF1 to the node numbered 6, as shown
in Fig. 2b. There are 10 hash table entries for Str1 , each ˆ j…jV~i;j j2 ‡ jV~i;k j1 ‡ 1 2 †jj sin j :
having the form …12; 1; Lcode; SF1 ‰l; m; nŠ† where l; m; n  …jV~i;j jj2 j ‡ jV~i;k jj1 j ‡ j1 jj2 j†j sin j
are any three nodes in Str1 .
 …jV~i;j j ‡ jV~i;k j ‡ †
ˆ U1
7. One interesting question here is regarding the size limitation of
patterns (rigid substructures) found by our algorithm. Suppose the graph Likewise,
identification number and the substructure identification number are both
…jV~i;j  V~j;k j†  …jV~i;j j ‡ jV~j;k j ‡ † ˆ U2
short integers of 2 bytes. Lcode is a long integer of 4 bytes. SF0 ‰i; j; kŠ is a
3  3 matrix of floating point numbers, each requiring 4 bytes. Thus, each
and
node-triplet requires 44 bytes. For a set of M patterns (substructures),
  each …jV~i;k  V~j;k j†  …jV~i;k j ‡ jV~j;k j ‡ † ˆ U3 :
T
having T nodes on average, the hash table requires M   44 bytes of
3
disk space. Suppose there are 10,000 patterns, i.e., M ˆ 10; 000. If T ˆ 20,
Among the three upperbounds U1 , U2 , U3 , U1 is the smallest.
It's likely that the accumulative error induced by calculating
the hash table requires about 502 Megabytes. If T ˆ 100, the hash table
the length of the cross product of the two corresponding
requires over 71 Gigabytes. This gives an estimate of how large patterns can
vectors is also the smallest. Therefore, we choose V~i;j , V~i;k ,
be in practice. In our case, the pattern sizes are between 5 and 13, so the size and exclude the longest vector V~j;k in constructing the local
demands are modest. Note that the time and memory space needed also coordinate frame LF ‰i; j; kŠ, so as to minimize the accumu-
increase dramatically as patterns become large. lative error induced by calculating SF0 ‰i; j; kŠ.
738 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002

2.3.2 Subphase B of Phase 2 nodes u, v, and w geometrically match the nodes i, j, and k,
Let H be the resulting hash table obtained in subphase A of respectively.
phase 2 of the pattern-finding algorithm. In subphase B, we Proof. (If) Let A be as defined in (4) and let
calculate the occurrence number of each candidate pattern 0 1
P by rehashing the node-triplets of P into H. This way, we V~u;v
B C
are able to match a node-triplet tri of P with a node-triplet Bˆ@ V~u;w A: …8†
tri0 of another substructure (candidate pattern) P 0 stored in ~ ~
Vu;v  Vu;w
subphase A where tri and tri0 have the same hash bin
address. By counting the node-triplet matches, one can infer Note that, if u, v, and w geometrically match i, j, and k,
whether P matches P 0 and, therefore, whether P occurs in respectively, then jBj ˆ jAj, where jBj (jAj, respectively)
the graph from which P 0 is generated. is the determinant of the matrix B (matrix A, respec-
We associate each substructure with several counters, tively). That is to say, jAÿ1 jjBj ˆ 1.
which are created and updated as illustrated by the From (3) and by the definition of the SFP in (7),
we have
following example. Suppose the two substructures (pat-
0 1 0 1
terns) of graph G with identification number 12 in Fig. 1
V~i;b1 Pu
have already been stored in the hash table H in subphase A. B C
SFP ˆ @ V~i;b2 A  Aÿ1  B ‡ @ Pu A: …9†
Suppose i; j; k, are three nodes in the substructure Str0 of G.
V~i;b3 Pu
Thus, for this node-triplet, its entry in the hash table is
…12; 0; Lcode; SF0 ‰i; j; kŠ†. Now, in subphase B, consider Thus, the SFP basically transforms Pb1 , Pb2 , and Pb3 via
another pattern P ; we hash the node-triplets of P using two translations and one rotation, where Pb1 , Pb2 , and Pb3
the same hash function. Let u; v; w, be three nodes in P that are the basis points of the Substructure Frame 0 (SF0 ).
have the same hash bin address as i; j; k; that is, the node- Since V~b1 ;b2 , V~b1 ;b3 , and V~b1 ;b2  V~b1 ;b3 are orthonormal
triplet ‰u; v; wŠ ªmatchesº the node-triplet ‰i; j; kŠ. If the vectors, and translations and rotations do not change
nodes u; v; w, geometrically match the nodes i; j; k respec- this property [27], we know that V~c1 ;c2 , V~c1 ;c3 , and V~c1 ;c2 
tively, i.e., they have coinciding 3D coordinates after V~c1 ;c3 are orthonormal vectors.
rotations and translations, we call the node-triplet match a (Only if) If u, v, and w do not match i, j and k
true match; otherwise it is a false match. For a true match, let geometrically while having the same hash bin address,
0 1 0 1 then there would be distortion in the aforementioned
V~u;v Pu transformation. Consequently, V~c1 ;c2 , V~c1 ;c3 , and V~c1 ;c2 
B C @ A V~c1 ;c3 would no longer be orthonormal vectors.
SFP ˆ SF0 ‰i; j; kŠ  @ V~u;w A ‡ Pu : …7† u
t
~ ~
Vu;v  Vu;w Pu

Theorem 2. If two true node-triplet matches yield the same SFP


This SFP contains the coordinates of the three basis points
and the codes of the corresponding label-triplets are the same,
of the Substructure Frame 0 (SF0 ) with respect to the global then the two node-triplet matches are augmentable, i.e., they
coordinate frame in which the pattern P is given. We can be combined to form a larger substructure match between
compare the SFP with those already associated with the P and Str0 .
substructure Str0 (initially none is associated with Str0 ). If
Proof. Since three nodes are enough to set the SFP at a fixed
the SFP differs from the existing ones, a new counter is
position and direction, all the other nodes in P will have
created, whose value is initialized to 1, and the new counter
is assigned to the SFP . If the SFP is the ªsameº as an definite coordinates under this SFP . When another node-
existing one with counter value Cnt,8 and the code of the triplet match yielding the same SFP occurs, it means that
label-triplet of nodes i, j, k; equals the code of the label- geometrically there is at least one more node match
triplet of nodes u, v, w, then Cnt is incremented by one. In between Str0 and P . If the codes of the corresponding
general, a substructure may be associated with several label-triplets are the same, it means that the labels of the
different SFP s, each having a counter. corresponding nodes are the same. Therefore, the two
We now present the theory supporting this algorithm. node-triplet matches are augmentable. u
t
Below, Theorem 1 establishes a criterion based on which Fig. 8 illustrates how two node-triplet matches are
one can detect and eliminate a false match. Below, augmented. Suppose the node-triplet ‰3; 4; 2Š yields the
Theorem 2 justifies the procedure of incrementing the
SFP shown in the figure. Further, suppose that the node-
counter values.
triplet ‰1; 2; 3Š yields the same SFP , as shown in Fig. 8. Since
Theorem 1. Let Pc1 , Pc2 , and Pc3 be the three basis points forming the labels of the corresponding nodes numbered 3 and 2 are
the SFP defined in (7), where Pc1 is the origin. V~c1 ;c2 , V~c1 ;c3 , the same, we can augment the two node-triplets to form a
and V~c1 ;c2  V~c1 ;c3 are orthonormal vectors if and only if the
larger match containing nodes numbered 1, 2, 3, 4.
8. By saying SFP is the same as an existing SFP0 , we mean that for each Thus, by incrementing the counter associated with the
entry ei;j ; 1  i; j  3, at the ith row and the jth column in SFP and its SFP , we record how many true node-triplet matches are
corresponding entry e0i;j in SFP0 , jei;j ÿ e0i;j j  , where  is an adjustable
parameter depending on the data. In the examples presented in the paper, augmentable under this SFP . Notice that in cases where two
 ˆ 0:01. node-triplet matches occur due to reflections, the directions
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 739

Fig. 8. Illustration of two augmentable node-triplet matches.

of the corresponding local coordinate systems are different,


Fig. 9. A substructure (pattern) P .
so are the SFP s. As a result, these node-triplet matches are
 
not augmentable. 5
ˆ 10;
Example 6. To illustrate how we update counter values for 3
augmentable node-triplet matches, let us consider the
since all matching node-triplets have the same SFP as
pattern P in Fig. 9. In P , the nodes numbered 0, 1, 2, 3, 4
in (10) and the labels of the corresponding nodes are
match, after rotation, the nodes numbered 5, 4, 3, 1, 2 in
the same.
the substructure Str0 in Fig. 2a. The node numbered 0 in
Str0 does not appear in P (i.e., it is to be deleted). The Now, consider again the SFP defined in (7) and the three
labels of the corresponding nodes are identical. Thus, basis points Pc1 , Pc2 , Pc3 forming the SFP , where Pc1 is the
P matches Str0 with 1 mutation, i.e., one node is deleted. origin. We note that for any node i in the pattern P with
Now, suppose in P , the global coordinates of the global coordinate Pi …xi ; yi ; zi †, it has a local coordinate with
nodes numbered 1, 2, 3, and 4 are respect to the SFP , denoted Pi0 , where
P1 …ÿ0:269000; 4:153153; 2:911494†; Pi0 ˆ V~c1 ;i  E ÿ1 : …12†
P2 …ÿ0:317400; 4:749386; 3:253592†;
Here, E is the base matrix for the SFP , defined as
P3 …0:172100; 3:913515; 4:100777†;
0 1
P4 …0:366000; 3:244026; 3:433268†: V~c1 ;c2
B C
Eˆ@ V~c1 ;c3 A …13†
Refer to Example 5. For the nodes numbered 3, 4, and
~ ~
Vc1 ;c2  Vc1 ;c3
2 of P , the hash bin address is h‰25Š‰12Š‰21Š, which is the
same as that of nodes numbered 1, 2, 3 of Str0 , and andV~c1 ;i is the vector starting at Pc1 and ending at Pi .
0 1
ÿ0:012200 5:005500 4:474200 Remark. If V~c1 ;c2 , V~c1 ;c3 , and V~c1 ;c2  V~c1 ;c3 are orthonormal
SFP ˆ @ 0:987800 5:005500 4:474200 A: …10† vectors, then jEj ˆ 1. Thus, a practically useful criterion
ÿ0:012200 4:298393 3:767093 for detecting false matches is to check whether or not
The three basis vectors forming this SFP are jEj ˆ 1. If jEj 6ˆ 1, then V~c1 ;c2 , V~c1 ;c3 , and V~c1 ;c2  V~c1 ;c3 are
not orthonormal vectors and, therefore, the nodes u, v,
V~c1 ;c2 ˆ …1:000000; 0:000000; 0:000000†; and w do not match the nodes i, j and k geometrically
V~c1 ;c3 ˆ …0:000000; ÿ0:707107; ÿ0:707107†; (see Theorem 1).
Intuitively, our scheme is to hash node-triplets and
V~c1 ;c2  V~c ;c ˆ …0:000000; 0:707107; ÿ0:707107†;
1 3 match the triplets. Only if one triplet tri matches another
which are orthonormal. tri0 do we see how the substructure containing tri matches
For the nodes numbered 3, 1, and 4 of P , the hash bin the pattern containing tri0 . Using Theorem 1, we detect and
address is h‰24Š‰0Š‰9Š, which is the same as that of nodes eliminate false node-triplet matches. Using Theorem 2, we
numbered 1, 4, 2 of Str0 , and record in a counter the number of augmentable true node-
0 1 triplet matches. The following theorem says that the
ÿ0:012200 5:005500 4:474200 counter value needs to be large (i.e., there are a sufficient
SFP ˆ @ 0:987800 5:005500 4:474200 A: …11† number of augmentable true node-triplet matches) in
ÿ0:012200 4:298393 3:767093 order to infer that there is a match between the
corresponding pattern and substructure. The larger the
These two true node-triplet matches have the same
Mut, the fewer node-triplet matches are needed.
SFP and, therefore, the corresponding counter associated
with the substructure Str0 of the graph 12 in Fig. 1 is Theorem 3. Let Str be a substructure in the hash table H and let
updated to 2. After hashing all node-triplets of P , the G be the graph from which Str is generated. Let P be a pattern
counter value will be where jP j  Mut ‡ 3. After rehashing the node-triplets of P ,
740 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002

suppose there is an SFP associated with Str whose counter qualified patterns. In Section 3, we will show experimen-
value Cnt > P , where tally that the missed patterns are few compared with those
  found by exhaustive search.
N ÿ1 …N ÿ 1†  …N ÿ 2†  …N ÿ 3†
P ˆ ˆ …14† Theorem 4. Let the set S contain K graphs, each having at most
3 6
N nodes. The time complexity of the proposed pattern-finding
and N ˆ jP j ÿ Mut. Then, P matches Str within Mut algorithm is O…KN 3 †.
mutations (i.e., P approximately occurs in G, or G
Proof. For each graph in S, phase 1 of the algorithm
approximately contains P , within Mut mutations).
requires O…N 2 † time to decompose the graph into
Proof. By Theorem 2, we increase the counter value only substructures. Thus, the time needed for phase 1 is
when there are true node-triplet matches that are O…KN 2 †. In subphase A of phase 2, we hash each
augmentable under the SFP . Thus, the counter value candidate pattern P by considering the combinations of
shows currently how many node-triplets from the any three nodes in P , which requires time
pattern P are found to match with the node-triplets  
from the substructure Str in the hash table. Note that P jP j
ˆ O…jP j3 †:
represents the number of distinct node-triplets that can 3
be generated from a substructure of size N ÿ 1. If there
are N ÿ 1 node matches between P and Str, then Thus, the time needed to hash all candidate patterns is
Cnt  P .9 Therefore, when Cnt > P , there are at least O…KN 3 †. In subphase B of phase 2, we rehash each
N node matches between Str and P . This completes the candidate pattern, which requires the same time
proof. u
t O…KN 3 †. u
t

Notice that our algorithm cannot handle patterns with


size smaller than 3, because the basic processing unit here is 3 PERFORMANCE EVALUATION
a node-triplet of size 3. This is reflected in the condition
jP j  Mut ‡ 3 stated in Theorem 3. Since Mut can only be We carried out a series of experiments to evaluate the
zero or greater than zero, this condition implies that the size performance and the speed of our approach. The programs
of a pattern must be greater than or equal to 3. were written in the C programming language and run on a
SunSPARC 20 workstation under the Solaris operating
Example 7. To illustrate how Theorem 3 works, let us refer
system version 2.4. Parameters used in the experiments can
to Example 6. Suppose the user-specified mutation
be classified into two categories: those related to data and
number Mut is 1. The candidate pattern P in Fig. 9 has
those related to the pattern-finding algorithm. In the first
size jP j ˆ 5. After rehashing the node-triplets of P , there
category, we considered the size (in number of nodes) of a
is only one counter associated with the substructure Str0
in Fig. 2a; this counter corresponds to the SFP in (10) and graph and the total number of graphs in a data set. In the
the value of the counter, Cnt, is 10. Thus, Cnt is greater second category, we considered all the parameters de-
than P ˆ …5 ÿ 2†…5 ÿ 3†…5 ÿ 4†=6 ˆ 1. By Theorem 3, P scribed in Section 2, which are summarized in Table 3
should match the substructure Str0 within 1 mutation. together with the base values used in the experiments.
This means that there are at least four node matches Two files were maintained: One recording the hash bin
between P and Str0 . addresses and the other containing the entries stored in the
hash bins. To evaluate the performance of the pattern-
Thus, after rehashing the node-triplets of each candidate finding algorithm, we applied the algorithm to two sets of
pattern P into the 3D table H, we check the values of the data: 1,000 synthetic graphs and 226 chemical compounds
counters associated with the substructures in H. By obtained from a drug database maintained in the National
Theorem 3, P approximately occurs in a graph G within Cancer Institute. When generating the artificial graphs, we
Mut mutations if G contains a substructure Str and there is randomly generated the 3D coordinates for each node. The
at least one counter associated with Str whose value node labels were drawn randomly from the range A to E.
Cnt > P . If there are less than Occur graphs in which P The size of the rigid substructures in an artificial graph
approximately occurs within Mut mutations, then we ranged from 4 to 10 and the size of the graphs ranged from
discard P . The remaining candidates are qualified patterns. 10 to 50. The size of the compounds ranged from 5 to 51.
Notice that Theorem 3 provides only the ªsufficientº (but In this section, we present experimental results to answer
not the ªnecessaryº) condition for finding the qualified questions concerning the performance of the pattern-
patterns. Due to the accumulative errors arising in the finding algorithm. For example, are all approximate
calculations, some node-triplets may be hashed to a wrong patterns found, i.e., is the recall high? Are any uninteresting
bin. As a result, the pattern-finding algorithm may miss patterns found, i.e., is the precision high? In Section 4, we
some node-triplet matches and, therefore, miss some study the applications of the algorithm and intend to
answer questions such as whether graphs having some
9. Notice that we cannot use Cnt ˆ P here. The reason is that we
augment node-triplet matches only when they yield the same SFP . In common phenomenological activity (e.g., they are proteins
practice, it's likely that two node-triplet matches could be augmented, but with the same function) share structural patterns in
yield different SFP s due to the errors accumulated during the calculation of
the SFP s. As a consequence, our algorithm fails to detect those node-triplet common and whether these patterns can characterize the
matches and update the counter values appropriately. graphs as a whole.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 741

TABLE 3
Parameters in the Pattern-Finding Algorithm and Their Base Values Used in the Experiments

3.1 Effect of Data-Related Parameters UP F ound


PR ˆ  100%;
P F ound
To evaluate the performance of the proposed pattern-
finding algorithm, we compared it with exhaustive search. where P F ound is the number of patterns found by the
The exhaustive search procedure works by generating all proposed algorithm, UP F ound is the number of patterns
candidate patterns as in phase 1 of the pattern-finding found that satisfy the user-specified parameter values, and
algorithm. Then, the procedure examines if a pattern P T otalP is the number of qualified patterns found by
approximately matches a substructure Str in a graph by exhaustive search. One would like both RE and P R to be
permuting the node labels of P and checking if they match as high as possible. Fig. 10 shows the running times of the
the node labels of Str. If so, the procedure performs algorithms as a function of the number of graphs and Fig. 11
translation and rotation on P and checks if P can shows the recall. The parameters used in the proposed
geometrically match Str. pattern-finding algorithm had the values shown in Table 3.
The speed of the algorithms was measured by the As can be seen from the figures, the proposed algorithm is
running time. The performance was evaluated using three 10,000 times faster than the exhaustive search method when
measures: recall (RE), precision (P R), and the number of the data set has more than 600 graphs while achieving a
false matches, Nfm , arising during the hashing process. very high (> 97%) recall. Due to the accumulative errors
(Recall that a false match arises, if a node-triplet ‰u; v; wŠ arising in the calculations, some node-triplets may be
from a pattern P has the same hash bin address as a node- hashed to a wrong bin. As a result, the proposed algorithm
triplet ‰i; j; kŠ from a substructure of graph G, though the may miss some node-triplet matches in subphase B of
nodes u; v; w do not match the nodes i; j; k geometrically, phase 2 and, therefore, cannot achieve a 100 percent recall.
see Section 2.3.2.) Recall is defined as In these experiments, precision was 100 percent.
Fig. 12 shows the number of false matches introduced by
UP F ound
RE ˆ  100%: the proposed algorithm as a function of the number of
T otalP
graphs. For the chemical compounds, Nfm is small. For the
Precision is defined as synthetic graphs, Nfm increases as the number of graphs

Fig. 10. Running times as a function of the number of graphs. Fig. 11. Recall as a function of the number of graphs.
742 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002

Fig. 12. Number of false matches as a function of the number of graphs. Fig. 14. Effect of Size.

becomes large. Similar results were obtained when testing the fewer false matches. On the other hand, when Nrow
the size of graphs for both types of data. is too large, the running times increase substantially, since
one needs to spend a lot of time in reading the 3D table
3.2 Effect of Algorithm-Related Parameters
containing the hash bin addresses.
The purpose of this section is to analyze the effect of
Examining Fig. 13, we see how Scale affects Nfm . Taking
varying the algorithm-related parameter values on the
an extreme case, when Scale ˆ 1, a node-triplet with the
performance of the proposed pattern-finding algorithm.
squares of the lengths of the three edges connecting the
To avoid the mutual influence of parameters, the analysis
nodes being 12.4567 is hashed to the same bin as a node-
was carried out by fixing the parameter values related to triplet with those values being 12.0000, although the two
data graphsÐthe 1,000 synthetic graphs and 226 com- node-triplets do not match geometrically, see Section 2.3.1.
pounds described above were used, respectively. In each On the other hand, when Scale is large (e.g., Scale = 10,000),
experiment, only one algorithm-related parameter value the distribution of hash function values is less skewed,
was varied; the other parameters had the values shown in which reduces the number of false matches. It was observed
Table 3. that Nfm was largest when Scale was 100. This happens
We first examined the effect of varying the parameter because with this Scale value, inaccuracy was being
values on generating false matches. Since few false introduced in calculating hash bin addresses. A node-triplet
matches were found for chemical compounds, the being hashed to a bin might generate k false matches where
experiments focused on synthetic graphs. It was observed k is the total number of node-triplets already stored in that
that only Nrow and Scale affected the number of false binÐk would be large if the distribution of hash function
matches. Fig. 13 shows Nfm as a function of Scale for values is skewed.
Nrow = 101, 131, 167, 199, respectively. The larger the Figs. 14, 15, and 16 show the recall as a function of Size,
Nrow, the fewer entries in a hash bin, and consequently Mut, and Scale, respectively. In all three figures, precision

Fig. 13. Number of false matches as a function of Scale. Fig. 15. Effect of Mut.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 743

to the nth digit after the decimal point, i.e., to 10ÿn , the error
caused by eliminating the digits after the …n ‡ 1†th position is
no larger than 0:5  10ÿn . Namely, for any coordinates
…x; y; z†, we have x  0:5  10ÿn , y  0:5  10ÿn , and
z  0:5  10ÿn . Thus,

l1  2  …jxi ÿ xj jjxi ÿ xj j ‡ jyi ÿ yj jjyi ÿ yj j


‡ jzi ÿ zj jjzi ÿ zj j†  Scale
‡ …jxi ÿ xj j2 ‡ jyi ÿ yj j2 ‡ jzi ÿ zj j2 †  Scale
 2  …jxi ÿ xj j…jxi j ‡ jxj j† ‡ jyi ÿ yj j…jyi j
‡ jyj j† ‡ jzi ÿ zj j…jzi j ‡ jzj j††  Scale
‡ ……jxi j ‡ jxj j†2 ‡ …jyi j ‡ jyj j†2 ‡ …jzi j ‡ jzj j†2 †  Scale
 2  …jxi ÿ xj j ‡ jyi ÿ yj j ‡ jzi ÿ zj j†
 2  0:5  10ÿn  Scale
Fig. 16. Impact of Scale.
‡ 3  …2  0:5  10ÿn †2  Scale
is 100 percent. From Fig. 14 and Fig. 15, we see that Size and ˆ …jxi ÿ xj j ‡ jyi ÿ yj j ‡ jzi ÿ zj j†
Mut affect recall slightly. It was also observed that the  2  Scale  10ÿn ‡ 3  Scale  10ÿ2n :
number of interesting patterns drops (increases, respec-
Assume that the range of the coordinates is ‰ÿM; MŠ. We
tively) significantly as Size (Mut, respectively) becomes have jxi ÿ xj j  2M, jyi ÿ yj j  2M, and jzi ÿ zj j  2M.
large. Fig. 16 shows that the pattern-finding algorithm Thus,
yields a poor performance when Scale is large. With the
data we tested, we found that setting Scale to 10 is the best l1  12  M  Scale  10ÿn ‡ 3  Scale  10ÿ2n :
overall for both recall and precision. The second term is obviously negligible in comparison
In sum, the value of Scale has a significant impact on the with the first term. In order to keep the calculation of hash
performance of our algorithm. The choice of Scale is bin addresses accurate, the first term should be smaller than
domain and data dependent. In practice, how would one 1. That is, Scale should be chosen to be the largest possible
select a value of Scale for a particular database? Recall that number such that
the coordinates …x; y; z† of each node in a 3D graph have an
error due to rounding, see Section 2.3.1. The real coordi- 12  M  Scale  10ÿn < 1
nates for the node at …xi ; yi ; zi † should be …xi ; yi ; zi †, where or
xi ˆ xi ‡ xi , yi ˆ yi ‡ yi , zi ˆ zi ‡ zi . Similarly, the real
coordinates for the node at …xj ; yj ; zj † should be …xj ; yj ; zj †, 1
Scale <  10n :
where xj ˆ xj ‡ xj , yj ˆ yj ‡ yj , zj ˆ zj ‡ zj . The real 12  M
value of l1 in (2) should be In our case, the coordinates of the chemical compounds
2 2 2 used were accurate to the 4th digit after the decimal point and
l1 ˆ ……xi ÿ xj † ‡ …yi ÿ yj † ‡ …zi ÿ zj † †  Scale
the range of the coordinates was [-10, 10]. Consequently,
ˆ ……xi ‡ xi ÿ xj ÿ xj †2 ‡ …yi ‡ yi ÿ yj ÿ yj †2
104 500
‡ …zi ‡ zi ÿ zj ÿ zj †2 †  Scale Scale < ˆ
12  10 6
ˆ ……xi ÿ xj †2 ‡ …yi ÿ yj †2 ‡ …zi ÿ zj †2 †  Scale
Thus, choosing Scale ˆ 10 is good, as validated by our
‡ 2  ……xi ÿ xj †…xi ÿ xj † ‡ …yi ÿ yj †…yi ÿ yj †
experimental results. Notice that Scale can be determined
‡ …zi ÿ zj †…zi ÿ zj ††  Scale once the data are given. All we need to know is how
‡ ……xi ÿ xj †2 ‡ …yi ÿ yj †2 ‡ …zi ÿ zj †2 †  Scale accurate the data are, and what the ranges of their
coordinates are, both of which are often clearly associated
ˆ l1 ‡ l1 ;
with the data.
where l1 is an accumulative error, Figs. 17 and 18 show the recall and precision as a
function of . It can be seen that when  is 0.01, precision is
l1 ˆ 2  ……xi ÿ xj †…xi ÿ xj † ‡ …yi ÿ yj †…yi ÿ yj † 100 percent and recall is greater than 97 percent. When 
‡ …zi ÿ zj †…zi ÿ zj ††  Scale : becomes smaller (e.g.,  ˆ 0:0001 ), precision remains the
2 2 2
‡ ……xi ÿ xj † ‡ …yi ÿ yj † ‡ …zi ÿ zj † †  Scale same while recall drops. When  becomes larger (e.g.,
 ˆ 10), recall increases slightly while precision drops. This
…15†
happens because some irrelevant node-triplet matches were
Since we used l1 to calculate the hash bin addresses, it is included, rendering unqualified patterns returned as an
critical that the accumulative error would not mislead us to a answer. We also tested different values for Occur, Nrow,
wrong hash bin. Assuming that the coordinates are accurate P rime1 , P rime2 , and P rime3 . It was found that varying
744 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002

Fig. 17. Recall as a function of . Fig. 18. Precision as a function of .

these parameter values had little impact on the performance the original substructure P . Supposing the pattern P 0 has
of the proposed algorithm. N nodes, the total number of node-triplets that include the
Finally, we examined the effect of sensor errors, which node v is
are caused by the measuring device (e.g., X-ray crystal-  
N ÿ1
lography), and are proportional to the values being :
2
measured. Suppose that the sensor error is 10ÿn of the
real value for any coordinates …x; y; z†, i.e., x ˆ x  10ÿn , Thus, we would miss the same number of node-triplet
y ˆ y  10ÿn , and z ˆ z  10ÿn . We have matches. However, if all other node-triplets that do not
include the node v are matched successfully, we would
x ˆ x  …1 ‡ 10ÿn †; y ˆ y  …1 ‡ 10ÿn †; still have
and z ˆ z  …1 ‡ 10ÿn †. Let l1 represent the sensor error      
N N ÿ1 N ÿ1
induced in calculating l1 in (2). In a way similar to the ÿ ˆ
3 2 3
calculation of the accumulative error l1 in (15), we obtain
node-triplet matches. According to Theorem 3, there would
l1 ˆ……xi ÿ xj †2 ‡ …yi ÿ yj †2 ‡ …zi ÿ zj †2 †  2  10ÿn  Scale be N ÿ 1 node matches between the pattern P 0 and the
‡ ……xi ÿ xj †2 ‡ …yi ÿ yj †2 ‡ …zi ÿ zj †2 †  10ÿ2n  Scale: original substructure P (i.e., P 0 matches P with 1 mutation).
This example reveals that, in general, if we allow mutations
Let us focus on the first term since it is more significant. to occur in searching for patterns and the number of nodes
Clearly, the proposed approach is not feasible in the with sensor errors equals the allowed number of mutations,
environment with large sensor errors. Taking an example, the recall will be the same as the case in which there are no
suppose that the sensor error is 10ÿ2 of the real value. sensor errors and we search for exactly matched patterns
Again, in order to keep the calculation of hash bin without mutations. On the other hand, if no mutations are
addresses accurate, the accumulative error should be allowed in searching for patterns and sensor errors occur,
smaller than 1. That is, the recall will be zero.

……xi ÿ xj †2 ‡ …yi ÿ yj †2 ‡ …zi ÿ zj †2 †  2  10ÿ2  Scale < 1


4 DATA MINING APPLICATIONS
or
One important application of pattern finding involves the
……xi ÿ xj †2 ‡ …yi ÿ yj †2 ‡ …zi ÿ zj †2 †  Scale < 50: ability to perform classification and clustering as well as
other data mining tasks. In this section, we present two data
Even when Scale ˆ 1, the Euclidean distance between mining applications of the proposed algorithm in scientific
any two nodes must be smaller than 7, since the square of domains: classifying proteins and clustering compounds.10
the distance, viz. ……xi ÿ xj †2 ‡ …yi ÿ yj †2 ‡ …zi ÿ zj †2 †, must
4.1 Classifying Proteins
be less than 50. Most data sets, including all the chemical
Proteins are large molecules, comprising hundreds of
compounds we used, do not satisfy this restriction.
To illustrate how this would affect recall, consider a amino acids (residues) [18], [33]. In each residue the C ,
C , and N atoms form a backbone of the residue [17].
substructure P already stored in the hash table. We add a
Following [28], we represent each residue by the three
10ÿ2 sensor error to the coordinates of a node v in P . Call
atoms. Thus, if we consider a protein as a 3D graph, each
the resulting substructure P 0 . Then, we use P 0 as a pattern
node of the graph is an atom. Each node has a label, which
to match the original substructure P in the hash table. Those
node-triplets generated from the pattern P 0 that include the 10. Another application of the proposed algorithm to flexible chemical
node v would not match the corresponding node-triplets in searching in 3D structural databases can be found in [34].
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 745

motifs. That is, there were 2,784 times in which a node-


triplet ‰u; v; wŠ from a pattern was found to have the same
hash bin address as a node-triplet ‰i; j; kŠ from a sub-
structure of a protein, though the nodes u; v; w did not
match the nodes i; j; k geometrically, see Section 2.3.2.
To evaluate the quality of the discovered motifs, we
applied them to classifying the proteins using the 10-way
cross-validation scheme. That is, each family was divided
into 10 groups of roughly equal size. Specifically, the
RNA-directed DNA Polymerase family, referred to as
family 1, contained five groups each having five proteins
and five groups each having four proteins. The Thymi-
dylate Synthase family, referred to as family 2, contained
seven groups each having four proteins and three groups
each having three proteins, and 10 tests were conducted.
In each test, a group was taken from a family and used
as test data; the other nine groups were used as training
data for that family. We applied our pattern-finding
algorithm to each training data set to find motifs (the
parameter values used were as shown in Table 3). Each
Fig. 19. (a) A 3D protein. (b) The three substructures of the protein in (a).
motif M found in family i was associated with a weight d
where
is the name of the atom and is not unique in the protein. We
assign a unique number to identify a node in the protein, d ˆ ri ÿ rj 1  i; j  2; i 6ˆ j:
where the order of numbering is obtained from the Protein
Data Bank (PDB) [1], [2], accessible at http://www.rcsb.org. Here ri is M's occurrence number in the training data set
In the experiments, we examined two families of of family i. Intuitively, the more frequently a motif occurs in
proteins chosen from PDB pertaining to RNA-directed its own family and the less frequently it occurs in the other
DNA Polymerase and Thymidylate Synthase. Each family family, the higher its weight is. In each family we collected
contains proteins having the same functionality in various all the motifs having a weight greater than one and used
organisms. We decompose each protein into consecutive them as the characteristic motifs of that family.
substructures, each substructure containing six nodes. When classifying a test protein Q, we first decomposed Q
Two adjacent substructures overlap by sharing the two into consecutive substructures as described above. The
neighboring nodes on the boundary of the two substruc- result was a set of substructures, say, Q1 ; . . . ; Qp . Let
tures (see Fig. 19). Thus, each substructure is a portion of nki ; 1  i  2; 1  k  p, denote the number of characteristic
the polypeptide chain backbone of a protein where the motifs in family i that matched Qk within one mutation.
polypeptide chain is made up of residues linked together Each family i obtained a score Ni where
by peptide bonds. The peptide bonds have strong covalent Pp
nk
bonding forces that make the polypeptide chain rigid. As Ni ˆ kˆ1 i
mi
a consequence, the substructures used by our algorithm
are rigid. Notice that in the proteins there are other atoms and mi is the total number of characteristic motifs in family
such as O and H (not shown in Fig. 19) lying between two i. Intuitively, the score Ni here was determined by the
residues. Since these atoms are not as important as C , C , number of the characteristic motifs in family i that occurred
and N in determining the structure of a protein, we do in Q, divided by the total number of characteristic motifs in
not consider them here. Table 4 summarizes the number family i. The protein Q was classified into the family i with
of proteins in each family, their sizes and the frequently maximum Ni . If the scores were 0 for both families (i.e., the
occurring patterns (or motifs) discovered from the test protein did not have any substructure that matched any
proteins. The parameter values used were as shown in characteristic motif), then the ªno-opinionº verdict was
Table 3. In the experiments, 2,784 false matches were given. This algorithm is similar to those used in [30], [35] to
detected and eliminated during the process of finding the classify chemical compounds and sequences.

TABLE 4
Statistics Concerning the Proteins and Motifs Found in Them
746 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002

As in Section 3, we use recall (REc ) and precision (P Rc ) average-group method [13] to cluster the graphs in S, which
to evaluate the effectiveness of our classification algorithm. works as follows.
Recall is defined as Initially, every graph is a cluster. The algorithm merges
P two nearest clusters to form a new cluster, until there are
T otalNum ÿ 2iˆ1 NumLossic only K clusters left where K is a user-specified parameter.
REc ˆ  100%;
T otalNum The distance between two clusters C1 and C2 is given by
where T otalNum is the total number of test proteins and 1 X
NumLossic is the number of test proteins that belong to jd…Gx ; Gy †j; …16†
jC1 jjC2 j G 2C ;G 2C
family i but are not assigned to family i by our algorithm x 1 y 2

(they are either assigned to family j, j 6ˆ i, or they receive where jCi j, i ˆ 1; 2, is the size of cluster Ci . The algorithm
the ªno-opinionº verdict). Precision is defined as requires O…N 2 † distance calculations where N is the total
P number of graphs in S.
T otalNum ÿ 2iˆ1 NumGainic
P Rc ˆ  100%; We applied this algorithm to clustering chemical
T otalNum
compounds. Ninety eight compounds were chosen from
where NumGainic is the number of test proteins that do not the Merck Index that belonged to three groups pertaining to
belong to family i but are assigned by our algorithm to aromatic, bicyclicalkanes and photosynthesis. The data was
family i. With the 10-way cross validation scheme, the created by the CORINA program that converted 2D data
average REc over the 10 tests was 92.7 percent and the (represented in SMILES string) to 3D data (represented in
average P Rc was 96.4 percent. It was found that 3.7 percent PDB format) [22]. Table 5 lists the number of compounds in
test proteins on average received the ªno-opinionº verdict each group, their sizes and the patterns discovered from
during the classification. We repeated the same experiments them. The parameter values used were Size ˆ 5, Occur ˆ 1,
using other parameter values and obtained similar results, Mut ˆ 2; the other parameters had the values shown in
except that larger Mut values (e.g., 3) generally yielded Table 3.
lower REc . To evaluate the effectiveness of our clustering algorithm,
The binary classification problem studied here is con- we applied it to finding clusters in the compounds. The
cerned with assigning a protein to one of two families. This parameter value K was set to 3, as there were three groups.
problem arises frequently in protein homology detection As in the previous sections, we use recall (REr ) and
[31]. The performance of our technique degrades when precision (P Rr ) to evaluate the effectiveness of the cluster-
applied to the n-ary classification problem, which is ing algorithm. Recall is defined as
concerned with assigning a protein to one of many families P
[32]. For example, we applied the technique to classifying T otalNum ÿ K iˆ1 NumLossr
i
REr ˆ  100%;
three families of proteins and the REc and P Rc dropped to T otalNum
80 percent.
where NumLossir is the number of compounds that belong
4.2 Clustering Compounds to group Gi , but are assigned by our algorithm to group Gj ,
In addition to classifying proteins, we have developed an i 6ˆ j, and T otalNum is the total number of compounds
algorithm for clustering 3D graphs based on the patterns tested. Precision is defined as
occurring in the graphs and have applied the algorithm P
T otalNum ÿ K iˆ1 NumGainr
i
to grouping compounds. Given a collection S of 3D P Rr ˆ  100%;
T otalNum
graphs, the algorithm first uses the procedure depicted in
Section 2.2 to decompose the graphs into rigid substruc- where NumGainir is the number of compounds that do
tures. Let fStrp jp ˆ 0; 1; :::; N ÿ 1g be the set of substruc- not belong to group Gi , but are assigned by our
tures found in the graphs in S where jStrp j  Size. algorithm to group Gi . Our experimental results indicated
Using the proposed pattern-finding algorithm, we exam- that REr ˆ P Rr ˆ 99%. Out of the 98 compounds, only
ine each graph Gq in S and determine whether each one compound in the photosynthesis group was assigned
substructure Strp approximately occurs in Gq within incorrectly to the bicyclicalkanes group. We experimented
Mut mutations. Each graph Gq is represented as a bit with other parameter values and obtained similar
string of length N, i.e., Gq ˆ …b0q ; b1q ; :::; bNÿ1
q †, where results.11

1 if Strp occurs in Gq within Mut mutations
bpq ˆ 5 RELATED WORK
0 otherwise:
There are several groups working on pattern finding (or
For example, consider the two patterns P1 and P2 in
knowledge discovery) in molecules and graphs. Conklin et al.
Fig. 3 and the three graphs in Fig. 3a. Suppose the
[4], [5], [6], for example, represented a molecular structure as
allowed number of mutations is 0. Then, G1 is repre-
an image, which comprised a set of parts with their 3D
sented as 10, G2 as 01, and G3 as 10. On the other hand,
coordinates and a set of relations that were preserved for the
suppose the allowed number of mutations is 1. Then, G1
image. The authors used an incremental, divisive approach to
and G3 are represented as 11 and G2 as 01.
discover the ªknowledgeº from a data set, that is, to build a
The distance between two graphs Gx and Gy , denoted
d…Gx ; Gy †, is defined as the Hamming distance [12] between 11. The Occur value was fixed at 1 in these experiments because of the
their bit strings. The algorithm then uses the well-known fact that all the compounds were represented as binary bit strings.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 747

TABLE 5
Statistics Concerning the Chemical Compounds and Patterns Found in Them

subsumption hierarchy that summarized and classified the point or a specific point chosen based on chemical
data set. The algorithm relied on a measure of similarity considerations. The 3D substructure can be rotated around
among molecular images that was defined in terms of their the hinge point. The authors then developed schemes to
largest common subimages. In [8], Djoko et al. developed a generate node-triplets and hash table indices. However,
system, called SUBDUE, that utilized the minimum descrip- based on those schemes, false matches cannot be detected in
tion length principle to find repeatedly occurring substruc- the hash table and must be processed in a subsequent,
tures in a graph. Once a substructure was found, the additional verification phase. Thus, even if a pattern
substructure was used to simplify the graph by replacing appears to match several substructures during the search-
instances of the substructure with a pointer to the
ing phase, one has to compare the pattern with those
discovered substructure. In [7], Dehaspe et al. used
substructures one by one to make sure they are true
DATALOG to represent compounds and applied data
matches during the verification phase. When mutations are
mining techniques to predicting chemical carcinogenicity.
Their techniques were based on Mannila and Toivonen's allowed, which were not considered in [23], [24], [29], a
algorithm [15] for finding interesting patterns from a class brute-force verification test would be very time-consuming.
of sentences in a database. For example, in comparing our approach with their
In contrast to the above work, we use the geometric approach in performing protein classification as described
hashing technique to find approximately common pat- in Section 4, we found that both approaches yield the same
terns in a set of 3D graphs without prior knowledge of recall and precision, though our approach is 10 times faster
their structures, positions, or occurrence frequency. The than their approach.
geometric hashing technique used here originated from
the work of Lamdan and Wolfson for model based
6 CONCLUSION
recognition in computer vision [14]. Several researchers
attempted to parallelize the technique based on various In this paper, we have presented an algorithm for finding
architectures, such as the Hypercube and the Connection patterns in 3D graphs. A pattern here is a rigid substructure
Machine [3], [19], [20]. It was observed that the distribu- that may occur in a graph after allowing for an arbitrary
tion of the hash table entries might be skewed. To balance number of rotations and translations as well as a small
the distribution of the hash function values, delicate number of edit operations in the pattern or in the graph. We
rehash functions were designed [20]. There were also used the algorithm to find approximately common patterns
efforts exploring the uncertainty existing in the geometric in a set of synthetic graphs, chemical compounds and
hashing algorithms [11], [26]. proteins. This yields a kind of alphabet of patterns. Our
In 1996, Rigoutsos et al. employed geometric hashing experimental results demonstrated the good performance of
and magic vectors for substructure matching in a database the proposed algorithm and its usefulness for pattern
of chemical compounds [21]. The magic vectors were bonds discovery. We then developed classification and clustering
among atoms; the choice of them was domain dependent algorithms using the patterns found in the graphs, and
and was based on the type of each individual graph. We applied them to classifying proteins and clustering com-
extend the work in [21] by providing a framework for pounds. Empirical study showed high recall and precision
discovering approximately common substructures in a set rates for both classification and clustering, indicating the
significance of the patterns.
of 3D graphs, and applying our techniques to both
We have implemented the techniques presented in the
compounds and proteins. Our approach differs from [21]
paper and are combining them with the algorithms for
in that instead of using the magic vectors, we store a
acyclic graph matching [36] into a toolkit. We use the toolkit
coordinate system in a hash table entry. Furthermore, we
to find patterns in various types of graphs arising in
establish a theory for detecting and eliminating false
different domains and as part of our tree and graph search
matches occurring in the hashing. engine project. The toolkit can be obtained from the authors
More recently, Wolfson and his colleagues also applied and is also accessible at http://www.cis.njit.edu/~jason/
geometric hashing algorithms to protein docking and sigmod.html on the Web.
recognition [23], [24], [29]. They attached a reference frame In the future, we plan to study topological graph
to each substructure, where the reference frame is an matching problems. One weakness of the proposed geo-
orthonormal coordinate frame of arbitrary orientations, metric hashing approach is its sensitivity to errors as we
established at a ªhingeº point. This hinge point can be any discussed in Section 3.2. One has to tune the Scale
748 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002

parameter when the accuracy of data changes. Another [10] H.N. Gabow, Z. Galil, and T.H. Spencer, ªEfficient Implementa-
tion of Graph Algorithms Using Contraction,º Proc. 25th Ann.
weakness is that the geometric hashing approach doesn't IEEE Symp. Foundations of Computer Science, pp. 347-357, 1984.
take edges into consideration. In some domains, edges are [11] W.E.L. Grimson, D.P. Huttenlocher, and D.W. Jacobs, ªAffine
important despite the nodes matching. For example, in Matching with Bounded Sensor Error: Study of Geometric
computer vision, edges may represent some kind of Hashing and Alignment,º Technical Memo AIM-1250, Artificial
Intelligence Laboratory, Massachusetts Inst. of Technology, 1991.
boundaries and have to be considered in object recognition. [12] R.W. Hamming ªError Detecting and Error Correcting Codes,º
(Notice that when applying the proposed geometric The Bell System Technical J., vol. 29, no. 2, pp. 147-160, 1950,
hashing approach to this domain, the node label alphabet Reprinted in E.E. Swartzlander, Computer Arithmetic. vol. 2, Los
Alamitos, Calif.: IEEE Computer Soc. Press Tutorial, 1990.
could be extended to contain semantic information, such as [13] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An
color, shape, etc.) We have done preliminary work in Introduction to Cluster Analysis. New York: John Wiley & Sons,
topological graph matching, in which our algorithm 1990.
[14] Y. Lamdan and H. Wolfson, ªGeometric Hashing: A General and
matches edge distances and the software is accessible at
Efficient Model-Based Recognition Scheme,º Proc. Int'l Conf.
http://cs.nyu.edu/cs/faculty/shasha/papers/agm.html. Computer Vision, pp. 237-249, 1988.
We plan to combine geometric matching and topological [15] H. Mannila and H. Toivonen, ªLevelwise Search and Borders of
matching approaches to finding patterns in more general Theories in Knowledge Discovery,º Data Mining and Knowledge
Discovery, vol. 1, no. 3, pp. 241-258, 1997.
graphs. [16] J.A. McHugh, Algorithmic Graph Theory. Englewood Cliffs, N.J.:
Prentice Hall, 1990.
[17] X. Pennec and N. Ayache, ªAn O…n2 † Algorithm for 3D
ACKNOWLEDGMENTS Substructure Matching of Proteins,º Proc. First Int'l Workshop
Shape and Pattern Matching in Computational Biology, pp. 25-40,
The authors would like to thank the anonymous reviewers 1994.
for their constructive suggestions that helped improve both [18] C. Pu, K.P. Sheka, L. Chang, J. Ong, A. Chang, E. Alessio,
the quality and the presentation of this paper. We also I.N. Shindyalov, W. Chang, and P.E. Bourne, ªPDBtool: A
Prototype Object Oriented Toolkit for Protein Structure
thank Dr. Carol Venanzi, Bo Chen, and Song Peng for useful Verification,º Technical Report CUCS-048-92, Dept. Computer
discussions and for providing chemical compounds used in Science, Columbia Univ., 1992.
the paper. This work was supported in part by US National [19] I. Rigoutsos and R. Hummel, ªScalable Parallel Geometric
Science Foundation grants IRI-9531548, IRI-9531554, IIS- Hashing for Hypercube (SIMD) Architectures,º Technical Report
TR-553, Dept. Computer Science, New York Univ., 1991.
9988345, IIS-9988636, and by the Natural Sciences and [20] I. Rigoutsos and R. Hummel, ªOn a Parallel Implementation of
Engineering Research Council of Canada under Grant No. Geometric Hashing on the Connection Machine,º Technical
OGP0046373. Part of this work was done while X. Wang Report TR-554, Dept. Computer Science, New York Univ., 1991.
[21] I. Rigoutsos, D. Platt, and A. Califano, ªFlexible Substructure
was with the Department of Computer and Information Matching in Very Large Databases of 3D-Molecular Information,º
Science, New Jersey Institute of Technology. Part of this research report, IBM T.J. Watson Research Center, Yorktown
work was done while J.T.L. Wang was visiting Courant Heights, N.Y., 1996.
Institute of Mathematical Sciences, New York University. [22] J. Sadowski and J. Gasteiger, ªFrom Atoms and Bonds to Three-
Dimensional Atomic Coordinates: Automatic Model Builders,º
Chemical Rev., pp. 2567-2581, vol. 93, 1993.
[23] B. Sandak, R. Nussinov, and H.J. Wolfson, ªA Method for
REFERENCES Biomolecular Structural Recognition and Docking Allowing
[1] E.E. Abola, F.C. Bernstein, S.H. Bryant, T.F. Koetzle, and J. Weng, Conformational Flexibility,º J. Computational Biology, vol. 5, no. 4,
ªProtein Data Bank,º Data Comm. Int'l Union of Crystallography, pp. 631-654, 1998.
pp. 107-132, 1987. [24] B. Sandak, H.J. Wolfson, and R. Nussinov, ªFlexible Docking
[2] F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer, M.D. Allowing Induced Fit in Proteins: Insights from an Open to Closed
Brice, J.R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, Conformational Isomers,º Proteins, vol. 32, pp. 159-74, 1998.
ªThe Protein Data Bank: A Computer-Based Archival File for [25] Time Warps, String Edits, and Macromolecules: The Theory and
Practice of Sequence Comparison. D. Sankoff and J.B. Kruskal, eds.,
Macromolecular Structures,º J. Molecular Biology, vol. 112, 1977.
Reading, Mass.: Addison-Wesley, 1983.
[3] O. Bourdon and G. Medioni, ªObject Recognition Using Geo-
[26] K.B. Sarachik, ªLimitations of Geometric Hashing in the Presence
metric Hashing on the Connection Machine,º Proc. 10th Int'l Conf. of Gaussian Noise,º Technical Memo AIM-1395, Artificial Intelli-
Pattern Recognition, pp. 596-600, 1990. gence Laboratory, Massachusetts Inst. of Technology, 1992.
[4] D. Conklin, ªKnowledge Discovery in Molecular Structure [27] R.R. Stoll, Linear Algebra and Matrix Theory, New York: McGraw-
Databases,º doctoral dissertation, Dept. Computing and Informa- Hill, 1952.
tion Science, Queen's Univ., Canada, 1995. [28] I.I. Vaisman, A. Tropsha, and W. Zheng, ªCompositional
[5] D. Conklin, ªMachine Discovery of Protein Motifs,º Machine Preferences in Quadruplets of Nearest Neighbor Residues in
Learning, vol. 21, pp. 125-150, 1995. Protein Structures: Statistical Geometry Analysis,º Proc. IEEE Int'l
[6] D. Conklin, S. Fortier, and J. Glasgow, ªKnowledge Discovery in Join Symp. Intelligence and Systems, pp. 163-168, 1998.
Molecular Databases,º IEEE Trans. Knowledge and Data Eng., vol. 5, [29] G. Verbitsky, R. Nussinov, and H.J. Wolfson, ªFlexible Structural
no. 6, pp. 985-987, 1993. Comparison Allowing Hinge Bending and Swiveling Motions,º
[7] L. Dehaspe, H. Toivonen, and R.D. King, ªFinding Frequent Proteins, vol. 34, pp. 232-254, 1999.
Substructures in Chemical Compounds,º Proc. Fourth Int'l Conf. [30] J.T.L. Wang, G.-W. Chirn, T.G. Marr, B.A. Shapiro, D. Shasha, and
Knowledge Discovery and Data Mining, pp. 30-36, 1998. K. Zhang, ªCombinatorial Pattern Discovery for Scientific Data:
[8] S. Djoko, D.J. Cook, and L.B. Holder, ªAn Empirical Study of Some Preliminary Results,º Proc. 1994 ACM SIGMOD Int'l Conf.
Domain Knowledge and Its Benefits to Substructure Discovery,º Management of Data, pp. 115-125, 1994.
IEEE Trans. Knowledge and Data Eng., vol. 9, no. 4, pp. 575-586, [31] J.T.L. Wang, Q. Ma, D. Shasha, and C.H. Wu, ªApplication of
1997. Neural Networks to Biological Data Mining: A Case Study in
[9] D. Fischer, O. Bachar, R. Nussinov, and H.J. Wolfson, ªAn Protein Sequence Classification,º Proc. Sixth ACM SIGKDD Int'l
Efficient Automated Computer Vision Based Technique for Conf. Knowledge Discovery and Data Mining, pp. 305±309, 2000.
Detection of Three Dimensional Structural Motifs in Proteins,º [32] J.T.L. Wang, T.G. Marr, D. Shasha, B. A. Shapiro, G.-W. Chirn, and
J. Biomolecular and Structural Dynamics, vol. 9, no. 4, pp. 769-789, T.Y. Lee, ªComplementary Classification Approaches for Protein
1992. Sequences,º Protein Eng., vol. 9, no. 5, pp. 381-386, 1996.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 749

[33] J.T.L. Wang, B.A. Shapiro, and D. Shasha, eds., Pattern Discovery in Bruce A. Shapiro received the BS degree from
Biomolecular Data: Tools, Techniques and Applications, New York: Brooklyn College studying mathematics and
Oxford Univ. Press, 1999. physics, the MS and PhD degrees in computer
[34] X. Wang and J.T.L. Wang, ªFast Similarity Search in Three- science from the University of Maryland, College
Dimensional Structure Databases,º J. Chemical Information and Park. He is a principal investigator and computer
Computer Sciences, vol. 40, no. 2, pp. 442-451, 2000. specialist in the Laboratory of Experimental and
[35] X. Wang, J.T.L. Wang, D. Shasha, B.A. Shapiro, S. Dikshitulu, Computational Biology in the National Cancer
I. Rigoutsos, and K. Zhang, ªAutomated Discovery of Active Institute, National Institutes of Health.
Motifs in Three Dimensional Molecules,º Proc. Third Int'l Conf. Dr. Shapiro directs research on computational
Knowledge Discovery and Data Mining, pp. 89-95, 1997. nucleic acid structure prediction and analysis
[36] K. Zhang, J.T.L. Wang, and D. Shasha, ªOn the Editing Distance and has developed several algorithms and computer systems related to
Between Undirected Acyclic Graphs,º Int'l J. Foundations of this area. His interests include massively parallel genetic algorithms,
Computer Science, special issue on Computational Biology, vol. 7, molecular dynamics, and data mining for molecular structure/function
no. 1, pp. 43-57, 1996. relationships. He is the author of numerous papers on nucleic acid
morphology, parallel computing, image processing, and graphics. He is
an editor and author of the book Pattern Discovery in Biomolecular Data,
Xiong Wang received the MS degree in published by Oxford University Press in 1999. In addition, he is a lecturer
computer science from Fudan University, China, in the Computer Science Department at the University of Maryland,
and the PhD degree in computer and information College Park.
science from the New Jersey Institute of
Technology, in 2000 (thesis advisor: Isidore Rigoutsos received the BS degree in physics from the
Jason T.L. Wang). He is currently an assistant University of Athens, Greece, the MS and PhD degrees in computer
professor of computer science in California State science from the Courant Institute of Mathematical Sciences of New
University, Fullerton. His research interests York University. He manages the Bioinformatics and Pattern Discovery
include databases, knowledge discovery and group in the Computational Biology Center of the Thomas J. Watson
data mining, pattern matching, and bioinfor- Research Center. Currently, he is visiting lecturer in the Department of
matics. He is a member of ACM, ACM SIGMOD, IEEE, and IEEE Chemical Engineering of the Massachusetts Institute of Technology. He
Computer Society. has been the recipient of a Fulbright Foundation Fellowship and has
held an adjunct professor appointment in the Computer Science
Department of New York University. He is the author or co-author of
eight patents and numerous technical papers. Dr. Rigoutsos is a
Jason T.L. Wang received the BS degree in member of the IEEE, the IEEE Computer Society, the International
mathematics from National Taiwan University, Society for Computational Biology, and the American Association for the
and the PhD degree in computer science from Advancement of Science.
the Courant Institute of Mathematical Sciences,
New York University, in 1991 (thesis advisor: Kaizhong Zhang received the MS degree in mathematics from Peking
Dennis Shasha). He is a full professor in the University, Beijing, People's Republic of China, in 1981, and the MS and
Computer and Information Science Department PhD degrees in computer science from the Courant Institute of
at the New Jersey Institute of Technology and Mathematical Sciences, New York University, New York, in 1986 and
director of the University's Data and Knowledge 1989, respectively. Currently, he is an associate professor in the
Engineering Laboratory. Dr. Wang's research Department of Computer Science, University of Western Ontario,
interests include data mining and databases, pattern recognition, London, Ontario, Canada. His research interests include pattern
bioinformatics, and Web information retrieval. He has published more recognition, computational biology, image processing, sequential, and
than 100 papers, including 30 journal articles, in these areas. He is an parallel algorithms.
editor and author of the book Pattern Discovery in Biomolecular Data
(Oxford University Press), and co-author of the book Mining the World
Wide Web (Kluwer). In addition, he is program cochair of the 2001
Atlantic Symposium on Computational Biology, Genome Information
Systems & Technology held at Duke University. He is a member of the . For more information on this or any computing topic, please visit
IEEE. our Digital Library at http://computer.org/publications/dlib.

Dennis Shasha received the PhD degree from


Harvard in 1984 (thesis advisor: Nat Goodman).
He is a professor at Courant Institute, New York
University, where he does research on biological
pattern discovery, combinatorial pattern match-
ing on trees and graphs, software support for the
exploration of unfamiliar databases, database
tuning, and database design for time series.
After graduating from Yale in 1977, he worked
for IBM designing circuits and microcode for the
3090. He has written a professional reference book Database Tuning: A
Principled Approach, (1992, Prentice-Hall), two books about a mathe-
matical detective named Dr. Ecco entitled The Puzzling Adventures of
Dr. Ecco, (1988, Freeman, and republished in 1998 by Dover) and
Codes, Puzzles, and Conspiracy, (1992, Freeman) and a book of
biographies about great computer scientists called Out of Their Minds:
The Lives and Discoveries of 15 Great Computer Scientists, (1995,
Copernicus/Springer-Verlag). Finally, he is a co-author of Pattern
Discovery in Biomolecular Data: Tools, Techniques, and Applications,
published in 1999 by Oxford University Press. In addition, he has co-
authored more than 30 journal papers and 40 conference papers and
spends most of his time in building data mining and pattern recognition
software these days. He writes a monthly puzzle column for Scientific
American.

You might also like