Professional Documents
Culture Documents
Finding Patterns in
Three-Dimensional Graphs: Algorithms and
Applications to Scientific Data Mining
Xiong Wang, Member, IEEE, Jason T.L. Wang, Member, IEEE,
Dennis Shasha, Bruce A. Shapiro, Isidore Rigoutsos, Member, IEEE, and Kaizhong Zhang
AbstractÐThis paper presents a method for finding patterns in 3D graphs. Each node in a graph is an undecomposable or atomic unit
and has a label. Edges are links between the atomic units. Patterns are rigid substructures that may occur in a graph after allowing for
an arbitrary number of whole-structure rotations and translations as well as a small number (specified by the user) of edit operations in
the patterns or in the graph. (When a pattern appears in a graph only after the graph has been modified, we call that appearance
ªapproximate occurrence.º) The edit operations include relabeling a node, deleting a node and inserting a node. The proposed method
is based on the geometric hashing technique, which hashes node-triplets of the graphs into a 3D table and compresses the label-
triplets in the table. To demonstrate the utility of our algorithms, we discuss two applications of them in scientific data mining. First, we
apply the method to locating frequently occurring motifs in two families of proteins pertaining to RNA-directed DNA Polymerase and
Thymidylate Synthase and use the motifs to classify the proteins. Then, we apply the method to clustering chemical compounds
pertaining to aromatic, bicyclicalkanes, and photosynthesis. Experimental results indicate the good performance of our algorithms and
high recall and precision rates for both classification and clustering.
Index TermsÐKDD, classification and clustering, data mining, geometric hashing, structural pattern discovery, biochemistry,
medicine.
1 INTRODUCTION
TABLE 1
Identifiers, Labels and Global Coordinates of the
Nodes of the Graph in Fig. 1
Fig. 1. A graph G.
Fig. 3. (a) The set S of three graphs, (b) the pattern exactly occurring in two graphs in S, and (c) the pattern approximately occurring, within one
mutation, in all the three graphs.
734 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
describes the pattern-finding algorithm in detail. Section 3 4
5 20 node-triplets:
evaluates the performance and efficiency of the pattern- 3
finding algorithm. Section 4 describes the applications of There are several alternative ways to decompose
our approach to classifying proteins and clustering com- 3D graphs into rigid substructures, depending on the
pounds. Section 5 discusses related work. Section 6 application at hand and the nature of the graphs. For the
concludes the paper. purposes of exposition, we describe our pattern-finding
algorithm based on a partitioning strategy. Our approach
assumes a notion of atomic unit which is the lowest level of
2 PATTERN-FINDING ALGORITHM description in the case of interest. Intuitively, atomic units
2.1 Terminology are fundamental building elements, e.g., atoms in a
Let S be a set of 3D graphs. The occurrence number of a molecule. Edges arise as bonds between atomic units. We
pattern P is the number of graphs in S that approxi- break a graph into maximal size rigid substructures (recall
that a rigid substructure is a subgraph in which the relative
mately contain P within the allowed number of muta-
positions of nodes in the substructure are fixed). To
tions. Formally, the occurrence number of a pattern P
accomplish this, we use an approach similar to [16] that
with respect to mutation d and set S, denoted employs a depth-first search algorithm, referred to as DFB,
occur nodS
P , is k if there are k graphs in S that contain to find blocks in a graph.5 The DFB works by traversing the
P within d mutations. For example, consider Fig. 3 again. graph in a depth-first order and collecting nodes belonging
Let S contain the three graphs in Fig. 3a. Then, to a block during the traversal, as illustrated in the example
occur no0S
P1 = 2; occur no1S
P2 = 3. below. Each block is a rigid substructure. We merge two
Given a set S of 3D graphs, our algorithm finds all the rigid substructures B1 and B2 if they are not rotatable with
patterns P where P approximately occurs in at least respect to each other, that is, the relative position of a node
Occur graphs in S within the allowed number of mutations n1 2 B1 and a node n2 2 B2 is fixed. The algorithm
Mut and jP j Size, where jP j represents the size, i.e., the maintains a stack, denoted ST K, which keeps the rigid
number of nodes, of the pattern P . (Mut, Occur, and Size substructures being merged. Fig. 4 shows the algorithm,
which outputs a set of rigid substructures of a graph G. We
are user-specified parameters.) One can use the patterns in
then throw away the substructures P where jP j < Size. The
several ways. For example, biologists or chemists may
remaining substructures constitute the candidate patterns
evaluate whether the patterns are significant; computer generated from G. This pattern-generation algorithm runs
scientists may use the patterns to classify or cluster in time linearly proportional to the number of edges in G.
molecules as demonstrated in Section 4.
Our algorithm proceeds in two phases to search for the Example 3. We use the graph in Fig. 5 to illustrate how the
Find_Rigid_Substructures algorithm in Fig. 4 works.
patterns: 1) find candidate patterns from the graphs in S;
Rotatable edges in the graph are represented by lightface
and 2) calculate the occurrence numbers of the candidate
links; nonrotatable edges are represented by boldface
patterns to determine which of them satisfy the user-
links. Initially, the stack ST K is empty. We invoke DFB
specified requirements. We describe each phase in turn
to locate the first block (Step 4). DFB begins by visiting
below. the node numbered 0. Following the depth-first search,
2.2 Phase 1 of the Algorithm DFB then visits the nodes numbered 1, 2, and 5. Next,
DFB may visit the node numbered 6 or 4. Without loss of
In phase 1 of the algorithm, we decompose the graphs into
generality, assume DFB visits the node numbered 6 and,
rigid substructures. Dividing a graph into substructures is
then, the nodes numbered 10, 11, 13, 14, 15, 16, 18, and
necessary for two reasons. First, in dealing with some 17, in that order. Then, DFB visits the node numbered 14,
molecules such as chemical compounds in which there may and realizes that this node has been visited before. Thus,
exist two substructures that are rotatable with respect to DFB goes back to the node numbered 17, 18, etc., until it
each other, any graph containing the two substructures is returns to the node numbered 14. At this point, DFB
not rigid. As a result, we decompose the graph into identifies the first block, B1 , which includes nodes
substructures having no rotatable components and consider numbered 14, 15, 16, 17, and 18. Since the stack ST K is
the substructures separately. Second, our algorithm hashes empty now, we push B1 into ST K (Step 7).
node-triplets within rigid substructures into a 3D table. In iteration 2, we call DFB again to find the next block
When a graph as a whole is too large, as in the case of (Step 4). DFB returns the nonrotatable edge f13; 14g as a
block, denoted B2 .6 The block is pushed into ST K
proteins, considering all combinations of three nodes in the
(Step 7). In iteration 3, DFB locates the block B3 , which
graph may become prohibitive. Consequently, decompos-
includes nodes numbered 10, 11, 12, and 13 (Step 4).
ing the graph into substructures and hashing node-triplets Since B3 and the top entry of the stack, B2 are not
of the substructures can increase efficiency. For example, rotatable with respect to each other, we push B3 into
consider a graph of 20 nodes. There are
5. A block is a maximal subgraph that has no cut-vertices. A cut-vertex of
20 a graph is one whose removal results in dividing the graph into multiple,
1140 node-triplets: disjointed subgraphs [16].
3
6. In practice, in graph representations for molecules, one uses different
notation to distinguish nonrotatable edges from rotatable edges. For
On the other hand, if we decompose the graph into five example, in chemical compounds, a double bond is nonrotatable. The
substructures, each having four nodes, then there are only different notation helps the algorithm to determine the types of edges.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 735
ST K (Step 7). In iteration 4, we continue to call DFB and is rotatable with respect to B9 , we pop out f5; 6g to form
get the single edge f6; 10g as the next block, B4 (Step 4). a rigid substructure (Step 9) and push B9 into ST K
Since f6; 10g is rotatable and the stack ST K is nonempty, (Step 10). Finally, since there is no block left in the graph,
we pop out all nodes in ST K, merge them and output we pop out all nodes in B9 to form a rigid substructure
the rigid substructure containing nodes numbered 10, 11, and terminate (Step 13).
12, 13, 14, 15, 16, 17, 18 (Step 9). We then push B4 into
ST K (Step 10). 2.3 Phase 2 of the Algorithm
Phase 2 of our pattern-finding algorithm consists of two
In iteration 5, DFB visits the node numbered 7 and,
subphases. In subphase A of phase 2, we hash and store the
then, 8 from which DFB goes back to the node numbered
candidate patterns generated from the graphs in phase 1 in
7. It returns the single edge f7; 8g as the next block, B5
a 3D table H. In subphase B, we rehash each candidate
(Step 4). Since f7; 8g is connected to the current top entry pattern into H and calculate its occurrence number. Notice
of ST K, f6; 10g, via a rotatable edge f6; 7g, we pop out that in the subphase B, one does not need to store the
f6; 10g, which itself becomes a rigid substructure (Step 9). candidate patterns in H again.
We then push f7; 8g into ST K (Step 10). In iteration 6, In processing a rigid substructure (pattern) of a 3D graph,
DFB returns the single edge f7; 9g as the next block, B6 we choose all three-node combinations, referred to as node-
(Step 4). Since B6 and the current top entry of ST K, B5 = triplets, in the substructure and hash the node-triplets. We
f7; 8g, are not rotatable with respect to each other, we hash three-node combinations, because to fix a rigid
push B6 into ST K (Step 7). In iteration 7, DFB goes back substructure in the 3D Euclidean space one needs at least
from the node numbered 7 to the node numbered 6 and three nodes from the substructure and three nodes are
returns the single edge f6; 7g as the next block, B7 sufficient provided they are not collinear. Notice that the
(Step 4). Since f6; 7g is rotatable, we pop out all nodes in proper order of choosing the nodes i; j; k, in a triplet is
ST K, merge them and output the resulting rigid significant and has an impact on the accuracy of our
substructure containing nodes numbered 7, 8, and 9 approach, as we will show later in the paper. We determine
(Step 9). We then push B7 = f6; 7g into ST K (Step 10). the order of the three nodes by considering the triangle
In iteration 8, DFB returns the block B8 = f5; 6g formed by them. The first node chosen always opposes the
(Step 4). Since f5; 6g and f6; 7g are both rotatable, we pop longest edge of the triangle and the third node chosen
out f6; 7g to form a rigid substructure (Step 9). We then opposes the shortest edge. For example, in the triangle in
push B8 into ST K (Step 10). In iteration 9, DFB returns Fig. 6, we choose i; j; k, in that order. Thus, the order is
the block B9 containing nodes numbered 0, 1, 2, 3, 4, and unique if the triangle is not isosceles or equilateral, which
5 (Step 4). Since the current top entry of ST K, B8 = f5; 6g, usually holds when the coordinates are floating point
TABLE 2
The Node Labels of the Graph in Fig. 1 and
Their Indices in the Array A
2.3.2 Subphase B of Phase 2 nodes u, v, and w geometrically match the nodes i, j, and k,
Let H be the resulting hash table obtained in subphase A of respectively.
phase 2 of the pattern-finding algorithm. In subphase B, we Proof. (If) Let A be as defined in (4) and let
calculate the occurrence number of each candidate pattern 0 1
P by rehashing the node-triplets of P into H. This way, we V~u;v
B C
are able to match a node-triplet tri of P with a node-triplet B@ V~u;w A:
8
tri0 of another substructure (candidate pattern) P 0 stored in ~ ~
Vu;v Vu;w
subphase A where tri and tri0 have the same hash bin
address. By counting the node-triplet matches, one can infer Note that, if u, v, and w geometrically match i, j, and k,
whether P matches P 0 and, therefore, whether P occurs in respectively, then jBj jAj, where jBj (jAj, respectively)
the graph from which P 0 is generated. is the determinant of the matrix B (matrix A, respec-
We associate each substructure with several counters, tively). That is to say, jAÿ1 jjBj 1.
which are created and updated as illustrated by the From (3) and by the definition of the SFP in (7),
we have
following example. Suppose the two substructures (pat-
0 1 0 1
terns) of graph G with identification number 12 in Fig. 1
V~i;b1 Pu
have already been stored in the hash table H in subphase A. B C
SFP @ V~i;b2 A Aÿ1 B @ Pu A:
9
Suppose i; j; k, are three nodes in the substructure Str0 of G.
V~i;b3 Pu
Thus, for this node-triplet, its entry in the hash table is
12; 0; Lcode; SF0 i; j; k. Now, in subphase B, consider Thus, the SFP basically transforms Pb1 , Pb2 , and Pb3 via
another pattern P ; we hash the node-triplets of P using two translations and one rotation, where Pb1 , Pb2 , and Pb3
the same hash function. Let u; v; w, be three nodes in P that are the basis points of the Substructure Frame 0 (SF0 ).
have the same hash bin address as i; j; k; that is, the node- Since V~b1 ;b2 , V~b1 ;b3 , and V~b1 ;b2 V~b1 ;b3 are orthonormal
triplet u; v; w ªmatchesº the node-triplet i; j; k. If the vectors, and translations and rotations do not change
nodes u; v; w, geometrically match the nodes i; j; k respec- this property [27], we know that V~c1 ;c2 , V~c1 ;c3 , and V~c1 ;c2
tively, i.e., they have coinciding 3D coordinates after V~c1 ;c3 are orthonormal vectors.
rotations and translations, we call the node-triplet match a (Only if) If u, v, and w do not match i, j and k
true match; otherwise it is a false match. For a true match, let geometrically while having the same hash bin address,
0 1 0 1 then there would be distortion in the aforementioned
V~u;v Pu transformation. Consequently, V~c1 ;c2 , V~c1 ;c3 , and V~c1 ;c2
B C @ A V~c1 ;c3 would no longer be orthonormal vectors.
SFP SF0 i; j; k @ V~u;w A Pu :
7 u
t
~ ~
Vu;v Vu;w Pu
suppose there is an SFP associated with Str whose counter qualified patterns. In Section 3, we will show experimen-
value Cnt > P , where tally that the missed patterns are few compared with those
found by exhaustive search.
N ÿ1
N ÿ 1
N ÿ 2
N ÿ 3
P
14 Theorem 4. Let the set S contain K graphs, each having at most
3 6
N nodes. The time complexity of the proposed pattern-finding
and N jP j ÿ Mut. Then, P matches Str within Mut algorithm is O
KN 3 .
mutations (i.e., P approximately occurs in G, or G
Proof. For each graph in S, phase 1 of the algorithm
approximately contains P , within Mut mutations).
requires O
N 2 time to decompose the graph into
Proof. By Theorem 2, we increase the counter value only substructures. Thus, the time needed for phase 1 is
when there are true node-triplet matches that are O
KN 2 . In subphase A of phase 2, we hash each
augmentable under the SFP . Thus, the counter value candidate pattern P by considering the combinations of
shows currently how many node-triplets from the any three nodes in P , which requires time
pattern P are found to match with the node-triplets
from the substructure Str in the hash table. Note that P jP j
O
jP j3 :
represents the number of distinct node-triplets that can 3
be generated from a substructure of size N ÿ 1. If there
are N ÿ 1 node matches between P and Str, then Thus, the time needed to hash all candidate patterns is
Cnt P .9 Therefore, when Cnt > P , there are at least O
KN 3 . In subphase B of phase 2, we rehash each
N node matches between Str and P . This completes the candidate pattern, which requires the same time
proof. u
t O
KN 3 . u
t
TABLE 3
Parameters in the Pattern-Finding Algorithm and Their Base Values Used in the Experiments
Fig. 10. Running times as a function of the number of graphs. Fig. 11. Recall as a function of the number of graphs.
742 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
Fig. 12. Number of false matches as a function of the number of graphs. Fig. 14. Effect of Size.
becomes large. Similar results were obtained when testing the fewer false matches. On the other hand, when Nrow
the size of graphs for both types of data. is too large, the running times increase substantially, since
one needs to spend a lot of time in reading the 3D table
3.2 Effect of Algorithm-Related Parameters
containing the hash bin addresses.
The purpose of this section is to analyze the effect of
Examining Fig. 13, we see how Scale affects Nfm . Taking
varying the algorithm-related parameter values on the
an extreme case, when Scale 1, a node-triplet with the
performance of the proposed pattern-finding algorithm.
squares of the lengths of the three edges connecting the
To avoid the mutual influence of parameters, the analysis
nodes being 12.4567 is hashed to the same bin as a node-
was carried out by fixing the parameter values related to triplet with those values being 12.0000, although the two
data graphsÐthe 1,000 synthetic graphs and 226 com- node-triplets do not match geometrically, see Section 2.3.1.
pounds described above were used, respectively. In each On the other hand, when Scale is large (e.g., Scale = 10,000),
experiment, only one algorithm-related parameter value the distribution of hash function values is less skewed,
was varied; the other parameters had the values shown in which reduces the number of false matches. It was observed
Table 3. that Nfm was largest when Scale was 100. This happens
We first examined the effect of varying the parameter because with this Scale value, inaccuracy was being
values on generating false matches. Since few false introduced in calculating hash bin addresses. A node-triplet
matches were found for chemical compounds, the being hashed to a bin might generate k false matches where
experiments focused on synthetic graphs. It was observed k is the total number of node-triplets already stored in that
that only Nrow and Scale affected the number of false binÐk would be large if the distribution of hash function
matches. Fig. 13 shows Nfm as a function of Scale for values is skewed.
Nrow = 101, 131, 167, 199, respectively. The larger the Figs. 14, 15, and 16 show the recall as a function of Size,
Nrow, the fewer entries in a hash bin, and consequently Mut, and Scale, respectively. In all three figures, precision
Fig. 13. Number of false matches as a function of Scale. Fig. 15. Effect of Mut.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 743
to the nth digit after the decimal point, i.e., to 10ÿn , the error
caused by eliminating the digits after the
n 1th position is
no larger than 0:5 10ÿn . Namely, for any coordinates
x; y; z, we have x 0:5 10ÿn , y 0:5 10ÿn , and
z 0:5 10ÿn . Thus,
these parameter values had little impact on the performance the original substructure P . Supposing the pattern P 0 has
of the proposed algorithm. N nodes, the total number of node-triplets that include the
Finally, we examined the effect of sensor errors, which node v is
are caused by the measuring device (e.g., X-ray crystal-
N ÿ1
lography), and are proportional to the values being :
2
measured. Suppose that the sensor error is 10ÿn of the
real value for any coordinates
x; y; z, i.e., x x 10ÿn , Thus, we would miss the same number of node-triplet
y y 10ÿn , and z z 10ÿn . We have matches. However, if all other node-triplets that do not
include the node v are matched successfully, we would
x x
1 10ÿn ; y y
1 10ÿn ; still have
and z z
1 10ÿn . Let
l1 represent the sensor error
N N ÿ1 N ÿ1
induced in calculating l1 in (2). In a way similar to the ÿ
3 2 3
calculation of the accumulative error l1 in (15), we obtain
node-triplet matches. According to Theorem 3, there would
l1
xi ÿ xj 2
yi ÿ yj 2
zi ÿ zj 2 2 10ÿn Scale be N ÿ 1 node matches between the pattern P 0 and the
xi ÿ xj 2
yi ÿ yj 2
zi ÿ zj 2 10ÿ2n Scale: original substructure P (i.e., P 0 matches P with 1 mutation).
This example reveals that, in general, if we allow mutations
Let us focus on the first term since it is more significant. to occur in searching for patterns and the number of nodes
Clearly, the proposed approach is not feasible in the with sensor errors equals the allowed number of mutations,
environment with large sensor errors. Taking an example, the recall will be the same as the case in which there are no
suppose that the sensor error is 10ÿ2 of the real value. sensor errors and we search for exactly matched patterns
Again, in order to keep the calculation of hash bin without mutations. On the other hand, if no mutations are
addresses accurate, the accumulative error should be allowed in searching for patterns and sensor errors occur,
smaller than 1. That is, the recall will be zero.
TABLE 4
Statistics Concerning the Proteins and Motifs Found in Them
746 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
As in Section 3, we use recall (REc ) and precision (P Rc ) average-group method [13] to cluster the graphs in S, which
to evaluate the effectiveness of our classification algorithm. works as follows.
Recall is defined as Initially, every graph is a cluster. The algorithm merges
P two nearest clusters to form a new cluster, until there are
T otalNum ÿ 2i1 NumLossic only K clusters left where K is a user-specified parameter.
REc 100%;
T otalNum The distance between two clusters C1 and C2 is given by
where T otalNum is the total number of test proteins and 1 X
NumLossic is the number of test proteins that belong to jd
Gx ; Gy j;
16
jC1 jjC2 j G 2C ;G 2C
family i but are not assigned to family i by our algorithm x 1 y 2
(they are either assigned to family j, j 6 i, or they receive where jCi j, i 1; 2, is the size of cluster Ci . The algorithm
the ªno-opinionº verdict). Precision is defined as requires O
N 2 distance calculations where N is the total
P number of graphs in S.
T otalNum ÿ 2i1 NumGainic
P Rc 100%; We applied this algorithm to clustering chemical
T otalNum
compounds. Ninety eight compounds were chosen from
where NumGainic is the number of test proteins that do not the Merck Index that belonged to three groups pertaining to
belong to family i but are assigned by our algorithm to aromatic, bicyclicalkanes and photosynthesis. The data was
family i. With the 10-way cross validation scheme, the created by the CORINA program that converted 2D data
average REc over the 10 tests was 92.7 percent and the (represented in SMILES string) to 3D data (represented in
average P Rc was 96.4 percent. It was found that 3.7 percent PDB format) [22]. Table 5 lists the number of compounds in
test proteins on average received the ªno-opinionº verdict each group, their sizes and the patterns discovered from
during the classification. We repeated the same experiments them. The parameter values used were Size 5, Occur 1,
using other parameter values and obtained similar results, Mut 2; the other parameters had the values shown in
except that larger Mut values (e.g., 3) generally yielded Table 3.
lower REc . To evaluate the effectiveness of our clustering algorithm,
The binary classification problem studied here is con- we applied it to finding clusters in the compounds. The
cerned with assigning a protein to one of two families. This parameter value K was set to 3, as there were three groups.
problem arises frequently in protein homology detection As in the previous sections, we use recall (REr ) and
[31]. The performance of our technique degrades when precision (P Rr ) to evaluate the effectiveness of the cluster-
applied to the n-ary classification problem, which is ing algorithm. Recall is defined as
concerned with assigning a protein to one of many families P
[32]. For example, we applied the technique to classifying T otalNum ÿ K i1 NumLossr
i
REr 100%;
three families of proteins and the REc and P Rc dropped to T otalNum
80 percent.
where NumLossir is the number of compounds that belong
4.2 Clustering Compounds to group Gi , but are assigned by our algorithm to group Gj ,
In addition to classifying proteins, we have developed an i 6 j, and T otalNum is the total number of compounds
algorithm for clustering 3D graphs based on the patterns tested. Precision is defined as
occurring in the graphs and have applied the algorithm P
T otalNum ÿ K i1 NumGainr
i
to grouping compounds. Given a collection S of 3D P Rr 100%;
T otalNum
graphs, the algorithm first uses the procedure depicted in
Section 2.2 to decompose the graphs into rigid substruc- where NumGainir is the number of compounds that do
tures. Let fStrp jp 0; 1; :::; N ÿ 1g be the set of substruc- not belong to group Gi , but are assigned by our
tures found in the graphs in S where jStrp j Size. algorithm to group Gi . Our experimental results indicated
Using the proposed pattern-finding algorithm, we exam- that REr P Rr 99%. Out of the 98 compounds, only
ine each graph Gq in S and determine whether each one compound in the photosynthesis group was assigned
substructure Strp approximately occurs in Gq within incorrectly to the bicyclicalkanes group. We experimented
Mut mutations. Each graph Gq is represented as a bit with other parameter values and obtained similar
string of length N, i.e., Gq
b0q ; b1q ; :::; bNÿ1
q , where results.11
1 if Strp occurs in Gq within Mut mutations
bpq 5 RELATED WORK
0 otherwise:
There are several groups working on pattern finding (or
For example, consider the two patterns P1 and P2 in
knowledge discovery) in molecules and graphs. Conklin et al.
Fig. 3 and the three graphs in Fig. 3a. Suppose the
[4], [5], [6], for example, represented a molecular structure as
allowed number of mutations is 0. Then, G1 is repre-
an image, which comprised a set of parts with their 3D
sented as 10, G2 as 01, and G3 as 10. On the other hand,
coordinates and a set of relations that were preserved for the
suppose the allowed number of mutations is 1. Then, G1
image. The authors used an incremental, divisive approach to
and G3 are represented as 11 and G2 as 01.
discover the ªknowledgeº from a data set, that is, to build a
The distance between two graphs Gx and Gy , denoted
d
Gx ; Gy , is defined as the Hamming distance [12] between 11. The Occur value was fixed at 1 in these experiments because of the
their bit strings. The algorithm then uses the well-known fact that all the compounds were represented as binary bit strings.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 747
TABLE 5
Statistics Concerning the Chemical Compounds and Patterns Found in Them
subsumption hierarchy that summarized and classified the point or a specific point chosen based on chemical
data set. The algorithm relied on a measure of similarity considerations. The 3D substructure can be rotated around
among molecular images that was defined in terms of their the hinge point. The authors then developed schemes to
largest common subimages. In [8], Djoko et al. developed a generate node-triplets and hash table indices. However,
system, called SUBDUE, that utilized the minimum descrip- based on those schemes, false matches cannot be detected in
tion length principle to find repeatedly occurring substruc- the hash table and must be processed in a subsequent,
tures in a graph. Once a substructure was found, the additional verification phase. Thus, even if a pattern
substructure was used to simplify the graph by replacing appears to match several substructures during the search-
instances of the substructure with a pointer to the
ing phase, one has to compare the pattern with those
discovered substructure. In [7], Dehaspe et al. used
substructures one by one to make sure they are true
DATALOG to represent compounds and applied data
matches during the verification phase. When mutations are
mining techniques to predicting chemical carcinogenicity.
Their techniques were based on Mannila and Toivonen's allowed, which were not considered in [23], [24], [29], a
algorithm [15] for finding interesting patterns from a class brute-force verification test would be very time-consuming.
of sentences in a database. For example, in comparing our approach with their
In contrast to the above work, we use the geometric approach in performing protein classification as described
hashing technique to find approximately common pat- in Section 4, we found that both approaches yield the same
terns in a set of 3D graphs without prior knowledge of recall and precision, though our approach is 10 times faster
their structures, positions, or occurrence frequency. The than their approach.
geometric hashing technique used here originated from
the work of Lamdan and Wolfson for model based
6 CONCLUSION
recognition in computer vision [14]. Several researchers
attempted to parallelize the technique based on various In this paper, we have presented an algorithm for finding
architectures, such as the Hypercube and the Connection patterns in 3D graphs. A pattern here is a rigid substructure
Machine [3], [19], [20]. It was observed that the distribu- that may occur in a graph after allowing for an arbitrary
tion of the hash table entries might be skewed. To balance number of rotations and translations as well as a small
the distribution of the hash function values, delicate number of edit operations in the pattern or in the graph. We
rehash functions were designed [20]. There were also used the algorithm to find approximately common patterns
efforts exploring the uncertainty existing in the geometric in a set of synthetic graphs, chemical compounds and
hashing algorithms [11], [26]. proteins. This yields a kind of alphabet of patterns. Our
In 1996, Rigoutsos et al. employed geometric hashing experimental results demonstrated the good performance of
and magic vectors for substructure matching in a database the proposed algorithm and its usefulness for pattern
of chemical compounds [21]. The magic vectors were bonds discovery. We then developed classification and clustering
among atoms; the choice of them was domain dependent algorithms using the patterns found in the graphs, and
and was based on the type of each individual graph. We applied them to classifying proteins and clustering com-
extend the work in [21] by providing a framework for pounds. Empirical study showed high recall and precision
discovering approximately common substructures in a set rates for both classification and clustering, indicating the
significance of the patterns.
of 3D graphs, and applying our techniques to both
We have implemented the techniques presented in the
compounds and proteins. Our approach differs from [21]
paper and are combining them with the algorithms for
in that instead of using the magic vectors, we store a
acyclic graph matching [36] into a toolkit. We use the toolkit
coordinate system in a hash table entry. Furthermore, we
to find patterns in various types of graphs arising in
establish a theory for detecting and eliminating false
different domains and as part of our tree and graph search
matches occurring in the hashing. engine project. The toolkit can be obtained from the authors
More recently, Wolfson and his colleagues also applied and is also accessible at http://www.cis.njit.edu/~jason/
geometric hashing algorithms to protein docking and sigmod.html on the Web.
recognition [23], [24], [29]. They attached a reference frame In the future, we plan to study topological graph
to each substructure, where the reference frame is an matching problems. One weakness of the proposed geo-
orthonormal coordinate frame of arbitrary orientations, metric hashing approach is its sensitivity to errors as we
established at a ªhingeº point. This hinge point can be any discussed in Section 3.2. One has to tune the Scale
748 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
parameter when the accuracy of data changes. Another [10] H.N. Gabow, Z. Galil, and T.H. Spencer, ªEfficient Implementa-
tion of Graph Algorithms Using Contraction,º Proc. 25th Ann.
weakness is that the geometric hashing approach doesn't IEEE Symp. Foundations of Computer Science, pp. 347-357, 1984.
take edges into consideration. In some domains, edges are [11] W.E.L. Grimson, D.P. Huttenlocher, and D.W. Jacobs, ªAffine
important despite the nodes matching. For example, in Matching with Bounded Sensor Error: Study of Geometric
computer vision, edges may represent some kind of Hashing and Alignment,º Technical Memo AIM-1250, Artificial
Intelligence Laboratory, Massachusetts Inst. of Technology, 1991.
boundaries and have to be considered in object recognition. [12] R.W. Hamming ªError Detecting and Error Correcting Codes,º
(Notice that when applying the proposed geometric The Bell System Technical J., vol. 29, no. 2, pp. 147-160, 1950,
hashing approach to this domain, the node label alphabet Reprinted in E.E. Swartzlander, Computer Arithmetic. vol. 2, Los
Alamitos, Calif.: IEEE Computer Soc. Press Tutorial, 1990.
could be extended to contain semantic information, such as [13] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An
color, shape, etc.) We have done preliminary work in Introduction to Cluster Analysis. New York: John Wiley & Sons,
topological graph matching, in which our algorithm 1990.
[14] Y. Lamdan and H. Wolfson, ªGeometric Hashing: A General and
matches edge distances and the software is accessible at
Efficient Model-Based Recognition Scheme,º Proc. Int'l Conf.
http://cs.nyu.edu/cs/faculty/shasha/papers/agm.html. Computer Vision, pp. 237-249, 1988.
We plan to combine geometric matching and topological [15] H. Mannila and H. Toivonen, ªLevelwise Search and Borders of
matching approaches to finding patterns in more general Theories in Knowledge Discovery,º Data Mining and Knowledge
Discovery, vol. 1, no. 3, pp. 241-258, 1997.
graphs. [16] J.A. McHugh, Algorithmic Graph Theory. Englewood Cliffs, N.J.:
Prentice Hall, 1990.
[17] X. Pennec and N. Ayache, ªAn O
n2 Algorithm for 3D
ACKNOWLEDGMENTS Substructure Matching of Proteins,º Proc. First Int'l Workshop
Shape and Pattern Matching in Computational Biology, pp. 25-40,
The authors would like to thank the anonymous reviewers 1994.
for their constructive suggestions that helped improve both [18] C. Pu, K.P. Sheka, L. Chang, J. Ong, A. Chang, E. Alessio,
the quality and the presentation of this paper. We also I.N. Shindyalov, W. Chang, and P.E. Bourne, ªPDBtool: A
Prototype Object Oriented Toolkit for Protein Structure
thank Dr. Carol Venanzi, Bo Chen, and Song Peng for useful Verification,º Technical Report CUCS-048-92, Dept. Computer
discussions and for providing chemical compounds used in Science, Columbia Univ., 1992.
the paper. This work was supported in part by US National [19] I. Rigoutsos and R. Hummel, ªScalable Parallel Geometric
Science Foundation grants IRI-9531548, IRI-9531554, IIS- Hashing for Hypercube (SIMD) Architectures,º Technical Report
TR-553, Dept. Computer Science, New York Univ., 1991.
9988345, IIS-9988636, and by the Natural Sciences and [20] I. Rigoutsos and R. Hummel, ªOn a Parallel Implementation of
Engineering Research Council of Canada under Grant No. Geometric Hashing on the Connection Machine,º Technical
OGP0046373. Part of this work was done while X. Wang Report TR-554, Dept. Computer Science, New York Univ., 1991.
[21] I. Rigoutsos, D. Platt, and A. Califano, ªFlexible Substructure
was with the Department of Computer and Information Matching in Very Large Databases of 3D-Molecular Information,º
Science, New Jersey Institute of Technology. Part of this research report, IBM T.J. Watson Research Center, Yorktown
work was done while J.T.L. Wang was visiting Courant Heights, N.Y., 1996.
Institute of Mathematical Sciences, New York University. [22] J. Sadowski and J. Gasteiger, ªFrom Atoms and Bonds to Three-
Dimensional Atomic Coordinates: Automatic Model Builders,º
Chemical Rev., pp. 2567-2581, vol. 93, 1993.
[23] B. Sandak, R. Nussinov, and H.J. Wolfson, ªA Method for
REFERENCES Biomolecular Structural Recognition and Docking Allowing
[1] E.E. Abola, F.C. Bernstein, S.H. Bryant, T.F. Koetzle, and J. Weng, Conformational Flexibility,º J. Computational Biology, vol. 5, no. 4,
ªProtein Data Bank,º Data Comm. Int'l Union of Crystallography, pp. 631-654, 1998.
pp. 107-132, 1987. [24] B. Sandak, H.J. Wolfson, and R. Nussinov, ªFlexible Docking
[2] F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer, M.D. Allowing Induced Fit in Proteins: Insights from an Open to Closed
Brice, J.R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, Conformational Isomers,º Proteins, vol. 32, pp. 159-74, 1998.
ªThe Protein Data Bank: A Computer-Based Archival File for [25] Time Warps, String Edits, and Macromolecules: The Theory and
Practice of Sequence Comparison. D. Sankoff and J.B. Kruskal, eds.,
Macromolecular Structures,º J. Molecular Biology, vol. 112, 1977.
Reading, Mass.: Addison-Wesley, 1983.
[3] O. Bourdon and G. Medioni, ªObject Recognition Using Geo-
[26] K.B. Sarachik, ªLimitations of Geometric Hashing in the Presence
metric Hashing on the Connection Machine,º Proc. 10th Int'l Conf. of Gaussian Noise,º Technical Memo AIM-1395, Artificial Intelli-
Pattern Recognition, pp. 596-600, 1990. gence Laboratory, Massachusetts Inst. of Technology, 1992.
[4] D. Conklin, ªKnowledge Discovery in Molecular Structure [27] R.R. Stoll, Linear Algebra and Matrix Theory, New York: McGraw-
Databases,º doctoral dissertation, Dept. Computing and Informa- Hill, 1952.
tion Science, Queen's Univ., Canada, 1995. [28] I.I. Vaisman, A. Tropsha, and W. Zheng, ªCompositional
[5] D. Conklin, ªMachine Discovery of Protein Motifs,º Machine Preferences in Quadruplets of Nearest Neighbor Residues in
Learning, vol. 21, pp. 125-150, 1995. Protein Structures: Statistical Geometry Analysis,º Proc. IEEE Int'l
[6] D. Conklin, S. Fortier, and J. Glasgow, ªKnowledge Discovery in Join Symp. Intelligence and Systems, pp. 163-168, 1998.
Molecular Databases,º IEEE Trans. Knowledge and Data Eng., vol. 5, [29] G. Verbitsky, R. Nussinov, and H.J. Wolfson, ªFlexible Structural
no. 6, pp. 985-987, 1993. Comparison Allowing Hinge Bending and Swiveling Motions,º
[7] L. Dehaspe, H. Toivonen, and R.D. King, ªFinding Frequent Proteins, vol. 34, pp. 232-254, 1999.
Substructures in Chemical Compounds,º Proc. Fourth Int'l Conf. [30] J.T.L. Wang, G.-W. Chirn, T.G. Marr, B.A. Shapiro, D. Shasha, and
Knowledge Discovery and Data Mining, pp. 30-36, 1998. K. Zhang, ªCombinatorial Pattern Discovery for Scientific Data:
[8] S. Djoko, D.J. Cook, and L.B. Holder, ªAn Empirical Study of Some Preliminary Results,º Proc. 1994 ACM SIGMOD Int'l Conf.
Domain Knowledge and Its Benefits to Substructure Discovery,º Management of Data, pp. 115-125, 1994.
IEEE Trans. Knowledge and Data Eng., vol. 9, no. 4, pp. 575-586, [31] J.T.L. Wang, Q. Ma, D. Shasha, and C.H. Wu, ªApplication of
1997. Neural Networks to Biological Data Mining: A Case Study in
[9] D. Fischer, O. Bachar, R. Nussinov, and H.J. Wolfson, ªAn Protein Sequence Classification,º Proc. Sixth ACM SIGKDD Int'l
Efficient Automated Computer Vision Based Technique for Conf. Knowledge Discovery and Data Mining, pp. 305±309, 2000.
Detection of Three Dimensional Structural Motifs in Proteins,º [32] J.T.L. Wang, T.G. Marr, D. Shasha, B. A. Shapiro, G.-W. Chirn, and
J. Biomolecular and Structural Dynamics, vol. 9, no. 4, pp. 769-789, T.Y. Lee, ªComplementary Classification Approaches for Protein
1992. Sequences,º Protein Eng., vol. 9, no. 5, pp. 381-386, 1996.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 749
[33] J.T.L. Wang, B.A. Shapiro, and D. Shasha, eds., Pattern Discovery in Bruce A. Shapiro received the BS degree from
Biomolecular Data: Tools, Techniques and Applications, New York: Brooklyn College studying mathematics and
Oxford Univ. Press, 1999. physics, the MS and PhD degrees in computer
[34] X. Wang and J.T.L. Wang, ªFast Similarity Search in Three- science from the University of Maryland, College
Dimensional Structure Databases,º J. Chemical Information and Park. He is a principal investigator and computer
Computer Sciences, vol. 40, no. 2, pp. 442-451, 2000. specialist in the Laboratory of Experimental and
[35] X. Wang, J.T.L. Wang, D. Shasha, B.A. Shapiro, S. Dikshitulu, Computational Biology in the National Cancer
I. Rigoutsos, and K. Zhang, ªAutomated Discovery of Active Institute, National Institutes of Health.
Motifs in Three Dimensional Molecules,º Proc. Third Int'l Conf. Dr. Shapiro directs research on computational
Knowledge Discovery and Data Mining, pp. 89-95, 1997. nucleic acid structure prediction and analysis
[36] K. Zhang, J.T.L. Wang, and D. Shasha, ªOn the Editing Distance and has developed several algorithms and computer systems related to
Between Undirected Acyclic Graphs,º Int'l J. Foundations of this area. His interests include massively parallel genetic algorithms,
Computer Science, special issue on Computational Biology, vol. 7, molecular dynamics, and data mining for molecular structure/function
no. 1, pp. 43-57, 1996. relationships. He is the author of numerous papers on nucleic acid
morphology, parallel computing, image processing, and graphics. He is
an editor and author of the book Pattern Discovery in Biomolecular Data,
Xiong Wang received the MS degree in published by Oxford University Press in 1999. In addition, he is a lecturer
computer science from Fudan University, China, in the Computer Science Department at the University of Maryland,
and the PhD degree in computer and information College Park.
science from the New Jersey Institute of
Technology, in 2000 (thesis advisor: Isidore Rigoutsos received the BS degree in physics from the
Jason T.L. Wang). He is currently an assistant University of Athens, Greece, the MS and PhD degrees in computer
professor of computer science in California State science from the Courant Institute of Mathematical Sciences of New
University, Fullerton. His research interests York University. He manages the Bioinformatics and Pattern Discovery
include databases, knowledge discovery and group in the Computational Biology Center of the Thomas J. Watson
data mining, pattern matching, and bioinfor- Research Center. Currently, he is visiting lecturer in the Department of
matics. He is a member of ACM, ACM SIGMOD, IEEE, and IEEE Chemical Engineering of the Massachusetts Institute of Technology. He
Computer Society. has been the recipient of a Fulbright Foundation Fellowship and has
held an adjunct professor appointment in the Computer Science
Department of New York University. He is the author or co-author of
eight patents and numerous technical papers. Dr. Rigoutsos is a
Jason T.L. Wang received the BS degree in member of the IEEE, the IEEE Computer Society, the International
mathematics from National Taiwan University, Society for Computational Biology, and the American Association for the
and the PhD degree in computer science from Advancement of Science.
the Courant Institute of Mathematical Sciences,
New York University, in 1991 (thesis advisor: Kaizhong Zhang received the MS degree in mathematics from Peking
Dennis Shasha). He is a full professor in the University, Beijing, People's Republic of China, in 1981, and the MS and
Computer and Information Science Department PhD degrees in computer science from the Courant Institute of
at the New Jersey Institute of Technology and Mathematical Sciences, New York University, New York, in 1986 and
director of the University's Data and Knowledge 1989, respectively. Currently, he is an associate professor in the
Engineering Laboratory. Dr. Wang's research Department of Computer Science, University of Western Ontario,
interests include data mining and databases, pattern recognition, London, Ontario, Canada. His research interests include pattern
bioinformatics, and Web information retrieval. He has published more recognition, computational biology, image processing, sequential, and
than 100 papers, including 30 journal articles, in these areas. He is an parallel algorithms.
editor and author of the book Pattern Discovery in Biomolecular Data
(Oxford University Press), and co-author of the book Mining the World
Wide Web (Kluwer). In addition, he is program cochair of the 2001
Atlantic Symposium on Computational Biology, Genome Information
Systems & Technology held at Duke University. He is a member of the . For more information on this or any computing topic, please visit
IEEE. our Digital Library at http://computer.org/publications/dlib.