You are on page 1of 13

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/234814343

DBMiner: a system for data mining in relational


databases and data warehouses

Article · September 1999


Source: CiteSeer

CITATIONS READS

44 251

17 authors, including:

Jiawei Han Sonny Han Seng Chee


University of Illinois, Urbana-Champaign 9 PUBLICATIONS 282 CITATIONS
574 PUBLICATIONS 51,709 CITATIONS
SEE PROFILE

SEE PROFILE

Nebojsa Stefanovic Osmar R. Zaïane


Simon Fraser University University of Alberta
12 PUBLICATIONS 904 CITATIONS 297 PUBLICATIONS 6,169 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Osmar R. Zaïane on 10 January 2017.

The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
DBMiner: A System for Data Mining in
Relational Databases and Data Warehouses
Jiawei Han Jenny Y. Chiang Sonny Chee Jianping Chen Qing Chen
Shan Cheng Wan Gong Micheline Kamber Krzysztof Koperski
Gang Liu Yijun Lu Nebojsa Stefanovic Lara Winstone
Betty B. Xia Osmar R. Zaiane Shuhua Zhang Hua Zhu
Data Mining Research Group, Intelligent Database Systems Research Laboratory
School of Computing Science, Simon Fraser University, British Columbia, Canada V5A 1S6
URL: http://db.cs.sfu.ca/ (for research group) http://db.cs.sfu.ca/DBMiner (for system)

Abstract With our years of research and development


on data mining and knowledge discovery in
A data mining system, DBMiner, has been de- databases, a data mining system, DBMiner,
veloped for interactive mining of multiple-level has been developed by integration of database,
knowledge in large relational databases and data OLAP and data mining technologies [6, 8, 9, 10].
warehouses. The system implements a wide The system mines various kinds of knowledge
spectrum of data mining functions, including at multiple levels of abstraction from large rela-
characterization, comparison, association, clas- tional databases and data warehouses eciently
si cation, prediction, and clustering. By incor- and e ectively, with the following distinct fea-
porating several interesting data mining tech- tures:
niques, including OLAP and attribute-oriented
induction, statistical analysis, progressive deep- 1. It incorporates several interesting data
ening for mining multiple-level knowledge, and mining techniques, including data cube and
meta-rule guided mining, the system provides OLAP technology [3], attribute-oriented in-
a user-friendly, interactive data mining environ- duction [6, 9], statistical analysis, progres-
ment with good performance. sive deepening for mining multiple-level
rules [8, 9, 12], and meta-rule guided knowl-
edge mining [5, 11]. It also implements a
1 Introduction wide spectrum of data mining functions in-
cluding characterization, comparison, asso-
With an enormous amount of data stored in ciation, classi cation, prediction, and clus-
databases and data warehouses, it is increas- tering.
ingly important to develop powerful data ware- 2. It performs interactive data mining at mul-
housing and data mining tools for analysis of tiple levels of abstraction on any user-
such data and mining interesting knowledge speci ed set of data in a database or a data
from it [14, 4]. warehouse using an SQL-like Data Mining
 Research was supported in part by a research Query Language, DMQL, or a graphical user
grant and a CRD grant from the Natural Sciences interface. Users may interactively set and
and Engineering Research Council of Canada, a grant adjust various thresholds, control a data
NCE:IRIS/Precarn from the Networks of Centres of Ex- mining process, perform roll-up or drill-
cellence of Canada, and grants from B.C. Advanced Sys-
tems Institute, MPR Teltech Ltd., National Research down at multiple levels of abstraction, and
Council of Canada, and Hughes Research Laboratories. generate di erent forms of outputs, includ-

1
ing crosstabs, bar/pie charts, curves, classi-
cation trees, multiple forms of generalized Graphical User Interface
rules, visual presentation of rules, etc.
3. Ecient implementation techniques have SQL Server Discovery Modules
been explored using di erent data struc-
tures, including multiple-dimensional data
cubes and generalized relations. The imple- Data Concept Hierarchy
mentations have been integrated smoothly
with relational database systems and data
warehouses. Figure 1: General architecture of DBMiner
4. The data mining process may utilize user-
or expert-de ned set-grouping or schema- DBMiner: Discovery Modules
level concept hierarchies which can be spec-
i ed exibly, adjusted dynamically based Characterizer Comparator Classifier
on data distribution, and generated auto-
matically for numerical attributes. Con-
cept hierarchies are being taken as an in- Associator Meta-Pattern Predictor
tegrated component of the system and are Guided Miner
stored as a relation in the database.
Future
5. The system adopts a client/server architec-
Cluster Time-Series
Analyzer Analyzer Modules
ture and is running on Windows/NT sys-
tem. It communicates with various com-
mercial database systems for data mining Figure 2: Knowledge discovery modules of
using the ODBC technology. DBMiner
The system has been tested on several
medium to large relational databases, includ- The functionalities of the knowledge discovery
ing NSERC (Natural Science and Engineering modules are brie y described as follows:
Research Council of Canada) research grant in-
formation system, and U.S. City-County Data  The characterizer generalizes a set of task-
Book, with satisfactory performance. Addi- relevant data into a generalized data cube
tional data mining modules are being designed which can then be used for extraction of
and will be added incrementally to the system di erent kinds of rules or be viewed at mul-
along with the progress of our research. tiple levels of abstraction from di erent an-
gles. In particular, it derives a set of char-
acteristic rules which summarizes the gen-
2 Architecture and Func- eral characteristics of a set of user-speci ed
data (called the target class). For example,
tionalities the symptoms of a speci c disease can be
summarized by a characteristic rule.
The general architecture of DBMiner, shown in A comparator mines a set of discriminant
Figure 1, tightly integrates a relational database


rules which summarize the features that


system, such as a Microsoft SQL/Server, with a distinguish the class being examined (the
concept hierarchy module, and a set of knowl- target class) from other classes (called con-
edge discovery modules. The discovery modules trasting classes). For example, to distin-
of DBMiner, shown in Figure 2, include charac- guish one disease from others, a discrimi-
terizer, comparator, classi er, associator, meta- nant rule summarizes the symptoms that
pattern guided miner, predictor, cluster ana- discriminate this disease from others.
lyzer, time series analyzer, and some planned
future modules.  A classi er analyzes a set of training data

2
(i.e., a set of objects whose class label is kinds of data analyses for time-related data
known) and constructs a model for each in the database or data warehouse, includ-
class based on the features in the data. A ing similarity analysis, periodicity analysis,
set of classi cation rules is generated by sequential pattern analysis, and trend and
such a classi cation process, which can be deviation analysis. For example, one may
used to classify future data and develop a nd the general characteristics of the com-
better understanding of each class in the panies whose stock price has gone up over
database. For example, one may classify 20% last year or evaluate the trend or par-
diseases and provide the symptoms which ticular growth patterns of certain stocks.
describe each class or subclass.
Another important function module of
 An associator discovers a set of associa- DBMiner is concept hierarchy which provides es-
tion rules (in the form of \A1 ^^ Ai ! sential background knowledge for data general-
B1 ^   ^B ") at multiple levels of abstrac-
j ization and multiple-level data mining. Con-
tion from the relevant set(s) of data in a cept hierarchies can be speci ed based on the
database. For example, one may discover relationships among database attributes (called
a set of symptoms often occurring together schema-level hierarchy) or by set groupings
with certain kinds of diseases and further (called set-grouping hierarchy) and be stored
study the reasons behind them. in the form of relations in the same database.
A meta-pattern guided miner is a data Moreover, they can be adjusted dynamically


mining mechanism which takes a user- based on the distribution of the set of data rel-
speci ed meta-rule form, such as \P(x; y) evant to the data mining task. Also, hierarchies
Q(y; z) R(x; z)" as a pattern to con-
^
for numerical attributes can be constructed au-
!

ne the search for desired rules. For exam- tomatically based on data distribution analysis
ple, one may specify the discovered rules [7].
to be in the form of \major(s:student; x) ^

P(s; y) gpa(s; z)" in order to nd the re-


!

lationships between a student's major and 3 DMQL and Interactive


his/her gpa in a university database. Data Mining
 A predictor predicts the possible values of DBMiner o ers both an SQL-like data mining
some missing data or the value distribu- query language, DMQL, and a graphical user in-
tion of certain attributes in a set of ob- terface for interactive mining of multiple-level
jects. This involves nding the set of at- knowledge.
tributes relevant to the attribute of inter-
est (by some statistical analysis) and pre- Example 1 . To characterize CS grants in the
dicting the value distribution based on the NSERC96 database related to discipline code
set of data similar to the selected object(s). and amount category in terms of count% and
For example, an employee's potential salary amount%, the query is expressed in DMQL as fol-
can be predicted based on the salary distri- lows,
bution of similar employees in the company.
use NSERC96
 A cluster analyzer groups a selected set of mine characteristic rules as \CS Discipline Grants"
data in the database or data warehouse into with respect to disc code, amount,
a set of clusters to ensure the interclass count%, sum(amount)%
similarity is low and intraclass similarity from award A, grant type G
is high. For example, one may cluster the where A.grant code = G.grant code
houses in Vancouver area according to their and A.disc code = \Computer Science"
house type, value, and geographical loca-
tion. The query is processed as follows: The sys-
tem collects the relevant set of data by process-
 A time-series analyzer performs several ing a transformed relational query, constructs

3
a multi-dimensional data cube, generalizes the
data, and then presents the outputs in di erent
forms, including generalized crosstabs, multi-
ple (including visual) forms of generalized rules,
pie/bar charts, curves, etc.
A user may interactively set and adjust vari-
ous kinds of thresholds to control the data min-
ing process. For example, one may adjust the
generalization threshold for an attribute to al-
low more or less distinct values in this attribute. Figure 3: A multi-dimensional data cube
A user may also roll-up or drill-down the gen-
eralized data at multiple levels of abstraction.
2 ples in the original data relation. For example, a
generalized relation award may store a set of tu-
A data mining query language such as DMQL ples, such as \award(AI; 20 40k; 37; 835900)",
facilitates the standardization of data mining which represents the generalized data for dis-
functions, systematic development of data min- cipline code is \AI", the amount category is
ing systems, and integration with commercial \20 40k", and such kind of data takes 37 in
relational database systems. Various kinds of count and $835,900 in (total) amount.
graphical user interfaces can be developed based
on such a data mining query language. A graph- A data cube can be viewed as a multi-
ical user interface facilitates interactive speci - dimensional array structure, as shown in Figure
cation and modi cation of data mining queries, 3, in which each dimension represents a gener-
concept hierarchies, and various kinds of thresh- alized attribute and each cell stores the value of
olds, selection and change of output forms, roll- some aggregate attribute, such as count, sum,
up or drill-down, and dynamic control of a data etc. For example, a cube award may have two
mining process. Such interfaces have been im- dimensions: \discipline code" and \amount cat-
plemented in DBMiner. egory". The value \AI" in the \discipline code"
dimension and \20-40k" in the \amount cate-
gory" dimension locate the corresponding val-
4 Implementation of DB- ues in the two aggregate attributes, count and
sum, in the cube. Then the values, count% and
Miner sum(amount)%, can be derived easily.
In comparison with the generalized relation
4.1 Data structures: Generalized structure, a multi-dimensional data cube struc-
relation vs. multi-dimensional ture has the following advantages: First, it may
data cube often save storage space since only the mea-
surement attribute values need to be stored in
Data generalization is a core function of the cube and the generalized (dimensional) at-
DBMiner. Two data structures, generalized rela- tribute values will serve only as dimensional in-
tion, and multi-dimensional data cube, are con- dices to the cube; second, it leads to fast access
sidered in the implementation of data general- to particular cells (or slices) of the cube using
ization. indexing structures; third, it usually costs less
A generalized relation is a relation which con- to produce a cube than a generalized relation
sists of a set of (generalized) attributes (stor- in the process of generalization since the right
ing generalized values of the corresponding at- cell in the cube can be located easily. When a
tributes in the original relation) and a set of \ag- cube structure is quite sparse, the sparse cube
gregate" (measure) attributes (storing the val- technology [1, 18] should be applied to compute
ues resulted from executing aggregate functions, and store sparse cube eciently.
such as count, sum, etc.), and in which each tu- Both data structures have been implemented
ple is the result of generalization of a set of tu- in the evolution of the DBMiner system: the

4
Conceptually, a top-down, progressive deep-
ening process is preferable since it is natural
to rst nd general data characteristics at a
high abstraction level and then follow certain
interesting paths to step down to specialized
cases. However, from the implementation point
of view, it is easier to perform generalization
than specialization because generalization re-
places low level tuples by high ones through as-
cension of a concept hierarchy. Since general-
ized tuples do not register the detailed original
information, it is dicult to get such informa-
Figure 4: A snapshot of the output of DBMiner tion back when specialization is required later.
Characterizer Our technique which facilitates specializa-
tions on generalized relations is to save a \min-
generalized relation structure is adopted in ver- imally generalized relation/cube" in the early
sion 1.0, and a multi-dimensional data cube stage of generalization. That is, each attribute
structure without using sparse cube technology in the relevant set of data is generalized to min-
in version 2.0. The version 3.0 adopts sparse imally generalized concepts (which can be done
cube technology in the cube construction and in one scan of the data relation) and then identi-
switches between array-based cube and relation- cal tuples in such a generalized relation/cube are
based cube according to its size to ensure good merged together, which derives the minimally
performance in databases and data warehouses generalized relation. After that, both progres-
with di erent sizes. sive deepening and interactive up-and-down can
be performed with reasonable eciency: If the
Besides designing good data structures, e- data at the current abstraction level is to be
cient implementation of each discovery module generalized further, generalization can be per-
has been explored, as discussed below. formed on it directly; on the other hand, if it is
to be specialized, the desired result can be de-
4.2 Multiple-level rived by generalizing the minimally generalized
relation/cube to appropriate level(s).
characterization
As shown in Figure 4, data characterization
summarizes and characterizes a set of task-
4.3 Discovery of discriminant
relevant data, usually based on generalization. rules
For mining multiple-level knowledge, progres- The comparator of DBMiner nds a set of dis-
sive deepening (drill-down) and progressive gen- criminant rules which distinguishes the general
eralization (roll-up) techniques can be applied. features of a target class from that of contrasting
Progressive generalization starts with a con- class(es) speci ed by a user. It is implemented
servative generalization process which rst gen- as follows.
eralizes the data to slightly higher abstraction First, the set of relevant data in the database
levels than the primitive data in the relation or has been collected by query processing and is
data cube. Further generalizations can be per- partitioned respectively into a target class and
formed on it progressively by selecting appropri- one or a set of contrasting class(es). Sec-
ate attributes for step-by-step generalization. ond, attribute-oriented induction is performed
Progressive deepening starts with a relatively on the target class to extract a prime target re-
high-level generalized cube/relation, selectively lation/cube, where a prime target relation is a
and progressively specializes some of the gener- generalized relation in which each attribute con-
alized tuples or attributes to lower abstraction tains no more than but close to the threshold
levels. value of the corresponding attribute. Then the

5
schema:
course taken = (student id; course;semester; grade):
Intra-attribute association is the association
among one or a set of attributes formed by
grouping another set of attributes in a rela-
tion. For example, the associations between
each student and his/her course performance
is an intra-attribute association because one or
a set of attributes, \course; semester; grade",
are grouped according to student id, for min-
ing associations among the courses taken by
each student. From a relational database
point of view, a relation so formed is a nested
relation obtained by nesting the attributes
Figure 5: Graphic output of the Associator of \(course; semester; grade)" with the same stu-
DBMiner dent id. Therefore, an intra-attribute associa-
tion is an association among the nested items in
concepts in the contrasting class(es) are general- a nested relation.
ized to the same level as those in the prime tar- Inter-attribute association is the association
get relation/cube, forming the prime contrast- among a set of attributes in a at relation. For
ing relation/cube. Finally, the information in example, the following is an inter-attribute as-
these two classes is used to generate qualitative sociation: the association between course and
or quantitative discriminant rules. grade , such as \the courses in computing sci-
Moreover, interactive drill-down and roll-up ence tend to have good grades", etc.
can be performed synchronously in both target Two associations require di erent data min-
class and contrasting class(es) in a similar way ing techniques.
as in characterization. These functions have For mining intra-attribute associations, a
been implemented in the discriminator. data relation can be transformed into a nested
relation in which the tuples which share the
same values in the nesting attributes are merged
4.4 Multiple-level association into one. For example, the course taken rela-
tion can be folded into a nested relation with
Based on many studies on ecient mining of the schema,
association rules [2, 17, 8], a multiple-level asso- course taken = (student id; course history)
ciation rule miner (called \associator") has been course history = (course;semester;grade).
implemented in DBMiner. An output of the as-
sociator is shown in Figure 5. With such transformation, it is easy to derive
Di erent from mining association rules in association rules like \90% senior CS students
transaction databases, a relational associator tend to take at least three CS courses at 300-
may nd two kinds of associations: inter- level or up in each semester". Since the nested
attribute association and intra-attribute associ- tuples (or values) can be viewed as data items
ation. The former is an association among dif- in the same transaction, the methods for mining
ferent attributes; whereas the latter is an asso- association rules in transaction databases, such
ciation within one or a set of attributes formed as [2, 8], can be applied to such transformed
by grouping of another set of attributes. This is relations in relational databases.
illustrated in the following example. The multi-dimensional data cube structure
facilitates ecient mining of multi-level, inter-
Example 2 . Suppose the \course taken" rela- attribute association rules. A count cell of
tion in a university database has the following a cube stores the number of occurrences of

6
the corresponding multi-dimensional data val-
ues; whereas a dimension count cell stores the
sum of counts of the whole dimension. With this
structure, it is straightforward to calculate the
measurements such as support and con dence
of association rules based on the values in these
summary cells. A set of such cubes, ranging
from the least generalized cube to rather high
level cubes, facilitate mining of association rules
at multiple levels of abstraction. 2

4.5 Meta-rule guided mining Figure 6: Classi er of DBMiner


Since there are many ways to derive associa- in such a database, as illustrated below:
tion rules in relational databases, it is prefer-
able to have users to specify some interest-
ing constraints to guide a data mining process. major(s; \Science") ^ gpa(s; \Excellent") !
Such constraints can be speci ed in a meta-rule status(s; \Graduate") (60%)
(or meta-pattern) form [16], which con nes the major(s; \Physics") ^ status(s; \M:Sc") !
gpa(s; \3:8 4:0") (76%)
search to speci c forms of rules. For exam-
ple, a meta-rule \P(x; y) Q(x; y; z)", where
!
The mining of such multi-level rules can
P and Q are predicate variables matching dif- be implemented in a similar way as min-
ferent properties in a database, can be used as ing multiple-level association rules in a multi-
a rule-form constraint in the search. dimensional data cube [11]. 2
In principle, a meta-rule can be used to guide
the mining of many kinds of rules. Since
the association rules are in the form similar 4.6 Classi cation
to logic rules, we have rst studied meta-rule
guided mining of association rules in relational Data classi cation is to develop a description or
databases [5]. Di erent from the study by model for each class in a database, based on the
[16] where a meta-predicate may match any features present in a set of class-labeled training
relation predicates, deductive predicates, at- data.
tributes, etc., we con ne the match to only those There have been many data classi cation
predicates corresponding to the attributes in a methods studied, including decision-tree meth-
relation. One such example is illustrated as fol- ods, such as ID-3 and C4.5 [15], statistical meth-
lows. ods, neural networks, rough sets, etc. Recently,
some database-oriented classi cation methods
Example 3 . A meta-rule guided data mining have also been investigated [13].
query can be speci ed in DMQL as follows for Our classi cation method consists of four
mining a speci c form of rules related to a set
of attributes: \major, gpa, status, birth place, steps: (1) collection of the relevant set of data
address " in relation student for those born in and partitioning of the data into training and
Canada in a university database. testing data, (2) analysis of the relevance of the
attributes, (3) construction of classi cation (de-
mine associations cision) tree, and (4) test of the e ectiveness of
with respect to major, gpa, status, birth place, address the classi cation using the test data set.
from student
where birth place = \Canada" Attribute relevance analysis is performed
set rule template based on the analysis of an uncertainty mea-
major(s : student;x) ^ Q(s; y) ! R(s; z ) surement, a measurement which determines how
much an attribute is in relevance to the class
Multi-level association rules can be discovered attribute. Several top-most relevant attributes

7
retain for classi cation analysis whereas the The generalization-based multi-level classi -
weakly or irrelevant attributes are not consid- cation process has been implemented in the DB-
ered in the subsequent classi cation process. Miner system. An output of the DBMiner clas-
In the classi cation process, our classi er si er is shown in Figure 6.
adopts a generalization-based decision-tree in-
duction method which integrates attribute-
oriented induction and OLAP data cube tech- 4.7 Prediction
nology with a decision-tree induction technique,
by rst performing minimal generalization on A predictor predicts data values or value distri-
the set of training data to generalize attribute butions on the attributes of interest based on
values in the training set, and then performing similar groups of data in the database. For ex-
decision tree induction on the generalized data. ample, one may predict the amount of research
Since a generalized tuple comes from the grants that an applicant may receive based on
generalization of a number of original tuples, the data about the similar groups of researchers.
the count information is associated with each The power of data prediction should be con-
generalized tuple and plays an important role ned to the ranges of numerical data or the nom-
in classi cation. To handle noise and excep- inal data generalizable to only a small number of
tional data and facilitate statistical analysis, categories. It is unlikely to give reasonable pre-
two thresholds, classi cation threshold and ex- diction on one's name or social insurance num-
ception threshold, are introduced. The former ber based on other persons' data.
helps justi cation of the classi cation at a node For successful prediction, the factors (or at-
when a signi cant set of the examples belong tributes) which strongly in uence the values of
to the same class; whereas the latter helps ig- the attributes of interest should be identi ed
nore a node in classi cation if it contains only a rst. This can be done by the analysis of data
negligible number of examples. relevance or correlations by statistical methods,
There are several alternatives for doing gen- decision-tree classi cation techniques, or simply
eralization before classi cation: A data set can be based on expert judgement. To analyze at-
be generalized to either a minimally generalized tribute relevance, the uncertainty measurement
abstraction level, an intermediate abstraction similar to the method used in our classi er is
level, or a rather high abstraction level. Too applied. This process ranks the relevance of
low an abstraction level may result in scattered all the attributes selected and only the highly
classes, bushy classi cation trees, and diculty ranked attributes will be used in the prediction
at concise semantic interpretation; whereas too process.
high a level may result in the loss of classi ca- After the selection of highly relevant at-
tion accuracy. tributes, a generalized linear model has been
Currently, we are testing several alternatives constructed which can be used to predict the
at integration of generalization and classi ca- value or value distribution of the predicted at-
tion in databases, such as (1) generalize data tribute. If the predicting attribute is a numeri-
to some medium abstraction levels; (2) gener- cal data, a set of curves are generated, each in-
alize data to intermediate abstraction level(s), dicating the trend of likely changes of the value
and then perform node merge and split for bet- distribution of the predicted attribute. If the
ter class representation and classi cation accu- predicting attribute is a categorical data, a set
racy; and (3) perform multi-level classi cation of pie charts are generated, each indicating the
and select a desired level by a comparison of the distributions of the value ranges of the predicted
classi cation quality at di erent levels. Since attribute.
all three classi cation processes are performed When a query probe is submitted, the corre-
in relatively small, compressed, generalized re- sponding value distribution of the predicted at-
lations, it is expected to result in ecient clas- tribute can be plotted based on the curves or pie
si cation algorithms in large databases. charts generated above. Therefore, the values in
the set of highly relevant predicting attributes

8
discover distinct groups in their customer bases
and develop targeted marketing programs.
Data clustering has been studied in statistics,
machine learning and data mining with di erent
methods and emphases. Many clustering meth-
ods have been developed and applied to various
domains, such as data classi cation and image
processing.
Data mining applications deal with large high
dimensional data, and frequently involve cate-
gorical domains with concept hierarchies. How-
ever, most of the existing data clustering meth-
Figure 7: Prediction output when the predicting ods can only handle numeric data, or can not
attribute is a numeric one produce good quality results in the case where
categorical domains are present.
Our cluster analyzer is based on the well-
known k-means paradigm. Comparing to the
other clustering methods, the k-means based
methods are promising for their eciency in
processing large data sets. However, their use
is often limited to numeric data. To adequately
re ect categorical domains, we have developed
a method of encoding concept hierarchies. This
enables us to de ne a dissimilarity measure that
not only takes into account both numeric and
categorical attributes, but also at multiple lev-
Figure 8: Prediction output when the predicting els. Due to these modi cations, our cluster an-
attribute is a categorical one alyzer can cluster large data sets with mixed
numeric and categorical attributes in a way sim-
ilar to k-means . It can also perform multi-level
can be used for trustable prediction. clustering and select a desired level by a com-
Figure 7 shows the prediction output when parison of the clustering quality at di erent lev-
the predicting attribute is a numeric one; els. On the other hand, the user or the analyst
whereas Figure 8 shows the prediction output can direct the clustering process by either select-
when the predicting attribute is a categorical ing a set of relevant attributes for the requested
one. clustering query, or assigning a weight factor to
each attribute, or both, so that increasing the
weight of an attribute increases the likelihood
4.8 Clustering that the algorithm will cluster according to that
attribute.
Data clustering, also viewed as \unsupervised
learning", is a process of partitioning a set of
data into a set of classes, called clusters , with 5 Further Development of
the members of each cluster sharing some in-
teresting common properties. A good cluster-
DBMiner
ing method will produce high quality clusters, The DBMiner system is currently being extended
in which the intra-class (i.e., intra-cluster) sim- in several directions, as illustrated below.
ilarity is high and inter-class similarity is low.
Clustering has many interesting applications.  Further enhancement of the power and e-
For example, it can be used to help marketers ciency of data mining in relational database

9
systems and data warehouses, including the mining and knowledge discovery in large
improvement of system performance and databases"), and has served or is cur-
rule discovery quality for the existing func- rently serving in the program committees of
tional modules, and the development of over 30 international conferences and work-
techniques for mining new kinds of rules, shops, including ICDE'95 (PC vice-chair),
especially on time-related data. DOOD'95, ACM-SIGMOD'96, VLDB'96,
KDD'96 (PC co-chair), CIKM'97, SSD'97,
 Further enhancement of the performance of KDD'97, and ICDE'98. He has also been
data mining in large databases and data serving as an editor for IEEE Transactions
warehouses by exploration of parallel pro- on Knowledge and Data Engineering, Jour-
cessing using NT clusters. Some algorithm nal of Intelligent Information Systems, and
design and experimental work have been Data Mining and Knowledge Discovery.
performed in this direction.
Integration, maintenance and application  Jenny Y. Chiang received her B.Sc. de-
gree in Computing Science at Simon Fraser


of discovered knowledge, including incre-


mental update of discovered rules, removal University. Currently, she is working on
of redundant or less interesting rules, merg- the design and implementation of the DB-
ing of discovered rules into a knowledge- Miner system. Her research interests are
base, intelligent query answering using dis- data cube construction, OLAP implemen-
covered knowledge, and the construction of tation, and association rule mining.
multiple layered databases.
 Sonny Chee received his MASc. degree in
 Extension of data mining technique to- Aerospace at University of Toronto. Cur-
wards advanced and/or special purpose rently, he is a Ph.D. candidate of School of
database systems, including extended- Computing Science, Simon Fraser Univer-
relational, object-oriented, textual, spatial, sity. He is working on the design and im-
temporal, multi-media, and heterogeneous plementation of the DBMiner System. His
databases and Internet information sys- research interests are active database min-
tems. Currently, three such data mining ing, OLAP implementation, and time-series
systems, GeoMiner, LibMiner, and Web- data mining.
Miner, for mining knowledge in spatial
databases, library databases, and the Inter-  Jianping Chen received his B. Sc. in
net information-base respectively, are being Mathematics in the University of Science
under design and construction. and Technology of China in 1987, and
M.Sc. degree in Mathematics at Simon
Fraser University in 1996. He worked in the
About the Authors Institute of Applied Mathematics, Chinese
Academy of Sciences, Beijing, China from
 Jiawei Han (Ph.D. Univ. of Wiscon- 1987 to 1994. Currently, he is a M.Sc. stu-
sin at Madison, 1985), Professor of the dent in the School of Computing Science,
School of Computing Science and Direc- Simon Fraser University. He is working on
tor of Intelligent Database Systems Re- parallel database mining, and data classi -
search Laboratory, Simon Fraser Univer- cation techniques.
sity, Canada. He has conducted research
in the areas of data mining and data  Qing Chen received her B.Sc. degree in
warehousing, deductive and object-oriented Computer Science from the University of
databases, spatial databases, multimedia Science and Technology of China, in 1993.
databases, and logic programming, with She is currently working on the M.Sc. de-
over 100 journal and conference publi- gree at School of Computing Science, Si-
cations. He is a project leader of the mon Fraser University. Her research inter-
Canada NCE/IRIS:HMI-5 project (\data ests include spatial data mining.

10
 Shan Cheng received a B.Sc. degree in Fraser University. His major interests are
statistics from Fudan University, China, in spatial data warehousing and spatial
and an M.Sc. degree in statistics from Si- data mining. He has implemented several
mon Fraser University. Currently, she is visualization tools for the DBMiner, and
involved in the research and implementa- the OLAP component for the GeoMiner
tion of DBMiner system. Her research in- system. Stefanovic received a B.Sc. in com-
terests are data mining, data warehousing, puter engineering from University of Bel-
and data analysis. grade, Yugoslavia.
 Wan Gong received her B.Sc. degree in Lara Winstone is currently a M.Sc. stu-
Math/Computer Science at Mount Allison 

dent in Computing Science at Simon Fraser


University, N.B. Currently, she is working University. She received her B.Sc. (Hon-
on a M.Sc. Degree in Computing Science, ours Co-op) from the University of Mani-
Simon Fraser University, and is involved in toba. Her research interests include clas-
the research and development of the DB- si cation, domain analysis, and case based
Miner System. Her research interest is reasoning.
time-related data mining.
 Micheline Kamber is a doctoral candi-  Betty Bin Xia received her M.Sc. degree
date in Computing Science at Simon Fraser in Computer & Application from Jilin Uni-
University. Her research interests in data versity, China, in 1993. She is currently
mining include data cube-based mining of working on the M.Sc. degree at School of
association rules, metarule-guided mining, Computing Science, Simon Fraser Univer-
pattern interestingness, and the develop- sity. Her research interests include data
ment of a data mining query language. mining, user interface design and imple-
mentation, and the similarity mining in
 Krzysztof Koperski received the M.Sc. time-related databases.
degree in Electrical Engineering from War-
saw University of Technology, Poland. He
is currently Ph.D. student in Computer Sci-  Osmar R. Zaiane received the M.Sc. de-
ence at Simon Fraser University. His re- gree in Computer Science from Laval Uni-
search interests are spatial data mining, versity, Canada. He is currently a Ph.D.
spatial reasoning, and data warehousing. candidate in Computer Science at Simon
Fraser University. His research interests are
 Gang Liu is a Ph.D. student at School of WWW technology, data mining, multime-
Computing Science at Simon Fraser Uni- dia databases, and Web-based data mining
versity. His research interests include vi- systems.
sual data mining. Gang received an M.Sc.
in Computer Science from National Univer-  Shuhua Zhang received his Ph.D. degree
sity of Singapore. in mathematics from Simon Fraser Univer-
Yijun Lu received his B.Sc. degree in ap- sity in 1992. He is currently pursuing his


plied mathematics from Huazhong Univ. of M.Sc. degree in computing science at Simon
Sci. and Tech., China, in 1985 and an M.Sc. Fraser University. His research interests in-
degree in applied and computational math- clude data mining and universal algebra.
ematics from Simon Fraser University, in
1995. He is currently a M.Sc. student in the  Hua Zhu is currently an M.Sc. student
School of Computing Science. His research of School of Computing Science at Simon
interests include knowledge discovery, data Fraser University. Her research interests
mining and scienti c computing. include database mining, user interface, es-
pecially the web mining. Hua received a
 Nebojsa Stefanovic is an M.Sc. student B.S. in Computing Science from University
at School of Computing Science at Simon of Sci. & Tech. of China.

11
References [11] M. Kamber, J. Han, and J. Y. Chiang.
Metarule-guided mining of multi-dimensional
[1] S. Agarwal, R. Agrawal, P. M. Deshpande, association rules using data cubes. In Proc.
A. Gupta, J. F. Naughton, R. Ramakrishnan, 3rd Int. Conf. on Knowledge Discovery and
and S. Sarawagi. On the computation of mul- Data Mining (KDD'96), pages 207{210, New-
tidimensional aggregates. In Proc. 1996 Int. port Beach, California, August 1997.
Conf. Very Large Data Bases, pages 506{521, [12] M. Kamber, L. Winstone, W. Gong, S. Cheng,
Bombay, India, Sept. 1996. and J. Han. Generalization and decision tree
[2] R. Agrawal and R. Srikant. Fast algorithms induction: Ecient classi cation in data min-
for mining association rules. In Proc. 1994 Int. ing. In Proc. of 1997 Int. Workshop on Re-
Conf. Very Large Data Bases, pages 487{499, search Issues on Data Engineering (RIDE'97),
Santiago, Chile, September 1994. pages 111{120, Birmingham, England, April
[3] S. Chaudhuri and U. Dayal. An overview of 1997.
data warehousing and OLAP technology. ACM [13] M. Mehta, R. Agrawal, and J. Rissanen.
SIGMOD Record, 26:65{74, 1997. SLIQ: A fast scalable classi er for data min-
[4] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, ing. In Proc. 1996 Int. Conference on Extend-
and R. Uthurusamy. Advances in Knowledge ing Database Technology (EDBT'96), Avignon,
Discovery and Data Mining. AAAI/MIT Press, France, March 1996.
1996. [14] G. Piatetsky-Shapiro and W. J. Fraw-
[5] Y. Fu and J. Han. Meta-rule-guided min- ley. Knowledge Discovery in Databases.
ing of association rules in relational databases. AAAI/MIT Press, 1991.
In Proc. 1st Int'l Workshop on Integra- [15] J. R. Quinlan. C4.5: Programs for Machine
tion of Knowledge Discovery with Deductive Learning. Morgan Kaufmann, 1993.
and Object-Oriented Databases (KDOOD'95), [16] W. Shen, K. Ong, B. Mitbander, and C. Zan-
pages 39{46, Singapore, Dec. 1995. iolo. Metaqueries for data mining. In U.M.
[6] J. Han, Y. Cai, and N. Cercone. Data-driven Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
discovery of quantitative rules in relational R. Uthurusamy, editors, Advances in Knowl-
databases. IEEE Trans. Knowledge and Data edge Discovery and Data Mining, pages 375{
Engineering, 5:29{40, 1993. 398. AAAI/MIT Press, 1996.
[7] J. Han and Y. Fu. Dynamic generation [17] R. Srikant and R. Agrawal. Mining general-
and re nement of concept hierarchies for ized association rules. In Proc. 1995 Int. Conf.
knowledge discovery in databases. In Proc. Very Large Data Bases, pages 407{419, Zurich,
AAAI'94 Workshop on Knowledge Discovery Switzerland, Sept. 1995.
in Databases (KDD'94), pages 157{168, Seat-
tle, WA, July 1994. [18] Y. Zhao, P. M. Deshpande, and J. F.
Naughton. An array-based algorithm for simul-
[8] J. Han and Y. Fu. Discovery of multiple-level taneous multidimensional aggregates. In Proc.
association rules from large databases. In Proc. 1997 ACM-SIGMOD Int. Conf. Management
1995 Int. Conf. Very Large Data Bases, pages of Data, pages 159{170, Tucson, Arizona, May
420{431, Zurich, Switzerland, Sept. 1995. 1997.
[9] J. Han and Y. Fu. Exploration of the power
of attribute-oriented induction in data min-
ing. In U.M. Fayyad, G. Piatetsky-Shapiro,
P. Smyth, and R. Uthurusamy, editors, Ad-
vances in Knowledge Discovery and Data Min-
ing, pages 399{421. AAAI/MIT Press, 1996.
[10] J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong,
K. Koperski, D. Li, Y. Lu, A. Rajan, N. Ste-
fanovic, B. Xia, and O. R. Zaiane. DBMiner:
A system for mining knowledge in large rela-
tional databases. In Proc. 1996 Int'l Conf.
on Data Mining and Knowledge Discovery
(KDD'96), pages 250{255, Portland, Oregon,
August 1996.

12
View publication stats

You might also like