KDD Process PDF

Knowledge Discovery in Databases creates the context for
developing the tools needed to control the flood of data facing

organizations that depend on ever-growing databases of business,
manufacturing, scientific, and personal information.
The KDD Process

for Extracting Useful
Knowledge from
Volumes of Data
AS WE MARCH INTO THE AGE
of digital information, the
problem of data overload Usama Fayyad,
looms ominously ahead.
Our ability to analyze and G r e g o r y P i a t e t s k y - S h a p i r o ,
understand massive
datasets lags far behind and Padhraic Smyth
our ability to gather and
store the data. A new gen-
eration of computational techniques and many more applications generate
and tools is required to support the streams of digital records archived in
extraction of useful knowledge from huge databases, sometimes in so-called
the rapidly growing volumes of data. data warehouses.
These techniques and tools are the Current hardware and database tech-
subject of the emerging field of knowl- nology allow efficient and inexpensive
edge discovery in databases (KDD) and reliable data storage and access. Howev-
data mining. er, whether the context is business,
Large databases of digital informa- medicine, science, or government, the
tion are ubiquitous. Data from the datasets themselves (in raw form) are of
neighborhood store’s checkout regis- little direct value. What is of value is the
TERRY WIDENER
ter, your bank’s credit card authoriza- knowledge that can be inferred from
tion device, records in your doctor’s the data and put to use. For example,
office, patterns in your telephone calls, the marketing database of a consumer
COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 27

goods company may yield knowledge Yet the true value of such data lies
of correlations between sales of certain in the users' ability to extract use-
items and certain demographic group- ful reports, spot interesting events
ings. This knowledge can be used to The value of and trends, support decisions and
introduce new targeted marketing policy based on statistical analysis
campaigns with predictable financial storing volumes of and inference, and exploit the
return relative to unfocused cam- data to achieve business, opera-
paigns. Databases are often a dormant data depends on tional, or scientific goals.
potential resource that, tapped, can When the scale of data manip-
yield substantial benefits. our ability to ulation, exploration, and infer-
This article gives an overview of the ence grows beyond human
emerging field of KDD and data min- extract useful capacities, people look to comput-
ing, including links with related fields, er technology to automate the
a definition of the knowledge discov- reports, spot bookkeeping. The problem of
ery process, dissection of basic data knowledge extraction from large
mining algorithms, and an analysis of interesting events databases involves many steps,
the challenges facing practitioners. ranging from data manipulation
and trends, and retrieval to fundamental
Impractical Manual Data Analysis mathematical and statistical infer-
The traditional method of turning support decisions ence, search, and reasoning.
data into knowledge relies on manual Researchers and practitioners
analysis and interpretation. For exam- and policy based interested in these problems have
ple, in the health-care industry, it is been meeting since the first KDD
common for specialists to analyze cur- on statistical Workshop in 1989. Although the
rent trends and changes in health-care problem of extracting knowledge
data on a quarterly basis. The special- analysis and from data (or observations) is not
ists then provide a report detailing the new, automation in the context of
analysis to the sponsoring health-care inference, and large databases opens up many
organization; the report is then used new unsolved problems.
as the basis for future decision making exploit the data to
and planning for health-care manage- Definitions
ment. In a totally different type of achieve business, Finding useful patterns in data is
application, planetary geologists sift known by different names (includ-
through remotely sensed images of operational, or ing data mining) in different com-
planets and asteroids, carefully locat- munities (e.g., knowledge
ing and cataloging geologic objects of scientific goals. extraction, information discovery,
interest, such as impact craters. information harvesting, data
For these (and many other) appli- archeology, and data pattern pro-
cations, such manual probing of a cessing). The term “data mining”
dataset is slow, expensive, and highly is used most by statisticians, data-
subjective. In fact, such manual data analysis is becom- base researchers, and more recently by the MIS and
ing impractical in many domains as data volumes grow business communities. Here we use the term “KDD” to
exponentially. Databases are increasing in size in two refer to the overall process of discovering useful knowl-
ways: the number N of records, or objects, in the data- edge from data. Data mining is a particular step in this
base, and the number d of fields, or attributes, per process—application of specific algorithms for extract-
object. Databases containing on the order of N = 109 ing patterns (models) from data. The additional steps in
objects are increasingly common in, for example, the the KDD process, such as data preparation, data selec-
astronomical sciences. The number d of fields can easi- tion, data cleaning, incorporation of appropriate prior
ly be on the order of 102 or even 103 in medical diag- knowledge, and proper interpretation of the results of
nostic applications. Who could be expected to digest mining ensure that useful knowledge is derived from
billions of records, each with tens or hundreds of fields? the data. Blind application of data mining methods
28 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM

(rightly criticized as data dredging in the statistical liter- ported. KDD places a special empha-
ature) can be a dangerous activity leading to discovery sis on finding understandable pat-
of meaningless patterns. terns that can be interpreted as
KDD has evolved, and continues to evolve, from the useful or interesting knowledge.
intersection of research in such fields as databases, Scaling and robustness properties of
machine learning, pattern recognition, statistics, artifi- modeling algorithms for large noisy
cial intelligence and reasoning with uncertainty, knowl- datasets are also of fundamental interest.
edge acquisition for expert systems, data visualization, Statistics has much in common with KDD.
machine discovery [7], scientific discovery, information Inference of knowledge from data has a fundamental
retrieval, and high-performance computing. KDD soft- statistical component (see [2] and the article by Gly-
ware systems incorporate theories, algorithms, and mour on statistical inference in this special section for
methods from all of these fields. more detailed discussions of the relationship between
Database theories and tools provide the necessary KDD and statistics). Statistics provides a language and
infrastructure to store, access, and manipulate data. framework for quantifying the uncertainty resulting
Data warehousing, a recently popularized term, refers when one tries to infer general patterns from a partic-
to the current business trend of collecting and cleaning ular sample of an overall population. As mentioned
transactional data to make them available for online earlier, the term data mining has had negative conno-
analysis and decision support. A popular approach for tations in statistics since the 1960s, when computer-
analysis of data warehouses is called online analytical based data analysis techniques were first introduced.
processing (OLAP).1 OLAP tools focus on providing The concern arose over the fact that if one searches
Figure 1. Overview of the steps constituting the KDD process
Pre- Trans- Data Interpretation/

Selection processing formation Mining Evaluation
Target Preprocessed Transformed Patterns Knowledge

Data Data Data
Data
multidimensional data analysis, which is superior to long enough in any dataset (even randomly generated
SQL (a standard data manipulation language) in com- data), one can find patterns that appear to be statisti-
puting summaries and breakdowns along many dimen- cally significant but in fact are not. This issue is of fun-
sions. While current OLAP tools target interactive data damental importance to KDD. There has been
analysis, we expect they will also include more auto- substantial progress in understanding such issues in sta-
mated discovery components in the near future. tistics in recent years, much directly relevant to KDD.
Fields concerned with inferring models from data— Thus, data mining is a legitimate activity as long as one
including statistical pattern recognition, applied statis- understands how to do it correctly. KDD can also be
tics, machine learning, and neural networks—were the viewed as encompassing a broader view of modeling
impetus for much early KDD work. KDD largely relies than statistics, aiming to provide tools to automate (to
on methods from these fields to find patterns from data the degree possible) the entire process of data analysis,
in the data mining step of the KDD process. A natural including the statistician’s art of hypothesis selection.
question is: How is KDD different from these other
fields? KDD focuses on the overall process of knowl- The KDD Process
edge discovery from data, including how the data is Here we present our (necessarily subjective) perspec-
stored and accessed, how algorithms can be scaled to tive of a unifying process-centric framework for KDD.
massive datasets and still run efficiently, how results can The goal is to provide an overview of the variety of activ-
be interpreted and visualized, and how the overall 1
See Providing OLAP to User Analysts: An IT Mandate by E.F. Codd and
human-machine interaction can be modeled and sup- Associates (1993).

ities in this multidisciplinary field and how they fit of patterns (or models) over the data, subject to some
together. We define the KDD process [4] as: acceptable computational-efficiency limitations. Since
the patterns enumerable over any finite dataset are
The nontrivial process of identifying valid, novel, potentially potentially infinite, and because the enumeration of
useful, and ultimately understandable patterns in data. patterns involves some form of search in a large space,
computational constraints place severe limits on the
Throughout this article, the term pattern goes beyond subspace that can be explored by a data mining algo-
its traditional sense to include models or structure in rithm.
data. In this definition, data comprises a set of facts The KDD process is outlined in Figure 1. (We did
(e.g., cases in a database), and pattern is an expression not show all the possible arrows to indicate that loops
in some language describing a subset of the data (or a can, and do, occur between any two steps in the
model applicable to that subset). The term process process; also not shown is the system's performance ele-
implies there are many steps involving data prepara- ment, which uses knowledge to make decisions or take
tion, search for patterns, knowledge evaluation, and actions.) The KDD process is interactive and iterative
refinement—all repeated in multiple iterations. The (with many decisions made by the user), involving
process is assumed to be nontrivial in that it goes numerous steps, summarized as:
beyond computing closed-form quantities; that is, it
must involve search for structure, models, patterns, or 1. Learning the application domain: includes relevant
parameters. The discovered patterns should be valid prior knowledge and the goals of the application
for new data with some degree of cer- 2. Creating a target dataset:
tainty. We also want patterns to be includes selecting a dataset or
novel (at least to the system, and focusing on a subset of vari-
preferably to the user) and potentially ables or data samples on which
useful for the user or task. Finally, the In practice, a discovery is to be performed
patterns should be understandable—if 3. Data cleaning and preprocessing:
not immediately, then after some large portion of includes basic operations, such as
postprocessing. removing noise or outliers if
This definition implies we can the applications appropriate, collecting the neces-
define quantitative measures for eval- sary information to model or
uating extracted patterns. In many effort can go account for noise, deciding on
cases, it is possible to define measures strategies for handling missing
of certainty (e.g., estimated classifica- into properly data fields, and accounting for
tion accuracy) or utility (e.g., gain, time sequence information and
perhaps in dollars saved due to better formulating the known changes, as well as decid-
predictions or speed-up in a system’s ing DBMS issues, such as data
response time). Such notions as nov- problem (asking types, schema, and mapping of
elty and understandability are much missing and unknown values
more subjective. In certain contexts, the right question) 4. Data reduction and projection:
understandability can be estimated includes finding useful features
through simplicity (e.g., number of rather than to represent the data, depend-
bits needed to describe a pattern). An ing on the goal of the task, and
important notion, called interesting- optimizing the using dimensionality reduction
ness, is usually taken as an overall mea- or transformation methods to
sure of pattern value, combining algorithmic details reduce the effective number of
validity, novelty, usefulness, and sim- variables under consideration
plicity. Interestingness functions can of a particular or to find invariant representa-
be explicitly defined or can be mani- tions for the data
fested implicitly through an ordering data mining 5. Choosing the function of data
placed by the KDD system on the dis- mining: includes deciding the
covered patterns or models. method. purpose of the model derived
Data mining is a step in the KDD by the data mining algorithm
process consisting of an enumeration (e.g., summarization, classifica

tion, regression, and clustering) principles. In particular, data min-
6. Choosing the data mining algorithm(s): includes ing algorithms consist largely of
selecting method(s) to be used for searching for some specific mix of three com-
patterns in the data, such as deciding which models ponents:
and parameters may be appropriate (e.g., models
for categorical data are different from models on • The model. There are two rele-
vectors over reals) and matching a particular data vant factors: the function of the
mining method with the overall criteria of the KDD model (e.g., classification and
process (e.g., the user may be more interested in clustering) and the representational
understanding the model than in its predictive form of the model (e.g., a linear function
capabilities) of multiple variables and a Gaussian probability den-
7. Data mining: includes searching for patterns of sity function). A model contains parameters that are
interest in a particular representational form or a to be determined from the data.
set of such representations, including classification • The preference criterion. A basis for preference of
rules or trees, regression, clustering, sequence mod- one model or set of parameters over another,
eling, dependency, and line analysis depending on the given data. The criterion is usual-
8. Interpretation: includes interpreting the discovered ly some form of goodness-of-fit function of the
patterns and possibly returning to any of the previ- model to the data, perhaps tempered by a smooth-
ous steps, as well as possible visualization of the ing term to avoid overfitting, or generating a model
extracted patterns, removing redundant or irrele- with too many degrees of freedom to be constrained
vant patterns, and translating the useful ones into by the given data.
terms understandable by users • The search algorithm. The specification of an algo-
9. Using discovered knowledge: includes incorporat- rithm for finding particular models and parameters,
ing this knowledge into the performance system, given data, a model (or family of models), and a
taking actions based on the knowledge, or simply preference criterion.
documenting it and reporting it to interested par-
ties, as well as checking for and resolving potential A particular data mining algorithm is usually an
conflicts with previously believed (or extracted) instantiation of the model/preference/search compo-
knowledge. nents (e.g., a classification model based on a decision-
tree representation, model preference based on data
Most previous work on KDD focused primarily on likelihood, determined by greedy search using a partic-
the data mining step. However, the other steps are ular heuristic. Algorithms often differ largely in terms
equally if not more important for the successful appli- of the model representation (e.g., linear and hierarchi-
cation of KDD in practice. We now focus on the data cal), and model preference or search methods are
mining component, which has received by far the most often similar across different algorithms. The literature
attention in the literature. on learning algorithms frequently does not state clear-
ly the model representation, preference criterion, or
Data Mining search method used; these are often mixed up in a
Data mining involves fitting models to or determining description of a particular algorithm. The reductionist
patterns from observed data. The fitted models play view clarifies the independent contributions of each
the role of inferred knowledge. Deciding whether or component.
not the models reflect useful knowledge is a part of
the overall interactive KDD process for which subjec- Model Functions
tive human judgment is usually required. A wide vari- The more common model functions in current data
ety and number of data mining algorithms are mining practice include:
described in the literature—from the fields of statis-
tics, pattern recognition, machine learning, and data- • Classification: maps (or classifies) a data item into
bases. Thus, an overview discussion can often consist one of several predefined categorical classes.
of long lists of seemingly unrelated, and highly spe- • Regression: maps a data item to a real-value predic-
cific algorithms. Here we take a somewhat reduction- tion variable.
ist viewpoint. Most data mining algorithms can be • Clustering: maps a data item into one of several cate-
viewed as compositions of a few basic techniques and gorical classes (or clusters) in which the classes must

be determined from the data—unlike classification the more complex models may fit the data better but
in which the classes are predefined. Clusters are may also be more difficult to understand and to fit reli-
defined by finding natural groupings of data items ably. While researchers tend to advocate complex mod-
based on similarity metrics or probability density els, practitioners involved in successful applications
models. often use simpler models due to their robustness and
• Summarization: provides a compact description for a interpretability [3, 5].
subset of data. A simple example would be the mean Model preference criteria determine how well a par-
and standard deviations for all fields. More sophisti- ticular model and its parameters meet the criteria of
cated functions involve summary rules, multivariate the KDD process. Typically, there is an explicit quanti-
visualization techniques, and functional relation- tative criterion embedded in the search algorithm (e.g.,
ships between variables. Summarization functions the maximum likelihood criterion of finding the para-
are often used in interactive exploratory data analy- meters that maximize the probability of the observed
sis and automated report generation. data). Also, an implicit criterion (reflecting the subjec-
• Dependency modeling: describes significant depen- tive bias of the analyst in terms of which models are ini-
dencies among variables. Dependency models exist tially chosen for consideration) is often used in the
at two levels: structured and quantitative. The struc- outer loops of the KDD process.
tural level of the model specifies (often in graphical Search algorithms are of two types: parameter
form) which variables are locally dependent; the search, given a model, and model search over model
quantitative level specifies the strengths of the space. Finding the best parameters is often reduced to
dependencies using some numerical scale. an optimization problem (e.g., finding the global max-
• Link analysis: determines relations between fields in imum of a nonlinear function in parameter space).
the database (e.g., association Data mining algorithms tend to
rules [1] to describe which items rely on relatively simple optimiza-
are commonly purchased with tion techniques (e.g., gradient
other items in grocery stores). descent), although in principle
The focus is on deriving multi- more sophisticated optimization
field correlations satisfying sup- Researchers and techniques are also used. Problems
port and confidence thresholds. with local minima are common
• Sequence analysis: models practitioners and dealt with in the usual manner
sequential patterns (e.g., in data (e.g., multiple random restarts and
with time dependence, such as should ensure that searching for multiple models).
time-series analysis). The goal is Search over model space is usually
to model the states of the process the potential carried out in a greedy fashion.
generating the sequence or to A brief review of specific popu-
extract and report deviation and contributions of KDD lar data mining algorithms can be
trends over time. found in [4, 5]. An important point
are not overstated is that each technique typically suits
Model Representation some problems better than others.
Popular model representations and that users For example, decision-tree classi-
include decision trees and rules, lin- fiers can be very useful for finding
ear models, nonlinear models (e.g., understand the structure in high-dimensional
neural networks), example-based spaces and are also useful in prob-
methods (e.g., nearest-neighbor true nature of lems with mixed continuous and
and case-based reasoning meth- categorical data (since tree meth-
ods), probabilistic graphical depen- those contributions ods do not require distance met-
dency models (e.g., Bayesian rics). However, classification trees
networks [6]), and relational along with their with univariate threshold decision
attribute models. Model representa- boundaries may not be suitable for
tion determines both the flexibility limitations. problems where the true decision
of the model in representing the boundaries are nonlinear multi-
data and the interpretability of the variate functions. Thus, there is no
model in human terms. Typically, universally best data mining

method; choosing a particular algorithm for a particu- incorporate prior knowledge about a problem except
lar application is something of an art. In practice, a in simple ways. Use of domain knowledge is impor-
large portion of the applications effort can go into tant in all steps of the KDD process. For example,
properly formulating the problem (asking the right Bayesian approaches use prior probabilities over data
question) rather than into optimizing the algorithmic and distributions as one way of encoding prior
details of a particular data mining method. knowledge (see [6] and Glymour's article on statisti-
The high-level goals of data mining tend to be pre- cal inference in this special section). Others employ
dictive, descriptive, or a combination of predictive and deductive database capabilities to discover knowledge
descriptive. A purely predictive goal focuses on accura- that is then used to guide the data mining search.
cy in predictive ability. A purely descriptive goal focuses • Overfitting and assessing statistical significance.
on understanding the underlying data-generating When an algorithm searches for the best parameters
process—a subtle but important distinction. In predic- for one particular model using a limited set of data,
tion, a user may not care whether the model reflects it may overfit the data, resulting in poor perfor-
reality as long as it has predictive power (e.g., a model mance of the model on test data. Possible solutions
combining current financial indicators in some nonlin- include cross-validation, regularization, and other
ear manner to predict future dollar-to-deutsche-mark sophisticated statistical strategies. Proper assessment
exchange rates). A descriptive model, on the other of statistical significance is often missed when the
hand, is interpreted as a reflection of reality (e.g., a system searches many possible models. Simple meth-
model relating economic and demographic variables to ods to handle this problem include adjusting the
educational achievements used as the basis for social test statistic as a function of the search (e.g., Bonfer-
policy recommendations to cause change). In practice, roni adjustments for independent tests) and ran-
most KDD applications demand some degree of both domization testing, although this area is largely
predictive and descriptive modeling. unexplored.
• Missing data. This problem is especially acute in
Research Issues and Challenges business databases. Important attributes may be
Current primary research and application challenges missing if the database was not designed with discov-
for KDD [4, 5] include: ery in mind. Missing data can result from operator
error, actual system and measurement failures, or
• Massive datasets and high dimensionality. Multigiga- from a revision of the data collection process over
byte databases with millions of records and large time (e.g., new variables are measured, but they
numbers of fields (attributes and variables) are com- were considered unimportant a few months before).
monplace. These datasets create combinatorially Possible solutions include more sophisticated statisti-
explosive search spaces for model induction and cal strategies to identify hidden variables and depen-
increase the chances that a data mining algorithm dencies.
will find spurious patterns that are not generally • Understandability of patterns. In many applications,
valid. Possible solutions include very efficient algo- it is important to make the discoveries more under-
rithms, sampling, approximation methods, massively standable by humans. Possible solutions include
parallel processing, dimensionality reduction tech- graphical representations, rule structuring, natural
niques, and incorporation of prior knowledge. language generation, and techniques for visualiza-
• User interaction and prior knowledge. An analyst is tion of data and knowledge. Rule refinement strate-
usually not a KDD expert but a person responsible gies can also help address a related problem:
for making sense of the data using available KDD Discovered knowledge may be implicitly or explicitly
techniques. Since the KDD process is by definition redundant.
interactive and iterative, it is a challenge to provide a • Managing changing data and knowledge. Rapidly
high-performance, rapid-response environment that changing (nonstationary) data may make previously
also assists users in the proper selection and match- discovered patterns invalid. In addition, the variables
ing of appropriate tools and techniques to achieve measured in a given application database may be
their goals. There needs to be more emphasis on modified, deleted, or augmented with new measure-
human-computer interaction and less emphasis on ments over time. Possible solutions include incre-
total automation—with the aim of supporting both mental methods for updating the patterns and
expert and novice users. Many current KDD methods treating change as an opportunity for discovery by
and tools are not truly interactive and do not easily using it to cue the search for patterns of change.

• Integration. A standalone discovery system may not synthesize new knowledge from data is still unsurpassed
be very useful. Typical integration issues include by any machine. However, the volumes of data to be
integration with a DBMS (e.g., via a query interface), analyzed make machines a necessity. This niche for
integration with spreadsheets and visualization tools, using machines as an aid to analysis and the hope that
and accommodation of real-time sensor readings. the massive datasets contain nuggets of valuable knowl-
Highly interactive human-computer environments as edge drive interest and research in the field. Bringing
outlined by the KDD process permit both human- together a set of varied fields, KDD creates fertile
assisted computer discovery and computer-assisted ground for the growth of new tools for managing, ana-
human discovery. Development of tools for visualiza- lyzing, and eventually gaining the upper hand over the
tion, interpretation, and analysis of discovered pat- flood of data facing modern society. The fact that the
terns is of paramount importance. Such interactive field is driven by strong social and economic needs is
environments can enable practical solutions to many the impetus to its continued growth. The reality check
real-world problems far more rapidly than humans of real applications will act as a filter to sift the good the-
or computers operating independently. There are a ories and techniques from those less useful. C
potential opportunity and a challenge to developing
techniques to integrate the OLAP tools of the data- References
1. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, I.
base community and the data mining tools of the Fast discovery of association rules. In Advances in Knowledge Discovery
machine learning and statistical communities. and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
• Nonstandard, multimedia, and object-oriented data. Uthurusamy, Eds. AAAI/MIT Press, Cambridge, Mass., 1996.
2. Elder, J., and Pregibon, D. A statistical perspective on KDD. In
A significant trend is that databases contain not just Advances in Knowledge Discovery and Data Mining, U. Fayyad, G.
numeric data but large quantities of nonstandard Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI/MIT
Press, Cambridge, Mass., 1996.
and multimedia data. Nonstandard data types 3. Fayyad, U., and Uthurusamy, R., Eds. Proceedings of KDD-95: The
include nonnumeric, nontextual, geometric, and First International Conference on Knowledge Discovery and Data Min-
ing. AAAI Press, Menlo Park, Calif., 1995.
graphical data, as well as nonstationary, temporal, 4. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. From data mining
spatial, and relational data, and a mixture of cate- to knowledge discovery: An overview. In Advances in Knowledge Dis-
covery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
gorical and numeric fields in the data. Multimedia and R. Uthurusamy, Eds. AAAI/MIT Press, Cambridge, Mass.,
data include free-form multilingual text as well as 1996.
digitized images, video, and speech and audio data. 5. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R.,
Eds. Advances in Knowledge Discovery and Data Mining, AAAI/MIT
These data types are largely beyond the scope of cur- Press, Cambridge, Mass., 1996.
rent KDD technology. 6. Heckerman, D. Bayesian networks for knowledge discovery.
Advances in Knowledge Discovery and Data Mining, U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI/MIT
Conclusions Press, Cambridge, Mass., 1996.
7. Langley, P., and Simon, H.A. Applications of machine learning
Despite its rapid growth, the KDD field is still in its and rule induction. Commun. ACM 38, 11 (Nov. 1995), 55–64.
infancy. There are many challenges to overcome, but
some successes have been achieved (see the articles by Additional references for this article can be found at
http://www.research.microsoft.com/research/datamine/CACM-
Brachman on business applications and by Fayyad on DM-refs/.
science applications in this special section). Because
the potential payoffs of KDD applications are high, USAMA FAYYAD is a senior researcher at Microsoft Research and a
distinguished visiting scientist at the Jet Propulsion Laboratory of the
there has been a rush to offer products and services in California Institute of Technology. He can be reached at
the market. A great challenge facing the field is how to fayyad@microsoft.com.
avoid the kind of false expectations plaguing other GREGORY PIATETSKY-SHAPIRO is a principal member of the
nascent (and related) technologies (e.g., artificial intel- technical staff at GTE Laboratories. He can be reached at
gps@gte.com.
ligence and neural networks). It is the responsibility of
researchers and practitioners in this field to ensure that PADHRAIC SMYTH is an assistant professor of computer science at
the University of California, Irvine, and technical group leader at the
the potential contributions of KDD are not overstated Jet Propulsion Laboratory of the California Institute of Technology.
and that users understand the true nature of the con- He can be reached at smyth@ics.uci.edu.
tributions along with their limitations. Permission to make digital/hard copy of part or all of this work for personal
Fundamental problems at the heart of the field or classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage, the copyright notice, the title
remain unsolved. For example, the basic problems of of the publication and its date appear, and notice is given that copying is by
statistical inference and discovery remain as difficult permission of ACM, Inc. To copy otherwise, to republish, to post on servers,
or to redistribute to lists requires prior specific permission and/or a fee.
and challenging as they always have been. Capturing
the art of analysis and the ability of the human brain to © ACM 0002-0782/96/1100 $3.50

KDD Process PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KDD Process PDF

Uploaded by

Copyright:

Available Formats

Knowledge Discovery in Databases creates the context for

developing the tools needed to control the flood of data facing

The KDD Process

COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 27

28 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM

Figure 1. Overview of the steps constituting the KDD process

Pre- Trans- Data Interpretation/

Target Preprocessed Transformed Patterns Knowledge

COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 29

30 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM

COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 31

32 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM

COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 33

34 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM

You might also like