Professional Documents
Culture Documents
DataMining:A FirstView
ChapterObjectives
) Definedataminingand understand
how dataminingcan be
usedto solveproblems.
) Understand
that complrters
are bestat learningconcept
definitions.
) Knowwhendataminingshouldbe considered
asa possible
problem-solvi
ng strategy.
) Understand
thatexpertsystems
and dataminingusedifferent
similargoals.
meansto accomplish
) Understand
that supervised
learningbuildsmodelsby
formingconceptdefinitionsfrom datacontainingpredefined
classes.
) Understand
thatunsupervised
clustering
buildsmodelsfrom
datawithoutthe aid of predefined
classes.
) Recognize
modelsareableto classifynew
thatclassification
dataof unknownorigin.
) Realize
appliedto
thatdatamininghasbeensuccessfully
solveproblemsin severaldomains.
27
28
ChopterI
c DotoMining:A FirstVrew
This chapter offers an introduction to the fascinating world of data mirung and
knowledge discovery.You will learn about the basics of data mining and ho*' dau
mining hasbeen appliedto solvereal-world problems.In Section 1.1 we provide a de=
how computersare best at learning confinitiorr of datamining. Section 1.2 discusses
cept de{initions and how concept definitions can be transformedinto useful patterns.
Section 1.3 offers guidance in understandingthe rypes of problems that may be appropriate for data mrning. In Section 1.4 we introduce expert systemsand explain
how expert systemsbuild models by extracting problem-solving knowledge from one
or more human experts.In Section 1.5 we offer a simple processmodel for datamining. Section 1.6 exploreswhy using a simple table searchmay not be an appropriate
data mining strategy.In Section 1.7 we detail severaldata mining applications.We
conclude this chapter,as well as all chaptersin this book, with a short summary key
term definitions,and a set ofexercises.Let's get started!
t.2 .
LeornT
WhatConComPuters
29
creditcatdcompanyofapossibiestolencardassoonassomeoneattemPtsatransactionthatdo.,,'otfityourgeneralpurchasingprofile..
Thesethreeexamplesmakeiteasyforustoseervhydata'min-lnglsfastbecoming
apreferredtechniqueforextractingusefulknowledgefiomdata.Laterinthischapter
mining has
additional .*"mpl., of how data
and throughout the rexr you *iilr..
been applied to solve real-world problems'
used interin batabases (KDD) is a term frequently
Knowledge lir.o*"y
scientific
tht
KDD i1 the applicatio:,.tf
changeably with data minittg'Technically'
process
performing data mining' a rypicai KDD
method to data ,nl.,t,'g' In adition to
asmaking deextracting anJpreparing data asw_ell
model includes a methodology for
c i s i o n s a b o u t a c t i o n s t o b e t " k t " o " t t d a t a m i n i n g h ' s t ' k t " p l a c e ' W h e n a p alocations'
rticular
large.voluniel of oi:l:::::*;everal
application involves the "nalvsisof
partsof the discovp*prration becom. th. *ort dme-consuming
dataextractior,
term' we do
broader
"rrd
the
a popular name for
ery process.As data mining has become
n o t c o n c e r n o u r s e l v e s w i t " h c l e a r l y d i s c r i m i n a t i n g b e t w e e n d a t a m5i nto
i n gdetaiiing
andKDD.
and have devoted chapter
However, we do recognize the distinction
models'
the stePsof two popular KDD process
'1 )
Learn?
WhatCanComputers
comPlex
is about learning. Learning is a
As the definition implies, data mining
1977):
differentiated(Merril and TennYson,
pro..rr. Four levelsoflearning can be
truth'
Facts. A fact is a simple statementof
useproceduresinoureverydayfunctioningaswellasinthesolutionofdi{iicult
problems.
Principles are genprinciples. Principles representthe highestlevel oflearning'
other trutns'
eral truths or laws that are basic to
Conceptsare the output of a dataminComputers are good at learning concepts'
e
e
ingsession.Thedata*ini,'gtool"di.t"t.,theformoflearnedconcepts.Common
conceptstructuresincludet"ts''ult"net'works'andmathematicalequations'Tiee
structuresandproductionrulesareeasyforhumanstointerpretandunderstand'
N e r w o r k s a n d m a t h e m a t i c a l e q u a t i o n s a r e b l a c k - b o x c o n c e p t s t lthese
u c t u rand
e s i nother
thatthe
understood' We will examine
knowledge they contain is not easily
il
Clnger t
o DotoMining:A FirstView
data mining structures throughout the text. First, we take a look at three conunon
concept views.
ThreeConceptViews
Concepts can be viewed from different perspectives.An understanding of each view
will help you categorize the datamining techniques discussedin this text. Let's take a
moment to define and illustrate each view.
The classical view attests that all concepts have definite defining properries.
These properties determine if an individual item is an example of a particular concept.The classicalview definition of a concept is'crisp and leavesno room for misinterpretation. This view supports all examples of a particular concept as being equally
rePresentativeof the concept. Here is a rule that employs a classicalview definition of
a good credit risk for an unsecuredloan:
It Annual lncome>= 30,000
& Yearsat Current Position >=5
& Owns Home = True
THEN Cood Credit Risk= True
The classicalview statesthat all three rule conditions must be mer for the applicant to
be considered a good credit risk.
The probabilistic and exemplar views are similar in that neither requires concept
representationsto have defining properties,The probabilistic view holds that concepts are represented by properties that are probable of concept members. The assumption is that people store and recall concepts as generalizationscreated from
individual exemplar (instance) observations.A probabilistic view definition of a good
credit risk might look like this:
o
The mean annual income for individuals who consistently make loan payments
on time is $30,000.
Most individuals who are good credit risks have been working for the same company for at least five years.
The majoriry of good credit risks own their own home.
|.2 e WhotConComputers
Leorn?
31
Exemplar #1:
Annual Income= 32,000
Number of Yearsat Current Position= 6
Homeowner
Exemplar #2:
Annual Income= 52,000
Numbir of Yearsat Current Position= 16
Renter
Exemplar #3:
Annual Income= 28,000
Number of Yearsat Current Position= 72
Homeowner
As with the probabilistic view, the exemplar view can associatea pobabiliry of concept membership with edch classification.
fu we have seen,concepts can be studied from at least three points of view. In addition, concept definitions can be formed in severalways.Supervisedlearning is probably the best understood concept learning method and the most widely used
technique for data mining.wo introduce supervisedlearning in the next secrion.
Supervised
Learning
when we are young, we use induction to form basic concept definitions.we see instancesof concepts representing animals, plants,buildiirg structures, and the like.We
hear the labels given to individual instances and choose what we believe to be the
defining concept features (attributes) and form ourown classification models. Later,
w use the models we have developed to help us identify objects of similar srrucrure.
ChopterI
32
DotoMining:A FirstView
TableI .l
o HypotheticalTrainingDatafor DiseaseDiagnosis
Patient
tD#
Sore
Throat
Fever
Swollen
Glands
YeS
Yes
YeS
YeS
Yes
Strepthroat
No
No
No
YeS
Yes
Allergy
YeS
Yes
No
Yes
No
Cold
Yes
No
Yes
No
No
Strepthroat
No
Yes
No
Yes
No
Cold
No
No
No
Yes
No
Allergy
No
No
Yes
No
No
Strepthroat
Yes
No
No
YeS
YeS
Allergy
No
Yes
No
Yes
Yes
Cold
Yes
No
YeS
Yes
LOtO
t0
Yes
Congestion Headache
Diagnosis
|.2 o WhotCanComputers
LeornT
33
Supervised
learning:A DecisionTreeExample
we presentedthe data inTable 1.1 to c4.5 (euinlan, 1993),a supervisedlearning
Program that generalizesa set ofinput instancesby building a decision tree.A decision tree is a simple structure where nonterminal nodes represent tests on one or
more attributes and terminal nodes reflect decision outcomes. Decision trees have
severaladvantagesin that they are easyfor us to understand, can be transformed into
rules, and have been shown ro work well experimentally.A supervised algorithm for
creating a decision tree will be detailed in Chapter 3.
Figure 1.1 shows the decision tree createdfrom the data inTable 1.1.The decision tree generalizesthe table data.Specifically,
Figure
1.1 o A decisiontreefor the datain Tablet.l
Swollen
Glands
Yes
\
Diagnosis= StrepThroat
No
Diagnosls= Allergy
Yes
= Cold
Diagnosis
ChopterI
DotoMining:A FirstView
Sore
Throat
Fever
Swollen
Glands
1t
t\o
No
Yes
t2
Yes
Yes
No
t3
No
No
No
Congestion Headache
Yes
Diagnosis
Yes
No
Yes
l\o
Yes
Leorn?
1.2 c WhotConComputers
'We
35
rules. Production
lF antecedent
conditions
THEN consequent
conditions
The antecedentconditions detail values or value rangesfor one or more input
attributes.The consequentconditions specifi the valuesor value rangesfor the output attributes.The technique for nrapping a decision tree to a set of production
rules is simple.A rule is createdby starting at the root node and following one path
of the tree to a leaf node. The antecedent of a rule is given by the attribute value
combinations seen along the path.The consequentof the corresponding rule is the
value at the leaf node. Here are the three production rules for the decision tree
shown in Fig. 1.1:
1 . It SwoIIenClands = Yes
THEN Diagnosis= StrepThroat
2.
3.
Let's use the production rules to classi$zthe table instancewith patient ID = 13.
Because swollenglands equals No, we pass over the first rule. Likewise,because feuer
equalsNo, the secondrule does not apply.Finally,both antecedentconditions for the
third rule are satisfied.Therefore we are able to apply the third rule and diagnose the
patient as having an allergy.
Unsupervised
Clustering
Unlike supervisedlearning, unsupervised clustering builds models from data without predefined classes.Data instancesare grouped together based on a similariry
scheme defined by the clustering system.With the help of one or severalevaluation
techniques;it is up to us to decide the meaning of the formed clusters.
To further distinguishbetween supervisedlearning and unsupervisedclustering,
consider the hypothetical data inTable 1,3.The table provides a sampling of information about {ive'customersmaintaining a brokerage account with Acme Investors
Incorporated.The attributes custoffierID, sex,age,fauoriterecreation,and incomeare selfexplanatory. Account type indrcates whether the account is held by a single person