You are on page 1of 9

Chapter

DataMining:A FirstView

ChapterObjectives
) Definedataminingand understand
how dataminingcan be
usedto solveproblems.
) Understand
that complrters
are bestat learningconcept
definitions.
) Knowwhendataminingshouldbe considered
asa possible
problem-solvi
ng strategy.
) Understand
thatexpertsystems
and dataminingusedifferent
similargoals.
meansto accomplish
) Understand
that supervised
learningbuildsmodelsby
formingconceptdefinitionsfrom datacontainingpredefined
classes.
) Understand
thatunsupervised
clustering
buildsmodelsfrom
datawithoutthe aid of predefined
classes.
) Recognize
modelsareableto classifynew
thatclassification
dataof unknownorigin.
) Realize
appliedto
thatdatamininghasbeensuccessfully
solveproblemsin severaldomains.

27

28

ChopterI

c DotoMining:A FirstVrew

This chapter offers an introduction to the fascinating world of data mirung and
knowledge discovery.You will learn about the basics of data mining and ho*' dau
mining hasbeen appliedto solvereal-world problems.In Section 1.1 we provide a de=
how computersare best at learning confinitiorr of datamining. Section 1.2 discusses
cept de{initions and how concept definitions can be transformedinto useful patterns.
Section 1.3 offers guidance in understandingthe rypes of problems that may be appropriate for data mrning. In Section 1.4 we introduce expert systemsand explain
how expert systemsbuild models by extracting problem-solving knowledge from one
or more human experts.In Section 1.5 we offer a simple processmodel for datamining. Section 1.6 exploreswhy using a simple table searchmay not be an appropriate
data mining strategy.In Section 1.7 we detail severaldata mining applications.We
conclude this chapter,as well as all chaptersin this book, with a short summary key
term definitions,and a set ofexercises.Let's get started!

1 . " 1 DataMining:A Definition


We define data rnining as the processof employing one or more computer learning
techniques to automatical|y analyze and extract knowledge from data contained
purpose of a data mining sessionis to identifi' trends and patwithin a database.,The
terns in data.Ray Kurzwell,"the father of voice-recognition software"and designerof
the Kurzwell keyboard, recently stated that 98% of all human learning is "pattern
recognition."
The knowledge gainedfrom a datamining sessionis given asa model or generaltzation of the data. Several data mining techrriques exist. However, all data mining
methods use induction-based learning. Induction-basedlearning is the processof
forming generalconcept definitions by observing specificexamplesof conceptsto be
learned.Here are three examplesof knowledge gained through the processof induction-basedlearning:
Did you ever wonder why so many televisedgolf tournaments are sponsoredby
online brokeragefirms such as Charles Schwab and TD Waterhouse?A primary
reasonis that over 70% of all online investorsare males over the age of 40 who
stock investorsare golfers.
play golf. In addition, 60% of a1l,
Does it make sensefor a music company to advertiserap music in magazinesfor
senior citizens?It does when you iearn that senior citizens often purchaserap
music for their teenagegrandchildren.
Did you know that credit card companiescan often susPecta stolen credit card,
even if the card holder is unawareof the theft? Manv credit card companiesstore
a generalizedmodel of your credit card purchxing habits.The model alerts the

t.2 .

LeornT
WhatConComPuters

29

creditcatdcompanyofapossibiestolencardassoonassomeoneattemPtsatransactionthatdo.,,'otfityourgeneralpurchasingprofile..
Thesethreeexamplesmakeiteasyforustoseervhydata'min-lnglsfastbecoming
apreferredtechniqueforextractingusefulknowledgefiomdata.Laterinthischapter
mining has
additional .*"mpl., of how data
and throughout the rexr you *iilr..
been applied to solve real-world problems'
used interin batabases (KDD) is a term frequently
Knowledge lir.o*"y
scientific
tht
KDD i1 the applicatio:,.tf
changeably with data minittg'Technically'
process
performing data mining' a rypicai KDD
method to data ,nl.,t,'g' In adition to
asmaking deextracting anJpreparing data asw_ell
model includes a methodology for
c i s i o n s a b o u t a c t i o n s t o b e t " k t " o " t t d a t a m i n i n g h ' s t ' k t " p l a c e ' W h e n a p alocations'
rticular
large.voluniel of oi:l:::::*;everal
application involves the "nalvsisof
partsof the discovp*prration becom. th. *ort dme-consuming
dataextractior,
term' we do
broader
"rrd
the
a popular name for
ery process.As data mining has become
n o t c o n c e r n o u r s e l v e s w i t " h c l e a r l y d i s c r i m i n a t i n g b e t w e e n d a t a m5i nto
i n gdetaiiing
andKDD.
and have devoted chapter
However, we do recognize the distinction
models'
the stePsof two popular KDD process

'1 )

Learn?
WhatCanComputers
comPlex
is about learning. Learning is a
As the definition implies, data mining
1977):
differentiated(Merril and TennYson,
pro..rr. Four levelsoflearning can be
truth'
Facts. A fact is a simple statementof

oconcepts.Aconceptisasetofobjects,symbols,oreventsgroupedtogetherbecausethey sharecertain characteristics'


rProcedures.Aprocedureisastep-by-stepcourseofactiontoachieveagoal.We

useproceduresinoureverydayfunctioningaswellasinthesolutionofdi{iicult
problems.
Principles are genprinciples. Principles representthe highestlevel oflearning'
other trutns'
eral truths or laws that are basic to
Conceptsare the output of a dataminComputers are good at learning concepts'

e
e

ingsession.Thedata*ini,'gtool"di.t"t.,theformoflearnedconcepts.Common
conceptstructuresincludet"ts''ult"net'works'andmathematicalequations'Tiee
structuresandproductionrulesareeasyforhumanstointerpretandunderstand'
N e r w o r k s a n d m a t h e m a t i c a l e q u a t i o n s a r e b l a c k - b o x c o n c e p t s t lthese
u c t u rand
e s i nother
thatthe
understood' We will examine
knowledge they contain is not easily

il

Clnger t

o DotoMining:A FirstView

data mining structures throughout the text. First, we take a look at three conunon
concept views.

ThreeConceptViews
Concepts can be viewed from different perspectives.An understanding of each view
will help you categorize the datamining techniques discussedin this text. Let's take a
moment to define and illustrate each view.
The classical view attests that all concepts have definite defining properries.
These properties determine if an individual item is an example of a particular concept.The classicalview definition of a concept is'crisp and leavesno room for misinterpretation. This view supports all examples of a particular concept as being equally
rePresentativeof the concept. Here is a rule that employs a classicalview definition of
a good credit risk for an unsecuredloan:
It Annual lncome>= 30,000
& Yearsat Current Position >=5
& Owns Home = True
THEN Cood Credit Risk= True
The classicalview statesthat all three rule conditions must be mer for the applicant to
be considered a good credit risk.
The probabilistic and exemplar views are similar in that neither requires concept
representationsto have defining properties,The probabilistic view holds that concepts are represented by properties that are probable of concept members. The assumption is that people store and recall concepts as generalizationscreated from
individual exemplar (instance) observations.A probabilistic view definition of a good
credit risk might look like this:
o

The mean annual income for individuals who consistently make loan payments
on time is $30,000.
Most individuals who are good credit risks have been working for the same company for at least five years.
The majoriry of good credit risks own their own home.

This definition olfersgeneralguidelinesaboutthe characterisrics


representative
of
a good credit risk. Unlike the classicalview definition, this definition cannotbe direcdy appliedto achievean answerabout whether a speci6cpersonshouldbe given
an unsecuredloan. Howeveathe definition can be usedto help with the decisionmakingprocess.The
probabilisticview may alsoassociate
a probabiliryof membership

|.2 e WhotConComputers
Leorn?

31

with a specific classification.For example, a horneowner with an annual income of


$27,000 employed at the same position for four years might be classifiedas a good
credit risk with a'probabiliry of 0.85.
The exemplar view statesthat a given instance is determined to be an example
of a particular concept if the instance is similar enough to a set of one or more known
examples of the concept. The view atteststhat people store and recall likely concept
exemplarsthat are then used to classifynew instances.Consider the loan applicant described in the previous paragraph.The applicant would be classifiedas a good credit
risk if the applicant were similar enough to one or more of the stored instancesrepresenting good credit risk candidates.Flere is a possible list of exemplars considered to
be good credit risks:
.

Exemplar #1:
Annual Income= 32,000
Number of Yearsat Current Position= 6
Homeowner

Exemplar #2:
Annual Income= 52,000
Numbir of Yearsat Current Position= 16
Renter

Exemplar #3:
Annual Income= 28,000
Number of Yearsat Current Position= 72
Homeowner

As with the probabilistic view, the exemplar view can associatea pobabiliry of concept membership with edch classification.
fu we have seen,concepts can be studied from at least three points of view. In addition, concept definitions can be formed in severalways.Supervisedlearning is probably the best understood concept learning method and the most widely used
technique for data mining.wo introduce supervisedlearning in the next secrion.

Supervised
Learning
when we are young, we use induction to form basic concept definitions.we see instancesof concepts representing animals, plants,buildiirg structures, and the like.We
hear the labels given to individual instances and choose what we believe to be the
defining concept features (attributes) and form ourown classification models. Later,
w use the models we have developed to help us identify objects of similar srrucrure.

ChopterI

32

DotoMining:A FirstView

The name for this rype of learning is induction-basedsupervisedconcept learning or


just supervised learning.
The purpose of supervisedlearning is two-fold. First,we use supervisedlearning
to build classificationmodels from setsof data containing examples and nonexamples
of the conceptsto be learned.Each example or nonexampleis known asan instance
of data.Second,once a classificationmodel hasbeen constructed,the model is usedto
determine the classification of newly presented instances of unknown origin. It is
worth noting that, although model creation is inductive, applying the model to classify
new instancesof unknown origin is a deductiveprocess.
To more clearly illustrate the idea of supervisedlearning, consider the hypothetical datasetshown inThble 1.1.The datasetis very small and is relevantfor illustrative
purposes only.The table data is displayedin attribute-value format where the first
row shows names for the attributes whose values are contained in the table.The atand headacheare possible symptoms
tributes sorethroat,feuer,swollenglands,congestion,
experienced by individuals who have a particular affliction (a strepthroat,a cold,or an
attributesare known asinput attributes and are usedto createa model
allergy).These
to represent the data. Diagnosisis the attribute whose value we wish to predict.
Diagnosisis known as the classor output attribute.
Starting with the second row of the table, each remaining row is an instance of
data.An individual row showsthe symptoms and aftliction of a singlepatient.For example, the patient with ID = t has a sore throat, fever, swollen glands,congestion, and
a headache.Thepatient hasbeen diagnosedashaving strep throat'

TableI .l

o HypotheticalTrainingDatafor DiseaseDiagnosis

Patient
tD#

Sore
Throat

Fever

Swollen
Glands

YeS

Yes

YeS

YeS

Yes

Strepthroat

No

No

No

YeS

Yes

Allergy

YeS

Yes

No

Yes

No

Cold

Yes

No

Yes

No

No

Strepthroat

No

Yes

No

Yes

No

Cold

No

No

No

Yes

No

Allergy

No

No

Yes

No

No

Strepthroat

Yes

No

No

YeS

YeS

Allergy

No

Yes

No

Yes

Yes

Cold

Yes

No

YeS

Yes

LOtO

t0

Yes

Congestion Headache

Diagnosis

|.2 o WhotCanComputers
LeornT

33

Supposewe wish to develop a generalized rnodel to represent the data shown in


Teble 1.1.Even though this datasetis small,it would be difficult for us to develoo a
gcneral represeirtationunlesswe knew something about the relative importance of
the individual attributes and possible relationships among rhe attributes. Fortunately,
en appropriate supervised learning algorithm can do the work for us.

Supervised
learning:A DecisionTreeExample
we presentedthe data inTable 1.1 to c4.5 (euinlan, 1993),a supervisedlearning
Program that generalizesa set ofinput instancesby building a decision tree.A decision tree is a simple structure where nonterminal nodes represent tests on one or
more attributes and terminal nodes reflect decision outcomes. Decision trees have
severaladvantagesin that they are easyfor us to understand, can be transformed into
rules, and have been shown ro work well experimentally.A supervised algorithm for
creating a decision tree will be detailed in Chapter 3.
Figure 1.1 shows the decision tree createdfrom the data inTable 1.1.The decision tree generalizesthe table data.Specifically,

Figure
1.1 o A decisiontreefor the datain Tablet.l

Swollen
Glands

Yes
\
Diagnosis= StrepThroat

No
Diagnosls= Allergy

Yes
= Cold
Diagnosis

ChopterI

DotoMining:A FirstView

If a patient has swollen glands,the diagnosisis strep throat.


If a patient does not have swollen glands and has a fever, the diagnosisis a cold.
If a patient does not have swollen glands and does not have a fever, the diagnosisis
an allergy.
The decision tree tells us that we can accurately diagnose a patient in this dataset
by concerning ourselvesonly with whether the patient has swollen glandsand a fever.
The attributes sorethroat,congestion,and headache
do not play a role in determining a
diagnosis.As we can see,the decision tree has generalizedthe data and provided us
with a summary of those attributes and a*ribute relationships important for an accurate diagnosis.
The instancesused to create the decision tree model are known astraining data.
At this point, the training instancesare the only instancesknown to be correctly classified by the model. Flowever, our model is useful to the extent that it can correcrly
classi$znew instanceswhose classificationis not known. To determine how well the
model is able to be of generaluse we test the accuracyof the model using a test set.
The instancesof the test set have a known classification.Therefore we can compare
the test set instance classificationsdetermined by the model with the correct classification values.Testset classificationcorrectnessgivesus some indication about the future
performanceof the model.
Let's use the decision tree to classifithe first rwo instancesshown inTable 1.2.
Since the patient with ID = 11 has a value of Yesfor swollenglands,we follow the
right link from the root node of the decisiontree.The right link leadsto a terminal node, indicating the patient hasstrep throat.
The patient with ID = 12 has a value of No for swollenglands.Wefollow the left
link and check the value of the attrtbuteJever.SinceJeuer
equalsYes,we diagnose
the patient with a cold.

Table1.2 o Datalnstanceswith an UnknownClassification


Patient
tD#

Sore
Throat

Fever

Swollen
Glands

1t

t\o

No

Yes

t2

Yes

Yes

No

t3

No

No

No

Congestion Headache
Yes

Diagnosis

Yes

No

Yes

l\o

Yes

Leorn?
1.2 c WhotConComputers
'We

can translate any decision tree into a set of production


rules are rules of the form:

35

rules. Production

lF antecedent
conditions
THEN consequent
conditions
The antecedentconditions detail values or value rangesfor one or more input
attributes.The consequentconditions specifi the valuesor value rangesfor the output attributes.The technique for nrapping a decision tree to a set of production
rules is simple.A rule is createdby starting at the root node and following one path
of the tree to a leaf node. The antecedent of a rule is given by the attribute value
combinations seen along the path.The consequentof the corresponding rule is the
value at the leaf node. Here are the three production rules for the decision tree
shown in Fig. 1.1:

1 . It SwoIIenClands = Yes
THEN Diagnosis= StrepThroat

2.

It SwollenGlands= No & Feuer= Yes


THEN Diagnosis= Cold

3.

lF SwollenGlands = No & Feuer= No


THEN Diagnosis= Allergy

Let's use the production rules to classi$zthe table instancewith patient ID = 13.
Because swollenglands equals No, we pass over the first rule. Likewise,because feuer
equalsNo, the secondrule does not apply.Finally,both antecedentconditions for the
third rule are satisfied.Therefore we are able to apply the third rule and diagnose the
patient as having an allergy.

Unsupervised
Clustering
Unlike supervisedlearning, unsupervised clustering builds models from data without predefined classes.Data instancesare grouped together based on a similariry
scheme defined by the clustering system.With the help of one or severalevaluation
techniques;it is up to us to decide the meaning of the formed clusters.
To further distinguishbetween supervisedlearning and unsupervisedclustering,
consider the hypothetical data inTable 1,3.The table provides a sampling of information about {ive'customersmaintaining a brokerage account with Acme Investors
Incorporated.The attributes custoffierID, sex,age,fauoriterecreation,and incomeare selfexplanatory. Account type indrcates whether the account is held by a single person

You might also like