You are on page 1of 63

CLASSIFICATION

IN DATA MINING
SUSHIL KULKARNI

SUSHIL KULKARNI
Classification
What is classification?
Model Construction
ID3
Information Theory
Naïve Baysian Classifier

SUSHIL KULKARNI
CLASSIFICATION
PROBLEM

SUSHIL KULKARNI
CLASSIFICATION PROBLEM
Given a database D={t1,t2,…,tn} and a set
of classes C={C1,…,Cm}, the Classification
Problem is to define a mapping f: D C
where each ti is assigned to one class.

Problem is to create classes to classify


data with the help of given set of data
called training set.

SUSHIL KULKARNI
CLASSIFICATION EXAMPLES

ӂ Teachers classify students’ grades as A,


B, C, D, or F.
ӂ Identify mushrooms as poisonous or
edible.
ӂ Identify individuals with credit risks.

SUSHIL KULKARNI
Why Classification? A motivating
application
Credit approval
o A bank wants to classify its customers
based on whether they are expected to
pay back their approved loans
The history of past customers is used to
train the classifier
The classifier provides rules, which
identify potentially reliable future
customers SUSHIL KULKARNI
Why Classification? A motivating
application
Credit approval
o Classification rule:
If age = “31...40” and income = high then
credit_rating = excellent
o Future customers
Suhas : age = 35, income = high ⇒ excellent
credit rating
Heena : age = 20, income = medium ⇒ fair
credit rating

SUSHIL KULKARNI
Classification — A Two-Step
Process
Model construction: describing a set of
predetermined classes: Excellent and Fair
using training set.

Model is represented using classification


rules.

SUSHIL KULKARNI
Supervised Learning

Supervised learning (classification)


o Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
o New data is classified based on the training set

SUSHIL KULKARNI
Classification Process (1):
Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TEACH Classifier


Henna Assistant Prof 3 no (Model)
Leena Assistant Prof 7 yes
Meena Professor 2 yes
Dinesh Associate Prof 7 yes
IF rank = ‘professor’
Dinu Assistant Prof 6 no
OR years > 6
Amar Associate Prof 3 no
THEN teach = ‘yes’
SUSHIL KULKARNI
Classification Process (2): Use
the Model in Prediction

Classifier

Testing
Data Unseen Data

(Dina, Professor, 4)
NAME RANK YEARS TENURED Teach?
Swati Assistant Prof 2 no
Malika Associate Prof 7 no
Tina Professor 5 yes
June Assistant Prof 7 yes SUSHIL KULKARNI
Model Construction: Example
Sr. Gender Age BP Drug
1 M 20 Normal A
2 F 73 Normal B
3 M 37 High A
4 M 33 Low B
5 F 48 High A
6 M 29 Normal A
7 F 52 Normal B
8 M 42 Low B
9 M 61 Normal B
10 F 30 Normal A
11 F 26 Low B
12 M 54 High A
SUSHIL KULKARNI
Model Construction: Example
Directed Tree
Blood Pressure ?

Normal High Low

Age ? Drug A Drug B


≤ 40
> 40

Drug A Drug B

SUSHIL KULKARNI
Model Construction: Example

Tree summarizes the following:


o If BP=High prescribe Drug A
o If BP=Low prescribe Drug B
o If BP=Normal and age ≤40 prescribe Drug A else prescribe
Drug B

Two classes ‘Drug A’ and ‘Drug B’ are created.

SUSHIL KULKARNI
Model Construction: Example
The tree is constructed with training data and
there is no training error.

All rules that we made are 100% correct


according to training data.

In practical field data, it is unlikely that we get


rules with 100% accuracy and with high
support.
SUSHIL KULKARNI
Model Construction: Example
Accuracy and Support :
o Accuracy is 100% correct for given rules.

o If BP=High prescribe Drug A ( Support = 3/12)


o If BP=Low prescribe Drug B ( Support = 3/12)
o If BP=Normal and age ≤ 40 prescribe Drug A else
prescribe Drug B ( Support = 3/12)

SUSHIL KULKARNI
Error and Support
Let t = No. of data points, r = no. of data points in a
class or node, max = maximum no. of data points in
a class or node, min = minimum no. of data points in
a class or node
o Accuracy = max / r
o Error = min / r
o Support = max / t

Accuracy and Error are calculated for classes and


support for the class is calculated with respect to
the total number of data points in a given set.

SUSHIL KULKARNI
Rules with different
accuracy & support
180 data points
E = 5/120 E = 2/60
A= 115/120 A= 58/60
S= 115/180 S= 58/180

X < 60 X > 60

115 A 58 A
5B 2B

Node P Node Q
SUSHIL KULKARNI
Criteria to grow the tree

If the attribute is a categorical then the


tree is called as classification tree.
[ Eg. Drug Prescribe]

If the attribute is continuous then the tree


is called as regression tree.
[ Eg. Income]

SUSHIL KULKARNI
CLASSIFICATION
TREES FOR
CATEGORICAL
ATTRIBUTES
SUSHIL KULKARNI
INDUCTION DECISION TREE [ ID3]
Decision tree generation consists of two phases
o Tree construction
At start, all the training examples are at the root
Partition examples recursively based on
selected attributes
o Tree pruning
• Identify and remove branches that reflect noise
or outliers

Use of decision tree: Classifying an unknown sample


o Test the attribute values of the sample against the
decision tree
SUSHIL KULKARNI
Training Dataset
This follows an example from Quinlan’s
NoID3 age income student credit_rating buys_computer
1 <=30 high no fair no
2 <=30 high no excellent no
3 31…40 high no fair yes
4 >40 medium no fair yes
5 >40 low yes fair yes
6 >40 low yes excellent no
7 31…40 low yes excellent yes
8 <=30 medium no fair no
9 <=30 low yes fair yes
10 >40 medium yes fair yes
11 <=30 medium yes excellent yes
12 31…40 medium no excellent yes
SUSHIL KULKARNI
Output: ID 3 for “buys_computer”
‘no’ and ‘yes’ are two classes created

age?

<=30
31..40 >40

student? yes credit rating?

no excellent fair
yes

no yes no yes

SUSHIL KULKARNI
ANOTHER EXAMPLE:
MARKS
x
<90 >=90
ӂ If x >= 90 then grade =A.
ӂIf 80<=x<90 then grade =B. x A

ӂIf 70<=x<80 then grade =C. <80 >=80


ӂIf 60<=x<70 then grade =D. x B
ӂIf x<50 then grade =F <70 >=70
x C

<50 >=60

F D
SUSHIL KULKARNI
ALGORITHM FOR ID 3
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-
and-conquer manner
At start, all the training examples are at the root
Attributes are categorical
Samples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g., information
gain)

SUSHIL KULKARNI
ALGORITHM FOR ID 3

Conditions for stopping partitioning


All samples for a given node belong to the
same class
There are no remaining attributes for
further partitioning – majority voting is
employed for classifying the leaf
There are no samples left

SUSHIL KULKARNI
ID 3 : ADVANTAGES

Easy to understand.

Easy to generate rules

SUSHIL KULKARNI
ID 3 :
DISADVANTAGES

May suffer from over fitting.

Does not easily handle nonnumeric


data.
Can be quite large – pruning is
necessary.

SUSHIL KULKARNI
INFORMATION
THEORY

SUSHIL KULKARNI
INFORMATION THEORY

SUSHIL KULKARNI
INFORMATION THEORY
When all the marbles in the bowl are
mixed up, little information is given.

When the marbles in the bowl are


distributed in different classes , more
information is given.

SUSHIL KULKARNI
ENTROPY

Entropy gives an idea of how to split an


attribute from a tree.
‘yes’ or ‘no’ in our example.

SUSHIL KULKARNI
BUILDING THE
TREE

SUSHIL KULKARNI
Information Gain ID3
Select the attribute with the highest information
gain
Assume there are two classes, P and N
Let the set S contain p elements of class P and n
elements of class N
The amount of information, needed to decide if an
arbitrary object in S belongs to P or N is defined as
p p n n
I ( p, n) = − log 2 − log 2
p+n p+n p+n p+n

SUSHIL KULKARNI
Information Gain in Decision
Tree Induction
Assume that using attribute A, a set S will be
partitioned into sets {S1, S2 , …, Sv}
If Si contains pi elements of P and ni elements of N,
the entropy, or the expected information needed to
classify objects in all sub trees Si is
ν pi + ni
E ( A) = ∑ I ( pi , ni )
i =1 p + n

The encoding information that would be gained


by branching on A
Gain( A) = I ( p, n) − E ( A)
SUSHIL KULKARNI
Training Dataset
This follows an example from Quinlan’s
NoID3 age income student credit_rating buys_computer
1 <=30 high no fair no
2 <=30 high no excellent no
3 31…40 high no fair yes
4 >40 medium no fair yes
5 >40 low yes fair yes
6 >40 low yes excellent no
7 31…40 low yes excellent yes
8 <=30 medium no fair no
9 <=30 low yes fair yes
10 >40 medium yes fair yes
11 <=30 medium yes excellent yes
12 31…40 medium no excellent yes
SUSHIL KULKARNI
Attribute Selection by
Information Gain Computation
Class P:
buys_computer = E ( age) =
5
I ( 2,3) +
4
I ( 4,0)
“yes” 14 14
5
+ I (3,2) =0.69
Class N: 14
Hence
buys_computer = “no” Gain( age) = I ( p, n) − E ( age)
I(p, n) = I(9, 5) =0.940 =0.250
Similarly

Compute the entropy Gain(income) = 0.029


for age: Gain( student ) = 0.151
age pi ni I(pi, ni) Gain(credit _ rating ) = 0.048
<=30 2 3 0.971 AGE IS MAX GAIN
31..40 4 0 0
SUSHIL KULKARNI
>40 3 2 0.971
Splitting the samples using
age
age?
<=30 >40
31...40
income student credit_rating buys_computer income student credit_rating buys_computer
high no fair no medium no fair yes
high no excellent no low yes fair yes
medium no fair no low yes excellent no
low yes fair yes medium yes fair yes
medium yes excellent yes medium no excellent no

income student credit_rating buys_computer


high no fair yes
low yes excellent yes labeled yes
medium no excellent yes
high yes fair yes

SUSHIL KULKARNI
Output: ID 3 for “buys_computer”

age?

<=30
31..40 >40

student? yes credit rating?

no excellent fair
yes

no yes no yes

SUSHIL KULKARNI
CART

SUSHIL KULKARNI
CART [ CLASSIFICATION AND
REGRESSION TREE]
Algorithm is similar to ID 3 but used GINI
index called impurity measure to select
variables.
If target variable is normal and has more
than two categories , the option of merging of
target categories into two super categories
may be considered. The process is called
Twoing.

SUSHIL KULKARNI
Gini Index (IBM Intelligent Miner)
If a data set T contains examples from n
classes, gini index, gini(T) is defined as

n
gini(T ) =1−∑p 2 j
j= 1
where pj is the relative frequency of class j in T.

SUSHIL KULKARNI
Extracting Classification Rules
from Trees
Represent the knowledge in the form of IF-
THEN rules

One rule is created for each path from the root


to a leaf

Each attribute-value pair along a path forms a


conjunction

SUSHIL KULKARNI
Extracting Classification Rules
from Trees
The leaf node holds the class prediction
Rules are easy for humans to understand
Example
IF age = “<=30” AND student = “no” THEN
buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN
buys_computer = “yes”
IF age = “31…40” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN
buys_computer = “no” SUSHIL KULKARNI
BAYESIAN
CLASSIFICATION

SUSHIL KULKARNI
Classification and
regression
What is classification? What is regression?
Issues regarding classification and
regression
Classification by decision tree induction
Bayesian Classification
Other Classification Methods
regression

SUSHIL KULKARNI
What is Bayesian
Classification?
Bayesian classifiers are statistical
classifiers

For each new sample they provide a


probability that the sample belongs to a
class (for all classes)

SUSHIL KULKARNI
What is Bayesian
Classification?
Example:

o sample John (age=27, income=high,


student=no, credit_rating=fair)

o P(John, buys_computer=yes) = 20%

o P(John, buys_computer=no) = 80%

SUSHIL KULKARNI
Naive Bayesian Classifier
play tennis?
Example
Outlook Temperature Humidity W indy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N

SUSHIL KULKARNI
Naive Bayesian Classifier
Example
Outlook Temperature Humidity W indy Class
overcast hot high false P
rain mild high false P
rain cool normal false P
overcast cool normal true P 9
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
rain cool normal true N 5
sunny mild high false N
rain mild high true N
SUSHIL KULKARNI
Naive Bayesian Classifier
Example
Given the training set, we compute the
probabilities:
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature W indy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5

We also have the probabilities


P = 9/14 and N = 5/14 SUSHIL KULKARNI
Naive Bayesian Classifier
We use notation P(A) as the probability of an
event A and P(A/B) denotes the probability of A
conditional on another event B.
H is the hypothesis and E is the evidence and is
the combination of attribute values then
p( E / H ).p( H )
p( H / E ) =
p( E )
• Example: Let H be ‘yes’ and e is the combination
of attribute values for new day: Outlook=sunny,
temp.= cool, humidity= high, windy= true. Call these
for pieces as E1 , E2 ’ E 3 and E 4 and are independent
then p(E1 / H).p(E 2 / H).p(E3 / H)p(E 4 / H).p(H)
p( H / E ) =
p( E )
SUSHIL KULKARNI
Naive Bayesian Classifier
Denominator can be eliminated as the final normalizing
step when we make the probabilities of different pieces,
the sum is 1. Thus

p( H / E ) = p(E1 / H).p(E 2 / H).p(E3 / H)p(E 4 / H).p(H)

SUSHIL KULKARNI
Naive Bayesian Classifier
Example
To classify a new day E:
outlook = sunny, temperature = cool
humidity = high, windy = false

Prob(P|E) = Prob(P) * Prob(sunny|P) * Prob(cool|P)


* Prob(high|P) * Prob(false|P)
= 9/14*2/9*3/9*3/9*6/9 = 0.01
Prob(N|X) = Prob(N) * Prob(sunny|N) *
Prob(cool|N) * Prob(high|N) *
Prob(false|N)
= 5/14*3/5*1/5*4/5*2/5 = 0.013
SUSHIL KULKARNI
Naive Bayesian Classifier
Example
Probability of ‘Playing’
0.01
= = 43 %
0.01 + 0.013

Probability of ‘ Not Playing’


0.013
= = 57 %
0.01 + 0.013
Therefore E takes class label N
SUSHIL KULKARNI
Naive Bayesian Classifier
Example
Second example X = <rain, hot, high, false>

P(X|p) · P(p) = P(rain|p) * P(hot|p) * P(high|p)

* P(false|p) * P(p)
= 3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n) · P(n) = P(rain|n) · P(hot|n) ·
P(high|n)·P(false|n)·P(n)
= 2/5·2/5·4/5·2/5·5/14 = 0.018286
Sample X is classified in class N (don’t play)
SUSHIL KULKARNI
Naive Bayesian Classifier
Example
Probability of ‘Playing’

0.010582
= = 37 %
0.010582 + 0.0182860
Probability of ‘ Not Playing’

0.0182860
= = 63 %
0.010582 + 0.0182860
Therefore X takes class label N SUSHIL KULKARNI
REGRESSION

SUSHIL KULKARNI
What Is regression?
regression is similar to classification
o First, construct a model
o Second, use model to predict unknown value
 Major method for regression is regression
• Linear and multiple regression
• Non-linear regression
regression is different from classification
o Classification refers to predict categorical
class label
o regression models continuous-valued
functions SUSHIL KULKARNI
Predictive Modeling in
Databases
Predictive modeling: Predict data values or
construct generalized linear models based
on the database data.
One can only predict value ranges or
category distributions
Determine the major factors which influence
the regression
o Data relevance analysis: uncertainty
measurement, entropy analysis, expert
judgement, etc. SUSHIL KULKARNI
Regress Analysis and Log-
Linear Models in Regression
Linear regression: Y = α + β X

Two parameters , α and β specify the line and are


to be estimated by using the data at hand.
using the least squares criterion to the known
values of (x1,y1),(x2,y2),...,(xs,yS):


s
( xi − x )( yi − y )
β= i =1
a =y −βx

s
i =1
( xi − x ) 2

SUSHIL KULKARNI
Regress Analysis and Log-
Linear Models in Regression
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into
the above.
E.g.,Y=b 0 + b1 X+ b2X 2+ b3X 3, X1=X, X2=X 2, X3=X 3

Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
Probability: p(a, b, c, d) = α ab β acχ ad δ bcd

SUSHIL KULKARNI
T H A N K S !

SUSHIL KULKARNI