Unit 4 Learning

What is Classification
 Classification is the task of assigning objects to one

of the several predefined categories.
 Given a database D={t1,t2,…,tn} and a set of classes

C={C1,…,Cm}, the Classification Problem is to
define a mapping f: D → C where each ti is assigned
to one class.
Classification
Attribute set
Classification Class label
(x) Model (y)
Classification as the task of mapping an input attribute set x into its class label y
 Classification model is useful for:

⚫ Descriptive Modeling
⚫ Predictive Modeling
Classification Examples
 Teachers classify students’ grades as A, B, C, D, or F.
 Identify mushrooms as poisonous or edible.
 Predict when a river will flood.
 Identify individuals with credit risks.
 Speech recognition
 Pattern recognition
Classification Ex: Grading
x
 If x >= 90 then grade =A.
<90 >=90
 If 80<=x<90 then grade =B.
x A
 If 70<=x<80 then grade =C.
 If 60<=x<70 then grade =D. <80 >=80
 If x<50 then grade =F. x B
<70 >=70
Classify the following marks x C
78 , 56 , 99
<50 >=60
F D
Topics Covered
 What is Classification
 General Approach to Classification
 Issues in Classification
 Classification Algorithms
⚫ Statistical Based
• Bayesian Classification
⚫ Distance Based
• KNN
⚫ Decision Tree Based
• ID3
⚫ Neural Network Based
⚫ Rule Based
General approach to Classification
 Two step process:

⚫ Learning step
• Where a classification algorithm builds the classifier
by analyzing or “learning from” a training set made
up of database tuples and their associated class
labels.
⚫ Classification step
• The model is used to predict class labels for given
data.
 Classes must be predefined
 Most common techniques use DTs, NNs, or are
based on distances or statistical methods.
Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no IF rank = ‘professor’
Anne Associate Prof 3 no OR years > 6
THEN tenured = ‘yes’
Use the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
Defining Classes
Issues in Classification
 Missing Data
⚫ Ignore missing value
⚫ Replace with assumed value
 Measuring Performance
⚫ Classification accuracy on test data
⚫ Confusion matrix
• provides the information needed to determine how
well a classification model performs
Confusion Matrix
Confusion Matrix
Predicted Class
Class= 1 Class= 0
Actual Class= 1 f11 f10
Class Class =0 f01 f00
• Each entry fij in this table denotes the number of records from
class i predicted to be of class j.
• For instance, f01 is the number of records from class 0 incorrectly
predicted as class 1.
• The total number of correct predictions: (f11+ f00)
• The total number of incorrect predictions: (f01 + f10)
Classification Performance
 Definition of the Terms:

⚫ Positive (P) : Observation is positive (for example: is an apple).
⚫ Negative (N) : Observation is not positive (for example: is not an apple).
⚫ True Positive (TP) : Observation is positive, and is predicted to be
positive.
⚫ False Negative (FN) : Observation is positive, but is predicted negative.
⚫ True Negative (TN) : Observation is negative, and is predicted to be
negative.
⚫ False Positive (FP) : Observation is negative, but is predicted positive.
Class Statistics Measures
learn
 Accuracy: Overall, how often is the classifier correct? kitna sahi bola
(TP+TN)/(TP+TN+FP+FN)
 Error Rate: Overall, how often is it wrong?
kitna galat bola
(FP+FN)/(TP+TN+FP+FN)
equivalent to 1 minus Accuracy
 Specificity: measures how exact the assignment to the positive class is
TN/(FP+TN)
 Sensitivity/Recall: Recall can be defined as the ratio of the total number of
correctly classified positive examples divide to the total number of positive
examples.
TP/(TP+FN)
High Recall indicates the class is correctly recognized.
Class Statistics Measures
 Precision: is a measure of how accurate a model’s positive
predictions are. TP/(TP+FP)
⚫ High Precision indicates an example labeled as positive is indeed positive
 F-measure: The F measure (F1 score or F score) is used to evaluate

the overall performance of a classification model and is defined as the
weighted harmonic mean of the precision and recall of the test.
F Score =
 High recall, low precision: This means that most of the positive examples are
correctly recognized (low FN) but there are a lot of false positives.
 Low recall, high precision: This shows that we miss a lot of positive examples
(high FN) but those we predict as positive are indeed positive (low FP)
Example to interpret Confusion Matrix
Classification Rate/Accuracy: (TP + TN) / (TP + TN + FP + FN) =
(100 + 50) /(100 + 5 + 10 + 50) = 0.90

Example to interpret Confusion Matrix
Recall = TP / (TP + FN)
= 100 / (100 + 5) = 0.95
Precision = TP / (TP + FP)
= 100 / (100 + 10) = 0.91
F-measure = (2 * Recall * Precision) / (Recall + Precision)
= (2 * 0.95 * 0.91) / (0.91 + 0.95) = 0.92

Example
 We have a total of 20 cats and dogs and our model
predicts whether it is a cat or not.
 Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’,
‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
 Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,

‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’,
‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
Example
Accuracy
Precision
 Ex 1:- In Spam Detection : Need to
focus on precision
 Suppose mail is not a spam but

model is predicted as spam : FP
(False Positive). We always try to
reduce FP.
 Ex 2:- Precision is important in

music or video recommendation
systems, e-commerce websites, etc.
Wrong results could lead to
customer churn and be harmful to
the business.
Recall
 Ex 1:- suppose person having

cancer (or) not? He is
suffering from cancer but
model predicted as not
suffering from cancer
 Ex 2:- Recall is important in

medical cases where it doesn’t
matter whether we raise a
false alarm but the actual
positive cases should not go
undetected!
Confusion Matrix for Multi-class Classification
 For 5-class problem with classes A,B,C,D,E
For more detail:

https://www.youtube.com/watch?v=FAr2GmW NbT0
Example
Height Example Data
Name Gender Height Output1(Correct) Output2(Actual
Assignment)
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
Confusion Matrix Example
 Using height data example with Output1 correct and

Output2 actual assignment.
 Best solution will have only zeroes outside the diagonal.
Actual Predicted Assignment

Membership Short Medium Tall
Short 0 4 0
Medium 0 5 3
Tall 0 1 2
Good classifier output
When to use Accuracy / Precision /
Recall / F1-Score?
 Accuracy is used when the True Positives and True Negatives are
more important. Accuracy is a better metric for Balanced Data.
 Whenever False Positive is much more important use Precision.
 Whenever False Negative is much more important use Recall.
 F1-Score is used when the False Negatives and False Positives

are important. F1-Score is a better metric for Imbalanced Data.
Statistical Based Algorithms -
Bayesian Classification
 Bayesian classifiers are statistical classifiers. They can
predict class membership probabilities such as the
probability that a given tuple belongs to a particular
class.
 Based on Bayes rule of conditional probability.
 Assumes that the contribution by all attributes are
independent and each of them contribute equally (hence
the name naive)
 Classification is made by combining the impact that the
different attributes have on the prediction to be made.
Bayes Theorem
 Bayes’ Theorem is a way of finding a probability when we know
certain other probabilities.
 The formula is:
•P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
•P(c) is the prior probability of class.
•P(x|c) is the likelihood which is the probability of the predictor given class.
•P(x) is the prior probability of the predictor.
Bayes Theorem
 Data tuple (A): 35-year-old customer with an income of $40,000
 Hypothesis (B): customer will buy a computer
 Posterior Probability: P(A|B),
⚫ P(A|B) is the likelihood. It represents the probability of observing the data
(35-year-old with an income of $40,000) given that the hypothesis
(customer will buy a computer) is true.
 Prior Probability: P(A),
⚫ is the prior probability of a customer being a 35-year-old with an income
of $40,000.
 Posterior Probability: P(B|A),
⚫ probability of the customer buying a computer given their age and income.
 Prior Probability: P(B),

⚫ probability of the customer buying a computer (regardless of age and
income).
Naïve Bayes Classifier
 Naive Bayes is a kind of classifier which uses the

Bayes Theorem.
 It predicts membership probabilities for each class
such as the probability that given record or data point
belongs to a particular class.
 The class with the highest probability is considered as
the most likely class.
 This is also known as Maximum A Posteriori (MAP).
 Naive Bayes classifier assumes that all the features
are unrelated to each other.
Naïve Bayes Classifier
 In real datasets,
 By substituting for X and expanding using the chain rule:
 For all entries in the dataset, the denominator does not change,
it remain static. Therefore, the denominator can be removed
and a proportionality can be introduced.
 For multivariate classification,

Weather dataset
dependent var The posterior
independent variables
probability can be
calculated by first,
constructing a
frequency table for
each attribute against
the target. Then,
transforming the
frequency tables to
likelihood tables and
finally use the Naive
Bayesian equation to
calculate the posterior
probability for each
class. The class with
the highest posterior
probability is the
outcome of prediction.
Test data
see likelihood table values

Play-Tennis example
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N An unseen sample
overcast hot high false P X = <rain, hot, high, false>
rain mild high false P
rain cool normal false P
rain cool normal true N P(X|p) and P(X|n) : Conditional Probabilities
overcast cool normal true P
sunny mild high false N Posterior Probabilities ,
sunny cool normal false P P(X|p)·P(p) =
rain mild normal false P
sunny mild normal true P P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
overcast mild high true P
overcast hot normal false P P(X|n)·P(n) =
rain mild high true N
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
Play-Tennis example
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
P(p) = 9/14
Prior probability
P(n) = 5/14
Derived probability
Play-Tennis example: classifying X
 An unseen sample X = <rain, hot, high, false>

 P(X|p) and P(X|n) : Conditional Probabilities
 Posterior Probabilities ,
P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
 Sample X is classified in class n (don’t play)

Naïve Bayes Classifier: Training Dataset
age income student credit_ratingbuys_compute
<=30 high no fair no
• Class:
<=30 high no excellent no
C1:buys_computer = ‘yes’
31…40 high no fair yes
C2:buys_computer = ‘no’
>40 medium no fair yes
• Data to be classified: >40 low yes fair yes
X = (age <=30, >40 low yes excellent no
Income = medium, 31…40 low yes excellent yes

Student = yes <=30 medium no fair no
Credit_rating = Fair) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40

>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40

<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes
 Compute P(X|Ci) for each class >40

<=30
medium yes fair
medium yes excellent
yes
yes
31…40 medium no excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40
>40
high
medium
yes fair
no excellent
yes
no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Bayes Theorem Example
Color Type Origin Stolen
Red Sports Domestic Y
Red Sports Domestic N
Red Sports Domestic Y
Yellow Sports Domestic N
Yellow Sports Imported Y
Yellow SUV Imported N
Yellow SUV Imported Y
Yellow SUV Domestic N
Red SUV Imported N
Red Sports Imported Y
Bayes Example(cont’d)
Classify for t = < Red, Domestic, SUV>
P(X/yes) * P(yes) = 0.024

P(X/no) * P(no) = 0.072
As 0.072 > 0.024, the new tuple is classified

as No.
Example: Height Classification
Classify tuple:
t = (Adam , M, 1 .95 m)
There are four tuples classified as short,

eight as medium, and three as tall.
P (short) = 4/ 15 = 0.267
P (medium) = 8/15 = 0.533
P (tall) = 3/1 5 = 0.2
To facilitate classification, we divide the height attribute into six ranges:

(0, 1 .6] , (1.6, 1 .7], (1.7, 1 .8], ( 1 .8, 1 .9], (1.9, 2.0], (2.0, oo)
Probabilities associated with attributes

 To classify t = (Adam , M, 1 .95 m)

 By using the values and associated probabilities of
gender and height, we obtain the following estimates:
P(t|short) = 1 /4 * 0 = 0 Prior Probabilities:
P (t|medium) = 2/8 * 1/8 = 0.031 P(short) = 4/15 = 0.267
P (t|tall) = 3/3 * 1/3 = 0.333
P(medium) = 8/15 = 0.533
Combining these, we get
P(tall) = 3/15 = 0.2
Likelihood of being short = 0 * 0.267 = 0
Likelihood of being medium = 0.031 * 0.533 = 0.0166
Likelihood of being tall = 0.33 * 0.2 = 0.066
We estimate P(t) by summing up the individual likelihood values

P(t) = 0 + 0.0166 + 0.066 = 0.08266
 Finally, we obtain the actual probabilities of each event (Using Bayes Theorem):
 P(short | t) = P(t | short) x P(short) / P(t) = 0*0.0267/0.0826 = 0
 P(medium | t) = P(t | medium) x P(medium) / P(t) = 0.031*0.533/0.0826 = 0.2
 P(tall | t) = P(t | tall) x P(tall) / P(t) = 0.333*0.2/0.0826 = 0.799
 Therefore, based on these probabilities, we classify the new tuple as tall because
it has the highest probability.
Advantages of naïve bayes
 It is easy to use.
 Unlike other classification approaches, only one
scan of the training data is required.
 The naive Bayes approach can easily handle missing
values by simply omitting that probability when
calculating the likelihoods of membership in each
class.
 In cases where there are simple relationships, the
technique often does yield good results.
Disadvantages of naïve bayes
 Although the naive Bayes approach is straightforward

to use, it does not always yield satisfactory results.
 First, the attributes usually are not independent. We
could use a subset of the attributes by ignoring any
that are dependent on others.
 The technique does not handle continuous data.
 Dividing the continuous values into ranges could be
used to solve this problem, but the division of the
domain into ranges is not an easy task, and how this is
done can certainly impact the results.
Decision Tree based Algorithms
 In Decision tree approach, a tree is constructed to

model the classification process.
 Once the tree is built, it is applied to each tuple in the
database and results in a classification for that tuple.
 There are two basic steps in the technique:
⚫ building the tree
⚫ and applying the tree to the database.
Most research has focused on how to build effective trees

as the application process is straightforward.
Decision Tree
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree

Another Example of Decision Tree
MarSt Single,
Married Divorced
ID
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
fits the same data!
10
Apply Model to Test Data
Test Data
Start from the root of tree.
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Parts of a Decision Tree
Decision Tree
Given:
⚫ D = {t1, …, tn} where ti=<ti1, …, tih>
⚫ Attributes {A1, A2, …, Ah}
⚫ Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated with
D such that
⚫ Each internal node is labeled with attribute, Ai
⚫ Each arc is labeled with predicate which can be
applied to attribute at parent
⚫ Each leaf node is labeled with a class, Cj
Decision Tree based Algorithms
 Solving the classification problem using decision trees

is a two-step process:
⚫ Decision tree induction: Construct a DT using training
data.
⚫ For each tiεD, apply the DT to determine its class.
 DT approaches differ in how the tree is built.
 Algorithms: ID3, C4.5, CART

DT Induction
DT Induction
 The recursive algorithm builds the tree in a top-down fashion.
 Using the initial training data, the "best" splitting attribute is
chosen first. [Algorithms differ in how they determine the "best
attribute" and its "best predicates" to use for splitting. ]
 Once this has been determined, the node and its arcs are
created and added to the created tree.
 The algorithm continues recursively by adding new subtrees to
each branching arc.
 The algorithm terminates when some "stopping criteria" is
reached. [Again, each algorithm determines when to stop the tree
differently. One simple approach would be to stop when the tuples in the
reduced training set all belong to the same class. This class is then used to
label the leaf node created.]
DT Induction
 Splitting attributes: Attributes in the database

schema that will be used to label nodes in the tree
and around which the divisions will take place.
 Splitting predicates: The predicates by which

the arcs in the tree are labeled.
DT Issues
 Choosing Splitting Attributes

 Ordering of Splitting Attributes
 Splits
 Tree Structure
 Stopping Criteria
 Training Data
 Pruning
DT Issues
 Choosing Splitting Attributes
Name Gender Height Output1(Correct) Output2(Actual
Assignment)
Kristina F 1.6m medium Medium
Jim M 2m Tall Short
Maggie F 1.9m Medium Short
Martha F 1.88m Short medium
Stephanie F 1.7m Medium Tall
Kathy F 1.6m Short Short
Steven M 2.1m Tall Short
Debbie F 1.8m Tall Medium
Todd M 1.95m Medium Tall
Kim F 1.9m Short Tall
Wynette F 1.75m Medium Short
DT Issues
 Ordering of Splitting Attributes
 The order in which the attributes are chosen is also
important.
DT Issues
 Splits
⚫ With some attributes, the domain is small, so the number of
splits is obvious based on the domain (as with the gender
attribute).
⚫ However, if the domain is continuous or has a large number
of values, the number of splits to use is not easily
determined. Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split

DT Issues
 Tree Structure
⚫ a balanced tree with the fewest levels is desirable.
⚫ However, in this case, more complicated comparisons
with multiway branching may be needed.
⚫ Some algorithms build only binary trees.
DT Issues
 Stopping Criteria
⚫ when the training data are perfectly classified.
⚫ when stopping earlier would be desirable to prevent the
creation of larger trees. This is a trade-off between accuracy of
classification and performance.
 Training Data
⚫ The structure of the DT created depends on the training data.
⚫ If the training data set is too small, then the generated tree
might not be specific enough to work properly with the more
general data.
⚫ If the training data set is too large, then the created tree may
overfit.
DT Issues
 Pruning
⚫ Once a tree is constructed, some modifications to the tree
might be needed to improve the performance of the tree
during the classification phase.
⚫ The pruning phase might remove redundant comparisons
or remove subtrees to achieve better performance.
Comparing
Decision
Trees
ID3
 ID3 stands for Iterative Dichotomiser 3
 Creates tree using information theory concepts and tries to
reduce expected number of comparison.
 ID3 chooses split attribute with the highest information
gain:
 Information gain=(Entropy of distribution before the
split)–(entropy of distribution after it)
Entropy
 Entropy
⚫ Is used to measure the amount of uncertainty or surprise
or randomness in a set of data.
⚫ When all data belongs to a single class, entropy is zero as
there is no uncertainty.
⚫ An equally divided sample as an entropy of 1.
⚫ The Mathematical formula for Entropy is –
Where ‘Pi’ is simply the frequentist probability of an

element/class ‘i’ in our data.
How do Decision Trees use Entropy?
 Entropy basically measures the impurity of a node.

 Impurity is the degree of randomness; it tells how random our data is.
 A pure sub-split means that either you should be getting “yes”, or you
should be getting “no”.
left node has low entropy or more purity than right node since left node has a greater number of
“yes” and it is easy to decide here.
Information Gain
 The goal of machine learning is to decrease the uncertainty

or impurity in the dataset, here by using the entropy we are
getting the impurity of a particular node, we don’t know if
the parent entropy or the entropy of a particular node has
decreased or not.
 For this, we bring a new metric called “Information gain”

which tells us how much the parent entropy has decreased
after splitting it with some feature.
Information Gain
 Suppose our entire population has a total of 30 instances. The

dataset is to predict whether the person will go to the gym or not.
Let’s say 16 people go to the gym and 14 people don’t
 Two features to predict whether he/she will go to the gym or not.
⚫ Feature 1 is “Energy” which takes two values “high” and “low”
⚫ Feature 2 is “Motivation” which takes 3 values “No motivation”,
“Neutral” and “Highly motivated”.
Let’s see how our decision tree will be made using these 2 features.
We’ll use information gain to decide which feature should be the
root node and which feature should be placed after the split.
Information Gain
To see the weighted average of entropy of each node we will do as follows:
Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99
and after looking at this value of
information gain, we can say that
the entropy of the dataset will
decrease by 0.37 if we make
“Energy” as our root node.
Information Gain
“Energy” feature gives more

reduction which is 0.37 than
the “Motivation” feature. Hence
we will select the feature which
has the highest information
gain and then split the node
based on that feature.
Information Gain
 Gain is defined as the difference between how

much information is needed to make a correct
classification before the split versus how much
information is needed after the split.
Information Gain Example
 Let S=14 examples, 9 positive 5 negative
 Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) • = 0.940
 The attribute is Wind. Values of wind are Weak and Strong
 8 occurrences of weak winds ; 6 occurrences of strong winds
 For the weak winds, 6 are positive and 2 are negative
 For the strong winds, 3 are positive and 3 are negative
 Gain(S,Wind) =
Entropy(S) - (8/14)*Entropy (Weak) -(6/14)*Entropy (Strong)
 Entropy(Weak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811
 Entropy(Strong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00
 So… 0.940 - (8/14)*0.811 - (6/14)*1.00 • = 0.048
Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Kathy F 1.6m Short Medium
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Wynette F 1.75m Medium Medium
ID3 Example (Output1)
 The beginning state of the training data in Table (with the
Output l classification) is that
(4/ 1 5) are short, (8/1 5) are medium, and (3/15) are tall.
 Starting state entropy:

4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384
 Gain using gender:

⚫ Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764
⚫ Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392
⚫ Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152
⚫ Gain: 0.4384 – 0.34152 = 0.09688
ID3 Example (Output1)
 Gain using height:
(0, 1 .6], (1 .6, 1 .7], ( 1.7, 1 .8], (1.8, 1 .9], (1.9, 2.0], (2.0, ∞)
Entropy calculation:
There are 2 tuples in the first division with entropy (2/2(0) + 0 + 0) = 0
2 in (1.6, 1 .7] with entropy (2/2(0) +0+0) = 0,
3 in (1.7, 1.8] with entropy (0 + 3/3(0) + 0) = 0,
4 in ( 1.8, 1.9] with entropy (0+4/4(0) +0) = 0,
2 in (1.9, 2.0] with entropy (0+ 1/2(0.301)+ 1 /2(0.301)) = 0.301, and
2 in the last with entropy (0 + 0 + 2/2(0)) = 0.
All of these states are completely ordered and thus an entropy of 0 except for
the (1.9, 2.0] state.
 The gain in entropy by using the height attribute is
0.4384 – (2/15)(0.301) = 0.3983
 Choose height as first splitting attribute
Advantages of ID3
 Understandable prediction rules are

created from the training data.
 Builds the fastest tree.
 Builds a short tree.
 Only need to test enough attributes until

all data is classified.
 Finding leaf nodes enables test data to be
pruned, reducing number of tests.
Disadvantages of ID3
 Data may be over-fitted or over classified,

if a small sample is tested.
 Only one attribute at a time is tested for
making a decision.
 Classifying continuous data may be
computationally expensive, as many trees
must be generated to see where to break
the continuum.
Example:Triangles and Squares
# Attribute Shape
Color Outline Dot
1 green dashed no triange
2 green dashed yes triange Data Set:
3 yellow dashed no square
4 red dashed no square
A set of classified objects
5 red solid no square
6 red solid yes triange
7 green solid no square
. .
8 green dashed no triange
9 yellow solid yes square
. .
10 red solid no square
. .
11 green solid yes square
12 yellow dashed yes square
13 yellow solid no square
14 red dashed yes triange
Entropy
• 5 triangles
• 9 squares
. .
• class probabilities
. .
. .
• entropy
. .
. .
. .
. .
red
Color? green
Entropy
reduction .
yellow .
by
data set
partitioning .
.
. . .
.
. .
. .
red
Color? green
.
yellow .
.
.
. . .
.
. .
. .
red
Information Gain
Color? green
.
yellow .
.
.
Information Gain of The
Attribute
 Attributes
⚫ Gain(Color) = 0.246
⚫ Gain(Outline) = 0.151
⚫ Gain(Dot) = 0.048
 Heuristics: attribute with the highest gain is chosen
 So, color is chosen as the root node
. .
. . .
.
. . red
Red
Color?
Color?
green
Green
Yellow
. yellow Gain(outline)
. P(dashed)=3/5, P(solid)=2/5
.
. I(dashed)= -(3/3)log2(3/3) – 0 =0
I(solid)= -0-(2/2)log2(2/2) = 0
Gain(Dot) I(outline)=(3/5).0+(2/5).0 = 0
P(y)=2/5, P(n)=3/5 Gain (outline)= 0.971-0=0.971
I(y)= -(1/2)log2(1/2) – (1/2)log2(1/2)=1
I(n)= -(1/3)log2(1/3) – (2/3)log2(2/3)=
= 0.917
Gain(Outline) = 0.971 – 0 = 0.971
I (Dot)=(2/5).1+(3/5).(0.917) = 0.9502 Gain(Dot) = 0.971 – 0.951 = 0.020
Gain (Dot)= 0.971-0.9502= 0.020
. .
. . .
.
. .
Red
Gain(Outline) = 0.971 – 0.951 = 0.020 bits
Color?
Color? Gain(Dot) = 0.971 – 0 = 0.971 bits
Yellow Green
.
.
.
.
solid
.
Outline
dashed
.
. .
. . .
.
. . red
yes .
Dot? .
Color?
no
green
. yellow
.
.
.
solid
.
Outline?
dashed
.
Decision Tree
. .
. .
. .
Color
red green
yellow
Dot square Outline
yes no dashed solid
triangle square triangle square

Unit 4 Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4 Learning

Uploaded by

Copyright:

Available Formats

What is Classification

 Classification is the task of assigning objects to one

 Given a database D={t1,t2,…,tn} and a set of classes

 Classification model is useful for:

 Identify mushrooms as poisonous or edible.

 Predict when a river will flood.

 Identify individuals with credit risks.

 Two step process:

NAME RANK YEARS TENURED Classifier

 Definition of the Terms:

 F-measure: The F measure (F1 score or F score) is used to evaluate

Classification Rate/Accuracy: (TP + TN) / (TP + TN + FP + FN) =

(100 + 50) /(100 + 5 + 10 + 50) = 0.90

Recall = TP / (TP + FN)

= 100 / (100 + 5) = 0.95

Precision = TP / (TP + FP)

= 100 / (100 + 10) = 0.91

F-measure = (2 * Recall * Precision) / (Recall + Precision)

= (2 * 0.95 * 0.91) / (0.91 + 0.95) = 0.92

 Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,

 Suppose mail is not a spam but

 Ex 2:- Precision is important in

 Ex 1:- suppose person having

 Ex 2:- Recall is important in

 For 5-class problem with classes A,B,C,D,E

For more detail:

 Using height data example with Output1 correct and

Actual Predicted Assignment

 Whenever False Positive is much more important use Precision.

 Whenever False Negative is much more important use Recall.

 F1-Score is used when the False Negatives and False Positives

 Prior Probability: P(B),

 Naive Bayes is a kind of classifier which uses the

 By substituting for X and expanding using the chain rule:

 For multivariate classification,

see likelihood table values

 An unseen sample X = <rain, hot, high, false>

 Sample X is classified in class n (don’t play)

• Data to be classified: >40 low yes fair yes

X = (age <=30, >40 low yes excellent no

Income = medium, 31…40 low yes excellent yes

 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40

P(buys_computer = “no”) = 5/14= 0.357 31…40

 Compute P(X|Ci) for each class >40

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

Classify for t = < Red, Domestic, SUV>

P(X/yes) * P(yes) = 0.024

As 0.072 > 0.024, the new tuple is classified

There are four tuples classified as short,

To facilitate classification, we divide the height attribute into six ranges:

Probabilities associated with attributes

 To classify t = (Adam , M, 1 .95 m)

We estimate P(t) by summing up the individual likelihood values

 P(short | t) = P(t | short) x P(short) / P(t) = 0*0.0267/0.0826 = 0

 P(medium | t) = P(t | medium) x P(medium) / P(t) = 0.031*0.533/0.0826 = 0.2

 P(tall | t) = P(t | tall) x P(tall) / P(t) = 0.333*0.2/0.0826 = 0.799

 Although the naive Bayes approach is straightforward

 In Decision tree approach, a tree is constructed to

Most research has focused on how to build effective trees

Training Data Model: Decision Tree

Single, Divorced Married

Single, Divorced Married

Single, Divorced Married

Single, Divorced Married