You are on page 1of 100

What is Classification

 Classification is the task of assigning objects to one


of the several predefined categories.

 Given a database D={t1,t2,…,tn} and a set of classes


C={C1,…,Cm}, the Classification Problem is to
define a mapping f: D → C where each ti is assigned
to one class.
Classification

Attribute set
Classification Class label
(x) Model (y)

Classification as the task of mapping an input attribute set x into its class label y

 Classification model is useful for:


⚫ Descriptive Modeling
⚫ Predictive Modeling
Classification Examples
 Teachers classify students’ grades as A, B, C, D, or F.

 Identify mushrooms as poisonous or edible.

 Predict when a river will flood.

 Identify individuals with credit risks.

 Speech recognition

 Pattern recognition
Classification Ex: Grading

x
 If x >= 90 then grade =A.
<90 >=90
 If 80<=x<90 then grade =B.
x A
 If 70<=x<80 then grade =C.
 If 60<=x<70 then grade =D. <80 >=80
 If x<50 then grade =F. x B
<70 >=70
Classify the following marks x C
78 , 56 , 99
<50 >=60
F D
Topics Covered
 What is Classification
 General Approach to Classification
 Issues in Classification
 Classification Algorithms
⚫ Statistical Based
• Bayesian Classification
⚫ Distance Based
• KNN
⚫ Decision Tree Based
• ID3
⚫ Neural Network Based
⚫ Rule Based
General approach to Classification

 Two step process:


⚫ Learning step
• Where a classification algorithm builds the classifier
by analyzing or “learning from” a training set made
up of database tuples and their associated class
labels.
⚫ Classification step
• The model is used to predict class labels for given
data.
 Classes must be predefined
 Most common techniques use DTs, NNs, or are
based on distances or statistical methods.
Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no IF rank = ‘professor’
Anne Associate Prof 3 no OR years > 6
THEN tenured = ‘yes’
Use the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
Defining Classes
Issues in Classification

 Missing Data
⚫ Ignore missing value
⚫ Replace with assumed value

 Measuring Performance
⚫ Classification accuracy on test data
⚫ Confusion matrix
• provides the information needed to determine how
well a classification model performs
Confusion Matrix
Confusion Matrix

Predicted Class
Class= 1 Class= 0
Actual Class= 1 f11 f10
Class Class =0 f01 f00
• Each entry fij in this table denotes the number of records from
class i predicted to be of class j.
• For instance, f01 is the number of records from class 0 incorrectly
predicted as class 1.
• The total number of correct predictions: (f11+ f00)
• The total number of incorrect predictions: (f01 + f10)
Classification Performance

 Definition of the Terms:


⚫ Positive (P) : Observation is positive (for example: is an apple).
⚫ Negative (N) : Observation is not positive (for example: is not an apple).
⚫ True Positive (TP) : Observation is positive, and is predicted to be
positive.
⚫ False Negative (FN) : Observation is positive, but is predicted negative.
⚫ True Negative (TN) : Observation is negative, and is predicted to be
negative.
⚫ False Positive (FP) : Observation is negative, but is predicted positive.
Class Statistics Measures
learn

 Accuracy: Overall, how often is the classifier correct? kitna sahi bola
(TP+TN)/(TP+TN+FP+FN)
 Error Rate: Overall, how often is it wrong?
kitna galat bola
(FP+FN)/(TP+TN+FP+FN)
equivalent to 1 minus Accuracy
 Specificity: measures how exact the assignment to the positive class is
TN/(FP+TN)
 Sensitivity/Recall: Recall can be defined as the ratio of the total number of
correctly classified positive examples divide to the total number of positive
examples.
TP/(TP+FN)
High Recall indicates the class is correctly recognized.
Class Statistics Measures
 Precision: is a measure of how accurate a model’s positive
predictions are. TP/(TP+FP)
⚫ High Precision indicates an example labeled as positive is indeed positive

 F-measure: The F measure (F1 score or F score) is used to evaluate


the overall performance of a classification model and is defined as the
weighted harmonic mean of the precision and recall of the test.

F Score =

 High recall, low precision: This means that most of the positive examples are
correctly recognized (low FN) but there are a lot of false positives.
 Low recall, high precision: This shows that we miss a lot of positive examples
(high FN) but those we predict as positive are indeed positive (low FP)
Example to interpret Confusion Matrix

Classification Rate/Accuracy: (TP + TN) / (TP + TN + FP + FN) =

(100 + 50) /(100 + 5 + 10 + 50) = 0.90


Example to interpret Confusion Matrix

Recall = TP / (TP + FN)

= 100 / (100 + 5) = 0.95

Precision = TP / (TP + FP)

= 100 / (100 + 10) = 0.91

F-measure = (2 * Recall * Precision) / (Recall + Precision)

= (2 * 0.95 * 0.91) / (0.91 + 0.95) = 0.92


Example
 We have a total of 20 cats and dogs and our model
predicts whether it is a cat or not.
 Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’,
‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]

 Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,


‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’,
‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
Example
Accuracy
Precision
 Ex 1:- In Spam Detection : Need to
focus on precision

 Suppose mail is not a spam but


model is predicted as spam : FP
(False Positive). We always try to
reduce FP.

 Ex 2:- Precision is important in


music or video recommendation
systems, e-commerce websites, etc.
Wrong results could lead to
customer churn and be harmful to
the business.
Recall

 Ex 1:- suppose person having


cancer (or) not? He is
suffering from cancer but
model predicted as not
suffering from cancer

 Ex 2:- Recall is important in


medical cases where it doesn’t
matter whether we raise a
false alarm but the actual
positive cases should not go
undetected!
Confusion Matrix for Multi-class Classification

 For 5-class problem with classes A,B,C,D,E

For more detail:


https://www.youtube.com/watch?v=FAr2GmW NbT0
Confusion Matrix for Multi-class Classification
Confusion Matrix for Multi-class Classification
Confusion Matrix for Multi-class Classification

Example
Height Example Data
Name Gender Height Output1(Correct) Output2(Actual
Assignment)
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
Confusion Matrix Example

 Using height data example with Output1 correct and


Output2 actual assignment.
 Best solution will have only zeroes outside the diagonal.

Actual Predicted Assignment


Membership Short Medium Tall
Short 0 4 0
Medium 0 5 3
Tall 0 1 2
Good classifier output
When to use Accuracy / Precision /
Recall / F1-Score?

 Accuracy is used when the True Positives and True Negatives are
more important. Accuracy is a better metric for Balanced Data.

 Whenever False Positive is much more important use Precision.

 Whenever False Negative is much more important use Recall.

 F1-Score is used when the False Negatives and False Positives


are important. F1-Score is a better metric for Imbalanced Data.
Statistical Based Algorithms -
Bayesian Classification
 Bayesian classifiers are statistical classifiers. They can
predict class membership probabilities such as the
probability that a given tuple belongs to a particular
class.
 Based on Bayes rule of conditional probability.
 Assumes that the contribution by all attributes are
independent and each of them contribute equally (hence
the name naive)
 Classification is made by combining the impact that the
different attributes have on the prediction to be made.
Bayes Theorem
 Bayes’ Theorem is a way of finding a probability when we know
certain other probabilities.
 The formula is:

•P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
•P(c) is the prior probability of class.
•P(x|c) is the likelihood which is the probability of the predictor given class.
•P(x) is the prior probability of the predictor.
Bayes Theorem
 Data tuple (A): 35-year-old customer with an income of $40,000
 Hypothesis (B): customer will buy a computer
 Posterior Probability: P(A|B),
⚫ P(A|B) is the likelihood. It represents the probability of observing the data
(35-year-old with an income of $40,000) given that the hypothesis
(customer will buy a computer) is true.
 Prior Probability: P(A),
⚫ is the prior probability of a customer being a 35-year-old with an income
of $40,000.
 Posterior Probability: P(B|A),
⚫ probability of the customer buying a computer given their age and income.

 Prior Probability: P(B),


⚫ probability of the customer buying a computer (regardless of age and
income).
Naïve Bayes Classifier

 Naive Bayes is a kind of classifier which uses the


Bayes Theorem.
 It predicts membership probabilities for each class
such as the probability that given record or data point
belongs to a particular class.
 The class with the highest probability is considered as
the most likely class.
 This is also known as Maximum A Posteriori (MAP).
 Naive Bayes classifier assumes that all the features
are unrelated to each other.
Naïve Bayes Classifier
 In real datasets,

 By substituting for X and expanding using the chain rule:

 For all entries in the dataset, the denominator does not change,
it remain static. Therefore, the denominator can be removed
and a proportionality can be introduced.

 For multivariate classification,


Weather dataset
dependent var The posterior
independent variables
probability can be
calculated by first,
constructing a
frequency table for
each attribute against
the target. Then,
transforming the
frequency tables to
likelihood tables and
finally use the Naive
Bayesian equation to
calculate the posterior
probability for each
class. The class with
the highest posterior
probability is the
outcome of prediction.
Test data

see likelihood table values


Play-Tennis example
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N An unseen sample
overcast hot high false P X = <rain, hot, high, false>
rain mild high false P
rain cool normal false P
rain cool normal true N P(X|p) and P(X|n) : Conditional Probabilities
overcast cool normal true P
sunny mild high false N Posterior Probabilities ,
sunny cool normal false P P(X|p)·P(p) =
rain mild normal false P
sunny mild normal true P P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
overcast mild high true P
overcast hot normal false P P(X|n)·P(n) =
rain mild high true N
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
Play-Tennis example
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N

P(p) = 9/14
Prior probability
P(n) = 5/14

Derived probability
Play-Tennis example: classifying X

 An unseen sample X = <rain, hot, high, false>


 P(X|p) and P(X|n) : Conditional Probabilities

 Posterior Probabilities ,
P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286

 Sample X is classified in class n (don’t play)


Naïve Bayes Classifier: Training Dataset
age income student credit_ratingbuys_compute
<=30 high no fair no
• Class:
<=30 high no excellent no
C1:buys_computer = ‘yes’
31…40 high no fair yes
C2:buys_computer = ‘no’
>40 medium no fair yes

• Data to be classified: >40 low yes fair yes

X = (age <=30, >40 low yes excellent no

Income = medium, 31…40 low yes excellent yes


Student = yes <=30 medium no fair no
Credit_rating = Fair) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes

 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40


>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no

P(buys_computer = “no”) = 5/14= 0.357 31…40


<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes

 Compute P(X|Ci) for each class >40


<=30
medium yes fair
medium yes excellent
yes
yes
31…40 medium no excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40
>40
high
medium
yes fair
no excellent
yes
no

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6


P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Bayes Theorem Example
Color Type Origin Stolen
Red Sports Domestic Y
Red Sports Domestic N
Red Sports Domestic Y
Yellow Sports Domestic N
Yellow Sports Imported Y
Yellow SUV Imported N
Yellow SUV Imported Y
Yellow SUV Domestic N
Red SUV Imported N
Red Sports Imported Y
Bayes Example(cont’d)

Classify for t = < Red, Domestic, SUV>

P(X/yes) * P(yes) = 0.024


P(X/no) * P(no) = 0.072

As 0.072 > 0.024, the new tuple is classified


as No.
Example: Height Classification

Classify tuple:
t = (Adam , M, 1 .95 m)
Example: Height Classification

There are four tuples classified as short,


eight as medium, and three as tall.

P (short) = 4/ 15 = 0.267
P (medium) = 8/15 = 0.533
P (tall) = 3/1 5 = 0.2

To facilitate classification, we divide the height attribute into six ranges:


(0, 1 .6] , (1.6, 1 .7], (1.7, 1 .8], ( 1 .8, 1 .9], (1.9, 2.0], (2.0, oo)
Example: Height Classification

Probabilities associated with attributes


Example: Height Classification

 To classify t = (Adam , M, 1 .95 m)


 By using the values and associated probabilities of
gender and height, we obtain the following estimates:
P(t|short) = 1 /4 * 0 = 0 Prior Probabilities:
P (t|medium) = 2/8 * 1/8 = 0.031 P(short) = 4/15 = 0.267
P (t|tall) = 3/3 * 1/3 = 0.333
P(medium) = 8/15 = 0.533
Combining these, we get
P(tall) = 3/15 = 0.2
Likelihood of being short = 0 * 0.267 = 0
Likelihood of being medium = 0.031 * 0.533 = 0.0166
Likelihood of being tall = 0.33 * 0.2 = 0.066

We estimate P(t) by summing up the individual likelihood values


P(t) = 0 + 0.0166 + 0.066 = 0.08266
Example: Height Classification

 Finally, we obtain the actual probabilities of each event (Using Bayes Theorem):

 P(short | t) = P(t | short) x P(short) / P(t) = 0*0.0267/0.0826 = 0

 P(medium | t) = P(t | medium) x P(medium) / P(t) = 0.031*0.533/0.0826 = 0.2

 P(tall | t) = P(t | tall) x P(tall) / P(t) = 0.333*0.2/0.0826 = 0.799

 Therefore, based on these probabilities, we classify the new tuple as tall because
it has the highest probability.
Advantages of naïve bayes

 It is easy to use.
 Unlike other classification approaches, only one
scan of the training data is required.
 The naive Bayes approach can easily handle missing
values by simply omitting that probability when
calculating the likelihoods of membership in each
class.
 In cases where there are simple relationships, the
technique often does yield good results.
Disadvantages of naïve bayes

 Although the naive Bayes approach is straightforward


to use, it does not always yield satisfactory results.
 First, the attributes usually are not independent. We
could use a subset of the attributes by ignoring any
that are dependent on others.
 The technique does not handle continuous data.
 Dividing the continuous values into ranges could be
used to solve this problem, but the division of the
domain into ranges is not an easy task, and how this is
done can certainly impact the results.
Decision Tree based Algorithms

 In Decision tree approach, a tree is constructed to


model the classification process.
 Once the tree is built, it is applied to each tuple in the
database and results in a classification for that tuple.
 There are two basic steps in the technique:
⚫ building the tree
⚫ and applying the tree to the database.

Most research has focused on how to build effective trees


as the application process is straightforward.
Decision Tree

Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree

MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
fits the same data!
10
Apply Model to Test Data

Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt

Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt

Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt

Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt

Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt

Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Parts of a Decision Tree
Decision Tree
Given:
⚫ D = {t1, …, tn} where ti=<ti1, …, tih>
⚫ Attributes {A1, A2, …, Ah}
⚫ Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated with
D such that
⚫ Each internal node is labeled with attribute, Ai
⚫ Each arc is labeled with predicate which can be
applied to attribute at parent
⚫ Each leaf node is labeled with a class, Cj
Decision Tree based Algorithms

 Solving the classification problem using decision trees


is a two-step process:
⚫ Decision tree induction: Construct a DT using training
data.
⚫ For each tiεD, apply the DT to determine its class.

 DT approaches differ in how the tree is built.

 Algorithms: ID3, C4.5, CART


DT Induction
DT Induction
 The recursive algorithm builds the tree in a top-down fashion.
 Using the initial training data, the "best" splitting attribute is
chosen first. [Algorithms differ in how they determine the "best
attribute" and its "best predicates" to use for splitting. ]
 Once this has been determined, the node and its arcs are
created and added to the created tree.
 The algorithm continues recursively by adding new subtrees to
each branching arc.
 The algorithm terminates when some "stopping criteria" is
reached. [Again, each algorithm determines when to stop the tree
differently. One simple approach would be to stop when the tuples in the
reduced training set all belong to the same class. This class is then used to
label the leaf node created.]
DT Induction

 Splitting attributes: Attributes in the database


schema that will be used to label nodes in the tree
and around which the divisions will take place.

 Splitting predicates: The predicates by which


the arcs in the tree are labeled.
DT Issues

 Choosing Splitting Attributes


 Ordering of Splitting Attributes
 Splits
 Tree Structure
 Stopping Criteria
 Training Data
 Pruning
DT Issues
 Choosing Splitting Attributes
Name Gender Height Output1(Correct) Output2(Actual
Assignment)
Kristina F 1.6m medium Medium
Jim M 2m Tall Short
Maggie F 1.9m Medium Short
Martha F 1.88m Short medium
Stephanie F 1.7m Medium Tall
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Short
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Short
Debbie F 1.8m Tall Medium
Todd M 1.95m Medium Tall
Kim F 1.9m Short Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Short
DT Issues
 Ordering of Splitting Attributes
 The order in which the attributes are chosen is also
important.
DT Issues
 Splits
⚫ With some attributes, the domain is small, so the number of
splits is obvious based on the domain (as with the gender
attribute).
⚫ However, if the domain is continuous or has a large number
of values, the number of splits to use is not easily
determined. Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


DT Issues
 Tree Structure
⚫ a balanced tree with the fewest levels is desirable.
⚫ However, in this case, more complicated comparisons
with multiway branching may be needed.
⚫ Some algorithms build only binary trees.
DT Issues
 Stopping Criteria
⚫ when the training data are perfectly classified.
⚫ when stopping earlier would be desirable to prevent the
creation of larger trees. This is a trade-off between accuracy of
classification and performance.
 Training Data
⚫ The structure of the DT created depends on the training data.
⚫ If the training data set is too small, then the generated tree
might not be specific enough to work properly with the more
general data.
⚫ If the training data set is too large, then the created tree may
overfit.
DT Issues

 Pruning
⚫ Once a tree is constructed, some modifications to the tree
might be needed to improve the performance of the tree
during the classification phase.
⚫ The pruning phase might remove redundant comparisons
or remove subtrees to achieve better performance.
Comparing
Decision
Trees
ID3
 ID3 stands for Iterative Dichotomiser 3
 Creates tree using information theory concepts and tries to
reduce expected number of comparison.
 ID3 chooses split attribute with the highest information
gain:
 Information gain=(Entropy of distribution before the
split)–(entropy of distribution after it)
Entropy
 Entropy
⚫ Is used to measure the amount of uncertainty or surprise
or randomness in a set of data.
⚫ When all data belongs to a single class, entropy is zero as
there is no uncertainty.
⚫ An equally divided sample as an entropy of 1.
⚫ The Mathematical formula for Entropy is –

Where ‘Pi’ is simply the frequentist probability of an


element/class ‘i’ in our data.
How do Decision Trees use Entropy?

 Entropy basically measures the impurity of a node.


 Impurity is the degree of randomness; it tells how random our data is.
 A pure sub-split means that either you should be getting “yes”, or you
should be getting “no”.

left node has low entropy or more purity than right node since left node has a greater number of
“yes” and it is easy to decide here.
Information Gain

 The goal of machine learning is to decrease the uncertainty


or impurity in the dataset, here by using the entropy we are
getting the impurity of a particular node, we don’t know if
the parent entropy or the entropy of a particular node has
decreased or not.

 For this, we bring a new metric called “Information gain”


which tells us how much the parent entropy has decreased
after splitting it with some feature.
Information Gain

 Suppose our entire population has a total of 30 instances. The


dataset is to predict whether the person will go to the gym or not.
Let’s say 16 people go to the gym and 14 people don’t
 Two features to predict whether he/she will go to the gym or not.
⚫ Feature 1 is “Energy” which takes two values “high” and “low”
⚫ Feature 2 is “Motivation” which takes 3 values “No motivation”,
“Neutral” and “Highly motivated”.

Let’s see how our decision tree will be made using these 2 features.
We’ll use information gain to decide which feature should be the
root node and which feature should be placed after the split.
Information Gain

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99
and after looking at this value of
information gain, we can say that
the entropy of the dataset will
decrease by 0.37 if we make
“Energy” as our root node.
Information Gain

“Energy” feature gives more


reduction which is 0.37 than
the “Motivation” feature. Hence
we will select the feature which
has the highest information
gain and then split the node
based on that feature.
Information Gain

 Gain is defined as the difference between how


much information is needed to make a correct
classification before the split versus how much
information is needed after the split.
Information Gain Example
 Let S=14 examples, 9 positive 5 negative
 Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) • = 0.940
 The attribute is Wind. Values of wind are Weak and Strong
 8 occurrences of weak winds ; 6 occurrences of strong winds
 For the weak winds, 6 are positive and 2 are negative
 For the strong winds, 3 are positive and 3 are negative
 Gain(S,Wind) =
Entropy(S) - (8/14)*Entropy (Weak) -(6/14)*Entropy (Strong)
 Entropy(Weak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811
 Entropy(Strong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00
 So… 0.940 - (8/14)*0.811 - (6/14)*1.00 • = 0.048
Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
ID3 Example (Output1)
 The beginning state of the training data in Table (with the
Output l classification) is that
(4/ 1 5) are short, (8/1 5) are medium, and (3/15) are tall.

 Starting state entropy:


4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384

 Gain using gender:


⚫ Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764
⚫ Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392
⚫ Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152
⚫ Gain: 0.4384 – 0.34152 = 0.09688
ID3 Example (Output1)
 Gain using height:
(0, 1 .6], (1 .6, 1 .7], ( 1.7, 1 .8], (1.8, 1 .9], (1.9, 2.0], (2.0, ∞)
Entropy calculation:
There are 2 tuples in the first division with entropy (2/2(0) + 0 + 0) = 0
2 in (1.6, 1 .7] with entropy (2/2(0) +0+0) = 0,
3 in (1.7, 1.8] with entropy (0 + 3/3(0) + 0) = 0,
4 in ( 1.8, 1.9] with entropy (0+4/4(0) +0) = 0,
2 in (1.9, 2.0] with entropy (0+ 1/2(0.301)+ 1 /2(0.301)) = 0.301, and
2 in the last with entropy (0 + 0 + 2/2(0)) = 0.
All of these states are completely ordered and thus an entropy of 0 except for
the (1.9, 2.0] state.
 The gain in entropy by using the height attribute is
0.4384 – (2/15)(0.301) = 0.3983
 Choose height as first splitting attribute
Advantages of ID3

 Understandable prediction rules are


created from the training data.
 Builds the fastest tree.

 Builds a short tree.

 Only need to test enough attributes until


all data is classified.
 Finding leaf nodes enables test data to be
pruned, reducing number of tests.
Disadvantages of ID3

 Data may be over-fitted or over classified,


if a small sample is tested.
 Only one attribute at a time is tested for
making a decision.
 Classifying continuous data may be
computationally expensive, as many trees
must be generated to see where to break
the continuum.
Example:Triangles and Squares
# Attribute Shape
Color Outline Dot
1 green dashed no triange
2 green dashed yes triange Data Set:
3 yellow dashed no square
4 red dashed no square
A set of classified objects
5 red solid no square
6 red solid yes triange
7 green solid no square
. .
8 green dashed no triange
9 yellow solid yes square
. .
10 red solid no square
. .
11 green solid yes square
12 yellow dashed yes square
13 yellow solid no square
14 red dashed yes triange
Entropy
• 5 triangles
• 9 squares
. .
• class probabilities
. .
. .

• entropy
. .
. .
. .
. .
red

Color? green
Entropy
reduction .
yellow .
by
data set
partitioning .
.
. . .
.
. .
. .
red

Color? green

.
yellow .

.
.
. . .
.
. .
. .
red
Information Gain

Color? green

.
yellow .

.
.
Information Gain of The
Attribute
 Attributes
⚫ Gain(Color) = 0.246
⚫ Gain(Outline) = 0.151
⚫ Gain(Dot) = 0.048
 Heuristics: attribute with the highest gain is chosen
 So, color is chosen as the root node
. .
. . .
.
. . red

Red

Color?
Color?

green
Green
Yellow
. yellow Gain(outline)
. P(dashed)=3/5, P(solid)=2/5
.
. I(dashed)= -(3/3)log2(3/3) – 0 =0
I(solid)= -0-(2/2)log2(2/2) = 0
Gain(Dot) I(outline)=(3/5).0+(2/5).0 = 0
P(y)=2/5, P(n)=3/5 Gain (outline)= 0.971-0=0.971
I(y)= -(1/2)log2(1/2) – (1/2)log2(1/2)=1
I(n)= -(1/3)log2(1/3) – (2/3)log2(2/3)=
= 0.917
Gain(Outline) = 0.971 – 0 = 0.971
I (Dot)=(2/5).1+(3/5).(0.917) = 0.9502 Gain(Dot) = 0.971 – 0.951 = 0.020
Gain (Dot)= 0.971-0.9502= 0.020
. .
. . .
.
. .
Red
Gain(Outline) = 0.971 – 0.951 = 0.020 bits
Color?
Color? Gain(Dot) = 0.971 – 0 = 0.971 bits

Yellow Green
.
.
.
.
solid
.
Outline

dashed

.
. .
. . .
.
. . red
yes .
Dot? .
Color?
no
green

. yellow
.
.
.
solid
.
Outline?

dashed

.
Decision Tree
. .

. .
. .

Color

red green
yellow

Dot square Outline

yes no dashed solid

triangle square triangle square

You might also like