You are on page 1of 19

Chapter 4 Classification and Prediction.

 Content:
1. Decision tree learning
 Construction, performance, attribute selection
 Issues: Over-fitting, tree pruning methods, missing values,
 continuous classes
 Classification and Regression Trees (CART)
2. Bayesian Classification:
 Bayes Theorem, Naïve Bayes classifier,
 Bayesian Networks
 Inference
 Parameter and structure learning
• Linear classifiers.
• Least squares, logistic, perceptron and SVM classifiers
3. Prediction
 Linear regression
 Non-linear regression

 Decision tree learning


Decision Tree Classification Algorithm

o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node
o represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.
o The decisions or the test are performed on the basis of features of the given
dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so
it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the
tree. The complete process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or Not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next
decision node (distance from the office) and one leaf node based on the corresponding
labels. The next decision node further gets split into one decision node (Cab facility) and
one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There are
two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the


segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree
in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high
Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two
types of tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
o For more class labels, the computational complexity of the decision tree may
increase.

Python Implementation of Decision Tree


Now we will implement the Decision tree using Python. For this, we will use the dataset
"user_data.csv," which we have used in previous classification models. By using the
same dataset, we can compare the Decision tree classifier with other classification
models such as KNN SVM, LogisticRegression, etc.

Steps will also remain the same, which are given below:

o Data Pre-processing step


o Fitting a Decision-Tree algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
Example: Using the following table build Decision tree.
Day Outlook Temperature Humidity Wind Play Cricket
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

:
We need to remember the following formulas for create a decision tree.
1. Information Gain = I(P , n)
p p n n
= - p+ n log2 ( p+ n ) – p+ n log2 ( p+ n )

v
Pi+¿
2. Entropy E(A) = ∑ (I(p, n))
i=1 p+n

3. Gain (A) = I(p, n) – E(A)

Step 1:
Here we have to find Information gain for the whole table.
In this table we have:
P: yes = 9
n: no = 5
so we calculate Information Gain,
p p n n
I(p, n) = - p+ n log2 ( p+ n ) – p+ n log2 ( p+ n )
9 9 5 5
= - 14 log2 ( 14 ) –( 14 )log2 ( 14 )

9 5
= - 14 log2 (0.642) –( 14 ) log2 (0.357)

[ Note: In scientific calculator by default log base is 10. So you have to convert or
change mode log10 to log2. If you not able to change mode then use following formula
like [ log (value) / log(2) ]

9 5
= - 14 (-0.639) - 14 (-1.485)

= 0.941
I (9, 5) = 0.940
Using Information Gain we have to find Entropy and Gain.
1. We have to find Entropy of Outlook column:
we have p: yes = 9
n: no = 5
I (9, 5) = 0.940
Entropy of Outlook = E(Outlook) :
We create a table of outlook attribute from Main table:
Outlook Pi ni I(pi, ni)
Sunny 2 3 0.970
Overcast 4 0 0
Rain 3 2 0.970

[Note: 1. If one of the value of pi or ni is 0 (Zero) then Information gain I(pi, ni) = 0.
2. If the Attribute value of p i or ni is same or in reverse order then Information gain
I(pi, ni) is same. ]

Calculate Information gain, for sunny:


p p n n
I(p, n) = - p+ n log2 ( p+ n ) – p+ n log2 ( p+ n )

2 2 3 3
I(2, 3) = - 5 log2 ( 5 ) – 5 log2 ( 4 )
I (2, 3) = 0.528 + 0.441
I (2, 3) = 0.970
* Then calculate Entropy (outlook):
v
Pi+¿
Entropy (outlook) = ∑ (I (pi, ni))
i=1 p+n
5 4 5
= 14 (0.970) + 14 (0) + 14 (0.970)

E (outlook) = 0.692
* Then calculate Gain (outlook):

Gain (outlook) = I (p, n) – E(A)


Gain (outlook) = I (9, 5) – E (outlook)
= 0.940 – 0.692
Gain (outlook) = 0.248

 Using same technique or method we have to find Gain (temperature),


Gain(Humidity), and Gain (wind).
So we get:
 Gain (T, outlook) = 0.248
 Gain (T, temperature) = 0.029
 Gain (T, Humidity) = 0.151
 Gain (T, wind) = 0.048
We have to select maximum Gain for decision tree. So we select Gain (T, Outlook) as a
root for decision tree.

After that we have to check under Sunny Attribute which conditions occur.
For that we have to find or differentiate Sunny attribute from main table.
So we create Sunny table:

Day Outlook Temperature Humidity Wind Paly Cricket


1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Sunny Mild Normal Strong Yes

We have to find Entropy and Gain of Temperature, Humidity and wind:


1. Entropy of Temperature:
P: yes = 2, n: no = 3

Temperature Pi ni I (pi, ni)


Hot 0 2 0
Mild 1 1 1
Cold 1 0 0
v
Pi+¿
E(Temperature) = ∑ (I (pi, ni))
i=1 p+n
0+2 1+ 1 1+ 0
= 5 (0) + 5 (1) + 5 (0)
E (Temperature) = 0.4
* Gain (Tsunny, Temperature) = I (p, n) – E(Temperature)
= 0.970 – 0.40
Gain (Tsunny, Temperature) = 0.57
2. Entropy of Humidity:
P: yes = 2, n: no = 3

Humidity Pi ni I (pi, ni)


High 0 3 0
Normal 2 0 0

E (Humidity) = 0
Gain (Tsunny, Humidity) = I (p, n) – E(Humidity)
= 0.970 - 0
Gain (Tsunny, Humidity) = 0.970
3. Entropy of Rain:
P: yes = 2, n: no = 3

Rain Pi ni I (pi, ni)


Weak 1 2 0.918
Strong 1 1 1

v
Pi+¿
E (Wind) = ∑ (I (pi, ni))
i=1 p+n
1+ 2 1+ 1
= 5 (0.918) + 5 (1)
= 0.950
G (Wind) = I (2,3) – E (Wind)
= 0.970 – 0.950
G (Wind) = 0.02
 So we get:
G (TSunny, Temperature) = 0.57
G (TSunny, Humidity) = 0.970
G (TSunny, Rain) = 0.02
So Maximum Gain is G (TSunny, Humidity) = 0.970. So we select Humidity attribute for
decision tree.
 Using Same method we calculate Gain for Overcast attribute.
For that we create overcast table from main table:

Day Outlook Temperature Humidity Wind Paly Cricket


3 Overcast Hot High Weak Yes
7 Overcast Cool Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes

In this table all possibilities of play cricket is yes, so we did not require to
calculate Gain of Overcast.
 Using same method we calculate Gain for Rain attribute.
For that we create Rain table from main table:

Day Outlook Temperature Humidity Wind Paly Cricket

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

10 Rain Mild Normal Weak Yes

14 Rain Mild High Strong No

Note: Again we have to Find Gain for Wind and Temperature.


We cannot calculate Gain of Humidity because we use Humidity for sunny
attribute.
 For Rain Table:
P: yes = 3
n: no = 2
I ( Train) = 0.970

 Entropy for Wind:

Wind Pi ni I (pi, ni)


Weak 3 0 0
Strong 0 2 0
E (Wind) = 0
Gain (TRain, Wind) = 0.970

 Entropy for Temperature:

Wind Pi ni I (pi, ni)


Hot 0 0 0
Mild 2 1 0.918
Cool 1 1 1

v
E (Temperature) = ∑ Pi+¿
p+n
(I (pi, ni))
i=1

3 2
= 0 + 5 (0.918) + 5 (1)

E (Temperature) = 0.951
Gain (TRain, Temperature) = I (TRain, ) – E (Temperature)
0.970 – 0.951
Gain (TRain, Temperature) = 0.019
So we get:
Gain (TRain, Wind) = 0.970
Gain (TRain, Temperature) = 0.019
So we select Maximum Gain i.e. Gain (TRain, Wind) = 0.970 for decision tree.
So Final Decision tree is:
Bayesian Classifiers
Bayesian classification uses Bayes theorem to predict the occurrence of any event.
Bayesian classifiers are the statistical classifiers with the Bayesian probability
understandings. The theory expresses how a level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an
unknown parameter.
Bayes's theorem is expressed mathematically by the following equation that is given
below.

Where X and Y are the events and P (Y) ≠ 0


P(X/Y) is a conditional probability that describes the occurrence of event X is given
that Y is true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given
that X is true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other.
This is known as the marginal probability.
Bayesian interpretation:
In the Bayesian interpretation, probability determines a "degree of belief." Bayes
theorem connects the degree of belief in a hypothesis before and after accounting for
evidence. For example, Lets us consider an example of the coin. If we toss a coin, then
we get either heads or tails, and the percent of occurrence of either heads and tails is
50%. If the coin is flipped numbers of times, and the outcomes are observed, the degree
of belief may rise, fall, or remain the same depending on the outcomes.
For proposition X and evidence Y,
o P(X), the prior, is the primary degree of belief in X
o P(X/Y), the posterior is the degree of belief having accounted for Y.

o The quotient represents the supports Y provides for X.


Bayes theorem can be derived from the conditional probability:
Where P (X⋂Y) is the joint probability of both X and Y being true, because

Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling
(PGM) procedure that is utilized to compute uncertainties by utilizing the probability
concept. Generally known as Belief Networks, Bayesian Networks are used to show
uncertainties using Directed Acyclic Graphs (DAG)
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other
statistical graph, a DAG consists of a set of nodes and links, where the links signify the
connection between the nodes.

The nodes here represent random variables, and the edges define the relationship
between these variables.
A DAG models the uncertainty of an event taking place based on the Conditional
Probability Distribution (CDP) of each random variable. A Conditional Probability Table
(CPT) is used to represent the CPD of each variable in a network.
Example:
1. For the given dataset Apply Naïve –Baye’s Algorithm & predict the outcome for a
car = {Red, Domestic, SUV}.
Color Type Origin Stolen
Red Sports Domestic Yes
Red Sports Domestic No
Red Sports Domestic Yes
Yellow Sports Domestic No
Yellow Sports Imported Yes
Yellow SUV Imported No
Yellow SUV Imported Yes
Yellow SUV Domestic No
Red SUV Imported No
Red Sports Imported Yes

In this problem we have to given new Instance i.e


A = [Red, Domestic, SUV]
Using this instance, we have to check a Car is stolen or not.
For that We using Bayes theorem:
P (Y ∨ X). P (X )
P(X|Y) = P(Y )

Step 1: We have to find the probability of Red, Domestic and SUV car is stolen.
P (Yes∨Red). P(Red)
i. P(Red | Yes) = P (Yes)

3 5
.
P(Red | Yes) =
5 10
5
= 35
10

P (Yes∨Domestic ). P(Domestic )
ii. P(Domestic | Yes) = P(Yes)

2 5
.
P(Red | Yes) =
5 10
5
= 25
10
P (Yes∨SUV ). P (SUV )
iii. P(SUV | Yes) = P (SUV )

1 4
.
P(Red | Yes) =
5 10
4
= 15
10

3
P (A | Yes) = 5 . 25 . 15 = 125
6

P (A | Yes) = 0.024

Step 2: We have to find the probability of Red, Domestic and SUV car is Not stolen.

P ( No∨Red). P (Red)
i. P(Red | No) = P(N 0)

2 5
.
P(Red | No) =
5 10
5
= 25
10

P ( No∨Domestic ). P(Domestic )
ii. P(Domestic | No) = P(No)

3 5
.
P(Domestic | No) =
5 10
5
= 35
10

P ( No∨SUV ) . P( SUV )
iii. P(SUV | No) = P(No)

3 4
.
P(SUV | No) =
5 10
4
= 35
10
2
P(A | No) = 5 . 35 . 35 = 125
18

P (A | No) = 0.144
From Step 1 and Step 2 Probability of No is greater than probability of Yes.
So new instance of car ={Red, Domestic, SUV} is No. Means car is not stolen.

Example 2:
Consider the given dataset, apply Naïve Bayes algorithm and predict that if a fruit has
the following properties, then which type of fruit it is
New instance is Fruit = {yellow, Sweet, Long}
Frequency Table
Fruit Yellow Sweet Long Total
Mango 350 450 0 650
Banana 400 300 350 400
Others 50 100 50 150
Total 800 850 400 1200

You might also like