You are on page 1of 54

Module 2 : Classification

• Logistic Regression
LOGISTIC



Decision Tree
Naïve Bayes
Random Forest
REGRESSION
• SVM Classifier

A classification Problem
• VITEE score vs. admission in VIT
• Admitted (1)
• Not admitted (0)

• 𝑦 ∈ {0,1}
Binary classification
Logistic Regression

• For example, suppose that we are studying whether a student applicant


receives scholarship (y = 1) or not (y = 0). Here, p is the probability that
an applicant receives aid, and possible explanatory variables include
(a) the financial support of the parents,
(b) the income and savings of the applicant, and
(c) whether the applicant has received financial aid before.

Logistic Regression Logistic Regression


Logistic Regression

Estimated Regression Equation

Estimated Regression Equation


Example:
Consider the following training examples:
Marks scored: X = [81 42 61 59 78 49]
Grade (Pass/Fail): Y = [Pass Fail Pass Fail Pass Fail]
Assume we want to model the probability of Y which is parameterized by (β0, β1).
(i) With the chosen parameters β0 = -120 and β1 = 2, what should be the
minimum mark to ensure the student gets a ‘Pass’ grade with 95% probability?
iii)With the chosen parameters find the probability of declaring 61.5 as pass
grade.
 value for β0 = -120 , β1 = 2.  value for β0 = -120 , β1 = 2.
 Therefore, we have to use these values to model p(x)  Therefore, we have to use these values to model p(x)
 With the chosen parameters, what should be the minimum  With the chosen parameters, what should be the minimum
mark to ensure the student gets a ‘Pass’ grade with 95% mark to ensure the student gets a ‘Pass’ grade with 95%
probability? probability?

 Substituting p(x) = 0.95, 0 = -120 and 1 = 2, we will get  Substituting p(x) = 0.95, 0 = -120 and 1 = 2, we will get
xmin = ??? xmin = 61.47

iii)With the chosen parameters find the probability of declaring 61.5 as pass iii)With the chosen parameters find the probability of declaring 61.5 as pass
grade. grade.

Substituting x1 = 61.5, 0 = -120 and 1 = 2, we will get Substituting x1 = 61.5, 0 = -120 and 1 = 2, we will get 96%
Decision Tree

Decision Trees
Decision Trees
A tree can be “learned” by splitting the source set into subsets based
A decision tree is a non-parametric supervised learning algorithm, which is on an attribute value test. This process is repeated on each derived
utilized for both classification and regression tasks. It has a hierarchical, tree subset in a recursive manner called recursive partitioning. The
structure, which consists of a root node, branches, internal nodes and leaf recursion is completed with leaf nodes.
nodes.
As you can see from the diagram in the previous page, a decision
tree starts with a root node, which does not have any incoming
branches. The outgoing branches from the root node then feed into
the internal nodes, also known as decision nodes. The leaf nodes
represent all the possible outcomes within the dataset.

Decision trees used in data analytics are of two main types −


•Classification tree − when the response is a nominal variable, for example if
an email is spam or not.
•Regression tree − when the predicted outcome can be considered a real
number (e.g. the salary of a worker).
Example of a Decision Tree Apply Model to Test Data
Start from the root of tree.
Splitting Attributes
Tid Refund Marital Taxable Test Data
Status Income Cheat
Refund Refund Marital Taxable
1 Yes Single 125K No
Status Income Cheat
No Refund Yes No
2 No Married 100K
Yes No No Married 80K ?
3 No Single 70K No
NO MarSt
10

4 Yes Married 120K No NO MarSt


5 No Divorced 95K Yes Single, Divorced Married
Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No
TaxInc NO TaxInc NO
8 No Single 85K Yes < 80K > 80K < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes NO YES
10

Training Data Model: Decision Tree

28-Mar-24 29 28-Mar-24 30

Apply Model to Test Data CLASSIFICATION METHODS

Test Data Challenges


Refund Marital Taxable
Status Income Cheat How to represent the entire information in the dataset using minimum number of
rules?
No Married 80K ?
Refund 10

How to develop the smallest tree?


Yes No
Solution
NO MarSt
Single, Divorced Married Assign Cheat to “No” Select the variable with maximum information for first split

TaxInc NO
< 80K > 80K

NO YES

28-Mar-24 31 32
What is a decision tree? What strategy is followed?
• A model in the form of a tree structure • Starts with whole data set and recursively partitions the data
set into smaller subsets
Divide and
Decision Conquer
Nodes strategy

• Begin at the root node - represents the entire dataset


• Chooses a feature that is the most predictive of the target class
• The examples are then partitioned into groups of distinct values
of this feature
• The algorithm continues to choose the best feature for
partitioning until a stopping criterion is reached.
Leaf Nodes • All (or nearly all) of the examples at the node have the same class
• There are no remaining features to distinguish among examples
• The tree has grown to a predefined size limit

Model building
Problem • Assume that you have gathered data from studio archives
to examine the previous ten years of movie releases.
• Imagine that you are working for a Hollywood film studio, • After reviewing the data for 30 different movie scripts,
and your desk is piled high with screenplays. there seems to be a relationship between
• you decide to develop a decision tree algorithm to predict • the film's proposed shooting budget
whether a potential movie would fall into one of three • the number of A-list celebrities lined up for starring roles and
categories: • the categories of success
• mainstream hit
• critic's choice
• box office bust
Model building
Choosing the Best Split
• If the partitions contain only a single class, they are
considered pure.
• Different measurements of purity for identifying splitting
criteria
• Entropy & Information Gain (C5.0)
• Gini Index (CART)

Entropy Information Gain


• Measure of impurity
• Measures the change in homogeneity resulting from a
split on each possible feature.
• Difference between the entropy in the segment before the
split (S1), and the partitions resulting from the split (S2):

• The higher the information gain, the better a feature is at


• The minimum value of 0 indicates that the sample is
creating homogeneous groups after a split on that feature
completely homogenous, while 1 indicates the maximum
amount of disorder
• Refer Manual workout
• https://medium.datadriveninvestor.com/decision-tree-algorithm-
with-hands-on-example-e6c2afb40d38
Gini Index Pros and Cons
Pros Cons
• Measures the degree or probability of a particular variable
• Results in a model that can • Decision tree models are
being wrongly classified when it is randomly chosen.
be interpreted without a often biased toward splits
mathematical background on features having a large
• Highly-automatic learning number of levels
process can handle numeric • Overfitting occurs easily
or nominal features, missing
• Small changes in training
• The lower the gini index, the better a feature is at creating data
data can result in large
homogeneous groups after a split on that feature • Uses only the most
important features
changes to decision logic
• Can be used on data with • Large trees can be
relatively few training difficult to interpret and
examples or a very large the decisions they make
number may seem counterintuitive

What is a decision tree? What strategy is followed?


• A model in the form of a tree structure • Starts with whole data set and recursively partitions the data set
into smaller subsets

Decision Divide and


Nodes Conquer strategy

• Begin at the root node - represents the entire dataset


• Chooses a feature that is the most predictive of the target class
• The examples are then partitioned into groups of distinct values of
this feature
• The algorithm continues to choose the best feature for partitioning
until a stopping criterion is reached.
– All (or nearly all) of the examples at the node have the same class
– There are no remaining features to distinguish among examples
Leaf Nodes – The tree has grown to a predefined size limit
Problem Model building
• Assume that you have gathered data from
studio archives to examine the previous ten
• Imagine that you are working for a Hollywood years of movie releases.
film studio, and your desk is piled high with
screenplays. • After reviewing the data for 30 different movie
scripts, there seems to be a relationship
• you decide to develop a decision tree between
algorithm to predict whether a potential
– the film's proposed shooting budget
movie would fall into one of three categories:
– the number of A-list celebrities lined up for
– mainstream hit starring roles and
– critic's choice – the categories of success
– box office bust

Model building Choosing the Best Split


• If the partitions contain only a single class,
they are considered pure.
• Different measurements of purity for
identifying splitting criteria
– Entropy & Information Gain (C5.0)
– Gini Index (CART)
Entropy Information Gain
• Measure of impurity • Measures the change in homogeneity resulting from a split on each
possible feature.
• Difference between the entropy in the segment before the split
(S1), and the partitions resulting from the split (S2):

• The higher the information gain, the better a feature is at creating


homogeneous groups after a split on that feature

• Refer Manual workout


• The minimum value of 0 indicates that the – https://medium.datadriveninvestor.com/decision-tree-algorithm-with-
hands-on-example-e6c2afb40d38
sample is completely homogenous, while 1
indicates the maximum amount of disorder

Gini Index Pros and Cons


Pros Cons
• Measures the degree or probability of a • Results in a model that can be • Decision tree models are
particular variable being wrongly classified interpreted without a often biased toward splits
when it is randomly chosen. mathematical background on features having a large
• Highly-automatic learning number of levels
process can handle numeric • Overfitting occurs easily
or nominal features, missing
data
• Small changes in training
data can result in large
• Uses only the most important
changes to decision logic
features
• The lower the gini index, the better a feature • Can be used on data with • Large trees can be difficult
is at creating homogeneous groups after a relatively few training to interpret and the
examples or a very large decisions they make may
split on that feature number seem counterintuitive
Training Attribute Selection Measure: Information Gain
(ID3/C4.5)
age income student credit_rating buys_computer
<=30 high no fair no  Select the attribute with the highest information gain
<=30 high no excellent no  Let pi be the probability that an arbitrary tuple in D belongs to
31…40 high no fair yes class Ci, estimated by |Ci, D|/|D|
>40 medium no fair yes
>40 low yes fair yes
 Expected information (entropy) needed to classify a tuple in D:
m
>40 low yes excellent no Info( D)   pi log 2 ( pi )
31…40 low yes excellent yes  Information needed (after using A to split Di into
1 v partitions) to
<=30 medium no fair no
classify D:
<=30 low yes fair yes
v |D |
>40 medium yes fair yes
InfoA ( D)  
j
 I (D j )
<=30 medium yes excellent yes Information gained by branching on attribute |D|
 j 1 A
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no Gain(A)  Info(D)  InfoA(D)
28-Mar-24 53 March 28, 2024 54

28-Mar-24 55 28-Mar-24 56
28-Mar-24 57

28-Mar-24 59 28-Mar-24 60
Training
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

28-Mar-24 62

Entropy Attribute Selection Measure: Information Gain


(ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info( D)   pi log 2 ( pi )
 Information needed (after using A to split Di into
1 v partitions) to
v |D |
classify D:
InfoA ( D)  
j
 I (D j )
|D|j 1

 Information gained by branching on attribute A

Gain(A)  Info(D)  InfoA(D)


March 28, 2024 64
Attribute Selection: Information Gain

 Class P: buys_computer = “yes”


 Class N: buys_computer = “no”
9 9 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( )  ?
14 14 14 14

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

March 28, 2024 66

Attribute Selection: Information Gain Attribute Selection: Information Gain

 Class P: buys_computer = “yes” 5 4


Infoage ( D)  I (2,3)  I (4,0)
 Class N: buys_computer = “no” 14 14
age pi ni I(pi, ni)
9 9 5 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940
<=30 2 3 0.971  I (3,2)  0.694
14 14 14 14 14
age income student credit_rating buys_computer
31…40 4 0 0 5
<=30 high no fair no
>40 3 2 0.971 I (2,3) means “age <=30” has 5 out of
<=30 high no excellent no 14
31…40 high no fair yes 14 samples, with 2 yes’es and 3
>40 medium no fair yes age income student credit_rating buys_computer
>40 low yes fair yes <=30 high no fair no no’s.
>40 low yes excellent no <=30 high no excellent no
31…40 low yes excellent yes 31…40 high no fair yes
<=30 medium no fair no >40 medium no fair yes
<=30 low yes fair yes >40 low yes fair yes
>40 medium yes fair yes >40 low yes excellent no
<=30 medium yes excellent yes 31…40 low yes excellent yes
31…40 medium no excellent yes <=30 medium no fair no
31…40 high yes fair yes <=30 low yes fair yes
>40 medium no excellent no >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

March 28, 2024 67 March 28, 2024 68


Attribute Selection: Information Gain Attribute Selection: Information Gain
5 4  Class P: buys_computer = “yes” 5 4
Infoage ( D)  I (2,3)  I (4,0) Infoage ( D)  I (2,3)  I (4,0)
14 14  Class N: buys_computer = “no” 14 14
age pi ni I(pi, ni)
5 9 9 5 5 5
<=30 2 3 0.971  I (3,2)  0.694 Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14 14
31…40 4 0 0 5 5
I (2,3) means “age <=30” has 5 out of age pi ni I(pi, ni) I (2,3)means “age <=30” has 5 out of
>40 3 2 0.971 14 <=30 2 3 0.971 14 14 samples, with 2 yes’es and 3
14 samples, with 2 yes’es and 3
age income student credit_rating buys_computer 31…40 4 0 0 no’s. Hence
<=30 high no fair no no’s.
<=30 high no excellent no >40 3 2 0.971
31…40
>40
high
medium
no
no
fair
fair
yes
yes
age income student credit_rating buys_computer Gain(age)  Info( D)  Infoage ( D)  0.246
<=30 high no fair no
>40 low yes fair yes
>40 low yes excellent no
<=30 high no excellent no Similarly find the following:
31…40 high no fair yes
31…40 low yes excellent yes
>40 medium no fair yes
<=30 medium no fair no
>40 low yes fair yes
<=30
>40
low
medium
yes fair
yes fair
yes
yes
>40 low
31…40 low
yes
yes
excellent
excellent
no
yes
Gain(income)  ?
<=30 medium yes excellent yes
Gain( student )  ?
<=30 medium no fair no
31…40 medium no excellent yes
<=30 low yes fair yes
31…40 high yes fair yes
>40 medium yes fair yes
>40 medium no excellent no

March 28, 2024 69


<=30 medium
31…40 medium
March 28, 2024
yes
no
excellent
excellent
yes
yes Gain(credit _ rating )  ?70
31…40 high yes fair yes
>40 medium no excellent no
28-Mar-24 76
Gini Index Gini Index
For example, let’s consider a toy dataset with two classes “Yes” and “No”
The other way of splitting a decision tree is via the Gini Index. The Gini Index is also and the following class probabilities:
known as Gini impurity. It is a measure of how mixed or impure a dataset is.
The Gini impurity ranges between 0 and 1, where 0 represents a pure dataset and 1 p(Yes) = 0.3 and p(No) = 0.7
represents a completely impure dataset(In a pure dataset, all the samples belong to the Find the Gini index.
same class or category. On the other hand, an impure dataset contains a mixture of
different classes or categories.

Where p(i) is the probability of a specific class and the summation is done for
all classes present in the dataset.

Gini Index Gini Index


For example, let’s consider a toy dataset with two classes “Yes” and “No” For example, let’s consider a toy dataset with two classes “Yes” and “No”
and the following class probabilities: and the following class probabilities:

p(Yes) = 0.3 and p(No) = 0.7 p(Yes) = 0.3 and p(No) = 0.7
Find the Gini index. Find the Gini index.

Gini impurity = 1 — (0.3)² — (0.7)² = 0.42 Gini impurity = 1 — (0.3)² — (0.7)² = 0.45

In the example, the Gini impurity value of 0.42 represents that there is a In the example, the Gini impurity value of 0.45 represents that there is a
42% chance of misclassifying a sample if we were to randomly assign a 45% chance of misclassifying a sample if we were to randomly assign a
label from the dataset to that sample. This means that the dataset is not label from the dataset to that sample. This means that the dataset is not
completely pure, and there is some degree of disorder in it. completely pure, and there is some degree of disorder in it.
Let’s consider a toy dataset with the following features and class labels. Find Student Background: Gini formula requires us to calculate the Gini Index for each
the root node using Gini index. sub node. Then do a weighted average to calculate the overall Gini Index for the
node.

Student Background: Gini formula requires us to calculate the Gini Index for each
sub node. Then do a weighted average to calculate the overall Gini Index for the
node.

Working (9) : 6 Pass and 3 Fail


Not Working (6) : 5 Pass and 1 Fail
Online (8) : 5 Pass and 3 Fail
Not online (7) : 3 Pass and 4 Fail Find the root node using Gini index.

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
The Gini Index is lowest for the Student Background variable. Hence, similar <=30 medium no fair no
to the Entropy and Information Gain criteria, we pick this variable for the root <=30 low yes fair yes
node. In a similar fashion we would again proceed to move down the tree, >40 medium yes fair yes
carrying out splits where node purity is less <=30 medium yes excellent yes
31…40 medium no excellent yes
CART uses Gini; ID3 and C4.5 use Entropy.
31…40 high yes fair yes
>40 medium no excellent no

Let’s consider a toy dataset with the following features and class labels. Find the root
node using Gini index.

Age =0.3428
Income= 0.438
student=0.3673
credit=0.428
Let’s consider a toy dataset with the following features and class labels. Find the root
node using Gini index. https://blog.quantinsti.com/gini-index/

Find the root node using Gini index.

The target class label is “Buys_insurance” and it can take two values “Yes” or “No”.

Gender
Male (3): 2 Yes, 1 No
Find the root node using Gini index for the given dataset with target as Buys
Female(3): 1 Yes, 2 No
Insurance.
GiniMale = 1-(2/3)*(2/3)-(1/3)*(1/3) = 0.444

GiniFemale = 1 – (1/3)*1/3-2/3*2/3 = 0.444

GiniGender = 3/6*0.444+3/6*0.444=0.444
Gini impurity for feature “Gender” Gini impurity for feature “Income”

p(Income=High) = 3/6 = 0.5

p(Income=Medium) = 1/6 = 0.17

p(Income=Low) = 2/6 = 0.33

Gini impurity for feature “Income” = 1 — (0.5)² — (0.17)² — (0.33)² = 0.44

p(Gender=Male) = 3/6 = 0.5

p(Gender=Female) = 3/6 = 0.5

Gini impurity for feature “Gender” = 1 — (0.5)² — (0.5)² = 0.5

Gini impurity for feature “Credit Score”

p(Credit Score=Excellent) = 3/6 = 0.5


Decision Tree Learning
p(Credit Score=Fair) = 2/6 = 0.33
Day Outlook Temperature Humidity Wind PlayTennis
p(Credit Score=Poor) = 1/6 = 0.17
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
Gini impurity for feature “Credit Score” = 1 — (0.5)² — (0.33)² — (0.17)² = 0.44
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
From the above calculations, we can see that the feature “Income” and D7 Overcast Cool Normal Strong Yes
“Credit Score” have the lowest Gini impurity of 0.44. So, we can select D8 Sunny Mild High Weak No
either one of them as the root node for the decision tree. D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
[See: Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997]
Decision Tree Learning Decision Tree Learning
• Building a Decision Tree

1. First test all attributes and select the on that would function as the best
root;
2. Break-up the training set into subsets based on the branches of the
root node;
3. Test the remaining attributes to see which ones fit best underneath the
branches of the root node;
4. Continue this process for all other branches until
a. all examples of a subset are of one type
b. there are no examples left (return majority classification of the parent)
c. there are no more attributes left (default value should be majority
classification)

(Outlook = Sunny  Humidity = Normal)  (Outlook = Overcast)  (Outlook = Rain  Wind = Weak)

[See: Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997]

Decision Tree Learning:


Decision Tree Learning
A Simple Example
• Let’s start off by calculating the Entropy of the Training
• Let Set.
• E([X+,Y-]) represent that there are X positive training elements and • E(S) = E([9+,5-]) = (-9/14 log2 9/14) + (-5/14 log2 5/14)
Y negative elements. • = 0.94
• Therefore the Entropy for the training data, E(S), can be
represented as E([9+,5-]) because of the 14 training
examples 9 of them are yes and 5 of them are no.
Decision Tree Learning: Decision Tree Learning:
A Simple Example A Simple Example
• Next we will need to calculate the information gain G(S,A) • The information gain for Outlook is:
for each attribute A where A is taken from the set {Outlook, • G(S,Outlook) = E(S) – [5/14 * E(Outlook=sunny) + 4/14 * E(Outlook
Temperature, Humidity, Wind}. = overcast) + 5/14 * E(Outlook=rain)]
• G(S,Outlook) = E([9+,5-]) – [5/14*E(2+,3-) + 4/14*E([4+,0-]) +
5/14*E([3+,2-])]
• G(S,Outlook) = 0.94 – [5/14*0.971 + 4/14*0.0 + 5/14*0.971]
• G(S,Outlook) = 0.246

Decision Tree Learning: Decision Tree Learning:


A Simple Example A Simple Example
• G(S,Temperature) = 0.94 – [4/14*E(Temperature=hot) + • G(S,Humidity) = 0.94 – [7/14*E(Humidity=high) +
6/14*E(Temperature=mild) + 7/14*E(Humidity=normal)]
4/14*E(Temperature=cool)] • G(S,Humidity = 0.94 – [7/14*E([3+,4-]) + 7/14*E([6+,1-])]
• G(S,Temperature) = 0.94 – [4/14*E([2+,2-]) + • G(S,Humidity = 0.94 – [7/14*0.985 + 7/14*0.592]
6/14*E([4+,2-]) + 4/14*E([3+,1-])] • G(S,Humidity) = 0.1515
• G(S,Temperature) = 0.94 – [4/14 + 6/14*0.918 +
4/14*0.811]
• G(S,Temperature) = 0.029
Decision Tree Learning: Decision Tree Learning:
A Simple Example A Simple Example
• G(S,Wind) = 0.94 – [8/14*0.811 + 6/14*1.00] • Outlook is our winner!
• G(S,Wind) = 0.048

Decision Tree Learning: Decision Tree Learning:


A Simple Example A Simple Example
• Now that we have discovered the root of our decision tree • G(Outlook=Rain, Humidity) = 0.971 –
we must now recursively find the nodes that should go [2/5*E(Outlook=Rain ^ Humidity=high) +
below Sunny, Overcast, and Rain. 3/5*E(Outlook=Rain ^Humidity=normal]
• G(Outlook=Rain, Humidity) = 0.02
• G(Outlook=Rain,Wind) = 0.971- [3/5*0 + 2/5*0]
• G(Outlook=Rain,Wind) = 0.971
Decision Tree Learning:
A Simple Example
• Now our decision tree looks like:
Naïve Bayes Classifier

Probability Theory – Random Experiments

Probability
A probability model is a mathematical representation of a random phenomenon. • A random experiment is an observational process whose
results cannot be known in advance
An event A is a subset of the sample space S. • The sample space to describe rolling a die has six outcomes:

A probability is a numerical value assigned to a given event A. The probability


of an event is written P(A)

The first two basic rules of probability are the following:

Rule 1: Any probability P(A) is a number between 0 and 1 (0 < P(A) < 1).

Rule 2: The probability of the sample space S is equal to 1 (P(S) = 1).

112
Probability Theory – Probability
Probability Theory – Random Experiments

• Sample Space • The probability of an event is a number that measures the relative likelihood that
the event will occur.
• When two dice are rolled, the sample space consists of 36 • The probability of an event A, denoted P(A), must lie within the interval from 0 to 1:
outcomes, each of which is a pair: 0 <= P(A) <= 1
• In a discrete sample space, the probabilities of all simple events must sum to 1, since
it is certain that one of them will occur:
P(S) = P(E1) + P(E2) + . . . + P(En) = 1

113 114

Probability Theory – Rules of Probability Probability Theory – Rules of Probability

• Complement of an Event •
• The complement of an event A is denoted A’ and consists of everything in
the sample space S except event A
• Since A and A’ together comprise the sample space, their probabilities sum
to 1:

P(A) + P(A’) = 1
P(A’) = 1 – P(A)

115 116
Probability Theory – Rules of Probability Probability Theory – Rules of Probability

• •

117 118

Probability Theory – Rules of Probability


Probability Theory – Rules of Probability

• General Law of Addition


120
119
Probability Theory – Rules of Probability Probability Theory – Rules of Probability

• Mutually Exclusive Events • Mutually Exclusive Events


• Events A and B are mutually exclusive (or disjoint) if their intersection is the
empty set (a set that contains no elements). In other words, one event precludes Here are examples of events that are mutually exclusive (cannot be in both
the other from occurring. categories):
• The null set is denoted ϕ. • Customer age: A = under 21, B = over 65
• Purebred dog breed: A = border collie, B = golden retriever

Here are examples of events that are not mutually exclusive (can be in both
categories):
• Student’s major: A = marketing major, B = economics major
• Credit card held: A = Visa, B = MasterCard, C = American Express

121 122

Probability Theory – Rules of Probability Probability Theory – Rules of Probability

• Special Law of Addition (Mutually Exclusive Events) • Conditional Probability


• If A and B are mutually exclusive events, then P(A ∩ B) = 0 and the general
addition law can be simplified to the sum of the individual probabilities for A and
• The probability of event A given that event B has occurred is a
B, the special law of addition. conditional probability, denoted P(A | B) which is read “the
probability of A given B.” The vertical line is read as “given.”
• The conditional probability is the joint probability of A and B
• For example, if we look at a person’s age, then P(under 21) 0.28 and P(over 65) =
divided by the probability of B.
0.12, so P(under 21 or over 65) = 0.28 + 0.12 = 0.40 since these events do not
overlap.

123 124
Probability Theory – Rules of Probability

• Conditional Probability
• The sample space is restricted to B, an event that we know has occurred (the green
shaded circle). The intersection, (A ∩ B), is the part of B that is also in A (the blue
shaded area).
• The ratio of the relative size of set (A ∩ B) to set B is the conditional probability P(A
| B ).

125

Prior probability
Bayes’ Rule
Consider the random variables,
cavity={true, false} The condition probability of the occurrence of A
weather={sunny, rain, cloudy, snow} if event B occurs
Prior or unconditional probability,
P(cavity=true)=0.1 P(A|B) = P(A  B) / P(B)
P(weather=sunny)=0.72
This can be written also as:
Probability distribution gives values of all possible assignments: P(A  B) = P(A|B) * P(B)
P(weather)={0.72, 0.1, 0.08, 0.1}(normalized i.e, sums to 1) P(A  B) = P(B|A) * P(A)

Hence P(B|A) = P(A|B) * P(B)


----------------
P(A)
Naïve Bayesian Classifier Towards Naïve Bayesian Classifier
Using Bayes theorem, we can find the probability of A happening, given • Let D be a training set of tuples and their associated class
that B has occurred. Here, B is the evidence and A is the hypothesis. The
assumption made here is that the predictors/features are independent. That is labels, and each tuple is represented by an n-D attribute
presence of one particular feature does not affect the other. Hence it is called vector X = (x1, x2, …, xn)
naive.
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
Bayes theorem can be rewritten as:
P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i
needs to be maximized

March 28, 2024 Data Mining: Concepts and Techniques 129 March 28, 2024 Data Mining: Concepts and Techniques 130

Derivation of Naïve Bayes Classifier Problem 1: If the weather is sunny, then the Player should play or
not?
Solution: To solve this, first consider the below dataset:
• A simplified assumption: attributes are conditionally independent (i.e.,
no dependence relation between attributes):
n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ... P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak
divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x )2
1 
and P(xk|Ci) is g ( x,  ,  )  e 2 2
2 

P(X | C i)  g ( xk , Ci ,  Ci )

March 28, 2024 Data Mining: Concepts and Techniques 131


Let us test it on a new set of features (let us call it today):
today = (Sunny, Hot, Normal, False)
Problem 2
P(Yes/today)=((3/9)*(2/9)*(6/9)*(6/9)*(9/14)) / P(today)

=0.0141/ P(today)

P(No/today)=((2/5)*(2/5)*(1/5)*(2/5)*(5/14)) / P(today)

=0.00457/ P(today)

P(today)=P(sunny)*P(hot)*p(normal)*p(nowind)
=(5/14)*(4/14)*(7/14)*(8/14)

Naïve Bayesian Classifier: Training Dataset


age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
140
>40March 28,medium
2024 no excellent
Data no
Mining: Concepts
and Techniques
Naïve Bayesian Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

• Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

Therefore, X belongs to class (“buys_computer = yes”)


• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)


143 March 28, 2024

Car theft Example


• Attributes are Color , Type , Origin, and the subject, stolen can be
either yes or no

RANDOM FOREST
CLASSIFIER

Classify a Red Domestic SUV.


What is Random Forest? Ensemble learning
• Supervised learning algorithm • What?
• Ensemble models in machine learning combine the decisions from
multiple models to improve the overall performance.
• Forest - Ensemble of decision trees, usually trained with
• How?
the “bagging” method.
• Taking the mode of the results – majority voting
• Taking weighted average of the results
• Builds multiple decision trees and merges them
together to get a more accurate and stable prediction.

Bagging? Random Forest Classifier


• Bootstrap AGGregatING • Random forest adds additional randomness to the model, while
growing the trees.
• Create random samples of the training data set with
replacement (sub sets of training data set).
• Only a random subset of the features is taken into consideration by
the algorithm for splitting a node.
• Build a model (classifier or Decision tree) for each sample.

• Randomly selects observations and features to build several decision


• Combine the results of these multiple models using average trees and then averages the results.
or majority voting.
• This results in a wide diversity that generally results in a better model.
Random Forest Classifier Random Forest Algorithm

Random Forest is a popular machine learning algorithm that


belongs to the supervised learning technique. It can be used
for both Classification and Regression problems in ML. It is
based on the concept of ensemble learning, which is a
process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.

Working of Random Forest Algorithm


Ensemble simply means combining multiple models. Thus a collection of
models is used to make predictions rather than an individual model.
As the name suggests, "Random Forest is a classifier that
contains a number of decision trees on various subsets of the Ensemble uses two types of methods:
given dataset and takes the average to improve the predictive Bagging– It creates a different training subset from sample training data with
accuracy of that dataset." Instead of relying on one decision tree, replacement & the final output is based on majority voting. For example, Random
the random forest takes the prediction from each tree and based on Forest.
the majority votes of predictions, and it predicts the final output. 2. Boosting– It combines weak learners into strong learners by creating sequential
models such that the final model has the highest accuracy. For example, ADA
BOOST, XG BOOST.
The greater number of trees in the forest leads to higher
accuracy and prevents the problem of overfitting.
Bagging

Bagging, also known as Bootstrap Aggregation, is the ensemble technique used Steps Involved in Random Forest Algorithm
by random forest. Bagging chooses a random sample/random subset from the
entire data set. Hence each model is generated from the samples (Bootstrap
Samples) provided by the Original Data with replacement known as row Step 1: In the Random forest model, a subset of data points and a subset of
sampling. This step of row sampling with replacement is called bootstrap. features is selected for constructing each decision tree. Simply put, n random
Now each model is trained independently, which generates results. The final records and m features are taken from the data set having k number of records.
output is based on majority voting after combining the results of all models. This
step which involves combining all the results and generating output based on Step 2: Individual decision trees are constructed for each sample.
majority voting, is known as aggregation.
Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or Averaging for


Classification and regression, respectively.

Random Forest Algorithm: Example


Important Features of Random Forest

•Diversity: Not all attributes/variables/features are considered while making an


individual tree; each tree is different.
•Immune to the curse of dimensionality: Since each tree does not consider all the
features, the feature space is reduced.
•Parallelization: Each tree is created independently out of different data and
attributes. This means we can fully use the CPU to build random forests.
•Stability: Stability arises because the result is based on majority voting/
averaging.
Decision Tree Vs. Random Forest
Difference Between Decision Tree and Random Forest

Decision Tree

Bagging → Random Forest


Decision Tree Vs. Random Forest
• Has 2 parameters
• A parameter to specify the number of trees
• A parameter that controls how many features to try when finding
the best split.

Decision Tree
Pros & Cons
Pros
• Versatility – used for both regression and
classification models
SUPPORT VECTOR
• The default hyperparameters it uses often produce a
good prediction result MACHINE
• Because of enough trees in the forest, the classifier
won’t overfit the model.
Cons
• Large number of trees can make the algorithm too
slow and ineffective for real-time predictions.

SVM SVM
• Support Vector Machine” (SVM) is a supervised machine learning • It is a supervised machine learning problem where we try to find a hyperplane
algorithm that can be used for both classification or regression that best separates the two classes.
challenges. However, it is mostly used in classification problems. In • Support Vectors are simply the coordinates of individual observation. The
the SVM algorithm, we plot each data item as a point in n-dimensional SVM classifier is a frontier that best segregates the two classes (hyper-plane/
space (where n is a number of features) with the value of each line).
feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiates the two
classes very well.
Hyperplane and Support Vectors in the SVM algorithm:
Types of SVM
Hyperplane: There can be multiple lines/decision boundaries to segregate
•Linear SVM: Linear SVM is used for linearly separable data, which the classes in n-dimensional space, but we need to find out the best decision
means if a dataset can be classified into two classes by using a single boundary that helps to classify the data points. This best boundary is known
straight line, then such data is termed as linearly separable data, and as the hyperplane of SVM. The dimension of the hyperplane depends upon
classifier is used called as Linear SVM classifier. the number of features. If the number of input features is 2, then the
hyperplane is just a line. If the number of input features is 3, then the
•Non-linear SVM: Non-Linear SVM is used for non-linearly separated hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
data, which means if a dataset cannot be classified by using a straight line, when the number of features exceeds 3.
then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.

SVM Support Vector Machines


• It is a supervised machine learning problem where we try to find a hyperplane • The line that maximizes the minimum
that best separates the two classes. margin is a good bet.
• Support Vectors are simply the coordinates of individual observation. The
• This maximum-margin separator is
SVM classifier is a frontier that best segregates the two classes (hyper-plane/
line). determined by a subset of the
datapoints.
• Datapoints in this subset are called
“support vectors”.
• It will be useful computationally if
only a small fraction of the datapoints
are support vectors, because we use
the support vectors to decide which The support vectors
side of the separator a test case is
are indicated by the
on.
circles around them.
Mathematical Intuition Mathematical Intuition
• The dot product can be defined as the • Consider a random point X and we
projection of one vector along with want to know whether it lies on the
another, multiply by the product of right side of the plane or the left side of
another vector the plane (positive or negative).

• To find this first we assume this point is


a vector (X) and then we make a
vector (w) which is perpendicular to the
hyperplane. Let’s say the distance of
vector w from origin to decision
boundary is ‘c’. Now we take the
projection of X vector on w.

Mathematical Intuition Mathematical Intuition


If the dot product is greater than ‘c’ then The equation of a hyperplane is
we can say that the point lies on the w.x+b=0 where w is a vector normal to
right side. If the dot product is less than hyperplane and b is an offset.
‘c’ then the point is on the left side and if To classify a point as negative or
the dot product is equal to ‘c’ then the positive we need to define a decision
point lies on the decision boundary. rule. We can define decision rule as:
SVM
• Support Vector Machine” (SVM) is a supervised machine
learning algorithm that can be used for both classification
or regression challenges.
• However, it is mostly used in classification problems. In
the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is a number of features) with
the value of each feature being the value of a particular
coordinate.
• Then, we perform classification by finding the hyper-plane
that differentiates the two classes very well.

Support Vector Classifier SVM


SVM SVM

Cardinality of the weight vector


Kernel Trick Kernel Trick

Non-linear transformations Non-linear transformations

• ϕ(x) = x²
Kernels
• Linear kernel
• Polynomial kernel
• Gaussian kernel or Radial basis function (RBF) kernel
• Sigmoid kernel

Linear Kernel Polynomial Kernel


Multiclass classification - SVM
• In its most simple type, SVM doesn’t support multiclass
classification natively. It supports binary classification and
separating data points into two classes. For multiclass
classification, the same principle is utilized after breaking
down the multiclassification problem into multiple binary
classification problems.
It is mostly preferred for neural networks. This kernel function is similar to a
two-layer
. perceptron model of the neural network, which works as an
activation function for neurons.

Multiclass classification - SVM


• The idea is to map data points to high dimensional space Let’s take an example of 3 classes classification problem;
to gain mutual linear separation between every two green, red, and blue, as the following image:
classes. This is called a One-to-One approach, which
breaks down the multiclass problem into multiple binary
classification problems. A binary classifier per each pair of
classes. For m classes it needs m(m-1)/2 SVMs
• Another approach one can use is One-to-Rest in which
the breakdown is set to a binary classifier per each class.
For m classes it needs m SVMs
One-to-One approach One-to-Rest approach
We need a hyperplane to separate between every two classes, neglecting the • In the One-to-Rest approach, we need a hyperplane to separate between a
points of the third class. This means the separation takes into account only the class and all others at once. This means the separation takes all points into
points of the two classes in the current split. For example, the red-blue line tries account, dividing them into two groups; a group for the class points and a
to maximize the separation only between blue and red points. It has nothing to group for all other points. For example, the green line tries to maximize the
do with green points: separation between green points and all other points at once:

Advantages of SVM
• Support vector machine works comparably well when there is an understandable margin of
dissociation between classes. Disadvantages of support vector machine:
• It is more productive in high-dimensional spaces.
• It is effective in instances where the number of dimensions is larger than the number of specimens.
•Computationally expensive: SVMs can be computationally expensive for large datasets, as
• Support vector machine is comparably memory systematic. Support Vector Machine (SVM) is a
the algorithm requires solving a quadratic optimization problem.
powerful supervised machine learning algorithm with several advantages. Some of the main •Choice of kernel: The choice of kernel can greatly affect the performance of an SVM, and it
advantages of SVM include: can be difficult to determine the best kernel for a given dataset.
• Handling high-dimensional data: SVMs are effective in handling high-dimensional data, which is •Sensitivity to the choice of parameters: SVMs can be sensitive to the choice of parameters,
common in many applications such as image and text classification. such as the regularization parameter, and it can be difficult to determine the optimal
• Handling small datasets: SVMs can perform well with small datasets, as they only require a small parameter values for a given dataset.
number of support vectors to define the boundary. •Memory-intensive: SVMs can be memory-intensive, as the algorithm requires storing the
• Modeling non-linear decision boundaries: SVMs can model non-linear decision boundaries by using kernel matrix, which can be large for large datasets.
the kernel trick, which maps the data into a higher-dimensional space where the data becomes •Limited to two-class problems: SVMs are primarily used for two-class problems, although
linearly separable. multi-class problems can be solved by using one-versus-one or one-versus-all strategies.
• Robustness to noise: SVMs are robust to noise in the data, as the decision boundary is determined •Lack of probabilistic interpretation: SVMs do not provide a probabilistic interpretation of the
by the support vectors, which are the closest data points to the boundary. decision boundary, which can be a disadvantage in some applications.
• Generalization: SVMs have good generalization performance, which means that they are able to •Not suitable for large datasets with many features: SVMs can be very slow and can consume
classify new, unseen data well. a lot of memory when the dataset has many features.
• Versatility: SVMs can be used for both classification and regression tasks, and it can be applied to a
wide range of applications such as natural language processing, computer vision and bioinformatics.
• Sparse solution: SVMs have sparse solutions, which means that they only use a subset of the
training data to make predictions. This makes the algorithm more efficient and less prone to
overfitting.
• Regularization: SVMs can be regularized, which means that the algorithm can be modified to avoid
overfitting.

You might also like