Professional Documents
Culture Documents
• Logistic Regression
LOGISTIC
•
•
•
Decision Tree
Naïve Bayes
Random Forest
REGRESSION
• SVM Classifier
A classification Problem
• VITEE score vs. admission in VIT
• Admitted (1)
• Not admitted (0)
• 𝑦 ∈ {0,1}
Binary classification
Logistic Regression
Substituting p(x) = 0.95, 0 = -120 and 1 = 2, we will get Substituting p(x) = 0.95, 0 = -120 and 1 = 2, we will get
xmin = ??? xmin = 61.47
iii)With the chosen parameters find the probability of declaring 61.5 as pass iii)With the chosen parameters find the probability of declaring 61.5 as pass
grade. grade.
Substituting x1 = 61.5, 0 = -120 and 1 = 2, we will get Substituting x1 = 61.5, 0 = -120 and 1 = 2, we will get 96%
Decision Tree
Decision Trees
Decision Trees
A tree can be “learned” by splitting the source set into subsets based
A decision tree is a non-parametric supervised learning algorithm, which is on an attribute value test. This process is repeated on each derived
utilized for both classification and regression tasks. It has a hierarchical, tree subset in a recursive manner called recursive partitioning. The
structure, which consists of a root node, branches, internal nodes and leaf recursion is completed with leaf nodes.
nodes.
As you can see from the diagram in the previous page, a decision
tree starts with a root node, which does not have any incoming
branches. The outgoing branches from the root node then feed into
the internal nodes, also known as decision nodes. The leaf nodes
represent all the possible outcomes within the dataset.
28-Mar-24 29 28-Mar-24 30
TaxInc NO
< 80K > 80K
NO YES
28-Mar-24 31 32
What is a decision tree? What strategy is followed?
• A model in the form of a tree structure • Starts with whole data set and recursively partitions the data
set into smaller subsets
Divide and
Decision Conquer
Nodes strategy
Model building
Problem • Assume that you have gathered data from studio archives
to examine the previous ten years of movie releases.
• Imagine that you are working for a Hollywood film studio, • After reviewing the data for 30 different movie scripts,
and your desk is piled high with screenplays. there seems to be a relationship between
• you decide to develop a decision tree algorithm to predict • the film's proposed shooting budget
whether a potential movie would fall into one of three • the number of A-list celebrities lined up for starring roles and
categories: • the categories of success
• mainstream hit
• critic's choice
• box office bust
Model building
Choosing the Best Split
• If the partitions contain only a single class, they are
considered pure.
• Different measurements of purity for identifying splitting
criteria
• Entropy & Information Gain (C5.0)
• Gini Index (CART)
28-Mar-24 55 28-Mar-24 56
28-Mar-24 57
28-Mar-24 59 28-Mar-24 60
Training
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
28-Mar-24 62
Where p(i) is the probability of a specific class and the summation is done for
all classes present in the dataset.
p(Yes) = 0.3 and p(No) = 0.7 p(Yes) = 0.3 and p(No) = 0.7
Find the Gini index. Find the Gini index.
Gini impurity = 1 — (0.3)² — (0.7)² = 0.42 Gini impurity = 1 — (0.3)² — (0.7)² = 0.45
In the example, the Gini impurity value of 0.42 represents that there is a In the example, the Gini impurity value of 0.45 represents that there is a
42% chance of misclassifying a sample if we were to randomly assign a 45% chance of misclassifying a sample if we were to randomly assign a
label from the dataset to that sample. This means that the dataset is not label from the dataset to that sample. This means that the dataset is not
completely pure, and there is some degree of disorder in it. completely pure, and there is some degree of disorder in it.
Let’s consider a toy dataset with the following features and class labels. Find Student Background: Gini formula requires us to calculate the Gini Index for each
the root node using Gini index. sub node. Then do a weighted average to calculate the overall Gini Index for the
node.
Student Background: Gini formula requires us to calculate the Gini Index for each
sub node. Then do a weighted average to calculate the overall Gini Index for the
node.
Let’s consider a toy dataset with the following features and class labels. Find the root
node using Gini index.
Age =0.3428
Income= 0.438
student=0.3673
credit=0.428
Let’s consider a toy dataset with the following features and class labels. Find the root
node using Gini index. https://blog.quantinsti.com/gini-index/
The target class label is “Buys_insurance” and it can take two values “Yes” or “No”.
Gender
Male (3): 2 Yes, 1 No
Find the root node using Gini index for the given dataset with target as Buys
Female(3): 1 Yes, 2 No
Insurance.
GiniMale = 1-(2/3)*(2/3)-(1/3)*(1/3) = 0.444
GiniGender = 3/6*0.444+3/6*0.444=0.444
Gini impurity for feature “Gender” Gini impurity for feature “Income”
1. First test all attributes and select the on that would function as the best
root;
2. Break-up the training set into subsets based on the branches of the
root node;
3. Test the remaining attributes to see which ones fit best underneath the
branches of the root node;
4. Continue this process for all other branches until
a. all examples of a subset are of one type
b. there are no examples left (return majority classification of the parent)
c. there are no more attributes left (default value should be majority
classification)
(Outlook = Sunny Humidity = Normal) (Outlook = Overcast) (Outlook = Rain Wind = Weak)
Probability
A probability model is a mathematical representation of a random phenomenon. • A random experiment is an observational process whose
results cannot be known in advance
An event A is a subset of the sample space S. • The sample space to describe rolling a die has six outcomes:
Rule 1: Any probability P(A) is a number between 0 and 1 (0 < P(A) < 1).
112
Probability Theory – Probability
Probability Theory – Random Experiments
• Sample Space • The probability of an event is a number that measures the relative likelihood that
the event will occur.
• When two dice are rolled, the sample space consists of 36 • The probability of an event A, denoted P(A), must lie within the interval from 0 to 1:
outcomes, each of which is a pair: 0 <= P(A) <= 1
• In a discrete sample space, the probabilities of all simple events must sum to 1, since
it is certain that one of them will occur:
P(S) = P(E1) + P(E2) + . . . + P(En) = 1
113 114
• Complement of an Event •
• The complement of an event A is denoted A’ and consists of everything in
the sample space S except event A
• Since A and A’ together comprise the sample space, their probabilities sum
to 1:
P(A) + P(A’) = 1
P(A’) = 1 – P(A)
115 116
Probability Theory – Rules of Probability Probability Theory – Rules of Probability
• •
117 118
120
119
Probability Theory – Rules of Probability Probability Theory – Rules of Probability
Here are examples of events that are not mutually exclusive (can be in both
categories):
• Student’s major: A = marketing major, B = economics major
• Credit card held: A = Visa, B = MasterCard, C = American Express
121 122
123 124
Probability Theory – Rules of Probability
• Conditional Probability
• The sample space is restricted to B, an event that we know has occurred (the green
shaded circle). The intersection, (A ∩ B), is the part of B that is also in A (the blue
shaded area).
• The ratio of the relative size of set (A ∩ B) to set B is the conditional probability P(A
| B ).
125
Prior probability
Bayes’ Rule
Consider the random variables,
cavity={true, false} The condition probability of the occurrence of A
weather={sunny, rain, cloudy, snow} if event B occurs
Prior or unconditional probability,
P(cavity=true)=0.1 P(A|B) = P(A B) / P(B)
P(weather=sunny)=0.72
This can be written also as:
Probability distribution gives values of all possible assignments: P(A B) = P(A|B) * P(B)
P(weather)={0.72, 0.1, 0.08, 0.1}(normalized i.e, sums to 1) P(A B) = P(B|A) * P(A)
March 28, 2024 Data Mining: Concepts and Techniques 129 March 28, 2024 Data Mining: Concepts and Techniques 130
Derivation of Naïve Bayes Classifier Problem 1: If the weather is sunny, then the Player should play or
not?
Solution: To solve this, first consider the below dataset:
• A simplified assumption: attributes are conditionally independent (i.e.,
no dependence relation between attributes):
n
P( X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak
divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x )2
1
and P(xk|Ci) is g ( x, , ) e 2 2
2
P(X | C i) g ( xk , Ci , Ci )
=0.0141/ P(today)
P(No/today)=((2/5)*(2/5)*(1/5)*(2/5)*(5/14)) / P(today)
=0.00457/ P(today)
P(today)=P(sunny)*P(hot)*p(normal)*p(nowind)
=(5/14)*(4/14)*(7/14)*(8/14)
RANDOM FOREST
CLASSIFIER
Bagging, also known as Bootstrap Aggregation, is the ensemble technique used Steps Involved in Random Forest Algorithm
by random forest. Bagging chooses a random sample/random subset from the
entire data set. Hence each model is generated from the samples (Bootstrap
Samples) provided by the Original Data with replacement known as row Step 1: In the Random forest model, a subset of data points and a subset of
sampling. This step of row sampling with replacement is called bootstrap. features is selected for constructing each decision tree. Simply put, n random
Now each model is trained independently, which generates results. The final records and m features are taken from the data set having k number of records.
output is based on majority voting after combining the results of all models. This
step which involves combining all the results and generating output based on Step 2: Individual decision trees are constructed for each sample.
majority voting, is known as aggregation.
Step 3: Each decision tree will generate an output.
Decision Tree
Decision Tree
Pros & Cons
Pros
• Versatility – used for both regression and
classification models
SUPPORT VECTOR
• The default hyperparameters it uses often produce a
good prediction result MACHINE
• Because of enough trees in the forest, the classifier
won’t overfit the model.
Cons
• Large number of trees can make the algorithm too
slow and ineffective for real-time predictions.
SVM SVM
• Support Vector Machine” (SVM) is a supervised machine learning • It is a supervised machine learning problem where we try to find a hyperplane
algorithm that can be used for both classification or regression that best separates the two classes.
challenges. However, it is mostly used in classification problems. In • Support Vectors are simply the coordinates of individual observation. The
the SVM algorithm, we plot each data item as a point in n-dimensional SVM classifier is a frontier that best segregates the two classes (hyper-plane/
space (where n is a number of features) with the value of each line).
feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiates the two
classes very well.
Hyperplane and Support Vectors in the SVM algorithm:
Types of SVM
Hyperplane: There can be multiple lines/decision boundaries to segregate
•Linear SVM: Linear SVM is used for linearly separable data, which the classes in n-dimensional space, but we need to find out the best decision
means if a dataset can be classified into two classes by using a single boundary that helps to classify the data points. This best boundary is known
straight line, then such data is termed as linearly separable data, and as the hyperplane of SVM. The dimension of the hyperplane depends upon
classifier is used called as Linear SVM classifier. the number of features. If the number of input features is 2, then the
hyperplane is just a line. If the number of input features is 3, then the
•Non-linear SVM: Non-Linear SVM is used for non-linearly separated hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
data, which means if a dataset cannot be classified by using a straight line, when the number of features exceeds 3.
then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.
• ϕ(x) = x²
Kernels
• Linear kernel
• Polynomial kernel
• Gaussian kernel or Radial basis function (RBF) kernel
• Sigmoid kernel
Advantages of SVM
• Support vector machine works comparably well when there is an understandable margin of
dissociation between classes. Disadvantages of support vector machine:
• It is more productive in high-dimensional spaces.
• It is effective in instances where the number of dimensions is larger than the number of specimens.
•Computationally expensive: SVMs can be computationally expensive for large datasets, as
• Support vector machine is comparably memory systematic. Support Vector Machine (SVM) is a
the algorithm requires solving a quadratic optimization problem.
powerful supervised machine learning algorithm with several advantages. Some of the main •Choice of kernel: The choice of kernel can greatly affect the performance of an SVM, and it
advantages of SVM include: can be difficult to determine the best kernel for a given dataset.
• Handling high-dimensional data: SVMs are effective in handling high-dimensional data, which is •Sensitivity to the choice of parameters: SVMs can be sensitive to the choice of parameters,
common in many applications such as image and text classification. such as the regularization parameter, and it can be difficult to determine the optimal
• Handling small datasets: SVMs can perform well with small datasets, as they only require a small parameter values for a given dataset.
number of support vectors to define the boundary. •Memory-intensive: SVMs can be memory-intensive, as the algorithm requires storing the
• Modeling non-linear decision boundaries: SVMs can model non-linear decision boundaries by using kernel matrix, which can be large for large datasets.
the kernel trick, which maps the data into a higher-dimensional space where the data becomes •Limited to two-class problems: SVMs are primarily used for two-class problems, although
linearly separable. multi-class problems can be solved by using one-versus-one or one-versus-all strategies.
• Robustness to noise: SVMs are robust to noise in the data, as the decision boundary is determined •Lack of probabilistic interpretation: SVMs do not provide a probabilistic interpretation of the
by the support vectors, which are the closest data points to the boundary. decision boundary, which can be a disadvantage in some applications.
• Generalization: SVMs have good generalization performance, which means that they are able to •Not suitable for large datasets with many features: SVMs can be very slow and can consume
classify new, unseen data well. a lot of memory when the dataset has many features.
• Versatility: SVMs can be used for both classification and regression tasks, and it can be applied to a
wide range of applications such as natural language processing, computer vision and bioinformatics.
• Sparse solution: SVMs have sparse solutions, which means that they only use a subset of the
training data to make predictions. This makes the algorithm more efficient and less prone to
overfitting.
• Regularization: SVMs can be regularized, which means that the algorithm can be modified to avoid
overfitting.