Chapter 2 ML (1) Mation Learning

Introduction to ML
Chapter Two
Supervised Learning
Misganu T. Introduction to Machine Learning 1

Supervised Learning
• Supervised learning is a technique in which we teach or train the

machine using labeled data.
• The machine observes input-output pairs and learns a pattern
that maps from input to output.
For example, the inputs could be images, each one
accompanied by an output saying “Tom” or “Jerry,” etc. An
output like this is called a label.
An other example:
function(inputs) = output
function(4,3) = 12
function(4,5) = 20
function(6,3) = 18
function (7,2) = ?
Supervised learning
Cont..
• Task of Machine learning can be affected by number of factors:
 Noise in dataset
o searching for a pattern that matches the data exactly is
not necessarily the best strategy to follow, as it is
equivalent to learning the noise
Lack of large dataset
o set of possible functions is larger than the set of
examples in the dataset.
o This means that machine learning
is an ill-posed problem
o Example:
function(4,1) = 4
function(9,1) = 9
function(6,1) = 6
function (8,2) = ?

Supervised learning
Cont..
The main task that can be done by using supervised machine Learning:
• Regression: In this type of problem the output is a continuous quantity. So,
for example, if you want to predict the speed of a car given the distance, it
is a Regression problem. Regression problems can be solved by using
Supervised Learning algorithms like Linear Regression.
• Classification: In this type, the output is a categorical value. Classifying
emails into two classes, spam and non-spam is a classification problem that
can be solved by using Supervised Learning algorithms such as KNN, Naive
Bayes, Logistic Regression, Decision Tree, Random Forest, Support
Vector Machines, etc.

Supervised Learning
Algorithms
• Regression
• Linear regression
• Polynomial regression
• Classification
• KNN
• Naïve Bayes
• Logistic regression
• Decision trees
• Random forest
• Support vector machines

Linear Regression
• One of supervised algorithm

 Regression
• The linear Regression model shows the relationship between
two or more variables and how the change in one variable
impacts the others.
• The linear regression model is used for predicting

fundamental continuous values.

Linear Regression
Cont..
• The most common linear regression examples are housing price
predictions, sales predictions, weather predictions etc.
• For example, the graph below shows the relation between the
number of umbrellas sold and the rainfall in a particular region.

Linear Regression
Cont..

Linear Regression
Cont..
• Example:
happiness = m * Salary + c
Salary
How Linear Regression
works
• Consider the following example: Predicting value of y by finding

equation of Regression line
Mean 3 3.6

works cont..
-- -- -- --
-- --
--

works cont..

works cont..
_ _

Polynomial
regression
• Polynomial regression is used when there is a non-linear
relationship between the independent variable and the
dependent variable.
• In other words, when the relationship between the
variables cannot be accurately modeled using a linear
function.
• Polynomial regression is a type of regression analysis in
which the relationship between the independent variable
(X) and the dependent variable (Y) is modeled as an nth
degree polynomial.
Polynomial
regression Cont …
• In polynomial regression, the goal is to find the coefficients of the
polynomial equation that best fits the data.
• The degree of the polynomial is usually chosen based on the complexity of

the data and the trade-off between bias and variance.
• A higher degree polynomial can fit the data better, but it can also lead to
overfitting.
• The polynomial equation can be written as:
Y = b0 + b1X + b2X^2 + ... + bn*X^n
where Y is the dependent variable, X is the independent variable, b0, b1,

b2, ..., bn are the coefficients, and n is the degree of the polynomial.

Logistic Regression
• A Supervised type of Ml algorithm

 Classification
• Used to estimate discrete values in classification tasks.
• This algorithm applies a logistic function to a linear combination of
features to predict the outcome of a categorical dependent variable
based on predictor variables.
• For example, Classifying emails into two classes, spam and non-spam.
• Logistic regression can be classified as:
o Binary Logistic Regression
o Multi-nominal Logistic Regression
o Ordinal Logistic Regression

Logistic Regression
Cont..
• Binary Logistic Regression - The most commonly used
logistic regression when the categorical response has two
possible outcomes, i.e., yes or not.
o Example –Predict whether a student will pass or fail an
exam
• Multi-nominal Logistic Regression – Categorical response has
three or more possible outcomes with no order.
o Example- Predicting what kind of search engine (Yahoo,
Bing, Google, and MSN) is used by majority of US citizens.
• Ordinal Logistic Regression - Categorical response has 3 or
more possible outcomes with natural ordering.
o Example- How a customer rates the service and quality of
food at a restaurant based on a scale of 1 to 10.

Logistic Regression
Cont..
• The probability of car breakdown is between 0 and 1

• Classification (Either breakdown or not breakdown).
• As number of years since service is increase the probability of
Car breakdown will increases.

The Math Behind Logistic
Regression

Linear Regression
vs
Logistic Regression
Linear Regression Logistic Regression
Used to solve Regression Problem Used to solve classification

problem
Variables are continuous in nature Variables are Categorical in nature
It helps to estimate the dependent It calculates the probability of a

variable when there is a change in particular event taking place
the independent variable
Is a straight line An S-curve. (S = Sigmoid)

Decision Tree
• Decision Tree algorithm belongs to the family of supervised learning algorithms.
• Unlike other supervised learning algorithms, the decision tree algorithm can be used for
solving regression and classification problems too.
• A decision tree is a graphical representation that makes use of branching methodology to

exemplify all possible outcomes of a decision, based on certain conditions.
• In a decision tree, the internal node represents a test on the attribute, each branch of the tree
represents the outcome of the test and the leaf node represents a particular class label i.e. the
decision made after computing all of the attributes.
• Generally, The tree has three types of nodes:
o A root nodes - has no incoming edges and zero or more out going edges.
o Internal nodes – each of which has exactly one incoming edge and two or more outgoing
edges.
o Leaf nodes or terminal nodes – each of which has one incoming edge and no outgoing edge.
• The classification rules are represented through the path from root to the leaf node.

Decision Tree
Cont..
• Consider the following example:
Training set for predicting borrowers who will defaulted on loan payments

Decision Tree Terminologies

How Does A Tree Decide
Where to Split
• If the dataset consists of N attributes then deciding which attribute to place at the
root or at different levels of the tree as internal nodes is a complicated step.
• By just randomly selecting any node to be the root can’t solve the issue. If we follow a
random approach, it may give us bad results with low accuracy.
• For solving this attribute selection problem, researchers worked and devised some
solutions. They suggested using some criteria like :
o Information gain, Gini index, Reduction in Variance and Chi-Square
• These criteria will calculate values for every attribute.

• The values are sorted, and attributes are placed in the tree by following the order i.e.,
the attribute with a high value(in case of information gain) is placed at the root.
• While using Information Gain as a criterion, we assume attributes to be categorical,

and for the Gini index, attributes are assumed to be continuous.

How Does A Tree Decide Where
to Split Cont..
There are many algorithms to build a decision tree. For example,

• ID3 (Iterative Dichotomiser 3) — This uses entropy and information
gain as metric.
• CART (Classification and Regression Trees) — This makes use of Gini
impurity as the metric.

Example: Classification using the
ID3 algorithm
Before going to it further I will explain some important terms related to decision
trees.
•Entropy: In machine learning, entropy is a measure of the randomness in the
information being processed. The higher the entropy, the harder it is to draw any
conclusions from that information.
E(S) = -P(yes)log2(yes) - P(no)log2(no)
= -[P(yes)log2(yes) + P(no)log2(no)]
•Information Gain:
 It measures the reduction in entropy
 Decide which attribute should be selected
 Information Gain = E(S) – [(weighted Avg.) * E(each feature)]

Example: Classification using the ID3
algorithm Cont..
• Consider whether a dataset based on which we will

determine whether to play football or not.
Outlook Temperature Humidity Wind Play football
[Yes / No]
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rain Mild High Strong No
algorithm Cont..
• Out of 14 instances there is 9 Yes and 5 No.

• So the formula:
E(S)= -[P(yes)log2(yes) + P(no)log2(no)]
E(S) = -[(9/14)log2(9/14) + (5/14)log2(5/14)] = 0.94
Which node to select as a root node [outlook, temp, Humidity, Wind]?
Play Total
Yes No
Sunny 2 3 5
Outlook overcast 4 0 4
Rain 3 2 5
14
E(outlook=Sunny) = -[(2/5)log2(2/5) + (3/5)log2(3/5)] = 0.971
E(outlook=Overcast) = -[ 1 log2(1) + 0 log2(0)] = 0
E(outlook=Rain) = -[(3/5)log2(3/5) + (2/5)log2(2/5)] = 0.971
Information from outlook
I(outlook) = 5/14*0.971 + 4/14*0 + 5/14*0.971 = 0.693
Information Gained from outlook,
IG(outlook)= E(S) – I(outlook)
= 0.94 – 0.693 = 0.247
algorithm Cont..
• Similarly I calculated for the rest three [Temp, Humidity and Windy]
Accordingly,
• IG(outlook) = 0.247
• IG(Humidity) = 0.152
• IG(Temp) = 0.029
• IG(windy) = 0.048
Select the attribute with maximum Gain as a root node. So, Outlook is
our ROOT node.

algorithm Cont..
Outlook Temperature Humidity Wind Play football [Yes / No]

Sunny Hot High Weak No
Sunny Hot High Strong No
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Sunny Mild Normal Strong Yes

Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Rain Mild Normal Weak Yes
Rain Mild High Strong No

Overcast Hot High Weak Yes
Overcast Cool Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes

algorithm Cont..
• To identify the attribute that will come under SUNNY, again you will find:
• The new information gain for Temperature, Humidity and Windy.
Accordingly
• IG(Temp) = 0.571
• IG(Humidity) = 0.971
• IG(Windy) = 0.020
• Here, IG(Humidity) is the largest value. So Humidity is the node that comes
under sunny.
Play
Humidity Yes No
High 0 3
Normal 2 0
For humidity from the above table, we can say that play will occur if
humidity is normal and will not occur if it is high. Similarly, We can
find the nodes under rainy.
algorithm Cont..
• Generally, the Decision tree looks like this:

Naive Bayes Algorithm
• Naïve Bayes algorithm is a supervised learning algorithm, which is based

on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional

training dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that
can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of

the probability of an object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration,

Sentimental analysis, and classifying articles.

cont..
Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a certain

feature is independent of the occurrence of other features. Such as if the
fruit is identified on the bases of color, shape, and taste, then red, spherical,
and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes'

Theorem

cont..
Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
• The formula for Bayes' theorem is given as:
Where,
• P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

• P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
• P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

• P(B) is Marginal Probability: Probability of Evidence.

How Does Naive Bayes Works?
Example
• Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
• Suppose we have a dataset of weather conditions and corresponding target

variable "Play".
• Assume, The features are Outlook, Temperature, Humidity, and Wind, and we
want to predict if we can play outside when weather = (sunny, cool, high, and
strong)
• So to solve this problem, we need to follow the below steps:
• Convert the given dataset into frequency tables.
• Generate Likelihood table by finding the probabilities of given features.
• Now, use Bayes theorem to calculate the posterior probability.
• Problem: If the weather is sunny, then the Player should play or not?

Example Cont..
Outlook Temperature Humidity Wind Play
1 Rainy Hot High Weak Yes
2 Sunny Hot High Strong Yes
3 Overcast Hot High Weak Yes
4 Overcast Mild High Weak Yes
5 Sunny Cool Normal Weak No
6 Rainy Cool Normal Strong Yes
7 Sunny Cool Normal Strong Yes
8 Overcast Mild High Weak Yes
9 Rainy Cool Normal Weak No
10 Sunny Mild Normal Weak No
11 Sunny Mild Normal Strong Yes
12 Rainy Mild High Strong No
13 Overcast Hot Normal Weak Yes
14 Overcast Mild High Strong

Introduction to Machine Learning
Yes
Example Cont..
1. Frequency table for the Weather Conditions:

Sunny Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
2. Likelihood table weather condition:

Sunny No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71

Example Cont..
Applying Bayes‘ theorem:
• P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
• P(Sunny|Yes)= 3/10= 0.3
• P(Sunny)=5/14= 0.35
• P(Yes)= 10/14 = 0.71
• So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
• P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
• P(Sunny|NO)= 2/4=0.5
• P(No)= 4/14 = 0.29
• P(Sunny)= 5/14=0.35
• So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
• In the same way we will do for cool, high, and strong.
• Finally, If the product of probability of Yes is greater than that of No, then we can Play

K-Nearest Neighbor(KNN)
Algorithm
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases and put the
new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by
using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for
the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action on
the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.

K-Nearest Neighbor(KNN)
Algorithm Cont..
• Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we
can use the KNN algorithm, as it works on a similarity measure.
• The KNN model will find the similar features of the new data set to the cats and
dogs images and based on the most similar features it will put it in either cat or
dog category.

How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors

• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points in
each category.
• Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
• Step-6: The model is ready.

How does K-NN work?
Cont..
.Suppose we have a new data point and
we need to put it in the required
category.
• Firstly, we will choose the number of

neighbours, so we will choose the k=5.
• Next, we will calculate the Euclidean
distance between the data points.
• The Euclidean distance is the distance
between two points.
• It can be calculated as:
How does K-NN work?
Cont..
• By calculating the Euclidean distance we got the nearest neighbours,
as three nearest neighbours in category A and two nearest
neighbours in category B. Consider the below image:
• As we can see the 3 nearest neighbours are from category A based on

the recommendation of Euclidean distance. hence this new data point
must belong to category A.
Support Vector Machine
(SVM )Algorithm
• Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
• There are three properties that make SVMs attractive:
 SVMs construct a maximum margin separator—a decision boundary with the
largest possible distance to example points.
 SVMs create a linear separating hyper plane, but they have the ability to embed the
data into a higher-dimensional space, using the so-called kernel trick.
 SVMs are a nonparametric method—they retain training examples and potentially need
to store them all.
• The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.
• SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine.

SVM Cont..
• Consider the below diagram in which there are two different

categories that are classified using a decision boundary or hyperplane:
• Example: Suppose we see a strange cat that also
has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or
dog, so such a model can be created by using the
SVM algorithm.
• We will first train our model with lots of images
of cats and dogs so that it can learn about
different features of cats and dogs, and then we
test it with this strange creature.
SVM algorithm can be used for :

Face detection, image classification,
text categorization, etc.

Types of SVM
SVM can be of two types:

• Linear SVM: Linear SVM is used for linearly separable data,
which means if a dataset can be classified into two classes by
using a single straight line, then such data is termed as
linearly separable data, and classifier used is called as Linear
SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly
separated data, which means if a dataset cannot be classified
by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in
the SVM algorithm
• Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-

dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
• The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
• We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
• Support Vectors: The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector. Since these
vectors support the hyperplane, hence called a Support vector.

How Does SVM works?
.
Suppose we have a dataset that has two
tags (green and blue), and the dataset has
two features x1 and x2. We want a
classifier that can classify the pair(x1, x2)
of coordinates in either green or blue.
So as it is 2-d space, we can easily

separate these two classes just by using a
straight line. But there can be multiple
lines that can separate these classes.

How Does SVM works?
•. Hence, the SVM algorithm helps to find the best line or decision
boundary; this best boundary or region is called as a hyperplane.
• SVM algorithm finds the closest point of the lines from both the
classes.
• These points are called support vectors.
• The distance between the vectors and the hyperplane is called
as margin.
• The goal of SVM is to maximize this margin. The hyperplane with
maximum margin is called the optimal hyperplane.

Random Forest
Algorithm
• Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique.
• It can be used for both Classification and Regression problems in ML.

• It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model.
• As the name suggests, "Random Forest is a classifier that contains a number of

decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset."
• Instead of relying on one decision tree, the random forest takes the prediction from
each tree and based on the majority votes of predictions, and it predicts the final
output.

Random Forest Algorithm
Cont..
• The greater number of trees in the forest leads to higher accuracy

and prevents the problem of overfitting.

Assumptions for Random
Forest
• Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct output,
while others may not. But together, all the trees predict the correct output.
Therefore, below are two assumptions for a better Random forest classifier:
ˉ There should be some actual values in the feature variable of the dataset so
that the classifier can predict accurate results rather than a guessed result.
ˉ The predictions from each tree must have very low correlations.
• Why use Random Forest?
ˉ It takes less training time as compared to other algorithms.
ˉ It predicts output with high accuracy, even for the large dataset it runs
efficiently.
ˉ It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest
algorithm work?
• Random Forest works in two-phase first is to create the random forest by
combining N decision tree, and second is to make predictions for each tree created
in the first phase.
• The Working process can be explained in the below steps and diagram:
• Step-1: Select random K data points from the training set.
• Step-2: Build the decision trees associated with the selected data points (Subsets).
• Step-3: Choose the number N for decision trees that you want to build.
• Step-4: Repeat Step 1 & 2.
• Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.

How does Random Forest algorithm
work? Cont..
• Suppose there is a dataset that contains multiple fruit images. The dataset is divided
into subsets and given to each decision tree. During the training phase, each decision
tree produces a prediction result, and when a new data point occurs, then based on the
majority of results, the Random Forest classifier predicts the final decision.


Chapter 2 ML (1) Mation Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 ML (1) Mation Learning

Uploaded by

Copyright:

Available Formats

Introduction to ML

Misganu T. Introduction to Machine Learning 1

• Supervised learning is a technique in which we teach or train the

Misganu T. Introduction to Machine Learning 3

• Regression: In this type of problem the output is a continuous quantity. So,

is a Regression problem. Regression problems can be solved by using

Supervised Learning algorithms like Linear Regression.

• Classification: In this type, the output is a categorical value. Classifying

can be solved by using Supervised Learning algorithms such as KNN, Naive

Bayes, Logistic Regression, Decision Tree, Random Forest, Support

Vector Machines, etc.

Misganu T. Introduction to Machine Learning 4

Misganu T. Introduction to Machine Learning 5

• One of supervised algorithm

• The linear regression model is used for predicting

Misganu T. Introduction to Machine Learning 6

Misganu T. Introduction to Machine Learning 7

Misganu T. Introduction to Machine Learning 8

• Consider the following example: Predicting value of y by finding

Misganu T. Introduction to Machine Learning 10

Misganu T. Introduction to Machine Learning 11

Misganu T. Introduction to Machine Learning 12

Misganu T. Introduction to Machine Learning 13

• The degree of the polynomial is usually chosen based on the complexity of

• The polynomial equation can be written as:

Y = b0 + b1X + b2X^2 + ... + bn*X^n

where Y is the dependent variable, X is the independent variable, b0, b1,

Misganu T. Introduction to Machine Learning 15

• A Supervised type of Ml algorithm

Misganu T. Introduction to Machine Learning 16

Misganu T. Introduction to Machine Learning 17

• The probability of car breakdown is between 0 and 1

Misganu T. Introduction to Machine Learning 18

Misganu T. Introduction to Machine Learning 19

Linear Regression Logistic Regression

Used to solve Regression Problem Used to solve classification

Variables are continuous in nature Variables are Categorical in nature

It helps to estimate the dependent It calculates the probability of a

Is a straight line An S-curve. (S = Sigmoid)

Misganu T. Introduction to Machine Learning 20

• Decision Tree algorithm belongs to the family of supervised learning algorithms.

• A decision tree is a graphical representation that makes use of branching methodology to

• Generally, The tree has three types of nodes:

Misganu T. Introduction to Machine Learning 21

Misganu T. Introduction to Machine Learning 22

Misganu T. Introduction to Machine Learning 23

o Information gain, Gini index, Reduction in Variance and Chi-Square

• These criteria will calculate values for every attribute.

• While using Information Gain as a criterion, we assume attributes to be categorical,

Misganu T. Introduction to Machine Learning 24

There are many algorithms to build a decision tree. For example,

Misganu T. Introduction to Machine Learning 25

Misganu T. Introduction to Machine Learning 26

• Consider whether a dataset based on which we will

• Out of 14 instances there is 9 Yes and 5 No.

Misganu T. Introduction to Machine Learning 29

Outlook Temperature Humidity Wind Play football [Yes / No]

Outlook Temperature Humidity Wind Play football [Yes / No]

Outlook Temperature Humidity Wind Play football [Yes / No]

Misganu T. Introduction to Machine Learning 30

• Generally, the Decision tree looks like this:

Misganu T. Introduction to Machine Learning 32

• Naïve Bayes algorithm is a supervised learning algorithm, which is based