Professional Documents
Culture Documents
ML Unit Iv
ML Unit Iv
Bayesian Learning
Bayes Optimal Classifier
The Bayes Optimal Classifier is a probabilistic model that predicts the most likely outcome
for a new situation.
The Bayes Theorem, which provides a systematic means of computing a conditional
probability, is used to describe it. It’s also related to Maximum a Posteriori (MAP), a
probabilistic framework for determining the most likely hypothesis for a training dataset.
Consider an example for Bayes Optimal Classification,
Take a hypothesis space that has 3 hypotheses h1, h2, and h3.
The posterior probabilities of the hypotheses are as follows:
h1 -> 0.4
h2 -> 0.3
h3 -> 0.3
Hence, h1 is the MAP hypothesis. (MAP => max posterior)
Suppose a new instance x is encountered, which is classified negative by h2 and h3 but
positive by h1.
Taking all hypotheses into account, the probability that x is positive is 0.4 and the probability
that it is negative is therefore 0.6.
The classification generated by the MAP hypothesis is different from the most probable
classification in this case which is negative.
The most probable classification of the new instance is obtained by combining the predictions
of all hypotheses, weighted by their posterior probabilities.
hi P(hi|D) Positive Negative
h1 0.4 1 0
h2 0.3 0 1
h3 0.3 0 1
∑ P(F∨hi)P (hi∨D)=0.4
hi € H
∑ P(R∨hi)P(hi∨D)=0.1
hi € H
Thus, the Bayes optimal procedure recommends the robot turn left.
Gibbs Algorithm
The Bayes optimal classifier provides the best classification result achievable, however it can
be computationally intensive, as it computes the posterior probability for every hypothesis h
∈ H, the prediction of each hypothesis for each new instance, and the combination of these 2
to classify each new instance.
Gibbs chooses one hypothesis h from H at random, according to posterior probability
distribution P(h|D) over H.
Use h to predict the classification of the next instance.
Under certain conditions the expected misclassification error for Gibbs algorithm is at most
twice the expected error of the Bayes optimal classifier.
If we assume the target concepts are drawn at random from H according the priors on H, then
the error of Gibbs is less than twice the error of optimal Bayes. Gibbs is seldom used,
however.
Naive Bayes Classifier
Naive Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training
dataset.
Naive Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the probability of
an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Working of Naive Bayes' Classifier:
Consider an example for Naïve Bayes Classification,
Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So, using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So, to solve this problem, we need to
follow the below steps:
Convert the given dataset into frequency tables.
Generate Likelihood table by finding the probabilities of given features.
Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play
Rainy Yes
Sunny Yes
Overcast Yes
Overcast Yes
Sunny No
Rainy Yes
Sunny Yes
Overcast Yes
Rainy No
Sunny No
Sunny Yes
Rainy No
Overcast Yes
Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
B F
Burglary
A Fire
A
Alarm
P1
P2
P1 calls P2 calls
In the above example, an alarm ‘A’ – a node, say installed in a house of a person
‘gfg’, which rings upon two probabilities i.e., burglary ‘B’ and fire ‘F’, which are –
parent nodes of the alarm node. The alarm is the parent node of two probabilities P1
calls ‘P1’ & P2 calls ‘P2’ person nodes.
Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’, respectively.
But there are few drawbacks in this case, as sometimes ‘P1’ may forget to call the
person ‘gfg’, even after hearing the alarm, as he has a tendency to forget things,
quick. Similarly, ‘P2’, sometimes fails to call the person ‘gfg’, as he is only able to
hear the alarm, from a certain distance.
Problem: Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has
called ‘gfg’) when the alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.
P(P1,P2,A,~B,~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’ events]
Burglary ‘B’ –
P (B=T) = 0.001 (‘B’ is true i.e., burglary has occurred)
P (B=F) = 0.999 (‘B’ is false i.e., burglary has not occurred)
Fire ‘F’ –
P (F=T) = 0.002 (‘F’ is true i.e., fire has occurred)
P (F=F) = 0.998 (‘F’ is false i.e., fire has not occurred)
Alarm ‘A’ –
B F P P (A=F)
(A=T)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ (i.e., may have rung or may not have
rung). It has two parent nodes burglary ‘B’ and fire ‘F’ which can be ‘true’ or ‘false’
(i.e., may have occurred or may not have occurred) depending upon different
conditions.
Person ‘P1’ –
A P (P1=T) P (P1=F)
T 0.95 0.05
F 0.05 0.95
The person ‘P1’ node can be ‘true’ or ‘false’ (i.e., may have called the person ‘gfg’ or
not). It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e., may have
rung or may not have rung, upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)
T 0.80 0.20
F 0.01 0.99
The person ‘P2’ node can be ‘true’ or false’ (i.e., may have called the person
‘gfg’ or not). It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’
(i.e., may have rung or may not have rung, upon burglary ‘B’ or fire ‘F’).
The EM algorithm
The Expectation-Maximization (EM) algorithm is defined as the combination of various
unsupervised machine learning algorithms, which is used to determine the local maximum
likelihood estimates (MLE) or maximum a posteriori estimates (MAP) for unobservable
variables in statistical models. Further, it is a technique to find maximum likelihood
estimation when the latent variables are present. It is also referred to as the latent variable
model.
The EM algorithm is the combination of various unsupervised ML algorithms, such as the
k-means clustering algorithm. Being an iterative approach, it consists of two modes. In the
first mode, we estimate the missing or latent variables. Hence it is referred to as the
Expectation/estimation step (E-step). Further, the other mode is used to optimize the
parameters of the models so that it can explain the data more clearly. The second mode is
known as the maximization-step or M-step.
E=Step M-Step
Expectation
Step
Maximization
No Step
Is Yes
STOP
Converged?
o Step 1: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained
from a specific model.
Introduction
An Artificial Neural Network in the field of Artificial intelligence where it attempts to mimic
the network of neurons makes up a human brain so that computers will have an option to
understand things and make decisions in a human-like manner. The artificial neural network
is designed by programming computers to behave simply like interconnected brain cells.
Artificial neural networks have neurons that are interconnected to one another in various
layers neurons are known as nodes.
Consider an example of a digital logic gate that takes an input and gives an output. "OR" gate,
which takes two inputs. If one or both the inputs are "On," then we get "On" in output. If both
the inputs are "Off," then we get "Off" in output. Here the output depends upon input. Our
brain does not perform the same task. The outputs to inputs relationship keep changing
because of the neurons in our brain, which are "learning."
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations
to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.
Artificial neural networks have a numerical value that can perform more than one task
simultaneously.
Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't prevent the
network from working.
After ANN training, the information may produce output even with inadequate data. The loss
of performance here relies upon the significance of missing data.
For ANN is to be able to adapt, it is important to determine the examples and to encourage the
network according to the desired output by demonstrating these examples to the network. The
succession of the network is directly proportional to the chosen instances, and if the event
can't appear to the network in all its aspects, it can produce false output.
Extortion of one or more cells of ANN does not prohibit it from generating output, and this
feature makes the network fault-tolerance.
Afterward, each of the input is multiplied by its corresponding weights (these weights are the
details utilized by the artificial neural networks to solve a specific problem). In general terms,
these weights normally represent the strength of the interconnection between neurons inside
the artificial neural network. All the weighted inputs are summarized inside the computing
unit.
If the weighted sum is equal to zero, then bias is added to make the output non-zero or
something else to scale up to the system's response. Bias has the same input, and weight
equals to 1. Here the total of weighted inputs can be in the range of 0 to positive infinity.
Here, to keep the response in the limits of the desired value, a certain maximum value is
benchmarked, and the total of weighted inputs is passed through the activation function.
The activation function refers to the set of transfer functions used to achieve the desired
output. There is a different kind of the activation function, but primarily either linear or non-
linear sets of functions. Some of the commonly used sets of activation functions are the
Binary, linear, and Tan hyperbolic sigmoidal activation functions. Let us take a look at each
of them in details:
The Perceptron
A Perceptron is an Artificial Neuron
It is the simplest possible Neural Network
Neural Networks are the building blocks of Machine Learning.
The original Perceptron was designed to take a number of binary inputs, and produce one
binary output (0 or 1).
The idea was to use different weights to represent the importance of each input, and that the
sum of the values should be greater than a threshold value before making a decision like true
or false (0 or 1).
Perceptron Example
Imagine a perceptron (in your brain).
The perceptron tries to decide if you should go to a concert.
Is the artist good? Is the weather good?
What weights should these facts have?
Threshold = 1.5
x1 * w1 = 1 * 0.7 = 0.7
x2 * w2 = 0 * 0.6 = 0
x3 * w3 = 1 * 0.5 = 0.5
x4 * w4 = 0 * 0.3 = 0
x5 * w5 = 1 * 0.4 = 0.4
Return true if the sum > 1.5 ("Yes I will go to the Concert")
Note
If the weather weight is 0.6 for you, it might different for someone else. A higher weight
means that the weather is more important to them.
If the threshold value is 1.5 for you, it might be different for someone else. A lower threshold
means they are more wanting to go to the concert.
Multi-Layer Perceptron
the Mean Squared Error is computed across all input and output pairs. Then, to propagate it
back, the weights of the first hidden layer are updated with the value of the gradient.
This process keeps going until gradient for each input-output pair has converged, meaning the
newly computed gradient hasn’t changed more than a specified convergence threshold,
compared to the previous iteration.
Backpropagation Algorithm
The Back propagation algorithm in neural network computes the gradient of the loss function
for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a
native direct computation. It computes the gradient, but it does not define how the gradient is
used. It generalizes the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:
1. Inputs X, arrive through the preconnected path
2. Input is modelled using real weights W. The weights are usually randomly
selected.
3. Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
4. Calculate the error in the outputs
5. Travel back from the output layer to the hidden layer to adjust the weights
such that the error is decreased.