You are on page 1of 17

UNIT-IV

Bayesian Learning
Bayes Optimal Classifier
The Bayes Optimal Classifier is a probabilistic model that predicts the most likely outcome
for a new situation.
The Bayes Theorem, which provides a systematic means of computing a conditional
probability, is used to describe it. It’s also related to Maximum a Posteriori (MAP), a
probabilistic framework for determining the most likely hypothesis for a training dataset.
Consider an example for Bayes Optimal Classification,
Take a hypothesis space that has 3 hypotheses h1, h2, and h3.
The posterior probabilities of the hypotheses are as follows:
h1 -> 0.4
h2 -> 0.3
h3 -> 0.3
Hence, h1 is the MAP hypothesis. (MAP => max posterior)
Suppose a new instance x is encountered, which is classified negative by h2 and h3 but
positive by h1.
Taking all hypotheses into account, the probability that x is positive is 0.4 and the probability
that it is negative is therefore 0.6.
The classification generated by the MAP hypothesis is different from the most probable
classification in this case which is negative.
The most probable classification of the new instance is obtained by combining the predictions
of all hypotheses, weighted by their posterior probabilities.
hi P(hi|D) Positive Negative
h1 0.4 1 0
h2 0.3 0 1
h3 0.3 0 1

∑ P(+ ¿ hi) P(hi∨D)=0.4


hi € H

∑ P(−¿ hi) P(hi∨D)=0.6


hi € H

The most probable classification in this case which is negative.


A Bayes optimal classifier is a system that classifies new cases according to Equation. This
strategy increases the likelihood that the new instance will be appropriately classified.
Consider an example for Bayes Optimal Classification,
Let there be 5 hypotheses h1 through h5.
P(hi/D) P(F/ hi) P(L/hi) P(R/hi)
0.4 1 0 0
0.2 0 1 0
0.1 0 0 1
0.1 0 1 0
0.2 0 1 0
The MAP theory, therefore, argues that the robot should proceed forward (F). Let’s see what
the Bayes optimal procedure suggests.

∑ P(F∨hi)P (hi∨D)=0.4
hi € H

∑ P(L∨hi) P(hi∨ D)=0.2+0.2+0.2=0.5


hi € H

∑ P(R∨hi)P(hi∨D)=0.1
hi € H

Thus, the Bayes optimal procedure recommends the robot turn left.

Gibbs Algorithm
The Bayes optimal classifier provides the best classification result achievable, however it can
be computationally intensive, as it computes the posterior probability for every hypothesis h
∈ H, the prediction of each hypothesis for each new instance, and the combination of these 2
to classify each new instance.
Gibbs chooses one hypothesis h from H at random, according to posterior probability
distribution P(h|D) over H.
Use h to predict the classification of the next instance.
Under certain conditions the expected misclassification error for Gibbs algorithm is at most
twice the expected error of the Bayes optimal classifier.
If we assume the target concepts are drawn at random from H according the priors on H, then
the error of Gibbs is less than twice the error of optimal Bayes. Gibbs is seldom used,
however.
Naive Bayes Classifier
 Naive Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
 It is mainly used in text classification that includes a high-dimensional training
dataset.
 Naive Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
 It is a probabilistic classifier, which means it predicts on the basis of the probability of
an object.
 Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Working of Naive Bayes' Classifier:
Consider an example for Naïve Bayes Classification,
Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So, using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So, to solve this problem, we need to
follow the below steps:
Convert the given dataset into frequency tables.
Generate Likelihood table by finding the probabilities of given features.
Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:

Outlook Play
Rainy Yes
Sunny Yes
Overcast Yes
Overcast Yes
Sunny No
Rainy Yes
Sunny Yes
Overcast Yes
Rainy No
Sunny No
Sunny Yes
Rainy No
Overcast Yes
Overcast Yes
Frequency table for the Weather Conditions:

Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4

Likelihood table weather condition:

Weathe No Yes P(Weathe


r r)
Overcas 0 5 5/14=0.35
t
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying All 4/14=0.2 10/14=0. Bayes'theorem:
9 71
P(Yes|Sunny) = P(Sunny|
Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes) = 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
From the above calculation that P(Yes|Sunny) > P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naive Bayes Classifier:
 Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
 It can be used for Binary as well as Multi-class Classifications.
 It performs well in multi-class predictions as compared to the other Algorithms.
 It is the most popular choice for text classification problems.
Disadvantages of Naive Bayes Classifier:
 Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
Applications of Naive Bayes Classifier:
 It is used for Credit Scoring.
 It is used in medical data classification.
 It can be used in real-time predictions because Naive Bayes Classifier is an eager
learner.
 It is used in Text classification such as Spam filtering and Sentiment analysis.
Bayesian Belief Network
Bayesian Belief Network is a graphical representation of different probabilistic relationships
among random variables in a particular set. It is a classifier with no dependency on attributes
i.e., it is condition independent. Due to its feature of joint probability, the probability in
Bayesian Belief Network is derived, based on a condition — P(attribute/parent) i.e.,
probability of an attribute, true over parent attribute.
Consider the example

B F

Burglary
A Fire
A

Alarm
P1
P2

P1 calls P2 calls

 In the above example, an alarm ‘A’ – a node, say installed in a house of a person
‘gfg’, which rings upon two probabilities i.e., burglary ‘B’ and fire ‘F’, which are –
parent nodes of the alarm node. The alarm is the parent node of two probabilities P1
calls ‘P1’ & P2 calls ‘P2’ person nodes.
 Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’, respectively.
But there are few drawbacks in this case, as sometimes ‘P1’ may forget to call the
person ‘gfg’, even after hearing the alarm, as he has a tendency to forget things,
quick. Similarly, ‘P2’, sometimes fails to call the person ‘gfg’, as he is only able to
hear the alarm, from a certain distance.
Problem: Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has
called ‘gfg’) when the alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.
P(P1,P2,A,~B,~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’ events]
Burglary ‘B’ –
 P (B=T) = 0.001 (‘B’ is true i.e., burglary has occurred)
 P (B=F) = 0.999 (‘B’ is false i.e., burglary has not occurred)
Fire ‘F’ –
P (F=T) = 0.002 (‘F’ is true i.e., fire has occurred)
P (F=F) = 0.998 (‘F’ is false i.e., fire has not occurred)
Alarm ‘A’ –
B F P P (A=F)
(A=T)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999

 The alarm ‘A’ node can be ‘true’ or ‘false’ (i.e., may have rung or may not have
rung). It has two parent nodes burglary ‘B’ and fire ‘F’ which can be ‘true’ or ‘false’
(i.e., may have occurred or may not have occurred) depending upon different
conditions.
Person ‘P1’ –
A P (P1=T) P (P1=F)
T 0.95 0.05
F 0.05 0.95

 The person ‘P1’ node can be ‘true’ or ‘false’ (i.e., may have called the person ‘gfg’ or
not). It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e., may have
rung or may not have rung, upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)
T 0.80 0.20
F 0.01 0.99

 The person ‘P2’ node can be ‘true’ or false’ (i.e., may have called the person
‘gfg’ or not). It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’
(i.e., may have rung or may not have rung, upon burglary ‘B’ or fire ‘F’).

Solution: Considering the observed probabilistic scan –


 With respect to the question — P ( P1, P2, A, ~B, ~F) , we need to get the
probability of ‘P1’. We find it with regard to its parent node – alarm ‘A’. To get the
probability of ‘P2’, we find it with regard to its parent node — alarm ‘A’.
 We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’ since burglary
‘B’ and fire ‘F’ are parent nodes of alarm ‘A’.
 From the observed probabilistic scan, we can deduce –
P ( P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075

The EM algorithm
The Expectation-Maximization (EM) algorithm is defined as the combination of various
unsupervised machine learning algorithms, which is used to determine the local maximum
likelihood estimates (MLE) or maximum a posteriori estimates (MAP) for unobservable
variables in statistical models. Further, it is a technique to find maximum likelihood
estimation when the latent variables are present. It is also referred to as the latent variable
model.
The EM algorithm is the combination of various unsupervised ML algorithms, such as the
k-means clustering algorithm. Being an iterative approach, it consists of two modes. In the
first mode, we estimate the missing or latent variables. Hence it is referred to as the
Expectation/estimation step (E-step). Further, the other mode is used to optimize the
parameters of the models so that it can explain the data more clearly. The second mode is
known as the maximization-step or M-step.

E=Step M-Step

Update variables Update Hypothesis

 Expectation step (E - step): It involves the estimation (guess) of all missing


values in the dataset so that after completing this step, there should not be any
missing value.
 Maximization step (M - step): This step involves the use of estimated data in
the E-step and updating the parameters.
 Repeat E-step and M-step until the convergence of the values occurs.

START Initial Values

Expectation
Step

Maximization
No Step

Is Yes
STOP
Converged?
o Step 1: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained
from a specific model.

o Step 2: This step is known as Expectation or E-Step, which is used to estimate or


guess the values of the missing or incomplete data using the observed data. Further,
E-step primarily updates the variables.
o Step 3: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily
updates the hypothesis.
o Step 4: The last step is to check if the values of latent variables are converging or not.
If it gets "yes", then stop the process; else, repeat the process from step 2 until the
convergence occurs.
Artificial Neural Networks

Introduction

An Artificial Neural Network in the field of Artificial intelligence where it attempts to mimic
the network of neurons makes up a human brain so that computers will have an option to
understand things and make decisions in a human-like manner. The artificial neural network
is designed by programming computers to behave simply like interconnected brain cells.
Artificial neural networks have neurons that are interconnected to one another in various
layers neurons are known as nodes.

An Artificial Neural Network in the field of Artificial intelligence where it attempts to


mimic the network of neurons makes up a human brain so that computers will have an option
to understand things and make decisions in a human-like manner. The artificial neural
network is designed by programming computers to behave simply like interconnected brain
cells.

Consider an example of a digital logic gate that takes an input and gives an output. "OR" gate,
which takes two inputs. If one or both the inputs are "On," then we get "On" in output. If both
the inputs are "Off," then we get "Off" in output. Here the output depends upon input. Our
brain does not perform the same task. The outputs to inputs relationship keep changing
because of the neurons in our brain, which are "learning."

The architecture of an artificial neural network:


In order to define a neural network that consists of a large number of artificial neurons, which
are termed units arranged in a sequence of layers. Let us look at various types of layers
available in an artificial neural network.
Artificial Neural Network primarily consists of three layers:

Input Layer:

As the name suggests, it accepts inputs in several different formats provided by the
programmer.

Hidden Layer:

The hidden layer presents in-between input and output layers. It performs all the calculations
to find hidden features and patterns.

Output Layer:

The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.

The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.

It determines weighted total is passed as an input to an activation function to produce the


output. Activation functions choose whether a node should fire or not. Only those who are
fired make it to the output layer. There are distinctive activation functions available that can
be applied upon the sort of task we are performing.

Advantages of Artificial Neural Networks


Parallel processing capability:

Artificial neural networks have a numerical value that can perform more than one task
simultaneously.

Storing data on the entire network:

Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't prevent the
network from working.

Capability to work with incomplete knowledge:

After ANN training, the information may produce output even with inadequate data. The loss
of performance here relies upon the significance of missing data.

Having a memory distribution:

For ANN is to be able to adapt, it is important to determine the examples and to encourage the
network according to the desired output by demonstrating these examples to the network. The
succession of the network is directly proportional to the chosen instances, and if the event
can't appear to the network in all its aspects, it can produce false output.

Having fault tolerance:

Extortion of one or more cells of ANN does not prohibit it from generating output, and this
feature makes the network fault-tolerance.

Representation of Artificial Neural Networks


Artificial Neural Network can be best represented as a weighted directed graph, where the
artificial neurons form the nodes. The association between the neurons outputs and neuron
inputs can be viewed as the directed edges with weights. The Artificial Neural Network
receives the input signal from the external source in the form of a pattern and image in the
form of a vector. These inputs are then mathematically assigned by the notations x(n) for
every n number of inputs.

Afterward, each of the input is multiplied by its corresponding weights (these weights are the
details utilized by the artificial neural networks to solve a specific problem). In general terms,
these weights normally represent the strength of the interconnection between neurons inside
the artificial neural network. All the weighted inputs are summarized inside the computing
unit.
If the weighted sum is equal to zero, then bias is added to make the output non-zero or
something else to scale up to the system's response. Bias has the same input, and weight
equals to 1. Here the total of weighted inputs can be in the range of 0 to positive infinity.
Here, to keep the response in the limits of the desired value, a certain maximum value is
benchmarked, and the total of weighted inputs is passed through the activation function.

The activation function refers to the set of transfer functions used to achieve the desired
output. There is a different kind of the activation function, but primarily either linear or non-
linear sets of functions. Some of the commonly used sets of activation functions are the
Binary, linear, and Tan hyperbolic sigmoidal activation functions. Let us take a look at each
of them in details:

Appropriate Problems for NN Learning

o Instances are represented by many attribute-value pairs.


o The target function output may be discrete-valued, real-valued, or a vector of
several real-valued or discrete-valued attributes.
o The training examples may contain errors.
o Long training times are acceptable.
o Fast evaluation of the learned target function may be required.
o The ability of humans to understand the learned target function is not
important.

The Perceptron
 A Perceptron is an Artificial Neuron
 It is the simplest possible Neural Network
 Neural Networks are the building blocks of Machine Learning.
The original Perceptron was designed to take a number of binary inputs, and produce one
binary output (0 or 1).
The idea was to use different weights to represent the importance of each input, and that the
sum of the values should be greater than a threshold value before making a decision like true
or false (0 or 1).

Perceptron Example
Imagine a perceptron (in your brain).
The perceptron tries to decide if you should go to a concert.
Is the artist good? Is the weather good?
What weights should these facts have?

Criteria Input Weight


Artists is Good x1 = 0 or 1 w1 = 0.7
Weather is Good x2 = 0 or 1 w2 = 0.6
Friend will Come x3 = 0 or 1 w3 = 0.5
Food is Served x4 = 0 or 1 w4 = 0.3
Alcohol is Served x5 = 0 or 1 w5 = 0.4
The Perceptron Algorithm

1. Set a threshold value


2. Multiply all inputs with its weights
3. Sum all the results
4. Activate the output

1. Set a threshold value:

 Threshold = 1.5

2. Multiply all inputs with its weights:

 x1 * w1 = 1 * 0.7 = 0.7
 x2 * w2 = 0 * 0.6 = 0
 x3 * w3 = 1 * 0.5 = 0.5
 x4 * w4 = 0 * 0.3 = 0
 x5 * w5 = 1 * 0.4 = 0.4

3. Sum all the results:

 0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

4. Activate the Output:

 Return true if the sum > 1.5 ("Yes I will go to the Concert")

Here, 1.6 is greater than 1.5 (true)

Note
If the weather weight is 0.6 for you, it might different for someone else. A higher weight
means that the weather is more important to them.
If the threshold value is 1.5 for you, it might be different for someone else. A lower threshold
means they are more wanting to go to the concert.

Multi-Layer Perceptron

Multi-layer perceptron (MLP) is a supplement of feed forward neural network, because


inputs are combined with the initial weights in a weighted sum and subjected to the activation
function, just like in the Perceptron. But the difference is that each linear combination is
propagated to the next layer.

It consists of three types of layers—


 the input layer,
 output layer and
 hidden layer
The input layer receives the input signal to be processed. The required task such as prediction
and classification are performed by the output layer. An arbitrary number of hidden layers
that are placed in between the input and output layer are the true computational engine of the
MLP. Similar to a feed forward network in a MLP the data flows in the forward direction
from input to output layer. The neurons in the MLP are trained with the back propagation
learning algorithm. MLPs are designed to approximate any continuous function and can solve
problems which are not linearly separable. The major use cases of MLP are pattern
classification, recognition, prediction and approximation.
Backpropagation
Backpropagation is the learning mechanism that allows the Multilayer Perceptron to
iteratively adjust the weights in the network, with the goal of minimizing the cost function.
There is one hard requirement for backpropagation to work properly. The function that
combines inputs and weights in a neuron, for instance the weighted sum, and the threshold
function, for instance ReLU, must be differentiable. These functions must have a bounded
derivative, because Gradient Descent is typically the optimization function used in
MultiLayer Perceptron.
In each iteration, after the weighted sums are forwarded through all layers, the gradient of

the Mean Squared Error is computed across all input and output pairs. Then, to propagate it

back, the weights of the first hidden layer are updated with the value of the gradient.

This process keeps going until gradient for each input-output pair has converged, meaning the
newly computed gradient hasn’t changed more than a specified convergence threshold,
compared to the previous iteration.

Backpropagation Algorithm
The Back propagation algorithm in neural network computes the gradient of the loss function
for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a
native direct computation. It computes the gradient, but it does not define how the gradient is
used. It generalizes the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:
1. Inputs X, arrive through the preconnected path
2. Input is modelled using real weights W. The weights are usually randomly
selected.
3. Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
4. Calculate the error in the outputs

ErrorB= Actual Output – Desired Output

5. Travel back from the output layer to the hidden layer to adjust the weights
such that the error is decreased.

Keep repeating the process until the desired output is achieved

advantages of Backpropagation are:

 Backpropagation is fast, simple and easy to program


 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about the network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to be learned.

You might also like