ML 3

Day & Time: Monday (10am-11am & 3pm-4pm)
Tuesday (10am-11am)
Wednesday (10am-11am & 3pm-4pm)
Friday (9am-10am, 11am-12am, 2pm-3pm)
Dr. Srinivasa L. Chakravarthy
&
Smt. Jyotsna Rani Thota
Department of CSE
GITAM Institute of Technology (GIT)
Visakhapatnam – 530045
Email: slade@gitam.edu & jthota@gitam.edu
Department of CSE, GIT 1
20 August 2020
EID 403 and machine learning
Course objectives
● Explore about various disciplines connected with ML.

● Explore about efficiency of learning with inductive bias.
● Explore about identification of Ml algorithms like decision
tree learning.
● Explore about algorithms like Artificial Neural networks,
genetic programming, Bayesian algorithm, Nearest neighbor
algorithm, Hidden Markov chain model.
Department of CSE, GIT EID 403 and machine learning

Learning Outcomes
● Identify the various applications connected with ML.

● Classify efficiency of ML algorithms with Inductive bias
technique.
● Discriminate the purpose of all ML algorithms.
● Analyze any application and Correlate available ML
algorithms.
● Choose an ML algorithm to develop their project.

Syllabus
20 August 2020 4

Reference book 1. Title -Machine Learning
Author- Tom M Mitchell
Department of CSE, GIT EID 403 and machine

learning
Reference book 2. Title –Introduction to Machine Learning
Author- Ethem Alpaydin

Module -3(Chapter-6)
It includes-
Chapter -6 Bayesian Learning
● Bayesian Theorem & Concept Learning

● Maximum Likelihood & Minimum Description Length
Principle.
● Naive Bayes Classifier
● Bayesian Belief network
● The EM algorithm
&
Chapter -9 Computational Learning Theory 7

Introduction
Bayesian learning methods are related to ML for 2 reasons-
1. Bayesian learning algorithms- calculate probabilities for hypothesis such as

naive Bayes classifier.
2. Bayesian methods- provide a useful perspective for understanding and

characterizing the operations of ML algorithms that do not manipulate
probability explicitly.
With Bayesian learning methods-
1. An observed training example can incrementally decrease or increase the

estimated probability that a hypothesis is correct.
2. It provides a more flexible approach to learning than eliminating an inconsistent

hypothesis.
3. To determine final probability of a hypothesis prior knowledge can be combined

with observed data.
4. Bayesian methods can make hypotheses to make probabilistic predictions.
5. New instances can be classified by combining the predictions of multiple

hypotheses w.r.t their probabilities.
Difficulty with Bayesian methods-
● It requires initial knowledge of many probabilities, when the probabilities are

not known then they are estimated based on background knowledge i.e.,
previously available data.
● It requires more computational cost to determine Bayes optimal hypothesis,

but in some cases the cost can be reduced.
Bayesian Theorem-
It provides a way to calculate probability of hypothesis based on its prior
probability.
Notation-
P(h) - prior probability of hypothesis h,
P(D) - prior probability the training data D will be observed,
P(D|h) - probability of observing data D, given some world in which hypothesis h

holds.
P(h|D) - probability of h given the observed training data D. It is called posterior

probability of h which reflects that ”h holds after we have seen the training data D”.
“Posterior probability is contrast to prior probability”.

Bayesian Theorem-(cont.)
In a learning scenario, learner finds a most probable hypothesis to the given

observed data D.
Such a maximum probable hypothesis is called
“maximum a posteriori (MAP) hypothesis”.
We can determine the MAP hypotheses by using Bayes theorem to calculate the
posterior probability of each hypothesis. hMAP represents as-
Bayesian Theorem-(cont.)
As an Example-
Consider a diagnosis problem which is with two hypothesis
1. Patient is with cancer

2. Patient is without cancer
A patient takes lab test, and the result comes back positive.
Test returns Correct positive is 98% (disease is present)
And Correct negative is 97% (disease is not present)
Prior knowledge is that over the entire population of people only .008 have this
disease.
As an Example-(cont.)
The probabilities for the above situation is as follows-
.008 .992
.98 .02
.03 .97
As an Example-(cont.)
If we observe a new patient for whom lab test returns positive result,
then MAP can be used to say patient having cancer or not?
P(+|cancer)P(cancer) = .98 * .008 = .0078

P(+|¬ cancer) P(¬ cancer) = .03 * .992 = .0298
Thus, hMAP = P(h|D) = ¬ cancer
eg-, P(cancer|+) =0( .0078/(.0078+.0298))=.21
Similarly, P( ¬ cancer|+) can be calculated.
Note-
The posterior probability of cancer is higher than its prior probability, the most
probable hypothesis is still the patient does not have cancer.
In this example, rather than more or less probability, more data is observed.
Basic formulas for calculating probabilities-
Brute-Force Bayes Concept learning-
We can design a straightforward concept learning algorithm to output the

maximum a posterior hypothesis based on bayes theorem as follows-
Brute-Force MAP Learning algorithm

To specify a learning problem for Brute-Force
MAP learning algorithm-
We have to specify values of P(h) and P(D|h), and with some assumptions-
1. The training data D is noise less.

2. The target concept c is contained in hypothesis space H.
3. There is no prior reason to say that any hypothesis is more probable than
any other.
Choose P(h) to be uniform distribution-
To specify a learning problem for Brute-Force
MAP learning algorithm-(cont.)
Posterior probability P(h|D) is
Where, V S H,D is version space of H that are consistent with D.
And so, For every consistent hypothesis has a posterior probability (1/|VSH,D|) &
For every inconsistent hypothesis has a posterior probability.
Therefore, every consistent hypothesis is a MAP hypothesis.

Evolution of posterior probabilities P(h|D) with
increasing training data-
a. Uniform priors assign equal probability to each hypothesis.

b. As training data increases first to D1,
c. Then to D1 ∧ D2 .
Note- The posterior probability of inconsistent hypothesis becomes zero, while

posterior probabilities increase for hypotheses remaining in the version space.
Maximum Likelihood-
In general, a problem faced by many learning approaches such as neural
network learning, linear regression and polynomial curve fitting is-
“Learning a continuous-valued target function”.
A straightforward Bayesian analysis will show that- under certain assumptions

any learning algorithm that minimizes the squared error between output hypothesis
and the training data will gives a maximum likelihood hypothesis.
Maximum Likelihood-(cont.)
Minimum description length principle-
This principle is motivated by interpreting the definition of hMAP
Can be expressed in terms of maximizing the log2
Or alternatively, minimizing the negative of this quantity-

Minimum description length principle-(cont.)
Let’s take a basic result from information theory-
Consider a problem of designing a code to transmit message drawn at

random, where the probability of encountering message i is p i.
So, here we may interested to minimize the expected code length such
that it assigns -log2pi bits to encode message i.
We refer the number of bits required to encode message i using code C as

the description length of message i with respect to C, we denote it as LC(i).
● -log2 P(h) is description length of h under the optimal encoding for the
hypothesis space H.
○ We denote it as LCH(h)= -log2 P(h) where CH is optimal code for
hypothesis space H.
● -log2P(D|h) is the description length of the training data D of a given
hypothesis h.
○ We denote it as LCD|H(D|h) = -log2 P(D|h) where CD|H is optimal code
for describing data D.
WIth this h MAP can be rewritten as-

The MDL principle recommends choosing the hypothesis that minimizes the
sum of these two description lengths.
Assume that code C1 & C2 represent the hypothesis and the data given for a
hypothesis.
Minimum Description Length principle:

MDL Principle provides-
- A way to trade-off hypothesis complexity for the number of errors

committed by the hypothesis.
- A shorter hypothesis that gives fewer errors than a longer hypothesis

that correctly classifies training data.
- A method for dealing the issue of overfitting the data.

Bayes optimal classifier-
In general, the most probable classification of a new instance is obtained by

combining the predictions of all hypotheses, weighted by their posterior
probabilities.
● If the possible classification of the new example can take on any value vj
from some set V,
● Then the probability P(vj|D) that the correct classification for the new
instance is vj, is
Bayes optimal classifier-(cont.)
The optimal classification of new instance for which P(vj|D) is maximum.
Any system that classifies new instances according to above equation is called
as Bayes optimal classifier.
Note- This method maximizes the correct classification of new instance than any
other classification method on an average.
GIBBS Algorithm-
Although the Bayes optimal classifier obtains-

the best performance, but it is expensive.
An alternative, optimal method is Gibbs algorithm,defined as follows-
1. Choose a hypothesis h from H at random, according to posterior probability

distribution over H.
2. Use h to predict the classification of new instance.
GIBBS Algorithm-
Expected
error
Naive Bayes Classifier-
It is one of the most practical learning methods in neural networks, decision
trees, nearest nbr.
It can be used when,
● Large training set is available.

● Attributes that describe instances are conditionally independent in given
classification.
The successful applications with this are-
● Diagnosis
● Classifying text documents
Naive Bayes Classifier-(cont.)
Assume target function f(x),
where each instance x described by attributes <a1,a2,...an>.
The Bayesian approach to classify new instance is to assign the most probable
target value, vMAP is defined as-
We can use Bayes theorem to rewrite this expression as-

Naive Bayes Classifier-(cont.)
The naive bayes classifier is based on-
Simplifying assumption that the attribute values are conditionally independent

for a given target value.
In other words, the assumption is that for a given target value of the instance,
the probability of observed conjunction a1,a2,....an is just the product of
probabilities for the individual attributes represented as-
Bayesian belief networks-
● Naive bayes classifier is depends on conditional independence which is too

restrictive.
● Where as, Bayesian belief networks describe conditional independence

among subsets of variables.
It is an active focus of research and it also called as Bayes Nets.

Bayesian belief networks-(cont.)
Consider an arbitrary set of random variables Y1,....Yn where each variable Yi

can take on set of possible values V(Yi).
We define the joint space of the set of variables Y to be the cross product i.e.,
V(Y1) X V(Y2) X…. V(Yn).
The probability distribution over this joint space is called joint probability
distribution.
● A bayesian belief network describes the joint probability distribution for a set
of variables.
Conditional Independence
Representation
Representation
Learning Bayesian Belief Networks-
The learning task of Bayesian networks involves-
● Network structure might be known or unknown.

● Training examples might provide values of all or some network variables.
If structure is known and observe all variables-
Then it is easy as training a Naive Bayes Classifier.
If structure is known and variables are partially observable-
Then i
If structure is unknown-
Algorithm use greedy search to add / subtract edges and nodes.

Expectation Maximization (EM) Algorithm-
This algorithm can be used when-
● Data is only partially observable.

● Target value is unobservable.
● Some instance attributes unobservable.
This algorithm has been used to-
● Train Bayesian belief Networks

● Unsupervised clustering
● Learning Hidden Markov Models
The easiest way to introduce EM algorithm-
Estimating Means of K Gaussians-
Consider a problem in which the data D is a set of instances generated by a

probability distribution that is a mixture of k distinct normal
distributions(i.e.,Gaussians).
Each instance is generated using a two-step process.
1. Choosing one of the k Gaussians with uniform probability.

2. Generating an instance at random according to that Gaussian.
EM Algorithm-
END OF CHAPTER-6

ML 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML 3

Uploaded by

Copyright:

Available Formats

Day & Time: Monday (10am-11am & 3pm-4pm)

● Explore about various disciplines connected with ML.

Department of CSE, GIT EID 403 and machine learning

● Identify the various applications connected with ML.

Department of CSE, GIT EID 403 and machine learning

Department of CSE, GIT EID 403 and machine learning

Department of CSE, GIT EID 403 and machine

Department of CSE, GIT EID 403 and machine learning

Chapter -6 Bayesian Learning

● Bayesian Theorem & Concept Learning

Chapter -9 Computational Learning Theory 7

1. Bayesian learning algorithms- calculate probabilities for hypothesis such as

2. Bayesian methods- provide a useful perspective for understanding and

1. An observed training example can incrementally decrease or increase the

2. It provides a more flexible approach to learning than eliminating an inconsistent

3. To determine final probability of a hypothesis prior knowledge can be combined

4. Bayesian methods can make hypotheses to make probabilistic predictions.

5. New instances can be classified by combining the predictions of multiple

● It requires initial knowledge of many probabilities, when the probabilities are

● It requires more computational cost to determine Bayes optimal hypothesis,

P(h) - prior probability of hypothesis h,

P(D) - prior probability the training data D will be observed,

P(D|h) - probability of observing data D, given some world in which hypothesis h

P(h|D) - probability of h given the observed training data D. It is called posterior

“Posterior probability is contrast to prior probability”.

In a learning scenario, learner finds a most probable hypothesis to the given

Such a maximum probable hypothesis is called

“maximum a posteriori (MAP) hypothesis”.

1. Patient is with cancer

Test returns Correct positive is 98% (disease is present)

And Correct negative is 97% (disease is not present)

The probabilities for the above situation is as follows-

then MAP can be used to say patient having cancer or not?

P(+|cancer)P(cancer) = .98 * .008 = .0078

We can design a straightforward concept learning algorithm to output the

Brute-Force MAP Learning algorithm

1. The training data D is noise less.

Posterior probability P(h|D) is

Where, V S H,D is version space of H that are consistent with D.

For every inconsistent hypothesis has a posterior probability.

Therefore, every consistent hypothesis is a MAP hypothesis.

a. Uniform priors assign equal probability to each hypothesis.

Note- The posterior probability of inconsistent hypothesis becomes zero, while

“Learning a continuous-valued target function”.

A straightforward Bayesian analysis will show that- under certain assumptions

Can be expressed in terms of maximizing the log2

Or alternatively, minimizing the negative of this quantity-

Consider a problem of designing a code to transmit message drawn at

We refer the number of bits required to encode message i using code C as

WIth this h MAP can be rewritten as-

Minimum Description Length principle:

MDL Principle provides-

- A way to trade-off hypothesis complexity for the number of errors

- A shorter hypothesis that gives fewer errors than a longer hypothesis

- A method for dealing the issue of overfitting the data.

In general, the most probable classification of a new instance is obtained by

The optimal classification of new instance for which P(vj|D) is maximum.

Although the Bayes optimal classifier obtains-

An alternative, optimal method is Gibbs algorithm,defined as follows-

1. Choose a hypothesis h from H at random, according to posterior probability

It can be used when,

● Large training set is available.

The successful applications with this are-