You are on page 1of 47

UNIT IV LEARNING

Probability basics - Bayes Rule and its Applications - Bayesian Networks – Exact and
Approximate Inference in Bayesian Networks - Hidden Markov Models - Forms of Learning -
Supervised Learning - Learning Decision Trees – Regression and Classification with Linear
Models - Artificial Neural Networks – Nonparametric Models - Support Vector Machines -
Statistical Learning - Learning with Complete Data - Learning with Hidden Variables- The EM
Algorithm – Reinforcement Learning

1. Probability basics
• Random Variables
• Joint and Marginal Distributions
• Conditional Distribution
• Product Rule, Chain Rule

Probabilistic assertions summarize effects of


– laziness: failure to enumerate exceptions, qualifications, etc.
– Theoretical and Practical ignorance: lack of relevant facts, initial conditions, etc.
Subjective probability:
• Probabilities relate propositions to agent's own state of knowledge
e.g., P(A25 | no reported accidents) = 0.06
These are not assertions about the world
Probabilities of propositions change with new evidence:
e.g., P(A25 | no reported accidents, 5 a.m.) = 0.15
Making decisions under uncertainty
– Utility theory is used to represent and infer preferences
– Decision theory = probability theory + utility theory
Uncertainty
▪ Observed variables (evidence): Agent knows certain things about the state of the world
(e.g., sensor readings or symptoms)
▪ Unobserved variables: Agent needs to reason about other aspects (e.g. where an object is or
what disease is present)
▪ Model: Agent knows something about how the known variables relate to the unknown
variables
Probabilistic reasoning gives us a framework for managing our beliefs and knowledge.

Random Variables
It is the basic element: random variable
• Similar to propositional logic: possible worlds defined by assignment of values to random
variables.
• Boolean random variables
e.g., Cavity (do I have a cavity?)
• Discrete random variables
e.g., Weather is one of <sunny,rainy,cloudy,snow>
• Domain values must be exhaustive and mutually exclusive
• Elementary proposition constructed by assignment of a value to a
random variable: e.g., Weather = sunny, Cavity = false (abbreviated as cavity)

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 1


• Complex propositions formed from elementary propositions and standard logical connectives
e.g., Weather = sunny  Cavity = false

A random variable is some aspect of the world about which we (may) have uncertainty
R = Is it raining?
T = Is it hot or cold?
D = How long will it take to drive to work?
L = Where is the ghost?

Random variables with capital letters and have domains


R in {true, false} (often write as {+r, -r})
T in {hot, cold}
D in [0, )
L in possible locations, maybe {(0,0), (0,1), …}

Axioms of probability
For any propositions A, B
– 0 ≤ P(A) ≤ 1
– P(true) = 1 and P(false) = 0
– P(A  B) = P(A) + P(B) - P(A  B)

Probability Distributions
Associate a probability with each value
Temperature T : P(T)

Weather W: P(W)

Shorthand notation:
P(hot)=P(T=hot)
P(cold)=P(T=cold)
P(rain)=P(W=rain)
A probability (lower case value) is a single number
P(rain)=P(W=rain)=0.1

Prior probability

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 2


Prior or unconditional probabilities of propositions
e.g., P(Cavity = true) = 0.1 and P(Weather = sunny) = 0.72 correspond to belief prior to
arrival of any (new) evidence
Probability distribution gives values for all possible assignments:
P(Weather) = <0.72,0.1,0.08,0.1> (normalized, i.e., sums to 1)

Joint probability distribution


Joint probability distribution for a set of random variables gives the probability of every
atomic event on those random variables
P(Weather,Cavity) = a 4 × 2 matrix of values:
Weather = sunny rainy cloudy snow
Cavity = true 0.144 0.02 0.016 0.02
Cavity = false 0.576 0.08 0.064 0.08

Joint Distributions
A joint distribution over a set of random variables: X1, X2, X3,…Xn specifies a real number for
each assignment (or outcome):
P(X1= x1, X2= x2, X3= x3,…Xn =xn), P(x1,x2,x3,…xn)
It must obey
P(x1,x2,x3,…xn) >=0

P(T, W)

A probabilistic model is a joint distribution over a set of random variables


Probabilistic models:
(Random) variables with domains
Assignments are called outcomes
Joint distributions: say whether assignments (outcomes) are likely
Normalized: sum to 1.0
Ideally: only certain variables directly interact

Events
Atomic event: A complete specification of the state of the world about which the agent is uncertain
E.g., if the world consists of only two Boolean variables Cavity and Toothache, then there are 4
distinct atomic events:
Cavity = false Toothache = false
Cavity = false  Toothache = true
Cavity = true  Toothache = false
Cavity = true  Toothache = true
An event is a set E of outcomes

▪ From a joint distribution, we can calculate the probability of any event


▪ Probability that it’s hot AND sunny?

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 3


▪ Probability that it’s hot?
▪ Probability that it’s hot OR sunny?
▪ Typically, the events we care about are partial assignments, like P(T=hot)
▪ Atomic events are mutually exclusive and exhaustive

Marginal Distributions
▪ Marginal distributions are sub-tables which eliminate variables
▪ Marginalization (summing out): Combine collapsed rows by adding

P (T, W)

Conditional Probabilities
Conditional or posterior probabilities
e.g., P(cavity | toothache) = 0.8
i.e., given that toothache is all I know
P(Cavity | Toothache) = 2-element vector of 2-element vectors)
If we know more, e.g., cavity is also given, then we have
P(cavity | toothache, cavity) = 1
P(cavity | toothache, sunny) = P(cavity | toothache) = 0.8
This kind of inference, sanctioned by domain knowledge, is crucial
▪ A simple relation between joint and conditional probabilities

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 4


▪ In fact, this is taken as the definition of a conditional probability
P(a | b) = P(a  b) / P(b) if P(b) > 0

P(a | b) = P(a  b) / P(b) if P(b) > 0


P(b | a) = P(a  b) / P(a) if P(a) > 0

Product rule gives an alternative formulation:


P(a  b) = P(a | b) P(b) = P(b | a) P(a)
Example:
P(Weather,Cavity) = P(Weather | Cavity) P(Cavity)

P (T, W)

Conditional Distributions
▪ Conditional distributions are probability distributions over some variables given fixed values
of others

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 5


Normalization

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 6


Normalize
Procedure:
Step 1: Compute Z = sum over all entries
Step 2: Divide every entry by Z
Example 1:

Example 2:

Conditional distribution at once:


▪ Select the joint probabilities matching the evidence
▪ Normalize the selection (make it sum to one)

Probabilistic Inference
▪ compute a desired probability from other known probabilities (e.g. conditional from joint)
Compute conditional probabilities
▪ P(on time | no reported accidents) = 0.90
▪ These represent the agent’s beliefs given the evidence
Probabilities change with new evidence:
▪ P(on time | no accidents, 5 a.m.) = 0.95
▪ P(on time | no accidents, 5 a.m., raining) = 0.80
▪ Observing new evidence causes beliefs to be updated

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 7


The Chain Rule
More generally, can always write any joint distribution as an incremental product of conditional
distributions
• Chain rule is derived by successive application of product rule:
P(X1, …,Xn) = P(X1,...,Xn-1) P(Xn | X1,...,Xn-1)
= P(X1,...,Xn-2) P(Xn-1 | X1,...,Xn-2) P(Xn | X1,...,Xn-1)
=…
= πi= 1^n P(Xi | X1, … ,Xi-1)

Conditional Independence
▪ Conditional independence is our most basic and robust form of knowledge about uncertain
environments.
▪ X is conditionally independent of Y given Z
if and only if: or,
equivalently, if and only if

Example 1:
▪ P(Toothache, Cavity, Catch)
▪ If I have a cavity, the probability that the probe catches in it doesn't depend on whether I
have a toothache:
▪ P(+catch | +toothache, +cavity) = P(+catch | +cavity)
▪ The same independence holds if I don’t have a cavity:
▪ P(+catch | +toothache, -cavity) = P(+catch| -cavity)
▪ Catch is conditionally independent of Toothache given Cavity:
▪ P(Catch | Toothache, Cavity) = P(Catch | Cavity)
Equivalent statements:
▪ P(Toothache | Catch , Cavity) = P(Toothache | Cavity)
▪ P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
▪ One can be derived from the other easily

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 8


Joint probability distribution:

P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2

P(cavity | toothache) = P(cavity  toothache)/ P(toothache)


= 0.016+0.064 / 0.108 + 0.012 + 0.016 + 0.064
= 0.4
1. Worst-case time complexity O(dn) where d is the largest arity
2. Space complexity O(dn) to store the joint distribution

Example 2:
▪ P(Traffic, Umbrella, Raining)
Example 3:
▪ P(Fire, Smoke, Alarm)

2. Bayes Rule and its Applications


Two ways to factor a joint distribution over two variables:

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 9


Inference with Bayes’ Rule
▪ Diagnostic probability from causal probability:

Example:
M: meningitis, S: stiff neck

3. Bayesian networks
A simple, graphical notation for conditional independence assertions and hence for compact
specification of full joint distributions
Syntax
– a set of nodes, one per variable
– a directed, acyclic graph (link ≈ "directly influences")
– a conditional distribution for each node given its parents:
P (Xi | Parents (Xi))
In the simplest case, conditional distribution represented as a conditional probability table (CPT)
giving the distribution over Xi for each combination of parent values
• Topology of network encodes conditional independence assertions:

• Weather is independent of the other variables


• Toothache and Catch are conditionally independent given Cavity
Example

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 10


I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call.
Sometimes it's set off by minor earthquakes. Is there a burglar?
• Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls
• Network topology reflects "causal" knowledge:
• A burglar can set the alarm off
• An earthquake can set the alarm off
• The alarm can cause Mary to call
• The alarm can cause John to call

Compactness

• A CPT for Boolean Xi with k Boolean parents has 2k rows for the combinations of parent
values
• Each row requires one number p for Xi = true (the number for Xi = false is just 1-p)

• If each variable has no more than k parents, the complete network requires O(n · 2k)
numbers i.e., grows linearly with n, vs. O(2n) for the full joint distribution
• For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25-1 = 31)

Constructing Bayesian networks


Let variables X1, … ,Xn
For i = 1 to n
– add Xi to the network
– select parents from X1, … ,Xi-1 such that
P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1)
This choice of parents guarantees:
P (X1, … ,Xn) = πi =1 P (Xi | X1, … , Xi-1) (chain rule)
= πi =1P (Xi | Parents(Xi)) (by construction)

Suppose we choose the ordering M, J, A, B, E

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 11


P(J | M) = P(J)?
No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)?

P(J | M) = P(J)?
No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)?
P(B | A, J, M) = P(B)?

P(J | M) = P(J)?
No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)? Yes
P(B | A, J, M) = P(B)? No
P(E | B, A ,J, M) = P(E | A)? No
P(E | B, A, J, M) = P(E | A, B)? Yes

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 12


4. Markov Models
Reasoning over Time or Space
▪ Speech recognition
▪ Robot localization
▪ User attention
▪ Medical monitoring

Value of X at a given time is called the state

▪ Parameters: called transition probabilities or dynamics, specify how the state evolves over
time (also, initial state probabilities)
▪ Stationarity assumption: transition probabilities the same at all times
▪ Same as MDP transition model, but no choice of action

Joint Distribution of a Markov Model

Chain Rule and Markov Models

▪ From the chain rule, every joint distribution over can be written as:

Assuming that

and

From the chain rule, every joint distribution over can be written as:

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 13


Assuming that for all t:

Implied Conditional Independencies

▪ We assumed: and

Markov Models
Explicit assumption for all t :
Joint distribution can be written as:

Implied conditional independencies:


Past variables independent of future variables given the present
i.e., if or then:
is the same for all t

Conditional Independence
▪ Basic conditional independence:
▪ Past and future independent of the present
▪ Each time step only depends on the previous
▪ This is called the (first order) Markov property

Example Markov Chain: Weather


▪ States: X = {rain, sun}
▪ Initial distribution: 1.0 sun

▪ CPT P(Xt | Xt-1):

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 14


Two new ways of representing the same CPT

▪ Initial distribution: 1.0 sun


▪ What is the probability distribution after one step?

5. Forms of Learning
An agent is learning if it improves its performance LEARNING on future tasks after making
observations about the world.

Any component of an agent can be improved by learning from data. The improvements, and
the techniques depend on four major factors:
• Which component is to be improved.
• What prior knowledge the agent already has.
• What representation is used for the data and the component.
• What feedback is available to learn from.

Components to be learned
1. A direct mapping from conditions on the current state to actions.
2. A means to infer relevant properties of the world from the percept sequence.
3. Information about the way the world evolves and about the results of possible actions the agent
can take.
4. Utility information indicating the desirability of world states.
5. Action-value information indicating the desirability of actions.
6. Goals that describe classes of states whose achievement maximizes the agent’s utility.

Representation and prior knowledge


Propositional and first-order logical sentences for the components in a logical agent;
Bayesian networks for the inferential components of a decision-theoretic agent, and so on.
Effective learning algorithms have been devised for all of these representations.

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 15


Inductive learning- General function or rule from specific input–output pairs.
Analytical or deductive learning: going from a known general rule to a new rule that is logically
entailed, but is useful because it allows more efficient processing.

Feedback to learn from


In unsupervised learning the agent learns patterns in the input even though no explicit
feedback is supplied. The most common unsupervised learning task is clustering: detecting
potentially useful clusters of input examples.
In reinforcement learning the agent learns from a series of reinforcements—rewards or
punishments.
In supervised learning the agent observes some example input–output pairs and learns a
function that maps from input to output.
In semi-supervised learning given a few labeled examples and must make what we can
of a large collection of unlabeled examples.

6. Supervised Learning
The task of supervised learning is this:
Given a training set of N example input–output pairs
(x1, y1), (x2, y2), . . . (xN, yN) ,
where each yj was generated by an unknown function y = f(x),
A function h that approximates the true function f.
x and y can be any value; they need not be numbers. The function h is a hypothesis.

Learning is a search through the space of possible hypotheses for one that will perform well,
even on new examples beyond the training set. To measure the accuracy of a hypothesis we
give it a test set of examples that are distinct from the training set. We say a hypothesis

The examples are points in the (x, y) plane, where y = f(x). We don’t know what f is, but we will
approximate it with a function h selected from a hypothesis space, H, which for this example we will
take to be the set of polynomials, such as x5+3x2+2. The above figure shows some data with an exact
fit by a straight line (the polynomial 0.4x + 3). The line is called a consistent hypothesis because it

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 16


agrees with all the data. Figure b) shows a high degree polynomial that is also consistent with the same
data.
Ockham’s razor
Prefer the simplest hypothesis consistent with the data. This principle is called Ockham’s razor.

A learning problem is realizable if the hypothesis space contains the true function.

Classification
When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the learning
problem is called classification, and is called Boolean or binary classification if there are only two values.

Regression
When y is a number (such as tomorrow’s temperature), the learning problem is called
regression.

Supervised learning can be done by choosing the hypothesis h∗ that is most probable given the
data:
h∗ = argmaxh∈H P(h|data) .
By Bayes’ rule this is equivalent to
h∗ = argmaxh∈H P(data|h) P(h) .
with the prior probability P(h)

7. Learning Decision Trees


Decision tree induction is one of the simplest and yet most successful forms of machine learning.
The decision tree representation
A decision tree represents a function that takes as input a vector of attribute values and
returns a “decision”—a single output value. The input and output values can be discrete or continuous.
A decision tree reaches its decision by performing a sequence of tests.
Each internal node in the tree corresponds to a test of the value of one of the input attributes, Ai,
and the branches from the node are labeled with the possible values of the attribute, Ai =vik.
Each leaf node in the tree specifies a value to be returned by the function.

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 17


Decision trees analysis are of two main types:
- Classification tree analysis when the leaves are the classes
- Regression tree analysis when the leaves are real numbers or intervals
Several approaches
- use information theoretic measures to guide the selection and order of features (e.g
Information gain and Gini impurity)
- prune the tree at a later stage
- generate several decision trees in parallel (e.g. random forest).
- always prefer the simplest of equivalent trees (Occam´s razor).
Attribute Selection Measures
 While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree.
Two popular techniques for ASM
 Information Gain
 Gini Index
Information Gain
𝒏

𝑰𝑮(𝑺, 𝑶𝒖𝒕𝒍𝒐𝒐𝒌) = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺) − |𝑺𝒊 |/|𝑺| ∑ 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒊 )


𝒌=𝟎

Entropy = Σi=1 -P(xi) log2 P(xi)

Algorithm

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 18


Example1:

Four Attributes / Variables / Feature


Input : Outlook, Temperature, Humidity, Wind
Output : Play Golf

Information Gain
𝒏

𝑰𝑮(𝑺, 𝑶𝒖𝒕𝒍𝒐𝒐𝒌) = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺) − |𝑺𝒊 |/|𝑺| ∑ 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒊 )


𝒌=𝟎
Outlook Attribute
Values(Outlook )={Rainy, Outlook, Sunny}

Outlook (9 Yes, 5 No) Entropy (S Outlook)= - 9/14 log2 (9/14) - 5/14 log2 (5/14) =0.94
Rainy (2 Yes, 3 No) Entropy (SRainy)= - 2/5 log2 (2/5) - 3/5 log2 (3/5) =0.971
Overcast(4Yes, 0 No) Entropy (SOvrecast)= - 4/4log2 (4/4) - 4/0 log2 (4/0) =0
Sunny (3 Yes, 2 No) Entropy (SSunny)= - 3/5 log2 (3/5) - 2/5 log2 (2/5) =0.971

Information Gain (Outlook)


= Entropy (Outlook) – 5/14 Entropy (SRainy) – 4/14 Entropy (SOvrecast) – 5/14 Entropy (SSunny)
= 0.94- (5/14) * 0.971 – (4/14)* 0 –(5/14)* 0.971
IG(S, Outlook) = 0.2464

Temperature Attribute
Values(Temp )={Hot, Mild, Cool}
Temp (9 Yes, 5 No) Entropy (S Temp)= - 9/14 log2 (9/14) - 5/14 log2 (5/14) =0.94
Hot (2 Yes, 2 No) Entropy (SHot)= - 2/4 log2 (2/4) - 2/4log2 (2/4) =1.0
Mild(4Yes, 2 No) Entropy (SOvrecast)= - 4/6 log2 (4/6) - 2/6 log2 (2/6) =0.918
Cool(3 Yes, 1 No) Entropy (SSunny)= - 3/4 log2 (3/4) - 1/4 log2 (1/4) =0.8113
IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 19
Gain IG(S, Temp) = Entropy(S) - Sum (|si|/|s| Entropy(S) )
= Entropy (Temp) – 5/14 Entropy (SHot) – 4/14 Entropy (SMild) – 5/14 Entropy (SCool)
= 0.94- (4/14) * 1.0 – (6/14)* 0.918 –(4/14)* 0.8113
= 0.0289

Humidity Attribute
Values(Humidity )={High, Normal}
Humidity (9 Yes, 5 No) Entropy (S Humidity)= - 9/14 log2 (9/14) - 5/14 log2 (5/14)
= - (0.6428* -0.6374) – (0.357* -1.4854 )
= 0.4097+0.5303= 0.94
High ( 3Yes, 3 No) Entropy (SHigh)= - 3/6 log2 (3/6) - 3/6log2 (3/6)
= -(0.5 * -1) – (0.5*-1)
= 0.5+.05= 1
Normal(7 Yes, 1 No) Entropy (SNormal)= - 7/8 log2 (7/8) - 1/8 log2 (1/8)
= -(0.875*-0.19265)-(0.125*-3)
=0.1686+0.375 = 0.5436

Gain IG(S, Humidity) = Entropy(S) - Sum (|si|/|s| Entropy(S) )


= Entropy (Humidity) – 6/14 Entropy (SHigh) – 8/14 Entropy (SNormal)
= 0.94- (6/14) *1– (8/14)* 0.5436
=0.94-0.4286-0.3106
= 0.2007

Wind Attribute
Values(Wind)={True, False}
Wind (9 Yes, 5 No) Entropy (S Wind)= - 9/14 log2 (9/14) - 5/14 log2 (5/14) =0.94
True ( 3Yes, 3 No) Entropy (Strue)= - 3/6 log2 (3/6) - 3/6log2 (3/6)
= -(0.5 * -1) – (0.5*-1)
= 0.5+.05= 1
False(6 Yes, 2 No) Entropy (Sfalse)= - 6/8 log2 (6/8) - 2/8 log2 (2/8) =
= -(0.75*-0.415)-(0.25*-2)
=0.31125+0.5 = 0.81125
Gain IG(S, Wind) = Entropy(S) - Sum (|si|/|s| Entropy(S) )
= Entropy Wind) – 6/14 Entropy (Strue) – 8/14 Entropy (Sfalse)
= 0.94- (6/14) * 1.0 – (8/14)* 0.81125
=0.94- 0.4286-0.4636
=0.478

IG(S, Outlook) = 0.2464


IG(S, Temp) = 0.0289 IG(SSunny, Temp) = 0.57 IG(SRainy, Temp) = 0.0192
IG(S, Humidity) = 0.1516 IG(SSunny, Humidity) = 0.97 IG(SRainy, Humidity) = 0.0192
IG(S, Wind) = 0.0478 IG(SSunny, Wind) = 0.0192 IG(SRainy, Wind) = 0.97

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 20


Example2:
Build a decision tree to decide whether to wait for a table at a restaurant. The aim here is to learn a
definition for the goal predicate WillWait

List the attributes for the input:


1. Alternate: whether there is a suitable alternative restaurant nearby.
2. Bar : whether the restaurant has a comfortable bar area to wait in.
3. Fri/Sat: true on Fridays and Saturdays.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in the restaurant (values are None, Some, and Full ).
6. Price: the restaurant’s price range ($, $$, $$$).
7. Raining: whether it is raining outside.
8. Reservation: whether we made a reservation.
9. Type: the kind of restaurant (French, Italian, Thai, or burger).
10. WaitEstimate: the wait estimated by the host (0–10 minutes, 10–30, 30–60, or >60).

Expressiveness of decision trees


A Boolean decision tree is logically equivalent to the assertion that the goal attribute is true
if and only if the input attributes satisfy one of the paths leading to a leaf with value true
Goal ⇔ (Path1 ∨ Path2 ∨ ・ ・ ・) ,
Path is a conjunction of attribute-value tests, whole expression is equivalent to disjunctive normal form
Path = (Patrons =Full ∧ WaitEstimate =0–10) .

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 21


A truth table over n attributes has 2n rows, one for each combination of values of the attributes. That
means there are 22 different functions
n

ten Boolean attributes of our restaurant problem there are 21024 or about 10308 different functions

Inducing decision trees from examples


Boolean decision tree consists of an (x, y) pair, where x is a vector of values for the input attributes, and
y is a single Boolean output value.

Fig:A decision tree for deciding whether to wait for a table.

Examples for the restaurant domain.


Expressiveness
Decision trees can express any function of the input attributes.
E.g., for Boolean functions, truth table row ! path to leaf:

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 22


Choosing an attribute
Idea: a good attribute splits the examples into subsets that are (ideally) \all positive" or \all
negative"

Patrons? is a better choice|gives information about the classification


Information Gain
Information in an answer when prior is <P1; : : : ; Pn> is

(also called entropy of the prior)

An attribute splits the examples E into subsets Ei, each of which needs less information to
complete the classification
Let Ei have pi positive and ni negative examples

bits needed to classify a new example expected number


of bits per example over all branches is

For Patrons?, this is 0.459 bits, for Type this is (still) 1 bit choose the attribute that minimizes the
remaining information needed

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 23


8. Regression and Classification with Linear Models
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (x) variables, hence called as linear regression.

Linear regression makes predictions for continuous/real or numeric variables such as sales,
salary, age, product price, etc.

y= a0+a1x+ ε
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 24


Types of Linear Regression
Simple Linear Regression
y= a0+a1x+ ε
Multiple Linear regression
y= a0+a1x1+ a2x2+…. ε

Classification Algorithm
The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data.

Types of Classifications
Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.

9. Artificial Neural Networks


Artificial neural network is a machine learning approach that models or mimics human brain

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 25


Stept(x) = 1 if x >= t, else 0
Sign(x) = +1 if x >= 0, else –1
Sigmoid(x) = 1/(1+e-x)
Identity Function

Output from Neuron, O :


The output value from the neuron
Ij : Inputs being presented to the neuron
Wj : Weight from input neuron (Ij) to the output neuron
LR : The learning rate. This dictates how quickly the network converges. It is set by a matter of
experimentation. It is typically 0.1

AI- enables computers to think


ML- Learn from the data
without any program
A
DL- Multi layer
Neural Network

Read the Build an ANN Training data with


Preprocessing Training
data Model (class labels)

Model Test data with


Evaluation (class labels)

Test data without


Predicted Output Testing (class labels)

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 26


A neural network consists of inter connected processing elements called neurons that work
together to produce an output function. The output of a neural network relies on the cooperation
of the individual neurons within the network to operate.
Well-designed neural networks are trainable systems that can often “learn” to solve
complex problems from a set of exemplars and generalize the “acquired knowledge” to solve
unforeseen problems, i.e. they are self-adaptive systems.
A neural network is used to refer to a network of biological neurons. A neural network
consists of a set of highly interconnected entities called nodes or units. Each unit accepts a
weighted set of inputs and responds with an output.
Basics of Neural Network
• Biological approach to AI, developed in 1943
• Comprised of one or more layers of neurons
• Several types, well focus on feed-forward and feedback networks

• Models of the brain and nervous system


• Highly parallel
– Process information much more like the brain than a serial computer
• Learning
• Very simple principles
• Very complex behaviours
• Applications
– As powerful problem solver
– As biological models

Biological Neuron
Biologically, we can also define a neuron. The human body is made up of a vast array of
living cells. Certain cells are interconnected in a way that allows them to communicate pain or to
actuate fibres or tissues. Some cells control the opening and closing of minuscule valves in the
veins and arteries. These specialized communication cells are called neurons. Neurons are
equipped with long tentacle like structures that stretch out from the cell body, permitting them to
communicate with other neurons. The tentacles that take in signals from other cells and the
environment itself are called dendrites, while the tentacles that carry signals from the neuron to
other cells are called axons.

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 27


Biological Artificial

Artificial Neuron
A neural network is a graph, with patterns represented in terms of numerical values
attached to the nodes of the graph and transformations between patterns achieved via simple
message-passing algorithms.
The graph contains a number of units and weighted unidirectional connections between
them. The output of one unit typically becomes an input for another. There may also be units with
external inputs and outputs. The nodes in the graph are generally distinguished as being input
nodes or output nodes and the graph as a whole can be viewed as a representation of a multivariate
functions linking inputs to outputs.
Numerical values (weights) are attached to the links of the graphs, parameterizing the
input/ output function and allowing it to be adjusted via a learning algorithm. A broader view of
neural network architecture involves treating the network as a statistical processor characterized
by making particular probabilistic assumptions about data. Figure illustrates one example of a
possible neural network structure.

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 28


Support Vector Machine
 A Support Vector Machine is a kind of machine learning algorithm which:
 Operates in a supervised mode with pre-classified examples
 Can also operate in an incremental model
 Is an instance of a non-probabilistic binary linear classifier system, where binary means
that it classifies instances into two classes and linear means that the instance space have
to be linearly separable
 Can be used to handle non-linear problems if the original instance space is transformed
into a linearly separable one
 Manages a flexible representation of the class boundaries
 Contains mechanisms to handle overfitting
 Has a single global minimum which can be found in polynomial time.
 It is popular because:
 it is easy to use
 it often has good generalization performance the same algorithm solves a variety of
problems with little tuning.

Hyperplane
A hyperplane in an n-dimensional Euclidean space is a n-1 dimensional subset of that space that
divides the space into two disconnected parts. Examples show hyperplanes in 2 and 3 dimensions

A Support Vector Machine (SVM) performs binary linear classification by finding the optimal
hyperplane that separates the two classes of instances

 Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 29


 Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.
Support Vectors:
 The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
 The SVM algorithm aims at maximizing the margins around the separating hyperplane,
i.e. maximizing distances between the target hyperplane and the chosen closest support
vectors.
 The optimal (maximum margin) hyperplane is fully specified by the chosen support
vectors.
 The optimization problem can be expressed as a Quadratic programming problem that
can be solved by standard methods.

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 30


SVM to Non-linear and Non- Feature Vector
As the SVM only handles linearly separable instance spaces with instances expressed in fixed
length real value feature vector form, all problems that we want to approach with SVM have to
be transformed into such kind of linearly separable spaces typically of higher dimensionality
than the original.
We have two important cases:
1. Non linearly separable spaces of feature vector instances
2. Instance spaces with the raw instances expressed in other forms

10. Nonparametric Models


Linear regression and neural networks use the training data to estimate a fixed set of parameters w.
Parametric model
A learning model that summarizes data with a set of parameters of fixed size (independent of the
number of training examples) is called a parametric model.
When data sets are small, it makes sense to have a strong restriction on the allowable hypotheses,
to avoid overfitting
thousands or millions or billions of examples it seems like a better idea to let the data speak for
themselves rather than forcing them to speak through a tiny vector of parameters.

Nonparametric model
A nonparametric model is one that cannot be characterized by a bounded set of parameters.
Example: suppose that each hypothesis retains within itself all of the training examples and uses
all of them to predict the next example.
Instance-based learning or memory-based learning.
The effective number of parameters is unbounded—it grows with the number of
examples.
Table lookup is the simplest instance-based learning method : take all the training examples, put
them in a lookup table, and then when asked for h(x), see if x is in the table; if it is, return the
corresponding y.

Nearest neighbor models


Given a query xq, find the k examples that are nearest to xq. This is called k NEAREST -nearest
neighbors lookup.
NN(k, xq) to denote the set of k nearest neighbors. To do classification, find NN(k, xq), then take the
majority vote. To avoid ties, k is always chosen to be an odd number. To do regression, we can take the
mean or median of the k neighbors

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 31


Fig: (a) A k-nearest-neighbor model showing the extent of the explosion class for the data, with k=1. Overfitting is
apparent. (b) With k=5, the overfitting problem goes away for this data set.

“nearest” implies a distance metric. Distances are measured with a Minkowski distance or Lp

when p=2 this is Euclidean distance and when p=1 it is Manhattan distance.

The number of attributes on which the two points differs with boolean attribute values is called the
Hamming distance.

To keep all dimensions the same, normalizing the measurements in each dimension.
rescale xj,i becomes (xj,i − μi)/σi.
A more complex metric known as the Mahalanobis distance takes into account the covariance between
dimensions.
Nearest neighbors works very well with low-dimensional spaces with plenty of data.

let k=10 and N =1, 000, 000. In two dimensions (n=2; a unit square), the average neighborhood has
_=0.003, a small fraction of the unit square, and in 3 dimensions _ is just 2% of the edge length of the
unit cube. But by the time we get to 17 dimensions, _ is half the edge length of the unit hypercube, and in
200 dimensions it is 94%. This problem has been called the curse of dimensionality.
poor nearest-neighbors fit on outliers, O(N) execution time

Finding nearest neighbors with k-d trees


A balanced binary tree over data with an arbitrary number of dimensions is called a k-d tree, for
k-dimensional tree. Exact lookup from a k-d tree is just like lookup from a binary tree
.
But nearest neighbor lookup is more complicated. As we go down the branches, splitting the
examples in half, in some cases we can discard the other half of the examples. But not always.
Sometimes the point we are querying for falls very close to the dividing boundary. The query point
itself might be on the left hand side of the boundary, but one or more of the k nearest neighbors
IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 32
might actually be on the right-hand side. We have to test for this possibility by computing the
distance of the query point to the dividing boundary, and then searching both sides if we can’t find
k examples on the left that are closer than this distance.
Because of this problem, k-d trees are appropriate only when there are many more examples than
dimensions, preferably at least 2n examples. Thus, k-d trees work well with up to 10 dimensions
with thousands of examples or up to 20 dimensions with millions of examples. If we don’t have
enough examples, lookup is no faster than a linear scan of the entire data set.

Locality-sensitive hashing
Hash tables have the potential to provide even faster lookup than binary trees. But how
can we find nearest neighbors using a hash table, when hash codes rely on an exact match? Hash
codes randomly distribute values among the bins, but we want to have near points grouped
together in the LOCALITY-SENSITIVE same bin; we want a locality-sensitive hash (LSH).

approximate nearneighbors problem: given a data set of example points and a query point xq , find,
with high probability, an example point (or points) that is near xq.

Nonparametric regression
In (a), we have perhaps the simplest method of all, known informally as “connect-the-dots,” and
superciliously as “piecewise linear nonparametric regression.” This model creates a function h(x) that,
when given a query xq, solves the ordinary linear regression problem with just two points: the training
examples immediately to the left and right of xq. When noise is low, this trivial method is
actually not too bad, which is why it is a standard feature of charting software in spreadsheets. But when
the data are noisy, the resulting function is spiky, and does not generalize well. k-nearest-neighbors
regression (Figure 18.28(b)) improves on connect-the-dots. Instead of using just the two examples to the
left and right of a query point xq, we use the k nearest neighbors (here 3). A larger value of k tends to
smooth out the magnitude of the spikes, although the resulting function has discontinuities. In (b), we have
the k-nearestneighbors average: h(x) is the mean value of the k points,_yj/k. Notice that at the outlying
points, near x=0 and x=14, the estimates are poor because all the evidence comes from one
side (the interior), and ignores the trend. In (c), we have k-nearest-neighbor linear regression, which finds
the best line through the k examples. This does a better job of capturing trends at the outliers, but is still
discontinuous. In both (b) and (c), we’re left with the question of how to choose a good value for k. The
answer, as usual, is cross-validation.
Locally weighted regression gives us the advantages of nearest neighbors, without the discontinuities. To
avoid discontinuities in h(x), we need to avoid discontinuities in the set of examples we use to estimate
h(x). The idea of locally weighted regression is that at each query point xq, the examples that are close to
xq are weighted heavily, and the examples that are farther away are weighted less heavily or not at all. The
decrease in weight over distance is always gradual, not sudden. A kernel function looks like a bump; in
Figure 18.29 we see the specific kernel used to generate

Figure 18.28(d). We can see that the weight provided by this kernel is highest in the center and reaches zero
at a distance of •}5. Can we choose just any function for a kernel? No. First, note that we invoke a kernel
function K with K(Distance(xj , xq)), where xq is a query point that is a given distance from xj , and we
want to know how much to weight that distance. So K should be symmetric around 0 and have a maximum
at 0. The area under the kernel must remain bounded as we go to •}∞. Other shapes, such as Gaussians,
have been used for kernels, but the latest research suggests that the choice of shape doesn’t matter much.
We

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 33


do have to be careful about the width of the kernel. Again, this is a parameter of the model
that is best chosen by cross-validation. Just as in choosing the k for nearest neighbors, if the
kernels are too wide we’ll get underfitting and if they are too narrow we’ll get overfitting. In
Figure 18.29(d), the value of k =10 gives a smooth curve that looks about right—but maybe
it does not pay enough attention to the outlier at x=6; a narrower kernel width would be
more responsive to individual points.
Doing locally weighted regression with kernels is now straightforward. For a given
query point xq we solve the following weighted regression proble w∗ = argmin
w
_
j

K(Distance(xq, xj)) (yj − w ・ xj)2


,
where Distance is any of the distance metrics discussed for nearest neighbors. Then the
answer is h(xq)=w∗ ・ xq.
Note that we need to solve a new regression problem for every query point—that’s what
it means to be local. (In ordinary linear regression, we solved the regression problem once,
globally, and then used the same hw for any query point.) Mitigating against this extra work is the fact that
each regression problem will be easier to solve, because it involves only the
examples with nonzero weight—the examples whose kernels overlap the query point. When
kernel widths are small, this may be just a few points.
Most nonparametric models have the advantage that it is easy to do leave-one-out crossvalidation
without having to recompute everything. With a k-nearest-neighbors model, for
instance, when given a test example (x, y) we retrieve the k nearest neighbors once, compute
the per-example loss L(y, h(x)) from them, and record that as the leave-one-out result for
every example that is not one of the neighbors. Then we retrieve the k + 1 nearest neighbors
and record distinct results for leaving out each of the k neighbors. With N examples the
whole process is O(k), not O(kN).

11. Statistical Learning

Agents can handle uncertainty by using the methods of probability and decision theory.

Data
Data are evidence that is, instantiations of some or all of the random variables describing
the domain.
Hypotheses
Hypotheses are probabilistic theories of how the domain works,

Example: Candy flavour cherry (yum) and lime (ugh)


The candy is sold in very large bags, of which there are known to be five kinds
h1: 100% cherry,
h2: 75% cherry + 25% lime,
h3: 50% cherry + 50% lime,
h4: 25% cherry + 75% lime,
IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 34
h5: 100% lime

The random variable H (for hypothesis) denotes the type of the bag, with possible values
h1 through h5.
Di is a random variable with possible values cherry and lime. Pieces of candy are opened
and inspected, data are revealed—D1, D2, . . ., DN
Task is to predict the flavor of the next piece of candy.
13.1 Bayesian learning
It simply calculates the probability of each hypothesis, given the data, and makes
predictions on that basis.
Let D represent all the data, with observed value d; then the probability of each hypothesis
is obtained by Bayes’ rule:
P(hi | d) = αP(d | hi)P(hi) .
To make a prediction about an unknown quantity X.
P(X | d) = ∑𝑖 𝐏(X | 𝐝, hi)𝐏(hi | 𝐝) =∑𝑖 𝐏(X | hi)P(hi | 𝐝)

P(hi)→ hypothesis prior,


P(d | hi)→ likelihood of the data under each hypothesis,
For candy example, prior distribution over h1, . . . , h5 is given by
<0.1, 0.2, 0.4, 0.2, 0.1>
P(d | hi) =∏𝑖 P(dj | hi)
For example, suppose the bag is really an all-lime bag (h5) and the first 10 candies are all
lime; then P(d | h3) is 0.510, because half the candies in an h3 bag are lime .2
The Bayesian prediction is optimal, whether the data set be small or large.

13.2 Maximum a posteriori or MAP


To make predictions based on a single most probable hypothesis—that is, an hi that
maximizes P(hi | d).
Predictions made according to an MAP hypothesis hMAP are approximately Bayesian to the
extent that P(X | d) ≈ P(X | hMAP)
MAP hypotheses is often much easier than Bayesian learning, because it requires solving
an optimization problem instead of a large summation (or integration) problem.
In both Bayesian learning and MAP learning, the hypothesis prior P(hi) plays an important
role.
overfitting can occur when the hypothesis space is too expressive, so that it contains many
hypotheses that fit the data set well.
Bayesian and MAP learning methods use the prior to penalize complexity. More complex
hypotheses have a lower prior probability. Hence, the hypothesis prior embodies a tradeoff
between the complexity of a hypothesis and its degree of fit to the data.
P(d | hi) is 1 if hi is consistent and 0 otherwise.
The tradeoff between complexity and degree of fit is obtained by taking the logarithm
hMAP to maximize P(d | hi)P(hi) is equivalent to minimizing
−log2 P(d | hi) − log2 P(hi)
− log2 P(hi) →term equals the number of bits required to specify the hypothesis h
− log2 P(d | hi) → the additional number of bits required to specify the data, given the
hypothesis.
MAP learning is choosing the hypothesis that provides maximum compression of the data.
The same task is addressed by the minimum description length, or MDL, learning method.

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 35


MAP learning expresses simplicity by assigning higher probabilities to simpler hypotheses,
MDL expresses by counting the bits in a binary encoding of the hypotheses and data.

13.3 Maximum-likelihood (ML) hypothesis, hML


MAP learning reduces to choosing an hi that maximizes P(d | hi)
Maximum-likelihood learning is very common in statistics
It provides a good approximation to Bayesian and MAP learning when the data set is large,
but it has problems with small data sets.

14. Learning with Complete Data


Density Estimation
Task of learning a probability model, given data that are assumed to be generated from that
model.
Complete data
Data are complete when each data point contains values for every variable in the probability
model being learned.
Parameter learning—finding the numerical parameters for a probability model whose structure
is fixed.
14.1 Maximum-likelihood parameter learning: Discrete models
Example : candy flavour
The parameter θ, is the proportion of cherry candies, and the hypothesis is hθ. The
proportion of limes is 1 − θ.
If we model with a Bayesian network, we need one random variable, Flavor. It has values
cherry and lime, where the probability of cherry is θ
suppose we unwrap N candies, of which c are cherries and l=N − c are limes.
The likelihood of this particular data set is
P(d | hθ) = ∏𝑁 𝑗=1 P(dj | hθ ) = θ ・ (1 − θ)
c l

The same value is obtained by maximizing the log likelihood,


L(d | hθ) = log P(d | hθ) =∑𝑁 𝑗=1 log P(dj |hθ ) = c log θ +l log(1 − θ)
To find the maximum-likelihood value of θ, we differentiate L with respect to θ and set the
resulting expression to zero
𝑑𝐿(𝒅 | ℎ𝜃) 𝑐 𝑙 𝑐 𝑐
= 𝑙 − 1 − θ = 0 →θ = 𝑐+𝑙 = 𝑁

the maximum-likelihood hypothesis hML asserts that the actual proportion of cherries in
the bag is equal to the observed proportion in the candies unwrapped so far

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 36


(a) Bayesian network model for the case of candies with an unknown proportion of cherries and
limes. (b) Model for the case where the wrapper color depends (probabilistically) on the candy
flavor.

1. Write down an expression for the likelihood of the data as a function of the parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero.

Another example. Suppose this new candy manufacturer wants to give a little hint to the
consumer and uses candy wrappers colored red and green. The Wrapper for each candy is selected
probabilistically, according to some unknown conditional distribution, depending on the flavor.
three parameters: θ, θ1, and θ2.
P(Flavor =cherry,Wrapper =green | hθ,θ1,θ2 )
= P(Flavor =cherry | hθ,θ1,θ2)P(Wrapper =green | Flavor =cherry, hθ,θ1,θ2)
= θ ・ (1 − θ1)
rc of the cherries have red wrappers and
gc have green,
while rl of the limes have red
and gl have green.
The likelihood of the data is given by
P(d | hθ,θ1,θ2) = θc(1 − θ)l ・ θ1rc (1 − θ1)gc ・ θ 2rl (1 − θ2)gl
Taking logarithms
L = [c log θ + l log(1 − θ)] + [rc log θ1 + gc log(1 − θ1)] + [rl log θ2 + gl log(1 − θ2)] .

With complete data, the maximum-likelihood parameter learning problem for a Bayesian
network decomposes into separate learning problems, one for each parameter

14.2 Naive Bayes models


The most common Bayesian network model used in machine learning is the naïve Bayes
model. In this model, the “class” variable C (which is to be predicted) is the root and the “attribute”
variables Xi are the leaves. The model is “naive” because it assumes that the attributes are
conditionally independent of each other. A naive Bayes model with class Flavor
θ =P(C =true), θi1 =P(Xi =true |C =true), θi2 =P(Xi =true |C =false).
Once the model has been trained in this way, it can be used to classify new examples for which
the class variable C is unobserved. With observed attribute values x1, . . . , xn, the probability of
each class is given by
P(C | x1, . . . , xn) = α P(c ) ∏𝑖 𝐏(xi |C)
Naive Bayes learning scales well to very large problems:
With n Boolean attributes, there are just 2n + 1 parameters, and no search is require
to find hML, the maximum-likelihood naive Bayes hypothesis.
naive Bayes learning systems have no difficulty with noisy or missing data and can give
probabilistic predictions when appropriate.

14.3 Maximum-likelihood parameter learning: Continuous models


Linear Gaussian model is the Continuous probability models.
IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 37
Maximum-likelihood learning are identical in the continuous and discrete cases
learning the parameters of a Gaussian density function on a single variable. That is, the
data are generated as follows:

The parameters of this model are the mean μ and the standard deviation σ.
Let the observed values be x1, . . . , xN. Then the log likelihood is

Setting the derivatives to zero as usual, we obtain

the maximum-likelihood value of the mean is the sample average and


the maximum likelihood value of the standard deviation is the square root of the sample
variance.
consider a linear Gaussian model with one continuous parent X and a continuous child Y
Y has a Gaussian distribution whose mean depends linearly on the value of X and whose
standard deviation is fixed. To learn the conditional distribution P(Y |X), we can maximize
the conditional likelihood

the parameters are θ1, θ2, and σ.


minimizing the sum of squared errors gives the maximum-likelihood straight-line model,
provided that the data are generated with Gaussian noise of fixed variance.

14.4 Bayesian parameter learning


The Bayesian approach to parameter learning starts by defining a prior probability
distribution over the possible hypotheses. As data arrives, the posterior probability distribution is
updated.
The candy example has one parameter, θ: the probability that a randomly selected piece of
candy is cherry-flavored. In the Bayesian view, θ is the (unknown) value of a random variable Θ
that defines the hypothesis space; the hypothesis prior is just the prior distribution P(Θ).
If the parameter θ can be any value between 0 and 1, then P(Θ) must be a continuous distribution
that is nonzero only between 0 and 1 and that integrates to 1. The uniform density P(θ) =
Uniform[0, 1](θ) is one candidate.
The uniform density is a member of the family of beta distributions. Each beta distribution is
defined by two a and b such that

the beta family has a property: if Θ has a prior beta[a, b], then, after a data point is observed, the
posterior distribution for Θ is also a beta distribution. In other words, beta is closed under update.
The beta family is called the conjugate prior for the family of distributions for a Boolean variable

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 38


A Bayesian network that corresponds to a Bayesian learning process. Posterior distributions for the parameter
variables Θ, Θ1, and Θ2 can be inferred from their prior distributions and the evidence in the Flavor i and Wrapperi
variables.
Thus, after seeing a cherry candy, we simply increment the a parameter to get the posterior; similarly,
after seeing a lime candy, we increment the b parameter. Thus, we can view the a and b hyperparameters
as virtual counts, in sense that a prior beta[a, b] behaves exactly as if we had started out with a uniform
prior beta[1, 1] and seen a − 1 actual cherry candies and b − 1 actual lime candies.
The Bayesian hypothesis prior must cover all three parameters—that is, we need to specify P(Θ,Θ1,Θ2)
P(Θ,Θ1,Θ2) = P(Θ)P(Θ1)P(Θ2) .
P(Flavor i =cherry |Θ=θ) = θ .
add a node Wrapper i, which is dependent on Θ1 and Θ2:
P(Wrapper i =red | Flavor i =cherry,Θ1 =θ1) = θ1
P(Wrapper i =red | Flavor i =lime,Θ2 =θ2) = θ2 .
the entire Bayesian learning process can be formulated as an inference problem
This formulation of learning and prediction makes it clear that Bayesian learning requires no extra
“principles of learning.”

14.5 Learning Bayes net structures


we have assumed that the structure of the Bayes net is given and we are just trying to learn the parameters
We can start with a model containing no links and begin adding parents for each node, fitting the
parameters with the methods we have just covered and measuring the accuracy of the resulting model.
Modifications can include reversing, adding, or deleting links. We must not introduce cycles in the
process, so many algorithms assume that an ordering is given for the variables, and that a node can have
parents only among those nodes that come earlier in the ordering
There are two alternative methods for deciding a good structure
The first is to test whether the conditional independence assertions implicit in the structure are
actually satisfied in the data.
P(Fri/Sat, Bar | WillWait) = P(Fri/Sat | WillWait)P(Bar | WillWait)
To assess the degree to which the proposed model explains the data (in a probabilistic sense). We
must be careful how we measure this, however. If we just try to find the maximum-likelihood hypothesis,

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 39


we will end up with a fully connected network, because adding more parents to a node cannot decrease
the likelihood
14.6 Density estimation with nonparametric models
The task of nonparametric density estimation is typically done in continuous domains.
k-nearest-neighbors models.
Given a sample of data points, to estimate the unknown probability density at a query point
x we can simply measure the density of the data points in the neighborhood of x. For each query
point we have drawn the smallest circle that encloses 10 neighbors—the 10-nearest-neighborhood.
Another possibility is to use kernel functions, as we did for locally weighted regression.
To apply a kernel model to density estimation, each data point generates its own little density
function, using a Gaussian kernel. The estimated density at a query point x is then the average
density as given by each kernel function:

spherical Gaussians with standard deviation w along each axis:

where d is the number of dimensions in x and D is the Euclidean distance function. A good value
of w can be chosen by using cross-validation.

15. Learning with Hidden Variable: The EM Algorithm


15.1 Hidden variables /latent variables
It is not observable in the data that are available for learning. Latent variables can
dramatically reduce the number of parameters required to specify a Bayesian network.
Example, it is not obvious how to learn the conditional distribution for HeartDisease, given
its parents, because we do not know the value of HeartDisease in each case; the same problem
arises in learning the distributions for the symptoms.
An algorithm called expectation–maximization, or EM, that solves this problem in a very general
way.

(a) A simple diagnostic network for heart disease, which is assumed to be a hidden variable. Each
variable has three possible values and is labeled with the number of independent parameters in its
conditional distribution; the total number is 78. (b) The equivalent network with HeartDisease
removed. Note that the symptom variables are no
longer conditionally independent given their parents. This network requires 708 parameters.

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 40


15.2 Unsupervised clustering: Learning mixtures of Gaussians
Unsupervised clustering is the problem of discerning multiple categories in a collection
of objects. Here the category labels are not given.
Example: categorzing different types of stars and identification of species, genera, orders etc.
Unsupervised clustering begins with data. Kind of probability distribution might have
generated the data. Clustering presumes that the data are generated from a mixture distribution,
P. Such a distribution has k components, each of which is a distribution in its own right. A data
point is generated by first choosing a component and then generating a sample from that
component. Let the random variable C denote the component, with values 1, . . . , k; then the
mixture distribution is given by
P(x) = ∑𝑘𝑖=1 P(C = i) P(𝐱 |C = i)
x refers to the values of the attributes for a data point
For continuous data, the component distributions is the multivariate Gaussian, which gives
mixture of Gaussians family of distributions

(a) A Gaussian mixture model with three components; the weights (left-toright) are 0.2, 0.3, and
0.5. (b) 500 data points sampled from the model in (a). (c) The model reconstructed by EM from
the data in (b).
parameters of a mixture of Gaussians are
wi =P(C =i) (the weight of each component)
μi (the mean of each component)
Σi (the covariance of each component)

The basic idea of EM in this context is to pretend that we know the parameters of the model
and then to infer the probability that each data point belongs to each component. After that, we
refit the components to the data, where each component is fitted to the entire data set with each
point weighted by the probability that it belongs to that component. The process iterates until
convergence.
For the mixture of Gaussians, we initialize the mixture-model parameters arbitrarily and
then iterate the following two steps:
E-step: Compute the probabilities pij =P(C =i | xj), the probability that datum xj was
generated by component i. By Bayes’ rule, we have pij =αP(xj |C =i)P(C =i). ni =Σj pij, the
effective number of data points currently assigned to component i.
M-step: Compute the new mean, covariance, and component weights

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 41


where N is the total number of data points.
The E-step, or expectation step, can be viewed as computing the expected values pij of the
hidden indicator variables Zij , where Zij is 1 if datum xj was generated by the ith component
and 0 otherwise.
The M-step, or maximization step, finds the new values of the parameters that maximize
the log likelihood of the data, given the expected values of the hidden indicator variables.

There are two points to notice. First, the log likelihood for the final learned model slightly exceeds
that of the original model, from which the data were generated
The second point is that EM increases the log likelihood of the data at every iteration.

15.3 Learning Bayesian networks with hidden variables


there are two bags of candies that have been mixed together. Candies are described by three features: in
addition to the Flavor and the Wrapper , some candies have a Hole in the middle and some do not.
The distribution of candies in each bag is described by a naive Bayes model: the features are independent,
given the bag, but the conditional probability distribution for each feature depends on the bag.
The parameters are as follows:
θ is the prior probability that a candy comes from Bag 1;
θF1 and θF2 are the probabilities that the flavor is cherry, given that the candy comes from Bag 1 or Bag
2 respectively;
θW1 and θW2 give the probabilities that the wrapper is red; and θH1 and θH2 give the probabilities that
the candy has a hole.
the bag is a hidden variable

(a) A mixture model for candy. The proportions of different flavors, wrappers, presence of holes depend on the bag,
which is not observed. (b) Bayesian network for a Gaussian mixture. The mean and covariance of the observable
variables X depend on the component C.

1000 samples from a model whose true parameters


are as follows:
θ =0.5, θF1 =θW1 =θH1 =0.8, θF2 =θW2 =θH2 =0.3 .

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 42


θ(0) =0.6,
θ(0)F1 =θ(0)W1 =θ(0)H1 =0.6
θ(0)F2 =θ(0)W2 =θ(0)H2 =0.4
The expected count ˆN (Bag =1) is the sum, over all candies, of the probability that the candy came from
bag 1:

Bayes’ rule and applying conditional independence:

the 273 red-wrapped cherry candies with holes, we get a contribution of

the parameter updates for Bayesian network learning with hidden variables are directly available from
the results of inference on each example. Moreover, only local posterior probabilities are needed for each
parameter.

15.4 Learning hidden Markov models


Application of EM involves learning the transition probabilities in hidden Markov models (HMMs).
in Bayes nets, each parameter is distinct; in a hidden Markov model, on the other hand, the individual
transition probabilities from state i to state j at time t, θijt =P(Xt+1 =j |Xt =i), are repeated across
time—that is, θijt =θij for all t.
The transition probability from state i to state j,

The expected counts are computed by an HMM inference algorithm. The forward–backward algorithm
can be modified very easily to compute the necessary probabilities. probabilities required are obtained by
smoothing rather than filtering

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 43


An unrolled dynamic Bayesian network that represents a hidden Markov model
The general form of the EM algorithm
Let x be all the observed values in all the examples, let Z denote all the hidden variables for all the
examples, and let θ be all the parameters for the probability model. Then the EM algorithm is

16. Reinforcement Learning


An optimal policy is a policy that maximizes the expected total reward. The task of
reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy
for the environment. In many complex domains, reinforcement learning is the only feasible way
to train a program to perform at high levels.
• Agent takes actions
• Agent occasionally receives reward
● Maybe just at the end of the process, e.g., Chess: – agent has to decide on individual moves – reward
only at end: win/lose
● Maybe more frequently – Scrabble: points for each word played – ping pong: any point scored – baby
learning to crawl: any forward movement

Markov Decision Process

States s ∈ S, actions a ∈ A
● Model T(s, a, s′) ≡ P(s ′ ∣s, a) = probability that a in s leads to s ′

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 44


● Reward function R(s) (or R(s, a), R(s, a, s′)) = { −0.04 (small penalty) for nonterminal states ±1 for
terminal states

Reinforcement learning might be considered to encompass all of AI: an agent is placed in an


environment and must learn to behave successfully therein. The agent does not know how the
environment works or what its actions do, and we will allow for probabilistic action outcomes.
Thus, the agent faces an unknown Markov decision process.
Agent Designs
• A utility-based agent learns a utility function on states and uses it to select actions that
maximize the expected outcome utility.
• A Q-learning agent learns an action-utility function, or Q-function, giving the expected
utility of taking a given action in a given state.
• A reflex agent learns a policy that maps directly from states to actions.

Regular MDP
– Given:
• Transition model P(s’ | s, a)
• Reward function R(s)
– Find:
• Policy (s)
• Reinforcement learning
– Transition model and reward function initially unknown
– Still need to find the right policy
– “Learn by doing”
Reinforcement learning strategies
• Model-based
– Learn the model of the MDP (transition probabilities and rewards) and try to
solve the MDP concurrently
• Model-free
– Learn how to act without explicitly learning the transition probabilities P(s’ | s, a)
– Q-learning: learn an action-utility function Q(s,a) that tells us the value of doing
action a in state s
Model-based reinforcement learning
• Basic idea: try to learn the model of the MDP (transition probabilities and rewards) and
learn how to act (solve the MDP) simultaneously
• Learning the model:
– Keep track of how many times state s’ follows state s when you take action a and
update the transition probability P(s’ | s, a) according to the relative frequencies
– Keep track of the rewards R(s)
• Learning how to act:
– Estimate the utilities U(s) using Bellman’s equations
– Choose the action that maximizes expected future utility:
* (s) = arg max  P(s ' | s,a)U(s ')
aA(s) s'
Exploration vs. exploitation
• Exploration: take a new action with unknown consequences
explore more in the beginning, become more and more greedy over time
Standard (“greedy”) selection of optimal action:
a = arg max  P(s ' | s,a ')U(s ')
a 'A(s) s'
IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 45
Modified Strategy
 
a = arg max f   P(s ' | s, a ')U(s '), N(s, a ') 
a 'A(s)  s' 

R + if n  N e
f (u, n) = 
 u otherwise
– Pros:
• Get a more accurate model of the environment
• Discover higher-reward states than the ones found so far
– Cons:
• When you’re exploring, you’re not maximizing your utility
• Something bad might happen
• Exploitation: go with the best strategy found so far
– Pros:
• Maximize reward as reflected in the current utility estimates
• Avoid bad stuff
– Cons:
• Might also prevent you from discovering the true optimal strategy

Model-free reinforcement learning


• Idea: learn how to act without explicitly learning the transition probabilities P(s’ | s, a)
• Q-learning: learn an action-utility function Q(s,a) that tells us the value of doing action a
in states U(s) = max a Q(s,a)
• Relationship between Q-values and utilities:
• Equilibrium constraint on Q values: Q(s,a) = R(s) +   P(s ' | s,a) max a ' Q(s ',a ')
s'

Temporal difference (TD) learning


• Equilibrium constraint on Q values:
• Pretend that the currently observed transition (s,a,s’) is 
Q(s,a) = R(s) +  P(s ' | s,a) max a ' Q(s ',a ')
the
s'
only possible outcome and
adjust the Q values towards the “local equilibrium”


• At each time step t
• From current state s, select an action a:
a = arg maxa ' f (Q(s, a' ), N (s, a' ))
– Get the successor state s’
– Perform the TD update:

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 46


Passive reinforcement learning

● We know which state we are in (= partially observable environment)


● We know which actions we can take
● But only after taking an action
→ new state becomes known
→ reward becomes known

Active reinforcement learning


● passive agent follows prescribed policy
● Now: active agent decides which action to take – following optimal policy (as currently
viewed) – exploration
● Goal: optimize rewards for a given time frame

Q(s, a)  Q(s, a) +  ( R(s) +  max a ' Q(s ', a ') − Q(s, a) )

IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 47

You might also like