Bayesian Networks (Part I) : 10 - 601 Introduction To Machine Learning

10-‐601 Introduction to Machine Learning
Machine Learning Department

School of Computer Science
Carnegie Mellon University
Bayesian Networks
(Part I)
Graphical Model Readings:
Murphy 10 – 10.2.1 Matt Gormley
Bishop 8.1, 8.2.2
HTF -‐-‐
Lecture 22
Mitchell 6.11 April 10, 2017
1
Reminders
• Peer Tutoring
• Homework 7: Deep Learning
– Release: Wed, Apr. 05
– Part I due Wed, Apr. 12 Start Early
– Part II due Mon, Apr. 17
2
CONVOLUTIONAL NEURAL NETS
3
Deep Learning Outline
• Background: Computer Vision
– Image Classification
– ILSVRC 2010 -‐ 2016
– Traditional Feature Extraction Methods
– Convolution as Feature Extraction
• Convolutional Neural Networks (CNNs)
– Learning Feature Abstractions
– Common CNN Layers:
• Convolutional Layer
• Max-‐Pooling Layer
• Fully-‐connected Layer (w/tensor input)
• Softmax Layer
• ReLU Layer
– Background: Subgradient
– Architecture: LeNet
– Architecture: AlexNet
• Training a CNN
– SGD for CNNs
– Backpropagation for CNNs
4
Convolutional Neural Network (CNN)
• Typical layers include:
– Convolutional layer
– Max-‐pooling layer
– Fully-‐connected (Linear) layer
– ReLU layer (or some other nonlinear activation function)
– Softmax
• These can be arranged into arbitrarily deep topologies
Architecture #1: LeNet-‐5
5
Convolutional Layer
CNN key idea:
Treat convolution matrix as
parameters and learn them!
Input Image
0 0 0 0 0 0 0 Convolved Image
Learned
0 1 1 1 1 1 0 Convolution .4 .5 .5 .5 .4
0 1 0 0 1 0 0 θ11 θ12 θ13 .4 .2 .3 .6 .3
0 1 0 1 0 0 0 θ21 θ22 θ23 .5 .4 .4 .2 .1
0 1 1 0 0 0 0 θ31 θ32 θ33 .5 .6 .2 .1 0
0 1 0 0 0 0 0 .4 .3 .1 0 0
0 0 0 0 0 0 0
6
Downsampling by Averaging
• Downsampling by averaging used to be a common approach
• This is a special case of convolution where the weights are fixed to a
uniform distribution
• The example below uses a stride of 2
Input Image
1 1 1 1 1 0 Convolved Image
Convolution
1 0 0 1 0 0
3/4 3/4 1/4
1 0 1 0 0 0 1/4 1/4
3/4 1/4 0
1 1 0 0 0 0 1/4 1/4
1/4 0 0
1 0 0 0 0 0
0 0 0 0 0 0
7
Max-‐Pooling
• Max-‐pooling is another (common) form of downsampling
• Instead of averaging, we take the max value within the same range as
the equivalently-‐sized convolution
• The example below uses a stride of 2
Input Image
Max-‐Pooled
1 1 1 1 1 0 Image
Max-‐
1 0 0 1 0 0 pooling
1 1 1
1 0 1 0 0 0 xi,j xi,j+1
1 1 0
1 1 0 0 0 0 xi+1,j xi+1,j+1
1 0 0
1 0 0 0 0 0
0 0 0 0 0 0
8
Multi-‐Class Output
Output …
Hidden Layer …
Input …
10
Multi-‐Class Output
(F) Loss
Softmax Layer: J = k=1 yk HQ;(yk )
K
2tT(bk ) (E) Output (softmax)

yk = yk = K2tT(b k)
2tT(bl )
K l=12tT(b )
l
l=1
(D) Output (linear)
D
bk = j=0 kj zj k
…
Output
(C) Hidden (nonlinear)
zj = (aj ), j
…
Hidden Layer
(B) Hidden (linear)
M
aj = i=0 ji xi , j
…
Input
(A) Input
Given xi , i
11
Training a CNN
Whiteboard
– SGD for CNNs
– Backpropagation for CNNs
12
Common CNN Layers
Whiteboard
– ReLU Layer
– Background: Subgradient
– Fully-‐connected Layer (w/tensor input)
– Softmax Layer
– Convolutional Layer
– Max-‐Pooling Layer
13
Convolutional Layer
14
Convolutional Layer
15
Max-‐Pooling Layer
16
Max-‐Pooling Layer
17
Convolutional Neural Network (CNN)
• Typical layers include:
– Convolutional layer
– Max-‐pooling layer
– Fully-‐connected (Linear) layer
– ReLU layer (or some other nonlinear activation function)
– Softmax
• These can be arranged into arbitrarily deep topologies
Architecture #1: LeNet-‐5
18
Architecture #2: AlexNet
CNN for Image Classification
(Krizhevsky, Sutskever & Hinton, 2012)
15.3% error on ImageNet LSVRC-‐2012 contest
Input • Five convolutional layers 1000-‐way
image (w/max-‐pooling)
(pixels) • Three fully connected layers softmax
19
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
CNNs for Image Recognition
(slide from Kaiming He’s recent presentation) 20

Slide from Kaiming He
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Mini-‐Batch SGD
• Gradient Descent:
Compute true gradient exactly from all N
examples
• Mini-‐Batch SGD:
Approximate true gradient by the average
gradient of K randomly chosen examples
• Stochastic Gradient Descent (SGD):
Approximate true gradient by the gradient
of one randomly chosen example
21
Mini-‐Batch SGD
Three variants of first-‐order optimization:
22
CNN VISUALIZATIONS
23
3D Visualization of CNN
http://scs.ryerson.ca/~aharley/vis/conv/
Convolution of a Color Image
• Color images consist of 3 floats per pixel for
RGB (red, green blue) color values
• Convolution must also be 3-‐
A closer look at spatial dimensions: dimensional
activation map
32x32x3 image
5x5x3 filter
32
28
convolve (slide) over all

spatial locations
32 28
3 1
25
Figure from Fei-‐Fei Li & Andrej Karpathy & Justin Johnson (CS231N)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 23 27 Jan 2016
Animation of 3D Convolution
http://cs231n.github.io/convolutional-‐networks/
26
Figure from Fei-‐Fei Li & Andrej Karpathy & Justin Johnson (CS231N)
MNIST Digit Recognition with CNNs
(in your browser)
https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html
27
Figure from Andrej Karpathy
CNN Summary
CNNs
– Are used for all aspects of computer vision, and
have won numerous pattern recognition
competitions
– Able learn interpretable features at different levels
of abstraction
– Typically, consist of convolution layers, pooling
layers, nonlinearities, and fully connected layers
Other Resources:
– Readings on course website
– Andrej Karpathy, CS231n Notes
http://cs231n.github.io/convolutional-‐networks/
28
BAYESIAN NETWORKS
29
Bayes Nets Outline
• Motivation
– Structured Prediction
• Background
– Conditional Independence
– Chain Rule of Probability
• Directed Graphical Models
– Writing Joint Distributions
– Definition: Bayesian Network
– Qualitative Specification
– Quantitative Specification
– Familiar Models as Bayes Nets
• Conditional Independence in Bayes Nets
– Three case studies
– D-‐separation
– Markov blanket
• Learning
– Fully Observed Bayes Net
– (Partially Observed Bayes Net)
• Inference
– Sampling directly from the joint distribution
– Gibbs Sampling
31
MOTIVATION: STRUCTURED
PREDICTION
32
Structured Prediction
• Most of the models we’ve seen so far were
for classification
– Given observations: x = (x1, x2, …, xK)
– Predict a (binary) label: y
• Many real-‐world problems require
structured prediction
– Given observations: x = (x1, x2, …, xK)
– Predict a structure: y = (y1, y2, …, yJ)
• Some classification problems benefit from
latent structure
33
Structured Prediction Examples
• Examples of structured prediction
– Part-‐of-‐speech (POS) tagging
– Handwriting recognition
– Speech recognition
– Word alignment
– Congressional voting
• Examples of latent structure
– Object recognition
34
Dataset for Supervised
Part-‐of-‐Speech (POS) Tagging
Data: D = {x(n) , y (n) }N
n=1
n v p d n y(1)
Sample 1:
time flies like an arrow x(1)
n n v d n y(2)
Sample 2:
time flies like an arrow x(2)
n v p n n y(3)
Sample 3:
flies fly with their wings x(3)
p n n v v y(4)
Sample 4:
with time you will see x(4)
35
total of 1,63
total words,
of 1,63
words, and
words,its and
part
Dataset for Supervised its part
its part
of s
of s
speechspeech
label
Handwriting Recognition speech label
1. F
1. First-
Data: D = {x(n) , y (n) }N
n=1
1. First-
2. F
2. Four
2. Four
Sample 1:
u n e x p e c t e d y(1)
x(1) Hand
Hand
Sample 2: y(2)

v o l c a n i c
x(2)
Sample 2:
e m b r a c e s y(3)
x(3)
Fig. 5.Fig. 5. Handwriting

Handwriting recognition:
recognition: Example
Example wordswords fromdataset
from the the dataset
used.used.
Fig. 5. Handwriting recognition: Example words from the dataset used. 36
Figures from (Chatzis & Demiris, 2013)
Dataset for Supervised
Phoneme (Speech) Recognition
Data: D = {x(n) , y (n) }N
n=1
Sample 1:
1704 h# dh ih s w uh z iy z y(1)
iyIEEE TRANSACTIONS ON SIGNA
x(1)
Sample 2: y(2)

f ao r ah s s h#
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 7, APRIL 1, 2013
Fig. 5. Extrinsic (top) and intrinsic (bottom) spectral representations for the utterance “This was easy for us.” Note
was used. x (2)
where are the input unlabeled data and is the novel utterance into this in
new parametrization of the function we need to estimate. To the computation of Equatio
proceed, we plug the functional form of (9) into the optimization Fig. 5 shows an exampl
problem of (8). Taking the gradient with respect to the parameter spectrograms for
vector and setting it to zero sets up the following generalized 37
for us” (TIMIT sentence sx
Figures from (Jansen & Niyogi, 2013)
eigenvalue problem: with 200 examples of eac
iﬁed in [26].2 Each examp
(10) a 40-dimensional, homomo
Application:
Word Alignment / Phrase Extraction
• Variables (boolean):
– For each (Chinese phrase,
English phrase) pair,
are they linked?
• Interactions:
– Word fertilities
– Few “jumps” (discontinuities)
– Syntactic reorderings
– “ITG contraint” on alignment
– Phrases are disjoint (?)
(Burkett & Klein, 2012) 38

Application:
Congressional Voting
• Variables:
– Text of all speeches of a

representative
– Local contexts of
references between two
representatives
• Interactions:
– Words used by
representative and their
vote
– Pairs of representatives
and their local context
(Stoyanov & Eisner, 2012) 39

Structured Prediction Examples
• Examples of structured prediction
– Part-‐of-‐speech (POS) tagging
– Handwriting recognition
– Speech recognition
– Word alignment
– Congressional voting
• Examples of latent structure
– Object recognition
40
Case Study: Object Recognition
Data consists of images x and labels y.
x(1) x(2)
pigeon y(1) rhinoceros y(2)
x(3) x(4)
leopard y(3) llama y(4)

41
• Preprocess data into
“patches”
• Posit a latent labeling z
describing the object’s
parts (e.g. head, leg,
tail, torso, grass)
• Define graphical
model with these
latent variables in
mind
• z is not observed at leopard
train or test time
42
“patches”
Z7
parts (e.g. head, leg,
tail, torso, grass)
X7
Z2
model with these
latent variables in
mind X2
• z is not observed at leopard Y

train or test time
43
“patches”
Z7
parts (e.g. head, leg, ψ4
tail, torso, grass) ψ1
X7 ψ4
ψ2 Z2 ψ4
model with these
ψ3
latent variables in
mind X2
• z is not observed at leopard Y

train or test time
44
Structured Prediction
Preview of challenges to come…
• Consider the task of finding the most probable
assignment to the output
Classification Structured Prediction

ŷ = `;Kt p(y|t) v̂ = `;Kt p(v|t)
y v
where y {+1, 1} where v Y

and |Y| is very large
45
Machine Learning
The data inspires Our model
the structures defines a score
we want to for each structure
predict
Domain Mathematical
It also tells us
Knowledge Modeling
what to optimize
ML
Inference finds Combinatorial Optimization
{best structure, marginals, Optimization
partition function} for a

new observation Learning tunes the
parameters of the
(Inference is usually model
called as a subroutine
in learning) 46
a
s aw B ob on
3 Alice
e
a telescop
Machine Learning
Alice
saw
B ob
on
ow
a hill with
Model
s l i ke a n a rr
ie
time fl
4
Data X1
arrow X3
like an X2
flies
time
X4 X5
an arrow
like
flies
time
an arrow
Objective
like
flies
time
Inference
time
flies
like an arrow
Learning
(Inference is usually
called as a subroutine
in learning) 47
BACKGROUND
48
Background
Whiteboard
– Chain Rule of Probability
– Conditional Independence
49
Background: Chain Rule
of Probability
For random variables A and B:
P (A, B) = P (A|B)P (B)
For random variables X1 , X2 , X3 , X4 :

P (X1 , X2 , X3 , X4 ) =P (X1 |X2 , X3 , X4 )
P (X2 |X3 , X4 )
P (X3 |X4 )
P (X4 )
50
Background:
Conditional Independence
Random variables A and B are conditionally
independent given C if:
P (A, B|C) = P (A|C)P (B|C) (1)
or equivalently:
P (A|B, C) = P (A|C) (2)
We write this as:
A B|C (3)
Later we will also
|4
write: I<A, {C}, B>

51
Bayesian Networks
DIRECTED GRAPHICAL MODELS
52
Example: Tornado Alarms
1. Imagine that
you work at the
911 call center
in Dallas
2. You receive six
calls informing
you that the
Emergency
Weather Sirens
are going off
3. What do you
conclude?
53
Figure from https://www.nytimes.com/2017/04/08/us/dallas-‐emergency-‐sirens-‐hacking.html
Example: Tornado Alarms
1. Imagine that
you work at the
911 call center
in Dallas
2. You receive six
calls informing
you that the
Emergency
Weather Sirens
are going off
3. What do you
conclude?
54
Figure from https://www.nytimes.com/2017/04/08/us/dallas-‐emergency-‐sirens-‐hacking.html
Directed Graphical Models
(Bayes Nets)
Whiteboard
– Example: Tornado Alarms
– Writing Joint Distributions
• Idea #1: Giant Table
• Idea #2: Rewrite using chain rule
• Idea #3: Assume full independence
• Idea #4: Drop variables from RHS of conditionals
– Definition: Bayesian Network
– Observed Variables in Graphical Models
55
Bayesian Network
X1
p(X1 , X2 , X3 , X4 , X5 ) =
X3
X2
p(X5 |X3 )p(X4 |X2 , X3 )
X4 X5 p(X3 )p(X2 |X1 )p(X1 )
56
Bayesian Network
Definition:
X1
n
X3
X2
P(X1 …X n ) = ∏ P(Xi | parents(Xi ))
i=1
X4 X5
• A Bayesian Network is a directed graphical model

• It consists of a graph G and the conditional probabilities P
• These two parts full specify the distribution:
– Qualitative Specification: G
– Quantitative Specification: P
57

Bayesian Networks (Part I) : 10 - 601 Introduction To Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Networks (Part I) : 10 - 601 Introduction To Machine Learning

Uploaded by

Copyright:

Available Formats

10-­‐601 Introduction to Machine Learning

Machine Learning Department

Architecture #1: LeNet-­‐5

2tT(bk ) (E) Output (softmax)

Architecture #1: LeNet-­‐5

(slide from Kaiming He’s recent presentation) 20

Three variants of first-­‐order optimization:

convolve (slide) over all

Sample 2: y(2)

Fig. 5.Fig. 5. Handwriting

Sample 2: y(2)

(Burkett & Klein, 2012) 38

– Text of all speeches of a

(Stoyanov & Eisner, 2012) 39

pigeon y(1) rhinoceros y(2)

leopard y(3) llama y(4)

• z is not observed at leopard Y

tail, torso, grass) ψ1

• z is not observed at leopard Y

Classification Structured Prediction

where y {+1, 1} where v Y

{best structure, marginals, Optimization

partition function} for a

For random variables A and B:

P (A, B) = P (A|B)P (B)

For random variables X1 , X2 , X3 , X4 :

P (A, B|C) = P (A|C)P (B|C) (1)

P (A|B, C) = P (A|C) (2)

We write this as:

write: I<A, {C}, B>

DIRECTED GRAPHICAL MODELS

• A Bayesian Network is a directed graphical model

You might also like

10-‐601 Introduction to Machine Learning

Architecture #1: LeNet-‐5

Architecture #1: LeNet-‐5

Three variants of first-‐order optimization: