You are on page 1of 55

10-­‐601  Introduction  to  Machine  Learning

Machine  Learning  Department


School  of  Computer  Science
Carnegie  Mellon  University

Bayesian  Networks
(Part  I)
Graphical  Model  Readings:
Murphy  10  – 10.2.1 Matt  Gormley
Bishop  8.1,  8.2.2
HTF  -­‐-­‐
Lecture  22
Mitchell  6.11 April  10,  2017

1
Reminders
• Peer  Tutoring
• Homework 7:  Deep  Learning
– Release:  Wed,  Apr.  05  
– Part I  due Wed,  Apr.  12 Start  Early
– Part II  due Mon,  Apr.  17

2
CONVOLUTIONAL  NEURAL  NETS

3
Deep  Learning  Outline
• Background:  Computer  Vision
– Image  Classification
– ILSVRC  2010  -­‐ 2016
– Traditional  Feature  Extraction  Methods
– Convolution  as  Feature  Extraction
• Convolutional  Neural  Networks  (CNNs)
– Learning  Feature  Abstractions
– Common  CNN  Layers:
• Convolutional  Layer
• Max-­‐Pooling  Layer
• Fully-­‐connected  Layer  (w/tensor  input)
• Softmax Layer
• ReLU Layer
– Background:  Subgradient
– Architecture:  LeNet
– Architecture:  AlexNet
• Training  a  CNN
– SGD  for  CNNs
– Backpropagation  for  CNNs

4
Convolutional  Neural  Network  (CNN)
• Typical  layers  include:
– Convolutional  layer
– Max-­‐pooling  layer
– Fully-­‐connected  (Linear)  layer
– ReLU layer  (or  some  other  nonlinear  activation  function)
– Softmax
• These  can  be  arranged  into  arbitrarily  deep  topologies

Architecture  #1:  LeNet-­‐5

5
Convolutional  Layer
CNN  key  idea:  
Treat  convolution  matrix  as  
parameters  and  learn  them!
Input Image

0 0 0 0 0 0 0 Convolved  Image
Learned
0 1 1 1 1 1 0 Convolution .4 .5 .5 .5 .4
0 1 0 0 1 0 0 θ11 θ12 θ13 .4 .2 .3 .6 .3
0 1 0 1 0 0 0 θ21 θ22 θ23 .5 .4 .4 .2 .1
0 1 1 0 0 0 0 θ31 θ32 θ33 .5 .6 .2 .1 0
0 1 0 0 0 0 0 .4 .3 .1 0 0
0 0 0 0 0 0 0

6
Downsampling by  Averaging
• Downsampling by  averaging  used  to  be a  common  approach
• This  is  a  special  case  of  convolution  where  the  weights  are  fixed  to  a  
uniform  distribution
• The  example  below  uses  a  stride  of  2
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
3/4 3/4 1/4
1 0 1 0 0 0 1/4 1/4
3/4 1/4 0
1 1 0 0 0 0 1/4 1/4
1/4 0 0
1 0 0 0 0 0
0 0 0 0 0 0

7
Max-­‐Pooling
• Max-­‐pooling  is  another  (common)  form  of  downsampling
• Instead  of  averaging,  we  take  the  max  value  within  the  same  range  as  
the  equivalently-­‐sized  convolution
• The  example  below  uses  a  stride  of  2
Input Image
Max-­‐Pooled  
1 1 1 1 1 0 Image
Max-­‐
1 0 0 1 0 0 pooling
1 1 1
1 0 1 0 0 0 xi,j xi,j+1
1 1 0
1 1 0 0 0 0 xi+1,j xi+1,j+1
1 0 0
1 0 0 0 0 0
0 0 0 0 0 0

8
Multi-­‐Class  Output

Output …

Hidden  Layer …

Input …
10
Multi-­‐Class  Output
(F) Loss
Softmax Layer: J = k=1 yk HQ;(yk )
K

2tT(bk ) (E) Output (softmax)


yk = yk = K2tT(b k)

2tT(bl )
K l=12tT(b )
l

l=1
(D) Output (linear)
D
bk = j=0 kj zj k

Output
(C) Hidden (nonlinear)
zj = (aj ), j

Hidden  Layer
(B) Hidden (linear)
M
aj = i=0 ji xi , j


Input
(A) Input
Given xi , i
11
Training  a  CNN
Whiteboard
– SGD  for  CNNs
– Backpropagation  for  CNNs

12
Common  CNN  Layers
Whiteboard
– ReLU Layer
– Background:  Subgradient
– Fully-­‐connected  Layer  (w/tensor  input)
– Softmax Layer
– Convolutional  Layer
– Max-­‐Pooling  Layer

13
Convolutional  Layer

14
Convolutional  Layer

15
Max-­‐Pooling  Layer

16
Max-­‐Pooling  Layer

17
Convolutional  Neural  Network  (CNN)
• Typical  layers  include:
– Convolutional  layer
– Max-­‐pooling  layer
– Fully-­‐connected  (Linear)  layer
– ReLU layer  (or  some  other  nonlinear  activation  function)
– Softmax
• These  can  be  arranged  into  arbitrarily  deep  topologies

Architecture  #1:  LeNet-­‐5

18
Architecture  #2:  AlexNet
CNN  for  Image  Classification
(Krizhevsky,  Sutskever &  Hinton,  2012)
15.3%  error  on  ImageNet LSVRC-­‐2012  contest
Input   • Five  convolutional  layers   1000-­‐way  
image   (w/max-­‐pooling)
(pixels) • Three  fully  connected  layers softmax

19

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
CNNs  for  Image  Recognition

(slide from Kaiming He’s recent presentation) 20


Slide  from  Kaiming He
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Mini-­‐Batch  SGD
• Gradient  Descent:  
Compute  true  gradient  exactly  from  all  N  
examples
• Mini-­‐Batch  SGD:  
Approximate  true  gradient  by  the  average  
gradient  of  K  randomly  chosen  examples
• Stochastic  Gradient  Descent  (SGD):
Approximate  true  gradient  by  the  gradient  
of  one  randomly  chosen  example

21
Mini-­‐Batch  SGD

Three  variants  of  first-­‐order  optimization:

22
CNN  VISUALIZATIONS

23
3D  Visualization  of  CNN
http://scs.ryerson.ca/~aharley/vis/conv/
Convolution  of  a  Color  Image
• Color  images  consist  of  3  floats  per  pixel  for  
RGB  (red,  green  blue)  color  values
• Convolution  must  also  be  3-­‐
A closer look at spatial dimensions: dimensional
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1
25
Figure  from  Fei-­‐Fei Li  &  Andrej  Karpathy &  Justin  Johnson  (CS231N)  
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 23 27 Jan 2016
Animation  of  3D  Convolution
http://cs231n.github.io/convolutional-­‐networks/

26
Figure  from  Fei-­‐Fei Li  &  Andrej  Karpathy &  Justin  Johnson  (CS231N)  
MNIST  Digit  Recognition  with  CNNs  
(in  your  browser)
https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

27
Figure  from  Andrej  Karpathy
CNN  Summary
CNNs
– Are  used  for  all  aspects  of  computer  vision,  and  
have  won  numerous  pattern  recognition  
competitions
– Able  learn  interpretable  features  at  different  levels  
of  abstraction
– Typically,  consist  of  convolution layers,  pooling
layers,  nonlinearities,  and  fully  connected  layers
Other  Resources:
– Readings  on  course  website
– Andrej  Karpathy,  CS231n  Notes
http://cs231n.github.io/convolutional-­‐networks/
28
BAYESIAN  NETWORKS

29
Bayes  Nets  Outline
• Motivation
– Structured  Prediction
• Background
– Conditional  Independence
– Chain  Rule  of  Probability
• Directed  Graphical  Models
– Writing  Joint  Distributions
– Definition:  Bayesian  Network
– Qualitative  Specification
– Quantitative  Specification
– Familiar  Models  as  Bayes  Nets
• Conditional  Independence  in  Bayes  Nets
– Three  case  studies
– D-­‐separation
– Markov  blanket
• Learning
– Fully  Observed  Bayes  Net
– (Partially  Observed  Bayes  Net)
• Inference
– Sampling  directly  from  the  joint  distribution
– Gibbs  Sampling

31
MOTIVATION:  STRUCTURED  
PREDICTION

32
Structured  Prediction
• Most  of  the  models  we’ve  seen  so  far  were  
for  classification
– Given  observations: x = (x1, x2, …, xK)
– Predict  a  (binary)  label: y
• Many  real-­‐world  problems  require  
structured  prediction
– Given  observations:   x = (x1, x2, …, xK)
– Predict  a  structure: y = (y1, y2, …, yJ)
• Some  classification problems  benefit  from  
latent structure
33
Structured  Prediction  Examples
• Examples  of  structured  prediction
– Part-­‐of-­‐speech  (POS)  tagging
– Handwriting  recognition
– Speech  recognition
– Word  alignment
– Congressional  voting
• Examples  of  latent  structure
– Object  recognition

34
Dataset  for  Supervised  
Part-­‐of-­‐Speech  (POS)  Tagging
Data: D = {x(n) , y (n) }N
n=1

n v p d n y(1)
Sample  1:
time flies like an arrow x(1)

n n v d n y(2)
Sample  2:
time flies like an arrow x(2)

n v p n n y(3)
Sample  3:
flies fly with their wings x(3)

p n n v v y(4)
Sample  4:
with time you will see x(4)

35
total of 1,63
total words,
of 1,63
words, and
words,its and
part
Dataset  for  Supervised   its part
its part
of s
of s
speechspeech
label
Handwriting  Recognition speech label
1. F
1. First-
Data: D = {x(n) , y (n) }N
n=1
1. First-
2. F
2. Four
2. Four
Sample  1:
u n e x p e c t e d y(1)

x(1) Hand
Hand

Sample  2: y(2)


v o l c a n i c

x(2)

Sample  2:
e m b r a c e s y(3)

x(3)

Fig. 5.Fig. 5. Handwriting


Handwriting recognition:
recognition: Example
Example wordswords fromdataset
from the the dataset
used.used.
Fig. 5. Handwriting recognition: Example words from the dataset used. 36
Figures  from  (Chatzis &  Demiris,  2013)
Dataset  for  Supervised  
Phoneme  (Speech)  Recognition
Data: D = {x(n) , y (n) }N
n=1

Sample  1:
1704 h# dh ih s w uh z iy z y(1)
iyIEEE TRANSACTIONS ON SIGNA

x(1)

Sample  2: y(2)


f ao r ah s s h#
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 7, APRIL 1, 2013

Fig. 5. Extrinsic (top) and intrinsic (bottom) spectral representations for the utterance “This was easy for us.” Note
was used. x (2)

where are the input unlabeled data and is the novel utterance into this in
new parametrization of the function we need to estimate. To the computation of Equatio
proceed, we plug the functional form of (9) into the optimization Fig. 5 shows an exampl
problem of (8). Taking the gradient with respect to the parameter spectrograms for
vector and setting it to zero sets up the following generalized 37
for us” (TIMIT sentence sx
Figures  from  (Jansen  &  Niyogi,  2013)
eigenvalue problem: with 200 examples of eac
ified in [26].2 Each examp
(10) a 40-dimensional, homomo
Application:
Word  Alignment  /  Phrase  Extraction
• Variables  (boolean):
– For  each  (Chinese  phrase,  
English  phrase)  pair,  
are  they  linked?
• Interactions:
– Word  fertilities
– Few  “jumps”  (discontinuities)
– Syntactic  reorderings
– “ITG  contraint”  on  alignment
– Phrases  are  disjoint  (?)

(Burkett  &  Klein,  2012) 38


Application:
Congressional  Voting
• Variables:

– Text  of  all  speeches  of  a  


representative  
– Local  contexts  of  
references  between  two  
representatives
• Interactions:
– Words  used  by  
representative  and  their  
vote
– Pairs  of  representatives  
and  their  local  context

(Stoyanov &  Eisner,  2012) 39


Structured  Prediction  Examples
• Examples  of  structured  prediction
– Part-­‐of-­‐speech  (POS)  tagging
– Handwriting  recognition
– Speech  recognition
– Word  alignment
– Congressional  voting
• Examples  of  latent  structure
– Object  recognition

40
Case  Study:  Object  Recognition
Data  consists  of  images  x and  labels  y.

x(1) x(2)

pigeon y(1) rhinoceros y(2)

x(3) x(4)

leopard y(3) llama y(4)


41
Case  Study:  Object  Recognition
Data  consists  of  images  x and  labels  y.
• Preprocess  data  into  
“patches”
• Posit  a  latent  labeling  z
describing  the  object’s  
parts  (e.g.  head,  leg,  
tail,  torso,  grass)

• Define  graphical  
model  with  these  
latent  variables  in  
mind
• z is  not  observed  at   leopard
train  or  test  time

42
Case  Study:  Object  Recognition
Data  consists  of  images  x and  labels  y.
• Preprocess  data  into  
“patches”
• Posit  a  latent  labeling  z
describing  the  object’s  
Z7
parts  (e.g.  head,  leg,  
tail,  torso,  grass)
X7
• Define  graphical  
Z2
model  with  these  
latent  variables  in  
mind X2

• z is  not  observed  at   leopard Y


train  or  test  time

43
Case  Study:  Object  Recognition
Data  consists  of  images  x and  labels  y.
• Preprocess  data  into  
“patches”
• Posit  a  latent  labeling  z
describing  the  object’s  
Z7
parts  (e.g.  head,  leg,   ψ4

tail,  torso,  grass) ψ1

X7 ψ4
• Define  graphical  
ψ2 Z2 ψ4
model  with  these  
ψ3
latent  variables  in  
mind X2

• z is  not  observed  at   leopard Y


train  or  test  time

44
Structured  Prediction
Preview  of  challenges  to  come…
• Consider  the  task  of  finding  the  most  probable  
assignment to  the  output  

Classification Structured  Prediction


ŷ = `;Kt p(y|t) v̂ = `;Kt p(v|t)
y v

where y {+1, 1} where v Y


and |Y| is very large

45
Machine  Learning
The data  inspires   Our  model
the  structures   defines  a  score  
we  want  to   for  each  structure
predict
Domain   Mathematical  
It  also  tells  us  
Knowledge Modeling
what  to  optimize
ML
Inference finds   Combinatorial   Optimization

{best  structure,  marginals,   Optimization

partition  function} for  a  


new  observation Learning tunes  the  
parameters  of  the  
(Inference is  usually   model
called  as  a  subroutine  
in  learning) 46
a
s aw B ob on
3 Alice

e
a telescop

Machine  Learning
Alice
saw
B ob
on

ow
a hill with

Model
s l i ke a n a rr
ie
time fl
4
Data X1

arrow X3
like an X2
flies
time

X4 X5
an arrow
like
flies
time

an arrow
Objective
like
flies
time

Inference
time
flies
like an arrow

Learning

(Inference is  usually  
called  as  a  subroutine  
in  learning) 47
BACKGROUND

48
Background
Whiteboard
– Chain  Rule  of  Probability
– Conditional  Independence

49
Background:  Chain  Rule
of  Probability

For random variables A and B:

P (A, B) = P (A|B)P (B)

For random variables X1 , X2 , X3 , X4 :


P (X1 , X2 , X3 , X4 ) =P (X1 |X2 , X3 , X4 )
P (X2 |X3 , X4 )
P (X3 |X4 )
P (X4 )

50
Background:
Conditional  Independence
Random variables A and B are conditionally
independent given C if:

P (A, B|C) = P (A|C)P (B|C) (1)

or equivalently:

P (A|B, C) = P (A|C) (2)

We write this as:

A B|C (3)
Later  we  will  also  
|4

write:  I<A, {C}, B>


51
Bayesian  Networks

DIRECTED  GRAPHICAL  MODELS

52
Example:  Tornado  Alarms
1. Imagine  that  
you  work  at  the  
911  call  center  
in  Dallas
2. You  receive  six  
calls  informing  
you  that  the  
Emergency  
Weather  Sirens  
are  going  off
3. What  do  you  
conclude?
53
Figure  from  https://www.nytimes.com/2017/04/08/us/dallas-­‐emergency-­‐sirens-­‐hacking.html
Example:  Tornado  Alarms
1. Imagine  that  
you  work  at  the  
911  call  center  
in  Dallas
2. You  receive  six  
calls  informing  
you  that  the  
Emergency  
Weather  Sirens  
are  going  off
3. What  do  you  
conclude?
54
Figure  from  https://www.nytimes.com/2017/04/08/us/dallas-­‐emergency-­‐sirens-­‐hacking.html
Directed  Graphical  Models  
(Bayes  Nets)
Whiteboard
– Example:  Tornado  Alarms
– Writing  Joint  Distributions
• Idea  #1:  Giant  Table
• Idea  #2:  Rewrite  using  chain  rule
• Idea  #3:  Assume  full  independence
• Idea  #4:  Drop  variables  from  RHS  of  conditionals
– Definition:  Bayesian  Network
– Observed  Variables  in  Graphical  Models

55
Bayesian  Network

X1
p(X1 , X2 , X3 , X4 , X5 ) =
X3
X2
p(X5 |X3 )p(X4 |X2 , X3 )
X4 X5 p(X3 )p(X2 |X1 )p(X1 )

56
Bayesian  Network
Definition:
X1

n
X3
X2
P(X1 …X n ) = ∏ P(Xi | parents(Xi ))
i=1
X4 X5

• A  Bayesian  Network  is  a  directed  graphical  model


• It  consists  of  a  graph  G and  the  conditional  probabilities  P
• These  two  parts  full  specify  the  distribution:
– Qualitative  Specification:  G
– Quantitative  Specification:  P

57

You might also like