Neural Networks

Softmax Regression Neural Networks
Convex Optimization
Lecture 16 - Softmax Regression and Neural Networks
Instructor: Yuanzhang Xiao
University of Hawaii at Manoa
Fall 2017
1 / 19
Today’s Lecture
1 Softmax Regression
2 Neural Networks
2 / 19
Outline
2 Neural Networks
3 / 19
Softmax Regression
extend logistic regression to multi-class classification
training data (a(i) , b (i) ), i = 1, . . . , m
labels with K values: b (i) ∈ {1, . . . , K }
hypothesis:
exp(x (1)T a)
   
P(b = 1 | a; x)
.. 1 ..
hx (a) =   = PK
   
. (k)T
 . 
k=1 exp(x a)
P(b = K | a; x) exp(x (K )T a)
parameters to learn:
x = [x (1) , . . . , x (K ) ] ∈ Rn×K
4 / 19
Maximum Log-Likelihood Estimator

given training data (a(i) , b (i) )m
i=1 , the probability of this sequence is
m Y K
" #1 (i)
Y exp(x (k)T a(i) ) {b =k}
PK (j)T a(i) )
i=1 k=1 j=1 exp(x
log-likelihood is
m X
K
X exp(x (k)T a(i) )
`(x) = 1{b(i) =k} · log PK
(j)T a(i) )
i=1 k=1 j=1 exp(x
gradient is
m
!
∂`(x) X (i) exp(x (k)T a(i) )
= a · 1{b(i) =k} − PK
∂x (k) i=1 j=1 exp(x
(j)T a(i) )
5 / 19
Outline
2 Neural Networks
6 / 19
Neural Networks – A Single Neuron

fit a training example (x, y ) with a neuron:
• input: x and a normalization term +1

• output: hw ,b (x) = f (w T x + b)
• f is the activation function
some choices of activation function:

• sigmoid: f (z) = 1+e1 −z (as in logistic regression)
z −z
• tanh: f (z) = tanh(z) = ee z −e
+e −z
• rectified linear: f (z) = max{0, x} (deep neural networks)
7 / 19
Neural Networks – Activation Functions

illustration of activation functions
• tanh: rescaled sigmoid

• rectified linear: unbounded
8 / 19
Neural Networks – Basic Architecture

neural network: network of neurons
a three-layer neural network:
• input layer: the leftmost layer

• output layer: the rightmost layer
• hidden layer: the layers in the middle
9 / 19
Neural Networks – Parameters to Learn

number of layers n` = 3
weight of link from unit j in layer ` and unit i in layer ` + 1: Wij`
parameters to learn: (W (1) , b (1) , . . . , W (n` ) , b (n` ) )

10 / 19
Neural Networks – Example
(`)
the activation (i.e., output) of unit i at layer `: ai
computation:
(2) (1) (1) (1) (1)
a1 = f (W11 x1 + W12 x2 + W13 x3 + b1 )
(2) (1) (1) (1) (1)
a2 = f (W21 x1 + W22 x2 + W23 x3 + b2 )
(2) (1) (1) (1) (1)
a3 = f (W31 x1 + W32 x2 + W33 x3 + b3 )
(3) (2) (2) (2) (2) (2) (2) (2)
hW ,b (x) = a1 = f (W11 a1 + W12 a2 + W13 a3 + b3 )
11 / 19
Neural Networks – Compact Representation

define weighted sum of inputs to unit i in layer ` as

n
(`) (`−1) (`−1) (`−1)
X
zi = Wij aj + bi ,
j=1
where
(`−1) (`−1)
aj = f (zj )
12 / 19
Neural Networks – Compact Representation
compact representation: (forward propagation)
z (`+1) = W (`) a(`) + b (`)

a(`+1) = f (z (`+1) )
13 / 19
Neural Networks – Extensions
may have different architectures (i.e., network topology)

• different numbers s` of units in each layer `
• different connectivity
may have loops
may have multiple output units
14 / 19
Neural Networks – Optimization
minimize the prediction error while promoting sparsity:

" m
# n` −1 X̀ s`+1
s X
1 X (i) (i) λ X (`) 2

J(W , b) = J(W , b; x , y ) + Wij
m 2
i=1 `=1 j=1 i=1
where J(W , b; x (i) , y (i) ) is the prediction error of sample i
1 2
J(W , b; x (i) , y (i) ) = hW ,b (x (i) ) − y (i)

2
characteristics:
• nonconvex – gradient descent used in practice
• initialization: small random values near 0 (but not all zeros)
15 / 19
Neural Networks – Calculating Gradients
need to compute gradients:

 
m (i) , y (i) )
∂J(W , b) 1 X ∂J(W , b; x  + λW (`)
(`)
=  (`) ij
∂Wij m ∂Wij
i=1
m
∂J(W , b) 1 X ∂J(W , b; x (i) , y (i) )
(`)
= (`)
∂bi m ∂b
i=1 i
∂J(W ,b;x (i) ,y (i) ) ∂J(W ,b;x (i) ,y (i) )

backpropagation to compute (`) and (`)
∂Wij ∂bi
16 / 19
Neural Networks – Backpropagation

for the output layer: (the superscript of sample index removed)
∂J(W , b; x, y )
(n` −1)
∂Wij
  2 

 


  

    
 

 n  
∂ 1  X (n −1) (n −1) (n −1) 

= yi − hW ,b  Wij ` aj ` + bi ` 

(n −1) 2
∂Wij ` 

  j=1
 
 

  | {z }  

 

 (n ) (n )
` ` 
=f (zi )=ai
(n` )

(n` )
h
(n )
i ∂zi
= yi − ai · −f 0 zi ` · (n` −1)
∂Wij

(n ) (n ) (n −1)
= − yi − ai ` · f 0 zi ` ·aj `
| {z }
(n` )
,δi 17 / 19

for the middle layer n` − 1: (the superscript of sample index
removed)
∂J(W , b; x, y )
(n` −2)
∂Wij
( sn )
∂ 1 X̀ h (n ) 2
i
= (n` −2)
yk − f (zk ` )
∂Wij 2
k=1
sn i ∂z (n` ) ∂a(n` −1) ∂z (n` −1)
X̀ (n` )
h
(n )
= yk − ak · −f 0 zk ` · (nk −1) · i(n −1) · i
(n −2)
` `
k=1 ∂ai ∂zi ∂Wij `
sn
(n` ) (n −1) (n −1) (n −2)
X̀
= δk · Wki ` · f 0 zi ` ·aj `
|k=1 {z }
(n −1)
,δi `
18 / 19

backpropagation:
(`) (`)
• a forward propagation to determine all the ai , zi
• for the output layer, set
(n` ) (n` ) (n` )
δi = −(yi − ai ) · f 0 (zi )
• for middle layers ` = n` − 1, . . . , 2 and each node i in layer `,
set  
s`+1
(`) (`) (`+1)  0 (`)
X
δi = W δ ji jf (z ) i
j=1
• compute gradients
∂J(W , b; x, y ) (`) (`+1)
(`)
= aj δi
∂Wij
∂J(W , b; x, y ) (`+1)
(`)
= δi
∂bi
19 / 19

Neural Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks

Uploaded by

Copyright:

Available Formats

Softmax Regression Neural Networks

Instructor: Yuanzhang Xiao

University of Hawaii at Manoa

training data (a(i) , b (i) ), i = 1, . . . , m

labels with K values: b (i) ∈ {1, . . . , K }

Maximum Log-Likelihood Estimator

Neural Networks – A Single Neuron

• input: x and a normalization term +1

some choices of activation function:

Neural Networks – Activation Functions

• tanh: rescaled sigmoid

Neural Networks – Basic Architecture

• input layer: the leftmost layer

Neural Networks – Parameters to Learn

weight of link from unit j in layer ` and unit i in layer ` + 1: Wij`

parameters to learn: (W (1) , b (1) , . . . , W (n` ) , b (n` ) )

Neural Networks – Example

Neural Networks – Compact Representation

define weighted sum of inputs to unit i in layer ` as

Neural Networks – Compact Representation

a three-layer neural network:

compact representation: (forward propagation)

z (`+1) = W (`) a(`) + b (`)

Neural Networks – Extensions

may have different architectures (i.e., network topology)

may have loops

may have multiple output units

Neural Networks – Optimization

minimize the prediction error while promoting sparsity:

where J(W , b; x (i) , y (i) ) is the prediction error of sample i

Neural Networks – Calculating Gradients

need to compute gradients:

∂J(W ,b;x (i) ,y (i) ) ∂J(W ,b;x (i) ,y (i) )

Neural Networks – Backpropagation

Neural Networks – Backpropagation

Neural Networks – Backpropagation

You might also like