You are on page 1of 19

Softmax Regression Neural Networks

Convex Optimization
Lecture 16 - Softmax Regression and Neural Networks

Instructor: Yuanzhang Xiao

University of Hawaii at Manoa

Fall 2017

1 / 19
Softmax Regression Neural Networks

Today’s Lecture

1 Softmax Regression

2 Neural Networks

2 / 19
Softmax Regression Neural Networks

Outline

1 Softmax Regression

2 Neural Networks

3 / 19
Softmax Regression Neural Networks

Softmax Regression
extend logistic regression to multi-class classification

training data (a(i) , b (i) ), i = 1, . . . , m

labels with K values: b (i) ∈ {1, . . . , K }

hypothesis:

exp(x (1)T a)
   
P(b = 1 | a; x)
.. 1 ..
hx (a) =   = PK
   
. (k)T
 . 
k=1 exp(x a)
P(b = K | a; x) exp(x (K )T a)

parameters to learn:

x = [x (1) , . . . , x (K ) ] ∈ Rn×K
4 / 19
Softmax Regression Neural Networks

Maximum Log-Likelihood Estimator


given training data (a(i) , b (i) )m
i=1 , the probability of this sequence is

m Y K
" #1 (i)
Y exp(x (k)T a(i) ) {b =k}

PK (j)T a(i) )
i=1 k=1 j=1 exp(x

log-likelihood is
m X
K
X exp(x (k)T a(i) )
`(x) = 1{b(i) =k} · log PK
(j)T a(i) )
i=1 k=1 j=1 exp(x

gradient is
m
!
∂`(x) X (i) exp(x (k)T a(i) )
= a · 1{b(i) =k} − PK
∂x (k) i=1 j=1 exp(x
(j)T a(i) )

5 / 19
Softmax Regression Neural Networks

Outline

1 Softmax Regression

2 Neural Networks

6 / 19
Softmax Regression Neural Networks

Neural Networks – A Single Neuron


fit a training example (x, y ) with a neuron:

• input: x and a normalization term +1


• output: hw ,b (x) = f (w T x + b)
• f is the activation function

some choices of activation function:


• sigmoid: f (z) = 1+e1 −z (as in logistic regression)
z −z
• tanh: f (z) = tanh(z) = ee z −e
+e −z
• rectified linear: f (z) = max{0, x} (deep neural networks)
7 / 19
Softmax Regression Neural Networks

Neural Networks – Activation Functions


illustration of activation functions

• tanh: rescaled sigmoid


• rectified linear: unbounded
8 / 19
Softmax Regression Neural Networks

Neural Networks – Basic Architecture


neural network: network of neurons
a three-layer neural network:

• input layer: the leftmost layer


• output layer: the rightmost layer
• hidden layer: the layers in the middle
9 / 19
Softmax Regression Neural Networks

Neural Networks – Parameters to Learn


a three-layer neural network:

number of layers n` = 3

weight of link from unit j in layer ` and unit i in layer ` + 1: Wij`

parameters to learn: (W (1) , b (1) , . . . , W (n` ) , b (n` ) )


10 / 19
Softmax Regression Neural Networks

Neural Networks – Example

(`)
the activation (i.e., output) of unit i at layer `: ai
computation:
(2) (1) (1) (1) (1)
a1 = f (W11 x1 + W12 x2 + W13 x3 + b1 )
(2) (1) (1) (1) (1)
a2 = f (W21 x1 + W22 x2 + W23 x3 + b2 )
(2) (1) (1) (1) (1)
a3 = f (W31 x1 + W32 x2 + W33 x3 + b3 )
(3) (2) (2) (2) (2) (2) (2) (2)
hW ,b (x) = a1 = f (W11 a1 + W12 a2 + W13 a3 + b3 )
11 / 19
Softmax Regression Neural Networks

Neural Networks – Compact Representation


a three-layer neural network:

define weighted sum of inputs to unit i in layer ` as


n
(`) (`−1) (`−1) (`−1)
X
zi = Wij aj + bi ,
j=1

where
(`−1) (`−1)
aj = f (zj )
12 / 19
Softmax Regression Neural Networks

Neural Networks – Compact Representation

a three-layer neural network:

compact representation: (forward propagation)

z (`+1) = W (`) a(`) + b (`)


a(`+1) = f (z (`+1) )

13 / 19
Softmax Regression Neural Networks

Neural Networks – Extensions

may have different architectures (i.e., network topology)


• different numbers s` of units in each layer `
• different connectivity

may have loops

may have multiple output units

14 / 19
Softmax Regression Neural Networks

Neural Networks – Optimization

minimize the prediction error while promoting sparsity:


" m
# n` −1 X̀ s`+1 
s X
1 X (i) (i) λ X (`) 2

J(W , b) = J(W , b; x , y ) + Wij
m 2
i=1 `=1 j=1 i=1

where J(W , b; x (i) , y (i) ) is the prediction error of sample i

1 2
J(W , b; x (i) , y (i) ) = hW ,b (x (i) ) − y (i)

2

characteristics:
• nonconvex – gradient descent used in practice
• initialization: small random values near 0 (but not all zeros)

15 / 19
Softmax Regression Neural Networks

Neural Networks – Calculating Gradients

need to compute gradients:


 
m (i) , y (i) )
∂J(W , b) 1 X ∂J(W , b; x  + λW (`)
(`)
=  (`) ij
∂Wij m ∂Wij
i=1
m
∂J(W , b) 1 X ∂J(W , b; x (i) , y (i) )
(`)
= (`)
∂bi m ∂b
i=1 i

∂J(W ,b;x (i) ,y (i) ) ∂J(W ,b;x (i) ,y (i) )


backpropagation to compute (`) and (`)
∂Wij ∂bi

16 / 19
Softmax Regression Neural Networks

Neural Networks – Backpropagation


for the output layer: (the superscript of sample index removed)
∂J(W , b; x, y )
(n` −1)
∂Wij
  2 

 


  

    
 

 n  
∂ 1  X (n −1) (n −1) (n −1) 

= yi − hW ,b  Wij ` aj ` + bi ` 

(n −1) 2
∂Wij ` 

  j=1
 
 

  | {z }  

 

 (n ) (n )
` ` 
=f (zi )=ai
(n` )

(n` )
 h 
(n )
i ∂zi
= yi − ai · −f 0 zi ` · (n` −1)
∂Wij
   
(n ) (n ) (n −1)
= − yi − ai ` · f 0 zi ` ·aj `
| {z }
(n` )
,δi 17 / 19
Softmax Regression Neural Networks

Neural Networks – Backpropagation


for the middle layer n` − 1: (the superscript of sample index
removed)
∂J(W , b; x, y )
(n` −2)
∂Wij
( sn )
∂ 1 X̀ h (n ) 2
i
= (n` −2)
yk − f (zk ` )
∂Wij 2
k=1
sn i ∂z (n` ) ∂a(n` −1) ∂z (n` −1)
X̀  (n` )
 h 
(n )
= yk − ak · −f 0 zk ` · (nk −1) · i(n −1) · i
(n −2)
` `
k=1 ∂ai ∂zi ∂Wij `
sn  
(n` ) (n −1) (n −1) (n −2)

= δk · Wki ` · f 0 zi ` ·aj `
|k=1 {z }
(n −1)
,δi `

18 / 19
Softmax Regression Neural Networks

Neural Networks – Backpropagation


backpropagation:
(`) (`)
• a forward propagation to determine all the ai , zi
• for the output layer, set
(n` ) (n` ) (n` )
δi = −(yi − ai ) · f 0 (zi )
• for middle layers ` = n` − 1, . . . , 2 and each node i in layer `,
set  
s`+1
(`) (`) (`+1)  0 (`)
X
δi = W δ ji jf (z ) i
j=1
• compute gradients
∂J(W , b; x, y ) (`) (`+1)
(`)
= aj δi
∂Wij
∂J(W , b; x, y ) (`+1)
(`)
= δi
∂bi
19 / 19

You might also like