You are on page 1of 7

Overview

I Structure and power of neural networks


I Backpropagation
I Practical issues
Neural networks I Convolutions

COMS 4771 Fall 2019

0 / 26 1 / 26

Parametric featurizations I Parametric featurizations II

I So far: features (x or ϕ(x)) are fixed during training I Neural network: parameterization for function f : Rd → R
I Consider a (small) collection of feature transformations ϕ I f (x) = ϕ(x)T w
I Select ϕ via cross-validation – outside of normal training I Parameters include both w and parameters of ϕ
I Varying parameters of ϕ allows f to be essentially any function!
I Major challenge: optimization (a lot of tricks to make it work)
I “Deep learning” approach:
I Use ϕ with many tunable parameters
I Optimize parameters of ϕ during normal training process

ϕ
Figure 1: Neural network

2 / 26 3 / 26
Feedforward neural network Standard layered architectures
I Architecture of a feedforward neural network I Standard architecture arranges nodes into sequence of L layers
I Directed acyclic graph G = (V, E) I Edges only go from one layer to the next
I One source node (vertex) per input, one sink node per output I Can write function using matrices of weight parameters
I Other nodes are hidden units
I Each edge (u, v) ∈ E has a weight parameter wu,v ∈ R f (x) = σL (W L σL−1 (· · · σ1 (W 1 x) · · · ))
I Value hv of node v given values of parents is I d` nodes in layer `; W ` ∈ Rd` ×d`−1 are weight parameters
X I Activation function σ` : R → R is applied coordinate-wise to input
hv := σv (zv ), zv := wu,v · hu .
I Often also include “bias” parameters b` ∈ Rd`
u∈V :(u,v)∈E

I σv : R → R is the activation function (e.g., sigmoid) f (x) = σL (bL + W L σL−1 (· · · σ1 (b1 + W 1 x) · · · ))


I Tunable parameters: θ = (W 1 , b1 , . . . , W L , bL )

W1 W2 W3
wu,v v
u

··· input hidden units output

Figure 2: Feedforward neural network architecture Figure 3: Standard feedforward architecture


4 / 26 5 / 26

Well-known activation functions Power of non-linear activations

I Heaviside: σ(z) = 1{z≥0} I What happens if every activation function is linear/affine?


I Popular in the 1940s; also called step function
I Sigmoid (from logistic regression): σ(z) = 1/(1 + e−z )
I Popular since 1970s
I Hyperbolic tangent: σ(z) = tanh(z)
I Similar to sigmoid, but range is (−1, 1) rather than (0, 1)
I Rectified Linear Unit (ReLU): σ(z) = max{0, z}
I Popular since 2012
I Identity: σ(z) = z
I Popular with luddites
P
I Softmax: σ(v)i = exp(vi )/ j exp(vj )
I Special vector-valued activation function
I Popular for final layer in multi-class classification

6 / 26 7 / 26
Necessity of multiple layers Neural network approximation theorems

I Suppose only have input and output layers, so function f is I Theorem (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989):
Let σ1 be be any continuous non-linear activation function from
f (x) = σ(b + wT x) above. For any continuous function f : Rd → R and any ε > 0, there
is a two-layer neural network (with parameters θ = (W 1 , b1 , w2 )) s.t.
where b ∈ R and w ∈ Rd (so wT ∈ R1×d )
I If σ is monotone (e.g., Heaviside, sigmoid, hyperbolic tangent, ReLU, max |f (x) − wT2 σ1 (b1 + W 1 x)| < ε.
identity), then f has same limitations as a linear/affine classifier x∈[0,1]d

I Many caveats
I “Width” (number of hidden units) may need to be very large
I Does not tell us how to find the network
I Does not justify deeper networks

Figure 4: XOR problem

8 / 26 9 / 26

Fitting neural networks to data Backpropagation

I Training data (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × Y I Backpropagation (backprop): Algorithm for computing partial
I Fix architecture: G = (V, E) and activation functions derivatives wrt weights in a feedforward neural network
I Plug-in principle: find parameters θ of neural network fθ to minimize I Clever organization of partial derivative computations with chain rule
empirical risk (possibly with a surrogate loss) I Use in combination with gradient descent, SGD, etc.

1X
n
I Consider loss on a single example (x, y), written J := `(y, fθ (x))
b
R(θ) = (fθ (xi ) − yi )2 regression I Goal: compute ∂w∂J for every edge (u, v) ∈ E
n i=1 u,v

1X
n
b
R(θ) = `log (−yi fθ (xi )) binary classification I Initial step of backprop: forward propagation
n i=1 I Compute zv ’s and hv ’s for every node v ∈ V
I Running time: linear in size of network
1X
n
b
R(θ) = `ce (ỹ i , fθ (xi )) multi-class classification
n i=1 I Rest of backprop also just requires time linear in size of network!
(Could use other surrogate loss functions . . . )
I Typically objective is not convex in parameters θ
I Nevertheless, local search (e.g., SGD) often works well!

10 / 26 11 / 26
Derivative of loss with respect to weights Derivative of output with respect to weights

I Let ŷ1 , ŷ2 , . . . denote the values at the output nodes. I Assume for simplicity there is just a single output, ŷ
I Then by chain rule, I Chain rule, again:
∂ ŷ ∂ ŷ ∂hv
∂J X ∂J ∂ ŷi = ·
= · . ∂wu,v ∂hv ∂wu,v
wu,v ∂ ŷi wu,v
i
I First term: trickier; we’ll handle later
I Second term:

v
wu,v

u ···

12 / 26 13 / 26

Derivative of output with respect to hidden units Example: chain graph I

I Compute ∂h ∂ ŷ
for all vertices in decreasing order of layer number w0,1 w1,2 wL−1,L
I If v is not the output node, then by chain rule (yet again),
v
0 1 ··· L
input output
∂ ŷ X ∂ ŷ ∂hv0
= · Figure 5: Chain graph; assume same activation σ in every layer
∂hv ∂hv0 ∂hv
v 0 :(v,v 0 )∈E

I Parameters θ = (w0,1 , w1,2 , . . . , wL−1,L )


I Fix input value x ∈ R; what is ∂w∂hL for i = 1, . . . , L?
i−1,i
I Forward propagation:
v10 ··· vk0
I h0 := x
w w I For i = 1, 2, . . . , L:
v,v10 ··· 0
v,vk
zi := wi−1,i hi−1
v hi := σ(zi )

14 / 26 15 / 26
Example: chain graph II Practical issues I: Initialization

w0,1 w1,2 wL−1,L I Ensure inputs are standardized: every feature has zero mean and unit
···
0 1 L variance (wrt training data)
input output

Figure 6: Chain graph; assume same activation σ in every layer I Initialize weights randomly for gradient descent / SGD

I Backprop:
I For i = L, L − 1, . . . , 1:
(
∂hL 1 if i = L
:=
∂hi ∂hL
∂hi+1
· σ 0 (zi+1 )wi,i+1 if i < L
∂hL ∂hL
:= · σ 0 (zi )hi−1
∂wi−1,i ∂hi

16 / 26 17 / 26

Practical issues II: Architecture choice Convolutional nets

I Architecture can be regarded as a “hyperparameter” I Neural networks with convolutional layers


I Useful when inputs have locality structure
I Optimization-inspired architecture choice I Sequential structure (e.g., speech waveform)
I With wide enough network, can get training error rate zero I 2D grid structure (e.g., image)
I Use the smallest network that lets you get zero training error rate I ...
I Then add regularization term to objective (e.g., sum of squares of
weights), and optimize the regularized ERM objective I Weight matrix W ` is highly-structured
I Determined by a small filter
I Entire research communities are trying to figure out good I Time to compute W ` h`−1 is typically  d` × d`−1 (e.g., closer to
architectures for their problems max{d` , d`−1 })

18 / 26 19 / 26
Convolutions I Convolutions II

I Convolution of two continuous functions: h := f ∗ g I E.g., suppose only f (0), f (1), f (2) are non-zero, and g is zero-padded
Z +∞ (in this case, g(i) = 0 for i < 1 or i > 5). Then:
h(x) = f (y)g(x − y) dy    
−∞
h(1) f (0) 0 0 0 0  
h(2) f (1) f (0) 0 0 0 
    g(1)
I If f (x) = 0 for x ∈
/ [−w, +w], then h(3) f (2) f (1) f (0) 0 0   
    g(2)
Z h(4) =  0 f (2) f (1) f (0) 0  g(3)
+w     
h(5)  0 0 f (2) f (1) f (0) g(4)
h(x) = f (y)g(x − y) dy    
−w h(6)  0 0 0 f (2) f (1) g(5)
h(7) 0 0 0 0 f (2)
I Replaces g(x) with weighted combination of g at nearby points
I For functions on discrete domain, replace integral with sum

X
h(i) = f (j)g(i − j)
j=−∞

Figure 7: Convolutional layer

20 / 26 21 / 26

2D convolutions I 2D convolutions II

I Similar for 2D inputs (e.g., images), except now sum over two indices I Similar for 2D inputs (e.g., images), except now sum over two indices
I g is the input image I g is the input image
I f is the filter I f is the filter
I Lots of variations (e.g., padding, strides, multiple “channels”) I Lots of variations (e.g., padding, strides, multiple “channels”)

Figure 8: 2D convolution, with padding, no stride Figure 9: 2D convolution, with padding, no stride

22 / 26 23 / 26
2D convolutions III 2D convolutions IV

I Similar for 2D inputs (e.g., images), except now sum over two indices I Similar for 2D inputs (e.g., images), except now sum over two indices
I g is the input image I g is the input image
I f is the filter I f is the filter
I Lots of variations (e.g., padding, strides, multiple “channels”) I Lots of variations (e.g., padding, strides, multiple “channels”)

Figure 10: 2D convolution, with padding, no stride Figure 11: 2D convolution, with padding, no stride

I Use additional layers/activations to down-sample after convolution

24 / 26 25 / 26

Postscript: Tangent model

I Let fθ : Rd → R be a neural network function with parameters


θ ∈ Rp (p is total number of parameters)
I Fix x, and consider first-order approximation of fθ (x) around
θ = θ (0) :

fθ (x) ≈ fθ(0) (x) + ∇fθ(0) (x)T (θ − θ (0) )

Here, ∇ is gradient wrt parameters θ, not wrt input x


I Consider feature transformation ϕ(x) := ∇fθ(0) (x), determined
entirely by initial parameters θ (0)
I If the first-order approximation is accurate (i.e., we never consider θ
too far from θ (0) ), then back to a linear model (over p features
ϕ(x) ∈ Rp )
I Called the tangent model
I Upshot: To really exploit power of “deep learning”, we must be
changing parameters a lot during training!

26 / 26

You might also like