11 Neural - Networks 4up PDF

Overview
I Structure and power of neural networks

I Backpropagation
I Practical issues
Neural networks I Convolutions
COMS 4771 Fall 2019
0 / 26 1 / 26
Parametric featurizations I Parametric featurizations II
I So far: features (x or ϕ(x)) are fixed during training I Neural network: parameterization for function f : Rd → R
I Consider a (small) collection of feature transformations ϕ I f (x) = ϕ(x)T w
I Select ϕ via cross-validation – outside of normal training I Parameters include both w and parameters of ϕ
I Varying parameters of ϕ allows f to be essentially any function!
I Major challenge: optimization (a lot of tricks to make it work)
I “Deep learning” approach:
I Use ϕ with many tunable parameters
I Optimize parameters of ϕ during normal training process
ϕ
Figure 1: Neural network
2 / 26 3 / 26
Feedforward neural network Standard layered architectures
I Architecture of a feedforward neural network I Standard architecture arranges nodes into sequence of L layers
I Directed acyclic graph G = (V, E) I Edges only go from one layer to the next
I One source node (vertex) per input, one sink node per output I Can write function using matrices of weight parameters
I Other nodes are hidden units
I Each edge (u, v) ∈ E has a weight parameter wu,v ∈ R f (x) = σL (W L σL−1 (· · · σ1 (W 1 x) · · · ))
I Value hv of node v given values of parents is I d` nodes in layer `; W ` ∈ Rd` ×d`−1 are weight parameters
X I Activation function σ` : R → R is applied coordinate-wise to input
hv := σv (zv ), zv := wu,v · hu .
I Often also include “bias” parameters b` ∈ Rd`
u∈V :(u,v)∈E
I σv : R → R is the activation function (e.g., sigmoid) f (x) = σL (bL + W L σL−1 (· · · σ1 (b1 + W 1 x) · · · ))

I Tunable parameters: θ = (W 1 , b1 , . . . , W L , bL )
W1 W2 W3
wu,v v
u
··· input hidden units output
Figure 2: Feedforward neural network architecture Figure 3: Standard feedforward architecture

4 / 26 5 / 26
Well-known activation functions Power of non-linear activations
I Heaviside: σ(z) = 1{z≥0} I What happens if every activation function is linear/affine?

I Popular in the 1940s; also called step function
I Sigmoid (from logistic regression): σ(z) = 1/(1 + e−z )
I Popular since 1970s
I Hyperbolic tangent: σ(z) = tanh(z)
I Similar to sigmoid, but range is (−1, 1) rather than (0, 1)
I Rectified Linear Unit (ReLU): σ(z) = max{0, z}
I Popular since 2012
I Identity: σ(z) = z
I Popular with luddites
P
I Softmax: σ(v)i = exp(vi )/ j exp(vj )
I Special vector-valued activation function
I Popular for final layer in multi-class classification
6 / 26 7 / 26
Necessity of multiple layers Neural network approximation theorems
I Suppose only have input and output layers, so function f is I Theorem (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989):
Let σ1 be be any continuous non-linear activation function from
f (x) = σ(b + wT x) above. For any continuous function f : Rd → R and any ε > 0, there
is a two-layer neural network (with parameters θ = (W 1 , b1 , w2 )) s.t.
where b ∈ R and w ∈ Rd (so wT ∈ R1×d )
I If σ is monotone (e.g., Heaviside, sigmoid, hyperbolic tangent, ReLU, max |f (x) − wT2 σ1 (b1 + W 1 x)| < ε.
identity), then f has same limitations as a linear/affine classifier x∈[0,1]d
I Many caveats
I “Width” (number of hidden units) may need to be very large
I Does not tell us how to find the network
I Does not justify deeper networks
Figure 4: XOR problem
8 / 26 9 / 26
Fitting neural networks to data Backpropagation
I Training data (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × Y I Backpropagation (backprop): Algorithm for computing partial
I Fix architecture: G = (V, E) and activation functions derivatives wrt weights in a feedforward neural network
I Plug-in principle: find parameters θ of neural network fθ to minimize I Clever organization of partial derivative computations with chain rule
empirical risk (possibly with a surrogate loss) I Use in combination with gradient descent, SGD, etc.
1X
n
I Consider loss on a single example (x, y), written J := `(y, fθ (x))
b
R(θ) = (fθ (xi ) − yi )2 regression I Goal: compute ∂w∂J for every edge (u, v) ∈ E
n i=1 u,v
1X
n
b
R(θ) = `log (−yi fθ (xi )) binary classification I Initial step of backprop: forward propagation
n i=1 I Compute zv ’s and hv ’s for every node v ∈ V
I Running time: linear in size of network
1X
n
b
R(θ) = `ce (ỹ i , fθ (xi )) multi-class classification
n i=1 I Rest of backprop also just requires time linear in size of network!
(Could use other surrogate loss functions . . . )
I Typically objective is not convex in parameters θ
I Nevertheless, local search (e.g., SGD) often works well!
10 / 26 11 / 26
Derivative of loss with respect to weights Derivative of output with respect to weights
I Let ŷ1 , ŷ2 , . . . denote the values at the output nodes. I Assume for simplicity there is just a single output, ŷ
I Then by chain rule, I Chain rule, again:
∂ ŷ ∂ ŷ ∂hv
∂J X ∂J ∂ ŷi = ·
= · . ∂wu,v ∂hv ∂wu,v
wu,v ∂ ŷi wu,v
i
I First term: trickier; we’ll handle later
I Second term:
v
wu,v
u ···
12 / 26 13 / 26
Derivative of output with respect to hidden units Example: chain graph I
I Compute ∂h ∂ ŷ
for all vertices in decreasing order of layer number w0,1 w1,2 wL−1,L
I If v is not the output node, then by chain rule (yet again),
v
0 1 ··· L
input output
∂ ŷ X ∂ ŷ ∂hv0
= · Figure 5: Chain graph; assume same activation σ in every layer
∂hv ∂hv0 ∂hv
v 0 :(v,v 0 )∈E
I Parameters θ = (w0,1 , w1,2 , . . . , wL−1,L )

I Fix input value x ∈ R; what is ∂w∂hL for i = 1, . . . , L?
i−1,i
I Forward propagation:
v10 ··· vk0
I h0 := x
w w I For i = 1, 2, . . . , L:
v,v10 ··· 0
v,vk
zi := wi−1,i hi−1
v hi := σ(zi )
14 / 26 15 / 26
Example: chain graph II Practical issues I: Initialization
w0,1 w1,2 wL−1,L I Ensure inputs are standardized: every feature has zero mean and unit
···
0 1 L variance (wrt training data)
input output
Figure 6: Chain graph; assume same activation σ in every layer I Initialize weights randomly for gradient descent / SGD
I Backprop:
I For i = L, L − 1, . . . , 1:
(
∂hL 1 if i = L
:=
∂hi ∂hL
∂hi+1
· σ 0 (zi+1 )wi,i+1 if i < L
∂hL ∂hL
:= · σ 0 (zi )hi−1
∂wi−1,i ∂hi
16 / 26 17 / 26
Practical issues II: Architecture choice Convolutional nets
I Architecture can be regarded as a “hyperparameter” I Neural networks with convolutional layers

I Useful when inputs have locality structure
I Optimization-inspired architecture choice I Sequential structure (e.g., speech waveform)
I With wide enough network, can get training error rate zero I 2D grid structure (e.g., image)
I Use the smallest network that lets you get zero training error rate I ...
I Then add regularization term to objective (e.g., sum of squares of
weights), and optimize the regularized ERM objective I Weight matrix W ` is highly-structured
I Determined by a small filter
I Entire research communities are trying to figure out good I Time to compute W ` h`−1 is typically d` × d`−1 (e.g., closer to
architectures for their problems max{d` , d`−1 })
18 / 26 19 / 26
Convolutions I Convolutions II
I Convolution of two continuous functions: h := f ∗ g I E.g., suppose only f (0), f (1), f (2) are non-zero, and g is zero-padded
Z +∞ (in this case, g(i) = 0 for i < 1 or i > 5). Then:
h(x) = f (y)g(x − y) dy    
−∞
h(1) f (0) 0 0 0 0  
h(2) f (1) f (0) 0 0 0 
    g(1)
I If f (x) = 0 for x ∈
/ [−w, +w], then h(3) f (2) f (1) f (0) 0 0   
    g(2)
Z h(4) =  0 f (2) f (1) f (0) 0  g(3)
+w     
h(5)  0 0 f (2) f (1) f (0) g(4)
h(x) = f (y)g(x − y) dy    
−w h(6)  0 0 0 f (2) f (1) g(5)
h(7) 0 0 0 0 f (2)
I Replaces g(x) with weighted combination of g at nearby points
I For functions on discrete domain, replace integral with sum
∞
X
h(i) = f (j)g(i − j)
j=−∞
Figure 7: Convolutional layer
20 / 26 21 / 26
2D convolutions I 2D convolutions II
I Similar for 2D inputs (e.g., images), except now sum over two indices I Similar for 2D inputs (e.g., images), except now sum over two indices
I g is the input image I g is the input image
I f is the filter I f is the filter
I Lots of variations (e.g., padding, strides, multiple “channels”) I Lots of variations (e.g., padding, strides, multiple “channels”)
Figure 8: 2D convolution, with padding, no stride Figure 9: 2D convolution, with padding, no stride
22 / 26 23 / 26
2D convolutions III 2D convolutions IV
I Similar for 2D inputs (e.g., images), except now sum over two indices I Similar for 2D inputs (e.g., images), except now sum over two indices
I g is the input image I g is the input image
I f is the filter I f is the filter
I Lots of variations (e.g., padding, strides, multiple “channels”) I Lots of variations (e.g., padding, strides, multiple “channels”)
Figure 10: 2D convolution, with padding, no stride Figure 11: 2D convolution, with padding, no stride
I Use additional layers/activations to down-sample after convolution
24 / 26 25 / 26
Postscript: Tangent model
I Let fθ : Rd → R be a neural network function with parameters

θ ∈ Rp (p is total number of parameters)
I Fix x, and consider first-order approximation of fθ (x) around
θ = θ (0) :
fθ (x) ≈ fθ(0) (x) + ∇fθ(0) (x)T (θ − θ (0) )
Here, ∇ is gradient wrt parameters θ, not wrt input x

I Consider feature transformation ϕ(x) := ∇fθ(0) (x), determined
entirely by initial parameters θ (0)
I If the first-order approximation is accurate (i.e., we never consider θ
too far from θ (0) ), then back to a linear model (over p features
ϕ(x) ∈ Rp )
I Called the tangent model
I Upshot: To really exploit power of “deep learning”, we must be
changing parameters a lot during training!
26 / 26

11 Neural - Networks 4up PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11 Neural - Networks 4up PDF

Uploaded by

Copyright:

Available Formats

Overview

I Structure and power of neural networks

COMS 4771 Fall 2019

Parametric featurizations I Parametric featurizations II

I σv : R → R is the activation function (e.g., sigmoid) f (x) = σL (bL + W L σL−1 (· · · σ1 (b1 + W 1 x) · · · ))

··· input hidden units output

Figure 2: Feedforward neural network architecture Figure 3: Standard feedforward architecture

Well-known activation functions Power of non-linear activations

I Heaviside: σ(z) = 1{z≥0} I What happens if every activation function is linear/affine?

Figure 4: XOR problem

Fitting neural networks to data Backpropagation

Derivative of output with respect to hidden units Example: chain graph I

I Parameters θ = (w0,1 , w1,2 , . . . , wL−1,L )

Practical issues II: Architecture choice Convolutional nets

I Architecture can be regarded as a “hyperparameter” I Neural networks with convolutional layers

Figure 7: Convolutional layer

I Use additional layers/activations to down-sample after convolution

Postscript: Tangent model

I Let fθ : Rd → R be a neural network function with parameters

fθ (x) ≈ fθ(0) (x) + ∇fθ(0) (x)T (θ − θ (0) )

Here, ∇ is gradient wrt parameters θ, not wrt input x

You might also like