Professional Documents
Culture Documents
Biplab Banerjee
- Reducing the number of layers below the minimum will result in an exponentially sized network to express
the function fully
- A network with fewer than the minimum required number of neurons cannot model the function
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
26
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
27
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑏 = 𝑏𝑐 𝑏𝑑 𝑏𝑏
𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
28
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑏 = 𝑏𝑐 𝑏𝑑 𝑏𝑏
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑔)
29
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤𝑥 𝑇 + 𝑏 𝑇 )
30
Two-layer MLP + Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[2] 𝑎[1]𝑇 + 𝑏[2] )
31
N-layer MLP + Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )
…
𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑘] )
𝑇 𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] )
32
Why is non-linearity important
How to train the parameters?
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )
…
𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑘] )
𝑇 𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] )
34
How to train the parameters?
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑇 𝑙 = 𝑙𝑜𝑠𝑠(𝑓, 𝑦)
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )
… We can still use SGD
𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑖] )
… We need!
𝜕𝑙 𝜕𝑙
𝑇 𝑇
𝑓= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] ) 𝜕𝑤[𝑘]𝑖𝑗 𝜕𝑏 𝑘 𝑖
35
Backpropagation – repeated application of
chain rule
Two-layer Neural Network – Forward Pass
Two-layer Neural Network – Backward Pass
Basic building blocks of the CNN architecture
• Input layer
• Convolutional layer
• Fully connected layer
• Loss layer
• Convolutional layer
• Convolutional kernel
• Pooling layer
• Non-linearity
Convolution operation
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.
“upper-left
beak” detector
“middle beak”
detector
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
Convolutional Layer (with 4 filters)
weights:
4x1x9x9
Input: 1x224x224 Output: 4x224x224
if zero padding,
and stride = 1
Convolutional Layer (with 4 filters)
weights:
4x1x9x9
Input: 1x224x224 Output: 4x112x112
if zero padding,
but stride = 2
Color image: RGB 3 channels – conv. over
depth
1 -1 -1 -1-1 11 -1-1
11 -1-1 -1-1 -1 1 -1
-1-1 11 -1-1 -1-1-1 111 -1-1-1 Filter
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1-1 11 -1-1 2 …3
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Different types of convolution
Parameters:
✓ Kernel stride
✓ Size
✓ Padding
Along depth
Depthwise separable convolution – step2
Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times
Max Pooling
Flattened
Pooling
• Down-sample the image – controls the parameters of the CNN model
Why Pooling
• Subsampling pixels will not change the object
bird
bird
Subsampling
Max Pooling
Max Pooling
1
3 0
-1 1 3
3 1 -1
0 3 Flattened
1 Fully Connected
Feedforward network
3
Conv Net Topology
• 5 convolutional layers
• 3 fully connected layers + soft-max
• 650K neurons , 60 Mln weights
Why do we need a deep CNN?
Courtsey: ICRI
Why do we need a deep CNN?
Why do we need a deep CNN?
Why do we need a deep CNN?
Why do we need a deep CNN?
Suggested reading