You are on page 1of 85

Neural networks and CNN

Biplab Banerjee

Thanks to towardsdatascience, Vincente Ordonez


Perceptron Model
Frank Rosenblatt (1957) - Cornell University
Activation Functions
Step(x) Sigmoid(x)

Tanh(x) ReLU(x) = max(0, x)


Two-layer Multi-layer Perceptron (MLP)
”hidden" layer

Loss / Criterion
𝑥1 𝑎1

𝑥2 𝑎2
෍ 𝑦ො1 𝑦1
𝑥3 𝑎3

𝑥4 𝑎4


- Reducing the number of layers below the minimum will result in an exponentially sized network to express
the function fully
- A network with fewer than the minimum required number of neurons cannot model the function
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑔𝑐 = 𝑤𝑐1 𝑥𝑖1 + 𝑤𝑐2 𝑥𝑖2 + 𝑤𝑐3 𝑥𝑖3 + 𝑤𝑐4 𝑥𝑖4 + 𝑏𝑐


𝑔𝑑 = 𝑤𝑑1 𝑥𝑖1 + 𝑤𝑑2 𝑥𝑖2 + 𝑤𝑑3 𝑥𝑖3 + 𝑤𝑑4 𝑥𝑖4 + 𝑏𝑑
𝑔𝑏 = 𝑤𝑏1 𝑥𝑖1 + 𝑤𝑏2 𝑥𝑖2 + 𝑤𝑏3 𝑥𝑖3 + 𝑤𝑏4 𝑥𝑖4 + 𝑏𝑏

𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )

26
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑤𝑐1 𝑤𝑐2 𝑤𝑐3 𝑤𝑐4


𝑔𝑐 = 𝑤𝑐1 𝑥𝑖1 + 𝑤𝑐2 𝑥𝑖2 + 𝑤𝑐3 𝑥𝑖3 + 𝑤𝑐4 𝑥𝑖4 + 𝑏𝑐
𝑤 = 𝑤𝑑1 𝑤𝑑2 𝑤𝑑3 𝑤𝑑4
𝑔𝑑 = 𝑤𝑑1 𝑥𝑖1 + 𝑤𝑑2 𝑥𝑖2 + 𝑤𝑑3 𝑥𝑖3 + 𝑤𝑑4 𝑥𝑖4 + 𝑏𝑑 𝑤𝑏1 𝑤𝑏2 𝑤𝑏3 𝑤𝑏4
𝑔𝑏 = 𝑤𝑏1 𝑥𝑖1 + 𝑤𝑏2 𝑥𝑖2 + 𝑤𝑏3 𝑥𝑖3 + 𝑤𝑏4 𝑥𝑖4 + 𝑏𝑏
𝑏 = 𝑏𝑐 𝑏𝑑 𝑏𝑏

𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )

27
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑤𝑐1 𝑤𝑐2 𝑤𝑐3 𝑤𝑐4


𝑤 = 𝑤𝑑1 𝑤𝑑2 𝑤𝑑3 𝑤𝑑4
𝑔 = 𝑤𝑥 𝑇 + 𝑏 𝑇 𝑤𝑏1 𝑤𝑏2 𝑤𝑏3 𝑤𝑏4

𝑏 = 𝑏𝑐 𝑏𝑑 𝑏𝑏

𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )

28
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑤𝑐1 𝑤𝑐2 𝑤𝑐3 𝑤𝑐4


𝑤 = 𝑤𝑑1 𝑤𝑑2 𝑤𝑑3 𝑤𝑑4
𝑔 = 𝑤𝑥 𝑇 + 𝑏 𝑇 𝑤𝑏1 𝑤𝑏2 𝑤𝑏3 𝑤𝑏4

𝑏 = 𝑏𝑐 𝑏𝑑 𝑏𝑏

𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑔)

29
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤𝑥 𝑇 + 𝑏 𝑇 )

30
Two-layer MLP + Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[2] 𝑎[1]𝑇 + 𝑏[2] )

31
N-layer MLP + Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )

𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑘] )

𝑇 𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] )
32
Why is non-linearity important
How to train the parameters?
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )

𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑘] )

𝑇 𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] )
34
How to train the parameters?
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑇 𝑙 = 𝑙𝑜𝑠𝑠(𝑓, 𝑦)
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )
… We can still use SGD
𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑖] )

… We need!

𝜕𝑙 𝜕𝑙
𝑇 𝑇
𝑓= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] ) 𝜕𝑤[𝑘]𝑖𝑗 𝜕𝑏 𝑘 𝑖
35
Backpropagation – repeated application of
chain rule
Two-layer Neural Network – Forward Pass
Two-layer Neural Network – Backward Pass
Basic building blocks of the CNN architecture
• Input layer
• Convolutional layer
• Fully connected layer
• Loss layer

• Convolutional layer
• Convolutional kernel
• Pooling layer
• Non-linearity
Convolution operation
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.

“upper-left
beak” detector

They can be compressed


to the same parameters.

“middle beak”
detector
Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected




0 1 0 0 1 0
0 0 1 0 1 0
x36
Convolutional Layer (with 4 filters)
weights:
4x1x9x9
Input: 1x224x224 Output: 4x224x224

if zero padding,
and stride = 1
Convolutional Layer (with 4 filters)
weights:
4x1x9x9
Input: 1x224x224 Output: 4x112x112

if zero padding,
but stride = 2
Color image: RGB 3 channels – conv. over
depth
1 -1 -1 -1-1 11 -1-1
11 -1-1 -1-1 -1 1 -1
-1-1 11 -1-1 -1-1-1 111 -1-1-1 Filter
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1-1 11 -1-1 2 …3
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Different types of convolution

Parameters:

✓ Kernel stride

✓ Size

✓ Padding

Normal vs dialated Dialation width = 2


Dialated convolution
Is ReLU helpful? https://github.com/bhattbhavesh91/why-is-relu-non-linear/
Spatially Separable convolution
Depthwise separable convolution

Convolving by 256 5x5 kernels over the input volume


Depthwise separable convolution – step1

Along depth
Depthwise separable convolution – step2

Pointwise 1x1 conv


Transpose convolution
Convolution as a matrix multiplication
Many to one mapping – 9 values to 1 value
One to many mapping
The whole CNN
cat dog ……
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
Pooling
• Down-sample the image – controls the parameters of the CNN model
Why Pooling
• Subsampling pixels will not change the object
bird
bird

Subsampling

✓ We can subsample the pixels to make image


smaller
✓ fewer parameters to characterize the image
Pooling or strided convolution?
Unpool
The whole CNN
cat dog ……
Convolution

Max Pooling

Fully Connected A new image


Feedforward network
Convolution

Max Pooling

Flattened A new image


3
Flattening
0

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
Conv Net Topology
• 5 convolutional layers
• 3 fully connected layers + soft-max
• 650K neurons , 60 Mln weights
Why do we need a deep CNN?

Courtsey: ICRI
Why do we need a deep CNN?
Why do we need a deep CNN?
Why do we need a deep CNN?
Why do we need a deep CNN?
Suggested reading

You might also like