Deep Learning Notes PDF

Describe what visual attributes of an object are and how they can be computed.
Features are important:

Key to recent progress in recognition, hand designed features in used (features are not
learned)
Hand crafted visual features: Scale invariant feature transform, Histogram of oriented
gradients, local binary pattern, Textons (vector quantized responses of a linear filter bank)
Traditional Object recognition model
Input data >>>>>> feature representation (hand crafted)>>>>> learning algorithms
Image>>>>>>>>> low-level vision features (sift, edge)>>>>>>>>>>>> object detection/
classification
Attributes are visual qualities of objects, such as ‘red’, ‘striped’, or ‘spotted’. Visual
attributes are both semantic (human-understandable) and visual (machine detectable).The
model sees attributes as patterns of image segments, repeatedly sharing some
characteristic properties. These can be any combination of appearance, shape, or the
layout of segments within the pattern. Moreover, attributes with general appearance are
taken into account, such as the pattern of alternation of any two colors which is
characteristic for stripes. To enable learning from unsegmented training images, the
model is learnt discriminatively, by optimizing a likelihood ratio.
Attributes described as: Mid-level descriptions of classes or instances
- Encoding domain knowledge ( red, stripy, tall, adult)
- Semantic attribute space|: (project images onto humanly nameable concept basis
as projection axes e.g. people in suits)
- Not statistical orthogonal basis e.g. PCA
- Semantically meaningful to human interpretation
- Visual (machine detectable)
Applications
✓ Used in person-identification
✓ Understanding and predicting memorability and aesthetics of photographs
✓ Image editing – adjust snowy or sunset
Texton classifiers consists of three main stages:
(1) Construction of a texton codebook (convolve image, construct response vectors for
each pixel, K-means clustering of responses results in dictionary for each class)
(2) Construct training model for texture classes -computation of a texton frequency
histogram
(3) Training of the classifier based on the texture frequency histograms.
(4) Classify an image (split data in block, construct model for each, assign texture class
based min Euclidean distance)
Attributes can be computed (handcrafted) using different models e.g. bottom up

generative models such as clustering (random forests) and feature mapping (deep
learning), both are not very discriminative. There are also top-down discriminative models
e.g. we can ask experts, use intuition ( both these cannot easily guarantee classifiable or
detectable) .
Given examples of person images from two different views shown below, what are the
advantages of attributes over low-level imagery feature vectors for people recognition?
Discuss the advantages of using attributes over conventional low-level features for object
recognition.
Limitations of low level features

1. Not all features are equal. It is difficult to determine what features are more
important under what circumstances
2. On the fly selection: How to selectively distribute some weights to informative
feature given different appearance attributes
3. Features are inherently noisy and ambiguous: how to filter noise and redundant
features
4. How to get robust features/ quantify them so that they can be transferable across
domains
Attributes advantages
A. More robust and data independent (good for sparse data)
B. Enable more “transfer learning” if learned from large independent data pool
(domain independent)
C. Readily interpretable for human interaction (query and search)
Describe a conventional deep learning process for model domain transfer. Traditional deep
learning approaches to domain adaptation (transfer learning)
1. Pre-train a source domain convolutional neural network deep model using a large set of
source domain data
2. Re-train the last layer for the new class labels. Assume that all previous feature layers
can be re-used across domains which may not be optimal. But cannot handle prediction of
attribute labels for which training samples are available only in the source domain.
3. Fine tune through back propagation at a lower learning rate. It requires larger amount
of target domain samples. Note that after adaptation, learned features are biased to the
target domain and may not be consistent with the source domain.
The current approaches rely on 2 separate stages: 1. Source domain is modelled 2. Model
is adapted to the target domain.
Given the deep learning model for domain adaptation in the figure shown below, explain
the model design choices and differences as compared to a conventional CNN based
transfer learning model.
https://nlpers.blogspot.com/2007/11/domain-adaptation-vs-transfer-learning.html
(i) Describe how the model shown below is used for attribute domain transfer learning,
and (ii) what the roles of the “Alignment cost layer” and the “Merging layer” are. In
comparison, (iii) explain what the two options for conventional deep learning domain
transfer and their limitations are.
We have two networks
Source domain: refers to labelled data (can perform supervised learning). Large pool of
data
Target domain: environment where you want to apply your model (none or sparsely
labelled data). Small pool
You want to borrow information from one another and want to tie them together. To learn
about the best representations.
We address the problem of describing people based on fine grained clothing attributes.
This is an important problem for many practical applications, such as identifying target
suspects or finding missing people based on de-tailed clothing descriptions in surveillance
videos or consumer photos. We approach this problem by first mining clothing images with
fine-grained attribute labels from on-line shopping stores. A large-scale dataset is built
with about one million images and fine-detailed attribute sub-categories, such as various
shades of color (e.g., water-melon red, rosy red, purplish red), clothing types (e.g., down
jacket, denim jacket), and patterns (e.g., thin horizontal stripes, houndstooth). As these
images are taken in ideal pose/lighting/background conditions, it is unreliable to directly
use them as training data for attribute prediction in the domain of unconstrained images
captured, for example, by mobile phones or surveillance cameras. In order to bridge this
gap, we propose a novel double-path deep do-main adaptation network to model the data
from the two domains jointly. Several alignment cost layers placed in-between the two
columns ensure the consistency of the two domain features and the feasibility to predict
unseen at-tribute categories in one of the domains. Finally, to achieve a working system
with automatic human body alignment, we trained an enhanced RCNN-based detector to
localize human bodies in images. Our extensive experimental evaluation demonstrates the
effectiveness of the proposed approach for describing people based on fine-grained
clothing attributes
The Approach:
A single model learning both source and target domains jointly through a double path DNN
architecture (.e.g a Siamese network). Joint learning of domain invariant hierarchical
features. The domain information is transferred at multiple levels within intermediate
layers.
The role of the alignment cost layers:
As domain adaptation is not learned at the classifier level, the domain invariant feature is
learnt directly through the hierarchical feature learning model. Alignment cost layers
connect the two paths to ensure both feature alignment and attribute label consistency of
the representation across the domain.
Merging Layer:
Input of merging layer is from the two network paths, which are merged and share
parameters in the subsequent layers. This design is used to deploy the model after the co-
training. The merging operation is a simple max operation (not maxpoolng), i.e. f(Xs,Xt)
=max(Xs,Xt). This layer is dropped at testing time.
Option 1: Comparison with Siamese network: Both networks have two paths and a
merging step. But the functions of these two key modules are quite different. The Siamese
network calculates the difference of the two input channels and there is no back
propagation channel to constrain the middle level representation.
Option 2: Comparison with deep learning fine-tuning framework:
The Fine tuning framework performs fine-tuning on the whole original network without
dropping the original objective function. It usually takes the output of the last layer of the
network as a feature and performs additional training for the new tasks. We don’t have
enough diverse training samples to re-train the target domain classifier in DDAN.
Solution, the DDAN puts an additional regularization term on the adaptation process. And
the learned generator has consistent output with the source domain feature so that we
can directly apply new attribute classifiers to unseen labels learned from the source
domain without additional cost
Given the AlexNet architecture shown below, state (i) which are convolutional layers, (ii)
which are fully-connected layers, and (iii) what the last layer represents. Figure 3 shows a
Conventional Neural Network (CNN) known as the AlexNet. (i) Give an overview on how
this model is designed to perform the ImageNet 1,000 object classification task, and (ii)
describe details of the network layers and the final network output. Describe the AlexNet
(Figure 3) design choices and the reasons for these choices on: the activation function,
local response normalisation, max pooling, and dropout; and where these considerations
are deployed in the network. Describe the design of feature maps and the filtering kernels
for the 1st, 2nd and 3rd convolutional layers, and the fully-connected 6th, 7th and 8th
layers in the AlexNet (see figure above). 4. Explain what is the overall purpose for the
design of the AlexNet as shown below. Describe its key characteristics on its activation
function, pooling, normalisation, and avoiding overfit, and where these considerations are
deployed in the network.
First 2 max pooling = + local response normalization
AlexNet contains eight layers: Input: 224×224×3 input images, 60M parameters to be
trained
1th: Convolutional Layer:
96 kernels of size 11×11×3 (stride: 4, pad: 0) 

55×55×96 feature maps 
Then 3×3 Overlapping Max Pooling (stride: 2) 
Then Local Response Normalization 
27×27×96 feature maps
2nd: Convolutional Layer:
Then Local Response Normalization 
3rd: Convolutional Layer:

4th: Convolutional Layer:
5th: Convolutional Layer: 256 kernels of size 3×3×192 (stride: 1, pad: 1) 
6th: Fully Connected (Dense) Layer of 4096 neurons
7th: Fully Connected (Dense) Layer of 4096 neurons
8th: Fully Connected (Dense) Layer of Output: 1000 neurons (since there are 1000 classes) 
Softmax is used for calculating the loss.
1. ReLU is introduced in AlexNet.And ReLU is six times faster than Tanh to reach 25%
training error rate. ReLU is a so-called non-saturating activation. This means that gradient
will never be close to zero for a positive activation and as result, the training will be
faster. By contrast, sigmoid activations are saturating, which makes gradient close to zero
for large absolute values of activations. Very small gradient will make the network train
slower or even stop, because the step size during gradient descent’s weight update will be
small or zero (so-called vanishing gradient problem).
2. At that moment, they a max of 3GB of memory was available. The architecture is split
into two paths and use 2 GPUs for convolutions. Inter-communications occurred at one
specific convolutional layer. Thus, using 2 GPUs, is due to memory problem, NOT for
speeding up the training process. With the whole network compared with a net with only
half of kernels , Top-1 and top-5 error rates are reduced by 1.7% and 1.2% respectively.
3. In AlexNet, local response normalization is used. Normalization helps to speed up the
convergence. Local response normalization is used only after layers 1 and 2 (before
activation). Overlapping max pooling is used after layers 1, 2 and 5. Dropout was only used
after layers 6 and 7.Nowadays, batch normalization is used instead of using local response
normalization. With local response normalization, Top-1 and top-5 error rates are reduced
by 1.4% and 1.2% respectively.
4. Overlapping pooling is used where the stride size is smaller than kernel size. With
overlapping pooling, Top-1 and top-5 error rates are reduced by 0.4% and 0.3% respectively
5. Two forms of data augmentation is used: First: Image translation and horizontal
reflection (mirroring) and secondly altering the intensity using PCA. By increasing the size
of training set with data augmentation, Top-1 error rate is reduced by over 1%.
6. Dropout: With the layer that using dropout, during training, each neuron has a
probability not to contribute to feed forward pass and participate in backpropagation.
Thus, each neuron can have a larger chance to be trained, and not to depend so much for
some very “strong” neuron. No dropout during test. In AlexNet, probability of 0.5 is used
at the first two fully-connected layers. Dropout is a kind of regularization technique to
reduce the overfitting.
Other learning parameters:
✓ Batch size: 128

✓ Momentum v: 0.9
✓ Weight Decay: 0.0005
✓ Learning rate : 0.01, reduced by 10 manually when validation error rate stopped
improving, and reduced by 3 times.
✓ Training set of 1.2 million images.
✓ Network is trained for roughly 90 cycles.
✓ Five to six days on two NVIDIA GTX 580 3GB GPUs.
CNN vs R-CNN
A convolutional neural network (CNN) is mainly for image classification. While an R-CNN,
with the R standing for region, is for object detection.
A typical CNN can only tell you the class of the objects but not where they are located. It
is actually possible to regress bounding boxes directly from a CNN but that can only
happen for one object at a time. If multiple objects are in the visual field then the CNN
bounding box regression cannot work well due to interference.
In R-CNN the CNN is forced to focus on a single region at a time because that way
interference is minimized because it is expected that only a single object of interest will
dominate in a given region. The regions in the R-CNN are detected by selective search
algorithm followed by resizing so that the regions are of equal size before they are fed to
a CNN for classification and bounding box regression.
Describe what is a R-CNN network and its essential components. (ii) its three key
components and their functionality
Object detection is the process of finding and classifying objects in an image. One deep
learning approach, regions with convolutional neural networks (R-CNN), combines
rectangular region proposals with convolutional neural network features. R-CNN is a two-
stage detection algorithm. The first stage identifies a subset of regions in an image that
might contain an object. The second stage classifies the object in each region.
Applications for R-CNN object detectors include:
• Autonomous driving
• Smart surveillance systems
• Facial recognition
Object Detection Using R-CNN Algorithms
Models for object detection using regions with CNNs are based on the following three
processes:
• Find regions in the image that might contain an object. These regions are called
region proposals.
• Extract CNN features from the region proposals.
• Classify the objects using the extracted features.
There are three variants of an R-CNN. Each variant attempts to optimize, speed up, or
enhance the results of one or more of these processes.
R-CNN
The R-CNN detector first generates region proposals using an algorithm such as Edge
Boxes. The proposal regions are cropped out of the image and resized. Then, the CNN
classifies the cropped and resized regions. Finally, the region proposal bounding boxes are
refined by a support vector machine (SVM) that is trained using CNN features.
Another explanation
RCNN algorithm proposes a bunch of boxes in the image and checks if any of these boxes
contain any object. RCNN uses selective search to extract these boxes from an image
(these boxes are called regions).
1. Takes image as input
2. Then, it generates initial sub-segmentations so that we have multiple regions from this
image:
3. The technique then combines the similar regions to form a larger region (based on color
similarity, texture similarity, size similarity, and shape compatibility):
4. Finally, these regions then produce the final object locations (Region of Interest).
1. We first take a pre-trained convolutional neural network.
2. Then, this model is retrained. We train the last layer of the network based on the
number of classes that need to be detected.
3. The third step is to get the Region of Interest for each image. We then reshape all
these regions so that they can match the CNN input size.
4. After getting the regions, we train SVM to classify objects and background. For
each class, we train one binary SVM.
5. Finally, we train a linear regression model to generate tighter bounding boxes for
each identified object in the image.
Problems
All these processes combine to make RCNN very slow. It takes around 40-50 seconds to
make predictions for each new image, which essentially makes the model cumbersome
and practically impossible to build when faced with a gigantic dataset.
VGGNET
2 layers of 3×3 filters already covered the 5×5 area
By using 2 layers of 3×3 filters, it actually have already covered 5×5 area . By using 3
layers of 3×3 filters, it actually have already covered 7×7 effective area. Thus, large-size
filters such as 11×11 in AlexNet and 7×7 in ZFNet indeed are not needed.
VGGNet consists of 16 convolutional layers and is very appealing because of its very
uniform architecture. Similar to AlexNet, only 3x3 convolutions, but lots of filters. Trained
on 4 GPUs for 2–3 weeks. It is currently the most preferred choice in the community for
extracting features from images. The weight configuration of the VGGNet is publicly
available and has been used in many other applications and challenges as a baseline
feature extractor. However, VGGNet consists of 138 million parameters, which can be a bit
challenging to handle.
Architecture considerations
• Pre-processing: fixed size image inputs (224x224) and mean subtraction
• Use stacks of small receptive filters (3x3) and (1x1) with 1 pixel convolutional
strides
• Spatial preserving padding
• 5 max-pooling layers carried out at 2x2 windows with stride of 2
• Max-pooling only applied to some conv layers
• Observations: Drastic change from previous shallower nets with larger receptive
fields and strides
Compared with GoogLeNet using 7-nets which has error rate of 6.7%, VGGNet using 2-nets,
plus multi scale training, multi-scale testing, mutli-crop and dense has error rate of 6.8%
which are competitive. With only 1-net, VGGNet has 7.0% error rate which is better than
GoogLeNet, that has 7.9% error rate. However, at the submission of ILSVRC 2014, VGGNet
has 7.3% error rate only which got 1st runner up at the moment.
Explain what is a Softmax function in a Convolutional Neural Network for image

classification. Give its formulation and describe the terms. A Softmax function is a
normalized exponential function defined as follows
w is the weight vector, x is the vector of 1 training sample, and w_0 is the bias unit.
Describe the characteristics of the Softmax function in Eqn.(1) above; Explain the
meanings of symbols in Eqn.(1).
θ = x T w = w0 x0 + w1x1 + … = wi xi = x T w

∑
This softmax computes the probability that this training example x j belongs to class j
given the weight and the net input X. So we compute the probability p(y = j|x j; wk) for
each class label k = 1 to n. The denominator is the normalization term which causes these
probabilities to sum up to 1.
1. Given an input vector x, our objective is to predict if the trained set of features x
are a class of j. x vector consists of binary values with the number 1 representing
an element in the iᵗʰ position of the column while the rest are 0s. Our output for
the Softmax function is the ratio of the exponential of the parameter and the sum
of exponential parameter.
Explain what the typical role is for a Softmax in a Convolutional Neural Network (CNN) for
image classification.
The softmax activation is normally applied to the very last layer in a neural net, instead of
using ReLU, sigmoid, tanh, or another activation function. It is a generalization of logistic
regression that we can use for multi-class classification. Output of last layer (logits) of the
classifier is represented as a probability distribution whose total sums up to 1. The output
values are between the range [0, 1]. Logits are the raw scores output by the last layer of a
neural network. Before activation takes place.
Explain what is the aim of minimizing a loss function in the training of a CNN model.
Explain (i) the role of a loss function in a CNN, and (ii) how it is used.
A deep learning neural network learns to map a set of inputs to a set of outputs from
training data. We cannot calculate the perfect weights for a neural network; there are too
many unknowns. Instead, the problem of learning is cast as a search or optimization
problem and an algorithm is used to navigate the space of possible sets of weights the
model may use in order to make good or good enough predictions. Typically, a neural
network model is trained using the stochastic gradient descent optimization algorithm and
weights are updated using the backpropagation of error algorithm. The “gradient” in
gradient descent refers to an error gradient. The model with a given set of weights is used
to make predictions and the error for those predictions is calculated. The gradient descent
algorithm seeks to change the weights so that the next evaluation reduces the error,
meaning the optimization algorithm is navigating down the gradient (or slope) of error. In
the context of an optimization algorithm, the function used to evaluate a candidate
solution (i.e. a set of weights) is referred to as the objective function. We seek to
minimize it. The objective function is often referred to as a cost function or a loss
function and the value calculated by the loss function is referred to as simply “loss.
A loss function is used to measure the inconsistency between predicted value and actual
label. It is a non-negative value, where the robustness of model increases along with the
decrease of the value of loss function.
Explain how to optimise a loss function in a CNN.
By using gradient descent. Gradient descent is an optimization algorithm used to minimize
some function by iteratively moving in the direction of steepest descent as defined by the
negative of the gradient. In machine learning, we use gradient descent to update the
parameters of our model.
We take our first step downhill in the direction specified by the negative gradient. Next
we recalculate the negative gradient and take another step in the direction it specifies.
We continue this process iteratively until we reach a local minimum. The size of these
steps is called the learning rate. With a high learning rate we can cover more ground each
step, but we risk overshooting the lowest point and results in oscillations. A low learning
rate is more precise, but calculating the gradient is time-consuming, so it will take us a
very long time to get to the minima.
A Loss Functions tells us “how good” our model is at making predictions for a given set of
parameters. The cost function has its own curve and its own gradients. The slope of this
curve tells us how to update our parameters to make the model more accurate.
There are two parameters in our cost function we can control: m (weight) and b (bias).
Given the cost function, the gradient can be calculated. To solve for the gradient, we
iterate through our data points using our new m and b values and compute the partial
derivatives. This new gradient tells us the slope of our cost function at our current
position (current parameter values) and the direction we should move to update our
parameters. The size of our update is controlled by the learning rate.
Explain what is a “Mini-batch” in the training of a CNN model.

First, Gradient descent is to find model parameters that minimize the error of the model
on the training dataset.
What is Stochastic Gradient Descent?
Stochastic gradient descent calculates the error and updates the model for each example
in the training dataset ( online ML algo).
Upsides
• The frequent updates immediately give an insight into the performance of the
model and the rate of improvement.
• This variant of gradient descent may be the simplest to understand and implement,
especially for beginners.
• The increased model update frequency can result in faster learning on some
problems.
• The noisy update process can allow the model to avoid local minima (e.g.
premature convergence).
Downsides
• Updating the model so frequently is more computationally expensive than other
configurations of gradient descent, taking significantly longer to train models on
large datasets.
• The frequent updates can result in a noisy gradient signal, which may cause the
model parameters and in turn the model error to jump around (have a higher
variance over training epochs).
• The noisy learning process down the error gradient can also make it hard for the
algorithm to settle on an error minimum for the model.
What is Batch Gradient Descent?
Batch gradient calculates the error for each example in the training dataset, but only
updates the model after all training examples have been evaluated.
One cycle through the entire training dataset is called a training epoch. Therefore, it is
often said that batch gradient descent performs model updates at the end of each training
epoch.
Upsides
• Fewer updates to the model means this variant of gradient descent is more
computationally efficient than stochastic gradient descent.
• The decreased update frequency results in a more stable error gradient and may
result in a more stable convergence on some problems.
• The separation of the calculation of prediction errors and the model update lends
the algorithm to parallel processing based implementations.
Downsides
• The more stable error gradient may result in premature convergence of the model
to a less optimal set of parameters.
• The updates at the end of the training epoch require the additional complexity of
accumulating prediction errors across all training examples.
• Commonly, batch gradient descent is implemented in such a way that it requires
the entire training dataset in memory and available to the algorithm.
• Model updates, and in turn training speed, may become very slow for large
datasets.
What is Mini-Batch Gradient Descent?
Mini-batch gradient descent splits the training dataset into small batches that are used to
calculate model error and update model coefficients.
Implementations may choose to sum the gradient over the mini-batch or take the average
of the gradient which further reduces the variance of the gradient.
Mini-batch gradient descent seeks to find a balance between the robustness of stochastic
gradient descent and the efficiency of batch gradient descent. It is the most common
implementation of gradient descent used in the field of deep learning.
Upsides
• The model update frequency is higher than batch gradient descent which allows for
a more robust convergence, avoiding local minima.
• The batched updates provide a computationally more efficient process than
stochastic gradient descent.
• The batching allows both the efficiency of not having all training data in memory
and algorithm implementations.
Downsides
• Mini-batch requires the configuration of an additional “mini-batch size”
hyperparameter for the learning algorithm.
• Error information must be accumulated across mini-batches of training examples
like batch gradient descent.
Explain what “Data Augmentation” and “Dropout” are used for in CNN model learning.
Data augmentation is the creation of altered copies of each instance within a training
dataset. When we feed image data into a neural network, there are some features of the
images that we would like the neural network to condense or summarize into a set of
numbers or weights. In the case of image classification, these features or signals are the
pixels which make up the object in the picture. On the other hand, there are features of
the images that we would not like the neural network to incorporate in its summary of the
images (the summary is the set of weights). In the case of image classification, these
features or noise are the pixels which form the background in the picture.
So, how do we ensure that the neural network can differentiate signal from noise? A very
simple solution is to create multiple alterations of each image, where the signal or the
object in the picture is kept invariant, whilst the noise or the background is distorted.
These distortions include cropping, scaling and rotating the image, among others.
Therefore, the network of neurons observes the invariance in the images and encodes this
information or signal in the set of weights which summarize the training data.
Data augmentation is an essential part of training discriminative Convolutional Neural
Networks (CNNs). A variety of augmentation strategies, including horizontal flips, random
crops, and principal component analysis (PCA), have been proposed and shown to capture
important characteristics of natural images.
Dropout
Large neural nets trained on relatively small datasets can overfit the training data. This
has the effect of the model learning the statistical noise in the training data, which results
in poor performance when the model is evaluated on new data, e.g. a test dataset.
Generalization error increases due to overfitting.
Dropout is a regularization technique for reducing overfitting in neural networks by
preventing complex co-adaptations on training data. The term "dropout" refers to dropping
out units (both hidden and visible) in a neural network. Dropout has the effect of making
the training process noisy, forcing nodes within a layer to probabilistically take on more or
less responsibility for the inputs
Hidden layer dropout is 0.5 to 0.8. Also, 1.0 means no dropout, and 0.0 means no outputs
from the layer.
Describe two common random repeat techniques for minimising overfitting in CNN model
training.
1. Dropout as above.
2. Add Gaussian noise to the input variables.
One approach to improving generalization error and to improving the structure of the
mapping problem is to add random noise. The addition of noise during the training of a
neural network model has a regularization effect and, in turn, improves the robustness of
the model. It has been shown to have a similar impact on the loss function as the addition
of a penalty term, as in the case of weight regularization methods. In effect, adding noise
expands the size of the training dataset. Each time a training sample is exposed to the
model, random noise is added to the input variables making them different every time it is
exposed to the model. In this way, adding noise to input samples is a simple form of data
augmentation. Adding noise means that the network is less able to memorize training
samples because they are changing all of the time, resulting in smaller network weights
and a more robust network that has lower generalization error.
Define the logistic regression function and explain why it is used as the activation function
of a neural network unit.
It is very common to use logistic sigmoid functions as activation functions in the hidden
layer of network. Pros: logistic cost function is convex, and thus you are guaranteed to
find the global minimum. However, once you stack them in a neural network, you'll lose
this convexity. But backpropagation works quite well for 1 or 2 layer neural networks
(autoencoders can help with deeper architectures). Even if you may likely converge to a
local minima, you often still end up with a powerful predictive mode
Given a neuron unit as illustrated in Figure 2 below, describe the meaning of all the
symbols.
1 – bias
X – weights
WT - weight vector
F(z) output is between [0,z]
Draw a visual illustration of the output response from the neuron unit given above.
Describe the characteristics of this output response and its activation function.
???????
Draw a visual illustration that compares the output responses from the Logistic Regression
(LR) and the Rectified Linear Unit (ReLU) activation functions. Explain the advantage of
ReLU over Logistic Regression.
Two additional major benefits of ReLUs are sparsity and a reduced likelihood of vanishing
gradient. But first recall the definition of a ReLU is h=max(0,a) where a=Wx+b.
One major benefit is the reduced likelihood of the gradient to vanish. This arises when
a>0. In this regime the gradient has a constant value. In contrast, the gradient of sigmoids
becomes increasingly small as the absolute value of x increases. The constant gradient of
ReLUs results in faster learning.
The other benefit of ReLUs is sparsity. Sparsity arises when a ≤ 0. The more such units that
exist in a layer the sparser the resulting representation. Sigmoids on the other hand are
always likely to generate some non-zero value resulting in dense representations. Sparse
representations seem to be more beneficial than dense representations.
ReLU does have the disadvantage of dying cells which limits the capacity of the network.
To overcome this just use a variant of ReLU such as leaky ReLU, ELU,etc if you notice the
problem described above. ReLU tends to blow up activation.
Explain what are the computational operations for the three stages of a Super-Resolution
Convulotional Neural Network (SRCNN) as shown in Figure 1 below.
Non linear mapping as otherwise increases complexity and training time.
Describe the main components of a classical Sparse-Coding super-resolution (SR)

algorithm.
The sparse-coding-based is an external example-based SR methods. This method involves
several steps in its solution pipeline. First, overlapping patches are densely cropped from
the input image and pre-processed (by subtracting mean and normalization). These
patches are then encoded by a low-resolution dictionary. The sparse coefficients are
passed into a high-resolution dictionary for reconstructing high-resolution patches. The
overlapping reconstructed patches are aggregated (e.g., by weighted averaging) to
produce the final output.
State the key weakness of a classical Sparse-Coding based SR model when compared to the
SRCNN model.
Sparse coding can be viewed as a conv Net. But not all operations have been considered in
the optimization in the sparse-coding-based SR methods. IN SRCNN, the low-resolution
dictionary, high-resolution dictionary, non-linear mapping, together with mean subtraction
and averaging, are all involved in the filters to be optimized. So SRCNN method optimizes
an end-to-end mapping that consists of all operations and learning is done by back
propagation.
Describe the key principles for the design of the GoogLeNet deep learning model shown
below.
The network architecture in this paper is quite different from VGGNet, ZFNet, and
AlexNet. It contains 1×1 Convolution at the middle of the network. And global average
pooling is used at the end of the network instead of using fully connected layers. These
two techniques are from another paper “Network In Network” (NIN). Another technique,
called inception module, is to have different sizes/types of convolutions for the same
input and stacking all the outputs.
1. The 1×1 Convolution
The 1×1 convolution is introduced by NIN. 1×1 convolution is used with ReLU. Thus,
originally, NIN uses it for introducing more non-linearity to increase the representational
power of the network since authors in NIN believe data is in non-linearity form. In
GoogLeNet, 1×1 convolution is used as a dimension reduction module to reduce the
computation. By reducing the computation bottleneck, depth and width can be increased.
I pick a simple example to illustrate this. Suppose we need to perform 5×5 convolution
without the use of 1×1 convolution as below:
Number of operations = (14×14×48)×(5×5×480) = 112.9M
With the use of 1×1 convolution:
Number of operations for 1×1 = (14×14×16)×(1×1×480) = 1.5M 

Number of operations for 5×5 = (14×14×48)×(5×5×16) = 3.8M 
Total number of operations = 1.5M + 3.8M = 5.3M 
which is much much smaller than 112.9M !!!!!!!!!!!!!!!
Indeed, the above example is the calculation of 5×5 conv at inception (4a). Thus,
inception module can be built without increasing the number of operations largely
compared the one without 1×1 convolution! 1 x 1 will help reduce model size and reduce
overfitting.
Describe the key principles for the design of GoogLeNet shown in Figures (a), (b), (c)
below.
It has 22 layers in total.
There are numerous inception modules connected together to go deeper. (There are
some intermediate softmax branches at the middle, we will describe about them in the
next section.)
As we can see there are some intermediate softmax branches at the middle, they are
used for training only. These branches are auxiliary classifiers which consist of:
5×5 Average Pooling (Stride 3) 
1×1 Conv (128 filters) 
1024 FC 
1000 FC 
Softmax
The loss is added to the total loss, with weight 0.3.
Authors claim it can be used for combating gradient vanishing problem, also providing
regularization.
And it is NOT used in testing or inference time.
During testing:
7 GoogLeNet are used for ensemble prediction. This is already a kind of boosting approach
from LeNet, AlexNet, ZFNet and VGGNet.
Multi-scale testing is used just like VGGNet, with shorter dimension of 256, 288, 320, 352.
(4 scales)
Multi-crop testing is used, same idea but a bit different from and more complicated than
AlexNet.
First, for each scale, it takes left, center and right, or top, middle and bottom squares (3
squares). Then, for each square, 4 corners and center as well as the resized square (6
crops) are cropped as well as their corresponding flips (2 versions) are generated.
The total is 4 scales×3 squares×6 crops×2 versions=144 crops/image
Softmax probabilities are averaged over all crops.
Explain the key principles for the design of the GoogLeNet Inception architecture ?
Explanation 2
What is the problem?
• Salient parts in the image can have extremely large variation in size. Example a
dog can occupy a small or large area in an image depending on camera distance
etc.
• Because of this huge variation in the location of the information, choosing the right
kernel size for the convolution operation becomes tough. A larger kernel is
preferred for information that is distributed more globally, and a smaller kernel is
preferred for information that is distributed more locally.
• Very deep networks are prone to overfitting. It also hard to pass gradient updates
through the entire network.
• Naively stacking large convolution operations is computationally expensive.
The Solution:
Why not have filters with multiple sizes operate on the same level? The network essentially
would get a bit “wider” rather than “deeper”. The authors designed the inception module to
reflect the same.
The below image is the “naive” inception module. It performs convolution on an input, with
3 different sizes of filters (1x1, 3x3, 5x5). Additionally, max pooling is also performed. The
outputs are concatenated and sent to the next inception module.
As stated before, deep neural networks are computationally expensive. To make it

cheaper, the authors limit the number of input channels by adding an extra 1x1
convolution before the 3x3 and 5x5 convolutions. Though adding an extra operation may
seem counterintuitive, 1x1 convolutions are far cheaper than 5x5 convolutions, and the
reduced number of input channels also help. Do note that however, the 1x1 convolution is
introduced after the max pooling layer, rather than before.
Using the dimension reduced inception module, a neural network architecture was built.
This was popularly known as GoogLeNet (Inception v1).
GoogLeNet has 9 such inception modules stacked linearly. It is 22 layers deep (27,
including the pooling layers). It uses global average pooling at the end of the last
inception module.
Needless to say, it is a pretty deep classifier. As with any very deep network, it is subject
to the vanishing gradient problem.
To prevent the middle part of the network from “dying out”, the authors introduced two
auxiliary classifiers (The purple boxes in the image). They essentially applied softmax to
the outputs of two of the inception modules, and computed an auxiliary loss over the
same labels. The total loss function is a weighted sum of the auxiliary loss and the real
loss. Weight value used in the paper was 0.3 for each auxiliary loss. Auxiliary loss is purely
used for training purposes, and is ignored during inference
Inception V2 and V3 also presented in that paper:
The Premise:
• Reduce representational bottleneck. The intuition was that, neural networks
perform better when convolutions didn’t alter the dimensions of the input
drastically. Reducing the dimensions too much may cause loss of information,
known as a “representational bottleneck”
• Using smart factorization methods, convolutions can be made more efficient in
terms of computational complexity.
The Solution:
• Factorize 5x5 convolution to two 3x3 convolution operations to improve
computational speed. Although this may seem counterintuitive, a 5x5 convolution
is 2.78 times more expensive than a 3x3 convolution. So stacking two 3x3
convolutions infact leads to a boost in performance. This is illustrated in the below
image.
Global Average Pooling in GooleLeNet
Previously, fully connected (FC) layers are used at the end of network, such as in AlexNet.
All inputs are connected to each output.
In GoogLeNet, global average pooling is used nearly at the end of network by averaging
each feature map from 7×7 to 1×1, as in the figure above.
And authors found that a move from FC layers to average pooling improved the top-1
accuracy by about 0.6%.
This is the idea from NIN which can be less prone to overfitting.
Describe what interpretations can be drawn from the visualisation of CNN layer output
below. Explain how to determine which CNN layers are suitable for fine-tuning given
different target training data sizes in relation to the original auxiliary data size.
http://cs231n.github.io/convolutional-networks/
https://www.researchgate.net/figure/Multifaceted-visualization-of-example-neuron-
feature-detectors-from-all-eight-layers-of-a_fig4_301845946
https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
Given a CNN network intermediate layers 1, 3, 5 outputs typically have the characteristics
as shown in Figure 4, give six aspects why a CNN model is good for object detection.
https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
CNN
A Convolutional Neural Network (CNN, or ConvNet) are a special kind of multi-layer
neural networks, designed to recognize visual patterns directly from pixel images with
minimal preprocessing.. The ImageNet project is a large visual database designed for use
in visual object recognition software research.
Learn useful higher level features from images. Learn a hierarchy from pixels to classifier.
One layer extracts features from output of previous layer. Leads to better performance.
Feature computation time and size of dataset important.
Pixels -> 1st layer = edges -> 2nd = object parts -> 3rd layer = objects
Explain the main differences between the R-CNN and the AlexNet (use the schemata given
in figures above and below).
What is Vanishing gradient?
Vanishing Gradient problem arises while training an Artificial Neural Network. This mainly
occurs when the network parameters and hyperparameters are not properly set.
Parameters could be weights and biases while hyperparameters could be learning rate, the
number of epochs, the number of batches. Because of this, your Deep Learning model may
take longer time to train and learn from the data and sometimes may not train at all and
show error. This results in less or no convergence of the neural network. Due to Vanishing
Gradient, your slope becomes too small and decreases gradually to a very small value
(sometimes negative).
This leads to poor performance of the model and the accuracy is very low. The model
may fail to predict or classify what it is supposed to do.
The solutions to Vanishing Gradient problems are:
• LSTMs: Long Short Term Memory Networks are generally used to tackle Vanishing
Gradient problem when you are working on RNN. LTSMs help to solve long term
dependencies and can memorize previous data easily.
• Faster Hardware: Switching from CPU’s to GPU’s with faster compilation time
have made standard backpropagation method feasible where the cost of the model
is very less.
• Other activation functions: Rectifiers such as ReLU suffer less from Vanishing
Gradient problem, because they only saturate in one direction.
• Residual Networks: Residual Neural Networks or ResNets avoid the problem of
Vanishing Gradient by constructing ensembles of many short networks together.
Examples of visualized weights for the first layer of a neural network. Left: Noisy
features indicate could be a symptom: Unconverged network, improperly set learning
rate, very low weight regularization penalty. Right: Nice, smooth, clean and diverse
features are a good indication that the training is proceeding well.
Describe the two major bottlenecks in constructing deeper convolutional neural networks.
Perception

Deep Learning Notes PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Notes PDF

Uploaded by

Copyright:

Available Formats

Describe what visual attributes of an object are and how they can be computed.

Features are important:

Attributes can be computed (handcrafted) using different models e.g. bottom up

Limitations of low level features

First 2 max pooling = + local response normalization

1th: Convolutional Layer:

96 kernels of size 11×11×3 (stride: 4, pad: 0)

3rd: Convolutional Layer:

6th: Fully Connected (Dense) Layer of 4096 neurons

7th: Fully Connected (Dense) Layer of 4096 neurons

✓ Batch size: 128

Explain what is a Softmax function in a Convolutional Neural Network for image

θ = x T w = w0 x0 + w1x1 + … = wi xi = x T w

Explain what is a “Mini-batch” in the training of a CNN model.

Non linear mapping as otherwise increases complexity and training time.

Describe the main components of a classical Sparse-Coding super-resolution (SR)

Number of operations = (14×14×48)×(5×5×480) = 112.9M

With the use of 1×1 convolution:

Number of operations for 1×1 = (14×14×16)×(1×1×480) = 1.5M

As stated before, deep neural networks are computationally expensive. To make it

Global Average Pooling in GooleLeNet

You might also like