Deep Learning
Department of Computer Science and Engineering
Pranveer Singh Institute of Technology, Kanpur
Dimensionality Reduction
Dimensionality reduction refers to reducing the number of input variables (features) while
preserving as much important information (variance or class separability) as possible. It helps:
● Reduce computational cost
● Avoid overfitting
● Improve visualization
● Remove redundant/noisy features
There are two broad types:
● Linear methods (e.g., PCA, LDA)
● Non-linear (manifold) methods (e.g., t-SNE, Isomap)
2
Principal Component Analysis (PCA)
🧠 Intuition:
Imagine a cloud of points in 3D space that lies roughly along a line.
Although it’s 3D, most variation is along that line — so we can represent it using just one
variable instead of three.
⚙ Steps:
1. Standardize data: Subtract mean, divide by standard deviation.
2. Compute Covariance Matrix: Measures how features vary together.
3. Find Eigenvectors & Eigenvalues:
○ Eigenvectors = principal component directions
○ Eigenvalues = amount of variance captured
4. Sort and Select: Keep top-k eigenvectors with largest eigenvalues.
5. Transform: Project original data on these axes. 3
Principal Component Analysis (PCA)
📘 Example:
Suppose we have height and weight data for 1000 people.
● Height and weight are correlated.
● PCA may find that one component (body size) explains 95% of the variation.
● We can reduce from 2D → 1D without losing much information.
📊 Application:
● Image compression (e.g., compressing 1024-pixel images to 100 features).
● Noise removal.
● Visualizing high-dimensional data in 2D (for clustering).
4
PCA(Principal Component Analysis)
5
PCA(Principal Component Analysis)
6
PCA(Principal Component Analysis)
7
PCA(Principal Component Analysis)
8
PCA(Principal Component Analysis)
9
PCA(Principal Component Analysis)
10
PCA(Principal Component Analysis)
11
PCA(Principal Component Analysis)
12
PCA(Principal Component Analysis)
13
PCA(Principal Component Analysis)
14
Linear Discriminant Analysis (LDA)
Type: Linear, Supervised
Goal: Reduce dimensions while maximizing class separability.
🧠 Intuition:
While PCA looks for directions of maximum variance,
LDA looks for directions that best separate classes.
⚙ Steps:
1. Compute mean of each class.
2. Compute within-class scatter (how spread out data is within a class).
3. Compute between-class scatter (how far apart class means are).
4. Find projection matrix W that maximizes the ratio:
5. Project data into lower dimension using W. 15
Linear Discriminant Analysis (LDA)
📘 Example:
In a face recognition dataset with multiple people:
● LDA projects each face image into a lower-dimensional space
where faces of the same person are close together, and
different people are far apart.
16
LDA(Linear Discriminant Analysis)
17
LDA(Linear Discriminant Analysis)
18
LDA(Linear Discriminant Analysis)
19
LDA(Linear Discriminant Analysis)
20
LDA(Linear Discriminant Analysis)
21
LDA(Linear Discriminant Analysis)
22
LDA(Linear Discriminant Analysis)
23
LDA(Linear Discriminant Analysis)
24
LDA(Linear Discriminant Analysis)
25
Manifold Learning
Type: Non-linear
Goal: When data lies on a curved surface (manifold) in high-dimensional space, manifold
learning “unfolds” it.
🧠 Intuition:
Think of a “Swiss roll” — a 2D sheet rolled in 3D space.
PCA would fail because it’s linear, but manifold learning can “unwrap” it into a flat 2D
surface.
📚 Common Techniques:
● Isomap: Preserves geodesic (curved) distances using nearest neighbors.
● t-SNE: Preserves local neighborhoods; great for 2D visualization of clusters.
● LLE (Locally Linear Embedding): Keeps local linear relationships among data points.
📘 Example:
In image datasets, all images of digits "2" may lie on one curved manifold, and "3" on
another — manifold learning helps visualize these relationships. 26
Metric Learning
Goal: Learn a distance function that reflects semantic similarity.
🧠 Intuition:
Euclidean distance may not always reflect similarity.
Metric learning teaches the model what “similar” means in context.
📘 Example:
● Face Recognition: Siamese networks learn embeddings such that:
○ Same person → small distance.
○ Different persons → large distance.
27
Autoencoders and Dimensionality Reduction in Neural
Networks
An Autoencoder is a neural network that learns to compress and then reconstruct
input data.
🔹 Architecture:
Input → Encoder → Bottleneck → Decoder → Output
● Encoder: Reduces dimension to a compressed representation.
● Bottleneck: The low-dimensional “code.”
● Decoder: Reconstructs original data from code.
🧠 Intuition:
28
Like PCA, but nonlinear — can capture complex patterns.
Autoencoders and Dimensionality Reduction in Neural
Networks
📘 Example:
For an image of 28×28 pixels (784 inputs):
● Encoder compresses it to 64 features.
● Decoder reconstructs the 784-pixel image from those 64 features.
● The 64-feature vector is a non-linear reduced representation.
⚙ Variants:
● Denoising Autoencoder: Learns to remove noise (input: noisy image → output: clean
image).
● Sparse Autoencoder: Forces most hidden units to be inactive → compact encoding.
● Variational Autoencoder (VAE): Learns probability distributions in latent space, used for
29
generative modeling.
Introduction to Convolutional Neural Networks (ConvNets)
CNNs are designed to process data with spatial structure, like images (height × width × color
channels).
🔹 Key Idea:
Instead of connecting every input neuron to every output neuron (as in dense layers),
CNNs use local connections called filters (kernels) to detect patterns.
Layers in CNN:
Convolution Layer
○ Applies filters to extract features like edges or corners.
○ Example: A 3×3 kernel slides over an image, computing weighted sums.
30
Introduction to Convolutional Neural Networks (ConvNets)
Activation (ReLU)
○ Adds non-linearity:
○ Helps model complex patterns.
Pooling Layer
○ Reduces spatial size (e.g., 2×2 max pooling).
○ Makes features more translation invariant.
Fully Connected Layer
○ Flattens feature maps and performs classification.
Example: For a 28×28 grayscale image:
● Conv1 → detects edges
● Conv2 → detects corners or shapes
● Conv3 → detects object parts
● Fully Connected → outputs “This is a cat.” 31
CNN Architecture
a) AlexNet (2012)
● 8 layers (5 conv + 3 FC).
● ReLU activations (faster than sigmoid/tanh).
● Used Dropout to avoid overfitting.
● Trained on GPU — a breakthrough for deep learning.
Example: Classified 1.2 million ImageNet images into 1000 categories.
Impact: Sparked the modern deep learning revolution.
32
AlexNet (2012)
This was the first architecture that used GPU to boost the training performance. AlexNet consists of 5 convolution
layers, 3 max-pooling layers, 2 Normalized layers, 2 fully connected layers and 1 SoftMax layer. Each convolution
layer consists of a convolution filter and a non-linear activation function called “ReLU”. The pooling layers are used
to perform the max-pooling function and the input size is fixed due to the presence of fully connected layers. The input
size is mentioned at most of the places as 224x224x3 but due to some padding which happens it works out to be
227x227x3. Above all this AlexNet has over 60 million parameters.
33
CNN Architecture
b) VGGNet (2014)
● Simple, uniform architecture.
● Only 3×3 convolution filters used repeatedly.
● 16–19 layers deep.
● Great performance but huge parameters (~138M).
Example: Used widely as a feature extractor in computer vision.
34
VGGNet
35
VGGNet
● Inputs: The VGGNet accepts 224x224-pixel images as input. To maintain a consistent input size for the
ImageNet competition, the model’s developers chopped out the central 224x224 patches in each image.
● Convolutional Layers: VGG convolutional layers use the smallest feasible receptive field, or 33, to
record left-to-right and up-to-down movement. Additionally, 11 convolution filters are used to transform
the input linearly. The next component is a ReLU unit, a significant advancement from AlexNet that
shortens training time. Rectified linear unit activation function, or ReLU, is a piecewise linear function
that, if the input is positive, outputs the input; otherwise, the output is zero. The convolution stride is fixed
at 1 pixel to keep the spatial resolution preserved after convolution (stride is the number of pixel shifts
over the input matrix).
36
VGGNet
● Hidden Layers: The VGG network hidden layers all make use of ReLU. Local
Response Normalization (LRN) is typically not used with VGG as it increases
memory usage and training time. Furthermore, it doesn’t increase overall accuracy.
● Fully Connected Layers: The VGGNet contains three layers with full connectivity. The
first two levels each have 4096 channels, while the third layer has 1000 channels with
one channel for each class.
37
CNN Architecture
Inception (GoogLeNet)
● Introduced Inception modules combining multiple filter sizes (1×1, 3×3, 5×5) in
parallel.
● 1×1 convolutions used for dimensionality reduction to reduce computation.
● Deeper but more efficient than VGG.
38
Inception (GoogLeNet)
Inception Module (GoogLeNet Style)
The Inception module runs multiple convolution paths in parallel, then concatenates the outputs
depth-wise.
39
Inception (GoogLeNet)
Purpose of each branch
● 1×1 Conv branch: keeps local details, cheap computation
● 1×1 → 3×3 Conv: medium-size receptive field
● 1×1 → 5×5 Conv: larger receptive field
● Pool → 1×1 Conv: captures background + reduces spatial variance
Why 1×1 conv?
● Reduces number of channels
● Adds non-linearity
● Makes 3×3 and 5×5 branches much cheaper
40
Inception (GoogLeNet)
Example: One Inception Module with Channel Sizes
Suppose input feature map = 256 channels
Output depth
41
CNN Architecture
ResNet (2016)
● Introduced skip connections (residual blocks).
● Solved vanishing gradient problem by allowing gradients to flow directly through identity
connections.
● Enabled training of extremely deep networks (up to 152 layers).
Example: Became standard for most vision tasks (object detection, segmentation, etc.).
42
ResNet (2016)
ResNet Architecture (2016)
Key Innovation → Skip Connections (Residual Learning)
The core idea is:
Where
● x = input to the residual block
● F(x) = output of stacked convolutional layers
● x is added directly (identity connection) to the output
This solves the vanishing gradient problem and allows very deep networks (50, 101, 152 layers).
43
ResNet (2016)
1. Residual Block (Basic Block: for ResNet-18/34)
Block Structure
Input → Conv3×3 → BN → ReLU → Conv3×3 → BN → Add(Input)
→ ReLU → Output
2. Bottleneck Residual Block (for ResNet-50/101/152)
Uses 1×1 → 3×3 → 1×1 convolutions for efficiency.
Block Structure
Input → 1×1 → 3×3 → 1×1 → Add(Input) → ReLU
44
ResNet (2016)
Full ResNet Architecture Construction:
ResNet-50
Input: 224×224 RGB
Stage 1: Conv 7×7, 64 filters, stride 2
MaxPool 3×3, stride 2
Stage 2: [1×1, 64; 3×3, 64; 1×1, 256] × 3 blocks
Stage 3: [1×1,128; 3×3,128; 1×1,512] × 4 blocks
Stage 4: [1×1,256; 3×3,256; 1×1,1024] × 6 blocks
Stage 5: [1×1,512; 3×3,512; 1×1,2048] × 3 blocks
Global Average Pool
45
FC 1000 (Softmax)
ResNet (2016)
ResNet-101 Why ResNet Is Powerful
Same as ResNet-50, except: ✔ Skip connections allow gradient flow →
Stage 4: × 23 blocks (instead of 6) prevents vanishing gradient
ResNet-152 ✔ Enables extremely deep models (152+ layers)
✔ Became backbone for:
Stage 2: × 3
Stage 3: × 8 ● Object detection (Faster R-CNN, Mask R-CNN)
Stage 4: × 36 ● Image segmentation (DeepLab)
Stage 5: × 3 feature extraction in many CV tasks
✔ Very stable training even at large depth
46
Training a ConvNet
a) Weight Initialization
Proper initialization ensures stable gradients.
Example: Poor initialization → exploding or vanishing gradients → model doesn’t converge.
47
Training a ConvNet
b) Batch Normalization
● Normalizes outputs of each layer to have mean=0, variance=1.
● Reduces internal covariate shift.
● Allows higher learning rates and faster convergence.
● Acts like a regularizer.
Example: Without BatchNorm, training deep CNNs may oscillate or diverge.
48
Training a ConvNet
C) Hyperparameter Optimization
Hyperparameters = settings not learned by the network.
49
Training a ConvNet
Optimization Methods:
1. Grid Search: Try all combinations (costly).
2. Random Search: Randomly sample combinations.
3. Bayesian Optimization: Smartly explore promising regions.
4. AutoML / HyperOpt / Optuna: Automated tuning.
✅ Example Workflow:
Building a CNN for dog–cat classification:
1. Use pretrained ResNet50.
2. Fine-tune with He initialization.
3. Apply BatchNorm + Dropout.
4. Tune learning rate, batch size, and optimizer.
5. Evaluate on test data.
50