You are on page 1of 5

Assignment 1

1. Different types of Gradient Descent and its advantages and disadvantages.


a. Stochastic Gradient Descent
b. Batch Gradient Descent
c. Mini-batch Gradient Descent

a. Stochastic Gradient Descent:


Stochastic gradient descent (SGD) changes the parameters for each training
sample one at a time for each training example in the dataset.

Sl Advantages Disadvantages
No
1 SGD is highly efficient when dealing with The stochastic nature of SGD can lead to
large datasets because it computes gradients noisy parameter updates, which may
and updates model parameters based on make convergence less stable.
only a subset (mini-batch) of the data in
each iteration.
2 The stochastic nature of SGD introduces Poorly chosen hyperparameters can lead
noise in the gradient updates, which can to slow convergence or divergence.
help the algorithm escape local minima or
saddle points and find better solutions
faster.
3 SGD can be used for online learning, where SGD can be sensitive to the scaling of
the model is updated continuously as new input features, and it may require feature
data becomes available. scaling (e.g., normalization) to perform
well.
4 SGD is amenable to parallelization,
allowing multiple mini-batches to be
processed simultaneously on multiple
processors or GPUs, leading to faster
training times.

b. Batch Gradient Descent:


In Batch Gradient Descent, all the training data is taken into consideration to
take a single step. We take the average of the gradients of all the training
examples and then use that mean gradient to update our parameters. So that’s
just one step of gradient descent in one epoch.

Sl Advantages Disadvantages
No
1 BGD typically converges to a minimum of BGD less suitable for big data scenarios
the loss function in a smooth and or when memory resources are limited.
deterministic manner. It does not exhibit the
oscillations or randomness seen in
stochastic gradient descent (SGD).
2 The use of the entire dataset in each Because BGD computes the gradient
iteration provides more accurate and using all data points, each iteration is
predictable updates to the model typically slower compared to SGD,
parameters. which uses only a mini-batch of data.
3 BGD often allows for a larger learning rate BGD requires storing the entire dataset
to be used compared to SGD because it is in memory for each iteration, which can
less prone to diverging due to its stable be a limitation for large datasets that
convergence behavior. This can speed up cannot fit in memory.
convergence.
4 BGD is guaranteed to converge to a local BGD is not well-suited for online
minimum (or global minimum if the loss learning scenarios where the model is
function is convex) when the learning rate continuously updated as new data
is sufficiently small. arrives because it requires processing
the entire dataset in each iteration.

c. Mini-batch Gradient Descent:


Mini Batch Gradient Descent is considered to be the cross-over between GD
and SGD. In this approach instead of iterating through the entire dataset or one
observation, we split the dataset into small subsets (batches) and compute the
gradients for each batch.It is a hyperparameter that denotes the size of a single
batch.

Sl Advantages Disadvantages
No
1 MBGD strikes a balance between the The randomness introduced by mini-
efficiency of BGD and the faster batch selection can make the training
convergence of SGD. It can handle larger process less deterministic and
datasets efficiently because it computes reproducible compared to BGD. It may
gradients on a smaller, random subset of the also lead to noisy updates.
data.
2 Compared to BGD, MBGD often converges While MBGD is more memory-efficient
faster because it benefits from some of the than BGD since it doesn't require storing
noise introduced by the mini-batch the entire dataset, the mini-batch size
selection, which can help escape local still consumes memory, and larger mini-
minima and saddle points. batches may require substantial memory
resources.
3 Similar to SGD, MBGD introduces noise in Like SGD, MBGD is not guaranteed to
the parameter updates, which acts as a form converge to a minimum of the loss
of implicit regularization. This can help function, as it uses a noisy gradient
prevent overfitting and improve estimate in each iteration.
generalization.

2. Describe different types of Regularization techniques along with advantages and


disadvantages
Regularization: Regularization is a way of controlling the complexity of a linear
regression model, by penalizing the coefficients that are not important or relevant for
the prediction. By doing so, regularization can reduce the variance of the model, as it
prevents overfitting and makes the model more robust to noise and outliers.
a. L2 Regularization (Ridge Regularization): Ridge regression adds “squared
magnitude” of coefficient as penalty term to the loss function. Here the highlighted
part represents L2 regularization element.

Here, if lambda is zero then you can imagine we get back OLS. However, if lambda
is very large then it will add too much weight and it will lead to under-fitting.
Having said that it’s important how lambda is chosen. This technique works very
well to avoid over-fitting issue.
Sl Advantages Disadvantages
No
1 Ridge regression helps prevent overfitting While retaining all features can be an
by adding a penalty term that shrinks the advantage in some cases, it can also be a
coefficients toward zero. disadvantage when dealing with datasets
that contain many irrelevant or noisy
features. Ridge regression does not
perform feature selection, and it keeps
all features in the model.
2 Ridge regression is particularly useful when If you have strong reasons to believe that
dealing with multicollinearity, which occurs many features are irrelevant, Ridge
when independent variables (features) in regression may not be the best choice, as
the dataset are highly correlated. it does not force coefficients to zero like
Lasso regression does.
3 Ridge regression strikes a balance between
bias and variance, making it suitable when
you want to prevent overfitting without
necessarily performing feature selection.
b. L1 Regularization (Lasso Regression- Least Absolute Shrinkage and Selection
Operator): Lasso adds “absolute value of magnitude” of coefficient as penalty term
to the loss function.

Again, if lambda is zero then we will get back OLS whereas very large value will
make coefficients zero hence it will under-fit.

Sl Advantages Disadvantages
No
1 Lasso is known for its feature selection Lasso may struggle when dealing with
property. It encourages sparsity in the highly correlated features because it
model by driving some of the coefficients to tends to arbitrarily select one feature
exactly zero. over another.
2 Lasso helps prevent overfitting by shrinking Lasso forces coefficients to be exactly
the coefficients toward zero. This zero, which means it cannot perform
regularization makes the model less partial feature selection. If two features
sensitive to noise and outliers in the training have similar importance, Lasso will
data, improving its generalization select one while driving the other to
performance on unseen data. zero.
3 Lasso is particularly useful when dealing Lasso is sensitive to the scale of the
with high-dimensional datasets with many features. Feature scaling, such as
features. normalization or standardization, may
be required to ensure that all features are
treated equally.
c. Elastic net Regression:
The elastic net method overcomes the limitations of the LASSO (least absolute
shrinkage and selection operator) method. Use of this penalty function has several
limitations. For example, in the "large p, small n" case (high-dimensional data with
few examples), the LASSO selects at most n variables before it saturates. Also if
there is a group of highly correlated variables, then the LASSO tends to select one
variable from a group and ignore the others.

The quadratic penalty term makes the loss function strongly convex, and it therefore
has a unique minimum. The elastic net method includes the LASSO and ridge
regression.

You might also like