You are on page 1of 9

• Differentiation

o Single variable differentiation


Y=f(X)

𝑦2 Δ𝑦
θ
𝑦1

Δ𝑥
X=scalar

𝑥1 𝑥2

o Assuming 𝑥1 & 𝑥2 are very close, Differentiation is rate of change of 𝑦 w.r.t 𝑥 in the vicinity of 𝑥1 .
o Geometrically,
𝑦2 − 𝑦1
𝑥2 − 𝑥 1
=
Δ𝑦
Δ𝑥
= 𝑡𝑎𝑛θ
d𝑦 Δ𝑦
o Mathematically, = lim
d𝑥 Δ𝑥→0 Δ𝑥
o Derivative between 𝑦 & 𝑥 is rate of change of 𝑦 w.r.t. x as Δ𝑥 → 0. As 𝑥2 comes closer towards 𝑥1 then only
Δ𝑥 → 0
o As Δ𝑥 → 0 i.e., difference between 𝑥1 & 𝑥2 tends to be 0. Since the triangle will be so small here the
hypotenuses become tangent touching 𝑥1 . we calculate tan 𝜃 to get derivative where 𝜃 is angle between
tangent & x axis
Y=f(X)

𝑦2 Δ𝑦

𝑦1

Δ𝑥 X=scalar
θ

𝑥1 𝑥2

𝑑𝑦 Δ𝑦
= lim = 𝑡𝑎𝑛𝜃 = 𝑠𝑙𝑜𝑝𝑒 𝑜𝑓 𝑡𝑎𝑛𝑔𝑒𝑛𝑡 𝑡𝑜 𝑓(𝑥)
𝑑𝑥 Δ𝑥→0 Δ𝑥
𝑑𝑦
= 𝑠𝑙𝑜𝑝𝑒 𝑜𝑓 𝑡𝑎𝑛𝑔𝑒𝑛𝑡 𝑡𝑜 𝑓(𝑥)𝑔𝑖𝑣𝑒𝑛 𝑥 = 𝑥1
𝑑𝑥 𝑥1
o Similar to 𝑥1 we can have 𝑥2 , 𝑥3 , 𝑥4 . We calculate tan 𝜃 of tangent to curve with 𝑥 axis at these points.

𝑥3 𝑥4
𝑦1 𝑥2

𝑥1
𝑦2

θ'
θ

Rohan Sawant 3.8 Solving Optimization Problems 1


o Basic Mathematical functions and their derivatives
function derivative
𝑥𝑛 𝑛𝑥 𝑛−1
𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 0
𝑐𝑥 𝑛 𝑐𝑛𝑥 𝑛−1
log 𝑥 1/𝑥
𝑒𝑥 𝑒𝑥
𝑓(𝑥) + 𝑔(𝑥) 𝑑 𝑑
𝑓(𝑥) + 𝑔(𝑥)
𝑑𝑥 𝑑𝑥
𝑓{𝑔(𝑥)} 𝑑𝑓 𝑑𝑔
.
𝑑𝑔 𝑑𝑥
o Interpretation of 𝑡𝑎𝑛𝜃

𝜃 tan 𝜃
0° - 90° >0
0 0
90° ∞
90° - 180° <0

• Maxima and Minima


Y=f(x) tan 𝜃 = tan 0 = 0
Maxima

Minima
tan 𝜃 = tan 0 = 0 x

o The maxima and minima of a function, known as extrema, are the largest and smallest value of the
function.
▪ 𝑓(𝑥) = 𝑥 2 − 3𝑥 + 2
𝑑𝑓
▪ = 2𝑥 − 3
𝑑𝑥
3
▪ 𝑥= at
2
▪ 𝑓(𝑥 = 1.5) = 1.52 − 3 ∗ 1.5 + 2 = −0.25
▪ 𝑓(𝑥 = 1) = 12 − 3 ∗ 1 + 2 = 0
o So, we have found out that at x = 1 slope of tangent is 0. Now 1.5 can be either maxima and minima
o But 𝑓(𝑥 = 1) > 𝑓(𝑥 = 1.5)So, 1.5 is minima
• Special cases
o For every function, it’s not necessary that it has maxima and minima.
o 𝑓(𝑥) = 𝑥 2 doesn’t have maxima and 𝑓(𝑥) = −𝑥 2 doesn’t have minima

Rohan Sawant 3.8 Solving Optimization Problems 2


o No Maxima or Minima

o Multiple Minima & Maxima

• Vector calculus: Grad or Dell (∇)


o If 𝑥 is high dimensional vector

𝑑
𝑇
𝑓(𝑥) = 𝑦 = 𝑎 𝑥 = ∑ 𝑎𝑖 𝑥𝑖 |𝑎 𝑖𝑠 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
𝑖=1

o The derivative of f(x) where x is a vector,


𝜕𝑓
𝜕𝑥1
𝑑𝑓 𝜕𝑓
= 𝛻𝑥 𝑓 = 𝜕𝑥2
𝑑𝑥

𝜕𝑓
[𝜕𝑥𝑑 ]

∑ 𝑎𝑖 𝑥𝑖 = 𝑎1 𝑥1 + ⋯ + 𝑎𝑑 𝑥𝑑
𝑖=1

𝜕 𝜕
𝑎 𝑥 = 𝑎1 𝑥 = 𝑎1
𝜕𝑥1 1 1 𝜕𝑥1 1

𝜕𝑓
𝜕𝑥1
𝑎1
𝜕𝑓
𝑎2
𝛻𝑥 𝑓 = 𝜕𝑥2 = [ ⋮ ]=𝑎
⋮ 𝑎𝑑
𝜕𝑓
[𝜕𝑥𝑑 ]

o If x as scalar,
𝑑
(𝑎𝑥) = 𝑎
𝑑𝑥
Rohan Sawant 3.8 Solving Optimization Problems 3
• Gradient descent
o To solve an optimization problem 𝑓(𝑥), we want to find optimum value x where minima will be 0. But such
equations are non-trivial to solve.
𝑑𝑓
= 0 𝑜𝑟 𝛻𝑥 𝑓 = 0
𝑑𝑥
o Gradient descent is a first-order iterative optimization algorithm for finding a local minima 𝑥 ∗ of a
differentiable function.

o Our main goal is to find 𝑥 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥 𝑓(𝑥) . It’s easy to fund maxima if we know minima
𝑚𝑖𝑛𝑖𝑚𝑎(𝑓(𝑥)) = 𝑚𝑎𝑥𝑖𝑚𝑎(−𝑓(𝑥))
𝑚𝑎𝑥𝑖𝑚𝑎(𝑓(𝑥)) = 𝑚𝑖𝑛𝑖𝑚𝑎(−𝑓(𝑥))
o We pick 𝑥 ∗ randomly and make it 𝑥0
▪ after 1st iteration, we get a value 𝑥1
▪ after 2nd iteration, it will be 𝑥2
▪ Finally, we reach 𝑥𝑘 which is very close to 𝑥 ∗
o After each iteration we get closer to 𝑥 ∗

𝑥0 𝑥0
𝜃 ≥ 90 0 < 𝜃 < 90

𝑡𝑎𝑛(𝜃) < 0 𝑡𝑎𝑛(𝜃) > 0

𝑥∗

𝜃 = 0

𝑡𝑎𝑛(𝜃) = 0

o On the left side of minima, tan 𝜃 > 0 & on right of its tan 𝜃 < 0. At minima point, slope changes its sign

𝑥0 𝑥0
𝜃 ≥ 90 0 < 𝜃 < 90

𝑡𝑎𝑛(𝜃) < 0 𝑡𝑎𝑛(𝜃) > 0


𝑥1 𝑥1 slope ↓
slope ↑ 𝑥∗
v
𝜃 = 0

𝑡𝑎𝑛(𝜃) = 0

o On the left side of minima, as the point starts reaching minima, slope ↑
o On the right side of minima, as the points starts reaching minima slope ↓
o In Gradient descent, we pick a first point randomly 𝑥0
Rohan Sawant 3.8 Solving Optimization Problems 4
o Case1- 𝑥0 is on right side of minima.

𝑥0

𝑥1
𝑥∗
𝜃 = 0

𝑡𝑎𝑛(𝜃) = 0

o Now 𝑥1 has to be calculated such that it will be closer to 𝑥 ∗


𝑑𝑓
𝑥1 = 𝑥0 − 𝜂 [ ] | 𝜂 = 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒 (> 0)
𝑑𝑥 𝑥0
𝑑𝑓
o Let’s say 𝜂 = 1 and [ ] = tan 𝜃 > 0
𝑑𝑥 𝑥0
𝑑𝑓
o As 𝜂 [ ] > 0 , 𝑥1 < 𝑥0
𝑑𝑥 𝑥0
𝑑𝑓
o E.g., say 𝑥0 = 5 & [ ] = 0.5 , 𝑥1 = 5 − 1 ∗ 0.5 = 4.5
𝑑𝑥 𝑥0

o Case2- 𝑥0 on left side of minima

𝑥0

𝑥1
𝑥∗
𝜃 = 0

𝑡𝑎𝑛(𝜃) = 0

𝑑𝑓
o Let’s say 𝜂 = 1 and [ ] = tan 𝜃 < 0
𝑑𝑥 𝑥0
𝑑𝑓
o As 𝜂 [ ] < 0 , 𝑥1 > 𝑥0
𝑑𝑥 𝑥0
𝑑𝑓
o E.g., say 𝑥0 = −5 & [ ] = 0.5 , 𝑥1 = − 5 − 1 ∗ −0.5 = −4.5
𝑑𝑥 𝑥0

o In the same iterative way, we keep on finding 𝑥2 , 𝑥3 , … … , 𝑥𝑘


𝑑𝑓
𝑥𝑘 = 𝑥𝑘−1 − 𝜂 [ ]
𝑑𝑥 𝑥𝑘−1
o If 𝑥𝑘+1 − 𝑥𝑘 is very small, then terminate and we conclude,

𝑥∗ = 𝑥𝑘

o To conclude the observations,


𝑑𝑓 𝑑𝑓 𝑑𝑓 𝑑𝑓 𝑑𝑓
[ ] ≥ [ ] ≥ [ ] ≥ …………[ ] ≥[ ]
𝑑𝑥 𝑥0 𝑑𝑥 𝑥1 𝑑𝑥 𝑥2 𝑑𝑥 𝑥𝑘−1 𝑑𝑥 𝑥𝑘
o The jump between two points will be reduced as we move ahead

Rohan Sawant 3.8 Solving Optimization Problems 5


• Learning rate or step-size
f(x) = x2

𝑥 9 , 𝑥 7, 𝑥 5, 𝑥 3, 𝑥1 𝑥 0, 𝑥 2, 𝑥 4, 𝑥 6, 𝑥 8
*
x
, X

𝜃 = 0

𝑡𝑎𝑛(𝜃) = 0)

▪ Let’s say 𝑥𝑖 = 0.5, 𝑓(𝑥) = 𝑥 2 , 𝜂 = 1


𝑑𝑓
▪ 𝑥𝑖+1 = 𝑥𝑖 − 𝜂 [ ] = 0.5 − 1 ∗ (2 ∗ 0.5) = −0.5
𝑑𝑥 𝑥𝑖
𝑑𝑓
▪ 𝑥𝑖+2 = 𝑥𝑖+1 − 𝜂 [ ] = −0.5 − 1 ∗ (2 ∗ −0.5) = 0.5
𝑑𝑥 𝑥𝑖+1

o So, we are not converging towards 𝑥 ∗ as it’s each 𝑥 are getting oscillated between 0.5 & -0.5.
o The solution to this problem is we change the learning rate or step size at each of iteration
o We need to find a function ℎ(𝑖) such that
𝜂 = ℎ(𝑖) |𝑖 = 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑛𝑢𝑚𝑏𝑒𝑟 𝑠. 𝑡 𝑎𝑠 𝑖 ↑ 𝜂 ↓
• Gradient descent for linear regression
𝑛
(𝑊 ∗ , 𝑊0∗ ) = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤,𝑤0 ∑{𝑦𝑖 − (𝑊 𝑇 𝑥𝑖 )}2
𝑖=1
o Here, we are trying to minimize 𝑤 & we will derive the formula first without considering regularization term 𝑥0
o Here 𝑥𝑖 & 𝑦𝑖 are constants and we need to find minimum 𝑊 our function 𝐿(𝑊) for linear regression
𝑛

𝐿(𝑊) = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 ∑{𝑦𝑖 − (𝑊 𝑇 𝑥𝑖 )}2


𝑖=1
o Since, 𝑊 𝑇 is a vector, we need to calculate vector calculus
𝑛

𝛻𝑤 𝐿 = ∑ 2{𝑦𝑖 − (𝑊 𝑇 𝑥𝑖 )}. { −𝑥𝑖 }


𝑖=1
o By applying Gradient Descent, we will pick random 𝑊0

𝑊1 = 𝑊0 − 𝜂 ∗ ∑𝑛𝑖=1{ −2𝑥𝑖 }{𝑦𝑖 − (𝑊0𝑇 𝑥𝑖 )}


𝑊1 = 𝑊1 − 𝜂 ∗ ∑𝑛𝑖=1{ −2𝑥𝑖 }{𝑦𝑖 − (𝑊1𝑇 𝑥𝑖 )}

𝑊𝑘 = 𝑊𝑘−1 − 𝜂 ∗ ∑𝑛𝑖=1{ −2𝑥𝑖 }{𝑦𝑖 − (𝑊𝑘−1
𝑇
𝑥𝑖 )}

o We will compute this till 𝑊𝑘+1 − 𝑊𝑘 is a very small value


o But if (𝑛𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠 ) is large, computing the sum is very expensive. The solution for this stochastic gradient descent
algorithm.
• Stochastic (randomly determined) Gradient descent
𝑘

𝑊𝑘 = 𝑊𝑘−1 − 𝜂 ∗ ∑{ −2𝑥𝑖 }{𝑦𝑖 − (𝑊𝑘−1 ∗ 𝑥𝑖 )}


𝑖=1

o In SGD, instead of taking sum of all 𝑛 points, we calculate sum of 𝑘 points where 1 ≤ 𝑘 ≥ 𝑛. 𝑘 is batch size
o It says pick a random set of 𝑘 points for each of the iteration and perform gradient descent
o As we pick 𝑘 random points, we call this Gradient Descent as Stochastic Gradient Descent;
o As 𝑘 reduces from 𝑛 to 1 the number of iterations required to reach optimum increases; At each iteration we
have 𝑘 and 𝜂 changing; 𝑘 is randomly picked between 1 and 𝑛 which is called as batch size; 𝜂 is reduced at
each iteration progressively. The optimum value reached from both methods will be close.
∗ ∗
𝑤𝐺𝐷 ≈ 𝑤𝑆𝐺𝐷
Rohan Sawant 3.8 Solving Optimization Problems 6
• Constrained Optimization & PCA
o General constrained optimization is
𝑥 ∗ = 𝑚𝑎𝑥𝑥 𝑓(𝑥) [𝑂𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛] 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑔(𝑥) = 𝑐 [𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡],
ℎ(𝑥) ≥ 𝑑[𝑖𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡]
o We solve such problems by using Lagrangian multipliers 𝜆, 𝜇 (both ≥ 0)
ℒ(𝑥, 𝜆, 𝜇) = 𝑓(𝑥) − 𝜆{𝑔(𝑥) − 𝑐} − 𝜇{𝑑 − ℎ(𝑥)}
𝜕ℒ 𝜕ℒ 𝜕ℒ
o Now we will solve above equation by = 0, = 0, =0
𝜕𝑥 𝜕𝜆 𝜕𝜇

o By solving them we get the values 𝑥̃, 𝜆̃ , 𝜇̃


o The 𝑥 ∗ = 𝑚𝑎𝑥𝑥 = 𝑥̃
1
o Equation of PCA is 𝑚𝑎𝑥𝑢 ∑𝑛𝑖=1(𝑢𝑇 𝑥𝑖 )2 𝑠. 𝑡. 𝑢𝑇 . 𝑢 = 1 here 𝑢𝑇 . 𝑢 = 1 is an equality constraint
𝑛
o Above equation can also be written as
𝑚𝑎𝑥𝑢 𝑢𝑇 . 𝑆. 𝑢. 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑢𝑇 . 𝑢 = 1
𝑋𝑇. 𝑋
𝑆 = 𝐶𝑜𝑣(𝑋) =
𝑛
o S is a covariance vector of X
o Solving this problem by using Lagrangian multipliers,
ℒ(𝑢, 𝜆) = 𝑢 𝑇 . 𝑆. 𝑢. − 𝜆{𝑢𝑇 . 𝑢 − 1}
𝜕ℒ 𝜕 𝑇
= 𝑢 . 𝑆. 𝑢. − 𝜆𝑢𝑇 . 𝑢 − 𝜆 = 0
𝜕𝑢 𝜕𝑢
𝜕ℒ
= 𝑆𝑢 − 𝜆𝑢 = 0
𝜕𝑢
𝑆𝑢 = 𝜆𝑢
o Here 𝑆 is a co variance matrix. 𝑢 is the eigen vector of 𝑆. & 𝜆 is eigen value of 𝑆
o This is because covariance matrix accounts for variability in the dataset, and variability of the dataset is a way to summarize how much
information we have in the data (Imagine a variable with all same values as its observations, then the variance is 0, and intuitively
speaking, there’s not too much information from this variable because every observation is the same). The diagonal elements of the
covariance matrix stand for variability of each variable itself, and off-diagonal elements in covariance matrix represents how variables are
correlated with each other.

o Ultimately, we want our transformed variables to contain as much as information (or equivalently, account for as much variability as
possible). PCA performs such orthogonal transformation to convert correlated variables into a new set of uncorrelated variables called
principal components. PCA makes sure that the 1st principal component has the largest possible variance, which means the 1st
component accounts for the most variability in the data. And that each succeeding component has the highest variance possible under the
condition that it is orthogonal to the preceding components.
o Of course, to account for 100% of the variability, we usually need the same number of components as the original number of variables.
But usually with much smaller number of new components, we can explain almost 90% of the variability. Thus, PCA helps us to reduce our
focus from a possible large number of variables to a relatively small number of new components, while preserving most of the information
from the data.
• Logistic regression formulation by using Lagrangian multipliers
𝑊 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥 (𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠) + 𝜆𝑤 𝑇 𝑤

o Above formula is actually a constrained optimization problem


𝑊 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥 (𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠) 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑤 𝑇 𝑤 = 1
ℒ(𝑤, 𝜆) = 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠 − 𝜆{1 − 𝑤 𝑇 𝑤 }
o For equality constraint 1 − 𝑤 𝑇 𝑤 or 𝑤 𝑇 𝑤 − 1 both are fine.

ℒ(𝑤, 𝜆) = 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠 − 𝜆 + 𝜆𝑤 𝑇 𝑤

o In above formula, we can ignore − 𝜆 as it is a constant. The rest of the equation is same as original one.
o Hence, one way to understand regularization can be thought as imposing an equality constraint

Rohan Sawant 3.8 Solving Optimization Problems 7


• Lagrangian multipliers Example
2 1
o Find 𝑚𝑎𝑥ℎ,𝑠 200ℎ3 𝑠 3 such that 20ℎ + 170𝑠 = 20000
2 1
o 𝑓(𝑥) [𝑂𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛] = 𝑚𝑎𝑥ℎ,𝑠 200ℎ3 𝑠 3
o 𝑔(𝑥) = 𝑐 [𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡] = 20ℎ + 170𝑠 = 20000
o ℒ(𝑥, 𝜆) = 𝑓(𝑥) − 𝜆{𝑔(𝑥) − 𝑐}
2 1
o ℒ(ℎ, 𝑠, 𝜆) = 200ℎ3 𝑠 3 − 𝜆(20ℎ + 170𝑠 − 20000)
2 1
o ℒ(ℎ, 𝑠, 𝜆) = 200ℎ3 𝑠 3 − 20ℎ𝜆 − 170𝑠𝜆 + 20000𝜆
−1 1
𝜕ℒ 2
o = 200 ℎ 3 𝑠 3 − 20𝜆 = 0
𝜕ℎ 3
2 −2
𝜕ℒ 1
o = 200 ℎ3 𝑠 3 − 170𝜆 = 0
𝜕𝑠 3
𝜕ℒ
o = − 20ℎ − 170𝑠 + 20000 = 0
𝜕𝜆
o If we solve above 3 equations, ℎ = 666.66, 𝑠 = 39.12, 𝜆 = 2.59

Rohan Sawant 3.8 Solving Optimization Problems 8


• Why L1 regularization creates sparsity?
• L2 Regularization • L1 Regularization

𝑊 = 𝑚𝑖𝑛𝑥 (𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠) + 𝜆||𝑊||22 𝑊 ∗ = 𝑚𝑖𝑛𝑥 (𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠) + 𝜆||𝑊||1
𝑊 = [𝑊1 , 𝑊2 , ⋯ , 𝑊𝑑 𝑊 = [𝑊1 , 𝑊2 , ⋯ , 𝑊𝑑

𝑊 = ∗
𝑚𝑖𝑛𝑤1,𝑤2,⋯,𝑤𝑑 (𝑊12 + 𝑊22 + ⋯+ 𝑊𝑑2 ) 𝑊 = 𝑚𝑖𝑛𝑤1,𝑤2,⋯,𝑤𝑑 (|𝑊1 | + |𝑊2 | + ⋯ + |𝑊𝑑 |)
𝑊 ∗ = 𝑚𝑖𝑛𝑤1 𝑊12 = ℒ2 (𝑤1 ) 𝑊 ∗ = 𝑚𝑖𝑛𝑤1 |𝑊1 | = ℒ1 (𝑤1 )

𝑊1 𝑖𝑓 𝑊1 > 0
ℒ2 (𝑤1 ) = 𝑊12 ℒ1 (𝑤1 ) = |𝑊1 |
−(𝑊1 ) 𝑖𝑓 𝑊1 ≤ 0

𝜕ℒ2 𝜕ℒ1 +1 𝑖𝑓 𝑊1 > 0


= 2 ∗ 𝑊1 =
𝜕𝑤1 𝜕𝑤1 −1 𝑖𝑓 𝑊1 ≤ 0

By Gradient Descent, Let (𝑊1 )𝑗 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 By Gradient Descent, Let (W1 )j = positive

𝜕ℒ2 𝜕ℒ1
(𝑊1 )𝑗+1 = (𝑊1 )𝑗 − 𝜂 [ ] (𝑊1 )j+1 = (𝑊1 )j − 𝜂 [ ]
𝜕𝑤1 𝑤 𝜕𝑤1 w
1𝑗
1j
(𝑊1 )𝑗+1 = (𝑊1 )𝑗 − 𝜂(2 ∗ (𝑊1 )𝑗 ) (𝑊1 )𝑗+1 = (𝑊1 )𝑗 − 𝜂 ∗ 1
Let(𝑊1 )𝑗 = 0.1 𝑎𝑛𝑑 𝜂 = 0.01 Let(𝑊1 )𝑗 = 0.05 𝑎𝑛𝑑 𝜂 = 0.01
(𝑊1 )𝑗+1 = 0.1 − 0.01(2 ∗ 0.1) = 0.098 (𝑊1 )𝑗+1 = 0.1 − 0.01(1) = 0.09
• Here, as we come closer and closer towards • Here, as we come closer and closer towards
minima, the slope decreases and the minima, the slope remains constant as the
derivative term becomes smaller derivative term remains constant
• The rate of change is becoming smaller • The rate of change remains constant
• By ℒ2 reg., The value of (𝑊1 )𝑗+1 doesn’t • By ℒ1 reg., The value of (𝑊1 )𝑗+1 changes a lot as
change much as compared (𝑊1 )𝑗 compared (𝑊1 )𝑗
• ℒ2 reg. does not change the values much in • ℒ1 reg. continues to constantly reduce 𝑊1
each iteration constantly towards 𝑊 ∗ = 0
• Here there are less chances of 𝑊1 becoming 0 • Here there are more chances of 𝑊1 becoming 0
after k-iterations after k- iteration

Rohan Sawant 3.8 Solving Optimization Problems 9

You might also like