Professional Documents
Culture Documents
𝑦2 Δ𝑦
θ
𝑦1
Δ𝑥
X=scalar
𝑥1 𝑥2
o Assuming 𝑥1 & 𝑥2 are very close, Differentiation is rate of change of 𝑦 w.r.t 𝑥 in the vicinity of 𝑥1 .
o Geometrically,
𝑦2 − 𝑦1
𝑥2 − 𝑥 1
=
Δ𝑦
Δ𝑥
= 𝑡𝑎𝑛θ
d𝑦 Δ𝑦
o Mathematically, = lim
d𝑥 Δ𝑥→0 Δ𝑥
o Derivative between 𝑦 & 𝑥 is rate of change of 𝑦 w.r.t. x as Δ𝑥 → 0. As 𝑥2 comes closer towards 𝑥1 then only
Δ𝑥 → 0
o As Δ𝑥 → 0 i.e., difference between 𝑥1 & 𝑥2 tends to be 0. Since the triangle will be so small here the
hypotenuses become tangent touching 𝑥1 . we calculate tan 𝜃 to get derivative where 𝜃 is angle between
tangent & x axis
Y=f(X)
𝑦2 Δ𝑦
𝑦1
Δ𝑥 X=scalar
θ
𝑥1 𝑥2
𝑑𝑦 Δ𝑦
= lim = 𝑡𝑎𝑛𝜃 = 𝑠𝑙𝑜𝑝𝑒 𝑜𝑓 𝑡𝑎𝑛𝑔𝑒𝑛𝑡 𝑡𝑜 𝑓(𝑥)
𝑑𝑥 Δ𝑥→0 Δ𝑥
𝑑𝑦
= 𝑠𝑙𝑜𝑝𝑒 𝑜𝑓 𝑡𝑎𝑛𝑔𝑒𝑛𝑡 𝑡𝑜 𝑓(𝑥)𝑔𝑖𝑣𝑒𝑛 𝑥 = 𝑥1
𝑑𝑥 𝑥1
o Similar to 𝑥1 we can have 𝑥2 , 𝑥3 , 𝑥4 . We calculate tan 𝜃 of tangent to curve with 𝑥 axis at these points.
𝑥3 𝑥4
𝑦1 𝑥2
𝑥1
𝑦2
θ'
θ
𝜃 tan 𝜃
0° - 90° >0
0 0
90° ∞
90° - 180° <0
Minima
tan 𝜃 = tan 0 = 0 x
o The maxima and minima of a function, known as extrema, are the largest and smallest value of the
function.
▪ 𝑓(𝑥) = 𝑥 2 − 3𝑥 + 2
𝑑𝑓
▪ = 2𝑥 − 3
𝑑𝑥
3
▪ 𝑥= at
2
▪ 𝑓(𝑥 = 1.5) = 1.52 − 3 ∗ 1.5 + 2 = −0.25
▪ 𝑓(𝑥 = 1) = 12 − 3 ∗ 1 + 2 = 0
o So, we have found out that at x = 1 slope of tangent is 0. Now 1.5 can be either maxima and minima
o But 𝑓(𝑥 = 1) > 𝑓(𝑥 = 1.5)So, 1.5 is minima
• Special cases
o For every function, it’s not necessary that it has maxima and minima.
o 𝑓(𝑥) = 𝑥 2 doesn’t have maxima and 𝑓(𝑥) = −𝑥 2 doesn’t have minima
𝑑
𝑇
𝑓(𝑥) = 𝑦 = 𝑎 𝑥 = ∑ 𝑎𝑖 𝑥𝑖 |𝑎 𝑖𝑠 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
𝑖=1
∑ 𝑎𝑖 𝑥𝑖 = 𝑎1 𝑥1 + ⋯ + 𝑎𝑑 𝑥𝑑
𝑖=1
𝜕 𝜕
𝑎 𝑥 = 𝑎1 𝑥 = 𝑎1
𝜕𝑥1 1 1 𝜕𝑥1 1
𝜕𝑓
𝜕𝑥1
𝑎1
𝜕𝑓
𝑎2
𝛻𝑥 𝑓 = 𝜕𝑥2 = [ ⋮ ]=𝑎
⋮ 𝑎𝑑
𝜕𝑓
[𝜕𝑥𝑑 ]
o If x as scalar,
𝑑
(𝑎𝑥) = 𝑎
𝑑𝑥
Rohan Sawant 3.8 Solving Optimization Problems 3
• Gradient descent
o To solve an optimization problem 𝑓(𝑥), we want to find optimum value x where minima will be 0. But such
equations are non-trivial to solve.
𝑑𝑓
= 0 𝑜𝑟 𝛻𝑥 𝑓 = 0
𝑑𝑥
o Gradient descent is a first-order iterative optimization algorithm for finding a local minima 𝑥 ∗ of a
differentiable function.
o Our main goal is to find 𝑥 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥 𝑓(𝑥) . It’s easy to fund maxima if we know minima
𝑚𝑖𝑛𝑖𝑚𝑎(𝑓(𝑥)) = 𝑚𝑎𝑥𝑖𝑚𝑎(−𝑓(𝑥))
𝑚𝑎𝑥𝑖𝑚𝑎(𝑓(𝑥)) = 𝑚𝑖𝑛𝑖𝑚𝑎(−𝑓(𝑥))
o We pick 𝑥 ∗ randomly and make it 𝑥0
▪ after 1st iteration, we get a value 𝑥1
▪ after 2nd iteration, it will be 𝑥2
▪ Finally, we reach 𝑥𝑘 which is very close to 𝑥 ∗
o After each iteration we get closer to 𝑥 ∗
𝑥0 𝑥0
𝜃 ≥ 90 0 < 𝜃 < 90
𝑥∗
𝜃 = 0
𝑡𝑎𝑛(𝜃) = 0
o On the left side of minima, tan 𝜃 > 0 & on right of its tan 𝜃 < 0. At minima point, slope changes its sign
𝑥0 𝑥0
𝜃 ≥ 90 0 < 𝜃 < 90
𝑡𝑎𝑛(𝜃) = 0
o On the left side of minima, as the point starts reaching minima, slope ↑
o On the right side of minima, as the points starts reaching minima slope ↓
o In Gradient descent, we pick a first point randomly 𝑥0
Rohan Sawant 3.8 Solving Optimization Problems 4
o Case1- 𝑥0 is on right side of minima.
𝑥0
𝑥1
𝑥∗
𝜃 = 0
𝑡𝑎𝑛(𝜃) = 0
𝑥0
𝑥1
𝑥∗
𝜃 = 0
𝑡𝑎𝑛(𝜃) = 0
𝑑𝑓
o Let’s say 𝜂 = 1 and [ ] = tan 𝜃 < 0
𝑑𝑥 𝑥0
𝑑𝑓
o As 𝜂 [ ] < 0 , 𝑥1 > 𝑥0
𝑑𝑥 𝑥0
𝑑𝑓
o E.g., say 𝑥0 = −5 & [ ] = 0.5 , 𝑥1 = − 5 − 1 ∗ −0.5 = −4.5
𝑑𝑥 𝑥0
𝑥∗ = 𝑥𝑘
𝑥 9 , 𝑥 7, 𝑥 5, 𝑥 3, 𝑥1 𝑥 0, 𝑥 2, 𝑥 4, 𝑥 6, 𝑥 8
*
x
, X
𝜃 = 0
𝑡𝑎𝑛(𝜃) = 0)
o So, we are not converging towards 𝑥 ∗ as it’s each 𝑥 are getting oscillated between 0.5 & -0.5.
o The solution to this problem is we change the learning rate or step size at each of iteration
o We need to find a function ℎ(𝑖) such that
𝜂 = ℎ(𝑖) |𝑖 = 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑛𝑢𝑚𝑏𝑒𝑟 𝑠. 𝑡 𝑎𝑠 𝑖 ↑ 𝜂 ↓
• Gradient descent for linear regression
𝑛
(𝑊 ∗ , 𝑊0∗ ) = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤,𝑤0 ∑{𝑦𝑖 − (𝑊 𝑇 𝑥𝑖 )}2
𝑖=1
o Here, we are trying to minimize 𝑤 & we will derive the formula first without considering regularization term 𝑥0
o Here 𝑥𝑖 & 𝑦𝑖 are constants and we need to find minimum 𝑊 our function 𝐿(𝑊) for linear regression
𝑛
o In SGD, instead of taking sum of all 𝑛 points, we calculate sum of 𝑘 points where 1 ≤ 𝑘 ≥ 𝑛. 𝑘 is batch size
o It says pick a random set of 𝑘 points for each of the iteration and perform gradient descent
o As we pick 𝑘 random points, we call this Gradient Descent as Stochastic Gradient Descent;
o As 𝑘 reduces from 𝑛 to 1 the number of iterations required to reach optimum increases; At each iteration we
have 𝑘 and 𝜂 changing; 𝑘 is randomly picked between 1 and 𝑛 which is called as batch size; 𝜂 is reduced at
each iteration progressively. The optimum value reached from both methods will be close.
∗ ∗
𝑤𝐺𝐷 ≈ 𝑤𝑆𝐺𝐷
Rohan Sawant 3.8 Solving Optimization Problems 6
• Constrained Optimization & PCA
o General constrained optimization is
𝑥 ∗ = 𝑚𝑎𝑥𝑥 𝑓(𝑥) [𝑂𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛] 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑔(𝑥) = 𝑐 [𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡],
ℎ(𝑥) ≥ 𝑑[𝑖𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡]
o We solve such problems by using Lagrangian multipliers 𝜆, 𝜇 (both ≥ 0)
ℒ(𝑥, 𝜆, 𝜇) = 𝑓(𝑥) − 𝜆{𝑔(𝑥) − 𝑐} − 𝜇{𝑑 − ℎ(𝑥)}
𝜕ℒ 𝜕ℒ 𝜕ℒ
o Now we will solve above equation by = 0, = 0, =0
𝜕𝑥 𝜕𝜆 𝜕𝜇
o Ultimately, we want our transformed variables to contain as much as information (or equivalently, account for as much variability as
possible). PCA performs such orthogonal transformation to convert correlated variables into a new set of uncorrelated variables called
principal components. PCA makes sure that the 1st principal component has the largest possible variance, which means the 1st
component accounts for the most variability in the data. And that each succeeding component has the highest variance possible under the
condition that it is orthogonal to the preceding components.
o Of course, to account for 100% of the variability, we usually need the same number of components as the original number of variables.
But usually with much smaller number of new components, we can explain almost 90% of the variability. Thus, PCA helps us to reduce our
focus from a possible large number of variables to a relatively small number of new components, while preserving most of the information
from the data.
• Logistic regression formulation by using Lagrangian multipliers
𝑊 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥 (𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠) + 𝜆𝑤 𝑇 𝑤
o In above formula, we can ignore − 𝜆 as it is a constant. The rest of the equation is same as original one.
o Hence, one way to understand regularization can be thought as imposing an equality constraint
𝑊1 𝑖𝑓 𝑊1 > 0
ℒ2 (𝑤1 ) = 𝑊12 ℒ1 (𝑤1 ) = |𝑊1 |
−(𝑊1 ) 𝑖𝑓 𝑊1 ≤ 0
By Gradient Descent, Let (𝑊1 )𝑗 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 By Gradient Descent, Let (W1 )j = positive
𝜕ℒ2 𝜕ℒ1
(𝑊1 )𝑗+1 = (𝑊1 )𝑗 − 𝜂 [ ] (𝑊1 )j+1 = (𝑊1 )j − 𝜂 [ ]
𝜕𝑤1 𝑤 𝜕𝑤1 w
1𝑗
1j
(𝑊1 )𝑗+1 = (𝑊1 )𝑗 − 𝜂(2 ∗ (𝑊1 )𝑗 ) (𝑊1 )𝑗+1 = (𝑊1 )𝑗 − 𝜂 ∗ 1
Let(𝑊1 )𝑗 = 0.1 𝑎𝑛𝑑 𝜂 = 0.01 Let(𝑊1 )𝑗 = 0.05 𝑎𝑛𝑑 𝜂 = 0.01
(𝑊1 )𝑗+1 = 0.1 − 0.01(2 ∗ 0.1) = 0.098 (𝑊1 )𝑗+1 = 0.1 − 0.01(1) = 0.09
• Here, as we come closer and closer towards • Here, as we come closer and closer towards
minima, the slope decreases and the minima, the slope remains constant as the
derivative term becomes smaller derivative term remains constant
• The rate of change is becoming smaller • The rate of change remains constant
• By ℒ2 reg., The value of (𝑊1 )𝑗+1 doesn’t • By ℒ1 reg., The value of (𝑊1 )𝑗+1 changes a lot as
change much as compared (𝑊1 )𝑗 compared (𝑊1 )𝑗
• ℒ2 reg. does not change the values much in • ℒ1 reg. continues to constantly reduce 𝑊1
each iteration constantly towards 𝑊 ∗ = 0
• Here there are less chances of 𝑊1 becoming 0 • Here there are more chances of 𝑊1 becoming 0
after k-iterations after k- iteration