You are on page 1of 11

Optimization

Mark. A Magumba
Linear Regression
•Equation of the line in the form:
, where
Multiple regression: 2 independent variables
• Equation takes the form:

• , where
Optimization Techniques
• Ways of determining the best parameters for parametric models
• Analytical solutions may not exist
• For big data it isn’t computationally feasible to obtain exact solutions
• Common technique is gradient descent
• Requires defining a cost/loss function
• It is convenient to express the loss function in differentiable form e.g. using
the mean square error as it involves computing partial differentials with
respect to different parameters
Gradient Descent
X1

Global minimum Loss

X2
Gradient descent
• Is about obtaining the optimal coefficients for your terms, the output
is assumed to be a linear combination of terms as with the
linear/multiple regression
• Assume this is our data:, x1, x2……x5 are features and y is the target
column to be predicted
x1 x2 x3 x4 xn Y
1
0
1
0
Gradient descent
• The following equations may be formulated
Partial gradients
• To compute the optimal values of the coefficients/weights an analytical solution
may be found but where you have a large number of terms, imperfect data and a
large number of examples this is often computationally intractable
• However, if we know the influence of each term on the loss we can obtain the
optimal coefficients by minimizing the loss
• The influence of each term on the loss/cost function is its partial derivative with
respect to the loss/cost
• From our equations in the previous slides, the following are the partial derivatives
with respect to x1 and x2 for multiple linear regression
=
=
Update rule
• With each iteration, we can adjust the coefficients/weights of each
term using this simple weight update rule

• Where w is the weight, n is the learning rate a small number used to


regulate the speed of learning, L is the loss/cost function. It is
common to use the square loss which is a mathematical convenience
as we want the loss/cost function to be differentiable
Weaknesses
• Local minima for certain data shapes
• Computational inefficiency for big data
• Solutions:
• Online gradient descent/ Stochastic gradient descent: With each
iteration weights are updated based on the loss of a single randomly
selected example instead of using the mean loss over all examples
• Batch gradient descent: Instead of updating the weights basing on a
single example (stochastic gradient) or all examples (vanilla gradient
descent) we use he mean loss over a small number of examples
(batch). Determining ideal batch size is a matter of trial and error

You might also like