ML Notes

Artificial intelligence
AI Layers enables computers and machines to mimic the perception, learning,

problem-solving, and decision-making capabilities of the human
mind.
Machine Learning
❖ The study of computer algorithms that have the ability to
automatically learn and improve from experience without being
explicitly programming.
AI is heavily dependent on ML
Deep Learning
______________________________________________________________________________
In mathematics, some problems can be solved analytically and numerically. What is the
difference?
• An Analytical Solution involves framing the problem in a well-understood form and
calculating the exact solution. (Can be done by hand)
• A Numerical Solution means making guesses at the solution and testing whether the
problem is solved well enough to stop. (Must use the computer)
________________________________________________________________________
Optimization
• Finding the values of input parameters (independent variables) that
minimizes/maximizes the function (dependent value).
• E.g. in Artificial Neural Networks (deep learning) finding w and b to minimize J(w,b).
Numerical Optimization
• Using numerical algorithms to solve an optimization problem
• Numerical optimization is at the heart of almost all ML algorithms.
• Which is really a search for a set of terms with unknown values needed to fill an
equation.
• Each ML algorithm (e.g., linear and logistic regressions) has a different “equation” and
“terms “, using this terminology loosely.
• The equation is easy to calculate in order to make a prediction for a given set of terms,
but we don’t know the terms to use in order to get a “good” or even “best” set of
predictions on a given set of data. This is the numerical optimization problem that we
always seek to solve.
• It’s numerical, because we are trying to solve the optimization problem with noisy,
incomplete, and error-prone limited samples of observations from our domain.
• The model is trying hard to interpret the data and create a map between the inputs and
the outputs of these observations.
Two major reasons of studying Optimization:

• different algorithms can perform (sometimes drastically) better or worse in
different scenarios, and an understanding of why this happens requires an
understanding of optimization;
• often times, understanding a problem from the optimization perspective can
contribute to our statistical understanding of the problem as well.
______________________________________________________________________________
Gradient
• Gradient is the slope
𝐶ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑌
• 𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 =
𝐶ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑋
• The derivative of a function of a real variable measures the sensitivity to change of the
function value (output value) with respect to a change in its argument (input value).
𝒇(𝒙+𝜟𝒙) − 𝒇(𝒙)
Gradient at x = derivative at x = 𝒍𝒊𝒎 ( 𝜟𝒙
)
𝜟𝒙→𝟎
Gradient of Multivariable Function

• Partial Derivative is the rate of change of a multi-variable function when all but one
variable is held fixed (Gradient is a vector)
𝜕𝑓 𝜕𝑓
• = 2𝑥 ; 𝜕𝑦 = 2𝑦
𝜕𝑥
𝜕𝑓
𝜕𝑥
• 𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 = 𝛻𝑓 = [𝜕𝑓 ] called Jacobians
𝜕𝑦
Local vs. Global Optimum
• A local minimum (or optimum) of a function is a point
where the function value is smaller than at nearby points,
but possibly greater than at a distant point.
• A global minimum (or optimum) is a point where the

function value is smaller than at all other feasible points.
Convex Problem
in convex problem, local minima are the global minima;
convex problems are easily understandable.
Understanding if the cost function is convex let us correctly decide whether to use
simple optimization algorithm such as gradient descent or more complex ones based on
momentum.
______________________________________________________________________________
Numerical Optimization for Data Science
Linear Regression
• Regression analysis is one of the most important fields in statistics and machine learning.
There are many regression methods available. Linear regression is one of them.
• Linear Regression is usually the first machine learning algorithm that every data scientist
comes across. It is a simple model but everyone needs to master it as it lays the
foundation for other machine learning algorithms.
• Regression searches for relationships among variables.
• For example, you can observe several employees of some company and try to
understand how their salaries depend on the features, such as experience, level of
education, role, city they work in, and so on.
𝑺𝒂𝒍𝒂𝒓𝒚 𝒚 = 𝒇(𝒆𝒅𝒖𝒄𝒂𝒕𝒊𝒐𝒏 𝒙𝟎 , 𝒓𝒐𝒍𝒆 𝒙𝟏 , 𝒄𝒊𝒕𝒚 𝒙𝟐 , … … … )

• 𝒆𝒅𝒖𝒄𝒂𝒕𝒊𝒐𝒏 𝒙𝟎 , 𝒓𝒐𝒍𝒆 𝒙𝟏 , 𝒄𝒊𝒕𝒚 𝒙𝟐 , … … are called
independent variables.
• 𝑺𝒂𝒍𝒂𝒓𝒚 𝒚 is the dependent variable.
• Each data for each employee represents one observation.
• h(x)= 0 + 1 x Called hypothesis

Vector Norm
• Evaluation is a crucial step in all modeling and machine learning problems. Since we are
often making predictions on entire datasets, providing a` single number that summarizes
the performance of our model is both simple and effective.
• There are a number of situations where we need to compress information about a

dataset to a single number. For instance:
• Determining the magnitude of a data point in multiple dimensions.
• Calculating the loss of a machine learning model.
• Computing the error of a predictive model.
Loss (Cost) Function

• For simplicity let 𝟎 = 𝟎. i.e. 𝒉(θ𝟏) = 𝟏𝒙
• We can plot loss as a function of 𝟏 for different fit.
𝑱(𝟏 ) = 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒗𝒂𝒍𝒖𝒆 − 𝒂𝒄𝒕𝒖𝒂𝒍 𝒗𝒂𝒍𝒖𝒆
LR & Loss Function
Mean Squared Error (MSE)

Gradient Descent Algorithm
• Gradient descent (GD) is an optimization algorithm that's used when training a machine
learning model. It's based on a convex function and tweaks its parameters (coefficients)
iteratively to minimize a given function as far as possible.
• This algorithm and its variants have been proven effective to solve data related
problems, especially in the domain of neural networks. It’s not the only algorithm or the
best but it is seen as the « hello world » of data science.
• Intuition: A person (blindly) trying to go down a hill when it is foggy. The idea is to make
single step at a time, in the direction of the steepest descent.
• GD makes the same thing to find the min of a loss function.
• Gradient descent is a first-order iterative optimization algorithm for finding a local

minimum of a differentiable function.
• The idea is to take repeated steps in the opposite direction of the gradient (or
approximate gradient) of the function at the current point, because this is the direction
of steepest descent.
Implementation Steps:
• Step1: Initialize parameters (𝟎 & 𝟏) with random value or simply zero. Also choose the
Learning rate.
Note: these parameters can be (weight and bias) for a deep learning ANN.
• Step2: Use (𝟎 & 𝟏) to predict the output h(x)= 0 + 1 x.
• Step3: Calculate Cost function 𝑱(𝟎 , 𝟏 ).
• Step4: Calculate the gradient.
• Step5: Update the parameters (simultaneously).
• Step6: Repeat from 2 to 5 until converge to the minimum or achieve maximum

iterations.
Implementation Notes:
• Parameters should be updated simultaneously.
• Learning step will decrease as you become closer to the minimum. Even with
fixed learning rate.
• Do not use very large learning rate in order not to overshoot.
• Do not use very small learning rate in order not to go very slowly.
______________________________________________________________________________
To find a good value, you have to test several values and pick the best.
Advices to choose the learning rate:
• Plot cost function with epochs (iterations) and check if it is decreasing.
• Convergence check 𝑐𝑜𝑠𝑡(𝑖 − 1) – 𝑐𝑜𝑠𝑡(𝑖) < 0.001.
• Try range of α e.g. 0.001, 0.01,0.1,1 then plot cost vs. epochs and check
for rapid and smooth conversion. Then you can select another α close the
value in that range.
e.g. if 0.001 is fine and 0.01 is bad you can try values in between such as
0.005
GD and Backpropagation algorithms are used to train artificial neural networks
(ANN). i.e. update weights and biases.
Those are key algorithms in Deep learning
Batch/Vanilla GD Stochastic GD (SGD) Mini Batch GD
Standard Gradient descent updates Stochastic gradient Mini-batch Gradient Descent sums
the parameters only after each descent updates up over lower number of examples
Definition
epoch i.e. after calculating the parameters for each based on the batch size.
the derivatives for all the observation which leads to a
observations it updates greater number of updates. Note: The batch size is something
the parameters. we can tune. It is usually chosen as
power of 2 such as 32, 64, 128, 256,
512, etc.
• We can use fixed learning rate • it can converge faster than • Updates are less noisy
during training without batch gradient descent since it compared to SGD which leads
worrying about learning rate updates the parameters after to better convergence.
decay. each training example.
• • A high number of updates in a
Advantages
It has straight trajectory

towards the minimum and it is • This makes it a popular choice single epoch compared to GD
guaranteed to converge in for large-scale machine so a smaller number of epochs
theory to the global minimum if learning problems. Additionally, are required for large datasets.
the loss function is convex and stochastic gradient descent can
to a local minimum if the loss avoid getting stuck in local • Fits very well to the processor
function is not convex. minima due to its random memory which makes
• It has unbiased estimate of nature. computing faster.
gradients. The more the
examples, the lower the
standard error.
• It can be very slow for very • Due to frequent fluctuations, it • it can occasionally get stuck in
large datasets because only will keep overshooting near to local minima, rather than
one-time update for each the desired exact minima. finding the global minimum.
epoch. Large number of epochs
• Add noise to the learning
is required to have a substantial
Disadvantages
process i.e. the variance

number of updates.
becomes large since we only
• For large datasets, the use 1 example for each learning
vectorization of data doesn’t fit step.
into memory.
• We can’t utilize vectorization
• For non-convex surfaces, it may over 1 example.
only find the local minimums.
How to solve the vanishing gradient problem?
• There are too frequent problems in Deep Learning: exploding gradient and
vanishing gradient.
• In the first case, it’s similar to having a too big learning rate. The algorithm is
unstable and never converges.
• With Deep Learning, it can happen when you’re network is too deep. Since the
gradients from each layer get multiplied with each other, you quickly obtain a
gradient that explodes exponentially.
• For the vanishing gradient, it’s the opposite.
• The gradient becomes so small that the skier barely moves anymore.
• It can happen if the learning rate is too small.
• But it can also happen if the skier (the algorithm) is stuck on a flat line.
Classification
• Classification techniques that are essential part of machine learning
• Approximately 70% of Data science problems are classification problems
• Some regression algorithms are used for classifications
Logistic Regression
• is a Supervised statistical method used for binary classification problems, where the goal
is predicted whether an observation belongs to a particular category or not.
• It’s a generalized linear model that used Logistic-function to model the relationship
between input & output features.
• Logistic regression uses functions called the logit functions, that helps derive a
relationship between the dependent variable and independent variables by predicting
the probabilities or chances of occurrence.
• The logistic functions (also known as the sigmoid functions) convert the probabilities
into binary values which could be further used for predictions
• Logistic Regression compute a weighted sum of the input features , but instead of
outputting the result directly like linear regression models, it passed the output to the
sigmoid function and outputs the logistic of the result.
logistic regression can't deal with continuous values as it aims for accuracy, so it Changes the
values from continuous to discreate / binary/ (0,1) values.
why can’t we use Linear Regression?
• Linear Regression predicts continuous variables like price of house, and the output of
the Linear Regression can range from negative infinity to positive infinity.
• Since, The predicted values is not probability value but a continuous value for the
classes, it will be very hard to find the right threshold that can help distinguish between
the classes..
• In a multiclass problem there can n number of classes, Now each classes will be labelled
from 0-n.
Suppose, we have 5 class problem 0,1,2,3 and 4 these classes won’t carry or won’t be
having any meaningful order. However, they would be forced to establish some kind of
relation between the dependent and the independent features.
Decision boundary
• A decision Boundary is a line or margin that separates the classes.
• Classification algorithm is all about finding the decision boundary that helps distinguish
between the classes perfectly or close to perfect.
• Logistic Regression decides a proper fit to the decision boundary so that we will be able
to predict which class a new data will correspond to.
Cost function
• is a function that measures the performance of a Machine Learning model for given
data.
• is basically the calculation of the error between predicted values and expected values
and presents it in the form of a single real number.
Difference between cost and lost functions

Cost Function is the average of error of n-sample in the data and Loss Function is the error for
individual data points. In other words, Loss Function is for one training example, Cost Function is
the for the entire training set.
Advantages
• Because of its efficient and straightforward nature, doesn't require high computation
power, easy to implement, easily interpretable, used widely by data analyst and
scientist. Also, it doesn't require scaling of features.
• Logistic regression provides a probability score for observations.
Disadvantages
• Logistic regression is not able to handle a large number of categorical features/variables.
• It is vulnerable to overfitting. Also, can't solve the non-linear problem with the logistic
regression that is why it requires a transformation of non-linear features.
• Logistic regression will not perform well with independent variables that are not
correlated to the target variable and are very similar or correlated to each other.

ML Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Notes

Uploaded by

Copyright:

Available Formats

Artificial intelligence

AI Layers enables computers and machines to mimic the perception, learning,

Two major reasons of studying Optimization:

Gradient of Multivariable Function

• A global minimum (or optimum) is a point where the

• Regression searches for relationships among variables.

𝑺𝒂𝒍𝒂𝒓𝒚 𝒚 = 𝒇(𝒆𝒅𝒖𝒄𝒂𝒕𝒊𝒐𝒏 𝒙𝟎 , 𝒓𝒐𝒍𝒆 𝒙𝟏 , 𝒄𝒊𝒕𝒚 𝒙𝟐 , … … … )

• 𝑺𝒂𝒍𝒂𝒓𝒚 𝒚 is the dependent variable.

• Each data for each employee represents one observation.

• h(x)= 0 + 1 x Called hypothesis

• There are a number of situations where we need to compress information about a

• Determining the magnitude of a data point in multiple dimensions.

• Calculating the loss of a machine learning model.

• Computing the error of a predictive model.

Loss (Cost) Function

• We can plot loss as a function of 𝟏 for different fit.

𝑱(𝟏 ) = 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒗𝒂𝒍𝒖𝒆 − 𝒂𝒄𝒕𝒖𝒂𝒍 𝒗𝒂𝒍𝒖𝒆

LR & Loss Function

Mean Squared Error (MSE)

• GD makes the same thing to find the min of a loss function.

• Gradient descent is a first-order iterative optimization algorithm for finding a local

• Step2: Use (𝟎 & 𝟏) to predict the output h(x)= 0 + 1 x.

• Step3: Calculate Cost function 𝑱(𝟎 , 𝟏 ).

• Step4: Calculate the gradient.

• Step5: Update the parameters (simultaneously).

• Step6: Repeat from 2 to 5 until converge to the minimum or achieve maximum

• Do not use very large learning rate in order not to overshoot.

Batch/Vanilla GD Stochastic GD (SGD) Mini Batch GD

It has straight trajectory

process i.e. the variance

• Approximately 70% of Data science problems are classification problems

• Some regression algorithms are used for classifications

Difference between cost and lost functions

• Logistic regression provides a probability score for observations.

You might also like