You are on page 1of 14

BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

Unit: 3 Basic mathematics for Machine Learning


Getting Started with The Math Basics, working with Data, Exploring the World of
Probabilities, Describing the Use of Statistics, Interpreting Learning As Optimization,
Exploring Cost Functions, Descending the Error Curve, Updating by Mini-Batch and Online.

Getting Started with The Math Basics:


If you want to implement existing machine learning algorithms from scratch or
you need to devise new ones, you require a profound knowledge of probability, linear
algebra, linear programming, and multivariable calculus. You also need to know how
to translate math into working code, which means having sophisticated computing
skills. This chapter begins by helping you understand the mechanics of machine
learning math and describes how to translate math basics into usable code.

1. Working with Data:


Machine learning is so appealing because it allows machines to learn from
real world examples (such as sales records, signals from sensors, and textual data
streaming from the Internet) and determine what such data implies. Common outputs
from a machine learning algorithm are predictions of the future
Many useful applications have already become a reality by leveraging these results:
» Diagnosing hard-to-find diseases
» Discovering criminal behavior and detecting criminals in action
» Recommending the right product to the right person
» Filtering and classifying data from the Internet at an enormous scale
» Driving a car autonomously

The mathematical and statistical basis of machine learning makes outputting such
useful results possible. Using math and statistics in this way enables the algorithms to
understand anything with a numerical basis. To begin the process, you represent the
solution to the problem as a number. For example, if you want to diagnose a disease
using a machine learning algorithm, you can make the response a 1 or a 0 (a binary
response) to indicate whether the person is ill, with a 1 stating simply that the person
is ill. Alternatively, you can use a number between 0 and 1 to convey a less
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

definite
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

answer. The value can represent the probability that the person is ill, with 0 indicating
that the person isn’t ill and 1 indicating that the person definitely has the disease.

A machine learning algorithm can provide an answer (predictions) when


supported by the required information (sample data) and an associated response
(examples of the predictions that you want to be able to guess). Information can
include facts, events, observations, counts, measurements, and so on. Any
information used as input is a feature or variable (a term taken from statistics).
Effective features describe the values that relate to the response and help the
algorithm guess a response using the function it creates given similar information in
other circumstances.

There are two types of features: quantitative and qualitative.


Quantitative features are perfect for machine learning because they define
values as numbers (integers, floats, counts, rankings, or other measures).
Qualitative features are usuallylabels or symbols that convey useful information
in a nonnumeric way, a way that you can define as more human-like (words,
descriptions, or concepts).
Quinlan describes a set of information useful for deciding whether to play tennis
outside or not, something a machine can learn using the proper technique.
The set of features described by Quinlan is as follows:
» Outlook: Sunny, overcast, or rain
» Temperature: Cool, mild, hot
» Humidity: High or normal
» Windy: True or false

A machine learning algorithm cannot really digest such information. You must
first transform the information into numbers. Many ways are available to do so, but
the simplest is one-hot encoding, which turns every feature into a new set of binary
(values 0 or 1) features for all its symbolic values. For instance, consider the outlook
variable, which becomes three new features, as follows: outlook: sunny, outlook:
overcast, and outlook: rain. Each one will have a numeric value of 1 or 0 depending
on whether the implied condition is present. So when the day is sunny, outlook: sunny
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

has a value of 1 and outlook: overcast and outlook: rain have a value of 0.
Creating a matrix
A matrix is a collection of numbers, arranged in rows and columns, much like the squares in a
chessboard. However, unlike a chessboard, which is always square, matrices can have a
different number of rows and columns
the machine learning algorithm requires that you turn the individual features into a matrix of
features and the individual responses into a vector or a matrix
a matrix used for machine learning relies on rows to represent examples and columns to
represent features. So, as in the example for learning the best weather conditions to play tennis,
you would construct a matrix that uses a new row for each day and columns containing the
different values for outlook, temperature, humidity, and wind. Typically, you represent a matrix
as a series of numbers enclosed by square brackets, as shown here:
X=
1 3 4 1
4 5 3 2
3 4 1 2
In this example, the above matrix called X contains three rows and four columns, so you can
say that the matrix has dimensions of 3 by 4 (also written as 3 x 4).

Understanding basic operations


The basic matrix operations are addition, subtraction, and scalar multiplication. They are
possible only when you have two matrices of the same size and the result is a new matrix of the
same dimensions. If you have two matrices of the same shape, you just have to apply the
operation to each corresponding position in the two matrices.

2. Exploring the World of Probabilities Probability:


 Probability tells you the likelihood of an event, and you express it as a number.
 The probabilityof an event is measured in the range from 0 (no probability that
an event occurs) to 1(certainty that an event occurs).
 Intermediate values, such as 0.25, 0.5, and 0.75, say that the event will happen
with a certain frequency when tried enough times. If you multiply the
probability by an integer number representing the number of trials you’re
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

going to try, you’ll get an estimate of how many times an event should happen
on average if all the trials are tried.
 For instance, if you have an event occurring with probability p=0.25 and
you try 100 times, you’re likely to witness that event happen 0.25 * 100 = 25
times.
 Probabilities are between 0 and 1; no probability can exceed such boundaries.
Operating on probabilities:
Example , you can now compute theprobability getting at least a six from two thrown dices,
which is a summation of mutually exclusive events:
» The probability of having a two sixes: p=1/6 * 1/6
» The probability of having a six on the first dice and something other than a six on the
second one: p= 1/6 * (1–1/6)
» The probability of having a six on the second dice and something other than a six on
the first one: p= 1/6 * (1–1/6)
Your probability of getting at least one six from two thrown dice is p=1/6 * 1/6 + 2 *
1/6 * (1–1/6) = 0.306
Conditioning chance by Bayes’ theorem:
The formula used by Thomas Bayes is quite useful:

P(B|E) = P(E|B)*P(B) / P(E)


For Example, the a priori probability of a person’s being female is roughly about 50
percent. However, the probability may differ drastically if you consider only specific
age ranges, because females tend to live longer, and after a certain age there are more
females than males.
» P(B|E): The probability of a belief (B) given a set of evidence (E) (posterior
probability). In this case, the hypothesis is that a person is a female and the evidence
is long hair. Knowing the probability of such a belief given evidence can help to
predict the person’s gender with some confidence.
» P(E|B): The probability of having long hair when the person is a female. This term
refers to the probability of the evidence in the subgroup, which is itself a conditional
probability. In this case, the figure is 60 percent, which translates to a value of 0.6 in
the formula (prior probability).
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

» P(B): The general probability of being a female; that is, the a priori probability of
the belief. In this case, the probability is 50 percent, or a value of 0.5 (likelihood).
» P(E): The general probability of having long hair. Here it is another a priori
probability, this time related to the observed evidence. In this formula, it is a 35
percent probability, which is a value of 0.35 (evidence).

3. Describing the Use of Statistics:

In statistics, population refers to all the events and objects you want to measure,
and a sample is a part of it chosen by certain criteria. Using random sampling, which
is picking the events or objects to represent randomly, helps create a set of examples
for machine learning to learn as it would learn from all the possible examples. The
sample works because the value distributions in the sample are similar to those in
population, and that’s enough.
Random sampling isn’t the only possible approach. You can also apply stratified
sampling; through which you can control some aspects of the random sample in order
to avoid picking too many or too few events of a certain kind. After all, random is
random, and you have no absolute assurance of always replicating the exact
distribution of the population.
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

4. Interpreting Learning As Optimization:

Learning comes in many different flavors, depending on the algorithm and its
objectives. You can divide machine learning algorithms into three main groups based
on their purpose:

» Supervised learning
» Unsupervised learning
» Reinforcement learning

a. Supervised learning:

Supervised(Predictive) Learning-the goal is to learn a mapping from inputs x to outputs y,


given a labeled set of input-output pairs D = {(xi, yi)} N
Where i=1 to N
Here D is called the training set, and N is the number of training examples.

How it works

Figure: - flow of supervised learning


BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

⚫ Suppose we have a dataset of different types of shapes which includes square,


rectangle, triangle, and Polygon. Now the first step is that we need to train the model
for each shape.
⚫ If the given shape has four sides, and all the sides are equal, then it will be labelled
as a Square.
⚫ If the given shape has three sides, then it will be labelled as a triangle.
⚫ If the given shape has six equal sides then it will be labelled as hexagon.
⚫ Now, after training, we test our model using the test set, and the task of the model is
to identify the shape.

b. Unsupervised learning :
Unsupervised(Descriptive) Learning- the goal is to find “interesting patterns” in the data.
This is sometimes called knowledge discovery. D ={xi}N where =1, and This is a much less
well-defined problem, since we are not told what kinds of patterns to look for, and there is no
obvious error metric to us your mother says “that’s a dog”, but that’s very little information
Example:-
customers by purchasing behavior
Why use unsupervised learning
⚫ Unsupervised learning is helpful for finding useful insights from the data.
⚫ Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
⚫ Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
⚫ In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

Figure: - flow of unsupervised learning


Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to
the machine learning model in order to train it. Firstly, it will interpret the raw data
to find the hidden patterns from the data and then will apply suitable algorithms
such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects
into groups according to the similarities and difference between the objects

c. Reinforcement learning:

⚫ Reinforcement learning is a type of machine learning method where an intelligent


agent (computer program) interacts with the environment and learns to act within
that.
⚫ It is a core part of Artificial Intelligence and all AI agent works on the concept of
reinforcement learning.
⚫ Reinforcement Learning is a feedback-based Machine learning technique in which
an agent learns to behave in an environment by performing the actions and seeing
the results of actions. For each good action, the agent gets positive feedback, and for
each bad action, the agent gets negative feedback or penalty.
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

Working of Reinforcement Learning


Suppose there is an AI agent present within a maze environment, and his goal is to
find the diamond. The agent interacts with the environment by performing some
actions, and based on those actions, the state of the agent gets changed, and it also
receives a reward or penalty as feedback

Figure :- Reinforcement learning


⚫ Agent(): An entity that can perceive/explore the environment and act upon it.
⚫ Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
⚫ Action(): Actions are the moves taken by an agent within the environment.
⚫ State(): State is a situation returned by the environment after each action taken by
the agent.
⚫ Reward(): A feedback returned to the agent from the environment to evaluate the
action of the agent.
⚫ Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
⚫ Value(): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.

Advantages
⚫ Reinforcement Learning is used to solve complex problems that cannot be solved by
conventional techniques.
⚫ This technique is preferred to achieve long-term results which are very difficult to
achieve.
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

⚫ This learning model is very similar to the learning of human beings. Hence, it is
close to achieving perfection
Disadvantages
⚫ Too much reinforcement learning can lead to an overload of states which can
diminish the results.
⚫ This algorithm is not preferable for solving simple problems.
⚫ This algorithm needs a lot of data and a lot of computation.
⚫ The curse of dimensionality limits reinforcement learning for real physical systems.

Learning Process
The learning process Even though supervised learning is the most popular and
frequently used of the three types, all machine learning algorithms respond to the same
logic. The central idea is that you can represent reality using a mathematical function
that the algorithm doesn’t know in advance but can guess after having seen some data.
You can express reality and all its challenging complexity in terms of unknown
mathematical functions that machine learning algorithms find and make advantageous.

5. Exploring Cost Functions:


 A cost function is an important parameter that determines how well a
machine learning model performs for a given dataset.
 It calculates the difference between the expected value and predicted value
and represents it as a single real number
Regression cost function
⚫ Regression models are used to make a prediction for the continuous variables such as
the price of houses, weather prediction, loan predictions, etc.
Error= Actual Output-Predicted output
1. Means Error
⚫ In this cost function, the error for each training data is calculated and then the mean
value of all these errors is derived.
⚫ Calculating the mean of the errors is the simplest and most intuitive way possible.
⚫ The errors can be both negative and positive. So they can cancel each other out during
summation giving zero mean error for the model.
2. Mean Squared Error (MSE) :
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

⚫ This improves the drawback we encountered in Mean Error above. Here a square of the
difference between the actual and predicted value is calculated to avoid any possibility
of negative error.
It is measured as the average of the sum of squared differences between predictions and
actual observations

MSE = (sum of squared errors)/n

MAE = (sum of absolute errors)/n

⚫ Mean Absolute Error (MAE)


⚫ This cost function also addresses the shortcoming of mean error differently. Here an
absolute difference between the actual and predicted value is calculated to avoid any
possibility of negative error.
⚫ So in this cost function, MAE is measured as the average of the sum of absolute
differences between predictions and actual observations.

MAE = (sum of absolute errors)/n


BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

6. Descending the Error Curve :

 Gradient descent is an iterative optimization algorithm for finding the local minimum of
a function.
⚫ Let’s say you are playing a game where the players are at the top of a mountain, and
they are asked to reach the lowest point of the mountain. Additionally, they are
blindfolded. So, what approach do you think would make you reach the lake?
⚫ To find the local minimum of a function using gradient descent, we must take steps
proportional to the negative of the gradient (move away from the gradient) of the
function at the current point. If we take steps proportional to the positive of the gradient
(moving towards the gradient), we will approach a local maximum of the function, and
the procedure is called Gradient Ascent.
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR

Batch Gradient Descent: This is a type of gradient descent which processes all the training
examples for each iteration of gradient descent. But if the number of training examples is large,
then batch gradient descent is computationally very expensive. Hence if the number of training
examples is large, then batch gradient descent is not preferred.

Stochastic Gradient Descent: This is a type of gradient descent which processes 1 training
example per iteration. Hence, the parameters are being updated even after one iteration in which
only a single example has been processed. Hence this is quite faster than batch gradient descent.
But again, when the number of training examples is large, even then it processes only one
example which can be additional overhead for the system as the number of iterations will be
quite large.

Mini Batch gradient descent: This is a type of gradient descent which works faster than both
batch gradient descent and stochastic gradient descent. here b examples where b<m are
processed per iteration. So even if the number of training examples is large, it is processed in
batches of b training examples in one go. Thus, it works for larger training examples and that
too with lesser number of iterations.

You might also like