Professional Documents
Culture Documents
The mathematical and statistical basis of machine learning makes outputting such
useful results possible. Using math and statistics in this way enables the algorithms to
understand anything with a numerical basis. To begin the process, you represent the
solution to the problem as a number. For example, if you want to diagnose a disease
using a machine learning algorithm, you can make the response a 1 or a 0 (a binary
response) to indicate whether the person is ill, with a 1 stating simply that the person
is ill. Alternatively, you can use a number between 0 and 1 to convey a less
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR
definite
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR
answer. The value can represent the probability that the person is ill, with 0 indicating
that the person isn’t ill and 1 indicating that the person definitely has the disease.
A machine learning algorithm cannot really digest such information. You must
first transform the information into numbers. Many ways are available to do so, but
the simplest is one-hot encoding, which turns every feature into a new set of binary
(values 0 or 1) features for all its symbolic values. For instance, consider the outlook
variable, which becomes three new features, as follows: outlook: sunny, outlook:
overcast, and outlook: rain. Each one will have a numeric value of 1 or 0 depending
on whether the implied condition is present. So when the day is sunny, outlook: sunny
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR
has a value of 1 and outlook: overcast and outlook: rain have a value of 0.
Creating a matrix
A matrix is a collection of numbers, arranged in rows and columns, much like the squares in a
chessboard. However, unlike a chessboard, which is always square, matrices can have a
different number of rows and columns
the machine learning algorithm requires that you turn the individual features into a matrix of
features and the individual responses into a vector or a matrix
a matrix used for machine learning relies on rows to represent examples and columns to
represent features. So, as in the example for learning the best weather conditions to play tennis,
you would construct a matrix that uses a new row for each day and columns containing the
different values for outlook, temperature, humidity, and wind. Typically, you represent a matrix
as a series of numbers enclosed by square brackets, as shown here:
X=
1 3 4 1
4 5 3 2
3 4 1 2
In this example, the above matrix called X contains three rows and four columns, so you can
say that the matrix has dimensions of 3 by 4 (also written as 3 x 4).
going to try, you’ll get an estimate of how many times an event should happen
on average if all the trials are tried.
For instance, if you have an event occurring with probability p=0.25 and
you try 100 times, you’re likely to witness that event happen 0.25 * 100 = 25
times.
Probabilities are between 0 and 1; no probability can exceed such boundaries.
Operating on probabilities:
Example , you can now compute theprobability getting at least a six from two thrown dices,
which is a summation of mutually exclusive events:
» The probability of having a two sixes: p=1/6 * 1/6
» The probability of having a six on the first dice and something other than a six on the
second one: p= 1/6 * (1–1/6)
» The probability of having a six on the second dice and something other than a six on
the first one: p= 1/6 * (1–1/6)
Your probability of getting at least one six from two thrown dice is p=1/6 * 1/6 + 2 *
1/6 * (1–1/6) = 0.306
Conditioning chance by Bayes’ theorem:
The formula used by Thomas Bayes is quite useful:
» P(B): The general probability of being a female; that is, the a priori probability of
the belief. In this case, the probability is 50 percent, or a value of 0.5 (likelihood).
» P(E): The general probability of having long hair. Here it is another a priori
probability, this time related to the observed evidence. In this formula, it is a 35
percent probability, which is a value of 0.35 (evidence).
In statistics, population refers to all the events and objects you want to measure,
and a sample is a part of it chosen by certain criteria. Using random sampling, which
is picking the events or objects to represent randomly, helps create a set of examples
for machine learning to learn as it would learn from all the possible examples. The
sample works because the value distributions in the sample are similar to those in
population, and that’s enough.
Random sampling isn’t the only possible approach. You can also apply stratified
sampling; through which you can control some aspects of the random sample in order
to avoid picking too many or too few events of a certain kind. After all, random is
random, and you have no absolute assurance of always replicating the exact
distribution of the population.
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR
Learning comes in many different flavors, depending on the algorithm and its
objectives. You can divide machine learning algorithms into three main groups based
on their purpose:
» Supervised learning
» Unsupervised learning
» Reinforcement learning
a. Supervised learning:
How it works
b. Unsupervised learning :
Unsupervised(Descriptive) Learning- the goal is to find “interesting patterns” in the data.
This is sometimes called knowledge discovery. D ={xi}N where =1, and This is a much less
well-defined problem, since we are not told what kinds of patterns to look for, and there is no
obvious error metric to us your mother says “that’s a dog”, but that’s very little information
Example:-
customers by purchasing behavior
Why use unsupervised learning
⚫ Unsupervised learning is helpful for finding useful insights from the data.
⚫ Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
⚫ Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
⚫ In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR
c. Reinforcement learning:
Advantages
⚫ Reinforcement Learning is used to solve complex problems that cannot be solved by
conventional techniques.
⚫ This technique is preferred to achieve long-term results which are very difficult to
achieve.
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR
⚫ This learning model is very similar to the learning of human beings. Hence, it is
close to achieving perfection
Disadvantages
⚫ Too much reinforcement learning can lead to an overload of states which can
diminish the results.
⚫ This algorithm is not preferable for solving simple problems.
⚫ This algorithm needs a lot of data and a lot of computation.
⚫ The curse of dimensionality limits reinforcement learning for real physical systems.
Learning Process
The learning process Even though supervised learning is the most popular and
frequently used of the three types, all machine learning algorithms respond to the same
logic. The central idea is that you can represent reality using a mathematical function
that the algorithm doesn’t know in advance but can guess after having seen some data.
You can express reality and all its challenging complexity in terms of unknown
mathematical functions that machine learning algorithms find and make advantageous.
⚫ This improves the drawback we encountered in Mean Error above. Here a square of the
difference between the actual and predicted value is calculated to avoid any possibility
of negative error.
It is measured as the average of the sum of squared differences between predictions and
actual observations
Gradient descent is an iterative optimization algorithm for finding the local minimum of
a function.
⚫ Let’s say you are playing a game where the players are at the top of a mountain, and
they are asked to reach the lowest point of the mountain. Additionally, they are
blindfolded. So, what approach do you think would make you reach the lake?
⚫ To find the local minimum of a function using gradient descent, we must take steps
proportional to the negative of the gradient (move away from the gradient) of the
function at the current point. If we take steps proportional to the positive of the gradient
(moving towards the gradient), we will approach a local maximum of the function, and
the procedure is called Gradient Ascent.
BRAHMDEVDADA MANE INSTITUTE OF TECHNOLOGY, BELATI, SOLAPUR
Batch Gradient Descent: This is a type of gradient descent which processes all the training
examples for each iteration of gradient descent. But if the number of training examples is large,
then batch gradient descent is computationally very expensive. Hence if the number of training
examples is large, then batch gradient descent is not preferred.
Stochastic Gradient Descent: This is a type of gradient descent which processes 1 training
example per iteration. Hence, the parameters are being updated even after one iteration in which
only a single example has been processed. Hence this is quite faster than batch gradient descent.
But again, when the number of training examples is large, even then it processes only one
example which can be additional overhead for the system as the number of iterations will be
quite large.
Mini Batch gradient descent: This is a type of gradient descent which works faster than both
batch gradient descent and stochastic gradient descent. here b examples where b<m are
processed per iteration. So even if the number of training examples is large, it is processed in
batches of b training examples in one go. Thus, it works for larger training examples and that
too with lesser number of iterations.