Professional Documents
Culture Documents
COURSE OBJECTIVES:
Machine Learning is an application of Artificial Intelligence that enables systems to learn from
vast volumes of data and solve specific problems. It uses computer algorithms that improve
their efficiency automatically through experience.
We can train machine learning algorithms by providing them the huge amount of data
and let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of
machine learning, we can save both time and money.
The importance of machine learning can be easily understood by its use cases,
Currently, machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc. Various top companies such
as Netflix and Amazon have built machine learning models that are using a vast
amount of data to analyse the user interest and recommend product accordingly.
Following are some key points which show the importance of Machine Learning:
History
Before some years (about 40-50 years), machine learning was science fiction, but today
it is the part of our daily life. Machine learning is making our day to day life easy
from self-driving cars to Amazon virtual assistant "Alexa". However, the idea
behind machine learning is so old and has a long history. Below some milestones are
given which have occurred in the history of machine learning:
From this point on, "intelligent" machine learning algorithms and computer
programs started to appear, doing everything from planning travel routes for
salespeople, to playing board games with humans such as checkers and tic-
tac-toe.
Intelligent machines went on to do everything from using speech recognition
to learning to pronounce words the way a baby would learn to defeating a
world chess champion at his own game. The infographic below shows the
history of machine learning and how it grew from mathematical models to
sophisticated technology.
Definitions
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with
the development of algorithms which allow a computer to learn from the data and past
experiences on their own. The term machine learning was first introduced by Arthur
Samuel in 1959. We can define it in a summarized way as:
With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine learning brings computer
science and statistics together for creating predictive models. Machine learning
constructs or uses the algorithms that learn from historical data. The more we will
provide the information, the higher will be the performance.
A machine has the ability to learn if it can improve its performance by gaining
more data.
Applications
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:
Other applications of machine learning are:
Machine Learning is one of the driving forces behind automation, and it is cutting
down time and human workload. Automation can now be seen everywhere, and the
complex algorithm does the hard work for the user. Automation is more reliable,
efficient, and quick. With the help of machine learning, now advanced computers are
being designed. Now this advanced computer can handle several machine-learning
models and complex algorithms. However, automation is spreading faster in the
industry but, a lot of research and innovation are required in this field.
2. Scope of Improvement
Machine Learning is a field where things keep evolving. It gives many opportunities
for improvement and can become the leading technology in the future. A lot of
research and innovation is happening in this technology, which helps improve software
and hardware.
Machine Learning is going to be used in the education sector extensively, and it will
be going to enhance the quality of education and student experience. It has emerged
in China; machine learning has improved student focus. In the e-commerce field,
Machine Learning studies your search feed and give suggestion based on them.
Depending upon search and browsing history, it pushes targeted advertisements and
notifications to users.
This technology has a very wide range of applications. Machine learning plays a role
in almost every field, like hospitality, ed-tech, medicine, science, banking, and
business. It creates more opportunities.
Disadvantages
Nothing is perfect in the world. Machine Learning has some serious limitations, which
are bigger than human errors.
1. Data Acquisition
The whole concept of machine learning is about identifying useful data. The outcome
will be incorrect if a credible data source is not provided. The quality of the data is also
significant. If the user or institution needs more quality data, wait for it. It will cause
delays in providing the output. So, machine learning significantly depends on the data
and its quality.
The data that machines process remains huge in quantity and differs greatly. Machines
require time so that their algorithm can adjust to the environment and learn it. Trials
runs are held to check the accuracy and reliability of the machine. It requires massive
and expensive resources and high-quality expertise to set up that quality of
infrastructure. Trials runs are costly as they would cost in terms of time and expenses.
3. Results Interpretations
One of the biggest advantages of Machine learning is that interpreted data that we
get from the cannot be hundred percent accurate. It will have some degree of
inaccuracy. For a high degree of accuracy, algorithms should be developed so that
they give reliable results.
The error committed during the initial stages is huge, and if not corrected at that time,
it creates havoc. Biasness and wrongness have to be dealt with separately; they are not
interconnected. Machine learning depends on two factors, i.e., data and algorithm.
All the errors are dependent on the two variables. Any incorrectness in any variables
would have huge repercussions on the output.
5. Social Changes
Machine learning is bringing numerous social changes in society. The role of machine
learning-based technology in society has increased multifold. It is influencing the
thought process of society and creating unwanted problems in society. Character
assassination and sensitive details are disturbing the social fabric of society.
With the advancement of machine learning, the nature of the job is changing. Now, all
the work are done by machine, and it is eating up the jobs for human which were done
earlier by them. It is difficult for those without technical education to adjust to these
changes.
8. Highly Expensive
This software is highly expensive, and not everybody can own it. Government agencies,
big private firms, and enterprises mostly own it. It needs to be made accessible to
everybody for wide use.
9. Privacy Concern
As we know that one of the pillars of machine learning is data. The collection of data
has raised the fundamental question of privacy. The way data is collected and used for
commercial purposes has always been a contentious issue. In India, the Supreme
court of India has declared privacy a fundamental right of Indians. Without the user's
permission, data cannot be collected, used, or stored. However, many cases have come
up that big firms collect the data without the user's knowledge and using it for their
commercial gains.
Machine learning is evolving concept. This area has not seen any major developments
yet that fully revolutionized any economic sector. The area requires continuous
research and innovation.
Challenges
Although machine learning is being used in every industry and helps organizations
make more informed and data-driven choices that are more effective than classical
methodologies, it still has so many problems that cannot be ignored. Here are some
common issues in Machine Learning that professionals face to inculcate ML skills and
create an application from scratch.
The major issue that comes while using machine learning algorithms is the lack of
quality as well as quantity of data. Although data plays a vital role in the processing of
machine learning algorithms, many data scientists claim that inadequate data, noisy
data, and unclean data are extremely exhausting the machine learning algorithms. For
example, a simple task requires thousands of sample data, and an advanced task such
as speech or image recognition needs millions of sample data examples. Further, data
quality is also important for the algorithms to work ideally, but the absence of data
quality is also found in Machine Learning applications. Data quality can be affected by
some factors as follows:
o Noisy Data- It is responsible for an inaccurate prediction that affects the decision as
well as accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results obtained in
machine learning models. Hence, incorrect data may affect the accuracy of the results
also.
o Generalizing of output data- Sometimes, it is also found that generalizing output
data becomes complex, which results in comparatively poor future actions.
As we have discussed above, data plays a significant role in machine learning, and it must be
of good quality as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to
less accuracy in classification and low-quality results. Hence, data quality can also be
considered as a major common problem while processing machine learning algorithms.
To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training data
must cover all cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less accurate
predictions. A machine learning model is said to be ideal if it predicts well for generalized
cases and provides accurate decisions. If there is less training data, then there will be a sampling
noise in the model, called the non-representative training set. It won't be accurate in predictions.
To overcome this, it will be biased against one class or a group.
Hence, we should use representative data in training to protect against being biased and make
accurate predictions without any drift.
Overfitting:
Overfitting is one of the most common issues faced by Machine Learning engineers
and data scientists. Whenever a machine learning model is trained with a huge amount
of data, it starts capturing noise and inaccurate data into the training data set. It
negatively affects the performance of the model. Let's understand with a simple
example where we have a few training data sets such as 1000 mangoes, 1000 apples,
1000 bananas, and 5000 papayas. Then there is a considerable probability of
identification of an apple as papaya because we have a massive amount of biased data
in the training data set; hence prediction got negatively affected. The main reason
behind overfitting is using non-linear methods used in machine learning algorithms as
they build non-realistic data models. We can overcome overfitting by using linear and
parametric algorithms in the machine learning models.
Underfitting:
Underfitting occurs when our model is too simple to understand the base structure of
the data, just like an undersized pant. This generally happens when we have limited
data into the data set, and we try to build a linear model with non-linear data. In such
scenarios, the complexity of the model destroys, and rules of the machine learning
model become too easy to be applied on this data set, and the model starts doing
wrong predictions as well.
As we know that generalized output data is mandatory for any machine learning
model; hence, regular monitoring and maintenance become compulsory for the same.
Different results for different actions require data change; hence editing of codes as
well as resources for monitoring them also become necessary.
6. Getting bad recommendations
A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example
where at a specific time customer is looking for some gadgets, but now customer
requirement changed over time but still machine learning model showing same
recommendations to the customer while customer expectation has been changed. This
incident is called a Data Drift. It generally occurs when new data is introduced or
interpretation of data changes. However, we can overcome this by regularly updating
and monitoring data according to the expectations.
Although Machine Learning and Artificial Intelligence are continuously growing in the
market, still these industries are fresher in comparison to others. The absence of skilled
resources in the form of manpower is also an issue. Hence, we need manpower having
in-depth knowledge of mathematics, science, and technologies for developing and
managing scientific substances for machine learning.
8. Customer Segmentation
The machine learning process is very complex, which is also another major issue faced
by machine learning engineers and data scientists. However, Machine Learning and
Artificial Intelligence are very new technologies but are still in an experimental phase
and continuously being changing over time. There is the majority of hits and trial
experiments; hence the probability of error is higher than expected. Further, it also
includes analysing the data, removing data bias, training data, applying complex
mathematical calculations, etc., making the procedure more complicated and quite
tedious.
Data Biasing is also found a big challenge in Machine Learning. These errors exist when
certain elements of the dataset are heavily weighted or need more importance than
others. Biased data leads to inaccurate results, skewed outcomes, and other analytical
errors. However, we can resolve this error by determining where data is actually biased
in the dataset. Further, take necessary steps to reduce it.
This issue is also very commonly seen in machine learning models. However, machine
learning models are highly efficient in producing accurate results but are time-
consuming. Slow programming, excessive requirements' and overloaded data take
more time to provide accurate results than expected. This needs continuous
maintenance and monitoring of the model for delivering accurate results.
Although machine learning models are intended to give the best possible outcome, if
we feed garbage data as input, then the result will also be garbage. Hence, we should
use relevant features in our training sample. A machine learning model is said to be
good if training data has a good set of features or less to no irrelevant features.
Based on the methods and way of learning, machine learning is divided into mainly
four types, which are:
The main goal of the supervised learning technique is to map the input
variable(x) with the output variable(y). Some real-world applications of supervised
learning are Risk Assessment, Fraud Detection, Spam filtering, etc.
Supervised machine learning can be classified into two types of problems, which are
given below:
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the
output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.
The classification algorithms predict the categories present in the dataset. Some real-
world examples of classification algorithms are Spam Detection, Email filtering, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease conditions.
With such a process, the machine can identify a disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data to
identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent to
the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can be
done using the same, such as voice-activated passwords, voice commands, etc.
In unsupervised learning, the models are trained with the data that is neither classified
nor labelled, and the model acts on that data without any supervision.
So, now the machine will discover its patterns and differences, such as colour
difference, shape difference, and predict the output when it is tested with the test
dataset.
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of
other groups. An example of the clustering algorithm is grouping the customers by
their purchasing behaviour.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat,
FP-growth algorithm.
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
Advantages:
Disadvantages:
In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child
learns various things by experiences in his day-to-day life. An example of
reinforcement learning is to play a game, where the Game is the environment, moves
of an agent at each step define states, and the goal of the agent is to get a high score.
Agent receives feedback in terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
Advantages
Disadvantage
The curse of dimensionality limits reinforcement learning for real physical systems.
Mathematical Foundations
• You could develop a customized model that fits your own problem
by knowing the machine learning model’s math.
Six math subjects become the foundation for machine learning. Each subject is
intertwined to develop our machine learning model and reach the “best” model for
generalizing the dataset.
Linear Algebra
Linear algebra is a branch of mathematics that deals with linear equations and their
representations in the vector space using matrices. In other words, linear algebra is the study
of linear functions and vectors. It is one of the most central topics of mathematics. Most modern
geometrical concepts are based on linear algebra.
Linear algebra facilitates the modeling of many natural phenomena and hence, is an integral
part of engineering and physics. Linear equations, matrices, and vector spaces are the most
important components of this subject. In this article, we will learn more about linear algebra
and the various associated topics.
Linear algebra can be defined as a branch of mathematics that deals with the study of
linear functions in vector spaces. When information related to linear functions is
presented in an organized form then it results in a matrix. Thus, linear algebra is
concerned with vector spaces, vectors, linear functions, the system of linear equations,
and matrices. These concepts are a prerequisite for sister topics such as geometry
and functional analysis.
The branch of mathematics that deals with vectors, matrics, finite or infinite dimensions
as well as a linear mapping between such spaces is defined as linear algebra. It is
used in both pure and applied mathematics along with different technical forms such
as physics, engineering, natural sciences, etc.
Linear algebra can be categorized into three branches depending upon the level of difficulty
and the kind of topics that are encompassed within each. These are elementary, advanced, and
applied linear algebra. Each branch covers different aspects of matrices, vectors, and linear
functions.
Elementary linear algebra introduces students to the basics of linear algebra. This includes
simple matrix operations, various computations that can be done on a system of linear
equations, and certain aspects of vectors. Some important terms associated with elementary
linear algebra are given below:
Scalars - A scalar is a quantity that only has magnitude and not direction. It is an element that
is used to define a vector space. In linear algebra, scalars are usually real numbers.
Vectors - A vector is an element in a vector space. It is a quantity that can describe both the
direction and magnitude of an element.
Vector Space - The vector space consists of vectors that may be added together and multiplied
by scalars.
Matrix - A matrix is a rectangular array wherein the information is organized in the form of
rows and columns. Most linear algebra properties can be expressed in terms of a matrix.
Matrix Operations - These are simple arithmetic operations such as addition, subtraction,
and multiplication that can be conducted on matrices.
Once the basics of linear algebra have been introduced to students the focus shifts on more
advanced concepts related to linear equations, vectors, and matrices. Certain important terms
that are used in advanced linear algebra are as follows:
Linear Transformations - The transformation of a function from one vector space to another
by preserving the linear structure of each vector space.
Inverse of a Matrix - When an inverse of a matrix is multiplied with the given original matrix
then the resultant will be the identity matrix. Thus, A-1A = I.
Linear Map - It is a type of mapping that preserves vector addition and vector multiplication.
Linear algebra is used in almost every field. Simple algorithms also make
use of linear algebra topics such as matrices. Some of the applications of
linear algebra are given as follows:
• Signal Processing - Linear algebra is used in encoding and manipulating
signals such as audio and video signals. Furthermore, it is required in the
analysis of such signals.
• Linear Programming - It is an optimizing technique that is used to determine
the best outcome of a linear function.
• Computer Science - Data scientists use several linear algebra algorithms to
solve complicated problems.
• Prediction Algorithms - Prediction algorithms use linear models that are
developed using concepts of linear algebra.
• Distance Function
• Inner Product
In finance, Bayes' Theorem can be used to rate the risk of lending money to
potential borrowers. The theorem is also called Bayes' Rule or Bayes' Law
and is the foundation of the field of Bayesian statistics.
Where P(A|B) is the probability of condition when event A is occurring while event B
has already occurred.
The Bayes theorem states that the probability of an event is based on prior knowledge
of the conditions that might be related to the event. It is also used to examine the case
of conditional probability. If we are aware of conditional probability, we can use the
Bayes formula to calculate reverse probabilities. The probability of A occurring given
that event B has taken place is equal to the product of the probability of event A
occurring at all and the probability of event B taking place given that event A has taken
place, divided by the probability of event B taking place at all.
Vector Calculus
Calculus is a mathematical study that concern with continuous change,
which mainly consists of functions and limits. Vector calculus itself is
concerned with the differentiation and integration of the vector fields.
Vector Calculus is often called multivariate calculus, although it has a
slightly different study case. Multivariate calculus deals with calculus
application functions of the multiple independent variables.
Derivative Equation
• Partial Derivative
• Gradient
The gradient is a word related to the derivative or the rate of change of a
function; you might consider that gradient is a fancy word for derivative.
The term gradient is typically used for functions with several inputs and a
single output (scalar). The gradient has a direction to move from
their current location, e.g., up, down, right, left.
Optimization
Gradient Descent
There are few terms as a starting point when learning optimization. They
are:
The point at which a function best values takes the minimum value is called
the global minima. However, when the goal is to minimize the function
and solved it using optimization algorithms such as gradient descent,
the function could have a minimum value at different points. Those several
points which appear to be minima but are not the point where the function
actually takes the minimum value are called local minima.
Decision Theory
Decision theory is a study of an agent's rational choices that supports all kinds of
progress in technology such as work on machine learning and artificial intelligence.
Decision theory looks at how decisions are made, how multiple decisions influence
one another, and how decision-making parties deal with uncertainty. Decision
theory is also known as theory of choice.
Decision theory, combined with probability theory, allows us to make optimal decisions
in situations involving uncertainty such as those encountered in pattern recognition.
Classification problems can be broken down into two separate stages, inference
stage and decision stage. The inference stage involves using the training data to learn
the model for the joint distribution, or equivalently, which gives us the most complete
probabilistic description of the situation. In the end, we must decide on optimal choice
based on our situation. This decision stage is generally very simple, even trivial, once
we have solved the inference problem.
The greater the degree of surprise in the statements, the greater the
information contained in the statements. For example, let’s say commuting from place
A to B takes 3 hours on average and is known to everyone. If somebody makes this
statement, the statement provides no information at all as this is already known to
everyone. Now, if someone says that it takes 2 hours to go from place A to B provided a
specific route is taken, then this statement consists of good bits of information as there is
an element of surprise in the statement.
The extent of information required to describe an event depends upon the possibility of
occurrence of that event. If the event is a common event, not much information is
required to describe the event. However, for unusual events, a good amount of
information will be needed to describe such events. Unusual events have a higher
degree of surprises and hence greater associated information.
The amount of information associated with event outcomes depends upon the probability
distribution associated with that event. In other words, the amount of information is
related to the probability distribution of event outcomes. Recall that the event and its
outcomes can be represented as the different values of the random variable, X from the
given sample space. And, the random variable has an associated probability distribution
with a probability associated with each outcome including the common outcomes
consisting of less information and rare outcomes consisting of a lot of information.
The higher the probability of an event outcome, the lesser the
information contained if that outcome happens. The smaller the probability of an
event outcome, the greater the information contained if that outcome with lesser
probability happens.
• Information (or degree of surprise) associated with a single discrete event: The
information associated with a single discrete event can be measured in terms of the
number of bits. Shannon introduced the term bits as the unit of information. This
information is also called self-information.
• Information (or degree of surprise) associated with the random variable whose values
represent different event outcomes where the values can be discrete or continuous.
Information associated with the random variable is related to probability distribution
as described in the previous section. The amount of information associated with the
random variable is measured using Entropy (or Shannon Entropy).
• The entropy of the event representing the random variable equals the average self-
information from observing each outcome of the event.
What is Entropy?
Entropy represents the amount of information associated with the random variable as the
function of the probability distribution for that random variable, be the probability
distribution be probability density function (PDF) or probability mass function (PMF).
The following is the formula for the entropy for a discrete random variable.
Where Pi represents the probability of a specific value of the random variable X. The
following represents the entropy of a continuous random variable. It is also termed
differential entropy.
Introduction
Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data
to the machine learning model. The aim of a supervised learning algorithm is to find
a mapping function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
The working of Supervised learning can be easily understood by the below example
and diagram:
Suppose we have a dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon. Now the first step is that we need to train the model
for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is
to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape,
it classifies the shape on the bases of a number of sides, and predicts the output.
Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
Machine learning models can be classified into two types: Discriminative and
Generative. In simple words, a discriminative model makes predictions on
unseen data based on conditional probability and can be used either for
classification or regression problem statements. On the contrary, a
generative model focuses on the distribution of a dataset to return a
probability for a given example.
We, as humans, can adopt any of the two different approaches to machine learning
models while learning an artificial language. These two models have not previously
been explored in human learning. However, it is related to known effects of causal
direction, classification vs. inference learning, and observational vs. feedback
learning. So, in this article, our focus is on two types of machine learning models
– Generative and Discriminative, and also see the importance, comparisons, and
differences of these two models.
Problem Formulation
Suppose we are working on a classification problem where our task is to decide if
an email is spam or not spam based on the words present in a particular email. To
solve this problem, we have a joint model over.
p(Y,X) = P(y,x1,x2…xn)
Now, our goal is to estimate the probability of spam email i.e., P(Y=1|X). Both
generative and discriminative models can solve this problem but in different ways.
If we have some outliers present in the dataset, discriminative models work better
compared to generative models i.e., discriminative models are more robust to
outliers. However, one major drawback of these models is the misclassification
problem, i.e., wrongly classifying a data point.
The Mathematics of Discriminative Models
Training discriminative classifiers or discriminant analysis involves
estimating a function f: X -> Y, or probability P(Y|X)
• Logistic regression
• Support vector machines (SVMs)
• Traditional neural networks
• Nearest neighbor
• Conditional Random Fields (CRFs)
• Decision Trees and Random Forest
Since these models often rely on the Bayes theorem to find the joint
probability, generative models can tackle a more complex task than
analogous discriminative models.
So, the Generative approach focuses on the distribution of individual classes
in a dataset, and the learning algorithms tend to model the underlying
patterns or distribution of the data points (e.g., gaussian). These models use
the concept of joint probability and create instances where a given feature
(x) or input and the desired output or label (y) exist simultaneously.
• Assume some functional form for the probabilities such as P(Y), P(X|Y)
• With the help of training data, we estimate the parameters of P(X|Y), P(Y)
• Use the Bayes theorem to calculate the posterior probability P(Y |X)
Examples of Generative Models
• Naïve Bayes
• Bayesian networks
• Markov random fields
• Hidden Markov Models (HMMs)
• Latent Dirichlet Allocation (LDA)
• Generative Adversarial Networks (GANs)
• Autoregressive Model
Let’s see some of the differences between the Discriminative and Generative
Models.
Core Idea
Mathematical Intuition
Applications
Discriminative models recognize existing data, i.e., discriminative modelling
identifies tags and sorts data and can be used to classify data, while
Generative modelling produces something.
Since these models use different approaches to machine learning, both are
suited for specific tasks i.e., Generative models are useful for unsupervised
learning tasks. In contrast, discriminative models are useful for supervised
learning tasks. GANs (Generative adversarial networks) can be thought of
as a competition between the generator, which is a component of the
generative model, and the discriminator, so basically, it is generative vs.
discriminative model.
Outliers
Computational Cost
Let’s see some of the comparisons based on the following criteria between
Discriminative and Generative Models:
Based on Performance
Based on Applications
Discriminative models are called “discriminative” since they are useful for
discriminating Y’s label, i.e., target outcome, so they can only solve
classification problems. In contrast, Generative models have more
applications besides classification, such as samplings, Bayes learning, MAP
inference, etc.
Conclusion
Linear Regression
Linear regression is one of the easiest and most popular Machine Learning algorithms.
It is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age,
product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
Mathematically, we can represent a linear regression as: y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
The different values for weights or the coefficient of lines (a0, a1) gives a different line
of regression, so we need to calculate the best values for a 0 and a1 to find the best fit
line, so to calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for
the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual values.
It can be written as:
Where,
Residuals: The distance between the actual value and predicted values is called
residual. If the observed points are far from the regression line, then the residual will
be high, and so cost function will high. If the scatter points are close to the regression
line, then the residual will be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations.
The process of finding the best model out of various models is called optimization. It
can be achieved by below method:
1. R-squared method:
KEY TAKEAWAYS
• The least squares method is a statistical procedure to find the best fit
for a set of data points by minimizing the sum of the offsets or
residuals of points from the plotted curve.
• Least squares regression is used to predict the behaviour of
dependent variables.
• The least squares method provides the overall rationale for the
placement of the line of best fit among the data points being studied.
Let us look at a simple example, Ms. Dolma said in the class "Hey students
who spend more time on their assignments are getting better grades". A
student wants to estimate his grade for spending 2.3 hours on an
assignment. Through the magic of the least-squares method, it is possible to
determine the predictive model that will help him estimate the grades far
more accurately. This method is much simpler because it requires nothing
more than some data and maybe a calculator.
• This method exhibits only the relationship between the two variables. All other causes and
effects are not taken into consideration.
• This method is unreliable when data is not evenly distributed.
• This method is very sensitive to outliers. In fact, this can skew the results of the least-squares
analysis.
Least Square Method Graph
Look at the graph below, the straight line shows the potential relationship
between the independent variable and the dependent variable. The ultimate
goal of this method is to reduce this difference between the observed
response and the response predicted by the regression line. Less residual
means that the model fits better. The data points need to be minimized by
the method of reducing residuals of each point from the line. There are
vertical residuals and perpendicular residuals. Vertical is mostly used in
polynomials and hyperplane problems while perpendicular is used in general
as seen in the image below.
Important Notes
• The least-squares method is used to predict the behavior of the dependent variable
with respect to the independent variable.
• The sum of the squares of errors is called variance.
• The main aim of the least-squares method is to minimize the sum of the squared
errors.
What is Underfitting?
When a model has not learned the patterns in the training data well and is unable to generalize well
on the new data, it is known as underfitting. An underfit model has poor performance on the training
data and will result in unreliable predictions. Underfitting occurs due to high bias and low variance.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this,
the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.
Example: The concept of the overfitting can be understood by the below graph of the
linear regression output:
As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because the
goal of the regression model to find the best fit line, but here we have not got any
best fit, so, it will generate the prediction errors.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant trend
in the data.
In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.
Example: We can understand the underfitting using below output of the linear
regression model:
As we can see from the above diagram, the model is unable to capture the data points
present in the plot.
Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how
closely the result or predicted values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and ideally,
it makes predictions with 0 errors, but in practice, it is difficult to achieve it.
As when we train our model for a time, the errors in the training data go down, and
the same happens with test data. But if we train the model for a long duration, then
the performance of the model may decrease due to the overfitting, as the model also
learn the noise present in the dataset. The errors in the test dataset start increasing, so
the point, just before the raising of errors, is the good point, and we can stop here for
achieving a good model.
Cross-Validation
Cross-validation is a technique for validating the model efficiency by training it on the
subset of input data and testing on previously unseen subset of the input data. We
can also say that it is a technique to check how a statistical model generalizes to
an independent dataset.
In machine learning, there is always the need to test the stability of the model. It means
based only on the training dataset; we can't fit our model on the training dataset. For
this purpose, we reserve a particular sample of the dataset, which was not part of the
training dataset. After that, we test our model on that sample before deployment, and
this complete process comes under cross-validation. This is something different from
the general train-test split.
But it has one of the big disadvantages that we are just using a 50% dataset to train
our model, so the model may miss out to capture important information of the dataset.
It also tends to give the underfitted model.
Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are
total n datapoints in the original input dataset, then n-p data points will be used as
the training dataset and the p data points as the validation set. This complete process
is repeated for all the samples, and the average error is calculated to know the
effectiveness of the model.
There is a disadvantage of this technique; that is, it can be computationally difficult for
the large p.
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of
equal sizes. These samples are called folds. For each learning set, the prediction
function uses k-1 folds, and the rest of the folds are used for the test set. This approach
is a very popular CV approach because it is easy to understand, and the output is less
biased than other methods.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5
folds. On 1st iteration, the first fold is reserved for test the model, and rest are used to
train the model. On 2nd iteration, the second fold is used to test the model, and rest
are used to train the model. This process will continue until each fold is not used for
the test fold.
It can be understood with an example of housing prices, such that the price of some
houses can be much high than other houses. To tackle such situations, a stratified k-
fold cross-validation technique is useful.
Holdout Method
This method is the simplest cross-validation technique among all. In this method, we
need to remove a subset of the training data and use it to get prediction results by
training it on the rest part of the dataset.
The error that occurs in this process tells how well our model will perform with the
unknown dataset. Although this approach is simple to perform, it still faces the issue
of high variance, and it also produces misleading results sometimes.
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:
o For the ideal conditions, it provides the optimum output. But for the inconsistent data,
it may produce a drastic result. So, it is one of the big disadvantages of cross-validation,
as there is no certainty of the type of data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may face the
differences between the training set and validation sets. Such as if we create a model
for the prediction of stock market values, and the data is trained on the previous 5
years stock values, but the realistic future values for the next 5 years may drastically
different, so it is difficult to expect the correct output for such situations.
Applications of Cross-Validation
o This technique can be used to compare the performance of different predictive
modeling methods.
o It has great scope in the medical research field.
o It can also be used for the meta-analysis, as it is already being used by the data
scientists in the field of medical statistics.
Sometimes the machine learning model performs well with the training data but does
not perform well with the test data. It means the model is not able to predict the output
when deals with unseen data by introducing noise in the output, and hence the model
is called overfitted. This problem can be deal with the help of a regularization
technique.
This technique can be used in such a way that it will allow to maintain all variables or
features in the model by reducing the magnitude of the variables. Hence, it maintains
accuracy as well as a generalization of the model.
It mainly regularizes or reduces the coefficient of features toward zero. In simple words,
"In regularization technique, we reduce the magnitude of the features by keeping the
same number of features."
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here
represents the bias of the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function.
The equation for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for the linear regression is called
as RSS or Residual sum of squares.
Lasso Regression
LASSO regression, also known as L1 regularization, is a popular technique used in
statistical modelling and machine learning to estimate the relationships between
variables and make predictions. LASSO stands for Least Absolute Shrinkage and
Selection Operator.
The primary goal of LASSO regression is to find a balance between model simplicity
and accuracy. It achieves this by adding a penalty term to the traditional linear
regression model, which encourages sparse solutions where some coefficients are
forced to be exactly zero. This feature makes LASSO particularly useful for feature
selection, as it can automatically identify and discard irrelevant or redundant variables.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of
bias is introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity
of the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The
amount of bias added to the model is called Ridge Regression penalty. We can
calculate it by multiplying with the lambda to the squared weight of each individual
feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model, and
hence ridge regression reduces the amplitudes of the coefficients that decreases the
complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum
value of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between
the independent variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the
model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:
o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well
as the feature selection.
Key Difference between Ridge Regression and Lasso
Regression
o Ridge regression is mostly used to reduce the overfitting in the model, and it includes
all the features present in the model. It reduces the complexity of the model by
shrinking the coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature
selection.
The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator.
L1 Regularization
Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of
the magnitude of coefficients. This type of regularization can result in sparse models with few
coefficients; Some coefficients can become zero and eliminated from the model. Larger penalties
result in coefficient values closer to zero, which is the ideal for producing simpler models. On the
other hand, L2 regularization (e.g. Ridge regression) doesn’t result in elimination of coefficients or
sparse models. This makes the Lasso far easier to interpret than the Ridge.
Which is the same as minimizing the sum of squares with constraint Σ |Bj≤ s (Σ = summation
notation). Some of the βs are shrunk to exactly zero, resulting in a regression model that’s easier
to interpret.
A tuning parameter, λ controls the strength of the L1 penalty. λ is basically the amount of
shrinkage:
• When λ = 0, no parameters are eliminated. The estimate is equal to the one found with
linear regression.
• As λ increases, more and more coefficients are set to zero and eliminated (theoretically,
when λ = ∞, all coefficients are eliminated).
• As λ increases, bias increases.
• As λ decreases, variance increases.
If an intercept is included in the model, it is usually left unchanged.
Classification
As we know, the Supervised Machine Learning algorithm can be broadly classified into
Regression and Classification Algorithms. In Regression algorithms, we have predicted
the output for continuous values, but to predict the categorical values, we need
Classification algorithms.
Unlike regression, the output variable of Classification is a category, not a value, such
as "Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a
Supervised learning technique, hence it takes labelled input data, which means it
contains input with the corresponding output.
Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have features
that are similar to each other and dissimilar to other classes.
o Binary Classifier: If the classification problem has only two possible outcomes, then it
is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it
is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time
for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners: Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve
Bayes, ANN.
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
?(ylog(p)+(1?y)log(1?p))
2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the
performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as below
table:
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.
Now that we have an idea of the different types of classification models, it is crucial to
choose the right evaluation metrics for those models. In this section, we will cover the
most commonly used metrics: accuracy, precision, recall, F1 score, and area under
the ROC (Receiver Operating Characteristic) curve and AUC (Area Under the
Curve).
Logistic Regression
o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression;
therefore, it is called logistic regression, but is used to classify samples; Therefore,
it falls under the classification algorithm.
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
• It does not require too many computational resources as it’s highly interpretable
• There is no problem scaling the input features—It does not require tuning
• It gives a measure of how relevant a predictor (coefficient size) is, and its direction of
association (positive or negative)
Linear Regression Logistic Regression
• To predict the weather conditions of a certain place (sunny, windy, rainy, humid,
etc.)
• Ecommerce companies can identify buyers if they are likely to purchase a certain
product
• Companies can predict whether they will gain or lose money in the next quarter,
year, or month based on their current performance
To minimize this cost function, the model needs to have the best value of
θ1 and θ2(for Univariate linear regression problem). Initially model selects
θ1 and θ2 values randomly and then iteratively update these values in order to
minimize the cost function until it reaches the minimum. By the time model
achieves the minimum cost function, it will have the best θ1 and θ2 values.
Using these updated values of θ 1 and θ2 in the hypothesis equation of linear
equation, our model will predict the output value y.
Gradient descent works by moving downward toward the pits or valleys in the
graph to find the minimum value. This is achieved by taking the derivative of
the cost function, as illustrated in the figure below. During each iteration,
gradient descent step-downs the cost function in the direction of the steepest
descent. By adjusting the parameters in this direction, it seeks to reach the
minimum of the cost function and find the best-fit values for the parameters.
The size of each step is determined by parameter α known as Learning
Rate.
The choice of correct learning rate is very important as it ensures that Gradient
Descent converges in a reasonable time.:
If we choose α to be very large, Gradient Descent can overshoot the
minimum. It may fail to converge or even diverge.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then we
test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will
see the extreme case of cat and dog. On the basis of the support vectors, it will classify it
as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. Consider the below image:
So, as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third-dimension
z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Hence, we get a circumference of radius 1 in case of non-linear data.
What makes the linear SVM algorithm better than some of the other
algorithms, like k-nearest neighbors, is that it chooses the best line to
classify your data points. It chooses the line that separates the data
and is the furthest away from the closet data points as possible.
A 2-D example helps to make sense of all the machine learning jargon.
Basically, you have some data points on a grid. You're trying to
separate these data points by the category they should fit in, but you
don't want to have any data in the wrong category. That means you're
trying to find the line between the two closest points that keeps the
other data points separated.
So, the two closest data points give you the support vectors you'll use
to find that line. That line is called the decision boundary.
linear SVM
The decision boundary doesn't have to be a line. It's also referred to
as a hyperplane because you can find the decision boundary with any
number of features, not just two.
Types of SVMs
There are two different types of SVMs, each used for different things:
• Simple SVM: Typically used for linear regression and classification problems.
• Kernel SVM: Has more flexibility for non-linear data because you can add more
features to fit a hyperplane instead of a two-dimensional space.
SVMs are used in applications like handwriting recognition, intrusion detection, face detection,
email classification, gene classification, and in web pages. This is one of the reasons we use
SVMs in machine learning. It can handle both classification and regression on linear and non-
linear data.
Another reason we use SVMs is because they can find complex relationships between your
data without you needing to do a lot of transformations on your own. It's a great option when
you are working with smaller datasets that have tens to hundreds of thousands of features. They
typically find more accurate results when compared to other algorithms because of their ability
to handle small, complex datasets.
Here are some of the pros and cons for using SVMs.
Pros
Cons
• If the number of features is a lot bigger than the number of data points, avoiding over-
fitting when choosing kernel functions and regularization term is crucial.
• SVMs don't directly provide probability estimates. Those are calculated using an
expensive five-fold cross-validation.
• Works best on small sample sets because of its high training time.
Kernel Methods
Kernels or kernel methods (also called Kernel functions) are sets of different types of algorithms that
are being used for pattern analysis. They are used to solve a non-linear problem by using a linear
classifier. Kernels Methods are employed in SVM (Support Vector Machines) which are used in
classification and regression problems. The SVM uses what is called a “Kernel Trick” where the data is
transformed and an optimal boundary is found for the possible outputs.
In other words, a kernel is a term used to describe applying linear classifiers to non-linear problems
by mapping non-linear data onto a higher-dimensional space without having to visit or understand
that higher-dimensional region.
Instance-Based Methods
• Instance-based learning is a family of learning algorithms that, instead
of performing explicit generalization, compares new problem instances
with instances seen in training, which have been stored in memory.
• They are sometimes referred to as lazy learning methods because they
delay processing until a new instance must be classified. The nearest
neighbours of an instance are defined in terms of Euclidean distance.
• No model is learned
• The stored training instances themselves represent the knowledge
• Training instances are searched for instance that most closely resembles
new instance
Instance-based learning representation
• It has the ability to adapt to previously unseen data, which means that
one can store a new instance or drop the old instance.
Disadvantages of instance-based learning:
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
o By calculating the Euclidean distance, we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B. Consider
the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.
Tree-Based Models
Tree-based models use a decision tree to represent how different input variables can be
used to predict a target value. Machine learning uses tree-based models for both
classification and regression problems, such as the type of animal or value of a home. The
input variables are repeatedly segmented into subsets to build the decision tree, and each
branch is tested for prediction accuracy and evaluated for efficiency and effectiveness.
Splitting the variables in a different order can reduce the number of layers and
calculations required to produce an accurate prediction. Generating a successful decision
tree results in the most important variables (most influential on the prediction) being at
the top of the tree hierarchy, while irrelevant features get dropped from the hierarchy.
Tree-based models use a series of if-then rules to generate predictions from one
or more decision trees. All tree-based models can be used for either regression
(predicting numerical values) or classification (predicting categorical values). For
example
1. Decision tree models, which are the foundation of all tree-based models.
2. Random forest models, an “ensemble” method which builds many
decision trees in parallel.
3. Gradient boosting models, an “ensemble” method which builds many
decision trees sequentially.
Decision Trees
First, let’s start with a simple decision tree model. A decision tree model can be used to
visually represent the “decisions”, or if-then rules, that are used to generate predictions.
Here is an example of a very basic decision tree model:
We’ll go through each yes or no question, or decision node, in the tree and will move
down the tree accordingly, until we reach our final predictions. Our first question, which
is referred to as our root node, is whether George is above 40 and, since he is, we will
then proceed onto the “Has Kids” node. Because the answer is yes, we’ll predict that he
will be a high spender at Willy Wonka Candy this week.
One other note to add — here, we’re trying to predict whether George will be a high
spender, so this is an example of a classification tree, but we could easily convert this into
a regression tree by predicting George’s actual dollar spend. The process would remain
the same, but the final nodes would be numerical predictions rather than categorical
ones.
Glad you asked. There are essentially two key components to building a decision tree
model: determining which features to split on and then deciding when to stop splitting.
When determining which features to split on, the goal is to select the feature that will
produce the most homogenous resulting datasets. The simplest and most commonly used
method of doing this is by minimizing entropy, a measure of the randomness within a
dataset, and maximizing information gain, the reduction in entropy that results from
splitting on a given feature.
We’ll split on the feature that results in the highest information gain, and then recompute
entropy and information gain for the resulting output datasets. In the Willy Wonka
example, we may have first split on age because the greater than 40 and less than (or
equal to) 40 datasets were each relatively homogenous. Homogeneity in this sense refers
to the diversity of classes, so one dataset was filled with primarily low spenders and the
other with primarily high spenders.
You may be wondering how we decided to use a threshold of 40 for age. That’s a good
question! For numerical features, we first sort the feature values in ascending order, and
then test each value as the threshold point and calculate the information gain of that split.
The value with the highest information gain — in this case, age 40 — will then be
compared with other potential splits, and whichever has the highest information gain will
be used at that node. A tree can split on any numerical feature multiple times at different
value thresholds, which enables decision tree models to handle non-linear relationships
quite well.
The second decision we need to make is when to stop splitting the tree. We can split until
each final node has very few data points, but that will likely result in overfitting, or
building a model that is too specific to the dataset it was trained on. This is problematic
because, while it may make good predictions for that one dataset, it may not generalize
well to new data, which is really our larger goal.
To combat this, we can remove sections that have little predictive power, a technique
referred to as pruning. Some of the most common pruning methods include setting a
maximum tree depth or minimum number of samples per leaf, or final node.
Advantages:
• Straightforward interpretation
• Good at handling complex, non-linear relationships
Disadvantages:
Aa
What are Decision Trees?
In simple words, a decision tree is a structure that contains nodes
(rectangular boxes) and edges(arrows) and is built from a dataset
(table of columns representing features/attributes and rows
corresponds to records). Each node is either used to make a
decision (known as decision node) or represent an
outcome (known as leaf node).
ID3 in brief
ID3 stands for Iterative Dichotomiser 3 and is named such because
the algorithm iteratively (repeatedly) dichotomizes(divides) features
into two or more groups at each step.
Dataset description
In this article, we’ll be using a sample dataset of COVID-19 infection.
A preview of the entire dataset is shown below.
+----+-------+-------+------------------+----------+
| ID | Fever | Cough | Breathing issues | Infected |
+----+-------+-------+------------------+----------+
| 1 | NO | NO | NO | NO |
+----+-------+-------+------------------+----------+
| 2 | YES | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 3 | YES | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 4 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 5 | YES | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 6 | NO | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 7 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 8 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 9 | NO | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 10 | YES | YES | NO | YES |
+----+-------+-------+------------------+----------+
| 11 | NO | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 12 | NO | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 13 | NO | YES | YES | NO |
+----+-------+-------+------------------+----------+
| 14 | YES | YES | NO | NO |
+----+-------+-------+------------------+----------+
Metrics in ID3
As mentioned previously, the ID3 algorithm selects the best feature
at each step while building a Decision tree.
Before you ask, the answer to the question: ‘How does ID3 select the
best feature?’ is that ID3 uses Information Gain or just Gain to
find the best feature.
Information Gain calculates the reduction in the entropy and
measures how well a given feature separates or classifies the target
classes. The feature with the highest Information Gain is
selected as the best one.
where,
n is the total number of classes in the target column (in our case n =
2 i.e YES and NO)
pᵢ is the probability of class ‘i’ or the ratio of “number of rows
with class i in the target column” to the “total number of rows” in
the dataset.
where Sᵥ is the set of rows in S for which the feature column A has
value v, |Sᵥ| is the number of rows in Sᵥ and likewise |S| is the
number of rows in S.
ID3 Steps
1. Calculate the Information Gain of each feature.
From the total of 14 rows in our dataset S, there are 8 rows with the
target value YES and 6 rows with the target value NO. The entropy
of S is calculated as:
Entropy(S) = — (8/14) * log₂(8/14) — (6/14) * log₂(6/14) = 0.99
Note: If all the values in our target column are same the
entropy will be zero (meaning that it has no or zero
randomness).
We now calculate the Information Gain for each feature:
As shown below, in the 6 rows with NO, there are 2 rows having
target value YES and 4 rows having target value NO.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| NO | NO | NO | NO |
+-------+-------+------------------+----------+
| NO | YES | NO | NO |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | NO | NO |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+
Next, we calculate the IG for the features Fever and Cough using the
subset Sʙʏ (Set Breathing Issues Yes) which is shown above :
There are no more unused features, so we stop here and jump to the
final step of creating the leaf nodes.
For the left leaf node of Fever, we see the subset of rows from the
original data set that has Breathing Issues and Fever both values
as YES.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
Since all the values in the target column are YES, we label the left
leaf node as YES, but to make it more logical we label it Infected.
Similarly, for the right node of Fever we see the subset of rows from
the original data set that have Breathing Issues value
as YES and Fever as NO.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+
Here not all but most of the values are NO, hence NO or Not
Infected becomes our right leaf node.
Our tree, now, looks like this:
We repeat the same process for the node Cough, however here both
left and right leaves turn out to be the same i.e. NO or Not
Infected as shown below:
Looks Strange, doesn’t it?
I know! The right node of Breathing issues is as good as just a leaf
node with class ‘Not infected’. This is one of the Drawbacks of ID3, it
doesn’t do pruning.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It
does that by searching for the best homogeneity for the sub nodes, with the
help of the Gini index criterion.
Classification tree
A classification tree is an algorithm where the target variable is categorical.
The algorithm is then used to identify the “Class” within which the target
variable is most likely to fall. Classification trees are used when the dataset
needs to be split into classes that belong to the response variable(like yes or
no)
Regression tree
A Regression tree is an algorithm where the target variable is continuous and
the tree is used to predict its value. Regression trees are used when the
response variable is continuous. For example, if the response variable is the
temperature of the day.
Pseudo-code of the CART algorithm
d = 0, endtree = 0
Note(0) = 1, Node(1) = 0, Node(2) = 0
whileendtree < 1
if Node(2d-1) + Node(2d) + .... + Node(2d+1-2) = 2 - 2d+1
endtree = 1
else
do i = 2d-1, 2d, .... , 2d+1-2
if Node(i) > -1
Split tree
else
Node(2i+1) = -1
Node(2i+2) = -1
end if
end do
end if
d = d + 1
end while
CART model representation
CART models are formed by picking input variables and evaluating split
points on those variables until an appropriate tree is produced.
Steps to create a Decision Tree using the CART algorithm:
• Greedy algorithm: In this The input space is divided using
the Greedy method which is known as a recursive binary
spitting. This is a numerical method within which all of the values
are aligned and several other split points are tried and assessed
using a cost function.
• Stopping Criterion: As it works its way down the tree with the
training data, the recursive binary splitting method described above
must know when to stop splitting. The most frequent halting
method is to utilize a minimum amount of training data allocated to
every leaf node. If the count is smaller than the specified threshold,
the split is rejected and also the node is considered the last leaf
node.
• Tree pruning: Decision tree’s complexity is defined as the number
of splits in the tree. Trees with fewer branches are recommended as
they are simple to grasp and less prone to cluster the data. Working
through each leaf node in the tree and evaluating the effect of
deleting it using a hold-out test set is the quickest and simplest
pruning approach.
• Data preparation for the CART: No special data preparation is
required for the CART algorithm.
Advantages of CART
• Results are simplistic.
• Classification and regression trees are Nonparametric and
Nonlinear.
• Classification and regression trees implicitly perform feature
selection.
• Outliers have no meaningful effect on CART.
• It requires minimal supervision and produces easy-to-understand
models.
Limitations of CART
• Overfitting.
• High Variance.
• low bias.
• the tree structure may be unstable.
Applications of the CART algorithm
• For quick Data insights.
• In Blood Donors Classification.
• For environmental and ecological data.
• In the financial sectors.
Ensemble Methods
The idea is for many weak guesses to come together to generate one strong guess. You
can think of ensembling as asking the audience on “Who Wants to Be a Millionaire?” If the
question is really hard, the contestant might prefer to aggregate many guesses, rather
than go with their own guess alone.
To get deeper into that metaphor, one decision tree model would be the contestant. One
individual tree might not be a great predictor, but if we build many trees and combine all
predictions, we get a pretty good model! Two of the most popular ensemble algorithms
are random forest and gradient boosting, which are quite powerful and commonly used
for advanced machine learning applications.
Before we discuss the random forest model, let’s take a quick step back and discuss its
foundation, bootstrap aggregating, or bagging. Bagging is a technique of building many
decision tree models at a time by randomly sampling with replacement, or bootstrapping,
from the original dataset. This ensures variety in the trees, which helps to reduce the
amount of overfitting.
Random forest models take this concept one step further. On top of building many
trees from sampled datasets, each node is only allowed to split on a random selection of
the model’s features.
For example, imagine that each node can split from a different, random selection of three
features from our feature set. Looking at the above, you may notice that the two trees
start with different features — the first starts with age and the second starts with dollars
spent. That’s because even though age may be the most significant feature in the dataset,
it wasn’t selected in the group of three features for the second tree, so that model had to
use the next most significant feature, dollars spent, to start.
Each subsequent node will also split on a random selection of three features. Let’s say
that the next group of features in the “less than $1 spent last week” dataset included age,
and this time, the age 30 threshold resulted in the highest information gain among all
features, age greater or less than 30 would be the next split.
We’ll build our two trees separately and get the majority vote. Note that if it were a
regression problem, we would get the average.
Advantages:
Disadvantages:
Boosting is an ensemble tree method that builds consecutive small trees — often only
one node — with each tree focused on correcting the net error from the previous tree. So,
we’ll split our first tree on the most predictive feature and then we’ll update weights to
ensure that the subsequent tree splits on whichever feature allows it to correctly classify
the data points that were misclassified in the initial tree. The next tree will then focus on
correctly classifying errors from that tree, and so on. The final prediction is a weighted
sum of all individual predictions.
Gradient boosting is the most popular extension of boosting, and uses the gradient
descent algorithm for optimization.
Advantages:
• They are powerful and accurate, in many cases even more so than random forest
• Good at handling complex, non-linear relationships
• They are good at dealing with imbalanced data
Disadvantages:
For instance, an algorithm can learn to predict whether a given email is spam or ham
(no spam),
Introduction
Unsupervised learning is the training of a machine using information that is
neither classified nor labeled and allowing the algorithm to act on that
information without guidance. Here the task of the machine is to group
unsorted information according to similarities, patterns, and differences
without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will
be given to the machine. Therefore, the machine is restricted to find the
hidden structure in unlabeled data by itself.
For instance, suppose it is given an image having both dogs and cats which
it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t
categorize it as ‘dogs and cats ‘. But it can categorize them according to their
similarities, patterns, and differences, i.e., we can easily categorize the above
picture into two parts. The first may contain all pics having dogs in them and
the second part may contain all pics having cats in them. Here you didn’t
learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information
that was previously undetected. It mainly deals with unlabeled data.
Clustering Algorithms
These data points are clustered by using the basic concept that the data point
lies within the given constraint from the cluster center. Various distance
methods and techniques are used for the calculation of the outliers.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping
among the unlabeled data present. There are no criteria for good clustering.
It depends on the user, and what criteria they may use which satisfy their
need. For instance, we could be interested in finding representatives for
homogeneous groups (data reduction), finding “natural clusters” and
describing their unknown properties (“natural” data types), in finding useful
and suitable groupings (“useful” data classes) or in finding unusual data
objects (outlier detection). This algorithm must make some assumptions that
constitute the similarity of points and each assumption make different and
equally valid clusters.
Clustering Methods:
• Density-Based Methods: These methods consider the clusters as
the dense region having some similarities and differences from the
lower dense region of the space. These methods have good
accuracy and the ability to merge two clusters. Example DBSCAN
(Density-Based Spatial Clustering of Applications with
Noise), OPTICS (Ordering Points to Identify Clustering Structure),
etc.
• Hierarchical Based Methods: The clusters formed in this method
form a tree-type structure based on the hierarchy. New clusters are
formed using the previously formed one. It is divided into two
category
• Agglomerative (bottom-up approach)
• Divisive (top-down approach)
Examples CURE (Clustering Using Representatives), BIRCH (Balanced
Iterative Reducing Clustering and using Hierarchies), etc.
• Partitioning Methods: These methods partition the objects into k
clusters and each partition forms one cluster. This method is used
to optimize an objective criterion similarity function such as when
the distance is a major parameter example K-means, CLARANS
(Clustering Large Applications based upon Randomized Search),
etc.
• Grid-based Methods: In this method, the data space is formulated
into a finite number of cells that form a grid-like structure. All the
clustering operations done on these grids are fast and independent
of the number of data objects example STING (Statistical
Information Grid), wave cluster, CLIQUE (CLustering In Quest), etc.
Clustering Algorithms: K-means clustering algorithm – It is the simplest
unsupervised learning algorithm that solves clustering problem.K-means
algorithm partitions n observations into k clusters where each observation
belongs to the cluster with the nearest mean serving as a prototype of the
cluster.
K – Means
K-Means Clustering is an Unsupervised Machine Learning algorithm, which
groups the unlabeled dataset into different clusters.
K means Clustering
Unsupervised Machine Learning learning is the process of teaching a
computer to use unlabeled, unclassified data and enabling the algorithm to
operate on that data without supervision. Without any previous data training,
the machine’s job in this case is to organize unsorted data according to
parallels, patterns, and variations.
The goal of clustering is to divide the population or set of data points into a
number of groups so that the data points within each group are more
comparable to one another and different from the data points within the other
groups. It is essentially a grouping of things based on how similar and
different they are to one another.
We are given a data set of items, with certain features, and values for these
features (like a vector). The task is to categorize those items into groups. To
achieve this, we will use the K-means algorithm; an unsupervised learning
algorithm. ‘K’ in the name of the algorithm represents the number of
groups/clusters we want to classify our items into.
(It will help if you think of items as points in an n-dimensional space). The
algorithm will categorize the items into k groups or clusters of similarity. To
calculate that similarity, we will use the euclidean distance as a
measurement.
The algorithm works as follows:
1. First, we randomly initialize k points, called means or cluster
centroids.
2. We categorize each item to its closest mean and we update the
mean’s coordinates, which are the averages of the items categorized
in that cluster so far.
3. We repeat the process for a given number of iterations and at the
end, we have our clusters.
The “points” mentioned above are called means because they are the mean
values of the items categorized in them. To initialize these means, we have a
lot of options. An intuitive method is to initialize the means at random items
in the data set. Another method is to initialize the means at random values
between the boundaries of the data set (if for a feature x, the items have
values in [0,3], we will initialize the means with values for x at [0,3]).
Advantages of k-means
Disadvantages of K-Means:
• Academic performance
• Diagnostic systems
• Search engines
• Wireless sensor networks
Hierarchical Clustering
Cluster Validity
• Proximity Matrix
• Ideal Similarity Matrix
o One row and one column for each data point
o An entry is 1 if the associated pair of points belong to the same
cluster
o An entry is 0 if the associated pair of points belongs to different
clusters
• Since the matrices are symmetric, only the correlation between n(n-1)
/ 2 entries needs to be calculated.
High correlation indicates that points that belong to the same cluster are
close to each other.
Dimensionality Reduction
Feature_Selection:
Feature selection involves selecting a subset of the original features that are
most relevant to the problem at hand. The goal is to reduce the
dimensionality of the dataset while retaining the most important features.
There are several methods for feature selection, including filter methods,
wrapper methods, and embedded methods. Filter methods rank the features
based on their relevance to the target variable, wrapper methods use the
model performance as the criteria for selecting features, and embedded
methods combine feature selection with the model training process.
Feature_Extraction:
Feature extraction involves creating new features by combining or
transforming the original features. The goal is to create a set of features that
captures the essence of the original data in a lower-dimensional space. There
are several methods for feature extraction, including principal component
analysis (PCA), linear discriminant analysis (LDA), and t-distributed
stochastic neighbor embedding (t-SNE). PCA is a popular technique that
projects the original features onto a lower-dimensional space while
preserving as much of the variance as possible.
Overall, PCA is a powerful tool for data analysis and can help to simplify
complex datasets, making them easier to understand and work with.
Recommendation Systems
A recommendation system (or recommender system) is a class of machine
learning that uses data to help predict, narrow down, and find what people
are looking for among an exponentially growing number of options.
Imagine that a user has already purchased a scarf. Why not offer a matching
hat so the look will be complete? This feature is often implemented by means
of AI-based algorithms as “Complete the look” or “You might also like”
sections in e-commerce platforms like Amazon, Walmart, Target, and many
others.
Personalized Banking
Algorithm:
Usage of EM algorithm –
• It can be used to fill the missing data in a sample.
• It can be used as the basis of unsupervised learning of clusters.
• It can be used for the purpose of estimating the parameters of Hidden
Markov Model (HMM).
• It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
Disadvantages of EM algorithm –
Reinforcement Learning
Reinforcement learning uses algorithms that learn from outcomes and decide
which action to take next. After each action, the algorithm receives feedback
that helps it determine whether the choice it made was correct, neutral or
incorrect. It is a good technique to use for automated systems that have to
make a lot of small decisions without human guidance.
Example:
The above image shows the robot, diamond, and fire. The goal of the robot is
to get the reward that is the diamond and avoid the hurdles that are fired.
The robot learns by trying all the possible paths and then choosing the path
which gives him the reward with the least hurdles. Each right step will give
the robot a reward and each wrong step will subtract the reward of the robot.
The total reward will be calculated when it reaches the final reward that is
the diamond.
• Input: The input should be an initial state from which the model will
start
• Output: There are many possible outputs as there are a variety of
solutions to a particular problem
• Training: The training is based upon the input, The model will return a
state and the user will decide to reward or punish the model based on
its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.
Example: Object
Example: Chess game,text summarization
recognition,spam detetction
Types of Reinforcement:
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which
can diminish the results
2. Negative: Negative Reinforcement is defined as strengthening of
behavior because a negative condition is stopped or avoided.
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
Elements
1. Policy
2. Reward function
3. Value function
4. Model of the environment
Policy: Policy defines the learning agent behavior for given time period. It is
a mapping from perceived states of the environment to actions to be taken
when in those states.
Value function: Value functions specify what is good in the long run. The
value of a state is the total amount of reward an agent can expect to
accumulate over the future, starting from that state.
The agent has sensors to decide on its state in the environment and takes
action that modifies its state.
The reinforcement learning problem model is an agent continuously
interacting with an environment. The agent and the environment interact in a
sequence of time steps. At each time step t, the agent receives the state of
the environment and a scalar numerical reward for the previous action, and
then the agent then selects an action.
2. The model can correct the errors that occurred during the training process.
3. In RL, training data is obtained via the direct interaction of the agent with
the environment
• Factor graphs
• Bayesian perspective,
• Probabilistic Programming
The essential principle is that in the form of a model, all assumptions about
the issue domain are made clear. Model-based deep learning is just a
collection of assumptions stated in a graphical manner.
Factor Graphs
They are a form of PGM with round nodes and square nodes representing
variable probability distributions (factors), and vertices expressing
conditional relationships between nodes. They offer a broad framework for
simulating the combined dispersion of a set of random variables.
The first essential concept allowing this new machine learning architecture
is Bayesian inference/learning. Latent/hidden parameters are represented in
MBML as random variables with probability distributions. This provides for a
consistent and rational approach to quantifying uncertainty in model
parameters. Again when the observed variables in the model are locked to
their values, the Bayes’ theorem is used to update the previously assumed
probability distributions.
Probabilistic Programming
• Describe the Model: Using factor graphs, describe the process that
created the data.
• Condition on Reported Data: Make the observed variables equal to
their known values.
• Backward reasoning is used to update the prior distribution across the
latent constructs or parameters. Estimate the Bayesian probability
distributions of latent constructs based on observable variables.
Temporal Based Learning
The trick is that rather than attempting to calculate the total future reward,
temporal difference learning just attempts to predict the combination of
immediate reward and its own reward prediction at the next moment in time.
Now when the next moment comes and brings fresh information with it, the
new prediction is compared with the expected prediction. If these two
predictions are different from each other, the Temporal Difference Learning
algorithm will calculate how different the predictions are from each other and
make use of this temporal difference to adjust the old prediction toward the
new prediction.
Temporal difference learning in machine learning got its name from the way
it uses changes, or differences, in predictions over successive time steps for
the purpose of driving the learning process.
The prediction at any particular time step gets updated to bring it nearer to
the prediction of the same quantity at the next time step.
Introduction
Probabilistic Models are one of the most important segments in Machine Learning,
which is based on the application of statistical codes to data analysis. This dates back
to one of the first approaches of machine learning and continues to be widely used
today. Unobserved variables are seen as stochastic in probabilistic models, and
interdependence between variables is recorded in a joint probability distribution. It
provides a foundation for embracing learning for what it is. The probabilistic
framework outlines the approach for representing and deploying model reservations.
In scientific data analysis, predictions play a dominating role. Their contribution is also
critical in machine learning, cognitive computing, automation, and artificial
intelligence.
These probabilistic models have many admirable characteristics and are quite useful
in statistical analysis. They make it quite simple to reason about the inconsistencies
present across most data. In fact, they may be built hierarchically to create
complicated models from basic elements. One of the main reasons why probabilistic
modeling is so popular nowadays is that it provides natural protection against
overfitting and allows for completely coherent inferences over complex forms from
data.
Weather and traffic are two everyday phenomena that are both unpredictable and
appear to have a link with one another. You are all aware that if the weather is cold
and snow is falling, traffic will be quite difficult and you will be detained for an
extended period of time. We could even go so far as to predict a substantial
association between snowy weather and higher traffic mishaps.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
Maximum Likelihood
Development:
Goal:
Examples:
Maximum Apriori
The Apriori algorithm uses frequent itemsets to generate association rules, and it is
designed to work on the databases that contain transactions. With the help of these
association rule, it determines how strongly or how weakly two objects are connected.
This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
associations efficiently. It is the iterative process for finding the frequent itemsets from
the large dataset.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly
used for market basket analysis and helps to find those products that can be bought
together. It can also be used in the healthcare field to find drug reactions for patients.
Frequent itemsets are those items whose support is greater than the threshold value
or user-specified minimum support. It means if A & B are the frequent itemsets
together, then individually A and B should also be the frequent itemset.
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent itemsets.
Step-1: Determine the support of itemsets in the transactional database, and select
the minimum support and confidence.
Step-2: Take all supports in the transaction with higher support value than the
minimum or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
Bayesian belief networks (BBNs) are probabilistic graphical models that are used to
represent uncertain knowledge and make decisions based on that knowledge. They
are a type of Bayesian network, which is a graphical model that represents probabilistic
relationships between variables.
Together, the DAG and the conditional probability tables allow us to perform
probabilistic inference in the network, such as computing the probability of a particular
variable given the values of other variables in the network. Bayesian networks have
many applications in machine learning, artificial intelligence, and decision analysis.
This is a graphical representation of the variables in the network and the causal
relationships between them. The nodes in the DAG represent variables, and the edges
represent the dependencies between the variables. The arrows in the graph indicate
the direction of causality.
For each node in the DAG, there is a corresponding table of conditional probabilities
that specifies the probability of each possible value of the node given the values of its
parents in the DAG. These tables encode the probabilistic relationships between the
variables in the network.
• The nodes of the network graph in the preceding diagram stand in for the
random variables A, B, C, and D, respectively.
• Node A is referred to as the parent of Node B if we are thinking about node B,
which is linked to node A by a directed arrow.
• Node C is independent of node A.
The Bayesian network's semantics can be understood in one of two ways, as follows:
Probabilistic models allow for the expression of uncertainty, making them particularly
well-suited for real-world applications where data is often noisy or incomplete.
Additionally, these models can often be updated as new data becomes available, which
is useful in many dynamic and evolving systems.
For better understanding, we will implement the probabilistic model on the OSIC
Pulmonary Fibrosis problem on the kaggle.
Problem Statement: "In this competition, you'll predict a patient's severity of decline
in lung function based on a CT scan of their lungs. You'll determine lung function based
on output from a spirometer, which measures the volume of air inhaled and exhaled.
The challenge is to use machine learning techniques to make a prediction with the
image, metadata, and baseline FVC as input."
• Generative models
• Discriminative models.
• Graphical models
Generative models:
Generative models aim to model the joint distribution of the input and output
variables. These models generate new data based on the probability distribution of the
original dataset. Generative models are powerful because they can generate new data
that resembles the training data. They can be used for tasks such as image and speech
synthesis, language translation, and text generation.
Discriminative models
The discriminative model aims to model the conditional distribution of the output
variable given the input variable. They learn a decision boundary that separates the
different classes of the output variable. Discriminative models are useful when the
focus is on making accurate predictions rather than generating new data. They can be
used for tasks such as image recognition, speech recognition, and sentiment analysis.
Graphical models
Inference
The first is simply evaluating the joint probability of a particular assignment of values
for each variable (or a subset) in the network. For this, we already have a factorized
form of the joint distribution, so we simply evaluate that product using the provided
conditional probabilities. If we only care about a subset of variables, we will need to
marginalize out the ones we are not interested in. In many cases, this may result in
underflow, so it is common to take the logarithm of that product, which is equivalent
to adding up the individual logarithms of each term in the product.
The second, more interesting inference task, is to find P(x|e), or, to find the probability
of some assignment of a subset of the variables (x) given assignments of other
variables (our evidence, e). In the above example, an example of this could be to find
P(Sprinkler, WetGrass | Cloudy), where {Sprinkler, WetGrass} is our x, and {Cloudy} is
our e. In order to calculate this, we use the fact that P(x|e) = P(x, e) / P(e) = αP(x, e),
where α is a normalization constant that we will calculate at the end such that P(x|e) +
P(¬x | e) = 1. In order to calculate P(x, e), we must marginalize the joint probability
distribution over the variables that do not appear in x or e, which we will denote as Y.
Note that in larger networks, Y will most likely be quite large, since most inference
tasks will only directly use a small subset of the variables. In cases like these, exact
inference as shown above is very computationally intensive, so methods must be used
to reduce the amount of computation. One more efficient method of exact inference
is through variable elimination, which takes advantage of the fact that each factor only
involves a small number of variables. This means that the summations can be
rearranged such that only factors involving a given variable are used in the
marginalization of that variable. Alternatively, many networks are too large even for
this method, so approximate inference methods such as MCMC are instead used;
these provide probability estimations that require significantly less computation than
exact inference methods.
The problem is that we don’t always know the full probability distribution for a random
variable. This is because we only use a small subset of observations to derive the
outcome. This problem is referred to as Probability Density Estimation as we use
only a random sample of observations to find the general density of the whole sample
space.
A PDF is a function that tells the probability of the random variable from a sub-sample
space falling within a particular range of values and not just one value. It tells the
likelihood of the range of values in the random variable sub-space being the same as
that of the whole sample.
By definition, if X is any continuous random variable, then the function f(x) is called a
probability density function if:
where,
a -> lower limit
b -> upper limit
X -> continuous random variable
f(x) -> probability density function
Steps Involved:
Step 1 - Create a histogram for the random set of observations to
understand the
density of the random sample.
Most of the histogram of the different random sample after fitting should match the
histogram plot of the whole population.
Density Estimation: It is the process of finding out the density of the whole
population by examining a random sample of data from that population. One of the
best ways to achieve a density estimate is by using a histogram plot.
A normal distribution has two given parameters, mean and standard deviation. We
calculate the sample mean and standard deviation of the random sample taken from
this population to estimate the density of the random sample. The reason it is termed
as ‘parametric’ is due to the fact that the relation between the observations and its
probability can be different based on the values of the two parameters.
Now, it is important to understand that the mean and standard deviation of this
random sample is not going to be the same as that of the whole population due to its
small size. A sample plot for parametric density estimation is shown below.
Problems with Probability Distribution Estimation
Probability Distribution Estimation relies on finding the best PDF and determining its
parameters accurately. But the random data sample that we consider, is very small.
Hence, it becomes very difficult to determine what parameters and what probability
distribution function to use. To tackle this problem, Maximum Likelihood Estimation is
used.
Sequence Models
Sequence models are the machine learning models that input or output sequences of
data. Sequential data includes text streams, audio clips, video clips, time-series data
and etc. Recurrent Neural Networks (RNNs) is a popular algorithm used in sequence
models.
3. Video Activity Recognition: In video activity recognition, the model needs to identify
the activity in a video clip. A video clip is a sequence of video frames, therefore in case
of video activity recognition input is a sequence of data.
These examples show that there are different applications of sequence models.
Sometimes both the input and output are sequences, in some either the input or the
output is a sequence. Recurrent neural network (RNN) is a popular sequence model
that has shown efficient performance for sequential data.
Sequence Models have been motivated by the analysis of sequential data such text
sentences, time-series and other discrete sequences data. These models are especially
designed to handle sequential information while Convolutional Neural Network are
more adapted for process spatial information.
The key point for sequence models is that the data we are processing are not anymore
independently and identically distributed (i.i.d.) samples and the data carry some
dependency due to the sequential order of the data.
Sequence Models is very popular for speech recognition, voice recognition, time
series prediction, and natural language processing.
Markov Models
A Markov model is a stochastic method for randomly changing systems that possess
the Markov property. This means that, at any given time, the next state is only
dependent on the current state and is independent of anything in the past. Two
commonly applied types of Markov model are used when the system being
represented is autonomous -- that is, when the system isn't influenced by an external
agent. These are as follows:
1. Markov chains. These are the simplest type of Markov model and are used to
represent systems where all states are observable. Markov chains show all
possible states, and between states, they show the transition rate, which is
the probability of moving from one state to another per unit of time.
Applications of this type of model include prediction of market crashes, speech
recognition and search engine algorithms.
2. Hidden Markov models. These are used to represent systems with some
unobservable states. In addition to showing states and transition rates, hidden
Markov models also represent observations and observation likelihoods for
each state. Hidden Markov models are used for a range of applications,
including thermodynamics, finance and pattern recognition.
Another two commonly applied types of Markov model are used when the system
being represented is controlled -- that is, when the system is influenced by a decision-
making agent. These are as follows:
Markov analysis is a probabilistic technique that uses Markov models to predict the
future behavior of some variable based on the current state. Markov analysis is used
in many domains, including the following:
• Markov chains are used for several business applications, including predicting
customer brand switching for marketing, predicting how long people will
remain in their jobs for human resources, predicting time to failure of a machine
in manufacturing, and forecasting the future price of a stock in finance.
• Markov analysis is also used in natural language processing (NLP) and in
machine learning. For NLP, a Markov chain can be used to generate a sequence
of words that form a complete sentence, or a hidden Markov model can be used
for named-entity recognition and tagging parts of speech. For machine
learning, Markov decision processes are used to represent reward in
reinforcement learning.
• A recent example of the use of Markov analysis in healthcare was in Kuwait.
A continuous-time Markov chain model was used to determine the optimal
timing and duration of a full COVID-19 lockdown in the country, minimizing
both new infections and hospitalizations. The model suggested that a 90-day
lockdown beginning 10 days before the epidemic peak was optimal.
The simplest Markov model is a Markov chain, which can be expressed in equations,
as a transition matrix or as a graph. A transition matrix is used to indicate the
probability of moving from each state to each other state. Generally, the current states
are listed in rows, and the next states are represented as columns. Each cell then
contains the probability of moving from the current state to the next state. For any
given row, all the cell values must then add up to one.
A graph consists of circles, each of which represents a state, and directional arrows to
indicate possible transitions between states. The directional arrows are labeled with
the transition probability. The transition probabilities on the directional arrows coming
out of any given circle must add up to one.
Other Markov models are based on the chain representations but with added
information, such as observations and observation likelihoods.
The transition matrix below represents shifting gears in a car with a manual
transmission. Six states are possible, and a transition from any given state to any other
state depends only on the current state -- that is, where the car goes from second gear
isn't influenced by where it was before second gear. Such a transition matrix might be
built from empirical observations that show, for example, that the most probable
transitions from first gear are to second or neutral.
This transition matrix represents shifting gears in a car with a manual transmission and the six states
that are possible.
The image below represents the toss of a coin. Two states are possible: heads and tails.
The transition from heads to heads or heads to tails is equally probable (.5) and is
independent of all preceding coin tosses.
The circles represent the two possible states -- heads or tails -- and the arrows show the possible
states the system could transition to in the next step. The number .5 represents the probability of that
transition occurring.
Hidden Markov Model (HMM) is a statistical model that is used to describe the
probabilistic relationship between a sequence of observations and a sequence of
hidden states. It is often used in situations where the underlying system or process
that generates the observations is unknown or hidden, hence it got the name “Hidden
Markov Model.”
• The hidden states are the underlying variables that generate the observed
data, but they are not directly observable.
• The observations are the variables that are measured and observed.
The relationship between the hidden states and the observations is modeled using a
probability distribution. The Hidden Markov Model (HMM) is the relationship between
the hidden states and the observations using two sets of probabilities: the transition
probabilities and the emission probabilities.
The Hidden Markov Model (HMM) algorithm can be implemented using the following
steps:
The state space is the set of all possible hidden states, and the observation space is
the set of all possible observations.
These are the probabilities of transitioning from one state to another. This forms the
transition matrix, which describes the probability of moving from one state to another.
These are the probabilities of generating each observation from each state. This forms
the emission matrix, which describes the probability of generating each observation
from each state.
The parameters of the state transition probabilities and the observation likelihoods are
estimated using the Baum-Welch algorithm, or the forward-backward algorithm. This
is done by iteratively updating the parameters until convergence.
Given the observed data, the Viterbi algorithm is used to compute the most likely
sequence of hidden states. This can be used to predict future observations, classify
sequences, or detect patterns in sequential data.
The performance of the HMM can be evaluated using various metrics, such as accuracy,
precision, recall, or F1 score.
To summarize, the HMM algorithm involves defining the state space, observation
space, and the parameters of the state transition probabilities and observation
likelihoods, training the model using the Baum-Welch algorithm or the forward-
backward algorithm, decoding the most likely sequence of hidden states using the
Viterbi algorithm, and evaluating the performance of the model.
HMMs are widely used in a variety of applications such as speech recognition, natural
language processing, computational biology, and finance. In speech recognition, for
example, an HMM can be used to model the underlying sounds or phonemes that
generate the speech signal, and the observations could be the features extracted from
the speech signal. In computational biology, an HMM can be used to model the
evolution of a protein or DNA sequence, and the observations could be the sequence
of amino acids or nucleotides.
Unit – 5. Neural Networks and Deep Learning
Neural Networks
Neural networks can help computers make intelligent decisions with limited
human assistance. This is because they can learn and model the relationships
between input and output data that are nonlinear and complex. For instance,
they can do the following tasks.
A neural network would know that both sentences mean the same thing. Or it
would be able to broadly recognize that Baxter Road is a place, but Baxter
Smith is a person’s name.
Neural networks have several use cases across many industries, such as the
following:
Speech recognition
Neural networks can analyze human speech despite varying speech patterns,
pitch, tone, language, and accent. Virtual assistants like Amazon Alexa and
automatic transcription software use speech recognition to do tasks like these:
Recommendation engines
The human brain is the inspiration behind neural network architecture. Human
brain cells, called neurons, form a complex, highly interconnected network and
send electrical signals to each other to help humans process information.
Similarly, an artificial neural network is made of artificial neurons that work
together to solve a problem. Artificial neurons are software modules, called
nodes, and artificial neural networks are software programs or algorithms that,
at their core, use computing systems to solve mathematical calculations.
Input Layer
Information from the outside world enters the artificial neural network from
the input layer. Input nodes process the data, analyze or categorize it, and
pass it on to the next layer.
Hidden Layer
Hidden layers take their input from the input layer or other hidden layers.
Artificial neural networks can have a large number of hidden layers. Each
hidden layer analyzes the output from the previous layer, processes it further,
and passes it on to the next layer.
Output Layer
The output layer gives the final result of all the data processing by the artificial
neural network. It can have single or multiple nodes. For instance, if we have
a binary (yes/no) classification problem, the output layer will have one output
node, which will give the result as 1 or 0. However, if we have a multi-class
classification problem, the output layer might consist of more than one output
node.
Deep neural networks, or deep learning networks, have several hidden layers
with millions of artificial neurons linked together. A number, called weight,
represents the connections between one node and another. The weight is a
positive number if one node excites another, or negative if one node
suppresses the other. Nodes with higher weight values have more influence
on the other nodes.
Theoretically, deep neural networks can map any input type to any output
type. However, they also need much more training as compared to other
machine learning methods. They need millions of examples of training data
rather than perhaps the hundreds or thousands that a simpler network might
need.
Artificial neural networks can be categorized by how the data flows from the
input node to the output node. Below are some examples:
Feedforward neural networks process data in one direction, from the input
node to the output node. Every node in one layer is connected to every node
in the next layer. A feedforward network uses a feedback process to improve
predictions over time.
Backpropagation algorithm
1. Each node makes a guess about the next node in the path.
2. It checks if the guess was correct. Nodes assign higher weight values to
paths that lead to more correct guesses and lower weight values to node
paths that lead to incorrect guesses.
3. For the next data point, the nodes make a new prediction using the
higher weight paths and then repeat Step 1.
Biological Motivation
Motivation behind neural network is human brain. Human brain is called as the
best processor even though it works slower than other computers. Many
researchers thought to make a machine that would work in the prospective of
the human brain.
Human brain contains billion of neurons which are connected to many other
neurons to form a network so that if it sees any image, it recognizes the image
and processes the output.
Perceptron
We can model the decision boundary and the classification output in the
Heaviside step function equation, as follows:
To produce the net input to the activation function (here, the Heaviside step
function) we take the dot product of the input and the connection weights. We
see this summation in the left half of Figure 2-3 as the input to the summation
function.
Table 2-1 provides an explanation of how the summation function is performed
as well as notes about the parameters involved in the summation function. The
output of the step function (activation function) is the output for the perceptron
and gives us a classification of the input values.
If the bias value is negative, it forces the learned weights sum to be a much
greater value to get a 1 classification output. The bias term in this capacity
moves the decision boundary around for the model. Input values do not affect
the bias term, but the bias term is learned through the perceptron learning
algorithm.
Multi-layer Perceptron
• Backward Stage: In the backward stage, weight and bias values are
modified per the model’s requirement. The backstage removed the error
between the actual output and demands originating backward on the output
layer.
If you notice, we have passed value one as input in the starting and
W0 in the weights section. Bias is an element that adjusts the boundary away
from origin to move the activation function left, right, up or down. Since we
want this to be independent of the input features, we add constant one in the
statement so that the features will not get affected by this and this value is
known as Bias.
• Sign function
• Sigmoid function
Based on the type of value we need as output we can change the activation
function. We can use the step function depending on the value required.
Sigmoid function and sign functions can be used for values between 0 and 1
and 1 and -1, respectively. The sign function is a hyperbolic tangent function
which is a zero centered function making it easy the multi-layer neural
networks. Rectified Linear Unit (ReLu) is another step function that is highly
computational and can be used for values approaching zero – value more less
than or more than zero.
The data scientist uses the activation function to take a subjective decision
based on various problem statements and forms the desired outputs.
Activation function may differ (e.g., Sign, Step, and Sigmoid) in perceptron
models by checking whether the learning process is slow or has vanishing or
exploding gradients.
The neural network can compare the outputs of its nodes with the
desired values using a property known as the delta rule, allowing the
network to alter its weights through training to create more accurate
output values. This training and learning procedure results in gradient
descent. The technique of updating weights in multi-layered perceptron is
virtually the same, however, the process is referred to as back-propagation.
In such circumstances, the output values provided by the final layer are used
to alter each hidden layer inside the network.
Back Propagation
5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.
Activation Function
The activation function of a neuron defines its output given its inputs. The
activation function activates the neuron that is required for the desired output,
converts linear input to non-linear output. In neural networks, activation
functions, also known as transfer functions, define how the weighted sum of
the input can be transformed into output via nodes in a layer of networks. They
are treated as a crucial part of neural networks’ design.
1. Sigmoid Function:
Cons: The gradient values are significant for range -3 and 3 but become much
closer to zero beyond this range which almost kills the impact of the neuron
on the final output. Also, sigmoid outputs are not zero-centered (it is centred
around 0.5) which leads to undesirable zig-zagging dynamics in the gradient
updates for the weights
Plot:
2. Tanh Function:
Pros: The derivatives of the tanh are larger than the derivatives of the sigmoid
which help us minimize the cost function faster
Cons: Similar to sigmoid, the gradient values become close to zero for wide
range of values (this is known as vanishing gradient problem). Thus, the
network refuses to learn or keeps learning at a very small rate.
Plot:
3. Softmax Function:
Pros: Can handle multiple classes and give the probability of belonging to each
class
4. ReLU Function:
Pros: Although RELU looks and acts like a linear function, it is a nonlinear
function allowing complex relationships to be learned and is able to allow
learning through all the hidden layers in a deep network by having large
derivatives.
Cons: It should not be used as the final output layer for either
classification/regression tasks
Plot:
Loss Functions
The other key aspect in setting up the neural network infrastructure is selecting
the right loss functions. With neural networks, we seek to minimize the error
(difference between actual and predicted value) which is calculated by the loss
function.
Description: MSE loss is used for regression tasks. As the name suggests, this
loss is calculated by taking the mean of squared differences between
actual(target) and predicted values. Range: (0, inf)
Formula:
Description: BCE loss is the default loss function used for the binary
classification tasks. It requires one output layer to classify the data into two
classes and the range of output is (0–1) i.e. should use the sigmoid function.
Range: (0, inf)
Formula:
Pros: The continuous nature of the loss function helps the training process
converged well
Cons: Can only be used with sigmoid activation function. Other loss functions
like Hinge or Squared Hinge Loss can work with tanh activation function
Formula:
where y is the actual label and p is the classifier’s predicted probability
distributions for predicting the class j
Pros: Similar to Binary Cross Entropy, the continuous nature of the loss
function helps the training process converged well
Cons: May require a one hot encoded vector with many zero values if there
many classes, requiring significant memory (should use Sparse Categorical
Cross entropy in this case)
Nothing is perfect in the world. Machine Learning has some serious limitations,
which are bigger than human errors.
1. Data Acquisition
The whole concept of machine learning is about identifying useful data. The
outcome will be incorrect if a credible data source is not provided. The quality
of the data is also significant. If the user or institution needs more quality data,
wait for it. It will cause delays in providing the output. So, machine learning
significantly depends on the data and its quality.
The data that machines process remains huge in quantity and differs greatly.
Machines require time so that their algorithm can adjust to the environment
and learn it. Trials runs are held to check the accuracy and reliability of the
machine. It requires massive and expensive resources and high-quality
expertise to set up that quality of infrastructure. Trials runs are costly as they
would cost in terms of time and expenses.
3. Results Interpretations
One of the biggest advantages of Machine learning is that interpreted data that
we get from the cannot be hundred percent accurate. It will have some degree
of inaccuracy. For a high degree of accuracy, algorithms should be developed
so that they give reliable results.
The error committed during the initial stages is huge, and if not corrected at
that time, it creates havoc. Biasness and wrongness have to be dealt with
separately; they are not interconnected. Machine learning depends on two
factors, i.e., data and algorithm. All the errors are dependent on the two
variables. Any incorrectness in any variables would have huge repercussions
on the output.
5. Social Changes
With the advancement of machine learning, the nature of the job is changing.
Now, all the work are done by machine, and it is eating up the jobs for human
which were done earlier by them. It is difficult for those without technical
education to adjust to these changes.
8. Highly Expensive
This software is highly expensive, and not everybody can own it. Government
agencies, big private firms, and enterprises mostly own it. It needs to be made
accessible to everybody for wide use.
9. Privacy Concern
As we know that one of the pillars of machine learning is data. The collection
of data has raised the fundamental question of privacy. The way data is
collected and used for commercial purposes has always been a contentious
issue. In India, the Supreme court of India has declared privacy a fundamental
right of Indians. Without the user's permission, data cannot be collected, used,
or stored. However, many cases have come up that big firms collect the data
without the user's knowledge and using it for their commercial gains.
Machine learning is evolving concept. This area has not seen any major
developments yet that fully revolutionized any economic sector. The area
requires continuous research and innovation.
Deep Learning
The deep in deep learning isn’t a reference to any kind of deeper understanding
achieved by the approach; rather, it stands for this idea of successive layers
of representations.
Depth of the model - How many layers contribute to a model of the data is
called the depth of the model.
No. of layers
As you can see in figure 1.6, the network transforms the digit image into
representations that are increasingly different from the original image and
increasingly informative about the final result.
A deep network as a multistage information-distillation operation, where
information goes through successive filters and comes out increasingly purified
(that is, useful with regard to some task). So that’s what deep learning is,
technically: a multistage way to learn data representations.
The specification of what a layer does to its input data is stored in the layer’s
weights, which in essence are a bunch of numbers. In technical terms, we’d
say that the transformation implemented by a layer is parameterized by its
weights (see figure 1.7). (Weights are also sometimes called the parameters
of a layer.)
In this context, learning means finding a set of values for the weights of all
layers in a network, such that the network will correctly map example inputs
to their associated targets. But here’s the thing: a deep neural network can
contain tens of millions of parameters. Finding the correct value for all of them
may seem like a daunting task, especially given that modifying the value of
one parameter will affect the behaviour of all the others!
To control something, first you need to be able to observe it. To control the
output of a neural network, you need to be able to measure how far this output
is from what you expected. This is the job of the loss function of the network,
also called the objective function. The loss function takes the predictions of the
network and the true target (what you wanted the network to output) and
computes a distance score, capturing how well the network has done on this
specific example (see figure 1.8).
Optimizer
The fundamental trick in deep learning is to use this score as a feedback signal
to adjust the value of the weights a little, in a direction that will lower the loss
score for the current example (see figure 1.9). This adjustment is the job of
the optimizer, which implements what’s called the Backpropagation algorithm:
the central algorithm in deep learning.
Trained Network
Initially, the weights of the network are assigned random values, so the
network merely implements a series of random transformations. Naturally, its
output is far from what it should ideally be, and the loss score is accordingly
very high. But with every example the network processes, the weights are
adjusted a little in the correct direction, and the loss score decreases. This is
the training loop, which, repeated a sufficient number of times (typically tens
of iterations over thousands of examples), yields weight values that minimize
the loss function. A network with a minimal loss is one for which the outputs
are as close as they can be to the targets: a trained network.
DL achieved so far?
Definition of CNN
They are specifically designed to process pixel data and are used in image
recognition and processing. CNN specializes in processing data that has a grid-
like topology, such as an image. A digital image is a binary representation of
visual data. It contains a series of pixels arranged in a grid-like fashion that
contains pixel values to denote how bright and what color each pixel should
be.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer,
Convolutional layer, Pooling layer, and fully connected layers.
The Convolutional layer applies filters to the input image to extract features,
the Pooling layer down samples the image to reduce computation, and the fully
connected layer makes the final prediction. The network learns the optimal
filters through backpropagation and gradient descent.
Convolution Neural Networks or covnets are neural networks that share their
parameters. Imagine you have an image. It can be represented as a cuboid
having its length, width (dimension of the image), and height (i.e the channel
as images generally has red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural
network, called a filter or kernel on it, with say, K outputs and representing
them vertically. Now slide that neural network across the whole image, as a
result, we will get another image with different widths, heights, and depths.
Instead of just R, G, and B channels now we have more channels but lesser
width and height. This operation is called Convolution. If the patch size is the
same as that of the image it will be a regular neural network. Because of this
small patch, we have fewer weights.
Now let’s talk about a bit of mathematics that is involved in the whole
convolution process.
• Input Layers: It’s the layer in which we give input to our model. In
CNN, Generally, the input will be an image or a sequence of images. This
layer holds the raw input of the image with width 32, height 32, and
depth 3.
• Convolutional Layers: This is the layer, which is used to extract the
feature from the input dataset. It applies a set of learnable filters known
as the kernels to the input images. The filters/kernels are smaller
matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input image
data and computes the dot product between kernel weight and the
corresponding input image patch. The output of this layer is referred ad
feature maps. Suppose we use a total of 12 filters for this layer we’ll get
an output volume of dimension 32 x 32 x 12.
• Activation Layer: By adding an activation function to the output of the
preceding layer, activation layers add nonlinearity to the network. it will
apply an element-wise activation function to the output of the
convolution layer. Some common activation functions are RELU: max(0,
x), Tanh, Leaky RELU, etc. The volume remains unchanged hence
output volume will have dimensions 32 x 32 x 12.
• Pooling layer: This layer is periodically inserted in the covnets and its
main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting. Two
common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the
resultant volume will be of dimension 16x16x12.
• Flattening: The resulting feature maps are flattened into a one-
dimensional vector after the convolution and pooling layers so they can
be passed into a completely linked layer for categorization or regression.
• Fully Connected Layers: It takes the input from the previous layer and
computes the final classification or regression task.
• Output Layer: The output from the fully connected layers is then fed
into a logistic function for classification tasks like sigmoid or SoftMax
which converts the output of each class into the probability score of each
class.
Recurrent Neural Network (RNN) is a type of Neural Network where the output
from the previous step is fed as input to the current step. In traditional neural
networks, all the inputs and outputs are independent of each other, but in
cases when it is required to predict the next word of a sentence, the previous
words are required and hence there is a need to remember the previous words.
Thus, RNN came into existence, which solved this issue with the help of a
Hidden Layer. The main and most important feature of RNN is its Hidden state,
which remembers some information about a sequence. The state is also
referred to as Memory State since it remembers the previous input to the
network. It uses the same parameters for each input as it performs the same
task on all the inputs or hidden layers to produce the output. This reduces the
complexity of parameters, unlike other neural networks.
where:
where:
Yt -> output
Use Cases
1. Machine Translation:
RNN can be used to build a deep learning model that can translate text from
one language to another without the need for human intervention. You can,
for example, translate a text from your native language to English.
2. Text Creation:
RNNs can also be used to build a deep learning model for text generation.
Based on the previous sequence of words/characters used in the text, a trained
model learns the likelihood of occurrence of a word/character. A model can be
trained at the character, n-gram, sentence, or paragraph level.
3. Captioning of images:
The process of creating text that describes the content of an image is known
as image captioning. The image's content can depict the object as well as the
action of the object on the image. In the image below, for example, the trained
deep learning model using RNN can describe the image as "A lady in a green
coat is reading a book under a tree.”
4. Recognition of Speech:
This is also known as Automatic Speech Recognition (ASR), and it is capable
of converting human speech into written or text format. Don't mix up speech
recognition and voice recognition; speech recognition primarily focuses on
converting voice data into text, whereas voice recognition identifies the user's
voice.
Speech recognition technologies that are used on a daily basis by various users
include Alexa, Cortana, Google Assistant, and Siri.
You can use stock market data to build a machine learning model that can
forecast future stock prices based on what the model learns from historical
data. This can assist investors in making data-driven investment decisions.