You are on page 1of 202

CP4252 MACHINE LEARNING LTPC3024

COURSE OBJECTIVES:

To understand the concepts and mathematical foundations of machine learning and


types of problems tackled by machine learning
To explore the different supervised learning techniques including ensemble
methods
To learn different aspects of unsupervised learning and reinforcement learning
To learn the role of probabilistic methods for machine learning
To understand the basic concepts of neural networks and deep learning

UNIT I INTRODUCTION AND MATHEMATICAL FOUNDATIONS


What is Machine Learning? Need –History – Definitions – Applications - Advantages,
Disadvantages & Challenges -Types of Machine Learning Problems – Mathematical
Foundations - Linear Algebra & Analytical Geometry -Probability and Statistics-
Bayesian Conditional Probability -Vector Calculus & Optimization - Decision Theory -
Information theory

UNIT II SUPERVISED LEARNING


Introduction-Discriminative and Generative Models -Linear Regression - Least
Squares -Under-fitting / Overfitting -Cross-Validation – Lasso Regression-
Classification - Logistic Regression- Gradient Linear Models -Support Vector
Machines –Kernel Methods -Instance based Methods - K-Nearest Neighbors - Tree
based Methods –Decision Trees –ID3 – CART - Ensemble Methods –Random Forest
- Evaluation of Classification Algorithms

UNIT III UNSUPERVISED LEARNING AND REINFORCEMENT LEARNING


Introduction - Clustering Algorithms -K – Means – Hierarchical Clustering - Cluster
Validity - Dimensionality Reduction –Principal Component Analysis –
Recommendation Systems - EM algorithm. Reinforcement Learning – Elements -
Model based Learning – Temporal Difference Learning

UNIT IV PROBABILISTIC METHODS FOR LEARNING


Introduction -Naïve Bayes Algorithm -Maximum Likelihood -Maximum Apriori -
Bayesian Belief Networks -Probabilistic Modelling of Problems -Inference in Bayesian
Belief Networks – Probability Density Estimation - Sequence Models – Markov Models
– Hidden Markov Models

UNIT V NEURAL NETWORKS AND DEEP LEARNING


Neural Networks – Biological Motivation- Perceptron – Multi-layer Perceptron – Feed
Forward Network – Back Propagation-Activation and Loss Functions- Limitations of
Machine Learning – Deep Learning– Convolution Neural Networks – Recurrent Neural
Networks – Use cases
Unit – 1. Introduction and Mathematical Foundations

What is Machine Learning?


Machine learning is a branch of artificial intelligence (AI) and computer science which focuses
on the use of data and algorithms to imitate the way that humans learn, gradually improving
its accuracy. Machine learning is a growing technology which enables computers to learn
automatically from past data. Machine learning uses various algorithms for building
mathematical models and making predictions using historical data or information.
Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender system, and many
more.

Machine Learning is an application of Artificial Intelligence that enables systems to learn from
vast volumes of data and solve specific problems. It uses computer algorithms that improve
their efficiency automatically through experience.

Need for Machine Learning


The need for machine learning is increasing day by day. The reason behind the need
for machine learning is that it is capable of doing tasks that are too complex for a
person to implement directly. As a human, we have some limitations as we cannot
access the huge amount of data manually, so for this, we need some computer systems
and here comes the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data
and let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of
machine learning, we can save both time and money.

The importance of machine learning can be easily understood by its use cases,
Currently, machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc. Various top companies such
as Netflix and Amazon have built machine learning models that are using a vast
amount of data to analyse the user interest and recommend product accordingly.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

History
Before some years (about 40-50 years), machine learning was science fiction, but today
it is the part of our daily life. Machine learning is making our day to day life easy
from self-driving cars to Amazon virtual assistant "Alexa". However, the idea
behind machine learning is so old and has a long history. Below some milestones are
given which have occurred in the history of machine learning:

Machine learning was first conceived from the mathematical modelling of


neural networks. A paper by logician Walter Pitts and neuroscientist Warren
McCulloch, published in 1943, attempted to mathematically map out thought
processes and decision making in human cognition.
In 1950, Alan Turning proposed the Turing Test, which became the litmus
test for which machines were deemed "intelligent" or "unintelligent." The
criteria for a machine to receive status as an "intelligent" machine, was for it
to have the ability to convince a human being that it, the machine, was also
a human being. Soon after, a summer research program at Dartmouth
College became the official birthplace of AI.

From this point on, "intelligent" machine learning algorithms and computer
programs started to appear, doing everything from planning travel routes for
salespeople, to playing board games with humans such as checkers and tic-
tac-toe.
Intelligent machines went on to do everything from using speech recognition
to learning to pronounce words the way a baby would learn to defeating a
world chess champion at his own game. The infographic below shows the
history of machine learning and how it grew from mathematical models to
sophisticated technology.

Definitions
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with
the development of algorithms which allow a computer to learn from the data and past
experiences on their own. The term machine learning was first introduced by Arthur
Samuel in 1959. We can define it in a summarized way as:

With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine learning brings computer
science and statistics together for creating predictive models. Machine learning
constructs or uses the algorithms that learn from historical data. The more we will
provide the information, the higher will be the performance.

A machine has the ability to learn if it can improve its performance by gaining
more data.

Applications
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:
Other applications of machine learning are:

o Human Resource Information Systems: In short, it is also called an HRIS


System, and it is used for identifying the best candidates for an open position
using machine learning models to filter the applications
o Business Intelligence: In short, it is called BI. Machine Learning is used by
vendors in their software to search potentially important anomalies and
patterns of data points.
o Customer Relationship Management: The Machine Learning model used by
the CRM software analyses prompt sales members responding to important
messages first and email.
o Virtual Assistants: Smart assistants usually combine unsupervised and
supervised learning machine learning models to decipher supply context and
natural speech.
o Self-Driving cars: Algorithms based on the Machine Learning model are used
to drive the car.
Advantages
1. Automation

Machine Learning is one of the driving forces behind automation, and it is cutting
down time and human workload. Automation can now be seen everywhere, and the
complex algorithm does the hard work for the user. Automation is more reliable,
efficient, and quick. With the help of machine learning, now advanced computers are
being designed. Now this advanced computer can handle several machine-learning
models and complex algorithms. However, automation is spreading faster in the
industry but, a lot of research and innovation are required in this field.

2. Scope of Improvement

Machine Learning is a field where things keep evolving. It gives many opportunities
for improvement and can become the leading technology in the future. A lot of
research and innovation is happening in this technology, which helps improve software
and hardware.

3. Enhanced Experience in Online Shopping and Quality Education

Machine Learning is going to be used in the education sector extensively, and it will
be going to enhance the quality of education and student experience. It has emerged
in China; machine learning has improved student focus. In the e-commerce field,
Machine Learning studies your search feed and give suggestion based on them.
Depending upon search and browsing history, it pushes targeted advertisements and
notifications to users.

4. Wide Range of Applicability

This technology has a very wide range of applications. Machine learning plays a role
in almost every field, like hospitality, ed-tech, medicine, science, banking, and
business. It creates more opportunities.

Disadvantages
Nothing is perfect in the world. Machine Learning has some serious limitations, which
are bigger than human errors.

1. Data Acquisition

The whole concept of machine learning is about identifying useful data. The outcome
will be incorrect if a credible data source is not provided. The quality of the data is also
significant. If the user or institution needs more quality data, wait for it. It will cause
delays in providing the output. So, machine learning significantly depends on the data
and its quality.

2. Time and Resources

The data that machines process remains huge in quantity and differs greatly. Machines
require time so that their algorithm can adjust to the environment and learn it. Trials
runs are held to check the accuracy and reliability of the machine. It requires massive
and expensive resources and high-quality expertise to set up that quality of
infrastructure. Trials runs are costly as they would cost in terms of time and expenses.

3. Results Interpretations

One of the biggest advantages of Machine learning is that interpreted data that we
get from the cannot be hundred percent accurate. It will have some degree of
inaccuracy. For a high degree of accuracy, algorithms should be developed so that
they give reliable results.

4. High Error Chances

The error committed during the initial stages is huge, and if not corrected at that time,
it creates havoc. Biasness and wrongness have to be dealt with separately; they are not
interconnected. Machine learning depends on two factors, i.e., data and algorithm.
All the errors are dependent on the two variables. Any incorrectness in any variables
would have huge repercussions on the output.

5. Social Changes

Machine learning is bringing numerous social changes in society. The role of machine
learning-based technology in society has increased multifold. It is influencing the
thought process of society and creating unwanted problems in society. Character
assassination and sensitive details are disturbing the social fabric of society.

6. Elimination of Human Interface

Automation, Artificial Intelligence, and Machine Learning have eliminated human


interface from some work. It has eliminated employment opportunities. Now, all those
works are conducted with the help of artificial intelligence and machine learning.

7. Changing Nature of Jobs

With the advancement of machine learning, the nature of the job is changing. Now, all
the work are done by machine, and it is eating up the jobs for human which were done
earlier by them. It is difficult for those without technical education to adjust to these
changes.
8. Highly Expensive

This software is highly expensive, and not everybody can own it. Government agencies,
big private firms, and enterprises mostly own it. It needs to be made accessible to
everybody for wide use.

9. Privacy Concern

As we know that one of the pillars of machine learning is data. The collection of data
has raised the fundamental question of privacy. The way data is collected and used for
commercial purposes has always been a contentious issue. In India, the Supreme
court of India has declared privacy a fundamental right of Indians. Without the user's
permission, data cannot be collected, used, or stored. However, many cases have come
up that big firms collect the data without the user's knowledge and using it for their
commercial gains.

10. Research and Innovations

Machine learning is evolving concept. This area has not seen any major developments
yet that fully revolutionized any economic sector. The area requires continuous
research and innovation.

Challenges
Although machine learning is being used in every industry and helps organizations
make more informed and data-driven choices that are more effective than classical
methodologies, it still has so many problems that cannot be ignored. Here are some
common issues in Machine Learning that professionals face to inculcate ML skills and
create an application from scratch.

1. Inadequate Training Data

The major issue that comes while using machine learning algorithms is the lack of
quality as well as quantity of data. Although data plays a vital role in the processing of
machine learning algorithms, many data scientists claim that inadequate data, noisy
data, and unclean data are extremely exhausting the machine learning algorithms. For
example, a simple task requires thousands of sample data, and an advanced task such
as speech or image recognition needs millions of sample data examples. Further, data
quality is also important for the algorithms to work ideally, but the absence of data
quality is also found in Machine Learning applications. Data quality can be affected by
some factors as follows:

o Noisy Data- It is responsible for an inaccurate prediction that affects the decision as
well as accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results obtained in
machine learning models. Hence, incorrect data may affect the accuracy of the results
also.
o Generalizing of output data- Sometimes, it is also found that generalizing output
data becomes complex, which results in comparatively poor future actions.

2. Poor quality of data

As we have discussed above, data plays a significant role in machine learning, and it must be
of good quality as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to
less accuracy in classification and low-quality results. Hence, data quality can also be
considered as a major common problem while processing machine learning algorithms.

3. Non-representative training data

To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training data
must cover all cases that are already occurred as well as occurring.

Further, if we are using non-representative training data in the model, it results in less accurate
predictions. A machine learning model is said to be ideal if it predicts well for generalized
cases and provides accurate decisions. If there is less training data, then there will be a sampling
noise in the model, called the non-representative training set. It won't be accurate in predictions.
To overcome this, it will be biased against one class or a group.

Hence, we should use representative data in training to protect against being biased and make
accurate predictions without any drift.

4. Overfitting and Underfitting

Overfitting:

Overfitting is one of the most common issues faced by Machine Learning engineers
and data scientists. Whenever a machine learning model is trained with a huge amount
of data, it starts capturing noise and inaccurate data into the training data set. It
negatively affects the performance of the model. Let's understand with a simple
example where we have a few training data sets such as 1000 mangoes, 1000 apples,
1000 bananas, and 5000 papayas. Then there is a considerable probability of
identification of an apple as papaya because we have a massive amount of biased data
in the training data set; hence prediction got negatively affected. The main reason
behind overfitting is using non-linear methods used in machine learning algorithms as
they build non-realistic data models. We can overcome overfitting by using linear and
parametric algorithms in the machine learning models.

Methods to reduce overfitting:


o Increase training data in a dataset.
o Reduce model complexity by simplifying the model by selecting one with fewer
parameters
o Ridge Regularization and Lasso Regularization
o Early stopping during the training phase
o Reduce the noise
o Reduce the number of attributes in training data.
o Constraining the model.

Underfitting:

Underfitting is just the opposite of overfitting. Whenever a machine learning model is


trained with fewer amounts of data, and as a result, it provides incomplete and
inaccurate data and destroys the accuracy of the machine learning model.

Underfitting occurs when our model is too simple to understand the base structure of
the data, just like an undersized pant. This generally happens when we have limited
data into the data set, and we try to build a linear model with non-linear data. In such
scenarios, the complexity of the model destroys, and rules of the machine learning
model become too easy to be applied on this data set, and the model starts doing
wrong predictions as well.

Methods to reduce Underfitting:

o Increase model complexity


o Remove noise from the data
o Trained on increased and better features
o Reduce the constraints
o Increase the number of epochs to get better results.

5. Monitoring and maintenance

As we know that generalized output data is mandatory for any machine learning
model; hence, regular monitoring and maintenance become compulsory for the same.
Different results for different actions require data change; hence editing of codes as
well as resources for monitoring them also become necessary.
6. Getting bad recommendations

A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example
where at a specific time customer is looking for some gadgets, but now customer
requirement changed over time but still machine learning model showing same
recommendations to the customer while customer expectation has been changed. This
incident is called a Data Drift. It generally occurs when new data is introduced or
interpretation of data changes. However, we can overcome this by regularly updating
and monitoring data according to the expectations.

7. Lack of skilled resources

Although Machine Learning and Artificial Intelligence are continuously growing in the
market, still these industries are fresher in comparison to others. The absence of skilled
resources in the form of manpower is also an issue. Hence, we need manpower having
in-depth knowledge of mathematics, science, and technologies for developing and
managing scientific substances for machine learning.

8. Customer Segmentation

Customer segmentation is also an important issue while developing a machine


learning algorithm. To identify the customers who paid for the recommendations
shown by the model and who don't even check them. Hence, an algorithm is necessary
to recognize the customer behaviour and trigger a relevant recommendation for the
user based on past experience.

9. Process Complexity of Machine Learning

The machine learning process is very complex, which is also another major issue faced
by machine learning engineers and data scientists. However, Machine Learning and
Artificial Intelligence are very new technologies but are still in an experimental phase
and continuously being changing over time. There is the majority of hits and trial
experiments; hence the probability of error is higher than expected. Further, it also
includes analysing the data, removing data bias, training data, applying complex
mathematical calculations, etc., making the procedure more complicated and quite
tedious.

10. Data Bias

Data Biasing is also found a big challenge in Machine Learning. These errors exist when
certain elements of the dataset are heavily weighted or need more importance than
others. Biased data leads to inaccurate results, skewed outcomes, and other analytical
errors. However, we can resolve this error by determining where data is actually biased
in the dataset. Further, take necessary steps to reduce it.

Methods to remove Data Bias:

o Research more for customer segmentation.


o Be aware of your general use cases and potential outliers.
o Combine inputs from multiple sources to ensure data diversity.
o Include bias testing in the development process.
o Analyse data regularly and keep tracking errors to resolve them easily.
o Review the collected and annotated data.
o Use multi-pass annotation such as sentiment analysis, content moderation, and intent
recognition.

11. Lack of Explainability

This basically means the outputs cannot be easily comprehended as it is programmed


in specific ways to deliver for certain conditions. Hence, a lack of explainability is also
found in machine learning algorithms which reduce the credibility of the algorithms.

12. Slow implementations and results

This issue is also very commonly seen in machine learning models. However, machine
learning models are highly efficient in producing accurate results but are time-
consuming. Slow programming, excessive requirements' and overloaded data take
more time to provide accurate results than expected. This needs continuous
maintenance and monitoring of the model for delivering accurate results.

13. Irrelevant features

Although machine learning models are intended to give the best possible outcome, if
we feed garbage data as input, then the result will also be garbage. Hence, we should
use relevant features in our training sample. A machine learning model is said to be
good if training data has a good set of features or less to no irrelevant features.

Types of Machine Learning Problems


Machine learning is a subset of AI, which enables the machine to automatically
learn from data, improve performance from past experiences, and make
predictions. Machine learning contains a set of algorithms that work on a huge
amount of data. Data is fed to these algorithms to train them, and on the basis of
training, they build the model & perform a specific task.
These ML algorithms help to solve different business problems like Regression,
Classification, Forecasting, Clustering, and Associations, etc.

Based on the methods and way of learning, machine learning is divided into mainly
four types, which are:

1. Supervised Machine Learning


2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

1. Supervised Machine Learning


As its name suggests, Supervised machine learning is based on supervision. It means
in the supervised learning technique, we train the machines using the "labelled"
dataset, and based on the training, the machine predicts the output. Here, the labelled
data specifies that some of the inputs are already mapped to the output. More
preciously, we can say; first, we train the machine with the input and corresponding
output, and then we ask the machine to predict the output using the test dataset.

Let's understand supervised learning with an example. Suppose we have an input


dataset of cats and dog images. So, first, we will provide the training to the machine
to understand the images, such as the shape & size of the tail of cat and dog, Shape
of eyes, colour, height (dogs are taller, cats are smaller), etc. After completion of
training, we input the picture of a cat and ask the machine to identify the object and
predict the output. Now, the machine is well trained, so it will check all the features of
the object, such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So,
it will put it in the Cat category. This is the process of how the machine identifies the
objects in Supervised Learning.

The main goal of the supervised learning technique is to map the input
variable(x) with the output variable(y). Some real-world applications of supervised
learning are Risk Assessment, Fraud Detection, Spam filtering, etc.

Categories of Supervised Machine Learning

Supervised machine learning can be classified into two types of problems, which are
given below:

o Classification
o Regression

a) Classification

Classification algorithms are used to solve the classification problems in which the
output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.
The classification algorithms predict the categories present in the dataset. Some real-
world examples of classification algorithms are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:

o Random Forest Algorithm


o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.

Some popular Regression algorithms are given below:

o Simple Linear Regression Algorithm


o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression
Advantages and Disadvantages of Supervised Learning

Advantages:

o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.

Disadvantages:

o These algorithms are not able to solve complex tasks.


o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning

Some common applications of Supervised Learning are given below:

o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease conditions.
With such a process, the machine can identify a disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data to
identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent to
the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can be
done using the same, such as voice-activated passwords, voice commands, etc.

2. Unsupervised Machine Learning


Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning,
the machine is trained using the unlabelled dataset, and the machine predicts the
output without any supervision.

In unsupervised learning, the models are trained with the data that is neither classified
nor labelled, and the model acts on that data without any supervision.

The main aim of the unsupervised learning algorithm is to group or categories


the unsorted dataset according to the similarities, patterns, and
differences. Machines are instructed to find the hidden patterns from the input
dataset.

Let's take an example to understand it more preciously; suppose there is a basket of


fruit images, and we input it into the machine learning model. The images are totally
unknown to the model, and the task of the machine is to find the patterns and
categories of the objects.

So, now the machine will discover its patterns and differences, such as colour
difference, shape difference, and predict the output when it is tested with the test
dataset.

Categories of Unsupervised Machine Learning

Unsupervised Learning can be further classified into two types, which are given below:

o Clustering
o Association

1) Clustering

The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of
other groups. An example of the clustering algorithm is grouping the customers by
their purchasing behaviour.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm


o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis
2) Association

Association rule learning is an unsupervised learning technique, which finds interesting


relations among variables within a large dataset. The main aim of this learning
algorithm is to find the dependency of one data item on another data item and map
those variables accordingly so that it can generate maximum profit. This algorithm is
mainly applied in Market Basket analysis, Web usage mining, continuous
production, etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat,
FP-growth algorithm.

Advantages and Disadvantages of Unsupervised Learning Algorithm

Advantages:

o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.

Disadvantages:

o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.

Applications of Unsupervised Learning


o Network Analysis: Unsupervised learning is used for identifying plagiarism and
copyright in document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised
learning techniques for building recommendation applications for different web
applications and e-commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised
learning, which can identify unusual data points within the dataset. It is used to discover
fraudulent transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to
extract particular information from the database. For example, extracting information
of each user located at a particular location.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies
between Supervised and Unsupervised machine learning. It represents the
intermediate ground between Supervised (With Labelled training data) and
Unsupervised learning (with no labelled training data) algorithms and uses the
combination of labelled and unlabelled datasets during the training period.

Although Semi-supervised learning is the middle ground between supervised and


unsupervised learning and operates on the data that consists of a few labels, it mostly
consists of unlabelled data. As labels are costly, but for corporate purposes, they may
have few labels. It is completely different from supervised and unsupervised learning
as they are based on the presence & absence of labels.

To overcome the drawbacks of supervised learning and unsupervised learning


algorithms, the concept of Semi-supervised learning is introduced. The main aim
of semi-supervised learning is to effectively use all the available data, rather than only
labelled data like in supervised learning. Initially, similar data is clustered along with an
unsupervised learning algorithm, and further, it helps to label the unlabelled data into
labelled data. It is because labelled data is a comparatively more expensive acquisition
than unlabelled data.

We can imagine these algorithms with an example. Supervised learning is where a


student is under the supervision of an instructor at home and college. Further, if that
student is self-analysing the same concept without any help from the instructor, it
comes under unsupervised learning. Under semi-supervised learning, the student has
to revise himself after analysing the same concept under the guidance of an instructor
at college.

Advantages and disadvantages of Semi-supervised Learning

Advantages:

o It is simple and easy to understand the algorithm.


o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.

Disadvantages:

o Iterations results may not be stable.


o We cannot apply these algorithms to network-level data.
o Accuracy is low.
4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent
(A software component) automatically explore its surrounding by hitting & trail,
acting, learning from experiences, and improving its performance. Agent gets
rewarded for each good action and get punished for each bad action; hence the goal
of reinforcement learning agent is to maximize the rewards.

In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.

The reinforcement learning process is similar to a human being; for example, a child
learns various things by experiences in his day-to-day life. An example of
reinforcement learning is to play a game, where the Game is the environment, moves
of an agent at each step define states, and the goal of the agent is to get a high score.
Agent receives feedback in terms of punishment and rewards.

Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.

A reinforcement learning problem can be formalized using Markov Decision Process


(MDP). In MDP, the agent constantly interacts with the environment and performs
actions; at each action, the environment responds and generates a new state.

Categories of Reinforcement Learning

Reinforcement learning is categorized mainly into two types of methods/algorithms:

o Positive Reinforcement Learning: Positive reinforcement learning specifies


increasing the tendency that the required behaviour would occur again by adding
something. It enhances the strength of the behaviour of the agent and positively
impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour would
occur again by avoiding the negative condition.

Real-world Use cases of Reinforcement Learning


o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-human
performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO
Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that
how to use RL in computer to automatically learn and schedule resources to wait for
different jobs in order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement
learning. There are different industries that have their vision of building intelligent
robots using AI and Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the
help of Reinforcement Learning by Salesforce company.

Advantages and Disadvantages of Reinforcement Learning

Advantages

o It helps in solving complex real-world problems which are difficult to be solved by


general techniques.
o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
o Helps in achieving long term results.

Disadvantage

o RL algorithms are not preferred for simple problems.


o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can weaken
the results.

The curse of dimensionality limits reinforcement learning for real physical systems.

Mathematical Foundations

• Math helps you select the correct machine learning algorithm.


Understanding math gives you insight into how the model works,
including choosing the right model parameter and the
validation strategies.

• Estimating how confident we are with the model result by producing


the right confidence interval and uncertainty
measurements needs an understanding of math.

• The right model would consider many aspects such as metrics,


training time, model complexity, number of parameters, and
number of features which need math to understand all of these
aspects.

• You could develop a customized model that fits your own problem
by knowing the machine learning model’s math.

Six math subjects become the foundation for machine learning. Each subject is
intertwined to develop our machine learning model and reach the “best” model for
generalizing the dataset.

Linear Algebra
Linear algebra is a branch of mathematics that deals with linear equations and their
representations in the vector space using matrices. In other words, linear algebra is the study
of linear functions and vectors. It is one of the most central topics of mathematics. Most modern
geometrical concepts are based on linear algebra.

Linear algebra facilitates the modeling of many natural phenomena and hence, is an integral
part of engineering and physics. Linear equations, matrices, and vector spaces are the most
important components of this subject. In this article, we will learn more about linear algebra
and the various associated topics.

Linear algebra can be defined as a branch of mathematics that deals with the study of
linear functions in vector spaces. When information related to linear functions is
presented in an organized form then it results in a matrix. Thus, linear algebra is
concerned with vector spaces, vectors, linear functions, the system of linear equations,
and matrices. These concepts are a prerequisite for sister topics such as geometry
and functional analysis.

The branch of mathematics that deals with vectors, matrics, finite or infinite dimensions
as well as a linear mapping between such spaces is defined as linear algebra. It is
used in both pure and applied mathematics along with different technical forms such
as physics, engineering, natural sciences, etc.

Branches of Linear Algebra

Linear algebra can be categorized into three branches depending upon the level of difficulty
and the kind of topics that are encompassed within each. These are elementary, advanced, and
applied linear algebra. Each branch covers different aspects of matrices, vectors, and linear
functions.

Elementary Linear Algebra

Elementary linear algebra introduces students to the basics of linear algebra. This includes
simple matrix operations, various computations that can be done on a system of linear
equations, and certain aspects of vectors. Some important terms associated with elementary
linear algebra are given below:

Scalars - A scalar is a quantity that only has magnitude and not direction. It is an element that
is used to define a vector space. In linear algebra, scalars are usually real numbers.

Vectors - A vector is an element in a vector space. It is a quantity that can describe both the
direction and magnitude of an element.

Vector Space - The vector space consists of vectors that may be added together and multiplied
by scalars.

Matrix - A matrix is a rectangular array wherein the information is organized in the form of
rows and columns. Most linear algebra properties can be expressed in terms of a matrix.

Matrix Operations - These are simple arithmetic operations such as addition, subtraction,
and multiplication that can be conducted on matrices.

Advanced Linear Algebra

Once the basics of linear algebra have been introduced to students the focus shifts on more
advanced concepts related to linear equations, vectors, and matrices. Certain important terms
that are used in advanced linear algebra are as follows:
Linear Transformations - The transformation of a function from one vector space to another
by preserving the linear structure of each vector space.

Inverse of a Matrix - When an inverse of a matrix is multiplied with the given original matrix
then the resultant will be the identity matrix. Thus, A-1A = I.

Eigenvector - An eigenvector is a non-zero vector that changes by a scalar factor (eigenvalue)


when a linear transformation is applied to it.

Linear Map - It is a type of mapping that preserves vector addition and vector multiplication.

Linear Algebra and its Applications

Linear algebra is used in almost every field. Simple algorithms also make
use of linear algebra topics such as matrices. Some of the applications of
linear algebra are given as follows:
• Signal Processing - Linear algebra is used in encoding and manipulating
signals such as audio and video signals. Furthermore, it is required in the
analysis of such signals.
• Linear Programming - It is an optimizing technique that is used to determine
the best outcome of a linear function.
• Computer Science - Data scientists use several linear algebra algorithms to
solve complicated problems.
• Prediction Algorithms - Prediction algorithms use linear models that are
developed using concepts of linear algebra.

Important Notes on Linear Algebra


• Linear algebra is concerned with the study of three broad subtopics - linear
functions, vectors, and matrices
• Linear algebra can be classified into 3 categories. These are elementary,
advanced, and applied linear algebra.
• Elementary linear algebra is concerned with the introduction to linear algebra.
Advanced linear algebra builds on these concepts. Applied linear algebra
applies these concepts to real-life situations.

Analytic Geometry (Coordinate Geometry)


Analytic geometry is a study in which we learn the data (point)
position using an ordered pair of coordinates. This study is concerned
with defining and representing geometrical shapes numerically and
extracting numerical information from the shapes numerical
definitions and representations. We project the data into the plane in
a simpler term, and we receive numerical information from there.

Above is an example of how we acquired information from the data


point by projecting the dataset into the plane. How we acquire the
information from this representation is the heart of Analytical
Geometry. Some important terms

• Distance Function

A distance function is a function that provides numerical


information for the distance between the elements of a set. If the
distance is zero, then elements are equivalent. Else, they are different
from each other.
An example of the distance function is Euclidean Distance which
calculates the linear distance between two data points.

• Inner Product

The inner product is a concept that introduces intuitive geometrical


concepts, such as the length of a vector and the angle or
distance between two vectors. It is often denoted as ⟨x,y⟩ (or
occasionally (x,y) or ⟨x|y⟩).

Probability and Statistics


Probability is a study of uncertainty (loosely terms). The probability here
can be thought of as a time where the event occurs or the degree of belief
about an event's occurrence. The probability distribution is a function
that measures the probability of a particular outcome (or probability set of
outcomes) that would occur associated with the random variable. The
common probability distribution function is shown in the image below.

Normal Distribution Probability Function

Probability theory and statistics are often associated with a similar


thing, but they concern different aspects of uncertainty:
•In math, we define probability as a model of some process where random
variables capture the underlying uncertainty, and we use the rules of
probability to summarize what happens.

•In statistics, we try to figure out the underlying process observe of


something that has happened and tries to explain the observations.

When we talk about machine learning, it is close to statistics because its


goal is to construct a model that adequately represents the process that
generated the data.

Bayesian Conditional Probability


Bayes' Theorem, named after 18th-century British mathematician Thomas
Bayes, is a mathematical formula for determining conditional probability.
Conditional probability is the likelihood of an outcome occurring, based on
a previous outcome having occurred in similar circumstances. Bayes'
theorem provides a way to revise existing predictions or theories (update
probabilities) given new or additional evidence.

In finance, Bayes' Theorem can be used to rate the risk of lending money to
potential borrowers. The theorem is also called Bayes' Rule or Bayes' Law
and is the foundation of the field of Bayesian statistics.

Bayes Theorem Formula


If A and B are two events, then the formula for the Bayes theorem is given by:

Where P(A|B) is the probability of condition when event A is occurring while event B
has already occurred.

The Bayes theorem states that the probability of an event is based on prior knowledge
of the conditions that might be related to the event. It is also used to examine the case
of conditional probability. If we are aware of conditional probability, we can use the
Bayes formula to calculate reverse probabilities. The probability of A occurring given
that event B has taken place is equal to the product of the probability of event A
occurring at all and the probability of event B taking place given that event A has taken
place, divided by the probability of event B taking place at all.

Vector Calculus
Calculus is a mathematical study that concern with continuous change,
which mainly consists of functions and limits. Vector calculus itself is
concerned with the differentiation and integration of the vector fields.
Vector Calculus is often called multivariate calculus, although it has a
slightly different study case. Multivariate calculus deals with calculus
application functions of the multiple independent variables.

• Derivative and Differentiation

The derivative is a function of real numbers that measure the change of


the function value (output value) concerning a change in its argument
(input value). Differentiation is the action of computing a derivative.

Derivative Equation

• Partial Derivative

The partial derivative is a derivative function where several variables


are calculated within the derivative function with respect to one of those
variables could be varied, and the other variable are held constant (as
opposed to the total derivative, in which all variables are allowed to
vary).

• Gradient
The gradient is a word related to the derivative or the rate of change of a
function; you might consider that gradient is a fancy word for derivative.
The term gradient is typically used for functions with several inputs and a
single output (scalar). The gradient has a direction to move from
their current location, e.g., up, down, right, left.

Optimization

In the learning objective, training a machine learning model is all about


finding a good set of parameters. What we consider “good” is determined
by the objective function or the probabilistic models. This is
what optimization algorithms are for; given an objective function, we
try to find the best value.

Commonly, objective functions in machine learning are trying


to minimize the function. It means the best value is the minimum
value. Intuitively, if we try to find the best value, it would like finding the
valleys of the objective function where the gradients point us uphill. That
is why we want to move downhill (opposite to the gradient) and hope to
find the lowest (deepest) point. This is the concept of gradient descent.

Gradient Descent
There are few terms as a starting point when learning optimization. They
are:

• Local Minima and Global Minima

The point at which a function best values takes the minimum value is called
the global minima. However, when the goal is to minimize the function
and solved it using optimization algorithms such as gradient descent,
the function could have a minimum value at different points. Those several
points which appear to be minima but are not the point where the function
actually takes the minimum value are called local minima.

Local and Global Minima

• Unconstrained Optimization and Constrained Optimization

Unconstrained Optimization is an optimization function where we


find a minimum of a function under the assumption that the parameters
can take any possible value (no parameter limitation). Constrained
Optimization simply limits the possible value by introducing a set of
constraints.
Gradient descent is an Unconstrained optimization if there is no parameter
limitation. If we set some limit, for example, x > 1, it is an unconstrained
optimization.

Decision Theory
Decision theory is a study of an agent's rational choices that supports all kinds of
progress in technology such as work on machine learning and artificial intelligence.
Decision theory looks at how decisions are made, how multiple decisions influence
one another, and how decision-making parties deal with uncertainty. Decision
theory is also known as theory of choice.

Decision theory involves normative or prescriptive decision theory, which


provides models for optimal decision-making. It also includes descriptive
decision theory that follows from observation. Either of these types of
theory can be applied to different types of technologies – for instance, many
of the enterprise software systems offered by vendors are described as
decision support tools – and so logically, their engineers would benefit from
a study of decision theory.

Similarly, in constructing machine learning tools and artificial intelligence


technologies, scientists are studying decision theory closely. One way to
think about this is that a close study of decision theory can reveal how
human and computer decisions are similar, and how they are different,
which leads researchers and engineers to close the gap between human
cognitive capacity and the capacity of artificial intelligence entities.

Decision theory, combined with probability theory, allows us to make optimal decisions
in situations involving uncertainty such as those encountered in pattern recognition.

Classification problems can be broken down into two separate stages, inference
stage and decision stage. The inference stage involves using the training data to learn
the model for the joint distribution, or equivalently, which gives us the most complete
probabilistic description of the situation. In the end, we must decide on optimal choice
based on our situation. This decision stage is generally very simple, even trivial, once
we have solved the inference problem.

Bayesian Decision Theory is a simple but fundamental approach to a


variety of problems like pattern classification. The entire purpose of the
Bayes Decision Theory is to help us select decisions that will cost us the
least ‘risk’. There is always some sort of risk attached to any decision
we choose.
Information Theory
Information theory is the study of how much information is present in the signals or data
we receive from our environment. AI / Machine learning (ML) is about extracting
interesting representations/information from data which are then used for building the
models. Thus, information theory fundamentals are key to processing information while
building machine learning models.

What is information theory and what are its key concepts?


Information theory is the study of encoding, decoding, transmitting, and manipulating
information. Information theory provides tools & techniques to compare and measure the
information present in a signal. In simpler words, how much information is present in one
or more statements is a field of study called information theory.

The greater the degree of surprise in the statements, the greater the
information contained in the statements. For example, let’s say commuting from place
A to B takes 3 hours on average and is known to everyone. If somebody makes this
statement, the statement provides no information at all as this is already known to
everyone. Now, if someone says that it takes 2 hours to go from place A to B provided a
specific route is taken, then this statement consists of good bits of information as there is
an element of surprise in the statement.

The extent of information required to describe an event depends upon the possibility of
occurrence of that event. If the event is a common event, not much information is
required to describe the event. However, for unusual events, a good amount of
information will be needed to describe such events. Unusual events have a higher
degree of surprises and hence greater associated information.

The amount of information associated with event outcomes depends upon the probability
distribution associated with that event. In other words, the amount of information is
related to the probability distribution of event outcomes. Recall that the event and its
outcomes can be represented as the different values of the random variable, X from the
given sample space. And, the random variable has an associated probability distribution
with a probability associated with each outcome including the common outcomes
consisting of less information and rare outcomes consisting of a lot of information.
The higher the probability of an event outcome, the lesser the
information contained if that outcome happens. The smaller the probability of an
event outcome, the greater the information contained if that outcome with lesser
probability happens.

How do we measure the information?


There are the following requirements for measuring the information associated with
events:

• Information (or degree of surprise) associated with a single discrete event: The
information associated with a single discrete event can be measured in terms of the
number of bits. Shannon introduced the term bits as the unit of information. This
information is also called self-information.
• Information (or degree of surprise) associated with the random variable whose values
represent different event outcomes where the values can be discrete or continuous.
Information associated with the random variable is related to probability distribution
as described in the previous section. The amount of information associated with the
random variable is measured using Entropy (or Shannon Entropy).
• The entropy of the event representing the random variable equals the average self-
information from observing each outcome of the event.

What is Entropy?
Entropy represents the amount of information associated with the random variable as the
function of the probability distribution for that random variable, be the probability
distribution be probability density function (PDF) or probability mass function (PMF).
The following is the formula for the entropy for a discrete random variable.

Where Pi represents the probability of a specific value of the random variable X. The
following represents the entropy of a continuous random variable. It is also termed
differential entropy.

How are information theory, entropy, and machine learning related?


Information theory, entropy, and machine learning are all related to each other.
Information theory deals with extracting information from data or signals. Entropy is a
measure of the information contained in the data or signal. Machine learning models are
designed to minimize the loss of information or entropy between the estimated and true
probability distributions. This is also termed cross-entropy loss. When training
classification models, the goal is to minimize the information loss between estimated and
true probability distribution.
Unit – 2. Supervised Learning

Introduction
Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.

In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data
to the machine learning model. The aim of a supervised learning algorithm is to find
a mapping function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.

How Supervised Learning Works?


In supervised learning, models are trained using labelled dataset, where the model
learns about each type of data. Once the training process is completed, the model is
tested on the basis of test data (a subset of the training set), and then it predicts the
output.

The working of Supervised learning can be easily understood by the below example
and diagram:
Suppose we have a dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon. Now the first step is that we need to train the model
for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is
to identify the shape.

The machine is already trained on all types of shapes, and when it finds a new shape,
it classifies the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:


o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as
the control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the
correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:


Supervised learning can be further divided into two types of problems:
1. Regression

Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Advantages of Supervised learning:


o With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:


o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different
from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.

Discriminative and Generative Models

Understanding Machine Learning Models

Machine learning models can be classified into two types: Discriminative and
Generative. In simple words, a discriminative model makes predictions on
unseen data based on conditional probability and can be used either for
classification or regression problem statements. On the contrary, a
generative model focuses on the distribution of a dataset to return a
probability for a given example.

We, as humans, can adopt any of the two different approaches to machine learning
models while learning an artificial language. These two models have not previously
been explored in human learning. However, it is related to known effects of causal
direction, classification vs. inference learning, and observational vs. feedback
learning. So, in this article, our focus is on two types of machine learning models
– Generative and Discriminative, and also see the importance, comparisons, and
differences of these two models.
Problem Formulation
Suppose we are working on a classification problem where our task is to decide if
an email is spam or not spam based on the words present in a particular email. To
solve this problem, we have a joint model over.

• Labels: Y=y, and


• Features: X={x1, x2, …xn}

Therefore, the joint distribution of the model can be represented as

p(Y,X) = P(y,x1,x2…xn)

Now, our goal is to estimate the probability of spam email i.e., P(Y=1|X). Both
generative and discriminative models can solve this problem but in different ways.

Let’s see why and how they are different!

The Approach of Generative Models


In the case of generative models, to find the conditional probability P(Y|X), they
estimate the priorprobability P(Y) and likelihood probability P(X|Y) with the help
of the training data and use the Bayes Theorem to calculate the posterior
probability P(Y |X):

The Approach of Discriminative Models


In the case of discriminative models, to find the probability, they directly assume
some functional form for P(Y|X) andthen estimate the parameters of P(Y|X) with
the help of the training data.

What Are Discriminative Models?


The discriminative model refers to a class of models used in Statistical
Classification, mainly used for supervised machine learning. These types of models
are also known as conditional models since they learn the boundaries between
classes or labels in a dataset.

Discriminative models focus on modelling the decision boundary between classes in


a classification problem. The goal is to learn a function that maps inputs to binary
outputs, indicating the class label of the input. Maximum likelihood estimation is
often used to estimate the parameters of the discriminative model, such as the
coefficients of a logistic regression model or the weights of a neural network.

Discriminative models (just as in the literal meaning) separate classes instead of


modelling the conditional probability and don’t make any assumptions about the
data points. But these models are not capable of generating new data points.
Therefore, the ultimate objective of discriminative models is to separate one class
from another.

If we have some outliers present in the dataset, discriminative models work better
compared to generative models i.e., discriminative models are more robust to
outliers. However, one major drawback of these models is the misclassification
problem, i.e., wrongly classifying a data point.
The Mathematics of Discriminative Models
Training discriminative classifiers or discriminant analysis involves
estimating a function f: X -> Y, or probability P(Y|X)

• Assume some functional form for the probability, such as P(Y|X)


• With the help of training data, we estimate the parameters of P(Y|X)

Examples of Discriminative Models

• Logistic regression
• Support vector machines (SVMs)
• Traditional neural networks
• Nearest neighbor
• Conditional Random Fields (CRFs)
• Decision Trees and Random Forest

What Are Generative Models?

Generative models are considered a class of statistical models that can


generate new data instances. These models are used in unsupervised
machine learning as a means to perform tasks such as

• Probability and Likelihood estimation,


• Modelling data points
• To describe the phenomenon in data,
• To distinguish between classes based on these probabilities.

Since these models often rely on the Bayes theorem to find the joint
probability, generative models can tackle a more complex task than
analogous discriminative models.
So, the Generative approach focuses on the distribution of individual classes
in a dataset, and the learning algorithms tend to model the underlying
patterns or distribution of the data points (e.g., gaussian). These models use
the concept of joint probability and create instances where a given feature
(x) or input and the desired output or label (y) exist simultaneously.

These models use probability estimates and likelihood to model data


points and differentiate between different class labels present in a dataset.
Unlike discriminative models, these models can also generate new data
points.

However, they also have a major drawback – If there is a presence of outliers


in the dataset, then it affects these types of models to a significant extent.

The Mathematics of Generative Models


Training generative classifiers involve estimating a function f: X -> Y, or
probability P(Y|X):

• Assume some functional form for the probabilities such as P(Y), P(X|Y)
• With the help of training data, we estimate the parameters of P(X|Y), P(Y)
• Use the Bayes theorem to calculate the posterior probability P(Y |X)
Examples of Generative Models

• Naïve Bayes
• Bayesian networks
• Markov random fields
• Hidden Markov Models (HMMs)
• Latent Dirichlet Allocation (LDA)
• Generative Adversarial Networks (GANs)
• Autoregressive Model

Difference Between Discriminative and Generative


Models

Let’s see some of the differences between the Discriminative and Generative
Models.

Core Idea

Discriminative models draw boundaries in the data space, while generative


models try to model how data is placed throughout the space. A generative
model explains how the data was generated, while a discriminative model
focuses on predicting the labels of the data.

Mathematical Intuition

In mathematical terms, discriminative machine learning trains a model, which


is done by learning parameters that maximize the conditional
probability P(Y|X). On the other hand, a generative model learns parameters
by maximizing the joint probability of P(X, Y).

Applications
Discriminative models recognize existing data, i.e., discriminative modelling
identifies tags and sorts data and can be used to classify data, while
Generative modelling produces something.

Since these models use different approaches to machine learning, both are
suited for specific tasks i.e., Generative models are useful for unsupervised
learning tasks. In contrast, discriminative models are useful for supervised
learning tasks. GANs (Generative adversarial networks) can be thought of
as a competition between the generator, which is a component of the
generative model, and the discriminator, so basically, it is generative vs.
discriminative model.

Outliers

Generative models have more impact on outliers than discriminative models.

Computational Cost

Discriminative models are computationally cheap as compared to generative


models.

Comparison Between Discriminative and Generative


Models

Let’s see some of the comparisons based on the following criteria between
Discriminative and Generative Models:

Based on Performance

Generative models need fewer data to train compared with discriminative


models since generative models are more biased as they make stronger
assumptions, i.e., assumption of conditional independence.
Based on Missing Data

In general, if we have missing data in our dataset, then Generative models


can work with these missing data, while discriminative models can’t. This is
because, in generative models, we can still estimate the posterior by
marginalizing the unseen variables. However, discriminative models usually
require all the features X to be observed.

Based on the Accuracy Score

If the assumption of conditional independence violates, then at that time,


generative models are less accurate than discriminative models.

Based on Applications

Discriminative models are called “discriminative” since they are useful for
discriminating Y’s label, i.e., target outcome, so they can only solve
classification problems. In contrast, Generative models have more
applications besides classification, such as samplings, Bayes learning, MAP
inference, etc.

Conclusion

In conclusion, discriminative and generative models are two basic


approaches to machine learning that have been used to solve various tasks.
The discriminative approach focuses on learning the decision boundary
between classes, while generative models are used to model the underlying
data distribution. Understanding the difference between discriminative and
generative models helps us to make better decisions about which approach
to use for a particular task to build a more accurate machine-learning
solution.
Key Takeaways

• Discriminative models learn the decision boundary between classes, while


generative models aim to model the underlying data distribution.
• Discriminative models are often simpler and faster to train than generative
models but may not perform as well on tasks where the underlying data
distribution is complex or uncertain.
• Generative models can be used for a wider range of tasks, including image
and text generation, but may require more training data and computational
resources.

Linear Regression
Linear regression is one of the easiest and most popular Machine Learning algorithms.
It is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age,
product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
Mathematically, we can represent a linear regression as: y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called
Multiple Linear Regression.

Linear Regression Line


A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.
o Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.

Finding the best fit line:


When working with linear regression, our main goal is to find the best fit line that
means the error between predicted values and actual values should be minimized. The
best fit line will have the least error.

The different values for weights or the coefficient of lines (a0, a1) gives a different line
of regression, so we need to calculate the best values for a 0 and a1 to find the best fit
line, so to calculate this we use cost function.

Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for
the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual values.
It can be written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called
residual. If the observed points are far from the regression line, then the residual will
be high, and so cost function will high. If the scatter points are close to the regression
line, then the residual will be small and hence the cost function.

Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.

Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations.
The process of finding the best model out of various models is called optimization. It
can be achieved by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values
and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression


Below are some important assumptions of Linear Regression. These are some formal
checks while building a Linear Regression model, which ensures to get the best
possible result from the given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and
independent variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors
and target variables. Or we can say, it is difficult to determine which predictor variable
is affecting the target variable and which is not. So, the model assumes either little or
no multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will
become either too wide or too narrow, which may cause difficulties in finding
coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the
model. Autocorrelation usually occurs if there is a dependency between residual errors.

What Is the Least Squares Method?


The least squares method is a form of mathematical regression analysis
used to determine the line of best fit for a set of data, providing a visual
demonstration of the relationship between the data points. Each point of
data represents the relationship between a known independent variable
and an unknown dependent variable.

KEY TAKEAWAYS

• The least squares method is a statistical procedure to find the best fit
for a set of data points by minimizing the sum of the offsets or
residuals of points from the plotted curve.
• Least squares regression is used to predict the behaviour of
dependent variables.
• The least squares method provides the overall rationale for the
placement of the line of best fit among the data points being studied.

Least Square Method

Least square method is the process of finding a regression line or best-fitted


line for any data set that is described by an equation. This method requires
reducing the sum of the squares of the residual parts of the points from the
curve or line and the trend of outcomes is found quantitatively. The method
of curve fitting is seen while regression analysis and the fitting equations to
derive the curve is the least square method.

Let us look at a simple example, Ms. Dolma said in the class "Hey students
who spend more time on their assignments are getting better grades". A
student wants to estimate his grade for spending 2.3 hours on an
assignment. Through the magic of the least-squares method, it is possible to
determine the predictive model that will help him estimate the grades far
more accurately. This method is much simpler because it requires nothing
more than some data and maybe a calculator.

In this section, we’re going to explore least squares, understand what it


means, learn the general formula, steps to plot it on a graph, know what are
its limitations, and see what tricks we can use with least squares.
Least Square Method Definition
The least-squares method is a statistical method used to find the line of best
fit of the form of an equation such as y = mx + b to the given data. The curve
of the equation is called the regression line. Our main objective in this
method is to reduce the sum of the squares of errors as much as possible.
This is the reason this method is called the least-squares method. This
method is often used in data fitting where the best fit result is assumed to
reduce the sum of squared errors that is considered to be the difference
between the observed values and corresponding fitted value. The sum of
squared errors helps in finding the variation in observed data. For example,
we have 4 data points and using this method we arrive at the following graph.

The two basic categories of least-square problems are ordinary or linear


least squares and nonlinear least squares.

Limitations for Least Square Method


Even though the least-squares method is considered the best method to find
the line of best fit, it has a few limitations. They are:

• This method exhibits only the relationship between the two variables. All other causes and
effects are not taken into consideration.
• This method is unreliable when data is not evenly distributed.
• This method is very sensitive to outliers. In fact, this can skew the results of the least-squares
analysis.
Least Square Method Graph
Look at the graph below, the straight line shows the potential relationship
between the independent variable and the dependent variable. The ultimate
goal of this method is to reduce this difference between the observed
response and the response predicted by the regression line. Less residual
means that the model fits better. The data points need to be minimized by
the method of reducing residuals of each point from the line. There are
vertical residuals and perpendicular residuals. Vertical is mostly used in
polynomials and hyperplane problems while perpendicular is used in general
as seen in the image below.

Important Notes

• The least-squares method is used to predict the behavior of the dependent variable
with respect to the independent variable.
• The sum of the squares of errors is called variance.
• The main aim of the least-squares method is to minimize the sum of the squared
errors.

Overfitting and Underfitting


Overfitting and Underfitting are two crucial concepts in machine learning and are the prevalent causes
for the poor performance of a machine learning model. This tutorial will explore Overfitting and
Underfitting in machine learning, and help you understand how to avoid them with a hands-on
demonstration.
What is Overfitting?
When a model performs very well for training data but has poor performance with test data (new
data), it is known as overfitting. In this case, the machine learning model learns the details and noise
in the training data such that it negatively affects the performance of the model on test data.
Overfitting can happen due to low bias and high variance.

Reasons for Overfitting


Data used for training is not cleaned and contains noise (garbage values) in it
The model has a high variance
The size of the training dataset used is not enough
The model is too complex
Ways to Tackle Overfitting
Using K-fold cross-validation
Using Regularization techniques such as Lasso and Ridge
Training model with sufficient data
Adopting ensembling techniques

What is Underfitting?
When a model has not learned the patterns in the training data well and is unable to generalize well
on the new data, it is known as underfitting. An underfit model has poor performance on the training
data and will result in unreliable predictions. Underfitting occurs due to high bias and low variance.

Reasons for Underfitting


Data used for training is not cleaned and contains noise (garbage values) in it
The model has a high bias
The size of the training dataset used is not enough
The model is too simple
Ways to Tackle Underfitting
Increase the number of features in the dataset
Increase model complexity
Reduce noise in the data
Increase the duration of training the data
Now that you have understood what overfitting and underfitting are, let’s see what is a good fit model
in this tutorial on overfitting and underfitting in machine learning.

What Is a Good Fit in Machine Learning?


To find the good fit model, you need to look at the performance of a machine learning
model over time with the training data. As the algorithm learns over time, the error for
the model on the training data reduces, as well as the error on the test dataset. If you
train the model for too long, the model may learn the unnecessary details and the
noise in the training set and hence lead to overfitting. In order to achieve a good fit,
you need to stop training at a point where the error starts to increase.

Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this,
the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to our


model. It means the more we train our model, the more chances of occurring the
overfitted model.
Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of the
linear regression output:

As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because the
goal of the regression model to find the best fit line, but here we have not got any
best fit, so, it will generate the prediction errors.

How to avoid the Overfitting in Model


Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which we
can reduce the occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant trend
in the data.

In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear
regression model:

As we can see from the above diagram, the model is unable to capture the data points
present in the plot.

How to avoid underfitting:


o By increasing the training time of the model.
o By increasing the number of features.

Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how
closely the result or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and ideally,
it makes predictions with 0 errors, but in practice, it is difficult to achieve it.
As when we train our model for a time, the errors in the training data go down, and
the same happens with test data. But if we train the model for a long duration, then
the performance of the model may decrease due to the overfitting, as the model also
learn the noise present in the dataset. The errors in the test dataset start increasing, so
the point, just before the raising of errors, is the good point, and we can stop here for
achieving a good model.

Cross-Validation
Cross-validation is a technique for validating the model efficiency by training it on the
subset of input data and testing on previously unseen subset of the input data. We
can also say that it is a technique to check how a statistical model generalizes to
an independent dataset.

In machine learning, there is always the need to test the stability of the model. It means
based only on the training dataset; we can't fit our model on the training dataset. For
this purpose, we reserve a particular sample of the dataset, which was not part of the
training dataset. After that, we test our model on that sample before deployment, and
this complete process comes under cross-validation. This is something different from
the general train-test split.

Hence the basic steps of cross-validations are:

o Reserve a subset of the dataset as a validation set.


o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs well
with the validation set, perform the further step, else check for the issues.

Methods used for Cross-Validation


There are some common methods that are used for cross-validation. These methods
are given below:

1. Validation Set Approach


2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation
Validation Set Approach
We divide our input dataset into a training set and test or validation set in the
validation set approach. Both the subsets are given 50% of the dataset.

But it has one of the big disadvantages that we are just using a 50% dataset to train
our model, so the model may miss out to capture important information of the dataset.
It also tends to give the underfitted model.

Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are
total n datapoints in the original input dataset, then n-p data points will be used as
the training dataset and the p data points as the validation set. This complete process
is repeated for all the samples, and the average error is calculated to know the
effectiveness of the model.

There is a disadvantage of this technique; that is, it can be computationally difficult for
the large p.

Leave one out cross-validation


This method is similar to the leave-p-out cross-validation, but instead of p, we need to
take 1 dataset out of training. It means, in this approach, for each learning set, only
one datapoint is reserved, and the remaining dataset is used to train the model. This
process repeats for each datapoint. Hence for n samples, we get n different training
set and n test set. It has the following features:

o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.

K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of
equal sizes. These samples are called folds. For each learning set, the prediction
function uses k-1 folds, and the rest of the folds are used for the test set. This approach
is a very popular CV approach because it is easy to understand, and the output is less
biased than other methods.

The steps for k-fold cross-validation are:


o Split the input dataset into K groups
o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the model
using the test set.

Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5
folds. On 1st iteration, the first fold is reserved for test the model, and rest are used to
train the model. On 2nd iteration, the second fold is used to test the model, and rest
are used to train the model. This process will continue until each fold is not used for
the test fold.

Consider the below diagram:

Stratified k-fold cross-validation


This technique is similar to k-fold cross-validation with some little changes. This
approach works on stratification concept, it is a process of rearranging the data to
ensure that each fold or group is a good representative of the complete dataset. To
deal with the bias and variance, it is one of the best approaches.

It can be understood with an example of housing prices, such that the price of some
houses can be much high than other houses. To tackle such situations, a stratified k-
fold cross-validation technique is useful.

Holdout Method
This method is the simplest cross-validation technique among all. In this method, we
need to remove a subset of the training data and use it to get prediction results by
training it on the rest part of the dataset.
The error that occurs in this process tells how well our model will perform with the
unknown dataset. Although this approach is simple to perform, it still faces the issue
of high variance, and it also produces misleading results sometimes.

Comparison of Cross-validation to train/test split in


Machine Learning
o Train/test split: The input data is divided into two parts, that are training set and test
set on a ratio of 70:30, 80:20, etc. It provides a high variance, which is one of the biggest
disadvantages.
o Training Data: The training data is used to train the model, and the dependent
variable is known.
o Test Data: The test data is used to make the predictions from the model that
is already trained on the training data. This has the same features as training
data but not the part of that.
o Cross-Validation dataset: It is used to overcome the disadvantage of train/test split
by splitting the dataset into groups of train/test splits, and averaging the result. It can
be used if we want to optimize our model that has been trained on the training dataset
for the best performance. It is more efficient as compared to train/test split as every
observation is used for the training and testing both.

Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:

o For the ideal conditions, it provides the optimum output. But for the inconsistent data,
it may produce a drastic result. So, it is one of the big disadvantages of cross-validation,
as there is no certainty of the type of data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may face the
differences between the training set and validation sets. Such as if we create a model
for the prediction of stock market values, and the data is trained on the previous 5
years stock values, but the realistic future values for the next 5 years may drastically
different, so it is difficult to expect the correct output for such situations.

Applications of Cross-Validation
o This technique can be used to compare the performance of different predictive
modeling methods.
o It has great scope in the medical research field.
o It can also be used for the meta-analysis, as it is already being used by the data
scientists in the field of medical statistics.

Regularization in Machine Learning


What is Regularization?
Regularization is one of the most important concepts of machine learning. It is a
technique to prevent the model from overfitting by adding extra information to it.

Sometimes the machine learning model performs well with the training data but does
not perform well with the test data. It means the model is not able to predict the output
when deals with unseen data by introducing noise in the output, and hence the model
is called overfitted. This problem can be deal with the help of a regularization
technique.

This technique can be used in such a way that it will allow to maintain all variables or
features in the model by reducing the magnitude of the variables. Hence, it maintains
accuracy as well as a generalization of the model.

It mainly regularizes or reduces the coefficient of features toward zero. In simple words,
"In regularization technique, we reduce the magnitude of the features by keeping the
same number of features."

How does Regularization Work?


Regularization works by adding a penalty or complexity term to the complex model.
Let's consider the simple linear regression equation:

y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b

In the above equation, Y represents the value to be predicted

X1, X2, …Xn are the features for Y.

β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here
represents the bias of the model, and b represents the intercept.

Linear regression models try to optimize the β0 and b to minimize the cost function.
The equation for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for the linear regression is called
as RSS or Residual sum of squares.

Lasso Regression
LASSO regression, also known as L1 regularization, is a popular technique used in
statistical modelling and machine learning to estimate the relationships between
variables and make predictions. LASSO stands for Least Absolute Shrinkage and
Selection Operator.

The primary goal of LASSO regression is to find a balance between model simplicity
and accuracy. It achieves this by adding a penalty term to the traditional linear
regression model, which encourages sparse solutions where some coefficients are
forced to be exactly zero. This feature makes LASSO particularly useful for feature
selection, as it can automatically identify and discard irrelevant or redundant variables.

Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:

o Ridge Regression
o Lasso Regression

Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of
bias is introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity
of the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The
amount of bias added to the model is called Ridge Regression penalty. We can
calculate it by multiplying with the lambda to the squared weight of each individual
feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model, and
hence ridge regression reduces the amplitudes of the coefficients that decreases the
complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum
value of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between
the independent variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the
model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:

o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well
as the feature selection.
Key Difference between Ridge Regression and Lasso
Regression
o Ridge regression is mostly used to reduce the overfitting in the model, and it includes
all the features present in the model. It reduces the complexity of the model by
shrinking the coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature
selection.

What is Lasso Regression?


Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data
values are shrunk towards a central point, like the mean. The lasso procedure encourages simple,
sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited
for models showing high levels of multicollinearity or when you want to automate certain parts of
model selection, like variable selection/parameter elimination.

The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator.

L1 Regularization
Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of
the magnitude of coefficients. This type of regularization can result in sparse models with few
coefficients; Some coefficients can become zero and eliminated from the model. Larger penalties
result in coefficient values closer to zero, which is the ideal for producing simpler models. On the
other hand, L2 regularization (e.g. Ridge regression) doesn’t result in elimination of coefficients or
sparse models. This makes the Lasso far easier to interpret than the Ridge.

Performing the Regression


Lasso solutions are quadratic programming problems, which are best solved with software
(like Matlab). The goal of the algorithm is to minimize:

Which is the same as minimizing the sum of squares with constraint Σ |Bj≤ s (Σ = summation
notation). Some of the βs are shrunk to exactly zero, resulting in a regression model that’s easier
to interpret.

A tuning parameter, λ controls the strength of the L1 penalty. λ is basically the amount of
shrinkage:

• When λ = 0, no parameters are eliminated. The estimate is equal to the one found with
linear regression.
• As λ increases, more and more coefficients are set to zero and eliminated (theoretically,
when λ = ∞, all coefficients are eliminated).
• As λ increases, bias increases.
• As λ decreases, variance increases.
If an intercept is included in the model, it is usually left unchanged.

Classification
As we know, the Supervised Machine Learning algorithm can be broadly classified into
Regression and Classification Algorithms. In Regression algorithms, we have predicted
the output for continuous values, but to predict the categorical values, we need
Classification algorithms.

What is the Classification Algorithm?


The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data. In Classification, a
program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or
Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such
as "Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a
Supervised learning technique, hence it takes labelled input data, which means it
contains input with the corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.


The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the categorical
data.

Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have features
that are similar to each other and dissimilar to other classes.

The algorithm which implements the classification on a dataset is known as a classifier.


There are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible outcomes, then it
is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it
is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:


In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time
for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners: Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve
Bayes, ANN.

Types of ML Classification Algorithms:


Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Evaluating a Classification model:


Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So, for evaluating a Classification model, we have
the following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability


value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:

?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the
performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as below
table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms


Classification algorithms can be used in different places. Below are some popular use
cases of Classification Algorithms:

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumour cells.
o Drugs Classification
Metrics to Evaluate Machine Learning Classification Algorithms

Now that we have an idea of the different types of classification models, it is crucial to
choose the right evaluation metrics for those models. In this section, we will cover the
most commonly used metrics: accuracy, precision, recall, F1 score, and area under
the ROC (Receiver Operating Characteristic) curve and AUC (Area Under the
Curve).
Logistic Regression
o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression;
therefore, it is called logistic regression, but is used to classify samples; Therefore,
it falls under the classification algorithm.

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:


o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Advantages of the Logistic Regression Algorithm

• Logistic regression performs better when the data is linearly separable

• It does not require too many computational resources as it’s highly interpretable

• There is no problem scaling the input features—It does not require tuning

• It is easy to implement and train a model using logistic regression

• It gives a measure of how relevant a predictor (coefficient size) is, and its direction of
association (positive or negative)
Linear Regression Logistic Regression

Used to solve regression problems Used to solve classification problems

The response variables are The response variable is categorical in


continuous in nature nature

It helps estimate the dependent


It helps to calculate the possibility of a
variable when there is a change in
particular event taking place
the independent variable

It is a straight line It is an S-curve (S = Sigmoid)

Applications of Logistic Regression


• Using the logistic regression algorithm, banks can predict whether a customer
would default on loans or not

• To predict the weather conditions of a certain place (sunny, windy, rainy, humid,
etc.)

• Ecommerce companies can identify buyers if they are likely to purchase a certain
product

• Companies can predict whether they will gain or lose money in the next quarter,
year, or month based on their current performance

• To classify objects based on their features and attributes

Gradient Descent in Linear Regression


We know that in any machine learning project our main aim relies on how
good our project accuracy is or how much our model prediction differs from
the actual data point. Based on the difference between model prediction and
actual data points we try to find the parameters of the model which give
better accuracy on our dataset\, In order to find these parameters, we apply
gradient descent on the cost function of the machine learning model.

What is Gradient Descent


Gradient Descent is an iterative optimization algorithm that tries to find the
optimum value (Minimum/Maximum) of an objective function. It is one of the
most used optimization techniques in machine learning projects for updating
the parameters of a model in order to minimize a cost function.
The main aim of gradient descent is to find the best parameters of a model
which gives the highest accuracy on training as well as testing datasets. In
gradient descent, The gradient is a vector that points in the direction of the
steepest increase of the function at a specific point. Moving in the opposite
direction of the gradient allows the algorithm to gradually descend towards
lower values of the function, and eventually reaching to the minimum of the
function.

Steps Required in Gradient Descent Algorithm

• Step 1 we first initialize the parameters of the model randomly


• Step 2 Compute the gradient of the cost function with respect to each
parameter. It involves making partial differentiation of cost function with
respect to the parameters.
• Step 3 Update the parameters of the model by taking steps in the
opposite direction of the model. Here we choose a hyperparameter learning
rate which is denoted by alpha. It helps in deciding the step size of the
gradient.
• Step 4 Repeat steps 2 and 3 iteratively to get the best parameter for the
defined model

To apply this gradient descent on data using any programming language we


have to make four new functions using which we can update our parameter
and apply it to data to make a prediction. We will see each function one by
one and understand it
1. gradient_descent – In the gradient descent function we will make the
prediction on a dataset and compute the difference between the predicted
and actual target value and accordingly we will update the parameter and
hence it will return the updated parameter.
2. compute_predictions – In this function, we will compute the prediction
using the parameters at each iteration.
3. compute_gradient – In this function we will compute the error which is
the difference between the actual and predicted target value and then
compute the gradient using this error and training data.
4. update_parameters – In this separate function we will update the
parameter using learning rate and gradient that we got from the
compute_gradient function.

Mathematics Behind Gradient Descent


In the Machine Learning Regression problem, our model targets to get the
best-fit regression line to predict the value y based on the given input value
(x). While training the model, the model calculates the cost function like Root
Mean Squared error between the predicted value (pred) and true value (y).
Our model targets to minimize this cost function.

To minimize this cost function, the model needs to have the best value of
θ1 and θ2(for Univariate linear regression problem). Initially model selects
θ1 and θ2 values randomly and then iteratively update these values in order to
minimize the cost function until it reaches the minimum. By the time model
achieves the minimum cost function, it will have the best θ1 and θ2 values.
Using these updated values of θ 1 and θ2 in the hypothesis equation of linear
equation, our model will predict the output value y.

Gradient Descent Algorithm for Linear Regression


How Does Gradient Descent Work

Gradient descent works by moving downward toward the pits or valleys in the
graph to find the minimum value. This is achieved by taking the derivative of
the cost function, as illustrated in the figure below. During each iteration,
gradient descent step-downs the cost function in the direction of the steepest
descent. By adjusting the parameters in this direction, it seeks to reach the
minimum of the cost function and find the best-fit values for the parameters.
The size of each step is determined by parameter α known as Learning
Rate.

In the Gradient Descent algorithm, one can infer two points:

• If slope is +ve : θj = θj – (+ve value). Hence the value of θ j decreases.

• If slope is -ve : θj = θj – (-ve value). Hence the value of θj increases.


How to Choose Learning Rate

The choice of correct learning rate is very important as it ensures that Gradient
Descent converges in a reasonable time.:
If we choose α to be very large, Gradient Descent can overshoot the
minimum. It may fail to converge or even diverge.

If we choose α to be very small, Gradient Descent will take small steps to


reach local minima and will take a longer time to reach minima.

Advantages of Gradient Descent


• Flexibility: Gradient Descent can be used with various cost functions and
can handle non-linear regression problems.
• Scalability: Gradient Descent is scalable to large datasets since it
updates the parameters for each training example one at a time.
• Convergence: Gradient Descent can converge to the global minimum of
the cost function, provided that the learning rate is set appropriately.

Disadvantages of Gradient Descent


• Sensitivity to Learning Rate: The choice of learning rate can be critical
in Gradient Descent since using a high learning rate can cause the
algorithm to overshoot the minimum, while a low learning rate can make
the algorithm converge slowly.
• Slow Convergence: Gradient Descent may require more iterations to
converge to the minimum since it updates the parameters for each
training example one at a time.
• Local Minima: Gradient Descent can get stuck in local minima if the cost
function has multiple local minima.
• Noisy updates: The updates in Gradient Descent are noisy and have a
high variance, which can make the optimization process less stable and
lead to oscillations around the minimum.
Overall, Gradient Descent is a useful optimization algorithm for linear
regression, but it has some limitations and requires careful tuning of the
learning rate to ensure convergence.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then we
test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will
see the extreme case of cat and dog. On the basis of the support vectors, it will classify it
as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. Consider the below image:

So, as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third-dimension
z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Hence, we get a circumference of radius 1 in case of non-linear data.

How an SVM works


A simple linear SVM classifier works by making a straight line between
two classes. That means all of the data points on one side of the line
will represent a category and the data points on the other side of the
line will be put into a different category. This means there can be an
infinite number of lines to choose from.

What makes the linear SVM algorithm better than some of the other
algorithms, like k-nearest neighbors, is that it chooses the best line to
classify your data points. It chooses the line that separates the data
and is the furthest away from the closet data points as possible.

A 2-D example helps to make sense of all the machine learning jargon.
Basically, you have some data points on a grid. You're trying to
separate these data points by the category they should fit in, but you
don't want to have any data in the wrong category. That means you're
trying to find the line between the two closest points that keeps the
other data points separated.

So, the two closest data points give you the support vectors you'll use
to find that line. That line is called the decision boundary.

linear SVM
The decision boundary doesn't have to be a line. It's also referred to
as a hyperplane because you can find the decision boundary with any
number of features, not just two.

non-linear SVM using RBF kernel

Types of SVMs

There are two different types of SVMs, each used for different things:

• Simple SVM: Typically used for linear regression and classification problems.
• Kernel SVM: Has more flexibility for non-linear data because you can add more
features to fit a hyperplane instead of a two-dimensional space.

Why SVMs are used in machine learning

SVMs are used in applications like handwriting recognition, intrusion detection, face detection,
email classification, gene classification, and in web pages. This is one of the reasons we use
SVMs in machine learning. It can handle both classification and regression on linear and non-
linear data.

Another reason we use SVMs is because they can find complex relationships between your
data without you needing to do a lot of transformations on your own. It's a great option when
you are working with smaller datasets that have tens to hundreds of thousands of features. They
typically find more accurate results when compared to other algorithms because of their ability
to handle small, complex datasets.

Here are some of the pros and cons for using SVMs.

Pros

• Effective on datasets with multiple features, like financial or medical data.


• Effective in cases where number of features is greater than the number of data points.
• Uses a subset of training points in the decision function called support vectors which
makes it memory efficient.
• Different kernel functions can be specified for the decision function. You can use
common kernels, but it's also possible to specify custom kernels.

Cons

• If the number of features is a lot bigger than the number of data points, avoiding over-
fitting when choosing kernel functions and regularization term is crucial.
• SVMs don't directly provide probability estimates. Those are calculated using an
expensive five-fold cross-validation.
• Works best on small sample sets because of its high training time.

Kernel Methods
Kernels or kernel methods (also called Kernel functions) are sets of different types of algorithms that
are being used for pattern analysis. They are used to solve a non-linear problem by using a linear
classifier. Kernels Methods are employed in SVM (Support Vector Machines) which are used in
classification and regression problems. The SVM uses what is called a “Kernel Trick” where the data is
transformed and an optimal boundary is found for the possible outputs.

Types of Kernel and methods in SVM


1. Liner Kernel
2. Polynomial Kernel
3. Gaussian Kernel
4. Exponential Kernel
5. Laplacian Kernel
6. Hyperbolic or the Sigmoid Kernel
7. Anova radial basis kernel

What are kernels?


Kernels, also known as kernel techniques or kernel functions, are a collection of distinct forms of
pattern analysis algorithms, using a linear classifier, they solve an existing non-linear problem. SVM
(Support Vector Machines) uses Kernels Methods in ML to solve classification and regression issues.
The SVM (Support Vector Machine) employs “Kernel Trick” where data is processed, and an optimal
boundary for the various outputs is determined.

In other words, a kernel is a term used to describe applying linear classifiers to non-linear problems
by mapping non-linear data onto a higher-dimensional space without having to visit or understand
that higher-dimensional region.

Kernel methods in machine learning


These are some of the many techniques of the kernel:

• Support Vector Machine (SVM)


• Adaptive Filter
• Kernel Perception
• Principle Component Analysis
• Spectral Clustering

Instance-Based Methods
• Instance-based learning is a family of learning algorithms that, instead
of performing explicit generalization, compares new problem instances
with instances seen in training, which have been stored in memory.
• They are sometimes referred to as lazy learning methods because they
delay processing until a new instance must be classified. The nearest
neighbours of an instance are defined in terms of Euclidean distance.
• No model is learned
• The stored training instances themselves represent the knowledge
• Training instances are searched for instance that most closely resembles
new instance
Instance-based learning representation

Instance-based learning: It generates classification predictions using only


specific instances. Instance-based learning algorithms do not maintain a set of
abstractions derived from specific instances. This approach extends the nearest
neighbor algorithm, which has large storage requirements.

Performance dimensions used for instance-based learning


algorithm
Time complexity of Instance based learning algorithms depends upon the size
of training data. Time complexity of this algorithm in the worst case is O (n),
where n is the number of training items to be used to classify a single new
instance.

Functions of instance-based learning


Instance-based learning refers to a family of techniques for classification
and regression, which produce a class label/prediction based on the similarity
of the query to its nearest neighbor(s) in the training set.
Functions are as follows:

1. Similarity: Similarity is a machine learning method that uses a nearest


neighbour approach to identify the similarity of two or more objects to
each other based on algorithmic distance functions.
2. Classification: Process of categorizing a given set of data into classes,
It can be performed on both structured or unstructured data. The
process starts with predicting the class of given data points. The
classes are often referred to as target, label or categories.
3. Concept Description: Much of human learning involves acquiring
general concepts from past experiences. This description can then be
used to predict the class labels of unlabelled cases.

Some of the instance-based learning algorithms are:


1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning

Advantages of instance-based learning:

• It has the ability to adapt to previously unseen data, which means that
one can store a new instance or drop the old instance.
Disadvantages of instance-based learning:

• Classification costs are high.


• Large amount of memory required to store the data, and each query
involves starting the identification of a local model from scratch.

K-Nearest Neighbor (KNN)


o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So, for this identification, we can
use the KNN algorithm, as it works on a similarity measure. Our KNN model will
find the similar features of the new data set to the cats and dogs’ images and based
on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
o By calculating the Euclidean distance, we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B. Consider
the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.

Tree-Based Models

What is Tree-Based Models?

Tree-based models use a decision tree to represent how different input variables can be
used to predict a target value. Machine learning uses tree-based models for both
classification and regression problems, such as the type of animal or value of a home. The
input variables are repeatedly segmented into subsets to build the decision tree, and each
branch is tested for prediction accuracy and evaluated for efficiency and effectiveness.
Splitting the variables in a different order can reduce the number of layers and
calculations required to produce an accurate prediction. Generating a successful decision
tree results in the most important variables (most influential on the prediction) being at
the top of the tree hierarchy, while irrelevant features get dropped from the hierarchy.
Tree-based models use a series of if-then rules to generate predictions from one
or more decision trees. All tree-based models can be used for either regression
(predicting numerical values) or classification (predicting categorical values). For
example

1. Decision tree models, which are the foundation of all tree-based models.
2. Random forest models, an “ensemble” method which builds many
decision trees in parallel.
3. Gradient boosting models, an “ensemble” method which builds many
decision trees sequentially.
Decision Trees

First, let’s start with a simple decision tree model. A decision tree model can be used to
visually represent the “decisions”, or if-then rules, that are used to generate predictions.
Here is an example of a very basic decision tree model:
We’ll go through each yes or no question, or decision node, in the tree and will move
down the tree accordingly, until we reach our final predictions. Our first question, which
is referred to as our root node, is whether George is above 40 and, since he is, we will
then proceed onto the “Has Kids” node. Because the answer is yes, we’ll predict that he
will be a high spender at Willy Wonka Candy this week.

One other note to add — here, we’re trying to predict whether George will be a high
spender, so this is an example of a classification tree, but we could easily convert this into
a regression tree by predicting George’s actual dollar spend. The process would remain
the same, but the final nodes would be numerical predictions rather than categorical
ones.

How Do We Actually Create These Decision Tree Models?

Glad you asked. There are essentially two key components to building a decision tree
model: determining which features to split on and then deciding when to stop splitting.

When determining which features to split on, the goal is to select the feature that will
produce the most homogenous resulting datasets. The simplest and most commonly used
method of doing this is by minimizing entropy, a measure of the randomness within a
dataset, and maximizing information gain, the reduction in entropy that results from
splitting on a given feature.

We’ll split on the feature that results in the highest information gain, and then recompute
entropy and information gain for the resulting output datasets. In the Willy Wonka
example, we may have first split on age because the greater than 40 and less than (or
equal to) 40 datasets were each relatively homogenous. Homogeneity in this sense refers
to the diversity of classes, so one dataset was filled with primarily low spenders and the
other with primarily high spenders.

You may be wondering how we decided to use a threshold of 40 for age. That’s a good
question! For numerical features, we first sort the feature values in ascending order, and
then test each value as the threshold point and calculate the information gain of that split.

The value with the highest information gain — in this case, age 40 — will then be
compared with other potential splits, and whichever has the highest information gain will
be used at that node. A tree can split on any numerical feature multiple times at different
value thresholds, which enables decision tree models to handle non-linear relationships
quite well.

The second decision we need to make is when to stop splitting the tree. We can split until
each final node has very few data points, but that will likely result in overfitting, or
building a model that is too specific to the dataset it was trained on. This is problematic
because, while it may make good predictions for that one dataset, it may not generalize
well to new data, which is really our larger goal.

To combat this, we can remove sections that have little predictive power, a technique
referred to as pruning. Some of the most common pruning methods include setting a
maximum tree depth or minimum number of samples per leaf, or final node.

Here’s a high-level recap of decision tree models:

Advantages:

• Straightforward interpretation
• Good at handling complex, non-linear relationships

Disadvantages:

• Predictions tend to be weak, as singular decision tree models are prone to


overfitting
• Unstable, as a slight change in the input dataset can greatly impact the final results

Aa
What are Decision Trees?
In simple words, a decision tree is a structure that contains nodes
(rectangular boxes) and edges(arrows) and is built from a dataset
(table of columns representing features/attributes and rows
corresponds to records). Each node is either used to make a
decision (known as decision node) or represent an
outcome (known as leaf node).

Decision tree Example

The picture above depicts a decision tree that is used to classify


whether a person is Fit or Unfit.
The decision nodes here are questions like ‘’‘Is the person less than
30 years of age?’, ‘Does the person eat junk?’, etc. and the leaves are
one of the two possible outcomes viz. Fit and Unfit.
Looking at the Decision Tree we can say make the following
decisions:
if a person is less than 30 years of age and doesn’t eat junk food then
he is Fit, if a person is less than 30 years of age and eats junk food
then he is Unfit and so on.
The initial node is called the root node (colored in blue), the final
nodes are called the leaf nodes (colored in green) and the rest of
the nodes are called intermediate or internal nodes.
The root and intermediate nodes represent the decisions while the
leaf nodes represent the outcomes.

ID3 in brief
ID3 stands for Iterative Dichotomiser 3 and is named such because
the algorithm iteratively (repeatedly) dichotomizes(divides) features
into two or more groups at each step.

Invented by Ross Quinlan, ID3 uses a top-down greedy approach


to build a decision tree. In simple words, the top-down approach
means that we start building the tree from the top and
the greedy approach means that at each iteration we select the best
feature at the present moment to create a node.

Most generally ID3 is only used for classification problems


with nominal features only.

Dataset description
In this article, we’ll be using a sample dataset of COVID-19 infection.
A preview of the entire dataset is shown below.
+----+-------+-------+------------------+----------+
| ID | Fever | Cough | Breathing issues | Infected |
+----+-------+-------+------------------+----------+
| 1 | NO | NO | NO | NO |
+----+-------+-------+------------------+----------+
| 2 | YES | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 3 | YES | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 4 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 5 | YES | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 6 | NO | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 7 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 8 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 9 | NO | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 10 | YES | YES | NO | YES |
+----+-------+-------+------------------+----------+
| 11 | NO | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 12 | NO | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 13 | NO | YES | YES | NO |
+----+-------+-------+------------------+----------+
| 14 | YES | YES | NO | NO |
+----+-------+-------+------------------+----------+

The columns are self-explanatory. Y and N stand for Yes and No


respectively. The values or classes in Infected column Y and N
represent Infected and Not Infected respectively.

The columns used to make decision nodes viz. ‘Breathing Issues’,


‘Cough’ and ‘Fever’ are called feature columns or just features and
the column used for leaf nodes i.e. ‘Infected’ is called the target
column.

Metrics in ID3
As mentioned previously, the ID3 algorithm selects the best feature
at each step while building a Decision tree.
Before you ask, the answer to the question: ‘How does ID3 select the
best feature?’ is that ID3 uses Information Gain or just Gain to
find the best feature.
Information Gain calculates the reduction in the entropy and
measures how well a given feature separates or classifies the target
classes. The feature with the highest Information Gain is
selected as the best one.

In simple words, Entropy is the measure of disorder and the


Entropy of a dataset is the measure of disorder in the target feature
of the dataset.
In the case of binary classification (where the target column has only
two types of classes) entropy is 0 if all values in the target column
are homogenous(similar) and will be 1 if the target column has equal
number values for both the classes.

We denote our dataset as S, entropy is calculated as:


Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n

where,
n is the total number of classes in the target column (in our case n =
2 i.e YES and NO)
pᵢ is the probability of class ‘i’ or the ratio of “number of rows
with class i in the target column” to the “total number of rows” in
the dataset.

Information Gain for a feature column A is calculated as:


IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))

where Sᵥ is the set of rows in S for which the feature column A has
value v, |Sᵥ| is the number of rows in Sᵥ and likewise |S| is the
number of rows in S.
ID3 Steps
1. Calculate the Information Gain of each feature.

2. Considering that all rows don’t belong to the same class,


split the dataset S into subsets using the feature for
which the Information Gain is maximum.

3. Make a decision tree node using the feature with the


maximum Information gain.

4. If all rows belong to the same class, make the current


node as a leaf node with the class as its label.

5. Repeat for the remaining features until we run out of all


features, or the decision tree has all leaf nodes.

Implementation on our Dataset


As stated in the previous section the first step is to find the best
feature i.e. the one that has the maximum Information Gain(IG).
We’ll calculate the IG for each of the features now, but for that, we
first need to calculate the entropy of S

From the total of 14 rows in our dataset S, there are 8 rows with the
target value YES and 6 rows with the target value NO. The entropy
of S is calculated as:
Entropy(S) = — (8/14) * log₂(8/14) — (6/14) * log₂(6/14) = 0.99

Note: If all the values in our target column are same the
entropy will be zero (meaning that it has no or zero
randomness).
We now calculate the Information Gain for each feature:

IG calculation for Fever:


In this(Fever) feature there are 8 rows having value YES and 6 rows
having value NO.
As shown below, in the 8 rows with YES for Fever, there are 6 rows
having target value YES and 2 rows having target value NO.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | NO | NO |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | NO | YES |
+-------+-------+------------------+----------+
| YES | YES | NO | NO |
+-------+-------+------------------+----------+

As shown below, in the 6 rows with NO, there are 2 rows having
target value YES and 4 rows having target value NO.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| NO | NO | NO | NO |
+-------+-------+------------------+----------+
| NO | YES | NO | NO |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | NO | NO |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+

The block, below, demonstrates the calculation of Information Gain


for Fever.
# total rows
|S| = 14For v = YES, |Sᵥ| = 8
Entropy(Sᵥ) = - (6/8) * log₂(6/8) - (2/8) * log₂(2/8) = 0.81For v
= NO, |Sᵥ| = 6
Entropy(Sᵥ) = - (2/6) * log₂(2/6) - (4/6) * log₂(4/6) = 0.91#
Expanding the summation in the IG formula:
IG(S, Fever) = Entropy(S) - (|Sʏᴇꜱ| / |S|) * Entropy(Sʏᴇꜱ) -
(|Sɴᴏ| / |S|) * Entropy(Sɴᴏ)∴ IG(S, Fever) = 0.99 - (8/14) *
0.81 - (6/14) * 0.91 = 0.13

Next, we calculate the IG for the


features “Cough” and “Breathing issues”.
You can use this free online tool to calculate the Information Gain.
IG(S, Cough) = 0.04
IG(S, BreathingIssues) = 0.40

Since the feature Breathing issues have the highest Information


Gain it is used to create the root node.
Hence, after this initial step our tree looks like this:

Next, from the remaining two unused features,


namely, Fever and Cough, we decide which one is the best for the
left branch of Breathing Issues.
Since the left branch of Breathing Issues denotes YES, we will
work with the subset of the original data i.e the set of rows
having YES as the value in the Breathing Issues column. These 8
rows are shown below:
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+

Next, we calculate the IG for the features Fever and Cough using the
subset Sʙʏ (Set Breathing Issues Yes) which is shown above :

Note: For IG calculation the Entropy will be calculated from


the subset Sʙʏ and not the original dataset S.
IG(Sʙʏ, Fever) = 0.20
IG(Sʙʏ, Cough) = 0.09

IG of Fever is greater than that of Cough, so we select Fever as the


left branch of Breathing Issues:
Our tree now looks like this:
Next, we find the feature with the maximum IG for the right branch
of Breathing Issues. But, since there is only one unused feature
left we have no other choice but to make it the right branch of the
root node.
So our tree now looks like this:

There are no more unused features, so we stop here and jump to the
final step of creating the leaf nodes.
For the left leaf node of Fever, we see the subset of rows from the
original data set that has Breathing Issues and Fever both values
as YES.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+

Since all the values in the target column are YES, we label the left
leaf node as YES, but to make it more logical we label it Infected.
Similarly, for the right node of Fever we see the subset of rows from
the original data set that have Breathing Issues value
as YES and Fever as NO.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+

Here not all but most of the values are NO, hence NO or Not
Infected becomes our right leaf node.
Our tree, now, looks like this:

We repeat the same process for the node Cough, however here both
left and right leaves turn out to be the same i.e. NO or Not
Infected as shown below:
Looks Strange, doesn’t it?
I know! The right node of Breathing issues is as good as just a leaf
node with class ‘Not infected’. This is one of the Drawbacks of ID3, it
doesn’t do pruning.

Pruning is a mechanism that reduces the size and complexity of a


Decision tree by removing unnecessary nodes. More about pruning
can be found here.

Another drawback of ID3 is overfitting or high variance i.e. it learns


the dataset it used so well that it fails to generalize on new data.

CART( Classification And Regression Tree) is a variation of the


decision tree algorithm. It can handle both classification and
regression tasks. Scikit-Learn uses the Classification And Regression Tree
(CART) algorithm to train Decision Trees (also called “growing” trees). CART
was first produced by Leo Breiman, Jerome Friedman, Richard Olshen, and
Charles Stone in 1984.
CART Algorithm
CART is a predictive algorithm used in Machine learning and it explains how
the target variable’s values can be predicted based on other matters. It is a
decision tree where each fork is split into a predictor variable and each node
has a prediction for the target variable at the end.
In the decision tree, nodes are split into sub-nodes on the basis of a threshold
value of an attribute. The root node is taken as the training set and is split
into two by considering the best attribute and threshold value. Further, the
subsets are also split using the same logic. This continues till the last pure
sub-set is found in the tree or the maximum number of leaves possible in that
growing tree.
The CART algorithm works via the following process:
• The best split point of each input is obtained.
• Based on the best split points of each input in Step 1, the new
“best” split point is identified.
• Split the chosen input according to the “best” split point.
• Continue splitting until a stopping rule is satisfied or no further
desirable splitting is available.

CART algorithm uses Gini Impurity to split the dataset into a decision tree .It
does that by searching for the best homogeneity for the sub nodes, with the
help of the Gini index criterion.

Gini index/Gini impurity


The Gini index is a metric for the classification tasks in CART. It stores the
sum of squared probabilities of each class. It computes the degree of
probability of a specific variable that is wrongly being classified when chosen
randomly and a variation of the Gini coefficient. It works on categorical
variables, provides outcomes either “successful” or “failure” and hence
conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,
•Where 0 depicts that all the elements are allied to a certain class,
or only one class exists there.
• The Gini index of value 1 signifies that all the elements are
randomly distributed across various classes, and
• A value of 0.5 denotes the elements are uniformly distributed into
some classes.
Mathematically, we can write Gini Impurity as follows:
where pi is the probability of an object being classified to a particular class.

Classification tree
A classification tree is an algorithm where the target variable is categorical.
The algorithm is then used to identify the “Class” within which the target
variable is most likely to fall. Classification trees are used when the dataset
needs to be split into classes that belong to the response variable(like yes or
no)

Regression tree
A Regression tree is an algorithm where the target variable is continuous and
the tree is used to predict its value. Regression trees are used when the
response variable is continuous. For example, if the response variable is the
temperature of the day.
Pseudo-code of the CART algorithm
d = 0, endtree = 0
Note(0) = 1, Node(1) = 0, Node(2) = 0
whileendtree < 1
if Node(2d-1) + Node(2d) + .... + Node(2d+1-2) = 2 - 2d+1
endtree = 1
else
do i = 2d-1, 2d, .... , 2d+1-2
if Node(i) > -1
Split tree
else
Node(2i+1) = -1
Node(2i+2) = -1
end if
end do
end if
d = d + 1
end while
CART model representation
CART models are formed by picking input variables and evaluating split
points on those variables until an appropriate tree is produced.
Steps to create a Decision Tree using the CART algorithm:
• Greedy algorithm: In this The input space is divided using
the Greedy method which is known as a recursive binary
spitting. This is a numerical method within which all of the values
are aligned and several other split points are tried and assessed
using a cost function.
• Stopping Criterion: As it works its way down the tree with the
training data, the recursive binary splitting method described above
must know when to stop splitting. The most frequent halting
method is to utilize a minimum amount of training data allocated to
every leaf node. If the count is smaller than the specified threshold,
the split is rejected and also the node is considered the last leaf
node.
• Tree pruning: Decision tree’s complexity is defined as the number
of splits in the tree. Trees with fewer branches are recommended as
they are simple to grasp and less prone to cluster the data. Working
through each leaf node in the tree and evaluating the effect of
deleting it using a hold-out test set is the quickest and simplest
pruning approach.
• Data preparation for the CART: No special data preparation is
required for the CART algorithm.
Advantages of CART
• Results are simplistic.
• Classification and regression trees are Nonparametric and
Nonlinear.
• Classification and regression trees implicitly perform feature
selection.
• Outliers have no meaningful effect on CART.
• It requires minimal supervision and produces easy-to-understand
models.
Limitations of CART
• Overfitting.
• High Variance.
• low bias.
• the tree structure may be unstable.
Applications of the CART algorithm
• For quick Data insights.
• In Blood Donors Classification.
• For environmental and ecological data.
• In the financial sectors.
Ensemble Methods

While pruning is a good method of improving the predictive performance of a decision


tree model, a single decision tree model will not generally produce strong predictions
alone. To improve our model’s predictive power, we can build many trees and combine
the predictions, which is called ensembling. Ensembling actually refers to any
combination of models, but is most frequently used to refer to tree-based models.

The idea is for many weak guesses to come together to generate one strong guess. You
can think of ensembling as asking the audience on “Who Wants to Be a Millionaire?” If the
question is really hard, the contestant might prefer to aggregate many guesses, rather
than go with their own guess alone.

To get deeper into that metaphor, one decision tree model would be the contestant. One
individual tree might not be a great predictor, but if we build many trees and combine all
predictions, we get a pretty good model! Two of the most popular ensemble algorithms
are random forest and gradient boosting, which are quite powerful and commonly used
for advanced machine learning applications.

Bagging and Random Forest Models

Before we discuss the random forest model, let’s take a quick step back and discuss its
foundation, bootstrap aggregating, or bagging. Bagging is a technique of building many
decision tree models at a time by randomly sampling with replacement, or bootstrapping,
from the original dataset. This ensures variety in the trees, which helps to reduce the
amount of overfitting.
Random forest models take this concept one step further. On top of building many
trees from sampled datasets, each node is only allowed to split on a random selection of
the model’s features.

For example, imagine that each node can split from a different, random selection of three
features from our feature set. Looking at the above, you may notice that the two trees
start with different features — the first starts with age and the second starts with dollars
spent. That’s because even though age may be the most significant feature in the dataset,
it wasn’t selected in the group of three features for the second tree, so that model had to
use the next most significant feature, dollars spent, to start.

Each subsequent node will also split on a random selection of three features. Let’s say
that the next group of features in the “less than $1 spent last week” dataset included age,
and this time, the age 30 threshold resulted in the highest information gain among all
features, age greater or less than 30 would be the next split.

We’ll build our two trees separately and get the majority vote. Note that if it were a
regression problem, we would get the average.

Here’s a high-level recap of random forests:

Advantages:

• Good at handling complex, non-linear relationships


• Handle datasets with high dimensionality (many features) well
• Handle missing data well
• They are powerful and accurate
• They can be trained quickly. Since trees do not rely on one another, they can be
trained in parallel.

Disadvantages:

• Less accurate for regression problems as they tend to overfit

Boosting and Gradient Boosting

Boosting is an ensemble tree method that builds consecutive small trees — often only
one node — with each tree focused on correcting the net error from the previous tree. So,
we’ll split our first tree on the most predictive feature and then we’ll update weights to
ensure that the subsequent tree splits on whichever feature allows it to correctly classify
the data points that were misclassified in the initial tree. The next tree will then focus on
correctly classifying errors from that tree, and so on. The final prediction is a weighted
sum of all individual predictions.

Gradient boosting is the most popular extension of boosting, and uses the gradient
descent algorithm for optimization.

Here’s a high-level recap of gradient boosting:

Advantages:

• They are powerful and accurate, in many cases even more so than random forest
• Good at handling complex, non-linear relationships
• They are good at dealing with imbalanced data

Disadvantages:

• Slower to train, since trees must be built sequentially


• Prone to overfitting if the data is noisy
• Harder to tune hyperparameters

Why are Tree-Based Models Important?

Tree-based models are a popular approach in machine learning because of a number of


benefits. Decision trees are easy to understand and interpret, and outcomes can be easily
explained. They accommodate both categorical and numerical data and can be used for
both classification and regression models. Computationally, they perform well even for
large data sets and require less data preparation than other techniques. Tree-based
models are very popular in machine learning. The decision tree model, the foundation of
tree-based models, is quite straightforward to interpret, but generally a weak predictor.
Ensemble models can be used to generate stronger predictions from many trees, with
random forest and gradient boosting as two of the most popular. All tree-based models
can be used for regression or classification and can handle non-linear relationships quite
well.

Evaluation of Classification Algorithms

What is Classification in Machine Learning?


Classification is a supervised machine learning method where the model tries to
predict the correct label of a given input data. In classification, the model is fully
trained using the training data, and then it is evaluated on test data before being
used to perform prediction on new unseen data.

For instance, an algorithm can learn to predict whether a given email is spam or ham
(no spam),

Metrics to Evaluate Machine Learning Classification


Algorithms
Now that we have an idea of the different types of classification models, it is crucial
to choose the right evaluation metrics for those models. In this section, we will cover
the most commonly used metrics: accuracy, precision, recall, F1 score, and area
under the ROC (Receiver Operating Characteristic) curve and AUC (Area Under the
Curve).
Unit – 3. Unsupervised Learning and Reinforcement
Learning

Introduction
Unsupervised learning is the training of a machine using information that is
neither classified nor labeled and allowing the algorithm to act on that
information without guidance. Here the task of the machine is to group
unsorted information according to similarities, patterns, and differences
without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will
be given to the machine. Therefore, the machine is restricted to find the
hidden structure in unlabeled data by itself.
For instance, suppose it is given an image having both dogs and cats which
it has never seen.

Thus the machine has no idea about the features of dogs and cats so we can’t
categorize it as ‘dogs and cats ‘. But it can categorize them according to their
similarities, patterns, and differences, i.e., we can easily categorize the above
picture into two parts. The first may contain all pics having dogs in them and
the second part may contain all pics having cats in them. Here you didn’t
learn anything before, which means no training data or examples.

It allows the model to work on its own to discover patterns and information
that was previously undetected. It mainly deals with unlabeled data.

Unsupervised learning is classified into two categories of algorithms:


• Clustering: A clustering problem is where you want to discover the
inherent groupings in the data, such as grouping customers by
purchasing behavior.
• Association: An association rule learning problem is where you want
to discover rules that describe large portions of your data, such as
people that buy X also tend to buy Y.

Types of Unsupervised Learning: -


Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types: -
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis

Advantages of unsupervised learning:

• It does not require training data to be labeled.


• Dimensionality reduction can be easily accomplished using
unsupervised learning.
• Capable of finding previously unknown patterns in data.
• Flexibility: Unsupervised learning is flexible in that it can be applied to
a wide variety of problems, including clustering, anomaly detection,
and association rule mining.
• Exploration: Unsupervised learning allows for the exploration of data
and the discovery of novel and potentially useful patterns that may not
be apparent from the outset.
• Low cost: Unsupervised learning is often less expensive than
supervised learning because it doesn’t require labeled data, which can
be time-consuming and costly to obtain.
Disadvantages of unsupervised learning:

• Difficult to measure accuracy or effectiveness due to lack of predefined


answers during training.
• The results often have lesser accuracy.
• The user needs to spend time interpreting and label the classes which
follow that classification.
• Lack of guidance: Unsupervised learning lacks the guidance and
feedback provided by labeled data, which can make it difficult to know
whether the discovered patterns are relevant or useful.
• Sensitivity to data quality: Unsupervised learning can be sensitive to
data quality, including missing values, outliers, and noisy data.
• Scalability: Unsupervised learning can be computationally expensive,
particularly for large datasets or complex algorithms, which can limit
its scalability.

Clustering Algorithms

Introduction to Clustering: It is basically a type of unsupervised learning


method. An unsupervised learning method is a method in which we draw
references from datasets consisting of input data without labeled responses.
Generally, it is used as a process to find meaningful structure, explanatory
underlying processes, generative features, and groupings inherent in a set of
examples.
Clustering is the task of dividing the population or data points into a number
of groups such that data points in the same groups are more similar to other
data points in the same group and dissimilar to the data points in other
groups. It is basically a collection of objects on the basis of similarity and
dissimilarity between them.
For example The data points in the graph below clustered together can be
classified into one single group. We can distinguish the clusters, and we can
identify that there are 3 clusters in the below picture.

It is not necessary for clusters to be spherical as depicted below:

DBSCAN: Density-based Spatial Clustering of Applications with Noise

These data points are clustered by using the basic concept that the data point
lies within the given constraint from the cluster center. Various distance
methods and techniques are used for the calculation of the outliers.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping
among the unlabeled data present. There are no criteria for good clustering.
It depends on the user, and what criteria they may use which satisfy their
need. For instance, we could be interested in finding representatives for
homogeneous groups (data reduction), finding “natural clusters” and
describing their unknown properties (“natural” data types), in finding useful
and suitable groupings (“useful” data classes) or in finding unusual data
objects (outlier detection). This algorithm must make some assumptions that
constitute the similarity of points and each assumption make different and
equally valid clusters.
Clustering Methods:
• Density-Based Methods: These methods consider the clusters as
the dense region having some similarities and differences from the
lower dense region of the space. These methods have good
accuracy and the ability to merge two clusters. Example DBSCAN
(Density-Based Spatial Clustering of Applications with
Noise), OPTICS (Ordering Points to Identify Clustering Structure),
etc.
• Hierarchical Based Methods: The clusters formed in this method
form a tree-type structure based on the hierarchy. New clusters are
formed using the previously formed one. It is divided into two
category
• Agglomerative (bottom-up approach)
• Divisive (top-down approach)
Examples CURE (Clustering Using Representatives), BIRCH (Balanced
Iterative Reducing Clustering and using Hierarchies), etc.
• Partitioning Methods: These methods partition the objects into k
clusters and each partition forms one cluster. This method is used
to optimize an objective criterion similarity function such as when
the distance is a major parameter example K-means, CLARANS
(Clustering Large Applications based upon Randomized Search),
etc.
• Grid-based Methods: In this method, the data space is formulated
into a finite number of cells that form a grid-like structure. All the
clustering operations done on these grids are fast and independent
of the number of data objects example STING (Statistical
Information Grid), wave cluster, CLIQUE (CLustering In Quest), etc.
Clustering Algorithms: K-means clustering algorithm – It is the simplest
unsupervised learning algorithm that solves clustering problem.K-means
algorithm partitions n observations into k clusters where each observation
belongs to the cluster with the nearest mean serving as a prototype of the
cluster.

Applications of Clustering in different fields:

1. Marketing: It can be used to characterize & discover customer


segments for marketing purposes.
2. Biology: It can be used for classification among different species of
plants and animals.
3. Libraries: It is used in clustering different books on the basis of topics
and information.
4. Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
5. City Planning: It is used to make groups of houses and to study their
values based on their geographical locations and other factors present.
6. Earthquake studies: By learning the earthquake-affected areas we can
determine the dangerous zones.
7. Image Processing: Clustering can be used to group similar images
together, classify images based on content, and identify patterns in
image data.
8. Genetics: Clustering is used to group genes that have similar
expression patterns and identify gene networks that work together in
biological processes.
9. Finance: Clustering is used to identify market segments based on
customer behavior, identify patterns in stock market data, and analyze
risk in investment portfolios.
10. Customer Service: Clustering is used to group customer
inquiries and complaints into categories, identify common issues, and
develop targeted solutions.
11. Manufacturing: Clustering is used to group similar products
together, optimize production processes, and identify defects in
manufacturing processes.
12. Medical diagnosis: Clustering is used to group patients with
similar symptoms or diseases, which helps in making accurate
diagnoses and identifying effective treatments.
13. Fraud detection: Clustering is used to identify suspicious
patterns or anomalies in financial transactions, which can help in
detecting fraud or other financial crimes.
14. Traffic analysis: Clustering is used to group similar patterns of
traffic data, such as peak hours, routes, and speeds, which can help in
improving transportation planning and infrastructure.
15. Social network analysis: Clustering is used to identify
communities or groups within social networks, which can help in
understanding social behavior, influence, and trends.
16. Cybersecurity: Clustering is used to group similar patterns of
network traffic or system behavior, which can help in detecting and
preventing cyberattacks.
17. Climate analysis: Clustering is used to group similar patterns of
climate data, such as temperature, precipitation, and wind, which can
help in understanding climate change and its impact on the
environment.
18. Sports analysis: Clustering is used to group similar patterns of
player or team performance data, which can help in analyzing player or
team strengths and weaknesses and making strategic decisions.
19. Crime analysis: Clustering is used to group similar patterns of
crime data, such as location, time, and type, which can help in
identifying crime hotspots, predicting future crime trends, and
improving crime prevention strategies.

K – Means
K-Means Clustering is an Unsupervised Machine Learning algorithm, which
groups the unlabeled dataset into different clusters.

K means Clustering
Unsupervised Machine Learning learning is the process of teaching a
computer to use unlabeled, unclassified data and enabling the algorithm to
operate on that data without supervision. Without any previous data training,
the machine’s job in this case is to organize unsorted data according to
parallels, patterns, and variations.
The goal of clustering is to divide the population or set of data points into a
number of groups so that the data points within each group are more
comparable to one another and different from the data points within the other
groups. It is essentially a grouping of things based on how similar and
different they are to one another.
We are given a data set of items, with certain features, and values for these
features (like a vector). The task is to categorize those items into groups. To
achieve this, we will use the K-means algorithm; an unsupervised learning
algorithm. ‘K’ in the name of the algorithm represents the number of
groups/clusters we want to classify our items into.
(It will help if you think of items as points in an n-dimensional space). The
algorithm will categorize the items into k groups or clusters of similarity. To
calculate that similarity, we will use the euclidean distance as a
measurement.
The algorithm works as follows:
1. First, we randomly initialize k points, called means or cluster
centroids.
2. We categorize each item to its closest mean and we update the
mean’s coordinates, which are the averages of the items categorized
in that cluster so far.
3. We repeat the process for a given number of iterations and at the
end, we have our clusters.
The “points” mentioned above are called means because they are the mean
values of the items categorized in them. To initialize these means, we have a
lot of options. An intuitive method is to initialize the means at random items
in the data set. Another method is to initialize the means at random values
between the boundaries of the data set (if for a feature x, the items have
values in [0,3], we will initialize the means with values for x at [0,3]).
Advantages of k-means

1. Simple and easy to implement: The k-means algorithm is easy to


understand and implement, making it a popular choice for clustering
tasks.
2. Fast and efficient: K-means is computationally efficient and can handle
large datasets with high dimensionality.
3. Scalability: K-means can handle large datasets with a large number of
data points and can be easily scaled to handle even larger datasets.
4. Flexibility: K-means can be easily adapted to different applications and
can be used with different distance metrics and initialization methods.

Disadvantages of K-Means:

1. Sensitivity to initial centroids: K-means is sensitive to the initial


selection of centroids and can converge to a suboptimal solution.
2. Requires specifying the number of clusters: The number of clusters k
needs to be specified before running the algorithm, which can be
challenging in some applications.
3. Sensitive to outliers: K-means is sensitive to outliers, which can have a
significant impact on the resulting clusters.

Applications of K-Means Clustering

K-Means clustering is used in a variety of examples or business cases in real


life, like:

• Academic performance
• Diagnostic systems
• Search engines
• Wireless sensor networks

Hierarchical Clustering

Hierarchical clustering is a method of clustering that builds a hierarchy of


clusters. There are two types of this method.

• Agglomerative: This is a bottom-up approach where each observation


is treated as its own cluster in the beginning and as we move from
bottom to top, each observation is merged into pairs, and pairs are
merged into clusters.
• Divisive: This is a "top-down" approach: all observations start in one
cluster, and splits are performed recursively as we move from top to
bottom.

When it comes to analyzing data from social networks, hierarchical clustering


is by far the most common and popular method of clustering. The nodes
(branches) in the graph are compared to each other depending on the degree
of similarity that exists between them. By linking together smaller groups of
nodes that are related to one another, larger groupings may be created.

The biggest advantage of hierarchical clustering is that it is easy to


understand and implement. Usually, the output of this clustering method is
analyzed in an image such as below. It is called a Dendrogram.

Cluster Validity

For supervised classification we have a variety of measures to evaluate how


good our model is

• Accuracy, precision, recall

For cluster analysis, the analogous question is how to evaluate the


“goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!


Then why do we want to evaluate them?

• To avoid finding patterns in noise


• To compare clustering algorithms
• To compare two sets of clusters
• To compare two clusters

Different Aspects of Cluster Validation

1. Determining the clustering tendency of a set of data, i.e., distinguishing


whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results,
e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data
without reference to external information. --Use only the data
4. Comparing the results of two different sets of cluster analyses to
determine which is better.
5. Determining the ‘correct’ number of clusters.

Measures of Cluster Validity

Numerical measures that are applied to judge various aspects of cluster


validity, are classified into the following three types.

• External Index: Used to measure the extent to which cluster labels


match externally supplied class labels.
Entropy
• Internal Index: Used to measure the goodness of a clustering structure
without respect to external information.
Sum of Squared Error (SSE)
• Relative Index: Used to compare two different clustering or clusters.
Often an external or internal index is used for this function, e.g., SSE or
entropy

Sometimes these are referred to as criteria instead of indices

• However, sometimes criterion is the general strategy and index is the


numerical measure that implements the criterion.

Measuring cluster validity via correlation


Two matrices

• Proximity Matrix
• Ideal Similarity Matrix
o One row and one column for each data point
o An entry is 1 if the associated pair of points belong to the same
cluster
o An entry is 0 if the associated pair of points belongs to different
clusters

Compute the correlation between the two matrices

• Since the matrices are symmetric, only the correlation between n(n-1)
/ 2 entries needs to be calculated.

High correlation indicates that points that belong to the same cluster are
close to each other.

Not a good measure for some density or contiguity-based clusters.

Correlation of ideal similarity and proximity matrices for the K-means


clustering of the following two data sets. Order the similarity matrix with
respect to cluster labels and inspect visually. Cluster with random data not
so crisp.

Dimensionality Reduction

Dimensionality reduction is a technique used to reduce the number of


features in a dataset while retaining as much of the important information as
possible. In other words, it is a process of transforming high-dimensional data
into a lower-dimensional space that still preserves the essence of the original
data.

In machine learning, high-dimensional data refers to data with a large


number of features or variables. The curse of dimensionality is a common
problem in machine learning, where the performance of the model
deteriorates as the number of features increases. This is because the
complexity of the model increases with the number of features, and it
becomes more difficult to find a good solution. In addition, high-dimensional
data can also lead to overfitting, where the model fits the training data too
closely and does not generalize well to new data.
Dimensionality reduction can help to mitigate these problems by reducing the
complexity of the model and improving its generalization performance. There
are two main approaches to dimensionality reduction: feature selection and
feature extraction.

Feature_Selection:
Feature selection involves selecting a subset of the original features that are
most relevant to the problem at hand. The goal is to reduce the
dimensionality of the dataset while retaining the most important features.
There are several methods for feature selection, including filter methods,
wrapper methods, and embedded methods. Filter methods rank the features
based on their relevance to the target variable, wrapper methods use the
model performance as the criteria for selecting features, and embedded
methods combine feature selection with the model training process.

Feature_Extraction:
Feature extraction involves creating new features by combining or
transforming the original features. The goal is to create a set of features that
captures the essence of the original data in a lower-dimensional space. There
are several methods for feature extraction, including principal component
analysis (PCA), linear discriminant analysis (LDA), and t-distributed
stochastic neighbor embedding (t-SNE). PCA is a popular technique that
projects the original features onto a lower-dimensional space while
preserving as much of the variance as possible.

Why is Dimensionality Reduction important in Machine Learning and


Predictive Modeling?

An intuitive example of dimensionality reduction can be discussed through a


simple e-mail classification problem, where we need to classify whether the
e-mail is spam or not. This can involve a large number of features, such as
whether or not the e-mail has a generic title, the content of the e-mail,
whether the e-mail uses a template, etc. However, some of these features
may overlap. In another condition, a classification problem that relies on both
humidity and rainfall can be collapsed into just one underlying feature, since
both of the aforementioned are correlated to a high degree. Hence, we can
reduce the number of features in such problems. A 3-D classification problem
can be hard to visualize, whereas a 2-D one can be mapped to a simple 2-
dimensional space, and a 1-D problem to a simple line. The below figure
illustrates this concept, where a 3-D feature space is split into two 2-D
feature spaces, and later, if found to be correlated, the number of features
can be reduced even further.

Components of Dimensionality Reduction

There are two components of dimensionality reduction:

• Feature selection: In this, we try to find a subset of the original set of


variables, or features, to get a smaller subset which can be used to
model the problem. It usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
• Feature extraction: This reduces the data in a high dimensional space
to a lower dimension space, i.e. a space with lesser no. of dimensions.

Methods of Dimensionality Reduction

The various methods used for dimensionality reduction include:

• Principal Component Analysis (PCA)


• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)

Advantages of Dimensionality Reduction

• It helps in data compression, and hence reduced storage space.


• It reduces computation time.
• It also helps remove redundant features, if any.
• Improved Visualization: High dimensional data is difficult to visualize,
and dimensionality reduction techniques can help in visualizing the
data in 2D or 3D, which can help in better understanding and analysis.
• Overfitting Prevention: High dimensional data may lead to overfitting
in machine learning models, which can lead to poor generalization
performance. Dimensionality reduction can help in reducing the
complexity of the data, and hence prevent overfitting.
• Feature Extraction: Dimensionality reduction can help in extracting
important features from high dimensional data, which can be useful in
feature selection for machine learning models.
• Data Preprocessing: Dimensionality reduction can be used as a
preprocessing step before applying machine learning algorithms to
reduce the dimensionality of the data and hence improve the
performance of the model.
• Improved Performance: Dimensionality reduction can help in improving
the performance of machine learning models by reducing the
complexity of the data, and hence reducing the noise and irrelevant
information in the data.

Disadvantages of Dimensionality Reduction

• It may lead to some amount of data loss.


• PCA tends to find linear correlations between variables, which is
sometimes undesirable.
• PCA fails in cases where mean and covariance are not enough to define
datasets.
• We may not know how many principal components to keep- in practice,
some thumb rules are applied.
• Interpretability: The reduced dimensions may not be easily
interpretable, and it may be difficult to understand the relationship
between the original features and the reduced dimensions.
• Overfitting: In some cases, dimensionality reduction may lead to
overfitting, especially when the number of components is chosen
based on the training data.
• Sensitivity to outliers: Some dimensionality reduction techniques are
sensitive to outliers, which can result in a biased representation of the
data.
• Computational complexity: Some dimensionality reduction techniques,
such as manifold learning, can be computationally intensive, especially
when dealing with large datasets.

Principal Component Analysis

Principal Component Analysis(PCA) technique was introduced by the


mathematician Karl Pearson in 1901. It works on the condition that while the
data in a higher dimensional space is mapped to data in a lower dimension
space, the variance of the data in the lower dimensional space should be
maximum.

• Principal Component Analysis (PCA) is a statistical procedure that


uses an orthogonal transformation that converts a set of correlated
variables to a set of uncorrelated variables.PCA is the most widely
used tool in exploratory data analysis and in machine learning for
predictive models. Moreover,
• PCA is an unsupervised learning algorithm technique used to examine
the interrelations among a set of variables. It is also known as a general
factor analysis where regression determines a line of best fit.
• The main goal of Principal Component Analysis (PCA) is to reduce the
dimensionality of a dataset while preserving the most important
patterns or relationships between the variables without any prior
knowledge of the target variables.

Principal Component Analysis (PCA) is used to reduce the dimensionality of


a data set by finding a new set of variables, smaller than the original set of
variables, retaining most of the sample’s information, and useful for
the regression and classification of data.
1. PCA is a technique for dimensionality reduction that identifies a set of
orthogonal axes, called principal components, that capture the
maximum variance in the data. The principal components are linear
combinations of the original variables in the dataset and are ordered in
decreasing order of importance. The total variance captured by all the
principal components is equal to the total variance in the original
dataset.
2. The first principal component captures the most variation in the data,
but the second principal component captures the
maximum variance that is orthogonal to the first principal component,
and so on.
3. PCA can be used for a variety of purposes, including data visualization,
feature selection, and data compression. In data visualization, PCA can
be used to plot high-dimensional data in two or three dimensions,
making it easier to interpret. In feature selection, PCA can be used to
identify the most important variables in a dataset. In data compression,
PCA can be used to reduce the size of a dataset without losing
important information.
4. In PCA, it is assumed that the information is carried in the variance of
the features, that is, the higher the variation in a feature, the more
information that features carries.

Overall, PCA is a powerful tool for data analysis and can help to simplify
complex datasets, making them easier to understand and work with.

Recommendation Systems
A recommendation system (or recommender system) is a class of machine
learning that uses data to help predict, narrow down, and find what people
are looking for among an exponentially growing number of options.

A recommendation system is an artificial intelligence or AI algorithm, usually


associated with machine learning, that uses Big Data to suggest or
recommend additional products to consumers. These can be based on various
criteria, including past purchases, search history, demographic information,
and other factors. Recommender systems are highly useful as they help users
discover products and services they might otherwise have not found on their
own.

Recommender systems are trained to understand the preferences, previous


decisions, and characteristics of people and products using data gathered
about their interactions. These include impressions, clicks, likes, and
purchases. Because of their capability to predict consumer interests and
desires on a highly personalized level, recommender systems are a favorite
with content and product providers. They can drive consumers to just about
any product or service that interests them, from books to videos to health
classes to clothing.

Types of Recommendation Systems

While there are a vast number of recommender algorithms and techniques,


most fall into these broad categories: collaborative filtering, content filtering
and context filtering.
Collaborative filtering algorithms recommend items (this is the filtering part)
based on preference information from many users (this is the collaborative
part). This approach uses similarity of user preference behavior, given
previous interactions between users and items, recommender algorithms
learn to predict future interaction. These recommender systems build a model
from a user’s past behavior, such as items purchased previously or ratings
given to those items and similar decisions by other users. The idea is that if
some people have made similar decisions and purchases in the past, like a
movie choice, then there is a high probability they will agree on additional
future selections. For example, if a collaborative filtering recommender
knows you and another user share similar tastes in movies, it might
recommend a movie to you that it knows this other user already likes.

Content filtering, by contrast, uses the attributes or features of an item (this


is the content part) to recommend other items similar to the user’s
preferences. This approach is based on similarity of item and user features,
given information about a user and items they have interacted with (e.g., a
user’s age, the category of a restaurant’s cuisine, the average review for a
movie), model the likelihood of a new interaction. For example, if a content
filtering recommender sees you liked the movies You’ve Got Mail and
Sleepless in Seattle, it might recommend another movie to you with the same
genres and/or cast such as Joe Versus the Volcano.
Hybrid recommender systems combine the advantages of the types above
to create a more comprehensive recommending system.

Context filtering includes users’ contextual information in the


recommendation process. Netflix spoke at NVIDIA GTC about making better
recommendations by framing a recommendation as a contextual sequence
prediction. This approach uses a sequence of contextual user actions, plus
the current context, to predict the probability of the next action. In the Netflix
example, given one sequence for each user—the country, device, date, and
time when they watched a movie—they trained a model to predict what to
watch next.
Use Cases and Applications

E-Commerce & Retail: Personalized Merchandising

Imagine that a user has already purchased a scarf. Why not offer a matching
hat so the look will be complete? This feature is often implemented by means
of AI-based algorithms as “Complete the look” or “You might also like”
sections in e-commerce platforms like Amazon, Walmart, Target, and many
others.

On average, an intelligent recommender system delivers a 22.66% lift in


conversions rates for web products.

Media & Entertainment: Personalized Content

AI-based recommender engines can analyze an individual’s purchase


behavior and detect patterns that will help provide them with the content
suggestions that will most likely match his or her interests. This is what
Google and Facebook actively apply when recommending ads, or what
Netflix does behind the scenes when recommending movies and TV shows.

Personalized Banking

A mass market product that is consumed digitally by millions, banking is


prime for recommendations. Knowing a customer’s detailed financial
situation and their past preferences, coupled by data of thousands of similar
users, is quite powerful.
Benefits of Recommendation Systems

Recommender systems are a critical component driving personalized user


experiences, deeper engagement with customers, and powerful decision
support tools in retail, entertainment, healthcare, finance, and other
industries. On some of the largest commercial platforms, recommendations
account for as much as 30% of the revenue. A 1% improvement in the quality
of recommendations can translate into billions of dollars in revenue.

Companies implement recommender systems for a variety of reasons,


including:

• Improving retention. By continuously catering to the preferences of


users and customers, businesses are more likely to retain them as loyal
subscribers or shoppers. When a customer senses that they’re truly
understood by a brand and not just having information randomly
thrown at them, they’re far more likely to remain loyal and continue
shopping at your site.
• Increasing sales. Various research studies show increases in upselling
revenue from 10-50% resulting from accurate ‘you might also like’
product recommendations. Sales can be increased with
recommendation system strategies as simple as adding matching
product recommendations to a purchase confirmation; collecting
information from abandoned electronic shopping carts; sharing
information on ‘what customers are buying now’; and sharing other
buyers’ purchases and comments.
• Helping to form customer habits and trends. Consistently serving up
accurate and relevant content can trigger cues that build strong habits
and influence usage patterns in customers.
• Speeding up the pace of work. Analysts and researchers can save as
much as 80% of their time when served tailored suggestions for
resources and other materials necessary for further research.
• Boosting cart value. Companies with tens of thousands of items for
sale would be challenged to hard code product suggestions for such
an inventory. By using various means of filtering, these ecommerce
titans can find just the right time to suggest new products customers
are likely to buy, either on their site or through email or other means.

Expectation-Maximization (EM) Algorithm


In the real-world applications of machine learning, it is very common that
there are many relevant features available for learning but only a small
subset of them are observable. So, for the variables which are sometimes
observable and sometimes not, then we can use the instances when that
variable is visible is observed for the purpose of learning and then predict its
value in the instances when it is not observable.

On the other hand, Expectation-Maximization algorithm can be used for the


latent variables (variables that are not directly observable and are actually
inferred from the values of the other observed variables) too in order to
predict their values with the condition that the general form of probability
distribution governing those latent variables is known to us. This algorithm is
actually at the base of many unsupervised clustering algorithms in the field
of machine learning.
It was explained, proposed and given its name in a paper published in 1977
by Arthur Dempster, Nan Laird, and Donald Rubin. It is used to find the local
maximum likelihood parameters of a statistical model in the cases where
latent variables are involved and the data is missing or incomplete.

Algorithm:

1. Given a set of incomplete data, consider a set of starting parameters.


2. Expectation step (E – step): Using the observed available data of the
dataset, estimate (guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the
expectation (E) step is used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.

The essence of Expectation-Maximization algorithm is to use the available


observed data of the dataset to estimate the missing data and then using that
data to update the values of the parameters. Let us understand the EM
algorithm in detail.

• Initially, a set of initial values of the parameters are considered. A set


of incomplete observed data is given to the system with the
assumption that the observed data comes from a specific model.
• The next step is known as “Expectation” – step or E-step. In this step,
we use the observed data in order to estimate or guess the values of
the missing or incomplete data. It is basically used to update the
variables.
• The next step is known as “Maximization”-step or M-step. In this step,
we use the complete data generated in the preceding “Expectation” –
step in order to update the values of the parameters. It is basically used
to update the hypothesis.
• Now, in the fourth step, it is checked whether the values are converging
or not, if yes, then stop otherwise repeat step-2 and step-3 i.e.
“Expectation” – step and “Maximization” – step until the convergence
occurs.

Flow chart for EM algorithm –

Usage of EM algorithm –
• It can be used to fill the missing data in a sample.
• It can be used as the basis of unsupervised learning of clusters.
• It can be used for the purpose of estimating the parameters of Hidden
Markov Model (HMM).
• It can be used for discovering the values of latent variables.

Advantages of EM algorithm –

• It is always guaranteed that likelihood will increase with each iteration.


• The E-step and M-step are often pretty easy for many problems in
terms of implementation.
• Solutions to the M-steps often exist in the closed form.

Disadvantages of EM algorithm –

• It has slow convergence.


• It makes convergence to the local optima only.
• It requires both the probabilities, forward and backward (numerical
optimization requires only forward probability).

Reinforcement Learning

Reinforcement learning is an area of Machine Learning. It is about taking


suitable action to maximize reward in a particular situation. It is employed by
various software and machines to find the best possible behavior or path it
should take in a specific situation. Reinforcement learning differs from
supervised learning in a way that in supervised learning the training data has
the answer key with it so the model is trained with the correct answer itself
whereas in reinforcement learning, there is no answer but the reinforcement
agent decides what to do to perform the given task. In the absence of a
training dataset, it is bound to learn from its experience.

Reinforcement Learning (RL) is the science of decision making. It is about


learning the optimal behavior in an environment to obtain maximum reward.
In RL, the data is accumulated from machine learning systems that use a trial-
and-error method. Data is not part of the input that we would find in
supervised or unsupervised machine learning.

Reinforcement learning uses algorithms that learn from outcomes and decide
which action to take next. After each action, the algorithm receives feedback
that helps it determine whether the choice it made was correct, neutral or
incorrect. It is a good technique to use for automated systems that have to
make a lot of small decisions without human guidance.

Reinforcement learning is an autonomous, self-teaching system that


essentially learns by trial and error. It performs actions with the aim of
maximizing rewards, or in other words, it is learning by doing in order to
achieve the best outcomes.

Example:

The problem is as follows: We have an agent and a reward, with many


hurdles in between. The agent is supposed to find the best possible path to
reach the reward. The following problem explains the problem more easily.

The above image shows the robot, diamond, and fire. The goal of the robot is
to get the reward that is the diamond and avoid the hurdles that are fired.
The robot learns by trying all the possible paths and then choosing the path
which gives him the reward with the least hurdles. Each right step will give
the robot a reward and each wrong step will subtract the reward of the robot.
The total reward will be calculated when it reaches the final reward that is
the diamond.

Main points in Reinforcement learning –

• Input: The input should be an initial state from which the model will
start
• Output: There are many possible outputs as there are a variety of
solutions to a particular problem
• Training: The training is based upon the input, The model will return a
state and the user will decide to reward or punish the model based on
its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.

Difference between Reinforcement learning and Supervised learning:

Reinforcement learning Supervised learning

Reinforcement learning is all about making


decisions sequentially. In simple words, In Supervised learning, the
we can say that the output depends on the decision is made on the initial
state of the current input and the next input or the input given at the
input depends on the output of the start
previous input

In supervised learning the


In Reinforcement learning decision is
decisions are independent of
dependent, So we give labels to
each other so labels are
sequences of dependent decisions
given to each decision.

Example: Object
Example: Chess game,text summarization
recognition,spam detetction

Types of Reinforcement:

There are two types of Reinforcement:

1. Positive: Positive Reinforcement is defined as when an event, occurs


due to a particular behavior, increases the strength and the frequency
of the behavior. In other words, it has a positive effect on behavior.

Advantages of reinforcement learning are:

• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which
can diminish the results
2. Negative: Negative Reinforcement is defined as strengthening of
behavior because a negative condition is stopped or avoided.
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior

Elements

Elements of Reinforcement Learning

Reinforcement learning elements are as follows:

1. Policy
2. Reward function
3. Value function
4. Model of the environment

Policy: Policy defines the learning agent behavior for given time period. It is
a mapping from perceived states of the environment to actions to be taken
when in those states.

Reward function: Reward function is used to define a goal in a reinforcement


learning problem. A reward function is a function that provides a numerical
score based on the state of the environment

Value function: Value functions specify what is good in the long run. The
value of a state is the total amount of reward an agent can expect to
accumulate over the future, starting from that state.

Model of the environment: Models are used for planning.

Credit assignment problem: Reinforcement learning algorithms learn to


generate an internal value for the intermediate states as to how good they
are in leading to the goal. The learning decision maker is called the agent.
The agent interacts with the environment that includes everything outside
the agent.

The agent has sensors to decide on its state in the environment and takes
action that modifies its state.
The reinforcement learning problem model is an agent continuously
interacting with an environment. The agent and the environment interact in a
sequence of time steps. At each time step t, the agent receives the state of
the environment and a scalar numerical reward for the previous action, and
then the agent then selects an action.

Reinforcement learning is a technique for solving Markov decision problems.

Reinforcement learning uses a formal framework defining the interaction


between a learning agent and its environment in terms of states, actions, and
rewards. This framework is intended to be a simple way of representing
essential features of the artificial intelligence problem.

Various Practical Applications of Reinforcement Learning –

• RL can be used in robotics for industrial automation.


• RL can be used in machine learning and data processing
• RL can be used to create training systems that provide custom
instruction and materials according to the requirement of students.

Application of Reinforcement Learnings

1. Robotics: Robots with pre-programmed behavior are useful in structured


environments, such as the assembly line of an automobile manufacturing
plant, where the task is repetitive in nature.

2. A master chess player makes a move. The choice is informed both by


planning, anticipating possible replies and counter replies.

3. An adaptive controller adjusts parameters of a petroleum refinery’s


operation in real time.

RL can be used in large environments in the following situations:

1. A model of the environment is known, but an analytic solution is not


available;
2. Only a simulation model of the environment is given (the subject of
simulation-based optimization)
3. The only way to collect information about the environment is to interact
with it.

Advantages and Disadvantages of Reinforcement Learning

Advantages of Reinforcement learning

1. Reinforcement learning can be used to solve very complex problems that


cannot be solved by conventional techniques.

2. The model can correct the errors that occurred during the training process.

3. In RL, training data is obtained via the direct interaction of the agent with
the environment

4. Reinforcement learning can handle environments that are non-


deterministic, meaning that the outcomes of actions are not always
predictable. This is useful in real-world applications where the environment
may change over time or is uncertain.

5. Reinforcement learning can be used to solve a wide range of problems,


including those that involve decision making, control, and optimization.

6. Reinforcement learning is a flexible approach that can be combined with


other machine learning techniques, such as deep learning, to improve
performance.

Disadvantages of Reinforcement learning

1. Reinforcement learning is not preferable to use for solving simple


problems.

2. Reinforcement learning needs a lot of data and a lot of computation

3. Reinforcement learning is highly dependent on the quality of the reward


function. If the reward function is poorly designed, the agent may not learn
the desired behavior.

4. Reinforcement learning can be difficult to debug and interpret. It is not


always clear why the agent is behaving in a certain way, which can make it
difficult to diagnose and fix problems.
Model Based Learning

Hundreds of learning algorithms have been developed in the field of machine


learning. Scientists typically select from among these algorithms to answer
specific issues. Their options are frequently restricted by their expertise with
these systems. In this classical/traditional machine learning framework,
scientists are forced to make some assumptions to employ an existing
algorithm.

• The model-based learning in machine learning is a technique that


tries to generate a custom solution for each new challenge

MBML’s purpose is to offer a single development framework that


facilitates the building of a diverse variety of custom models. This
paradigm evolved as a result of a significant confluence of three main ideas:

• Factor graphs
• Bayesian perspective,
• Probabilistic Programming

The essential principle is that in the form of a model, all assumptions about
the issue domain are made clear. Model-based deep learning is just a
collection of assumptions stated in a graphical manner.

Factor Graphs

The usage of PGM- Probabilistic Graphical Models, particularly factor graphs,


is the pillar of MBML. A PGM is a graph-based diagrammatic representation
of the joint probability distribution across all random variables in a model.

They are a form of PGM with round nodes and square nodes representing
variable probability distributions (factors), and vertices expressing
conditional relationships between nodes. They offer a broad framework for
simulating the combined dispersion of a set of random variables.

In factor graphs, we consider implicit parameters as random variables and


discover their probability distributions throughout the network using
Bayesian inference techniques. Inference/learning is just the product of
factors across a subset of the graph’s variables. This makes it simple to
develop local message forwarding algorithms.
Bayesian Methods

The first essential concept allowing this new machine learning architecture
is Bayesian inference/learning. Latent/hidden parameters are represented in
MBML as random variables with probability distributions. This provides for a
consistent and rational approach to quantifying uncertainty in model
parameters. Again when the observed variables in the model are locked to
their values, the Bayes’ theorem is used to update the previously assumed
probability distributions.

In contrast, the classical ML framework assigns model parameters to average


values derived by maximizing an objective function. Bayesian inference on big
models with millions of variables is accomplished similarly, but in a more
complicated way, employing the Bayes’ theorem. This is because Bayes’
theory is an accurate inference approach that is intractable when applied to
huge datasets. The rise in the processing capacity of computers over the last
decade has enabled the research and innovation of algorithms that can scale
to enormous data sets.

Probabilistic Programming

Probabilistic programming (PP) is a breakthrough in computer science in


which programming languages are now created to compute with uncertainty
in addition to logic. Current programming languages can already handle
random variables, variable restrictions, and inference packages. You may now
express a model-based reinforcement learning of your problem concisely
with a few lines of code using a PP language. So an inference engine is
invoked to produce inference procedures to solve the problem automatically.

Model-Based ML Developmental Stages

It consists of three rules-based models in machine learning:

• Describe the Model: Using factor graphs, describe the process that
created the data.
• Condition on Reported Data: Make the observed variables equal to
their known values.
• Backward reasoning is used to update the prior distribution across the
latent constructs or parameters. Estimate the Bayesian probability
distributions of latent constructs based on observable variables.
Temporal Based Learning

Temporal Difference Learning is an unsupervised learning technique that is


very commonly used in reinforcement learning for the purpose of predicting
the total reward expected over the future. They can, however, be used to
predict other quantities as well. It is essentially a way to learn how to predict
a quantity that is dependent on the future values of a given signal. It is a
method that is used to compute the long-term utility of a pattern of behaviour
from a series of intermediate rewards.

Essentially, Temporal Difference Learning (TD Learning) focuses on


predicting a variable's future value in a sequence of states. Temporal
difference learning was a major breakthrough in solving the problem of
reward prediction. You could say that it employs a mathematical trick that
allows it to replace complicated reasoning with a simple learning procedure
that can be used to generate the very same results.

The trick is that rather than attempting to calculate the total future reward,
temporal difference learning just attempts to predict the combination of
immediate reward and its own reward prediction at the next moment in time.
Now when the next moment comes and brings fresh information with it, the
new prediction is compared with the expected prediction. If these two
predictions are different from each other, the Temporal Difference Learning
algorithm will calculate how different the predictions are from each other and
make use of this temporal difference to adjust the old prediction toward the
new prediction.

The temporal difference algorithm always aims to bring the expected


prediction and the new prediction together, thus matching expectations with
reality and gradually increasing the accuracy of the entire chain of prediction.

Temporal Difference Learning aims to predict a combination of the immediate


reward and its own reward prediction at the next moment in time.

In TD Learning, the training signal for a prediction is a future prediction. This


method is a combination of the Monte Carlo (MC) method and the Dynamic
Programming (DP) method. Monte Carlo methods adjust their estimates only
after the final outcome is known, but temporal difference methods tend to
adjust predictions to match later, more accurate, predictions for the future,
much before the final outcome is clear and know. This is essentially a type of
bootstrapping.

Temporal difference learning in machine learning got its name from the way
it uses changes, or differences, in predictions over successive time steps for
the purpose of driving the learning process.

The prediction at any particular time step gets updated to bring it nearer to
the prediction of the same quantity at the next time step.

Parameters used in temporal difference learning

• Alpha (α): learning rate


It shows how much our estimates should be adjusted, based on the error.
This rate varies between 0 and 1.
• Gamma (γ): the discount rate
This indicates how much future rewards are valued. A larger discount rate
signifies that future rewards are valued to a greater extent. The discount
rate also varies between 0 and 1.
• e: the ratio reflective of exploration vs. exploitation.
This involves exploring new options with probability e and staying at the
current max with probability 1-e. A larger e signifies that more exploration
is carried out during training

The advantages of temporal difference learning in machine learning are:


✓ TD learning methods are able to learn in each step, online or offline.
✓ These methods are capable of learning from incomplete sequences,
which means that they can also be used in continuous problems.
✓ Temporal difference learning can function in non-terminating
environments.
✓ TD Learning has less variance than the Monte Carlo method, because
it depends on one random action, transition, reward.
✓ It tends to be more efficient than the Monte Carlo method.
✓ Temporal Difference Learning exploits the Markov property, which
makes it more effective in Markov environments.

There are two main disadvantages:

• It has greater sensitivity towards the initial value.


• It is a biased estimation.
Unit – 4. Probabilistic Methods for Learning

Introduction
Probabilistic Models are one of the most important segments in Machine Learning,
which is based on the application of statistical codes to data analysis. This dates back
to one of the first approaches of machine learning and continues to be widely used
today. Unobserved variables are seen as stochastic in probabilistic models, and
interdependence between variables is recorded in a joint probability distribution. It
provides a foundation for embracing learning for what it is. The probabilistic
framework outlines the approach for representing and deploying model reservations.
In scientific data analysis, predictions play a dominating role. Their contribution is also
critical in machine learning, cognitive computing, automation, and artificial
intelligence.

These probabilistic models have many admirable characteristics and are quite useful
in statistical analysis. They make it quite simple to reason about the inconsistencies
present across most data. In fact, they may be built hierarchically to create
complicated models from basic elements. One of the main reasons why probabilistic
modeling is so popular nowadays is that it provides natural protection against
overfitting and allows for completely coherent inferences over complex forms from
data.

Examples of Probabilistic Models

Generalised Linear Models

One of the better applications of probabilistic modeling is generalised linear models.


This vastly generalises linear regression using exponential families. The expected
return of a specified unknown factor (the response variable, a random variable) is
predicted by ordinary linear regression as a linear combination of a collection of
observed values.
This means that each change in a predictor causes a change in the response variable
(i.e. a linear response model). This is useful when the response variable may fluctuate
endlessly in either direction or when any number varies by a relatively modest amount
compared to the variance in the predictive factors, such as human heights. These
assumptions, however, are incorrect for several types of response variables.

Straight Line Modeling

A straight-line probabilistic model is sometimes known as a linear regression model


or a best-fit straight line. It's a best-fit line since it tries to reduce the size of all the
different error components. A linear regression model may be computed using any
basic spreadsheet or statistical software application. However, the basic computation
is just dependent on a few variables. This is another implementation that is based on
probabilistic modeling.

Weather and Traffic

Weather and traffic are two everyday phenomena that are both unpredictable and
appear to have a link with one another. You are all aware that if the weather is cold
and snow is falling, traffic will be quite difficult and you will be detained for an
extended period of time. We could even go so far as to predict a substantial
association between snowy weather and higher traffic mishaps.

Based on available data, we can develop a basic mathematical model of traffic


accidents as a function of snowy weather to aid in the analysis of our hypothesis. All
of these models are based on probabilistic modeling. It is one of the most effective
approaches for assessing weather and traffic relationships.

Naïve Bayes Algorithm

Naïve Bayes Classifier Algorithm

o Naïve Bayes algorithm is a supervised learning algorithm, which is based


on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.

o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.

o It is a probabilistic classifier, which means it predicts on the basis of the


probability of an object.

o Some popular examples of Naïve Bayes Algorithm are spam filtration,


Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain


feature is independent of the occurrence of other features. Such as if the fruit
is identified on the bases of color, shape, and taste, then red, spherical, and
sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.

o Bayes: It is called Bayes because it depends on the principle of Bayes'


Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.

o The formula for Bayes' theorem is given as:


Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below
example:

Suppose we have a dataset of weather conditions and corresponding target variable


"Play". So using this dataset we need to decide that whether we should play or not
on a particular day according to the weather conditions. So to solve this problem, we
need to follow the below steps:

1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given features.

3. Now, use Bayes theorem to calculate the posterior probability.

Maximum Likelihood

Maximum likelihood estimation (MLE) is an estimation method that


allows us to use a sample to estimate the parameters of the probability
distribution that generated the sample. The maximum likelihood
estimation is a method that determines values for parameters of the
model. It is the statistical method of estimating the parameters of the
probability distribution by maximizing the likelihood function. The point
in which the parameter value that maximizes the likelihood function is
called the maximum likelihood estimate.

Development:

This principle was originally developed by Ronald Fisher, in the 1920s. He


stated that the probability distribution is the one that makes the observed
data “most likely”. Which means, the parameter vector is considered which
maximizes the likelihood function.

Goal:

The goal of maximum likelihood estimation is to make inference about the


population, which is most likely to have generated the sample i.e., the joint
probability distribution of the random variables.

Five Major Steps in MLE:

1. Perform a certain experiment to collect the data.


2. Choose a parametric model of the data, with certain modifiable
parameters.
3. Formulate the likelihood as an objective function to be maximized.
4. Maximize the objective function and derive the parameters of the
model.

Examples:

• Toss a Coin – To find the probabilities of head and tail


• Throw a Dart – To find your PDF of distance to the bull eye
• Sample a group of animals – To find the quantity of animals

Maximum likelihood estimation (MLE) is a technique used for estimating the


parameters of a given distribution, using some observed data. For example,
if a population is known to follow a normal distribution but the mean and
variance are unknown, MLE can be used to estimate them using a limited
sample of the population, by finding particular values of the mean and
variance so that the observation is the most likely result to have occurred.
MLE is useful in a variety of contexts, ranging from econometrics to MRIs to
satellite imaging. It is also related to Bayesian statistics.

Maximum Apriori

The Apriori algorithm uses frequent itemsets to generate association rules, and it is
designed to work on the databases that contain transactions. With the help of these
association rule, it determines how strongly or how weakly two objects are connected.
This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
associations efficiently. It is the iterative process for finding the frequent itemsets from
the large dataset.

This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly
used for market basket analysis and helps to find those products that can be bought
together. It can also be used in the healthcare field to find drug reactions for patients.

What is Frequent Itemset?

Frequent itemsets are those items whose support is greater than the threshold value
or user-specified minimum support. It means if A & B are the frequent itemsets
together, then individually A and B should also be the frequent itemset.

Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent itemsets.

Steps for Apriori Algorithm


Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database, and select
the minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than the
minimum or selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.

Advantages of Apriori Algorithm

o This is easy to understand algorithm


o The join and prune steps of the algorithm can be easily implemented on large
datasets.

Disadvantages of Apriori Algorithm

o The apriori algorithm works slow compared to other algorithms.


o The overall performance can be reduced as it scans the database for multiple
times.
o The time complexity and space complexity of the apriori algorithm is O(2 D),
which is very high. Here D represents the horizontal width present in the
database.

Bayesian Belief Networks

Bayesian belief networks (BBNs) are probabilistic graphical models that are used to
represent uncertain knowledge and make decisions based on that knowledge. They
are a type of Bayesian network, which is a graphical model that represents probabilistic
relationships between variables.

In the field of artificial intelligence and decision-making, Bayesian Belief Networks


(BBNs) have emerged as a powerful tool for probabilistic reasoning and inference.
BBNs provide a framework for representing and analysing complex systems by
explicitly modelling the relationships between uncertain variables. With their ability to
reason under uncertainty, BBNs have found wide-ranging applications in areas such as
healthcare, finance, environmental management, and more. In this technical article, we
will explore the fundamentals of Bayesian Belief Networks, their construction, inference
algorithms, and real-world applications. Whether you are a researcher, practitioner or
enthusiast in the field of AI, this article will provide you with a comprehensive
understanding of BBNs and their potential for solving real-world problems.

Bayesian Network Consists of Two Parts

Together, the DAG and the conditional probability tables allow us to perform
probabilistic inference in the network, such as computing the probability of a particular
variable given the values of other variables in the network. Bayesian networks have
many applications in machine learning, artificial intelligence, and decision analysis.

Directed Acyclic Graph

This is a graphical representation of the variables in the network and the causal
relationships between them. The nodes in the DAG represent variables, and the edges
represent the dependencies between the variables. The arrows in the graph indicate
the direction of causality.

Table of Conditional Probabilities

For each node in the DAG, there is a corresponding table of conditional probabilities
that specifies the probability of each possible value of the node given the values of its
parents in the DAG. These tables encode the probabilistic relationships between the
variables in the network.

• The nodes of the network graph in the preceding diagram stand in for the
random variables A, B, C, and D, respectively.
• Node A is referred to as the parent of Node B if we are thinking about node B,
which is linked to node A by a directed arrow.
• Node C is independent of node A.

The Semantics of Bayesian Network

The Bayesian network's semantics can be understood in one of two ways, as follows:

• To understand the network as the representation of the Joint probability


distribution: It is important because it allows us to model complex systems
using a graph structure. By representing the joint distribution as a graph, we
can easily identify the dependencies and independence relations between
variables, which can be useful in making predictions or inferences about the
system. Moreover, it can help us to identify the most probable causes or effects
of an observed event.
• To understand the network as an encoding of a collection of conditional
independence statements: It is crucial for designing efficient inference
procedures. By exploiting the conditional independence relations encoded in
the network, we can significantly reduce the computational complexity of
inference tasks. This is because we can often factorize the joint distribution into
smaller, more manageable conditional probability distributions, which can be
updated efficiently using the observed evidence. This approach is particularly
useful in probabilistic reasoning, where we need to infer the probability
distribution of some unobserved variables given some observed evidence.

Applications of Bayesian Networks in AI

Some of the most common applications of Bayesian networks in AI include:

• Prediction and classification: Bayesian belief networks can be used to predict


the probability of an event or classify data into different categories based on a
set of inputs. This is useful in areas such as fraud detection, medical diagnosis,
and image recognition.
• Decision making: Bayesian networks can be used to make decisions based on
uncertain or incomplete information. For example, they can be used to
determine the optimal route for a delivery truck based on traffic conditions and
delivery schedules.
• Risk analysis: Bayesian belief networks can be used to analyze the risks
associated with different actions or events. This is useful in areas such as
financial planning, insurance, and safety analysis.
• Anomaly detection: Bayesian networks can be used to detect anomalies in
data, such as outliers or unusual patterns. This is useful in areas such as
cybersecurity, where unusual network traffic may indicate a security breach.
• Natural language processing: Bayesian belief networks can be used to model
the probabilistic relationships between words and phrases in natural language,
which is useful in applications such as language translation and sentiment
analysis.

Probabilistic Modelling of Problems

A Probabilistic model in machine learning is a mathematical representation of a


real-world process that incorporates uncertain or random variables. The goal of
probabilistic modeling is to estimate the probabilities of the possible outcomes of a
system based on data or prior knowledge.

Probabilistic models are used in a variety of machine learning tasks such as


classification, regression, clustering, and dimensionality reduction. Some popular
probabilistic models include:

o Gaussian Mixture Models (GMMs)


o Hidden Markov Models (HMMs)
o Bayesian Networks
o Markov Random Fields (MRFs)

Probabilistic models allow for the expression of uncertainty, making them particularly
well-suited for real-world applications where data is often noisy or incomplete.
Additionally, these models can often be updated as new data becomes available, which
is useful in many dynamic and evolving systems.

For better understanding, we will implement the probabilistic model on the OSIC
Pulmonary Fibrosis problem on the kaggle.

Problem Statement: "In this competition, you'll predict a patient's severity of decline
in lung function based on a CT scan of their lungs. You'll determine lung function based
on output from a spirometer, which measures the volume of air inhaled and exhaled.
The challenge is to use machine learning techniques to make a prediction with the
image, metadata, and baseline FVC as input."

Categories Of Probabilistic Models

These models can be classified into the following categories:

• Generative models
• Discriminative models.
• Graphical models

Generative models:

Generative models aim to model the joint distribution of the input and output
variables. These models generate new data based on the probability distribution of the
original dataset. Generative models are powerful because they can generate new data
that resembles the training data. They can be used for tasks such as image and speech
synthesis, language translation, and text generation.

Discriminative models

The discriminative model aims to model the conditional distribution of the output
variable given the input variable. They learn a decision boundary that separates the
different classes of the output variable. Discriminative models are useful when the
focus is on making accurate predictions rather than generating new data. They can be
used for tasks such as image recognition, speech recognition, and sentiment analysis.

Graphical models

These models use graphical representations to show the conditional dependence


between variables. They are commonly used for tasks such as image recognition,
natural language processing, and causal inference.
Probabilistic Models in Deep Learning

Deep learning, a subset of machine learning, also relies on probabilistic models.


Probabilistic models are used to optimize complex models with many parameters, such
as neural networks. By incorporating uncertainty into the model training process, deep
learning algorithms can provide higher accuracy and generalization capabilities. One
popular technique is variational inference, which allows for efficient estimation of
posterior distributions.

Importance of Probabilistic Models

• Probabilistic models play a crucial role in the field of machine learning,


providing a framework for understanding the underlying patterns and
complexities in massive datasets.
• Probabilistic models provide a natural way to reason about the likelihood of
different outcomes and can help us understand the underlying structure of the
data.
• Probabilistic models help enable researchers and practitioners to make
informed decisions when faced with uncertainty.
• Probabilistic models allow us to perform Bayesian inference, which is a powerful
method for updating our beliefs about a hypothesis based on new data. This
can be particularly useful in situations where we need to make decisions under
uncertainty.

Inference in Bayesian Belief Networks

A Bayesian network is a directed acyclic graph in which each edge corresponds to a


conditional dependency, and each node corresponds to a unique random
variable. Formally, if an edge (A, B) exists in the graph connecting random variables A
and B, it means that P(B|A) is a factor in the joint probability distribution, so we must
know P(B|A) for all values of B and A in order to conduct inference. In the above
example, since Rain has an edge going into WetGrass, it means that P(WetGrass|Rain)
will be a factor, whose probability values are specified next to the WetGrass node in a
conditional probability table.

Inference

Inference over a Bayesian network can come in two forms.

The first is simply evaluating the joint probability of a particular assignment of values
for each variable (or a subset) in the network. For this, we already have a factorized
form of the joint distribution, so we simply evaluate that product using the provided
conditional probabilities. If we only care about a subset of variables, we will need to
marginalize out the ones we are not interested in. In many cases, this may result in
underflow, so it is common to take the logarithm of that product, which is equivalent
to adding up the individual logarithms of each term in the product.

The second, more interesting inference task, is to find P(x|e), or, to find the probability
of some assignment of a subset of the variables (x) given assignments of other
variables (our evidence, e). In the above example, an example of this could be to find
P(Sprinkler, WetGrass | Cloudy), where {Sprinkler, WetGrass} is our x, and {Cloudy} is
our e. In order to calculate this, we use the fact that P(x|e) = P(x, e) / P(e) = αP(x, e),
where α is a normalization constant that we will calculate at the end such that P(x|e) +
P(¬x | e) = 1. In order to calculate P(x, e), we must marginalize the joint probability
distribution over the variables that do not appear in x or e, which we will denote as Y.

Note that in larger networks, Y will most likely be quite large, since most inference
tasks will only directly use a small subset of the variables. In cases like these, exact
inference as shown above is very computationally intensive, so methods must be used
to reduce the amount of computation. One more efficient method of exact inference
is through variable elimination, which takes advantage of the fact that each factor only
involves a small number of variables. This means that the summations can be
rearranged such that only factors involving a given variable are used in the
marginalization of that variable. Alternatively, many networks are too large even for
this method, so approximate inference methods such as MCMC are instead used;
these provide probability estimations that require significantly less computation than
exact inference methods.

Probability Density Estimation

Probability Density: Assume a random variable x that has a probability distribution


p(x). The relationship between the outcomes of a random variable and its probability
is referred to as the probability density.

The problem is that we don’t always know the full probability distribution for a random
variable. This is because we only use a small subset of observations to derive the
outcome. This problem is referred to as Probability Density Estimation as we use
only a random sample of observations to find the general density of the whole sample
space.

Probability Density Function (PDF)

A PDF is a function that tells the probability of the random variable from a sub-sample
space falling within a particular range of values and not just one value. It tells the
likelihood of the range of values in the random variable sub-space being the same as
that of the whole sample.
By definition, if X is any continuous random variable, then the function f(x) is called a
probability density function if:

where,
a -> lower limit
b -> upper limit
X -> continuous random variable
f(x) -> probability density function
Steps Involved:
Step 1 - Create a histogram for the random set of observations to
understand the
density of the random sample.

Step 2 - Create the probability density function and fit it on the


random sample.
Observe how it fits the histogram plot.

Step 3 - Now iterate steps 1 and 2 in the following manner:


3.1 - Calculate the distribution parameters.
3.2 - Calculate the PDF for the random sample
distribution.
3.3 - Observe the resulting PDF against the data.
3.4 - Transform the data to until it best fits the
distribution.

Most of the histogram of the different random sample after fitting should match the
histogram plot of the whole population.

Density Estimation: It is the process of finding out the density of the whole
population by examining a random sample of data from that population. One of the
best ways to achieve a density estimate is by using a histogram plot.

Parametric Density Estimation

A normal distribution has two given parameters, mean and standard deviation. We
calculate the sample mean and standard deviation of the random sample taken from
this population to estimate the density of the random sample. The reason it is termed
as ‘parametric’ is due to the fact that the relation between the observations and its
probability can be different based on the values of the two parameters.

Now, it is important to understand that the mean and standard deviation of this
random sample is not going to be the same as that of the whole population due to its
small size. A sample plot for parametric density estimation is shown below.
Problems with Probability Distribution Estimation

Probability Distribution Estimation relies on finding the best PDF and determining its
parameters accurately. But the random data sample that we consider, is very small.
Hence, it becomes very difficult to determine what parameters and what probability
distribution function to use. To tackle this problem, Maximum Likelihood Estimation is
used.

Maximum Likelihood Estimation

It is a method of determining the parameters (mean, standard deviation, etc.) of


normally distributed random sample data or a method of finding the best fitting PDF
over the random sample data. This is done by maximizing the likelihood function so
that the PDF fitted over the random sample. Another way to look at it is that MLE
function gives the mean, the standard deviation of the random sample is most similar
to that of the whole sample.

Sequence Models

Sequence models are the machine learning models that input or output sequences of
data. Sequential data includes text streams, audio clips, video clips, time-series data
and etc. Recurrent Neural Networks (RNNs) is a popular algorithm used in sequence
models.

Applications of Sequence Models


1. Speech recognition: In speech recognition, an audio clip is given as an input and
then the model has to generate its text transcript. Here both the input and output are
sequences of data.

2. Sentiment Classification: In sentiment classification opinions expressed in a piece of


text is categorized. Here the input is a sequence of words.

3. Video Activity Recognition: In video activity recognition, the model needs to identify
the activity in a video clip. A video clip is a sequence of video frames, therefore in case
of video activity recognition input is a sequence of data.

These examples show that there are different applications of sequence models.
Sometimes both the input and output are sequences, in some either the input or the
output is a sequence. Recurrent neural network (RNN) is a popular sequence model
that has shown efficient performance for sequential data.

Sequence Models have been motivated by the analysis of sequential data such text
sentences, time-series and other discrete sequences data. These models are especially
designed to handle sequential information while Convolutional Neural Network are
more adapted for process spatial information.

The key point for sequence models is that the data we are processing are not anymore
independently and identically distributed (i.i.d.) samples and the data carry some
dependency due to the sequential order of the data.

Sequence Models is very popular for speech recognition, voice recognition, time
series prediction, and natural language processing.

Markov Models

What is a Markov model?

A Markov model is a stochastic method for randomly changing systems that possess
the Markov property. This means that, at any given time, the next state is only
dependent on the current state and is independent of anything in the past. Two
commonly applied types of Markov model are used when the system being
represented is autonomous -- that is, when the system isn't influenced by an external
agent. These are as follows:

1. Markov chains. These are the simplest type of Markov model and are used to
represent systems where all states are observable. Markov chains show all
possible states, and between states, they show the transition rate, which is
the probability of moving from one state to another per unit of time.
Applications of this type of model include prediction of market crashes, speech
recognition and search engine algorithms.
2. Hidden Markov models. These are used to represent systems with some
unobservable states. In addition to showing states and transition rates, hidden
Markov models also represent observations and observation likelihoods for
each state. Hidden Markov models are used for a range of applications,
including thermodynamics, finance and pattern recognition.

Another two commonly applied types of Markov model are used when the system
being represented is controlled -- that is, when the system is influenced by a decision-
making agent. These are as follows:

1. Markov decision processes. These are used to model decision-making in


discrete, stochastic, sequential environments. In these processes, an agent
makes decisions based on reliable information. These models are applied to
problems in artificial intelligence (AI), economics and behavioral sciences.
2. Partially observable Markov decision processes. These are used in cases like
Markov decision processes but with the assumption that the agent doesn't
always have reliable information. Applications of these models include robotics,
where it isn't always possible to know the location. Another application is
machine maintenance, where reliable information on machine parts can't be
obtained because it's too costly to shut down the machine to get the
information.

How is Markov analysis applied?

Markov analysis is a probabilistic technique that uses Markov models to predict the
future behavior of some variable based on the current state. Markov analysis is used
in many domains, including the following:

• Markov chains are used for several business applications, including predicting
customer brand switching for marketing, predicting how long people will
remain in their jobs for human resources, predicting time to failure of a machine
in manufacturing, and forecasting the future price of a stock in finance.
• Markov analysis is also used in natural language processing (NLP) and in
machine learning. For NLP, a Markov chain can be used to generate a sequence
of words that form a complete sentence, or a hidden Markov model can be used
for named-entity recognition and tagging parts of speech. For machine
learning, Markov decision processes are used to represent reward in
reinforcement learning.
• A recent example of the use of Markov analysis in healthcare was in Kuwait.
A continuous-time Markov chain model was used to determine the optimal
timing and duration of a full COVID-19 lockdown in the country, minimizing
both new infections and hospitalizations. The model suggested that a 90-day
lockdown beginning 10 days before the epidemic peak was optimal.

How are Markov models represented?

The simplest Markov model is a Markov chain, which can be expressed in equations,
as a transition matrix or as a graph. A transition matrix is used to indicate the
probability of moving from each state to each other state. Generally, the current states
are listed in rows, and the next states are represented as columns. Each cell then
contains the probability of moving from the current state to the next state. For any
given row, all the cell values must then add up to one.

A graph consists of circles, each of which represents a state, and directional arrows to
indicate possible transitions between states. The directional arrows are labeled with
the transition probability. The transition probabilities on the directional arrows coming
out of any given circle must add up to one.

Other Markov models are based on the chain representations but with added
information, such as observations and observation likelihoods.
The transition matrix below represents shifting gears in a car with a manual
transmission. Six states are possible, and a transition from any given state to any other
state depends only on the current state -- that is, where the car goes from second gear
isn't influenced by where it was before second gear. Such a transition matrix might be
built from empirical observations that show, for example, that the most probable
transitions from first gear are to second or neutral.

This transition matrix represents shifting gears in a car with a manual transmission and the six states
that are possible.

The image below represents the toss of a coin. Two states are possible: heads and tails.
The transition from heads to heads or heads to tails is equally probable (.5) and is
independent of all preceding coin tosses.

The circles represent the two possible states -- heads or tails -- and the arrows show the possible
states the system could transition to in the next step. The number .5 represents the probability of that
transition occurring.

History of the Markov chain


Markov chains are named after their creator, Andrey Andreyevich Markov, a Russian
mathematician who founded a new branch of probability theory around stochastic
processes in the early 1900s. Markov was greatly influenced by his teacher and mentor,
Pafnuty Chebyshev, whose work also broke new ground in probability theory.

Hidden Markov Model

Hidden Markov Model (HMM) is a statistical model that is used to describe the
probabilistic relationship between a sequence of observations and a sequence of
hidden states. It is often used in situations where the underlying system or process
that generates the observations is unknown or hidden, hence it got the name “Hidden
Markov Model.”

It is used to predict future observations or classify sequences, based on the underlying


hidden process that generates the data.

An HMM consists of two types of variables: hidden states and observations.

• The hidden states are the underlying variables that generate the observed
data, but they are not directly observable.
• The observations are the variables that are measured and observed.

The relationship between the hidden states and the observations is modeled using a
probability distribution. The Hidden Markov Model (HMM) is the relationship between
the hidden states and the observations using two sets of probabilities: the transition
probabilities and the emission probabilities.

• The transition probabilities describe the probability of transitioning from one


hidden state to another.
• The emission probabilities describe the probability of observing an output
given a hidden state.

Hidden Markov Model Algorithm

The Hidden Markov Model (HMM) algorithm can be implemented using the following
steps:

Step 1: Define the state space and observation space

The state space is the set of all possible hidden states, and the observation space is
the set of all possible observations.

Step 2: Define the initial state distribution

This is the probability distribution over the initial state.


Step 3: Define the state transition probabilities

These are the probabilities of transitioning from one state to another. This forms the
transition matrix, which describes the probability of moving from one state to another.

Step 4: Define the observation likelihoods:

These are the probabilities of generating each observation from each state. This forms
the emission matrix, which describes the probability of generating each observation
from each state.

Step 5: Train the model

The parameters of the state transition probabilities and the observation likelihoods are
estimated using the Baum-Welch algorithm, or the forward-backward algorithm. This
is done by iteratively updating the parameters until convergence.

Step 6: Decode the most likely sequence of hidden states

Given the observed data, the Viterbi algorithm is used to compute the most likely
sequence of hidden states. This can be used to predict future observations, classify
sequences, or detect patterns in sequential data.

Step 7: Evaluate the model

The performance of the HMM can be evaluated using various metrics, such as accuracy,
precision, recall, or F1 score.

To summarize, the HMM algorithm involves defining the state space, observation
space, and the parameters of the state transition probabilities and observation
likelihoods, training the model using the Baum-Welch algorithm or the forward-
backward algorithm, decoding the most likely sequence of hidden states using the
Viterbi algorithm, and evaluating the performance of the model.

HMMs are widely used in a variety of applications such as speech recognition, natural
language processing, computational biology, and finance. In speech recognition, for
example, an HMM can be used to model the underlying sounds or phonemes that
generate the speech signal, and the observations could be the features extracted from
the speech signal. In computational biology, an HMM can be used to model the
evolution of a protein or DNA sequence, and the observations could be the sequence
of amino acids or nucleotides.
Unit – 5. Neural Networks and Deep Learning
Neural Networks

A neural network is a method in artificial intelligence that teaches computers


to process data in a way that is inspired by the human brain. It is a type of
machine learning process, called deep learning, that uses interconnected
nodes or neurons in a layered structure that resembles the human brain. It
creates an adaptive system that computers use to learn from their mistakes
and improve continuously. Thus, artificial neural networks attempt to solve
complicated problems, like summarizing documents or recognizing faces, with
greater accuracy.

Why are neural networks important?

Neural networks can help computers make intelligent decisions with limited
human assistance. This is because they can learn and model the relationships
between input and output data that are nonlinear and complex. For instance,
they can do the following tasks.

Make generalizations and inferences

Neural networks can comprehend unstructured data and make general


observations without explicit training. For instance, they can recognize that
two different input sentences have a similar meaning:

• Can you tell me how to make the payment?


• How do I transfer money?

A neural network would know that both sentences mean the same thing. Or it
would be able to broadly recognize that Baxter Road is a place, but Baxter
Smith is a person’s name.

What are neural networks used for?

Neural networks have several use cases across many industries, such as the
following:

• Medical diagnosis by medical image classification


• Targeted marketing by social network filtering and behavioral data
analysis
• Financial predictions by processing historical data of financial
instruments
• Electrical load and energy demand forecasting
• Process and quality control
• Chemical compound identification

We give four of the important applications of neural networks below.


Computer vision

Computer vision is the ability of computers to extract information and insights


from images and videos. With neural networks, computers can distinguish and
recognize images similar to humans. Computer vision has several applications,
such as the following:

• Visual recognition in self-driving cars so they can recognize road signs


and other road users
• Content moderation to automatically remove unsafe or inappropriate
content from image and video archives
• Facial recognition to identify faces and recognize attributes like open
eyes, glasses, and facial hair
• Image labeling to identify brand logos, clothing, safety gear, and other
image details

Speech recognition

Neural networks can analyze human speech despite varying speech patterns,
pitch, tone, language, and accent. Virtual assistants like Amazon Alexa and
automatic transcription software use speech recognition to do tasks like these:

• Assist call center agents and automatically classify calls


• Convert clinical conversations into documentation in real time
• Accurately subtitle videos and meeting recordings for wider content
reach

Natural language processing

Natural language processing (NLP) is the ability to process natural, human-


created text. Neural networks help computers gather insights and meaning
from text data and documents. NLP has several use cases, including in these
functions:

• Automated virtual agents and chatbots


• Automatic organization and classification of written data
• Business intelligence analysis of long-form documents like emails and
forms
• Indexing of key phrases that indicate sentiment, like positive and
negative comments on social media
• Document summarization and article generation for a given topic

Recommendation engines

Neural networks can track user activity to develop personalized


recommendations. They can also analyze all user behavior and discover new
products or services that interest a specific user. For example, Curalate, a
Philadelphia-based startup, helps brands convert social media posts into sales.
Brands use Curalate’s intelligent product tagging (IPT) service to automate the
collection and curation of user-generated social content. IPT uses neural
networks to automatically find and recommend products relevant to the user’s
social media activity. Consumers don't have to hunt through online catalogs to
find a specific product from a social media image. Instead, they can use
Curalate’s auto product tagging to purchase the product with ease.

How do neural networks work?

The human brain is the inspiration behind neural network architecture. Human
brain cells, called neurons, form a complex, highly interconnected network and
send electrical signals to each other to help humans process information.
Similarly, an artificial neural network is made of artificial neurons that work
together to solve a problem. Artificial neurons are software modules, called
nodes, and artificial neural networks are software programs or algorithms that,
at their core, use computing systems to solve mathematical calculations.

Simple neural network architecture

A basic neural network has interconnected artificial neurons in three layers:

Input Layer

Information from the outside world enters the artificial neural network from
the input layer. Input nodes process the data, analyze or categorize it, and
pass it on to the next layer.

Hidden Layer

Hidden layers take their input from the input layer or other hidden layers.
Artificial neural networks can have a large number of hidden layers. Each
hidden layer analyzes the output from the previous layer, processes it further,
and passes it on to the next layer.

Output Layer

The output layer gives the final result of all the data processing by the artificial
neural network. It can have single or multiple nodes. For instance, if we have
a binary (yes/no) classification problem, the output layer will have one output
node, which will give the result as 1 or 0. However, if we have a multi-class
classification problem, the output layer might consist of more than one output
node.

Deep neural network architecture

Deep neural networks, or deep learning networks, have several hidden layers
with millions of artificial neurons linked together. A number, called weight,
represents the connections between one node and another. The weight is a
positive number if one node excites another, or negative if one node
suppresses the other. Nodes with higher weight values have more influence
on the other nodes.
Theoretically, deep neural networks can map any input type to any output
type. However, they also need much more training as compared to other
machine learning methods. They need millions of examples of training data
rather than perhaps the hundreds or thousands that a simpler network might
need.

What are the types of neural networks?

Artificial neural networks can be categorized by how the data flows from the
input node to the output node. Below are some examples:

Feedforward neural networks

Feedforward neural networks process data in one direction, from the input
node to the output node. Every node in one layer is connected to every node
in the next layer. A feedforward network uses a feedback process to improve
predictions over time.

Backpropagation algorithm

Artificial neural networks learn continuously by using corrective feedback loops


to improve their predictive analytics. In simple terms, you can think of the
data flowing from the input node to the output node through many different
paths in the neural network. Only one path is the correct one that maps the
input node to the correct output node. To find this path, the neural network
uses a feedback loop, which works as follows:

1. Each node makes a guess about the next node in the path.
2. It checks if the guess was correct. Nodes assign higher weight values to
paths that lead to more correct guesses and lower weight values to node
paths that lead to incorrect guesses.
3. For the next data point, the nodes make a new prediction using the
higher weight paths and then repeat Step 1.

Convolutional neural networks

The hidden layers in convolutional neural networks perform specific


mathematical functions, like summarizing or filtering, called convolutions.
They are very useful for image classification because they can extract relevant
features from images that are useful for image recognition and classification.
The new form is easier to process without losing features that are critical for
making a good prediction. Each hidden layer extracts and processes different
image features, like edges, color, and depth.

Biological Motivation

Motivation behind neural network is human brain. Human brain is called as the
best processor even though it works slower than other computers. Many
researchers thought to make a machine that would work in the prospective of
the human brain.

Human brain contains billion of neurons which are connected to many other
neurons to form a network so that if it sees any image, it recognizes the image
and processes the output.

• Dendrite receives signals from other neurons.


• Cell body sums the incoming signals to generate input.
• When the sum reaches a threshold value, neuron fires and the signal
travels down the axon to the other neurons.
• The amount of signal transmitted depend upon the strength of the
connections.
• Connections can be inhibitory, i.e., decreasing strength or excitatory,
i.e., increasing strength in nature.

In the similar manner, it was thought to make artificial interconnected neurons


like biological neurons making up an Artificial Neural Network (ANN). Each
biological neuron is capable of taking a number of inputs and produce output.
Neurons in human brain are capable of making very complex decisions, so this
means they run many parallel processes for a particular task. One motivation
for ANN is that to work for a particular task identification through many parallel
processes.

Perceptron

The Perceptron - Simple Model of Neural Networks

The Perceptron is a linear model used for binary classification. The


perceptron is more widely known as a “single-layer perceptron” in neural
network research to distinguish it from its successor the “multilayer
perceptron.” As a basic linear classifier, we consider the single-layer
perceptron to be the simplest form of the family of feed-forward neural
networks.
Definition of the Perceptron

The perceptron is a linear-model binary classifier with a simple


input–output relationship as depicted in Figure 2-3, which shows we’re
summing n number of inputs times their associated weights and then sending
this “net input” to a step function with a defined threshold. Typically, with
perceptron, this is a Heaviside step function with a threshold value of 0.5. This
function will output a real valued single binary value (0 or a 1), depending on
the input.

We can model the decision boundary and the classification output in the
Heaviside step function equation, as follows:

To produce the net input to the activation function (here, the Heaviside step
function) we take the dot product of the input and the connection weights. We
see this summation in the left half of Figure 2-3 as the input to the summation
function.
Table 2-1 provides an explanation of how the summation function is performed
as well as notes about the parameters involved in the summation function. The
output of the step function (activation function) is the output for the perceptron
and gives us a classification of the input values.

If the bias value is negative, it forces the learned weights sum to be a much
greater value to get a 1 classification output. The bias term in this capacity
moves the decision boundary around for the model. Input values do not affect
the bias term, but the bias term is learned through the perceptron learning
algorithm.

History of the Perceptron

The perceptron was invented in 1957 at the Cornell Aeronautical


Laboratory by Frank Rosenblatt. Early versions were intended to be
implemented as a physical machine rather than a software program. The first
software implementation was for the IBM 704, and then later it was
implemented in the Mark I Perceptron machine. It also should be noted that
McCulloch and Pitts introduced the basic concept of analysing neural activity
in 1943 based on thresholds and weighted sums. These concepts were key in
developing a model for later variations like the perceptron.

Multi-layer Perceptron

It is mainly similar to a single-layer perceptron model but has more hidden


layers. In a multilayer perceptron’s a group of neurons will be organized in
multiple layers. Every single neuron present in the first layer will take the input
signal and send a response to the neurons in the second layer and so on. A
multilayer perceptron model has a greater processing power and can process
linear and non-linear patterns.
• Forward Stage: From the input layer in the on stage, activation functions
begin and terminate on the output layer.

• Backward Stage: In the backward stage, weight and bias values are
modified per the model’s requirement. The backstage removed the error
between the actual output and demands originating backward on the output
layer.

Primary Components of a Perceptron

1. Neurons - A neural network is made up of a collection of units or


nodes which are called neurons.
2. Synapse - The getting neuron can obtain the sign, process the
same, and sign the subsequent one. A neuron can send
information or signals through the synapse to another adjacent
neuron. It processes it and signals the subsequent one. This
process in the perceptron algorithm continues until an output
signal is generated.
3. Input Nodes or Input Layer - This is the primary component of
Perceptron which accepts the initial data into the system for
further processing. Each input node contains a real numerical
value. All the Features of the model we want to train the neural
network are taken as inputs in the perceptron algorithm. Inputs
are denoted as x1, x2, x3, x4, … xn – ‘x’ in these inputs indicates
the feature value and ‘n’ the total occurrences of these features.
4. Weight - Weight parameter represents the strength of the
connection between units. This is another most important
parameter of Perceptron components. These are values that are
calculated during the training of the model. Initially, we have to
pass some random values as values to the weights and these
values get automatically updated after each training error that is
the values are generated during the training of the model. In some
cases, weights can also be called as weight coefficients that occurs
during hidden layers which is donated w1, w2, w3, w4, ... wn.
Weight is directly proportional to the strength of the associated
input neuron in deciding the output.
5. Bias – Bias is a special input type which allows the classifier to
move the decision boundary around from its original position. The
objective of the bias is to shift each point in a particular direction
for a specified distance. Bias allows for higher quality and faster
model training.

If you notice, we have passed value one as input in the starting and
W0 in the weights section. Bias is an element that adjusts the boundary away
from origin to move the activation function left, right, up or down. Since we
want this to be independent of the input features, we add constant one in the
statement so that the features will not get affected by this and this value is
known as Bias.

6. Weighted Summation - The multiplication of every feature or


input value (xi) associated with corresponding weight values (wi)
gives us a sum of values that are called weighted summation. This
weighted sum is passed on to the so-called activation function

Weights sum = ∑Wi * Xi (from i=1 to i=n) + (W0 * 1)


7. Activation/Step Function – Activation function applies step rule
which converts the numerical value to 0 or 1 so that it will be easy
for data set to classify. This is a process of reaching to result or
outcome that help to determine whether the neuron will fire or
not.

Activation Function can be considered primarily as a step function.

Types of Activation functions:

• Sign function

• Step function, and

• Sigmoid function

Based on the type of value we need as output we can change the activation
function. We can use the step function depending on the value required.
Sigmoid function and sign functions can be used for values between 0 and 1
and 1 and -1, respectively. The sign function is a hyperbolic tangent function
which is a zero centered function making it easy the multi-layer neural
networks. Rectified Linear Unit (ReLu) is another step function that is highly
computational and can be used for values approaching zero – value more less
than or more than zero.

The data scientist uses the activation function to take a subjective decision
based on various problem statements and forms the desired outputs.
Activation function may differ (e.g., Sign, Step, and Sigmoid) in perceptron
models by checking whether the learning process is slow or has vanishing or
exploding gradients.

Feed Forward Network

A Feed Forward Neural Network is an artificial Neural Network in


which the nodes are connected circularly. A feed-forward neural network,
in which some routes are cycled, is the polar opposite of a Recurrent Neural
Network. The feed-forward model is the basic type of neural network because
the input is only processed in one direction. The data always flows in one
direction and never backwards/opposite.

In its most basic form, a Feed-Forward Neural Network is a single layer


perceptron. A sequence of inputs enters the layer and are multiplied
by the weights in this model. The weighted input values are then
summed together to form a total. If the sum of the values is more than a
predetermined threshold, which is normally set at zero, the output value is
usually 1, and if the sum is less than the threshold, the output value is usually
-1. The single-layer perceptron is a popular feed-forward neural network model
that is frequently used for classification. Single-layer perceptron can also
contain machine learning features.

The neural network can compare the outputs of its nodes with the
desired values using a property known as the delta rule, allowing the
network to alter its weights through training to create more accurate
output values. This training and learning procedure results in gradient
descent. The technique of updating weights in multi-layered perceptron is
virtually the same, however, the process is referred to as back-propagation.
In such circumstances, the output values provided by the final layer are used
to alter each hidden layer inside the network.

Back Propagation

Backpropagation is the essence of neural network training. It is the method


of fine-tuning the weights of a neural network based on the error rate obtained
in the previous epoch (i.e., iteration). Proper tuning of the weights allows you
to reduce error rates and make the model reliable by increasing its
generalization.

Backpropagation in neural network is a short form for “backward propagation


of errors.” It is a standard method of training artificial neural networks. This
method helps calculate the gradient of a loss function with respect to all the
weights in the network.

How Backpropagation Algorithm Works

The Back propagation algorithm in neural network computes the gradient of


the loss function for a single weight by the chain rule. It efficiently computes
one layer at a time, unlike a native direct computation. It computes the
gradient, but it does not define how the gradient is used. It generalizes the
computation in the delta rule.

Consider the above Back propagation neural network example diagram to


understand:

1. Inputs X, arrive through the preconnected path


2. Input is modeled using real weights W. The weights are usually randomly
selected.
3. Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
4. Calculate the error in the outputs

ErrorB= Actual Output – Desired Output

5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.

Keep repeating the process until the desired output is achieved.

Why We Need Backpropagation?

Most prominent advantages of Backpropagation are:

• Backpropagation is fast, simple and easy to program


• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge about the
network
• It is a standard method that generally works well
• It does not need any special mention of the features of the function to
be learned.

Activation and Loss Functions

Activation Function

The activation function of a neuron defines its output given its inputs. The
activation function activates the neuron that is required for the desired output,
converts linear input to non-linear output. In neural networks, activation
functions, also known as transfer functions, define how the weighted sum of
the input can be transformed into output via nodes in a layer of networks. They
are treated as a crucial part of neural networks’ design.

In hidden layers, the selection of activation function controls how perfectly a


network model works to learn the training dataset while in the output layer, it
determines the types of predictions a model can generate.

1. Sigmoid Function:

Description: Takes a real-valued number and scales it between 0 and 1. Large


negative numbers become 0 and large positive numbers become 1. Range:
(0,1)
Pros: As its range is between 0 and 1, it is ideal for situations where we need
to predict the probability of an event as an output.

Cons: The gradient values are significant for range -3 and 3 but become much
closer to zero beyond this range which almost kills the impact of the neuron
on the final output. Also, sigmoid outputs are not zero-centered (it is centred
around 0.5) which leads to undesirable zig-zagging dynamics in the gradient
updates for the weights

Plot:

2. Tanh Function:

Description: Similar to sigmoid but takes a real-valued number and scales it


between -1 and 1. It is better than sigmoid as it is centred around 0 which
leads to better convergence. Range: (-1,1)

Pros: The derivatives of the tanh are larger than the derivatives of the sigmoid
which help us minimize the cost function faster

Cons: Similar to sigmoid, the gradient values become close to zero for wide
range of values (this is known as vanishing gradient problem). Thus, the
network refuses to learn or keeps learning at a very small rate.

Plot:
3. Softmax Function:

Description: Softmax function can be imagined as a combination of multiple


sigmoids which can returns the probability for a datapoint belonging to each
individual class in a multiclass classification problem. Range: (0,1), sum of
output = 1

Pros: Can handle multiple classes and give the probability of belonging to each
class

Cons: Should not be used in hidden layers as we want the neurons to be


independent. If we apply it then they will be linearly dependent.

4. ReLU Function:

Description: The rectified linear activation function or ReLU for short is a


piecewise linear function that will output the input directly if it is positive,
otherwise, it will output zero. This is the default function but modifying default
parameters allows us to use non-zero thresholds and to use a non-zero
multiple of the input for values below the threshold (called Leaky ReLU).
Range: (0,inf)

Pros: Although RELU looks and acts like a linear function, it is a nonlinear
function allowing complex relationships to be learned and is able to allow
learning through all the hidden layers in a deep network by having large
derivatives.

Cons: It should not be used as the final output layer for either
classification/regression tasks

Plot:
Loss Functions

The other key aspect in setting up the neural network infrastructure is selecting
the right loss functions. With neural networks, we seek to minimize the error
(difference between actual and predicted value) which is calculated by the loss
function.

1. Mean Squared Error, L2 Loss

Description: MSE loss is used for regression tasks. As the name suggests, this
loss is calculated by taking the mean of squared differences between
actual(target) and predicted values. Range: (0, inf)

Formula:

Pros: Preferred loss function if the distribution of the target variable is


Gaussian as it has good derivatives and helps the model converge quickly
Cons: Is not robust to outliers in the data (unlike loss functions like Mean
Absolute Error) and penalizes high and low predictions exponentially (unlike
loss functions like Mean Squared Logarithmic Error Loss)

2. Binary Cross Entropy

Description: BCE loss is the default loss function used for the binary
classification tasks. It requires one output layer to classify the data into two
classes and the range of output is (0–1) i.e. should use the sigmoid function.
Range: (0, inf)

Formula:

where y is the actual label, ŷ is the classifier’s predicted probability


distributions for predicting one class and m is the number of records.

Pros: The continuous nature of the loss function helps the training process
converged well

Cons: Can only be used with sigmoid activation function. Other loss functions
like Hinge or Squared Hinge Loss can work with tanh activation function

3. Categorical Cross Entropy

Description: It is the default loss function when we have a multi-class


classification task. It requires the same number of output nodes as the classes
with the final layer going through a SoftMax activation so that each output
node has a probability value between (0–1). Range: (0, inf)

Formula:
where y is the actual label and p is the classifier’s predicted probability
distributions for predicting the class j

Pros: Similar to Binary Cross Entropy, the continuous nature of the loss
function helps the training process converged well

Cons: May require a one hot encoded vector with many zero values if there
many classes, requiring significant memory (should use Sparse Categorical
Cross entropy in this case)

Limitations of the Machine Learning

Nothing is perfect in the world. Machine Learning has some serious limitations,
which are bigger than human errors.

1. Data Acquisition

The whole concept of machine learning is about identifying useful data. The
outcome will be incorrect if a credible data source is not provided. The quality
of the data is also significant. If the user or institution needs more quality data,
wait for it. It will cause delays in providing the output. So, machine learning
significantly depends on the data and its quality.

2. Time and Resources

The data that machines process remains huge in quantity and differs greatly.
Machines require time so that their algorithm can adjust to the environment
and learn it. Trials runs are held to check the accuracy and reliability of the
machine. It requires massive and expensive resources and high-quality
expertise to set up that quality of infrastructure. Trials runs are costly as they
would cost in terms of time and expenses.

3. Results Interpretations

One of the biggest advantages of Machine learning is that interpreted data that
we get from the cannot be hundred percent accurate. It will have some degree
of inaccuracy. For a high degree of accuracy, algorithms should be developed
so that they give reliable results.

4. High Error Chances

The error committed during the initial stages is huge, and if not corrected at
that time, it creates havoc. Biasness and wrongness have to be dealt with
separately; they are not interconnected. Machine learning depends on two
factors, i.e., data and algorithm. All the errors are dependent on the two
variables. Any incorrectness in any variables would have huge repercussions
on the output.

5. Social Changes

Machine learning is bringing numerous social changes in society. The role of


machine learning-based technology in society has increased multifold. It is
influencing the thought process of society and creating unwanted problems in
society. Character assassination and sensitive details are disturbing the social
fabric of society.

6. Elimination of Human Interface

Automation, Artificial Intelligence, and Machine Learning have eliminated


human interface from some work. It has eliminated employment opportunities.
Now, all those works are conducted with the help of artificial intelligence and
machine learning.

7. Changing Nature of Jobs

With the advancement of machine learning, the nature of the job is changing.
Now, all the work are done by machine, and it is eating up the jobs for human
which were done earlier by them. It is difficult for those without technical
education to adjust to these changes.

8. Highly Expensive
This software is highly expensive, and not everybody can own it. Government
agencies, big private firms, and enterprises mostly own it. It needs to be made
accessible to everybody for wide use.

9. Privacy Concern

As we know that one of the pillars of machine learning is data. The collection
of data has raised the fundamental question of privacy. The way data is
collected and used for commercial purposes has always been a contentious
issue. In India, the Supreme court of India has declared privacy a fundamental
right of Indians. Without the user's permission, data cannot be collected, used,
or stored. However, many cases have come up that big firms collect the data
without the user's knowledge and using it for their commercial gains.

10. Research and Innovations

Machine learning is evolving concept. This area has not seen any major
developments yet that fully revolutionized any economic sector. The area
requires continuous research and innovation.

Deep Learning

What is Deep Learning (DL)?

Deep learning is a specific subfield of machine learning: a new take on learning


representations from data that puts an emphasis on learning successive layers
of increasingly meaningful representations. Other appropriate names for the
field could have been layered representations learning and hierarchical
representations learning.

"Deep" in Deep Learning

The deep in deep learning isn’t a reference to any kind of deeper understanding
achieved by the approach; rather, it stands for this idea of successive layers
of representations.

Depth of the model - How many layers contribute to a model of the data is
called the depth of the model.

No. of layers

Modern deep learning often involves tens or even hundreds of successive


layers of representations— and they’re all learned automatically from exposure
to training data. Meanwhile, other approaches to machine learning tend to
focus on learning only one or two layers of representations of the data; hence,
they’re sometimes called shallow learning.

How layers learn?

In deep learning, these layered representations are (almost always) learned


via models called neural networks, structured in literal layers stacked on top
of each other. The term neural network is a reference to neurobiology, but
although some of the central concepts in deep learning were developed in part
by drawing inspiration from our understanding of the brain, deep-learning
models are not models of the brain. There’s no evidence that the brain
implements anything like the learning mechanisms used in modern deep-
learning models. For our purposes, deep learning is a mathematical framework
for learning representations from data.

Working of Neural networks with Example - Digit Classification

What do the representations learned by a deep-learning algorithm look like?


Let’s examine how a network several layers deep (see figure 1.5) transform
an image of a digit in order to recognize what digit it is.

As you can see in figure 1.6, the network transforms the digit image into
representations that are increasingly different from the original image and
increasingly informative about the final result.
A deep network as a multistage information-distillation operation, where
information goes through successive filters and comes out increasingly purified
(that is, useful with regard to some task). So that’s what deep learning is,
technically: a multistage way to learn data representations.

How this learning happens?

Machine learning is about mapping inputs (such as images) to


targets (such as the label “cat”), which is done by observing many examples
of input and targets. Deep neural networks do this input-to-target mapping via
a deep sequence of simple data transformations (layers) and that these data
transformations are learned by exposure to examples.

Weights AKA Parameters

The specification of what a layer does to its input data is stored in the layer’s
weights, which in essence are a bunch of numbers. In technical terms, we’d
say that the transformation implemented by a layer is parameterized by its
weights (see figure 1.7). (Weights are also sometimes called the parameters
of a layer.)
In this context, learning means finding a set of values for the weights of all
layers in a network, such that the network will correctly map example inputs
to their associated targets. But here’s the thing: a deep neural network can
contain tens of millions of parameters. Finding the correct value for all of them
may seem like a daunting task, especially given that modifying the value of
one parameter will affect the behaviour of all the others!

Loss function AKA Objective function

To control something, first you need to be able to observe it. To control the
output of a neural network, you need to be able to measure how far this output
is from what you expected. This is the job of the loss function of the network,
also called the objective function. The loss function takes the predictions of the
network and the true target (what you wanted the network to output) and
computes a distance score, capturing how well the network has done on this
specific example (see figure 1.8).

Optimizer

The fundamental trick in deep learning is to use this score as a feedback signal
to adjust the value of the weights a little, in a direction that will lower the loss
score for the current example (see figure 1.9). This adjustment is the job of
the optimizer, which implements what’s called the Backpropagation algorithm:
the central algorithm in deep learning.
Trained Network

Initially, the weights of the network are assigned random values, so the
network merely implements a series of random transformations. Naturally, its
output is far from what it should ideally be, and the loss score is accordingly
very high. But with every example the network processes, the weights are
adjusted a little in the correct direction, and the loss score decreases. This is
the training loop, which, repeated a sufficient number of times (typically tens
of iterations over thousands of examples), yields weight values that minimize
the loss function. A network with a minimal loss is one for which the outputs
are as close as they can be to the targets: a trained network.

DL achieved so far?

Although deep learning is a fairly old subfield of machine learning, it


only rose to prominence in the early 2010s. In the few years since, it has
achieved nothing short of a revolution in the field, with remarkable results on
perceptual problems such as seeing and hearing—problems involving skills that
seem natural and intuitive to humans but have long been elusive for machines.

In particular, deep learning has achieved the following


breakthroughs, all in historically difficult areas of machine learning:

✓ Near-human-level image classification


✓ Near-human-level speech recognition
✓ Near-human-level handwriting transcription
✓ Improved machine translation
✓ Improved text-to-speech conversion
✓ Digital assistants such as Google Now and Amazon Alexa
✓ Near-human-level autonomous driving
✓ Improved ad targeting, as used by Google, Baidu, and Bing
✓ Improved search results on the web
✓ Ability to answer natural-language questions
✓ Superhuman Go playing

Convolution Neural Networks

Definition of CNN

Convolutional neural networks also known as CNN or ConvNet are a specialized


type of artificial neural networks that use a mathematical operation called
convolution in place of general matrix multiplication in at least one of their
layers.

They are specifically designed to process pixel data and are used in image
recognition and processing. CNN specializes in processing data that has a grid-
like topology, such as an image. A digital image is a binary representation of
visual data. It contains a series of pixels arranged in a grid-like fashion that
contains pixel values to denote how bright and what color each pixel should
be.

Convolutional Neural Network (CNN) is the extended version of artificial neural


networks (ANN) which is predominantly used to extract the feature from the
grid-like matrix dataset. For example, visual datasets like images or videos
where data patterns play an extensive role.

CNN architecture

Convolutional Neural Network consists of multiple layers like the input layer,
Convolutional layer, Pooling layer, and fully connected layers.

Simple CNN architecture

The Convolutional layer applies filters to the input image to extract features,
the Pooling layer down samples the image to reduce computation, and the fully
connected layer makes the final prediction. The network learns the optimal
filters through backpropagation and gradient descent.

How Convolutional Layers works?

Convolution Neural Networks or covnets are neural networks that share their
parameters. Imagine you have an image. It can be represented as a cuboid
having its length, width (dimension of the image), and height (i.e the channel
as images generally has red, green, and blue channels).

Now imagine taking a small patch of this image and running a small neural
network, called a filter or kernel on it, with say, K outputs and representing
them vertically. Now slide that neural network across the whole image, as a
result, we will get another image with different widths, heights, and depths.
Instead of just R, G, and B channels now we have more channels but lesser
width and height. This operation is called Convolution. If the patch size is the
same as that of the image it will be a regular neural network. Because of this
small patch, we have fewer weights.

Now let’s talk about a bit of mathematics that is involved in the whole
convolution process.

• Convolution layers consist of a set of learnable filters (or kernels) having


small widths and heights and the same depth as that of input volume (3
if the input layer is image input).
• For example, if we have to run convolution on an image with dimensions
34x34x3. The possible size of filters can be axax3, where ‘a’ can be
anything like 3, 5, or 7 but smaller as compared to the image dimension.
• During the forward pass, we slide each filter across the whole input
volume step by step where each step is called stride (which can have a
value of 2, 3, or even 4 for high-dimensional images) and compute the
dot product between the kernel weights and patch from input volume.
• As we slide our filters, we’ll get a 2-D output for each filter and we’ll
stack them together as a result, we’ll get output volume having a depth
equal to the number of filters. The network will learn all the filters.

Layers used to build ConvNets

A complete Convolution Neural Networks architecture is also known as


covnets. A covnets is a sequence of layers, and every layer transforms one
volume to another through a differentiable function.

Types of layers: datasets

Let’s take an example by running a covnets on of image of dimension 32 x 32


x3

• Input Layers: It’s the layer in which we give input to our model. In
CNN, Generally, the input will be an image or a sequence of images. This
layer holds the raw input of the image with width 32, height 32, and
depth 3.
• Convolutional Layers: This is the layer, which is used to extract the
feature from the input dataset. It applies a set of learnable filters known
as the kernels to the input images. The filters/kernels are smaller
matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input image
data and computes the dot product between kernel weight and the
corresponding input image patch. The output of this layer is referred ad
feature maps. Suppose we use a total of 12 filters for this layer we’ll get
an output volume of dimension 32 x 32 x 12.
• Activation Layer: By adding an activation function to the output of the
preceding layer, activation layers add nonlinearity to the network. it will
apply an element-wise activation function to the output of the
convolution layer. Some common activation functions are RELU: max(0,
x), Tanh, Leaky RELU, etc. The volume remains unchanged hence
output volume will have dimensions 32 x 32 x 12.
• Pooling layer: This layer is periodically inserted in the covnets and its
main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting. Two
common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the
resultant volume will be of dimension 16x16x12.
• Flattening: The resulting feature maps are flattened into a one-
dimensional vector after the convolution and pooling layers so they can
be passed into a completely linked layer for categorization or regression.
• Fully Connected Layers: It takes the input from the previous layer and
computes the final classification or regression task.

• Output Layer: The output from the fully connected layers is then fed
into a logistic function for classification tasks like sigmoid or SoftMax
which converts the output of each class into the probability score of each
class.

Recurrent Neural Networks

Recurrent Neural Network (RNN) is a type of Neural Network where the output
from the previous step is fed as input to the current step. In traditional neural
networks, all the inputs and outputs are independent of each other, but in
cases when it is required to predict the next word of a sentence, the previous
words are required and hence there is a need to remember the previous words.
Thus, RNN came into existence, which solved this issue with the help of a
Hidden Layer. The main and most important feature of RNN is its Hidden state,
which remembers some information about a sequence. The state is also
referred to as Memory State since it remembers the previous input to the
network. It uses the same parameters for each input as it performs the same
task on all the inputs or hidden layers to produce the output. This reduces the
complexity of parameters, unlike other neural networks.

The Recurrent Neural Network consists of multiple fixed activation function


units, one for each time step. Each unit has an internal state which is called
the hidden state of the unit. This hidden state signifies the past knowledge
that the network currently holds at a given time step. This hidden state is
updated at every time step to signify the change in the knowledge of the
network about the past. The hidden state is updated using the following
recurrence relation: -

The formula for calculating the current state:

where:

ht -> current state

ht-1 -> previous state


xt -> input state

Formula for applying Activation function(tanh):

where:

whh -> weight at recurrent neuron

wxh -> weight at input neuron

The formula for calculating output:

Yt -> output

Why -> weight at output layer

These parameters are updated using Backpropagation. However, since RNN


works on sequential data here we use an updated backpropagation which is
known as Backpropagation through time.

Training through RNN

1. A single-time step of the input is provided to the network.


2. Then calculate its current state using a set of current input and the
previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the
information from all the previous states.
5. Once all the time steps are completed the final current state is used to
calculate the output.
6. The output is then compared to the actual output i.e the target output
and the error is generated.
7. The error is then back-propagated to the network to update the weights
and hence the network (RNN) is trained using Backpropagation through
time.

Advantages of Recurrent Neural Network


1. An RNN remembers each and every piece of information through time.
It is useful in time series prediction only because of the feature to
remember previous inputs as well. This is called Long Short Term
Memory.
2. Recurrent neural networks are even used with convolutional layers to
extend the effective pixel neighborhood.

Disadvantages of Recurrent Neural Network

1. Gradient vanishing and exploding problems.


2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an
activation function.

Applications of Recurrent Neural Network

1. Language Modelling and Generating Text


2. Speech Recognition
3. Machine Translation
4. Image Recognition, Face detection
5. Time series Forecasting

Variation Of Recurrent Neural Network (RNN)

To overcome the problems like vanishing gradient and exploding gradient


descent several new advanced versions of RNNs are formed some of these
areas ;

o Bidirectional Neural Network (BiNN)


o Long Short-Term Memory (LSTM)

Bidirectional Neural Network (BiNN)

A BiNN is a variation of a Recurrent Neural Network in which the input


information flows in both direction and then the output of both directions are
combined to produce the input. BiNN is useful in situations when the context
of the input is more important such as Nlp tasks and Time-series analysis
problems.

Long Short-Term Memory (LSTM)

Long Short-Term Memory works on the read-write-and-forget principle where


given the input information network reads and writes the most useful
information from the data and it forgets about the information which is not
important in predicting the output. For doing these three new gates are
introduced in the RNN. In this way, only the selected information is passed
through the network.

Use Cases

1. Machine Translation:

RNN can be used to build a deep learning model that can translate text from
one language to another without the need for human intervention. You can,
for example, translate a text from your native language to English.

2. Text Creation:

RNNs can also be used to build a deep learning model for text generation.
Based on the previous sequence of words/characters used in the text, a trained
model learns the likelihood of occurrence of a word/character. A model can be
trained at the character, n-gram, sentence, or paragraph level.

3. Captioning of images:

The process of creating text that describes the content of an image is known
as image captioning. The image's content can depict the object as well as the
action of the object on the image. In the image below, for example, the trained
deep learning model using RNN can describe the image as "A lady in a green
coat is reading a book under a tree.”

4. Recognition of Speech:
This is also known as Automatic Speech Recognition (ASR), and it is capable
of converting human speech into written or text format. Don't mix up speech
recognition and voice recognition; speech recognition primarily focuses on
converting voice data into text, whereas voice recognition identifies the user's
voice.

Speech recognition technologies that are used on a daily basis by various users
include Alexa, Cortana, Google Assistant, and Siri.

5. Forecasting of Time Series:

After being trained on historical time-stamped data, an RNN can be used to


create a time series prediction model that predicts the future outcome. The
stock market is a good example.

You can use stock market data to build a machine learning model that can
forecast future stock prices based on what the model learns from historical
data. This can assist investors in making data-driven investment decisions.

You might also like