You are on page 1of 47

Forms of learning in Artificial Intelligence

There are 4 types of machine learning


1.Supervised learning
2.Unsupervised learning
3.Semi-supervised learning
4.Reinforced learning
 
1. Supervised learning
Supervised machine learning can take what it
has learned in the past and apply that to new
data using labelled examples to predict future
patterns and events. It learns by explicit
example.
Supervised learning requires that the algorithm’s
possible outputs are already known and that the
data used to train the algorithm is already
labelled with correct answers. It’s like teaching a
child that 2+2=4 or showing an image of a dog
and teaching the child that it is called a dog. The
approach to supervised machine learning is
essentially the same – it is presented with all the
information it needs to reach pre-determined
conclusions. It learns how to reach the
conclusion, just like a child would learn how to
reach the total of ‘5’ and the few, pre-determined
ways to get there, for example, 2+3 and 1+4. If
you were to present 6+3 as a way to get to 5,
that would be determined as incorrect. Errors
would be found and adjusted.

The algorithm learns by comparing its actual


output with correct outputs to find errors. It then
modifies the model accordingly.
Supervised learning is commonly used in
applications where historical
data predicts likely future events. Using the
previous example, if 6+3 is the most common
erroneous way to get to 5, the machine can
predict that when someone inputs 6+3, after the
correct answer of 9, 5 would be the second
most commonly expected result.  We can also
consider an everyday example – it can foresee
when credit card transactions are likely to be
fraudulent or which insurance customer is more
likely to put forward a claim.
Supervised learning is further divided into:
1.Classification
2.Regression
 
2. Unsupervised learning
Supervised learning tasks find patterns where
we have a dataset of “right answers” to learn
from. Unsupervised learning tasks find
patterns where we don’t. This may be because
the “right answers” are unobservable, or
infeasible to obtain, or maybe for a given
problem, there isn’t even a “right answer” per
se.
Unsupervised learning is used against data
without any historical labels. The system is not
given a pre-determined set of outputs or
correlations between inputs and outputs or a
“correct answer.” The algorithm must figure out
what it is seeing by itself, it has no storage of
reference points. The goal is to explore the data
and find some sort of patterns of structure.
Unsupervised learning works well when the data
is transactional. For example, identifying
pockets of customers with similar characteristics
who can then be targeted in marketing
campaigns.
Unsupervised machine learning is a more
complex process and has been used far fewer
times than supervised machine learning. But it’s
exactly for this reason that there is so much
buzz around the future of AI. Advances in
unsupervised ML are seen as the future of AI
because it moves away from narrow AI and
closer to AGI (‘artificial general intelligence’ that
we discussed a few paragraphs earlier). If
you’ve ever heard someone talking about
computers teaching themselves, this is
essentially what they are referring to.
In unsupervised learning, neither a training data
set nor a list of outcomes is provided. The AI
enters the problem blind – with only its
faultless logical operations to guide
it. Imagine yourself as a person that has never
heard of or seen any sport being played. You
get taken to a football game and left to figure out
what it is that you are observing. You can’t refer
to your knowledge of other sports and try to
draw up similarities and differences that will
eventually boil down to an understanding of
football. You have nothing but your cognitive
ability. Unsupervised learning places the AI in
an equivalent of this situation and leaves it to
learn using only its on/off logic mechanisms that
are used in all computer systems.

3. Semi-supervised learning (SSL)


Semi-supervised learning falls somewhere in
the middle of supervised and unsupervised
learning. It is used because many problems that
AI is used to solving require a balance of both
approaches.
In many cases the reference data needed for
solving the problem is available, but it is either
incomplete or somehow inaccurate. This is
when semi-supervised learning is summoned for
help since it is able to access the available
reference data and then use unsupervised
learning techniques to do its best to fill the gaps.
Unlike supervised learning which uses labelled
data and unsupervised which is given no
labelled data at all, SSL uses both. More often
than not the scales tip in favour of unlabelled
data since it is cheaper and easier to acquire,
leaving the volume of available labelled data in
the minority. The AI learns from the labelled
data to then make a judgement on the
unlabelled data and find patterns, relationships
and structures.
SSL is also useful in reducing human bias in the
process. A fully labelled, supervised learning AI
has been labelled by a human and thus poses
the risk of results potentially being skewed due
to improper labelling. With SSL, including a lot
of unlabelled data in the training process often
improves the precision of the end result while
time and cost are reduced. It enables data
scientists to access and use lots of unlabelled
data without having to face the insurmountable
task of assigning information and labels to each
one.
 
4. Reinforcement learning
Reinforcement learning is a type of dynamic
programming that trains algorithms using a
system of reward and punishment.
A reinforcement learning algorithm, or agent,
learns by interacting with its environment. It
receives rewards by performing correctly and
penalties for doing so incorrectly. Therefore, it
learns without having to be directly taught by a
human – it learns by seeking the greatest
reward and minimising penalty. This learning
is tied to a context because what may lead to
maximum reward in one situation may be
directly associated with a penalty in another.
This type of learning consists of three
components: the agent (the AI learner/decision
maker), the environment (everything the agent
has interaction with) and actions (what the agent
can do). The agent will reach the goal much
faster by finding the best way to do it – and that
is the goal – maximising the reward, minimising
the penalty and figuring out the best way to do
so.

Machines and software agents learn to


determine the perfect behaviour within a specific
context, to maximise its performance and
reward. Learning occurs via reward feedback
which is known as the reinforcement signal. An
everyday example of training pets to relieve
themselves outside is a simple way to illustrate
this. The goal is getting the pet into the habit of
going outside rather than in the house. The
training then involves rewards and punishments
intended for the pet’s learning. It gets a treat for
going outside or has its nose rubbed in its mess
if it fails to do so.
Reinforcement learning tends to be used for
gaming, robotics and navigation. The algorithm
discovers which steps lead to the maximum
rewards through a process of trial and error.
When this is repeated, the problem is known as
a Markov Decision Process.
Facebook’s News Feed is an example most of
us will be able to understand. Facebook uses
machine learning to personalise people’s feeds.
If you frequently read or “like” a particular
friend’s activity, the News Feed will begin to
bring up more of that friend’s activity more often
and nearer to the top. Should you stop
interacting with this friend’s activity in the same
way, the data set will be updated and the News
Feed will consequently adjust.

Decision Tree Classification


Algorithm
o Decision Tree is a Supervised learning
technique that can be used for both
classification and Regression problems, but
mostly it is preferred for solving
Classification problems. It is a
tree-structured classifier, where internal
nodes represent the features of a
dataset, branches represent the
decision rules and each leaf node
represents the outcome.
o In a Decision tree, there are two nodes,
which are the Decision Node and Leaf
Node. Decision nodes are used to make
any decision and have multiple branches,
whereas Leaf nodes are the output of
those decisions and do not contain any
further branches.
o The decisions or the test are performed on
the basis of features of the given dataset.
o It is a graphical representation for
getting all the possible solutions to a
problem/decision based on given
conditions.
o It is called a decision tree because, similar
to a tree, it starts with the root node,
which expands on further branches and
constructs a tree-like structure.
o In order to build a tree, we use the CART
algorithm, which stands
for Classification and Regression Tree
algorithm.
o A decision tree simply asks a question, and
based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general
structure of a decision tree:

Note: A decision tree can contain categorical


data (YES/NO) as well as numeric data.

Why use Decision Trees?


There are various algorithms in Machine
learning, so choosing the best algorithm for
the given dataset and problem is the main
point to remember while creating a machine
learning model. Below are the two reasons for
using the Decision tree:
o Decision Trees usually mimic human
thinking ability while making a decision, so
it is easy to understand.
o The logic behind the decision tree can be
easily understood because it shows a
tree-like structure.

Decision Tree Terminologies


 Root Node: Root node is from where the decision
tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous
sets.
 Leaf Node: Leaf nodes are the final output node,
and the tree cannot be segregated further after getting
a leaf node.
 Splitting: Splitting is the process of dividing the
decision node/root node into sub-nodes according to
the given conditions.
 Branch/Sub Tree: A tree formed by splitting the
tree.
 Pruning: Pruning is the process of removing the
unwanted branches from the tree.
 Parent/Child node: The root node of the tree is
called the parent node, and other nodes are called the
child nodes.
How does the Decision Tree algorithm
Work?
In a decision tree, for predicting the class of
the given dataset, the algorithm starts from
the root node of the tree. This algorithm
compares the values of the root attribute with
the record (real dataset) attribute and, based
on the comparison, follows the branch and
jumps to the next node.
For the next node, the algorithm again
compares the attribute value with the other
sub-nodes and move further. It continues the
process until it reaches the leaf node of the
tree. The complete process can be better
understood using the below algorithm:
o Step-1: Begin the tree with the root node,
says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the
dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that
contains possible values for the best
attributes.
o Step-4: Generate the decision tree node,
which contains the best attribute.
o Step-5: Recursively make new decision
trees using the subsets of the dataset
created in step -3. Continue this process
until a stage is reached where you cannot
further classify the nodes and called the
final node as a leaf node.

Example: Suppose there is a candidate who


has a job offer and wants to decide whether
he should accept the offer or Not. So, to solve
this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root
node splits further into the next decision node
(distance from the office) and one leaf node
based on the corresponding labels. The next
decision node further gets split into one
decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer).
Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main
issue arises that how to select the best
attribute for the root node and for sub-nodes.
So, to solve such problems there is a
technique which is called as Attribute
selection measure or ASM. By this
measurement, we can easily select the best
attribute for the nodes of the tree. There are
two popular techniques for ASM, which are:
o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of
changes in entropy after the segmentation
of a dataset based on an attribute.
o It calculates how much information a
feature provides us about a class.
o According to the value of information gain,
we split the node and build the decision
tree.
o A decision tree algorithm always tries to
maximize the value of information gain,
and a node/attribute having the highest
information gain is split first. It can be
calculated using the below formula:

1. Information Gain= Entropy(S)- [(Weighted A
vg) *Entropy(each feature)  
Entropy: Entropy is a metric to measure the
impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated
as:
Entropy(s)= -P(yes)log2 P(yes)- P(no)
log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or
purity used while creating a decision tree
in the CART(Classification and Regression
Tree) algorithm.
o An attribute with the low Gini index should
be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART
algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the
below formula:
Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal
Decision tree
Pruning is a process of deleting the
unnecessary nodes from a tree in order to get
the optimal decision tree.
A too-large tree increases the risk of
overfitting, and a small tree may not capture
all the important features of the dataset.
Therefore, a technique that decreases the size
of the learning tree without reducing accuracy
is known as Pruning. There are mainly two
types of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the
same process which a human follow while
making any decision in real-life.
o It can be very useful for solving
decision-related problems.
o It helps to think about all the possible
outcomes for a problem.
o There is less requirement of data cleaning
compared to other algorithms.

Disadvantages of the Decision


Tree
o The decision tree contains lots of layers,
which makes it complex.
o It may have an overfitting issue, which can
be resolved using the Random Forest
algorithm.
o For more class labels, the computational
complexity of the decision tree may
increase.

What Is a Hypothesis?
A hypothesis is an explanation for something.
It is a provisional idea, an educated guess that
requires some evaluation.
A good hypothesis is testable; it can be either
true or false.
In science, a hypothesis must be falsifiable,
meaning that there exists a test whose outcome
could mean that the hypothesis is not true. The
hypothesis must also be framed before the
outcome of the test is known.
A good hypothesis fits the evidence and can be
used to make predictions about new
observations or new situations.
The hypothesis that best fits the evidence and
can be used to make predictions is called a
theory, or is part of a theory.
● Hypothesis in Science: Provisional explanation
that fits the evidence and can be confirmed or
disproved.
What Is a Hypothesis in Statistics?
Much of statistics is concerned with the
relationship between observations.
Statistical hypothesis tests are techniques used
to calculate a critical value called an “effect.” The
critical value can then be interpreted in order to
determine how likely it is to observe the effect if
a relationship does not exist.
If the likelihood is very small, then it suggests
that the effect is probably real. If the likelihood is
large, then we may have observed a statistical
fluctuation, and the effect is probably not real.
For example, we may be interested in evaluating
the relationship between the means of two
samples, e.g. whether the samples were drawn
from the same distribution or not, whether there
is a difference between them.
One hypothesis is that there is no difference
between the population means, based on the
data samples.
This is a hypothesis of no effect and is called the
null hypothesis and we can use the statistical
hypothesis test to either reject this hypothesis, or
fail to reject (retain) it. We don’t say “accept”
because the outcome is probabilistic and could
still be wrong, just with a very low probability.
If the null hypothesis is rejected, then we assume
the alternative hypothesis that there exists some
difference between the means.
● Null Hypothesis (H0): Suggests no effect.
● Alternate Hypothesis (H1): Suggests some
effect.
Statistical hypothesis tests don’t comment on
the size of the effect, only the likelihood of the
presence or absence of the effect in the
population, based on the observed samples of
data.

● Hypothesis in Statistics: Probabilistic


explanation about the presence of a relationship
between observations.

Regression vs. Classification in Machine


Learning
Regression and Classification algorithms are Supervised
Learning algorithms. Both the algorithms are used for
prediction in Machine learning and work with the labeled
datasets. But the difference between both is how they are used
for different machine learning problems.
The main difference between Regression and Classification
algorithms that Regression algorithms are used to predict the
continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the
discrete values such as Male or Female, True or False, Spam or
Not Spam, etc.
Consider the below diagram:
Classification:
Classification is a process of finding a function which helps in
dividing the dataset into classes based on different parameters.
In Classification, a computer program is trained on the training
dataset and based on that training, it categorizes the data into
different classes.
00:00/04:08

The task of the classification algorithm is to find the mapping


function to map the input(x) to the discrete output(y).
Example: The best example to understand the Classification
problem is Email Spam Detection. The model is trained on the
basis of millions of emails on different parameters, and
whenever it receives a new email, it identifies whether the email
is spam or not. If the email is spam, then it is moved to the
Spam folder.
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the
following types:

● Logistic Regression

● K-Nearest Neighbours

● Support Vector Machines

● Kernel SVM

● Naïve Bayes

● Decision Tree Classification

● Random Forest Classification

Regression:
Regression is a process of finding the correlations between
dependent and independent variables. It helps in predicting the
continuous variables such as prediction of Market Trends,
prediction of House prices, etc.
The task of the Regression algorithm is to find the mapping
function to map the input variable(x) to the continuous output
variable(y).
Example: Suppose we want to do weather forecasting, so for
this, we will use the Regression algorithm. In weather
prediction, the model is trained on the past data, and once the
training is completed, it can easily predict the weather for future
days.
Types of Regression Algorithm:

● Simple Linear Regression


● Multiple Linear Regression

● Polynomial Regression

● Support Vector Regression

● Decision Tree Regression

● Random Forest Regression

Difference between Regression and Classification

Regression Algorithm Classification Algorithm

In Regression, the output In Classification, the output

variable must be of variable must be a discrete value.

continuous nature or real

value.

The task of the regression The task of the classification

algorithm is to map the algorithm is to map the input

input value (x) with the value(x) with the discrete output

continuous output variable(y).

variable(y).
Regression Algorithms Classification Algorithms are

are used with continuous used with discrete data.

data.

In Regression, we try to In Classification, we try to find the

find the best fit line, which decision boundary, which can

can predict the output divide the dataset into different

more accurately. classes.

Regression algorithms Classification Algorithms can be

can be used to solve the used to solve classification

regression problems such problems such as Identification

as Weather Prediction, of spam emails, Speech

House price prediction, Recognition, Identification of

etc. cancer cells, etc.

The regression Algorithm The Classification algorithms can

can be further divided into be divided into Binary Classifier

Linear and Non-linear and Multi-class Classifier.

Regression.
Artificial Neural Network Tutorial

The term "Artificial neural network" refers to a biologically


inspired sub-field of artificial intelligence modeled after the
brain. An Artificial neural network is usually a computational
network based on biological neural networks that construct the
structure of the human brain. Similar to how a human brain has
neurons interconnected to each other, artificial neural networks
also have neurons that are linked to each other in various layers
of the networks. These neurons are known as nodes.
Artificial neural network tutorial covers all the aspects related
to the artificial neural network. In this tutorial, we will discuss
ANNs, Adaptive resonance theory, Kohonen self-organizing
map, Building blocks, unsupervised learning, Genetic algorithm,
etc.

What is Artificial Neural Network?


The term "Artificial Neural Network" is derived from Biological
neural networks that develop the structure of a human brain.
Similar to the human brain that has neurons interconnected to
one another, artificial neural networks also have neurons that
are interconnected to one another in various layers of the
networks. These neurons are known as nodes.
The given figure illustrates the typical diagram of Biological
Neural Network.
The typical Artificial Neural Network looks something like the
given figure.

Dendrites from Biological Neural Network represent inputs in


Artificial Neural Networks, cell nucleus represents Nodes,
synapse represents Weights, and Axon represents Output.
Relationship between Biological neural network and artificial
neural network:
Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output

An Artificial Neural Network in the field of Artificial intelligence


where it attempts to mimic the network of neurons makes up a
human brain so that computers will have an option to
understand things and make decisions in a human-like manner.
The artificial neural network is designed by programming
computers to behave simply like interconnected brain cells.
There are around 1000 billion neurons in the human brain. Each
neuron has an association point somewhere in the range of
1,000 and 100,000. In the human brain, data is stored in such a
manner as to be distributed, and we can extract more than one
piece of this data when necessary from our memory parallelly.
We can say that the human brain is made up of incredibly
amazing parallel processors.
We can understand the artificial neural network with an
example, consider an example of a digital logic gate that takes
an input and gives an output. "OR" gate, which takes two inputs.
If one or both the inputs are "On," then we get "On" in output. If
both the inputs are "Off," then we get "Off" in output. Here the
output depends upon input. Our brain does not perform the
same task. The outputs to inputs relationship keep changing
because of the neurons in our brain, which are "learning."

The architecture of an artificial neural


network:
To understand the concept of the architecture of an artificial
neural network, we have to understand what a neural network
consists of. In order to define a neural network that consists of
a large number of artificial neurons, which are termed units
arranged in a sequence of layers. Lets us look at various types
of layers available in an artificial neural network.
Artificial Neural Network primarily consists of three layers:

Input Layer:

As the name suggests, it accepts inputs in several different formats provided by the
programmer.

Hidden Layer:

The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.

Output Layer:
The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.

The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a transfer
function.

It determines the weighted total is passed as an input to an activation function to


produce the output. Activation functions choose whether a node should fire or not.
Only those who are fired make it to the output layer. There are distinctive activation
functions available that can be applied upon the sort of task we are performing.

Advantages of Artificial Neural Network (ANN)


Parallel processing capability:

Artificial neural networks have a numerical value that can perform more than one
task simultaneously.

Storing data on the entire network:

Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't
prevent the network from working.

Capability to work with incomplete knowledge:

After ANN training, the information may produce output even with inadequate data.
The loss of performance here relies upon the significance of missing data.

Having a memory distribution:

For ANN to be able to adapt, it is important to determine the examples and to


encourage the network according to the desired output by demonstrating these
examples to the network. The succession of the network is directly proportional to
the chosen instances, and if the event can't appear to the network in all its aspects, it
can produce false output.

Having fault tolerance:

Extortion of one or more cells of ANN does not prohibit it from generating output,
and this feature makes the network fault-tolerance.

Disadvantages of Artificial Neural Network:


Assurance of proper network structure:

There is no particular guideline for determining the structure of artificial neural


networks. The appropriate network structure is accomplished through experience,
trial, and error.

Unrecognized behavior of the network:

It is the most significant issue of ANN. When ANN produces a testing solution, it
does not provide insight concerning why and how. It decreases trust in the network.

Hardware dependence:

Artificial neural networks need processors with parallel processing power, as per
their structure. Therefore, the realization of the equipment is dependent.

Difficulty of showing the issue to the network:

ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be resolved
here will directly impact the performance of the network. It relies on the user's
abilities.

How do artificial neural networks work?


Artificial Neural Network can be best represented as a weighted directed graph,
where the artificial neurons form the nodes. The association between the neurons
outputs and neuron inputs can be viewed as the directed edges with weights. The
Artificial Neural Network receives the input signal from the external source in the
form of a pattern and image in the form of a vector. These inputs are then
mathematically assigned by the notations x(n) for every n number of inputs.
Afterward, each of the input is multiplied by its corresponding weights ( these
weights are the details utilized by the artificial neural networks to solve a specific
problem ). In general terms, these weights normally represent the strength of the
interconnection between neurons inside the artificial neural network. All the weighted
inputs are summarized inside the computing unit.

If the weighted sum is equal to zero, then bias is added to make the output non-zero
or something else to scale up to the system's response. Bias has the same input, and
weight equals to 1. Here the total of weighted inputs can be in the range of 0 to
positive infinity. Here, to keep the response in the limits of the desired value, a certain
maximum value is benchmarked, and the total of weighted inputs is passed through
the activation function.

The activation function refers to the set of transfer functions used to achieve the
desired output. There is a different kind of the activation function, but primarily either
linear or non-linear sets of functions. Some of the commonly used sets of activation
functions are the Binary, linear, and Tan hyperbolic sigmoidal activation functions.
Let us take a look at each of them in details:

Binary:

In the binary activation function, the output is either a one or a 0. Here, to accomplish
this, there is a threshold value set up. If the net weighted input of neurons is more
than 1, then the final output of the activation function is returned as one or else the
output is returned as 0.
Types of Artificial Neural Network:
There are various types of Artificial Neural Networks (ANN) depending upon the
human brain neuron and network functions, an artificial neural network similarly
performs tasks. The majority of the artificial neural networks will have some
similarities with a more complex biological partner and are very effective at their
expected tasks. For example, segmentation or classification.

Feedback ANN:

In this type of ANN, the output returns into the network to accomplish the
best-evolved results internally. As per the University of Massachusetts, Lowell
Centre for Atmospheric Research. The feedback networks feed information back into
itself and are well suited to solve optimization issues. The Internal system error
corrections utilize feedback ANNs.

Feed-Forward ANN:
A feed-forward network is a basic neural network comprising of an input layer, an
output layer, and at least one layer of a neuron. Through assessment of its output by
reviewing its input, the intensity of the network can be noticed based on group
behavior of the associated neurons, and the output is decided. The primary
advantage of this network is that it figures out how to evaluate and recognize input
patterns.

Parametric and Nonparametric Machine


Learning Algorithms
Learning a Function

Machine learning can be summarized as learning a function (f) that maps input variables
(X) to output variables (Y).

Y = f(x)

An algorithm learns this target mapping function from training data.

The form of the function is unknown, so our job as machine learning practitioners is to
evaluate different machine learning algorithms and see which is better at approximating
the underlying function.
Different algorithms make different assumptions or biases about the form of the
function and how it can be learned.

Parametric Machine Learning Algorithms

Assumptions can greatly simplify the learning process, but can also limit what can be
learned. Algorithms that simplify the function to a known form are called parametric
machine learning algorithms.

A learning model that summarizes data with a set of parameters of fixed size
(independent of the number of training examples) is called a parametric model. No
matter how much data you throw at a parametric model, it won’t change its mind about
how many parameters it needs.

The algorithms involve two steps:

1. Select a form for the function.


2. Learn the coefficients for the function from the training data.

An easy to understand functional form for the mapping function is a line, as is used in
linear regression:

b0 + b1*x1 + b2*x2 = 0

Where b0, b1 and b2 are the coefficients of the line that control the intercept and slope,
and x1 and x2 are two input variables.

Assuming the functional form of a line greatly simplifies the learning process. Now, all
we need to do is estimate the coefficients of the line equation and we have a predictive
model for the problem.

Often the assumed functional form is a linear combination of the input variables and as
such parametric machine learning algorithms are often also called “linear machine
learning algorithms“.

Some more examples of parametric machine learning algorithms include:


● Logistic Regression
● Linear Discriminant Analysis
● Perceptron
● Naive Bayes
● Simple Neural Networks

Benefits of Parametric Machine Learning Algorithms:

● Simpler: These methods are easier to understand and interpret results.


● Speed: Parametric models are very fast to learn from data.
● Less Data: They do not require as much training data and can work well even if
the fit to the data is not perfect.

Limitations of Parametric Machine Learning Algorithms:

● Constrained: By choosing a functional form these methods are highly


constrained to the specified form.
● Limited Complexity: The methods are more suited to simpler problems.
● Poor Fit: In practice the methods are unlikely to match the underlying mapping
function.

Nonparametric Machine Learning Algorithms

Algorithms that do not make strong assumptions about the form of the mapping
function are called nonparametric machine learning algorithms. By not making
assumptions, they are free to learn any functional form from the training data.

Nonparametric methods seek to best fit the training data in constructing the mapping
function, whilst maintaining some ability to generalize to unseen data. As such, they are
able to fit a large number of functional forms.

An easy to understand nonparametric model is the k-nearest neighbors algorithm that


makes predictions based on the k most similar training patterns for a new data
instance. The method does not assume anything about the form of the mapping
function other than patterns that are close are likely to have a similar output variable.

Some more examples of popular nonparametric machine learning algorithms are:

● k-Nearest Neighbors
● Decision Trees like CART and C4.5
● Support Vector Machines

Benefits of Nonparametric Machine Learning Algorithms:

● Flexibility: Capable of fitting a large number of functional forms.


● Power: No assumptions (or weak assumptions) about the underlying function.
● Performance: Can result in higher performance models for prediction.

Limitations of Nonparametric Machine Learning Algorithms:

● More data: Require a lot more training data to estimate the mapping function.
● Slower: A lot slower to train as they often have far more parameters to train.
● Overfitting: More of a risk to overfit the training data and it is harder to explain
why specific predictions are made.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if
we want a model that can accurately identify whether it is a cat or dog, so such a
model can be created by using the SVM algorithm. We will first train our model with
lots of images of cats and dogs so that it can learn about different features of cats
and dogs, and then we test it with this strange creature. So as support vector creates
a decision boundary between these two data (cat and dog) and choose extreme
cases (support vectors), it will see the extreme case of cat and dog. On the basis of
the support vectors, it will classify it as a cat. Consider the below diagram:
00:00/06:36
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

● Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.

● Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,


which means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the


classes in n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the hyperplane
of SVM.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair(x1, x2) of
coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider
the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point
of the lines from both the classes. These points are called support vectors. The
distance between the vectors and the hyperplane is called as margin. And the goal of
SVM is to maximize this margin. The hyperplane with maximum margin is called the
optimal hyperplane.
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the
below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.


Ensemble Classifier
Ensemble learning helps improve machine learning results by combining
several models. This approach allows the production of better predictive
performance compared to a single model. Basic idea is to learn a set of
classifiers (experts) and to allow them to vote.
Advantage : Improvement in predictive accuracy.

Disadvantage : It is difficult to understand an ensemble of classifiers.

Why do ensembles work?


Dietterich(2002) showed that ensembles overcome three problems –
● Statistical Problem –
The Statistical Problem arises when the hypothesis space is too
large for the amount of available data. Hence, there are many
hypotheses with the same accuracy on the data and the learning
algorithm chooses only one of them! There is a risk that the accuracy
of the chosen hypothesis is low on unseen data!
● Computational Problem –
The Computational Problem arises when the learning algorithm
cannot guarantees finding the best hypothesis.
● Representational Problem –
The Representational Problem arises when the hypothesis space
does not contain any good approximation of the target class(es).

Main Challenge for Developing Ensemble Models?


The main challenge is not to obtain highly accurate base models, but rather to
obtain base models which make different kinds of errors. For example, if
ensembles are used for classification, high accuracies can be accomplished if
different base models misclassify different training examples, even if the base
classifier accuracy is low.
Methods for Independently Constructing Ensembles –

● Majority Vote
● Bagging and Random Forest
● Randomness Injection
● Feature-Selection Ensembles
● Error-Correcting Output Coding

Types of Ensemble Classifier –


Bagging:
Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision
tree. Suppose a set D of d tuples, at each iteration i, a training set Di of d
tuples is sampled with replacement from D (i.e., bootstrap). Then a classifier
model Mi is learned for each training set D < i. Each classifier Mi returns its
class prediction. The bagged classifier M* counts the votes and assigns the
class with the most votes to X (unknown sample).
Implementation steps of Bagging –
1. Multiple subsets are created from the original data set with equal
tuples, selecting observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and
independent of each other.
4. The final predictions are determined by combining the predictions
from all the models.

5.
Random Forest:
Random Forest is an extension over bagging. Each classifier in the
ensemble is a decision tree classifier and is generated using a
random selection of attributes at each node to determine the split.
During classification, each tree votes and the most popular class is
returned.
Implementation steps of Random Forest –
1. Multiple subsets are created from the original data set,
selecting observations with replacement.
2. A subset of features is selected randomly and whichever
feature gives the best split is used to split the node iteratively.
3. The tree is grown to the largest.
4. Repeat the above steps and prediction is given based on the
aggregation of predictions from n number of trees.

You might also like