Unit 1

1
Supervised learning:
Supervised learning, also known as supervised machine learning, is a subcategory of machine
learning and artificial intelligence. It is defined by its use of labeled datasets to train algorithms that
to classify data or predict outcomes accurately. As input data is fed into the model, it adjusts its
weights until the model has been fitted appropriately, which occurs as part of the cross validation
process. Supervised learning helps organizations solve for a variety of real-world problems at scale,
such as classifying spam in a separate folder from your inbox.
How supervised learning works

Supervised learning uses a training set to teach models to yield the desired output. This training
dataset includes inputs and correct outputs, which allow the model to learn over time. The algorithm
measures its accuracy through the loss function, adjusting until the error has been sufficiently
minimized.
Supervised learning can be separated into two types of problems when data mining—classification
and regression:
Classification uses an algorithm to accurately assign test data into specific categories. It recognizes
specific entities within the dataset and attempts to draw some conclusions on how those entities
should be labeled or defined. Common classification algorithms are linear classifiers, support vector
machines (SVM), decision trees, k-nearest neighbor, and random forest, which are described in more
detail below.
Regression is used to understand the relationship between dependent and independent variables. It is
commonly used to make projections, such as for sales revenue for a given business. Linear
regression, logistical regression, and polynomial regression are popular regression algorithms.
Supervised learning algorithms
Various algorithms and computations techniques are used in supervised machine learning processes.
Below are brief explanations of some of the most commonly used learning methods, typically
calculated through use of programs like R or Python:
Neural networks: Primarily leveraged for deep learning algorithms, neural networks process training
data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is
made up of inputs, weights, a bias (or threshold), and an output. If that output value exceeds a given
threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural
networks learn this mapping function through supervised learning, adjusting based on the loss
function through the process of gradient descent. When the cost function is at or near zero, we can
be confident in the model’s accuracy to yield the correct answer.
Naive bayes: Naive Bayes is classification approach that adopts the principle of class conditional
independence from the Bayes Theorem. This means that the presence of one feature does not impact
the presence of another in the probability of a given outcome, and each predictor has an equal effect
on that result. There are three types of Naïve Bayes classifiers: Multinomial Naïve Bayes, Bernoulli
Naïve Bayes, and Gaussian Naïve Bayes. This technique is primarily used in text classification, spam
identification, and recommendation systems.
Linear regression: Linear regression is used to identify the relationship between a dependent variable
and one or more independent variables and is typically leveraged to make predictions about future
outcomes. When there is only one independent variable and one dependent variable, it is known as
simple linear regression. As the number of independent variables increases, it is referred to as
2
multiple linear regression. For each type of linear regression, it seeks to plot a line of best fit, which is
calculated through the method of least squares. However, unlike other regression models, this line is
straight when plotted on a graph.
Logistic regression: While linear regression is leveraged when dependent variables are continuous,
logistic regression is selected when the dependent variable is categorical, meaning they have binary
outputs, such as "true" and "false" or "yes" and "no." While both regression models seek to understand
relationships between data inputs, logistic regression is mainly used to solve binary classification
problems, such as spam identification.
Support vector machines (SVM): A support vector machine is a popular supervised learning model
developed by Vladimir Vapnik, used for both data classification and regression. That said, it is
typically leveraged for classification problems, constructing a hyperplane where the distance between
two classes of data points is at its maximum. This hyperplane is known as the decision boundary,
separating the classes of data points (e.g., oranges vs. apples) on either side of the plane.
K-nearest neighbor: K-nearest neighbor, also known as the KNN algorithm, is a non-parametric
algorithm that classifies data points based on their proximity and association to other available data.
This algorithm assumes that similar data points can be found near each other. As a result, it seeks to
calculate the distance between data points, usually through Euclidean distance, and then it assigns a
category based on the most frequent category or average. Its ease of use and low calculation time
make it a preferred algorithm by data scientists, but as the test dataset grows, the processing time
lengthens, making it less appealing for classification tasks. KNN is typically used for recommendation
engines and image recognition.
Random forest: Random forest is another flexible supervised machine learning algorithm used for
both classification and regression purposes. The "forest" references a collection of uncorrelated decision
trees, which are then merged together to reduce variance and create more accurate data predictions.
Distance-based methods:
Distance-based algorithms are machine learning algorithms that classify queries by computing
distances between these queries and a number of internally stored exemplars. Exemplars that are
closest to the query have the largest influence on the classification assigned to the queryHere are some
distance-based methods in machine learning:
Cosine similarity
A commonly used distance metric that calculates the cosine angle between two vectors
or data points to find their similarity.
Hamming distance: An alternative measure to Euclidean distance that calculates how separate two
values are. Hamming distances are often used in Coding Theory for error detection.
3
Distance metric learning: Automatically constructs task-specific distance metrics from supervised
data. The learned distance metric can be used for tasks like k-NN classification, clustering, and
information retrieval.
KNN: A distance-based machine learning algorithm that looks for similar observations to answer
classification or regression problems. KNN can be used for simple missing value imputation and
complex facial recognition systems.
Euclidean distance: A common distance measure used in machine learning. It provides the foundation
for many popular and effective ML algorithms like k-nearest neighbors for supervised learning and k-
means clustering for unsupervised learning.
Bayes classifiers: Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are
distance-based models that use Mahalanobis distance to make predictions. Mahalanobis distance is a
measure of the distance between a point and a distribution.
Distance algorithms: Classify queries by computing distances between these queries and a number of
internally stored exemplars. Exemplars that are closest to the query have the largest influence on the
classification assigned to the query.
Feature extraction: Some amount of pre-processing is required to ensure input parameters can be
arranged as data points to support distance calculation to enable clustering.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning :

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
K-NN algorithm assumes the similarity between the new case/data and available cases and put the
new case into the category that is most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using
K- NN algorithm.
K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
4
It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works
on a similarity measure. Our KNN model will find the similar features of the new data set to the cats
and dogs images and based on the most similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
5
By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
As we can see the 3 nearest neighbors are from category A, hence this new data point must belong
to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
There is no particular way to determine the best value for "K", so we need to try some values to find
the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
Always needs to determine the value of K which may be complex some time.
The computation cost is high because of calculating the distance between the data points for all the
training samples.
Decision Tree:
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent the
decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
6
In order to build a tree, we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees.
Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the
two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies
· Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
· Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
· Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
· Branch/Sub Tree: A tree formed by splitting the tree.
· Pruning: Pruning is the process of removing the unwanted branches from the tree.
· Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
7
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best attribute
for the nodes of the tree. There are two popular techniques for ASM, which are:
Information Gain
Gini Index
1. Information Gain:
Information gain is the measurement of changes in entropy after the segmentation of a dataset based
on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build the decision tree.
A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the below
formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no
2. Gini Index:
Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
8
An attribute with the low Gini index should be preferred as compared to the high Gini index.
It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
ADVERTISEMENT
Cost Complexity Pruning
Reduced Error Pruning.
Advantages of the Decision Tree
It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
It can be very useful for solving decision-related problems.
It helps to think about all the possible outcomes for a problem.
There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
The decision tree contains lots of layers, which makes it complex.
It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
For more class labels, the computational complexity of the decision tree may increase.
Linear regression
Linear Regression in Machine Learning
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε
9
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.
Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then
such a relationship is termed as a Positive linear relationship.
Negative Linear Relationship:

If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis,
then such a relationship is called a negative linear relationship.
Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression,
so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we
use cost function.
Cost function-
The different values for weights or coefficient of lines (a0, a1) gives the different line of regression,
and the cost function is used to estimate the values of the coefficient for the best fit line.
10
Cost function optimizes the regression coefficients or weights. It measures how a linear regression
model is performing.
We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of
squared error occurred between the predicted values and actual values. It can be written as:
For the above linear equation, MSE can be calculated as:
Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function
will high. If the scatter points are close to the regression line, then the residual will be small and
hence the cost function.
Gradient Descent:
Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
A regression model uses gradient descent to update the coefficients of the line by reducing the cost
function.
It is done by a random selection of values of coefficient and then iteratively update the values to
reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below
method:
1. R-squared method:
R-squared is a statistical method that determines the goodness of fit.
It measures the strength of the relationship between the dependent and independent variables on a
scale of 0-100%.
The high value of R-square determines the less difference between the predicted values and actual
values and hence represents a good model.
It is also called a coefficient of determination, or coefficient of multiple determination for multiple
regression.
It can be calculated from the below formula:
Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given
dataset.
Linear relationship between the features and target:
Linear regression assumes the linear relationship between the dependent and independent variables.
11
Small or no multicollinearity between the features:

Multicollinearity means high-correlation between the independent variables. Due to multicollinearity,
it may difficult to find the true relationship between the predictors and target variables. Or we can
say, it is difficult to determine which predictor variable is affecting the target variable and which is
not. So, the model assumes either little or no multicollinearity between the features or independent
variables.
Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of independent
variables. With homoscedasticity, there should be no clear pattern distribution of data in the scatter
plot.
Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern. If error
terms are not normally distributed, then confidence intervals will become either too wide or too
narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation, which
means the error is normally distributed.
No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any correlation
in the error term, then it will drastically reduce the accuracy of the model. Auto correlation usually
occurs if there is a dependency between residual errors.
Neural networks: Primarily leveraged for deep learning algorithms, neural networks process training
data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is
made up of inputs, weights, a bias (or threshold), and an output. If that output value exceeds a given
threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural
networks learn this mapping function through supervised learning, adjusting based on the loss
function through the process of gradient descent. When the cost function is at or near zero, we can
be confident in the model’s accuracy to yield the correct answer.
Naive bayes: Naive Bayes is classification approach that adopts the principle of class conditional
independence from the Bayes Theorem. This means that the presence of one feature does not impact
the presence of another in the probability of a given outcome, and each predictor has an equal effect
on that result. There are three types of Naïve Bayes classifiers: Multinomial Naïve Bayes, Bernoulli
Naïve Bayes, and Gaussian Naïve Bayes. This technique is primarily used in text classification, spam
identification, and recommendation systems.
Linear regression: Linear regression is used to identify the relationship between a dependent variable
and one or more independent variables and is typically leveraged to make predictions about future
outcomes. When there is only one independent variable and one dependent variable, it is known as
simple linear regression. As the number of independent variables increases, it is referred to as
multiple linear regression. For each type of linear regression, it seeks to plot a line of best fit, which is
calculated through the method of least squares. However, unlike other regression models, this line is
straight when plotted on a graph.
Logistic regression: While linear regression is leveraged when dependent variables are continuous,
logistic regression is selected when the dependent variable is categorical, meaning they have binary
outputs, such as "true" and "false" or "yes" and "no." While both regression models seek to understand
12
relationships between data inputs, logistic regression is mainly used to solve binary classification
problems, such as spam identification.
Support vector machines (SVM): A support vector machine is a popular supervised learning model
developed by Vladimir Vapnik, used for both data classification and regression. That said, it is
typically leveraged for classification problems, constructing a hyperplane where the distance between
two classes of data points is at its maximum. This hyperplane is known as the decision boundary,
separating the classes of data points (e.g., oranges vs. apples) on either side of the plane.
K-nearest neighbor: K-nearest neighbor, also known as the KNN algorithm, is a non-parametric
algorithm that classifies data points based on their proximity and association to other available data.
This algorithm assumes that similar data points can be found near each other. As a result, it seeks to
calculate the distance between data points, usually through Euclidean distance, and then it assigns a
category based on the most frequent category or average. Its ease of use and low calculation time
make it a preferred algorithm by data scientists, but as the test dataset grows, the processing time
lengthens, making it less appealing for classification tasks. KNN is typically used for recommendation
engines and image recognition.
kernel methods.
Ans : A kernel method is a technique used in machine learning to estimate the value of a function at
a given point. It is a generalization of the concept of a support vector machine (SVM). Kernel methods
are used in a variety of machine learning tasks, including regression, classification, and clustering.
What are the benefits of using a kernel method?
There are many benefits to using kernel methods in AI. Kernel methods can help to improve the
accuracy of predictions, and they can also help to reduce the amount of data that needs to be
processed. Kernel methods can also help to improve the efficiency of learning algorithms, and they
can help to improve the interpretability of results.
What are some common kernel functions?
There are many common kernel functions in AI, but the most popular ones are the RBF (Radial Basis
Function) and the polynomial kernel. The RBF kernel is used in many different applications, such as
support vector machines, and is a very popular choice. The polynomial kernel is also used in many
applications, such as regression and classification.
How do you choose the best kernel function for a given problem?
When it comes to choosing the best kernel function for a given problem in AI, there are a few things
to consider. First, you need to think about what type of data you are working with. If you are
working with linear data, then a linear kernel function is likely to be the best choice. If you are
working with nonlinear data, then a nonlinear kernel function is likely to be the best choice. There
are a variety of kernel functions to choose from, so it is important to select the one that will work
best for your data.
Another thing to consider is the size of your data. If you have a large dataset, then you may want to
choose a kernel function that is computationally efficient. If you have a small dataset, then you may
be able to get away with using a more complex kernel function.
Finally, you need to think about what type of problem you are trying to solve. If you are trying to
solve a classification problem, then you will want to use a kernel function that is able to separate the
data into classes. If you are trying to solve a regression problem, then you will want to use a kernel
function that is able to fit a line to the data.
13
There is no one perfect kernel function for all problems, so it is important to select the one that is
best suited for your data and your problem. With a little trial and error, you should be able to find
the kernel function that works best for you.
What are some common issues that can arise when using kernel methods?
There are a few common issues that can arise when using kernel methods in AI. One issue is that the
kernels can be very sensitive to hyperparameters, which can lead to overfitting. Another issue is that
some kernels can be computationally expensive, which can make training time prohibitive. Finally,
some kernels can be unstable, which can lead to numerical issues during training.
Kernel method is the mathematical technique that is used in machine learning for analyzing data.
This method uses Kernel function - that maps data from one space to another space.
It is generally used in Support Vector Machines (SVMs) where the algorithms classify data by finding
the hyperplane that separates the data points of different classes.
The most important benefit of Kernel Method is that it can work with non-linearly separable data,
and it works with multiple Kernel functions - depending on the type of data.
Because the linear classifier can solve a very limited class of problems, the kernel trick is employed to
empower the linear classifier, enabling the SVM to solve a larger class of problems.
What are the types of Kernel methods in SVM models?

Support vector machines use various kinds of kernel methods in machine learning. Here are a few of
them:
1. Linear Kernel
It is used when the data is linearly separable.
K(x1, x2) = x1 . x2
2. Polynomial Kernel
It is used when the data is not linearly separable.
K(x1, x2) = (x1 . x2 + 1)d
3. Gaussian Kernel
The Gaussian kernel is an example of a radial basis function kernel. It can be represented with this
equation:
k(xi, xj) = exp(-𝛾||xi - xj||2)
4. Exponential Kernel
Similar to RBF kernel, but it decays much more quickly.
k(x, y) =exp(-||x -y||22)
5. Laplacian Kernel
Similar to RBF Kernel, it has a sharper peak and faster decay.
k(x, y) = exp(- ||x - y||)
6. Hyperbolic or the Sigmoid Kernel
14
It is used for non-linear classification problems. It transforms the input data into a higher-
dimensional space using the Sigmoid kernel.
k(x, y) = tanh(xTy + c)
7. Anova radial basis kernel
It is a multiple-input kernel function that can be used for feature selection.
k(x, y) = k=1nexp(-(xk -yk)2)d
8. Radial-basis function kernel
It maps the input data to an to an infinite-dimensional space.
K(x, y) = exp(-γ ||x - y||^2)
9. Wavelet kernel
It is a non-stationary kernel function that can be used for time-series analysis.
K(x, y) = ∑φ(i,j) Ψ(x(i),y(j))
10. Spectral kernel
This function is based on eigenvalues & eisenvectors of a similarity matrix.
K(x, y) = ∑λi φi(x) φi(y)
11. Mahalonibus kernel
This function takes into account the covariance structure of the data.
K(x, y) = exp(-1/2 (x - y)T S^-1 (x - y))
Beyond Binary Classification:Multi- class/Structured Outputs, Ranking.

Classification
In machine learning, Classification, as the name suggests, classifies data into different
parts/classes/groups. It is used to predict from which dataset the input data belongs to.
For example, if we are taking a dataset of scores of a cricketer in the past few matches, along with
average, strike rate, not outs etc, we can classify him as “in form” or “out of form”.
Classification is the process of assigning new input variables (X) to the class they most likely belong
to, based on a classification model, as constructed from previously labeled training data.
Data with labels is used to train a classifier such that it can perform well on data without labels (not
yet labeled). This process of continuous classification, of previously known classes, trains a machine. If
the classes are discrete, it can be difficult to perform classification tasks.
Types of Classification
There are two types of classifications;
Binary classification
Multi-class classification
Binary Classification
It is a process or task of classification, in which a given data is being classified into two classes. It’s
basically a kind of prediction about which of two groups the thing belongs to.
Let us suppose, two emails are sent to you, one is sent by an insurance company that keeps sending
their ads, and the other is from your bank regarding your credit card bill. The email service provider
will classify the two emails, the first one will be sent to the spam folder and the second one will be
kept in the primary one.
This process is known as binary classification, as there are two discrete classes, one is spam and the
other is primary. So, this is a problem of binary classification.
15
Binary classification uses some algorithms to do the task, some of the most common algorithms used
by binary classification are .
Logistic Regression
k-Nearest Neighbors
Decision Trees
Support Vector Machine
Naive Bayes
Multiclass Classification
Multi-class classification is the task of classifying elements into different classes. Unlike binary, it
doesn’t restrict itself to any number of classes.
Examples of multi-class classification are
classification of news in different categories,
classifying books according to the subject,
classifying students according to their streams etc.
In these, there are different classes for the response variable to be classified in and thus according to
the name, it is a Multi-class classification.
Can a classification possess both binary or multi-class?
Let us suppose we have to do sentiment analysis of a person, if the classes are just “positive” and
“negative”, then it will be a problem of binary class. But if the classes are “sadness”, happiness”,
“disgusting”, “depressed”, then it will be called a problem of Multi-class classification.
Binary vs Multiclass Classification
Parameters Binary classification Multi-class classification
There can be any number of classes in it, i.e.,

It is a classification of two groups, i.e.
No. of classes classifies the object into more than two
classifies objects in at most two classes.
classes.
Popular algorithms that can be used for

The most popular algorithms used by
multi-class classification include:
the binary classification are-
k-Nearest Neighbors
Logistic Regression
Algorithms Decision Trees
k-Nearest Neighbors
used Naive Bayes
Decision Trees
Random Forest.
Support Vector Machine
Gradient Boosting
Naive Bayes
Examples of binary classification

Examples of multi-class classification include:
include-
Face classification.
Email spam detection (spam or not).
Examples Plant species classification.
Churn prediction (churn or not).
Optical character recognition.
Conversion prediction (buy or not).

Unit 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

1

How supervised learning works

K-Nearest Neighbor(KNN) Algorithm for Machine Learning :

Why do we need a K-NN Algorithm?

How does K-NN work?

Why use Decision Trees?

Attribute Selection Measures

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Mathematically, we can represent a linear regression as:

Negative Linear Relationship:

Finding the best fit line:

Assumptions of Linear Regression

Small or no multicollinearity between the features:

What are the types of Kernel methods in SVM models?

Beyond Binary Classification:Multi- class/Structured Outputs, Ranking.

Parameters Binary classification Multi-class classification

There can be any number of classes in it, i.e.,

Popular algorithms that can be used for

Examples of binary classification

You might also like