Unit 2 - Machine Learning - WWW - Rgpvnotes.in

Please do not share these notes on apps like WhatsApp or Telegram.
The revenue we generate from the ads we show on our website and app
funds our services. The generated revenue helps us prepare new notes
and improve the quality of existing study materials, which are
available on our website and mobile app.
If you don't use our website and app directly, it will hurt our revenue,
and we might not be able to run the services and have to close them.
So, it is a humble request for all to stop sharing the study material we
provide on various apps. Please share the website's URL instead.
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022
Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester
Unit II
Syllabus: Supervised Learning Algorithms: Learning a Class from Examples, Linear,

Non-linear, Multi-class and Multi-label classification, Decision Trees: ID3, Classification
and Regression Trees (CART), Regression: Linear Regression, Multiple Linear Regression,
Logistic Regression, Neural Networks: Introduction, Perceptron, Multilayer Perceptron,
Support vector machines: Linear and NonLinear, Kernel Functions, K-Nearest Neighbors.
Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO2): Apply various supervised learning methods to appropriate
problems.
SUPERVISED LEARNING ALGORITHMS
In Supervised learning, you train the machine using data which is well "labeled." It means
some data is already tagged with the correct answer. It can be compared to learning which
takes place in the presence of a supervisor or a teacher.
A supervised learning algorithm learns from labeled training data, helps you to predict
outcomes for unforeseen data. Successfully building, scaling, and deploying accurate
supervised machine learning Data science model takes time and technical expertise from a
team of highly skilled data scientists. Moreover, Data scientist must rebuild models to make
sure the insights given remains true until its data changes.
• Supervised learning allows to collect data or produce a data output from the
previous experience.
• Helps to optimize performance criteria using experience
• Supervised machine learning helps to solve various types of real-world computation

problems.
LEARNING A CLASS FROM EXAMPLES
Suppose we want to learn the class, C, of a “family car.” We have a set of examples of cars,
and we have a group of people that we survey to whom we show these cars. The people
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

look at the cars and label them; the cars that they believe are family cars are positive
examples, and the other are negative examples. Class learning is finding a description that is
shared by all the positive examples and none of the negative examples.
The features that separate a family car from other type of cars are the price and engine
power. These two attributes are the inputs to the class recognizer.
Class C of a “family car”:-
• Prediction: Is car xa family car?
• Knowledge extraction: What do people expect from a family car?
Output: Positive (+) and negative (–) examples„
Input representation:
x1: price, x2: engine power
Training Set x
Figure 2.1 Training set for the class of a “family car”
Class C has price as the first input attribute x1 and engine power as the second attribute x2

Figure 2.2 Example of a hypothesis class. The class of family car is a rectangle
Our training data can now be plotted in the two-dimensional (x1, x2) space where each
instance t is a data point at coordinates (xt1, xt2) and its type, namely, positive versus
negative, is given by r t (see figure 2.1) After further discussions with the expert and the
analysis of the data, we may have reason to believe that for a car to be a family car, its price
and engine power should be in a certain range (p1 ≤ price ≤ p2) AND (e1 ≤ engine power ≤
e2)
Figure 2.3: The error of hypothesis h given the training set X
The aim is to find h ∈ H that is as similar as possible to C. So, the hypothesis h makes a
prediction for an instance x. we have is the training set X, which is a small subset of the set
of all possible x. The empirical error is the proportion of training instances where predictions
of h do not match the required values given in X.

LINEAR CLASSIFICATION
Linear classifiers classify data into labels based on a linear combination of input features.
Therefore, these classifiers separate data using a line or plane or a hyper-plane (a plane in
more than 2 dimensions). They can only be used to classify data that is linearly separable.
They can be modified to classify non-linearly separable data. Major algorithms in linear
classification are:
Perceptron: In Perceptron, we take weighted linear combination of input features and pass
it through a thresholding function which outputs 1 or 0. The sign of wTx tells us which side
of the plane wTx=0, the point x lies on. Thus, by taking threshold as 0, perceptron classifies
data based on which side of the plane the new point lies on.
The task during training is to arrive at the plane (defined by w) that accurately classifies the
training data. If the data is linearly separable, perceptron training always converges
Logistic Regression: In Logistic regression, we take weighted linear combination of input

features and pass it through a sigmoid function which outputs a number between 1 and 0.
Unlike perceptron, which just tells us which side of the plane the point lies on, logistic
regression gives a probability of a point lying on a particular side of the plane. The
probability of classification will be very close to 1 or 0 as the point goes far away from the
plane. The probability of classification of points very close to the plane is close to 0.5
SVM: There can be multiple hyperplanes that separate linearly separable data. SVM
calculates the optimal separating hyperplane using concepts of geometry. SVM only be used
to separate linearly separable data. But we can modify our data and project it into higher
dimensions to make it linearly separable.
NONLINEAR CLASSIFICATION
Nonlinear functions can be used to separate instances that are not linearly separable. We’ve
many nonlinear classifiers:
K-nearest-neighbors (kNN): The KNN algorithm assumes that similar things exist in close
proximity. In other words, similar things are near to each other. The KNN algorithm hinges
on this assumption being true enough for the algorithm to be useful. KNN captures the idea
of similarity.
Kernel SVM: SVM algorithms use a set of mathematical functions that are defined as the
kernel. The function of kernel is to take data as input and transform it into the required
form. Different SVM algorithms use different types of kernel functions
Decision Tree: Decision tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart like tree structure, where each internal node

denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (terminal node) holds a class label.
Multilayer Perceptron: Perceptron consists of an input layer and an output layer which are
fully connected. MLPs have the same input and output layers but may have multiple hidden
layers in between the aforementioned layers. The input layer is the first set of perceptrons
which output positive/negative based on the observed features in the data. A hidden layer is
a set of perceptrons that uses the outputs of the previous layer as inputs, instead of using
the original data. There can be multiple hidden layers and the final hidden layer is also
called the output layer.
MULTICLASS CLASSIFICATION
Multiclass classification means a classification task with more than two classes, e.g., classify
a set of images of fruits which may be oranges, apples, or pears. Multiclass classification
makes the assumption that each sample is assigned to one and only one label: a fruit can be
either an apple or a pear but not both at the same time.
Figure 2.4: Multi class where one column = one class
MULTILABEL CLASSIFICATION
Multilabel classification assigns to each sample a set of target labels. This can be thought of
as predicting properties of a data-point that are not mutually exclusive, such as topics that
are relevant for a document. A text might be about any of religion, politics, finance or
education at the same time or none of these.

Figure 2.5: Multi label where one column = one class
DECISION TREES
A decision tree is a structure that contains nodes and edges and is built from a dataset
(table of columns representing features/attributes and rows corresponds to records). Each
node is either used to make a decision (known as decision node) or represent an outcome
(known as leaf node).
ID3
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step. ID3 uses a
top-down greedy approach to build a decision tree. In simple words, the top-down approach
means that we start building the tree from the top and the greedy approach means that at
each iteration we select the best feature at the present moment to create a node.
ID3 uses Information Gain or just Gain to find the best feature. Information Gain calculates
the reduction in the entropy and measures how well a given feature separates or classifies
the target classes. The feature with the highest Information Gain is selected as the best one.
ID3 Steps
• Calculate the Information Gain of each feature.
• Considering that all rows don’t belong to the same class, split the dataset S into
subsets using the feature for which the Information Gain is maximum.

• Make a decision tree node using the feature with the maximum Information gain.
• If all rows belong to the same class, make the current node as a leaf node with the
class as its label.
• Repeat for the remaining features until we run out of all features, or the decision
tree has all leaf nodes.
CLASSIFICATION AND REGRESSION TREES (CART)
Classification and Regression Trees or CART for is a term introduced to refer to Decision Tree
algorithms that can be used for classification or regression predictive modeling problems.
The CART algorithm provides a foundation for important algorithms like bagged decision
trees, random forest and boosted decision trees.
CART Model Representation
The representation for the CART model is a binary tree.
This is a binary in this each root node represents a single input variable (x) and a split point
on that variable (assuming the variable is numeric).
The leaf nodes of the tree contain an output variable (y) which is used to make a prediction.
A dataset with two inputs (x) of height in centimeters and weight in kilograms the output of
gender as male or female, below is an example of a binary decision tree.
Figure 2.6: Decision Tree
The tree can be stored to file as a graph or a set of rules. For example, below is the above
decision tree as a set of rules.

• If Height > 180 cm Then Male
• If Height <= 180 cm AND Weight > 80 kg Then Male
• If Height <= 180 cm AND Weight <= 80 kg Then Female
• Make Predictions with CART Models
With the binary tree representation of the CART model, making predictions is relatively
straightforward. The tree is traversed by evaluating the specific input started at the root
node of the tree.
A learned binary tree is actually a partitioning of the input space. Each input variable as a
dimension on a p-dimensional space. The decision tree split this up into rectangles (when
p=2 input variables) or some kind of hyper-rectangles with more inputs. New data is filtered
through the tree and lands in one of the rectangles and the output value for that rectangle
is the prediction made by the model. This gives some feeling for the type of decisions that a
CART model is capable of making, e.g., boxy decision boundaries.
REGRESSION
Regression is a statistical method used in finance, investing, and other disciplines that
attempts to determine the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other variables (known as
independent variables).
Regression analysis is a predictive modeling technique that analyzes the relation between
the target or dependent variable and independent variable in a dataset. The different types
of regression analysis techniques get used when the target and independent variables show
a linear or non-linear relationship between each other, and the target variable contains
continuous values. The regression technique gets used mainly to determine the predictor
strength, forecast trend, time series, and in case of cause & effect relation.
LINEAR REGRESSION
Linear regression is one of the most basic types of regression in machine learning. The linear
regression model consists of a predictor variable and a dependent variable related linearly
to each other. In case the data involves more than one independent variable, then linear
regression is called multiple linear regression models.
The below-given equation is used to denote the linear regression model:
y=mx+c+e
Where m is the slope of the line, c is an intercept, and e represents the error in the model.

Figure 2.7: Linear regression
The best fit line is determined by varying the values of m and c. The predictor error is the
difference between the observed values and the predicted value. The values of m and c get
selected in such a way that it gives the minimum predictor error. It is important to note that
a simple linear regression model is susceptible to outliers. Therefore, it should not be used
in case of big size data.
MULTIPLE LINEAR REGRESSION
Multiple linear regression (MLR), is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable. The goal of multiple linear
regression (MLR) is to model the linear relationship between the explanatory (independent)
variables and response (dependent) variable.
Formula for Calculation of Multiple Linear Regression is:
LOGISTIC REGRESSION
Logistic regression is one of the types of regression analysis technique, which gets used
when the dependent variable is discrete. Example: 0 or 1, true or false, etc. This means the

target variable can have only two values, and a sigmoid curve denotes the relation between
the target variable and the independent variable.
Logit function is used in Logistic Regression to measure the relationship between the target
variable and independent variables. Below is the equation that denotes the logistic
regression.
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk
Where, p is the probability of occurrence of the feature.
Figure 2.8: Logistic regression
For selecting logistic regression, as the regression analyst technique, it should be noted, the
size of data is large with the almost equal occurrence of values to come in target variables.
Also, there should be no multi-collinearity, which means that there should be no correlation
between independent variables in the dataset.
NEURAL NETWORKS: Introduction
Neural Network is a computing system made up of a number of simple, highly

interconnected processing elements, which process information by their dynamic state
response to external inputs.

Figure 2.9: Neural network layers
Neural Network are processing devices (algorithms or actual hardware) that are loosely
modeled after the neuronal structure of the human cerebral cortex but on much smaller
scales. A large Neural Network might have hundreds or thousands of processor units,
whereas a human brain has billions of neurons with a corresponding increase in magnitude
of their overall interaction and emergent behavior. Neural networks are typically organized
in layers. Layers are made up of a number of interconnected 'nodes' which contain an
'activation function'. Patterns are presented to the network via the 'input layer', which
communicates to one or more 'hidden layers' where the actual processing is done via a
system of weighted 'connections'. The hidden layers then link to an 'output layer' where the
answer is output.
PERCEPTRON
A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary

classifiers decide whether an input, usually represented by a series of vectors, belongs to a
specific class. In short, a perceptron is a single-layer neural network. They consist of four
main parts including input values, weights and bias, net sum, and an activation function.
The process of Perceptron begins by taking all the input values and multiplying them by
their weights. Then, all of these multiplied values are added together to create the weighted
sum. The weighted sum is then applied to the activation function, producing the
perceptron's output. The activation function plays the integral role of ensuring the output is
mapped between required values such as (0,1) or (-1,1). It is important to note that the
weight of an input is indicative of the strength of a node. Similarly, an input's bias value
gives the ability to shift the activation function curve up or down.

Figure 2.10: The process of Perceptron
As a simplified form of a neural network, specifically a single-layer neural network,

perceptrons play an important role in binary classification. This means the perceptron is
used to classify data into two parts, hence binary. Sometimes, perceptrons are also referred
to as linear binary classifiers for this reason.
MULTILAYER PERCEPTRONS
The perceptron is very useful for classifying data sets that are linearly separable. They
encounter serious limitations with data sets that do not conform to this pattern as
discovered with the XOR problem. The XOR problem shows that for any classification of
four points that there exists a set that are not linearly separable. The MultiLayer Perceptron
(MLPs) breaks this restriction and classifies datasets which are not linearly separable. They
do this by using a more robust and complex architecture to learn regression and
classification models for difficult datasets.
The Perceptron consists of an input layer and an output layer which are fully connected.
MLPs have the same input and output layers but may have multiple hidden layers in
between the aforementioned layers
The algorithm for the MLP is as follows:
• Just as with the perceptron, the inputs are pushed forward through the MLP by
taking the dot product of the input with the weights that exist between the input
layer and the hidden layer (W---H). This dot product yields a value at the hidden
layer. We do not push this value forward as we would with a perceptron though.

• MLPs utilize activation functions at each of their calculated layers. There are many
activation functions to discuss: rectified linear units (ReLU), sigmoid function, tanh.
Push the calculated output at the current layer through any of these activation
functions.
• Once the calculated output at the hidden layer has been pushed through the
activation function, push it to the next layer in the MLP by taking the dot product
with the corresponding weights.
• Repeat steps two and three until the output layer is reached.
• At the output layer, the calculations will either be used for a backpropagation
algorithm that corresponds to the activation function that was selected for the MLP
(in the case of training) or a decision will be made based on the output (in the case
of testing).
MLPs form the basis for all neural networks and have greatly improved the power of
computers when applied to classification and regression problems. Computers are no
longer limited by XOR cases and can learn rich and complex models thanks to the multilayer
perceptron.
SUPPORT VECTOR MACHINES: LINEAR AND NONLINEAR
Support Vector Machine is a linear model for classification and regression problems. It can
solve linear and non-linear problems and work well for many practical problems. The idea of
SVM is simple: The algorithm creates a line or a hyperplane which separates the data into
classes.
SVM is an algorithm that takes the data as an input and outputs a line that separates those
classes if possible. Suppose we have a dataset as shown below and we need to classify the
red rectangles from the blue ellipses (positives from the negatives). So, our task is to find an
ideal line that separates this dataset in two classes (red and blue).

Figure 2.11: Data set to classify Blue and Red
We have infinite lines that can separate these two classes.
Figure 2.12: Classes separated by many lines
It’s visually quite intuitive in this case that the yellow line classifies better. The green line in
the image above is quite close to the red class. Though it classifies the current datasets it is
not a generalized line and in machine learning our goal is to get a more generalized
separator.
The SVM algorithm finds the points closest to the line from both the classes. These points
are called support vectors. Now, compute the distance between the line and the support
vectors. This distance is called the margin. The goal is to maximize the margin. The
hyperplane for which the margin is maximum, called the optimal hyperplane.

Figure 2.13: Computing the Hyperplane
SVM tries to make a decision boundary in such a way that the separation between the two
classes (that street) is as wide as possible.
For a bit complex dataset, which is not linearly separable we use non linear SVM.
Figure 2.14: Data set for non linear SVM
This data is clearly not linearly separable. We cannot draw a straight line that can classify
this data. But this data can be converted to linearly separable data in higher dimension.
Let’s add one more dimension and call it z-axis. Let the co-ordinates on z-axis be governed
by the constraint,
z = x²+y²
So, basically z co-ordinate is the square of distance of the point from origin. Let’s plot the
data on z-axis.

Figure 2.15: Dataset on higher dimension
Now the data is clearly linearly separable. Let the purple line separating the data in higher
dimension be z=k, where k is a constant. Since, z=x²+y² we get x² + y² = k; which is an
equation of a circle. So, we can project this linear separator in higher dimension back in
original dimensions using this transformation.
Figure 2.16: Decision boundary in original dimensions
Thus, we can classify data by adding an extra dimension to it so that it becomes linearly
separable and then projecting the decision boundary back to original dimensions using
mathematical transformation.
KERNEL FUNCTIONS
Kernel Function is a method used to take data as input and transform into the required form
of processing data. “Kernel” is used due to set of mathematical functions used in Support
Vector Machine provides the window to manipulate the data. So, Kernel Function generally
transforms the training set of data so that a non-linear decision surface is able to be

transformed to a linear equation in a higher number of dimension spaces. Basically, it

returns the inner product between two points in a standard feature dimension.
Standard Kernel Function Equation:
Major Kernel functions: -
For Implementing Kernel Functions, first of all we have to install “scikit-learn” library using
command prompt terminal: pip install scikit-learn
Gaussian Kernel: It is used to perform transformation when there is no prior knowledge

about data.
Gaussian Kernel Radial Basis Function (RBF): Same as above kernel function, adding radial
basis method to improve the transformation.
Figure 2.17: Gaussian Kernel Radial Basis Function
Sigmoid Kernel: this function is equivalent to a two-layer, perceptron model of neural

network, which is used as activation function for artificial neurons.

Figure 2.18: Sigmoid Kernel
Polynomial Kernel: It represents the similarity of vectors in training set of data in a feature
space over polynomials of the original variables used in kernel.
Figure 2.19: Polynomial Kernel
Linear Kernel: Used when data is linearly separable.
K-NEAREST NEIGHBORS
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised

machine learning algorithm that can be used to solve both classification and regression
problems. The KNN algorithm assumes that similar things exist in proximity. In other words,
similar things are near to each other. KNN works by finding the distances between a query
and all the examples in the data, selecting the specified number examples (K) closest to the
query, then votes for the most frequent label (in the case of classification) or averages the
labels (in the case of regression).

In the case of classification and regression, choosing the right K for data is done by trying
several Ks and picking the one that works best.
The KNN Algorithm
1. Load the data
2. Initialize K to your chosen number of neighbors
3. For each example in the data
3.1 Calculate the distance between the query example and the current example from the
data.
3.2 Add the distance and the index of the example to an ordered collection
4. Sort the ordered collection of distances and indices from smallest to largest (in ascending
order) by the distances
5. Pick the first K entries from the sorted collection
6. Get the labels of the selected K entries
7. If regression, return the mean of the K labels
8. If classification, return the mode of the K labels
Advantages
• The algorithm is simple and easy to implement.
• There’s no need to build a model, tune several parameters, or make additional

assumptions.
• The algorithm is versatile. It can be used for classification, regression, and search
Disadvantages
• The algorithm gets significantly slower as the number of examples and/or

predictors/independent variables increase.

Thank you for using our services. Please support us so that we can
improve further and help more people.
https://www.rgpvnotes.in/support-us
If you have questions or doubts, contact us on

WhatsApp at +91-8989595022 or by email at hey@rgpvnotes.in.
For frequent updates, you can follow us on

Instagram: https://www.instagram.com/rgpvnotes.in/.

Unit 2 - Machine Learning - WWW - Rgpvnotes.in

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2 - Machine Learning - WWW - Rgpvnotes.in

Uploaded by

Copyright:

Available Formats

Please do not share these notes on apps like WhatsApp or Telegram.

Syllabus: Supervised Learning Algorithms: Learning a Class from Examples, Linear,

SUPERVISED LEARNING ALGORITHMS

• Helps to optimize performance criteria using experience

• Supervised machine learning helps to solve various types of real-world computation

LEARNING A CLASS FROM EXAMPLES

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Class C of a “family car”:-

• Prediction: Is car xa family car?

• Knowledge extraction: What do people expect from a family car?

Output: Positive (+) and negative (–) examples„

x1: price, x2: engine power

Figure 2.1 Training set for the class of a “family car”

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Figure 2.3: The error of hypothesis h given the training set X

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Logistic Regression: In Logistic regression, we take weighted linear combination of input

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Figure 2.4: Multi class where one column = one class

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Figure 2.5: Multi label where one column = one class

• Calculate the Information Gain of each feature.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

CLASSIFICATION AND REGRESSION TREES (CART)

CART Model Representation

The representation for the CART model is a binary tree.

Figure 2.6: Decision Tree

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

• If Height > 180 cm Then Male

• If Height <= 180 cm AND Weight > 80 kg Then Male

• If Height <= 180 cm AND Weight <= 80 kg Then Female

• Make Predictions with CART Models

The below-given equation is used to denote the linear regression model:

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Figure 2.7: Linear regression

MULTIPLE LINEAR REGRESSION

Formula for Calculation of Multiple Linear Regression is:

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk

Where, p is the probability of occurrence of the feature.

Figure 2.8: Logistic regression

NEURAL NETWORKS: Introduction

Neural Network is a computing system made up of a number of simple, highly

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Figure 2.9: Neural network layers

A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Figure 2.10: The process of Perceptron

As a simplified form of a neural network, specifically a single-layer neural network,

The algorithm for the MLP is as follows:

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

SUPPORT VECTOR MACHINES: LINEAR AND NONLINEAR

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Figure 2.11: Data set to classify Blue and Red

We have infinite lines that can separate these two classes.

Figure 2.12: Classes separated by many lines

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Figure 2.13: Computing the Hyperplane

Figure 2.14: Data set for non linear SVM

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Figure 2.15: Dataset on higher dimension

Figure 2.16: Decision boundary in original dimensions

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in