You are on page 1of 21

Please do not share these notes on apps like WhatsApp or Telegram.

The revenue we generate from the ads we show on our website and app
funds our services. The generated revenue helps us prepare new notes
and improve the quality of existing study materials, which are
available on our website and mobile app.

If you don't use our website and app directly, it will hurt our revenue,
and we might not be able to run the services and have to close them.
So, it is a humble request for all to stop sharing the study material we
provide on various apps. Please share the website's URL instead.
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester

Unit II

Syllabus: Supervised Learning Algorithms: Learning a Class from Examples, Linear,


Non-linear, Multi-class and Multi-label classification, Decision Trees: ID3, Classification
and Regression Trees (CART), Regression: Linear Regression, Multiple Linear Regression,
Logistic Regression, Neural Networks: Introduction, Perceptron, Multilayer Perceptron,
Support vector machines: Linear and NonLinear, Kernel Functions, K-Nearest Neighbors.

Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO2): Apply various supervised learning methods to appropriate
problems.

SUPERVISED LEARNING ALGORITHMS

In Supervised learning, you train the machine using data which is well "labeled." It means
some data is already tagged with the correct answer. It can be compared to learning which
takes place in the presence of a supervisor or a teacher.

A supervised learning algorithm learns from labeled training data, helps you to predict
outcomes for unforeseen data. Successfully building, scaling, and deploying accurate
supervised machine learning Data science model takes time and technical expertise from a
team of highly skilled data scientists. Moreover, Data scientist must rebuild models to make
sure the insights given remains true until its data changes.

• Supervised learning allows to collect data or produce a data output from the
previous experience.

• Helps to optimize performance criteria using experience

• Supervised machine learning helps to solve various types of real-world computation


problems.

LEARNING A CLASS FROM EXAMPLES

Suppose we want to learn the class, C, of a “family car.” We have a set of examples of cars,
and we have a group of people that we survey to whom we show these cars. The people

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

look at the cars and label them; the cars that they believe are family cars are positive
examples, and the other are negative examples. Class learning is finding a description that is
shared by all the positive examples and none of the negative examples.

The features that separate a family car from other type of cars are the price and engine
power. These two attributes are the inputs to the class recognizer.

Class C of a “family car”:-

• Prediction: Is car xa family car?

• Knowledge extraction: What do people expect from a family car?

Output: Positive (+) and negative (–) examples„

Input representation:

x1: price, x2: engine power

Training Set x

Figure 2.1 Training set for the class of a “family car”

Class C has price as the first input attribute x1 and engine power as the second attribute x2

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.2 Example of a hypothesis class. The class of family car is a rectangle

Our training data can now be plotted in the two-dimensional (x1, x2) space where each
instance t is a data point at coordinates (xt1, xt2) and its type, namely, positive versus
negative, is given by r t (see figure 2.1) After further discussions with the expert and the
analysis of the data, we may have reason to believe that for a car to be a family car, its price
and engine power should be in a certain range (p1 ≤ price ≤ p2) AND (e1 ≤ engine power ≤
e2)

Figure 2.3: The error of hypothesis h given the training set X

The aim is to find h ∈ H that is as similar as possible to C. So, the hypothesis h makes a
prediction for an instance x. we have is the training set X, which is a small subset of the set
of all possible x. The empirical error is the proportion of training instances where predictions
of h do not match the required values given in X.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

LINEAR CLASSIFICATION

Linear classifiers classify data into labels based on a linear combination of input features.
Therefore, these classifiers separate data using a line or plane or a hyper-plane (a plane in
more than 2 dimensions). They can only be used to classify data that is linearly separable.
They can be modified to classify non-linearly separable data. Major algorithms in linear
classification are:

Perceptron: In Perceptron, we take weighted linear combination of input features and pass
it through a thresholding function which outputs 1 or 0. The sign of wTx tells us which side
of the plane wTx=0, the point x lies on. Thus, by taking threshold as 0, perceptron classifies
data based on which side of the plane the new point lies on.

The task during training is to arrive at the plane (defined by w) that accurately classifies the
training data. If the data is linearly separable, perceptron training always converges

Logistic Regression: In Logistic regression, we take weighted linear combination of input


features and pass it through a sigmoid function which outputs a number between 1 and 0.
Unlike perceptron, which just tells us which side of the plane the point lies on, logistic
regression gives a probability of a point lying on a particular side of the plane. The
probability of classification will be very close to 1 or 0 as the point goes far away from the
plane. The probability of classification of points very close to the plane is close to 0.5

SVM: There can be multiple hyperplanes that separate linearly separable data. SVM
calculates the optimal separating hyperplane using concepts of geometry. SVM only be used
to separate linearly separable data. But we can modify our data and project it into higher
dimensions to make it linearly separable.

NONLINEAR CLASSIFICATION

Nonlinear functions can be used to separate instances that are not linearly separable. We’ve
many nonlinear classifiers:

K-nearest-neighbors (kNN): The KNN algorithm assumes that similar things exist in close
proximity. In other words, similar things are near to each other. The KNN algorithm hinges
on this assumption being true enough for the algorithm to be useful. KNN captures the idea
of similarity.

Kernel SVM: SVM algorithms use a set of mathematical functions that are defined as the
kernel. The function of kernel is to take data as input and transform it into the required
form. Different SVM algorithms use different types of kernel functions

Decision Tree: Decision tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart like tree structure, where each internal node

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (terminal node) holds a class label.

Multilayer Perceptron: Perceptron consists of an input layer and an output layer which are
fully connected. MLPs have the same input and output layers but may have multiple hidden
layers in between the aforementioned layers. The input layer is the first set of perceptrons
which output positive/negative based on the observed features in the data. A hidden layer is
a set of perceptrons that uses the outputs of the previous layer as inputs, instead of using
the original data. There can be multiple hidden layers and the final hidden layer is also
called the output layer.

MULTICLASS CLASSIFICATION

Multiclass classification means a classification task with more than two classes, e.g., classify
a set of images of fruits which may be oranges, apples, or pears. Multiclass classification
makes the assumption that each sample is assigned to one and only one label: a fruit can be
either an apple or a pear but not both at the same time.

Figure 2.4: Multi class where one column = one class

MULTILABEL CLASSIFICATION

Multilabel classification assigns to each sample a set of target labels. This can be thought of
as predicting properties of a data-point that are not mutually exclusive, such as topics that
are relevant for a document. A text might be about any of religion, politics, finance or
education at the same time or none of these.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.5: Multi label where one column = one class

DECISION TREES

A decision tree is a structure that contains nodes and edges and is built from a dataset
(table of columns representing features/attributes and rows corresponds to records). Each
node is either used to make a decision (known as decision node) or represent an outcome
(known as leaf node).

ID3

ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step. ID3 uses a
top-down greedy approach to build a decision tree. In simple words, the top-down approach
means that we start building the tree from the top and the greedy approach means that at
each iteration we select the best feature at the present moment to create a node.

ID3 uses Information Gain or just Gain to find the best feature. Information Gain calculates
the reduction in the entropy and measures how well a given feature separates or classifies
the target classes. The feature with the highest Information Gain is selected as the best one.

ID3 Steps

• Calculate the Information Gain of each feature.

• Considering that all rows don’t belong to the same class, split the dataset S into
subsets using the feature for which the Information Gain is maximum.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• Make a decision tree node using the feature with the maximum Information gain.

• If all rows belong to the same class, make the current node as a leaf node with the
class as its label.

• Repeat for the remaining features until we run out of all features, or the decision
tree has all leaf nodes.

CLASSIFICATION AND REGRESSION TREES (CART)

Classification and Regression Trees or CART for is a term introduced to refer to Decision Tree
algorithms that can be used for classification or regression predictive modeling problems.
The CART algorithm provides a foundation for important algorithms like bagged decision
trees, random forest and boosted decision trees.

CART Model Representation

The representation for the CART model is a binary tree.

This is a binary in this each root node represents a single input variable (x) and a split point
on that variable (assuming the variable is numeric).

The leaf nodes of the tree contain an output variable (y) which is used to make a prediction.

A dataset with two inputs (x) of height in centimeters and weight in kilograms the output of
gender as male or female, below is an example of a binary decision tree.

Figure 2.6: Decision Tree

The tree can be stored to file as a graph or a set of rules. For example, below is the above
decision tree as a set of rules.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• If Height > 180 cm Then Male

• If Height <= 180 cm AND Weight > 80 kg Then Male

• If Height <= 180 cm AND Weight <= 80 kg Then Female

• Make Predictions with CART Models

With the binary tree representation of the CART model, making predictions is relatively
straightforward. The tree is traversed by evaluating the specific input started at the root
node of the tree.

A learned binary tree is actually a partitioning of the input space. Each input variable as a
dimension on a p-dimensional space. The decision tree split this up into rectangles (when
p=2 input variables) or some kind of hyper-rectangles with more inputs. New data is filtered
through the tree and lands in one of the rectangles and the output value for that rectangle
is the prediction made by the model. This gives some feeling for the type of decisions that a
CART model is capable of making, e.g., boxy decision boundaries.

REGRESSION

Regression is a statistical method used in finance, investing, and other disciplines that
attempts to determine the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other variables (known as
independent variables).

Regression analysis is a predictive modeling technique that analyzes the relation between
the target or dependent variable and independent variable in a dataset. The different types
of regression analysis techniques get used when the target and independent variables show
a linear or non-linear relationship between each other, and the target variable contains
continuous values. The regression technique gets used mainly to determine the predictor
strength, forecast trend, time series, and in case of cause & effect relation.

LINEAR REGRESSION

Linear regression is one of the most basic types of regression in machine learning. The linear
regression model consists of a predictor variable and a dependent variable related linearly
to each other. In case the data involves more than one independent variable, then linear
regression is called multiple linear regression models.

The below-given equation is used to denote the linear regression model:

y=mx+c+e

Where m is the slope of the line, c is an intercept, and e represents the error in the model.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.7: Linear regression

The best fit line is determined by varying the values of m and c. The predictor error is the
difference between the observed values and the predicted value. The values of m and c get
selected in such a way that it gives the minimum predictor error. It is important to note that
a simple linear regression model is susceptible to outliers. Therefore, it should not be used
in case of big size data.

MULTIPLE LINEAR REGRESSION

Multiple linear regression (MLR), is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable. The goal of multiple linear
regression (MLR) is to model the linear relationship between the explanatory (independent)
variables and response (dependent) variable.

Formula for Calculation of Multiple Linear Regression is:

LOGISTIC REGRESSION

Logistic regression is one of the types of regression analysis technique, which gets used
when the dependent variable is discrete. Example: 0 or 1, true or false, etc. This means the

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

target variable can have only two values, and a sigmoid curve denotes the relation between
the target variable and the independent variable.

Logit function is used in Logistic Regression to measure the relationship between the target
variable and independent variables. Below is the equation that denotes the logistic
regression.

logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk

Where, p is the probability of occurrence of the feature.

Figure 2.8: Logistic regression

For selecting logistic regression, as the regression analyst technique, it should be noted, the
size of data is large with the almost equal occurrence of values to come in target variables.
Also, there should be no multi-collinearity, which means that there should be no correlation
between independent variables in the dataset.

NEURAL NETWORKS: Introduction

Neural Network is a computing system made up of a number of simple, highly


interconnected processing elements, which process information by their dynamic state
response to external inputs.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.9: Neural network layers

Neural Network are processing devices (algorithms or actual hardware) that are loosely
modeled after the neuronal structure of the human cerebral cortex but on much smaller
scales. A large Neural Network might have hundreds or thousands of processor units,
whereas a human brain has billions of neurons with a corresponding increase in magnitude
of their overall interaction and emergent behavior. Neural networks are typically organized
in layers. Layers are made up of a number of interconnected 'nodes' which contain an
'activation function'. Patterns are presented to the network via the 'input layer', which
communicates to one or more 'hidden layers' where the actual processing is done via a
system of weighted 'connections'. The hidden layers then link to an 'output layer' where the
answer is output.

PERCEPTRON

A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary


classifiers decide whether an input, usually represented by a series of vectors, belongs to a
specific class. In short, a perceptron is a single-layer neural network. They consist of four
main parts including input values, weights and bias, net sum, and an activation function.

The process of Perceptron begins by taking all the input values and multiplying them by
their weights. Then, all of these multiplied values are added together to create the weighted
sum. The weighted sum is then applied to the activation function, producing the
perceptron's output. The activation function plays the integral role of ensuring the output is
mapped between required values such as (0,1) or (-1,1). It is important to note that the
weight of an input is indicative of the strength of a node. Similarly, an input's bias value
gives the ability to shift the activation function curve up or down.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.10: The process of Perceptron

As a simplified form of a neural network, specifically a single-layer neural network,


perceptrons play an important role in binary classification. This means the perceptron is
used to classify data into two parts, hence binary. Sometimes, perceptrons are also referred
to as linear binary classifiers for this reason.

MULTILAYER PERCEPTRONS

The perceptron is very useful for classifying data sets that are linearly separable. They
encounter serious limitations with data sets that do not conform to this pattern as
discovered with the XOR problem. The XOR problem shows that for any classification of
four points that there exists a set that are not linearly separable. The MultiLayer Perceptron
(MLPs) breaks this restriction and classifies datasets which are not linearly separable. They
do this by using a more robust and complex architecture to learn regression and
classification models for difficult datasets.

The Perceptron consists of an input layer and an output layer which are fully connected.
MLPs have the same input and output layers but may have multiple hidden layers in
between the aforementioned layers

The algorithm for the MLP is as follows:

• Just as with the perceptron, the inputs are pushed forward through the MLP by
taking the dot product of the input with the weights that exist between the input
layer and the hidden layer (W---H). This dot product yields a value at the hidden
layer. We do not push this value forward as we would with a perceptron though.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• MLPs utilize activation functions at each of their calculated layers. There are many
activation functions to discuss: rectified linear units (ReLU), sigmoid function, tanh.
Push the calculated output at the current layer through any of these activation
functions.

• Once the calculated output at the hidden layer has been pushed through the
activation function, push it to the next layer in the MLP by taking the dot product
with the corresponding weights.

• Repeat steps two and three until the output layer is reached.

• At the output layer, the calculations will either be used for a backpropagation
algorithm that corresponds to the activation function that was selected for the MLP
(in the case of training) or a decision will be made based on the output (in the case
of testing).

MLPs form the basis for all neural networks and have greatly improved the power of
computers when applied to classification and regression problems. Computers are no
longer limited by XOR cases and can learn rich and complex models thanks to the multilayer
perceptron.

SUPPORT VECTOR MACHINES: LINEAR AND NONLINEAR

Support Vector Machine is a linear model for classification and regression problems. It can
solve linear and non-linear problems and work well for many practical problems. The idea of
SVM is simple: The algorithm creates a line or a hyperplane which separates the data into
classes.

SVM is an algorithm that takes the data as an input and outputs a line that separates those
classes if possible. Suppose we have a dataset as shown below and we need to classify the
red rectangles from the blue ellipses (positives from the negatives). So, our task is to find an
ideal line that separates this dataset in two classes (red and blue).

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.11: Data set to classify Blue and Red

We have infinite lines that can separate these two classes.

Figure 2.12: Classes separated by many lines

It’s visually quite intuitive in this case that the yellow line classifies better. The green line in
the image above is quite close to the red class. Though it classifies the current datasets it is
not a generalized line and in machine learning our goal is to get a more generalized
separator.

The SVM algorithm finds the points closest to the line from both the classes. These points
are called support vectors. Now, compute the distance between the line and the support
vectors. This distance is called the margin. The goal is to maximize the margin. The
hyperplane for which the margin is maximum, called the optimal hyperplane.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.13: Computing the Hyperplane

SVM tries to make a decision boundary in such a way that the separation between the two
classes (that street) is as wide as possible.

For a bit complex dataset, which is not linearly separable we use non linear SVM.

Figure 2.14: Data set for non linear SVM

This data is clearly not linearly separable. We cannot draw a straight line that can classify
this data. But this data can be converted to linearly separable data in higher dimension.
Let’s add one more dimension and call it z-axis. Let the co-ordinates on z-axis be governed
by the constraint,

z = x²+y²

So, basically z co-ordinate is the square of distance of the point from origin. Let’s plot the
data on z-axis.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.15: Dataset on higher dimension

Now the data is clearly linearly separable. Let the purple line separating the data in higher
dimension be z=k, where k is a constant. Since, z=x²+y² we get x² + y² = k; which is an
equation of a circle. So, we can project this linear separator in higher dimension back in
original dimensions using this transformation.

Figure 2.16: Decision boundary in original dimensions

Thus, we can classify data by adding an extra dimension to it so that it becomes linearly
separable and then projecting the decision boundary back to original dimensions using
mathematical transformation.

KERNEL FUNCTIONS

Kernel Function is a method used to take data as input and transform into the required form
of processing data. “Kernel” is used due to set of mathematical functions used in Support
Vector Machine provides the window to manipulate the data. So, Kernel Function generally
transforms the training set of data so that a non-linear decision surface is able to be

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

transformed to a linear equation in a higher number of dimension spaces. Basically, it


returns the inner product between two points in a standard feature dimension.

Standard Kernel Function Equation:

Major Kernel functions: -

For Implementing Kernel Functions, first of all we have to install “scikit-learn” library using
command prompt terminal: pip install scikit-learn

Gaussian Kernel: It is used to perform transformation when there is no prior knowledge


about data.

Gaussian Kernel Radial Basis Function (RBF): Same as above kernel function, adding radial
basis method to improve the transformation.

Figure 2.17: Gaussian Kernel Radial Basis Function

Sigmoid Kernel: this function is equivalent to a two-layer, perceptron model of neural


network, which is used as activation function for artificial neurons.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.18: Sigmoid Kernel

Polynomial Kernel: It represents the similarity of vectors in training set of data in a feature
space over polynomials of the original variables used in kernel.

Figure 2.19: Polynomial Kernel

Linear Kernel: Used when data is linearly separable.

K-NEAREST NEIGHBORS

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised


machine learning algorithm that can be used to solve both classification and regression
problems. The KNN algorithm assumes that similar things exist in proximity. In other words,
similar things are near to each other. KNN works by finding the distances between a query
and all the examples in the data, selecting the specified number examples (K) closest to the
query, then votes for the most frequent label (in the case of classification) or averages the
labels (in the case of regression).

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

In the case of classification and regression, choosing the right K for data is done by trying
several Ks and picking the one that works best.

The KNN Algorithm

1. Load the data

2. Initialize K to your chosen number of neighbors

3. For each example in the data

3.1 Calculate the distance between the query example and the current example from the
data.

3.2 Add the distance and the index of the example to an ordered collection

4. Sort the ordered collection of distances and indices from smallest to largest (in ascending
order) by the distances

5. Pick the first K entries from the sorted collection

6. Get the labels of the selected K entries

7. If regression, return the mean of the K labels

8. If classification, return the mode of the K labels

Advantages

• The algorithm is simple and easy to implement.

• There’s no need to build a model, tune several parameters, or make additional


assumptions.

• The algorithm is versatile. It can be used for classification, regression, and search

Disadvantages

• The algorithm gets significantly slower as the number of examples and/or


predictors/independent variables increase.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Thank you for using our services. Please support us so that we can
improve further and help more people.
https://www.rgpvnotes.in/support-us

If you have questions or doubts, contact us on


WhatsApp at +91-8989595022 or by email at hey@rgpvnotes.in.

For frequent updates, you can follow us on


Instagram: https://www.instagram.com/rgpvnotes.in/.

You might also like