You are on page 1of 31

REGRESSION ANALYSIS

Regression analysis is a form of predictive modelling technique which investigates the relationship
between a dependent (target) and independent variable (s) (predictor). This technique is used for
forecasting, time series modelling and finding the causal effect relationship between the variables.
For example, relationship between rash driving and number of road accidents by a driver is best
studied through regression.

Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve /
line to the data points, in such a manner that the differences between the distances of data points
from the curve or line is minimized. I’ll explain this in more details in coming sections.

Why do we use Regression Analysis?


As mentioned above, regression analysis estimates the relationship between two or more variables.
Let’s understand this with an easy example:
Let’s say, you want to estimate growth in sales of a company based on current economic
conditions. You have the recent company data which indicates that the growth in sales is around
two and a half times the growth in the economy. Using this insight, we can predict future sales of
the company based on current & past information.
There are multiple benefits of using regression analysis. They are as follows:
1. It indicates the significant relationships between dependent variable and independent
variable.
2. It indicates the strength of impact of multiple independent variables on a dependent
variable.
Regression analysis also allows us to compare the effects of variables measured on different scales,
such as the effect of price changes and the number of promotional activities. These benefits help
market researchers / data analysts / data scientists to eliminate and evaluate the best set of variables
to be used for building predictive models.

How many types of regression techniques do we have?


There are various kinds of regression techniques available to make predictions. These techniques
are mostly driven by three metrics (number of independent variables, type of dependent variables
and shape of regression line). We’ll discuss them in detail in the following sections.

For the creative ones, you can even cook up new regressions, if you feel the need to use a
combination of the parameters above, which people haven’t used before. But before you start that,
let us understand the most commonly used regressions:

1. Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually among the
first few topics which people pick while learning predictive modeling. In this technique, the
dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature
of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one or
more independent variables (X) using a best fit straight line (also known as regression line).
It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is
error term. This equation can be used to predict the value of target variable based on given
predictor variable(s).

The difference between simple linear regression and multiple linear regression is that, multiple
linear regression has (>1) independent variables, whereas simple linear regression has only 1
independent variable. Now, the question is “How do we obtain best fit line?”.
How to obtain best fit line (Value of a and b)?
This task can be easily accomplished by Least Square Method. It is the most common method used
for fitting a regression line. It calculates the best-fit line for the observed data by minimizing the
sum of the squares of the vertical deviations from each data point to the line. Because the deviations
are first squared, when added, there is no cancelling out between positive and negative values.
We can evaluate the model performance using the metric R-square. To know more details about
these metrics, you can read: Model Performance metrics Part 1, Part 2 .
Important Points:
 There must be linear relationship between independent and dependent variables
 Multiple regression suffers from multicollinearity, autocorrelation, heteroskedasticity.
 Linear Regression is very sensitive to Outliers. It can terribly affect the regression line
and eventually the forecasted values.
 Multicollinearity can increase the variance of the coefficient estimates and make the
estimates very sensitive to minor changes in the model. The result is that the coefficient
estimates are unstable
 In case of multiple independent variables, we can go with forward selection, backward
elimination and step wise approach for selection of most significant independent variables.

2. Logistic Regression
Logistic regression is used to find the probability of event=Success and event=Failure. We
should use logistic regression when the dependent variable is binary (0/ 1, True/ False, Yes/ No)
in nature. Here the value of Y ranges from 0 to 1 and it can represented by following equation.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence


ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

Above, p is the probability of presence of the characteristic of interest. A question that you should
ask here is “why have we used log in the equation?”.
Since we are working here with a binomial distribution (dependent variable), we need to choose a
link function which is best suited for this distribution. And, it is logit function. In the equation
above, the parameters are chosen to maximize the likelihood of observing the sample values rather
than minimizing the sum of squared errors (like in ordinary regression).
Important Points:
 Logistic regression is widely used for classification problems
 Logistic regression doesn’t require linear relationship between dependent and independent
variables. It can handle various types of relationships because it applies a non-linear log
transformation to the predicted odds ratio
 To avoid over fitting and under fitting, we should include all significant variables. A good
approach to ensure this practice is to use a step wise method to estimate the logistic
regression
 It requires large sample sizes because maximum likelihood estimates are less powerful at
low sample sizes than ordinary least square
 The independent variables should not be correlated with each other i.e. no multi
collinearity. However, we have the options to include interaction effects of categorical
variables in the analysis and in the model.
 If the values of dependent variable is ordinal, then it is called as Ordinal logistic regression
 If dependent variable is multi class then it is known as Multinomial Logistic regression.
SIMPLE PERCEPTRON ALGORITHM
“I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.”
-Bill Gates

We humans are so enthusiastic that we look at different things in nature and try to replicate it in our
own way. Humans saw birds flying and wanted to invent something so that they could fly too. Many
efforts were made, many inventions were invented, and eventually aero planes came into existence
that enabled us to fly from one place to another. The source of all motivation was from Mother
Nature. Similarly, there were efforts made to replicate the human brain. A number of researchers
tried to understand the working of a human brain. After many years of research, Artificial Neural
Networks were invented vaguely inspired from the biological neural networks inside our brain.

Human brain is really an amazing thing. It can identify objects, recognize patterns, classify things,
and much more. What if a machine could do all this stuff? Wouldn’t that be cool? Today, as in
2018, we have come a long way in Artificial Intelligence. Many AI models are invented that could
classify things, predict future, play games better than humans, and even communicate with us.

A neural network is a collection of neurons/nodes interconnected with each other through synaptic
connections. An artificial neural network looks something like this.
The inputs to the neural network are fed to the input layer(the nodes in red color). Each node in a
neural network has some function associated with it, each connection/edge has some weight value.
The inputs are propagated from the input layer to the hidden layer (nodes in blue color). Finally,
the outputs are received at the output layer (nodes in green color).

Today, let’s build a perceptron model, which is nothing but a single node of a neural network.

Perceptron Invented in 1957 by Frank Rosenblatt at the Cornell Aeronautical Laboratory, a


perceptron is the simplest neural network possible: a computational model of a single neuron. A
perceptron consists of one or more inputs, a processor, and a single output.

Lets understand the perceptron model with a simple classification problem.

Say, we have the input and output data,

Input:

x1 = Height of the person

x2 = Weight of the person

Output::

y = Gender(Male/Female)
Red markers indicate Males and the markers in Magenta indicate Females
Our motive is to fit a decision boundary (a line) that separates all the male samples from the female
samples. Well, we can do it by hand, try to find the equation of a line that separates both the classes.
But, this is just a toy data, in real life applications, data is humongous, and we humans are too lazy
to sit and go through each and every data point to find the equation of the decision boundary. Hence,
we’ll use the perceptron model that’ll find the equation of the decision boundary for us. All we have
to do is feed the input and output data for the model to train.

We need to find the equation of the blue line

The general equation of a straight line is,

ax+by+c = 0 — — — eqn(1)
When we substitute the point P(x, y) in the equation, ax+by+c, it will give a value of 0(Since P lies
on the line).

Similarly, when we substitute the point Q(x, y) in the equation, ax+by+c, it will give us a value
greater than 0(Since Q lies above the line).

When we substitute the point R(x, y) in the equation ax+by+c, it will give us a value less than
0(Since R lies below the line).

Using this intuition, we can classify any point by substituting its value in the line equation,

If the resultant value is positive, the sample belongs to class Male(Y = 1),

If negative, the sample is a female sample(Y = -1).

On Plotting the above property discussed, we get a function called the Sign function. This is the
activation function that we are going to use. There are many more activation functions out there
but for now, lets stick with sign function.

Let me rephrase eqn(1) as follows:


w0 + w1 * x1 + w2 * x2 = 0 — — — eqn(2)
Where,
w0 = c; w1 = a; w2 = b; x1 = x; x2 = y.
We have the values of x1 and x2. We need the values of w0, w1, w2.
For mathematical convenience, lets vectorize eqn(2) as follows,
eqn(2)::
w0 * 1 + w1 * x1 + w2 * x2 = 0
(Or)
w0 * x0 + w1 * x1 + w2 * x2 = 0 … (where, x0 = 1)
Vector X:

Vector W:

We can define eqn (2) as dot product of vectors W and X

X.W = w0 * 1 + w1 * x1 + w2 * x2 = 0 — — — eqn (3)

If we successfully train our model and obtain optimum values of vector W, then eqn (3) should
make classifications as follows…

If sample is a Male(Y = 1), then,

X.W > 0 — — — eqn (4)

If sample is a Female(Y = -1), then,

X.W < 0 — — — eqn (5)

Damn, now we got 2 constraints to satisfy (eqn 4 and 5),

Let’s can combine eqn (4) and (5) as follows,


Y*(X.W) > 0 — — — eqn (6)

Where, Y = {1,-1}

If Y = 1;

1*(X.W) > 0

X.W > 0

If Y = -1;

-1*(X.W) >0

X.W < 0

Alright, so we can conclude that our model correctly classifies the sample X if,

Y*(X.W) > 0 … (positive)

The sample is said to be misclassified if,

Y*(X.W) < 0 … (negative)

(Or)

-Y*(X.W) > 0 — — — eqn (7) . . . (Misclassification Condition)

Now, to start off, we’ll randomly initialize the Weight vector W and for each misclassification we’ll
update the weights as follows,

W = W + ΔW — — — eqn (8)

Where ΔW is a small change that we will make in W.

Let’s examine each misclassification case,

If Y = 1, and we got
X.W < 0, then

We need to update the Weights in such a way that,

X. (W+ ΔW) > X.W

Here, a good choice for ΔW would be η*X (positive value), i.e.,

ΔW = η*X — — — eqn (9)

If Y = -1, and we got

X.W > 0, then

We need to update the Weights in such a way that,

X. (W+ ΔW) > X.W

Here, a good choice for ΔW would be -η*X (negative value), i.e.,

ΔW = -η*X— — — eqn (10)

We can combine eqns(9 & 10) as,

ΔW = Y *(η*X) — — — eqn (11)

Therefore, eqn (8) implies,

W = W+Y *(η*X) — — — eqn (8)

Note: η is called the learning rate (usually greater than 0)


How did we get ΔW = Y*(η*X)? Keep reading to find out.

There’s an optimization algorithm, called the Gradient Descent. It is used to update the weights in
case of misclassification. It does this by using a cost/loss function that penalizes/tells us the loss in
case of misclassification. Gradient Descent minimizes the cost function by gradually updating the
weight values.

So, on to defining our cost function;

From eqn (7), we have the misclassification condition,


-Y*(X.W) > 0

Which means that “-Y*(X.W)” gives us a positive value for misclassification of input X.

So, let us assume our cost function (J) as,

Cost, J = -Y (X.W)

But, there’s one problem with this cost function, when the output is correctly classified,

Cost, J = -Y (X.W) = “Some negative value”…

But the cost function can’t be negative, so we’ll define our cost functions as follows,

If, -Y (X.W) > 0,

Cost, J = -Y (X.W) — — — eqn (12-a)

if, -Y(X.W) < 0 ,

Cost, J = 0 — — — eqn (12-b)

Repeat W=W-η. d (J)


dw
Gradient Descent Algorithm

Gradient descent updates the weights as shown above,

Note that we need to calculate the partial derivative of the cost function (J), with respect to weights
W.
Partial Derivatives::

If, -Y (X.W) > 0,

∂J/ ∂W = -Y*X

if, -Y(X.W) < 0 ,

∂J/ ∂W = 0

Substituting the partial derivatives in gradient descent algorithm,

W = W - η*(∂J/ ∂W)

If, -Y(X.W) > 0 , (Misclassification)

W = W — η * (-Y*X) = W + η * (Y*X)

if, -Y(X.W) < 0 , (Correct Classification)

W = W — η * 0 = W …(No update in weights)

Hence, that’s how we got “W = W + η * (Y*X)” for cases of misclassification.

Finally, to summarize Perceptron training algorithm,

Perceptron models (with slight modifications), when connected with each other, form a neural
network. Perceptron models can only learn on linearly separable data. If we want our model to train
on non-linear data sets too, it’s better to go with neural networks.
CLASSIFICATION TECHNIQUES

What is Classification?

We use the training dataset to get better boundary conditions which could be used to determine
each target class. Once the boundary conditions are determined, the next task is to predict the target
class. The whole process is known as classification.

Target class examples:

 Analysis of the customer data to predict whether he will buy computer accessories (Target
class: Yes or No)

 Classifying fruits from features like color, taste, size, weight (Target classes: Apple,
Orange, Cherry, Banana)

 Gender classification from hair length (Target classes: Male or Female)

Let’s understand the concept of classification algorithms with gender classification using hair
length. To classify gender (target class) using hair length as feature parameter we could train a
model using any classification algorithms to come up with some set of boundary conditions which
can be used to differentiate the male and female genders using hair length as the training feature.
In gender classification case the boundary condition could the proper hair length value. Suppose
the differentiated boundary hair length value is 15.0 cm then we can say that if hair length is less
than 15.0 cm then gender could be male or else female.

Basic Terminology in Classification Algorithms

 Classifier: An algorithm that maps the input data to a specific category.

 Classification model: A classification model tries to draw some conclusion from the input
values given for training. It will predict the class labels/categories for the new data.

 Feature: A feature is an individual measurable property of a phenomenon being observed.


 Binary Classification: Classification task with two possible outcomes. Eg: Gender
classification (Male / Female)

 Multi-class classification: Classification with more than two classes. In multi-class


classification, each sample is assigned to one and only one target label. Eg: An animal can
be a cat or dog but not both at the same time.

 Multi-label classification: Classification task where each sample is mapped to a set of


target labels (more than one class). Eg: A news article can be about sports, a person, and
location at the same time.

Applications of Classification Algorithms

 Email spam classification


 Bank customers loan pay willingness prediction.
 Cancer tumor cells identification.
 Sentiment analysis
 Drugs classification
 Facial key points detection
 Pedestrians detection in an automotive car driving.

Types of Classification Algorithms

Classification Algorithms could be broadly classified as the following:

 Support vector machines


 Prediction trees
o Decision tree
o Regression tree
SVM (SUPPORT VECTOR MACHINE ALGORITHM)

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyper plane.

SVM chooses the extreme points/vectors that help in creating the hyper plane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyper plane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as support
vector creates a decision boundary between these two data (cat and dog) and choose extreme cases
(support vectors), it will see the extreme case of cat and dog. On the basis of the support vectors,
it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.

o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.

Hyper plane and Support Vectors in the SVM algorithm:

Hyper plane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data,
we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated
as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
DECISION TREE

A Decision Tree has many analogies in real life and turns out, it has influenced a wide area
of Machine Learning, covering both Classification and Regression. In decision analysis, a decision
tree can be used to visually and explicitly represent decisions and decision making.

What is a Decision Tree?

A decision tree is a map of the possible outcomes of a series of related choices. It allows an
individual or organization to weigh possible actions against one another based on their costs,
probabilities, and benefits. As the name goes, it uses a tree-like model of decisions. They can be
used either to drive informal discussion or to map out an algorithm that predicts the best choice
mathematically. A decision tree typically starts with a single node, which branches into possible
outcomes. Each of those outcomes leads to additional nodes, which branch off into other
possibilities. This gives it a tree-like shape.

Advantages

 Decision trees generate understandable rules.

 Decision trees perform classification without requiring much computation.

 Decision trees are capable of handling both continuous and categorical variables.

 Decision trees provide a clear indication of which fields are most important for prediction
or classification.

Disadvantages

 Decision trees are less appropriate for estimation tasks where the goal is to predict the value
of a continuous attribute.

 Decision trees are prone to errors in classification problems with many class and a
relatively small number of training examples.

 Decision trees can be computationally expensive to train. The process of growing a


decision tree is computationally expensive. At each node, each candidate splitting field
must be sorted before its best split can be found. In some algorithms, combinations of fields
are used and a search must be made for optimal combining weights.

Creating a Decision Tree


Let us consider a scenario where a new planet is discovered by a group of astronomers. Now the
question is whether it could be ‘the next earth?’ The answer to this question will revolutionize the
way people live. Well, literally!

There is n number of deciding factors which need to be thoroughly researched to take an intelligent
decision. These factors can be whether water is present on the planet, what is the temperature,
whether the surface is prone to continuous storms, flora and fauna survives the climate or not, etc.

Let us create a decision tree to find out whether we have discovered a new habitat.

The habitable temperature falls into the range 0 to 100 Celsius.

Whether water is present or not?


Whether flora and fauna flourishes?
Thus, we a have a decision tree with us.

Classification Rules:
Classification rules are the cases in which all the scenarios are taken into consideration and a class
variable is assigned to each.

Class Variable:
Each leaf node is assigned a class-variable. A class-variable is the final output which leads to our
decision.

Let us derive the classification rules from the Decision Tree created:

1. If Temperature is not between 273 to 373K, -> Survival Difficult

2. If Temperature is between 273 to 373K, and water is not present, -> Survival Difficult

3. If Temperature is between 273 to 373K, water is present, and flora and fauna is not present ->
Survival Difficult
4. If Temperature is between 273 to 373K, water is present, flora and fauna is present, and a stormy
surface is not present -> Survival Probable

5. If Temperature is between 273 to 373K, water is present, flora and fauna is present, and a stormy
surface is present -> Survival Difficult

Decision Tree
A decision tree has the following constituents:

 Root Node: The factor of ‘temperature’ is considered as the root in this case.

 Internal Node: The nodes with one incoming edge and 2 or more outgoing edges.

 Leaf Node: This is the terminal node with no out-going edge.

As the decision tree is now constructed, starting from the root-node we check the test condition
and assign the control to one of the outgoing edges, and so the condition is again tested and a node
is assigned. The decision tree is said to be complete when all the test conditions lead to a leaf node.
The leaf node contains the class-labels, which vote in favor or against the decision.

Now, you might think why did we start with the ‘temperature’ attribute at the root? If you choose
any other attribute, the decision tree constructed will be different.

Correct. For a particular set of attributes, there can be numerous different trees created. We need
to choose the optimal tree which is done by following an algorithmic approach.

The Greedy Approach


“Greedy Approach is based on the concept of Heuristic Problem Solving by making an optimal
local choice at each node. By making these local optimal choices, we reach the approximate
optimal solution globally.”

The algorithm can be summarized as:

1. At each stage (node), pick out the best feature as the test condition.

2. Now split the node into the possible outcomes (internal nodes).

3. Repeat the above steps till all the test conditions have been exhausted into leaf nodes.

When you start to implement the algorithm, the first question is: ‘How to pick the starting test
condition?’
The answer to this question lies in the values of ‘Entropy’ and ‘Information Gain’. Let us see what
are they and how do they impact our decision tree creation.

Entropy: Entropy in Decision Tree stands for homogeneity. If the data is completely homogenous,
the entropy is 0, else if the data is divided (50-50%) entropy is 1.

Information Gain: Information Gain is the decrease/increase in Entropy value when the node is
split.

An attribute should have the highest information gain to be selected for splitting. Based on the
computed values of Entropy and Information Gain, we choose the best attribute at any particular
step.

Let us consider the following data:

There can be n number of decision trees that can be formulated from these set of attributes.
Tree Creation Trial-1:
Here we take up the attribute ‘Student’ as the initial test condition.

Tree Creation Trial-2:


Similarly, why to choose ‘Student’? We can choose ‘Income’ as the test condition.
Creating the Perfect Decision Tree with Greedy Approach

Let us follow the ‘Greedy Approach’ and construct the optimal decision tree.

There are two classes involved: ‘Yes’ i.e. whether the person buys a computer or ‘No’ i.e. he does
not. To calculate Entropy and Information Gain, we are computing the value of Probability for
each of these 2 classes.

»Positive: For ‘buys computer=yes’ probability will come out to be:

P (buys = yes) = 9
14

»Negative: For ‘buys computer=no’ probability comes out to be:

P (buys = no) = 5
14

Entropy in D: We now put calculate the Entropy by putting probability values in the formula stated
above.

We have already classified the values of Entropy, which are:


Entropy =0: Data is completely homogenous (pure)
Entropy =1: Data is divided into 50- 50 % (impure)
Our value of Entropy is 0.940, which means our set is almost impure.
Let’s develop deep, to find out the suitable attribute and calculate the Information Gain.
What is information gain if we split on “Age”?
This data represents how many people falling into a specific age bracket, buy and do not buy the
product.
For example, for people with Age 30 or less, 2 people buy (Yes) and 3 people do not buy (No) the
product, the Info (D) is calculated for these 3 categories of people, that is represented in the last
column.
The Info (D) for the age attribute is computed by the total of these 3 ranges of age values. Now,
the question is what is the ‘information gain’ if we split on ‘Age’ attribute.
The difference of the total Information value (0.940) and the information computed for age
attribute (0.694) gives the ‘information gain’.

This is the deciding factor for whether we should split at ‘Age’ or any other attribute. Similarly,
we calculate the ‘information gain’ for the rest of the attributes:
Information Gain (Age) =0.246
Information Gain (Income) =0.029
Information Gain (Student) = 0.151
Information Gain (credit rating) =0.048
On comparing these values of gain for all the attributes, we find out that the ‘information gain’ for
‘Age’ is the highest. Thus, splitting at ‘age’ is a good decision.
Similarly, at each split, we compare the information gain to find out whether that attribute should
be chosen for split or not.
Thus, the optimal tree created looks like:

The classification rules for this tree can be jotted down as:
If a person’s age is less than 30 and he is not a student, he will not buy the product.
Age (<30) ^ student (no) = NO
If a person’s age is less than 30 and he is a student, he will buy the product.
Age (<30) ^ student (yes) = YES

If a person’s age is between 31 and 40, he is most likely to buy.

Age (31…40) = YES

If a person’s age is greater than 40 and has an excellent credit rating, he will not buy.

Age (>40) ^ credit_rating(excellent) = NO

If a person’s age is greater than 40, with a fair credit rating, he will probably buy.
Age (>40) ^ credit_rating(fair) = Yes

Thus, we achieve the perfect Decision Tree!!

You might also like