You are on page 1of 14

Decision Tree Introduction with example

• Decision tree algorithm falls under the category of supervised learning. They can be used to
solve both regression and classification problems.
• Decision tree uses the tree representation to solve the problem in which each leaf node
corresponds to a class label and attributes are represented on the internal node of the tree.
• We can represent any boolean function on discrete attributes using the decision tree.

Below are some assumptions that we made while using decision tree:

• At the beginning, we consider the whole training set as the root.


• Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
• On the basis of attribute values records are distributed recursively.
• We use statistical methods for ordering attributes as root or the internal node.

As you can see from the above image that Decision Tree works on the Sum of Product form
which is also known as Disjunctive Normal Form. In the above image, we are predicting the use
of computer in the daily life of the people.
In Decision Tree the major challenge is to identification of the attribute for the root node in each
level. This process is known as attribute selection. We have two popular attribute selection
measures:
1. Information Gain
2. Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into smaller subsets the
entropy changes. Information gain is a measure of this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and
Values (A) is the set of all possible values of A, then

Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an
arbitrary collection of examples. The higher the entropy more the information content.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and
Values (A) is the set of all possible values of A, then

Example:

For the set X = {a,a,a,b,b,b,b,b}


Total instances: 8
Instances of b: 5
Instances of a: 3

= -[0.375 * (-1.415) + 0.625 * (-0.678)]

=-(-0.53-0.424)

= 0.954
Building Decision Tree using Information Gain
The essentials:

• Start with all training instances associated with the root node
• Use info gain to choose which attribute to label each node with
• Note: No root-to-leaf path should contain the same discrete attribute twice
• Recursively construct each subtree on the subset of training instances that would be classified
down that path in the tree.
• If all positive or all negative training instances remain, label that node “yes” or “no”
accordingly
• If no attributes remain, label with a majority vote of training instances left at that node
• If no instances remain, label with a majority vote of the parent’s training instances
Example:
Now, lets draw a Decision Tree for the following data using Information gain.
Training set: 3 features and 2 classes
X Y Z C

1 1 1 I

1 1 0 I

0 0 1 II

1 0 0 II
Here, we have 3 features and 2 output classes.
To build a decision tree using Information gain. We will take each of the feature and calculate
the information for each feature.

Split on feature X

Split on feature Y
Split on feature Z
From the above images we can see that the information gain is maximum when we make a split
on feature Y. So, for the root node best suited feature is feature Y. Now we can see that while
splitting the dataset by feature Y, the child contains pure subset of the target variable. So we
don’t need to further split the dataset.
The final tree for the above dataset would be look like this:

2. Gini Index

• Gini Index is a metric to measure how often a randomly chosen element would be incorrectly
identified.
• It means an attribute with lower Gini index should be preferred.
• Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
• The Formula for the calculation of the of the Gini Index is given below.
ID3

In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross


Quinlan[1] used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5
algorithm, and is typically used in the machine learning and natural language
processing domains.

The ID3 algorithm begins with the original set {\displaystyle S} as the root node. On
each iteration of the algorithm, it iterates through every unused attribute of the set {\displaystyle
S} and calculates the entropy {\displaystyle \mathrm {H} {(S)}} or the information
gain {\displaystyle IG(S)} of that attribute. It then selects the attribute which has the smallest
entropy (or largest information gain) value. The set {\displaystyle S} is then split or partitioned by
the selected attribute to produce subsets of the data. (For example, a node can be split into child
nodes based upon the subsets of the population whose ages are less than 50, between 50 and 100,
and greater than 100.) The algorithm continues to recurse on each subset, considering only
attributes never selected before.
Recursion on a subset may stop in one of these cases:

• every element in the subset belongs to the same class; in which case the node is turned into
a leaf node and labelled with the class of the examples.
• there are no more attributes to be selected, but the examples still do not belong to the same
class. In this case, the node is made a leaf node and labelled with the most common class of
the examples in the subset.
• there are no examples in the subset, which happens when no example in the parent set was
found to match a specific value of the selected attribute. An example could be the absence of
a person among the population with age over 100 years. Then a leaf node is created and
labelled with the most common class of the examples in the parent node's set.
Throughout the algorithm, the decision tree is constructed with each non-terminal node (internal
node) representing the selected attribute on which the data was split, and terminal nodes (leaf
nodes) representing the class label of the final subset of this branch.

AdaBoost
AdaBoost, short for Adaptive Boosting, is a statistical classification meta-algorithm formulated
by Yoav Freund and Robert Schapire, who won the 2003 Gödel Prize for their work. It can be used
in conjunction with many other types of learning algorithms to improve performance. The output
of the other learning algorithms ('weak learners') is combined into a weighted sum that represents
the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak
learners are tweaked in favor of those instances misclassified by previous classifiers. In some
problems it can be less susceptible to the overfitting problem than other learning algorithms. The
individual learners can be weak, but as long as the performance of each one is slightly better than
random guessing, the final model can be proven to converge to a strong learner.
Every learning algorithm tends to suit some problem types better than others, and typically has
many different parameters and configurations to adjust before it achieves optimal performance on
a dataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-
of-the-box classifier. When used with decision tree learning, information gathered at each stage of
the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree
growing algorithm such that later trees tend to focus on harder-to-classify examples.
This figure shows how the first model is made and errors from the first model are noted by the
algorithm. The record which is incorrectly classified is used as input for the next model. This
process is repeated until the specified condition is met. As you can see in the figure, there are
‘n’ number of models made by taking the errors from the previous model. This is how boosting
works. The models 1,2, 3,…, N are individual models that can be known as decision trees. All
types of boosting models work on the same principle.

Since we now know the boosting principle, it will be easy to understand the AdaBoost
algorithm. Let’s dive into AdaBoost’s working. When the random forest is used, the algorithm
makes an ‘n’ number of trees. It makes proper trees that consist of a start node with several leaf
nodes. Some trees might be bigger than others, but there is no fixed depth in a random forest.
With AdaBoost, however, the algorithm only makes a node with two leaves, known as Stump.
The figure here represents the stump. It can be seen clearly that it has only one node with two
leaves. These stumps are weak learners and boosting techniques prefer this. The order of
stumps is very important in AdaBoost. The error of the first stump influences how other
stumps are made. Let’s understand this with an example.

Here’s a sample dataset consisting of only three features where the output is in categorical
form. The image shows the actual representation of the dataset. As the output is in
binary/categorical form, it becomes a classification problem. In real life, the dataset can have
any number of records and features in it. Let us consider 5 datasets for explanation purposes.
The output is in categorical form, here in the form of Yes or No. All these records will be
assigned a sample weight. The formula used for this is ‘W=1/N’ where N is the number of
records. In this dataset, there are only 5 records, so the sample weight becomes 1/5 initially.
Every record gets the same weight. In this case, it’s 1/5.

Step 1 – Creating the First Base Learner


To create the first learner, the algorithm takes the first feature, i.e., feature 1 and creates the
first stump, f1. It will create the same number of stumps as the number of features. In the case
below, it will create 3 stumps as there are only 3 features in this dataset. From these stumps, it
will create three decision trees. This process can be called the stumps-base learner model. Out
of these 3 models, the algorithm selects only one. Two properties are considered while
selecting a base learner – Gini and Entropy. We must calculate Gini or Entropy the same way it
is calculated for decision trees. The stump with the least value will be the first base learner. In
the figure below, all the 3 stumps can be made with 3 features. The number below the leaves
represents the correctly and incorrectly classified records. By using these records, the Gini or
Entropy index is calculated. The stump that has the least Entropy or Gini will be selected as the
base learner. Let’s assume that the entropy index is the least for stump 1. So, let’s take stump
1, i.e., feature 1 as our first base learner.
Here, feature (f1) has classified 2 records correctly and 1 incorrectly. The row in the figure that
is marked red is incorrectly classified. For this, we will be calculating the total error.

Step 2 – Calculating the Total Error (TE)


The total error is the sum of all the errors in the classified record for sample weights. In our
case, there is only 1 error, so Total Error (TE) = 1/5.

Step 3 – Calculating Performance of the Stump


Formula for calculating Performance of the Stump is: –

where, ln is natural log and TE is Total Error.

In our case, TE is 1/5. By substituting the value of total error in the above formula and solving
it, we get the value for the performance of the stump as 0.693. Why is it necessary to
calculate the TE and performance of a stump? The answer is, we must update the sample
weight before proceeding to the next model or stage because if the same weight is applied, the
output received will be from the first model. In boosting, only the wrong records/incorrectly
classified records would get more preference than the correctly classified records. Thus, only
the wrong records from the decision tree/stump are passed on to another stump. Whereas, in
AdaBoost, both records were allowed to pass and the wrong records are repeated more than the
correct ones. We must increase the weight for the wrongly classified records and decrease the
weight for the correctly classified records. In the next step, we will be updating the weights
based on the performance of the stump.
Also Read: Decision Tree Algorithm Explained

Step 4 – Updating Weights

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.

Python implementation of the KNN algorithm

To do the Python implementation of the K-NN algorithm, we will use the same problem and
dataset which we have used in Logistic Regression. But here we will improve the performance of
the model. Below is the problem description:
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a
new SUV car. The company wants to give the ads to the users who are interested in buying that
SUV. So for this problem, we have a dataset that contains multiple user's information through the
social network. The dataset contains lots of information but the Estimated Salary and Age we
will consider for the independent variable and the Purchased variable is for the dependent
variable. Below is the dataset:

Steps to implement the K-NN algorithm:

o Data Pre-processing step


o Fitting the K-NN algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

You might also like