You are on page 1of 12


Machine Learning by Building Decision Trees

Submitted in partial fulfilment of the requirements

For the award of degree of

Bachelor of Technology


Computer Science Engineering


Mrs Silica Kole
Ashish Kumar(00711502711)
Pranav Bhatia(03911502711)
Anshul (0??11502711)


To test and implement ID3 Algorithm with the help of some chosen example
problems. Also study the improvements of original problem (i.e. C4.5
algorithm) and compare ID3 with Version space algorithm.

Machine learning is a subfield of computer science (CS) and artificial
intelligence (AI) that deals with the construction and study of systems that
can learn from data, rather than follow only explicitly programmed
instructions. Besides CS and AI, it has strong ties to statistics and optimization,
which deliver both methods and theory to the field. Machine learning is
employed in a range of computing tasks where designing and programming
explicit, rule-based algorithms is infeasible. Example applications
include studying datasets, spam filtering, optical character
recognition (OCR), search engines and computer vision.
One of the main areas where machine learning is used is to predict outcomes
of events by learning or training itself from a dataset. There are many ways to
represent it with decision tree being one of the widely used method.
A decision tree is a decision - support tool that uses a tree-
like graph or model of decisions and their possible consequences, including
chance event outcomes, resource costs, and utility. It is one way to display
an algorithm.
Decision trees are commonly used in operations research, specifically
in decision analysis, to help identify a strategy most likely to reach a goal.

It is a flowchart-like structure in which internal node represents a "test" on an
attribute (e.g. whether a coin flip comes up heads or tails), each branch
represents the outcome of the test and each leaf node represents a class label
(decision taken after computing all attributes). The paths from root to leaf
represents classification rules.

In decision analysis a decision tree and the closely related influence
diagram are used as a visual and analytical decision support tool, where
the expected values (or expected utility) of competing alternatives are
A decision tree consists of 3 types of nodes:
1. Decision nodes
2. Chance nodes
3. End nodes
Some advantages of decision trees are:
Simple to understand and to interpret. Trees can be visualised.
Requires little data preparation. Other techniques often require data
normalisation, dummy variables need to be created and blank values
to be removed. Note however that this module does not support
missing values.
The cost of using the tree (i.e., predicting data) is logarithmic in the
number of data points used to train the tree.
Able to handle both numerical and categorical data. Other
techniques are usually specialised in analysing datasets that have
only one type of variable. See algorithms for more information.
Able to handle multi-output problems.
Uses a white box model. If a given situation is observable in a model,
the explanation for the condition is easily explained by boolean logic.
By contrast, in a black box model (e.g., in an artificial neural
network), results may be more difficult to interpret.
Possible to validate a model using statistical tests. That makes it
possible to account for the reliability of the model.
Performs well even if its assumptions are somewhat violated by the
true model from which the data were generated.
The disadvantages of decision trees include:
Decision-tree learners can create over-complex trees that do not
generalise the data well. This is called overfitting. Mechanisms such
as pruning (not currently supported), setting the minimum number of
samples required at a leaf node or setting the maximum depth of the
tree are necessary to avoid this problem.
Decision trees can be unstable because small variations in the data
might result in a completely different tree being generated. This
problem is mitigated by using decision trees within an ensemble.
The problem of learning an optimal decision tree is known to be NP-
complete under several aspects of optimality and even for simple
concepts. Consequently, practical decision-tree learning algorithms
are based on heuristic algorithms such as the greedy algorithm where
locally optimal decisions are made at each node. Such algorithms
cannot guarantee to return the globally optimal decision tree. This
can be mitigated by training multiple trees in an ensemble learner,
where the features and samples are randomly sampled with
There are concepts that are hard to learn because decision trees do
not express them easily, such as XOR, parity or multiplexer problems.
Decision tree learners create biased trees if some classes dominate. It
is therefore recommended to balance the dataset prior to fitting with
the decision tree.

ID3 Algorithm: In decision tree learning, ID3 (Iterative Dichotomiser 3) is
an algorithm invented by Ross Quinlan and is used to generate a decision
tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically
used in the machine learning and natural language processing domains.
ID3 is based off the Concept Learning System (CLS) algorithm. The basic CLS
algorithm over a set of training instances C:
Step 1: If all instances in C are positive, then create YES node and halt.
If all instances in C are negative, create a NO node and halt.
Otherwise select a feature, F with values v1, ..., vn and create a decision node.
Step 2: Partition the training instances in C into subsets C1, C2, ..., Cn according
to the values of V.
Step 3: apply the algorithm recursively to each of the sets Ci.
Note, the trainer (the expert) decides which feature to select.
ID3 improves on CLS by adding a feature selection heuristic. ID3 searches
through the attributes of the training instances and extracts the attribute that
best separates the given examples. If the attribute perfectly classifies the
training sets then ID3 stops; otherwise it recursively operates on the n (where
n = number of possible values of an attribute) partitioned subsets to get their
"best" attribute. The algorithm uses a greedy search, that is, it picks the best
attribute and never looks back to reconsider earlier choices.

Given a collection S of c outcomes
Entropy(S) = S -p(I) log
where p(I) is the proportion of S belonging to class I. S is over c.
If S is a collection of 14 examples with 9 YES and 5 NO examples then
Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) = 0.940
Entropy is 0 if all members of S belong to the same class (the data is perfectly
classified). The range of entropy is 0 ("perfectly classified") to 1 ("totally

C4.5 Algorithm: C4.5 is an algorithm used to generate a decision
tree developed by Ross Quinlan. It is an extension of Quinlan's earlier ID3
algorithm. The decision trees generated by C4.5 can be used for classification,
and for this reason, C4.5 is often referred to as a statistical classifier.
C4.5 builds decision trees from a set of training data in the same way as ID3,
using the concept of information entropy. The training data is a
set of already classified samples. Each sample consists of a p-
dimensional vector , where represent attributes or
features of the sample, as well as the class in which falls.
At each node of the tree, C4.5 chooses the attribute of the data that most
effectively splits its set of samples into subsets enriched in one class or the
other. The splitting criterion is the normalized information gain (difference in
entropy). The attribute with the highest normalized information gain is chosen
to make the decision. The C4.5 algorithm then recurs on the smaller sub lists.

This algorithm has a few base cases.
All the samples in the list belong to the same class. When this happens, it
simply creates a leaf node for the decision tree saying to choose that class.
None of the features provide any information gain. In this case, C4.5
creates a decision node higher up the tree using the expected value of the
Instance of previously-unseen class encountered. Again, C4.5 creates a
decision node higher up the tree using the expected value.

Example: A dataset for a zoo is taken and implemented in C4.5 or J48

Performance Parameters:
1.) Accuracy: The measurements of a quantity to that quantitys factual value
to the degree of familiarity are known as accuracy.
Accuracy = (No. of true positive + no. of true negatives)/(No. of true positive +
false positive + false negative + true negative)

(2) Memory Used : How much amount of memory is used by a particular
program to build an execute it successfully in different condition is known as
memory used.

(3) Model Build Time: Extraction from dataset to data model is known as
Model built time. It is depend upon the number of dataset used in training. And
accuracy of program is depending upon the number of dataset, greater number
of dataset there will be more accuracy.

(4) Search Time : Search time is defined as after building model answering
time of system is called search time.

(5) Error Rate : Error rate is the difference between actual outcome n desired
outcomes. In case ofdecision making, where the probability of error may be
considered as being the probability of making a wrong decision and which
would have a different value for each type of error.


[1] Nishant Mathur, Sumit Kumar, Santosh Kumar, and Rajni Jindal , The Base Strategy for
ID3 Algorithm of Data Mining using Havrda and Charvat Entropy Based on Decision Tree,
Vol. 2, No. 2, March 2012
[3] Kalpesh Adhatrao, Aditya Gaykar, Amiraj Dhawan, Rohit Jha and Vipul Honrao,
Predicting students performance using id3 and c4.5 classification algorithms , Vol.3, No.5,
September 2013
[5] A.P.Subapriya, M.Kalimuthu, Efficient decision tree construction in unrealized dataset
using c4.5 algorithm, Vol 04; Special Issue; June 2013