You are on page 1of 17

Vietnam National University Ho Chi Minh City

University of Science
Faculty of Information Technology

LAB 2:
Decision Tree

Course Introduction to AI
Class 22CLC02
Teacher Nguyễn Ngọc Thảo
Hồ Thị Thanh Tuyến

Author 22127357 − Phạm Trần Yến Quyên

HCMC, 2024
Mục lục

1 Check list 2

2 Source Code 2
2.1 Library Usages: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Decision Tree: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2.1 Preparing the data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2.2 Building the decision tree classifiers . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Evaluating the decision tree classifiers: . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 Classification Report and Confusion Matrix . . . . . . . . . . . . . . . . . . . 4
2.3.2 Comments: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Statistics Report 13
3.1 The depth and accuracy of a decision tree: . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Decision Tree Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Accuracy to Max Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 References 16

VNUHCM-US-FIT 1 22127357
1 Check list
1. Preparing the data sets.

2. Building the decision tree classifiers.

3. Evaluating the decision tree classifiers.

(a) Classification report and confusion matrix.


(b) Comments.

4. The depth and accuracy of a decision tree.

(a) Trees, tables, and charts.


(b) Comments.

2 Source Code
2.1 Library Usages:
− sklearn: A powerful machine learning library in Python. It provides a wide range of algo-
rithms for classification, regression, clustering, etc. It also offers tools for data preprocessing,
model evaluation, and parameter tuning.

− pandas: A library for data manipulation and analysis in Python. It is used to load, manipu-
late, filter, and analyze data from CSV files.

− graphviz: A tool for visualizing graphs and networks.

− matplotlib: Used for creating static, interactive, and animated visualizations in Python. It
offers a wide range of plotting functions to create line plots, bar plots, histograms, heatmaps,
etc.

− seaborn: A statistical data visualization library built on top of matplotlib.

− pydotplus: Interface to Graphviz’s Dot language. Allows the users to create, manipulate,
and visualize graphs using the Dot language. And is often used in combination with other li-
braries for generating and visualizing complex graphs (Decision Trees and Neural Networks).

2.2 Decision Tree:


2.2.1 Preparing the data sets
The given dataset of ’nursery.data.csv’ contains the information originally developed to rank
applications for nursery schools. It has 8 columns to represent 8 Attributes (names of Attribute
taken from ’nursery.names’ ):

VNUHCM-US-FIT 2 22127357
parents usual , pretentious , great_pret
has_nurs proper , less_proper , improper , critical , very_crit
form complete , completed , incomplete , foster
children 1 , 2 , 3 , more
housing convenient , less_conv , critical
finance convenient , inconv
social non - prob , slightly_prob , problematic
health recommended , priority , not_recom

The data are preprocessed with the following steps:

1. Shuffled: The dataset is shuffled once at the start and then shuffled again during the split-
ting process (with default random_state setting).

2. Split: The dataset are split into Training set and Testing set in these ratio (train/test): 40/60,
60/40, 80/20, and 90/10. The purpose of this is to compare the performance of the models
with different training/test ratios.
*Note: stratify = y ensures that the distribution of the labels in the train and test sets
are the same.

Even class distribution.

3. Encoded: Due to the nature of the dataset being a ’Categories with an inherent order or
rank’ dataset (Ex: proper, less_proper, etc.), it must be encoded in order to be process for
making a Decision Tree. The best encoding style woulld be Label Encoding (Encode categori-
cal variables to positive integers: 0, 1, 2,...). After encoding, the dataset looks like this:
parents : usual (2) , pretentious (1) , great_pret (0)
has_nurs : proper (3) , less_proper (2) , improper (1) , critical (0) ,
very_crit (4)
form : complete (0) , completed (1) , incomplete (2) , foster (3)
children : 1 (0) , 2 (1) , 3 (2) , more (3)
housing : convenient (0) , less_conv (1) , critical (2)
finance : convenient (0) , inconv (1)
social : non - prob (0) , slightly_prob (1) , problematic (2)
health : recommended (2) , priority (1) , not_recom (0)

VNUHCM-US-FIT 3 22127357
*Note: For the Encoding step, the other types of Encoding that can be used here is Ordinal
Encoding (used for Categorical data that involves ranking or ordering) and One Hot Encod-
ing (Creates dummy variables based on the number of unique values in the categorical fea-
ture) with similar results.

2.2.2 Building the decision tree classifiers


− To enhance the convenience of setting the Maximum depth (Levels) parameter, a subclass of
the DecisionTreeClassifier was defined as a custom class DecisionTreeClassifierInfoGain
that sets the criterion to ’entropy’ for information gain.
− The classifier is then trained using the training dataset (feature_train, label_train) and
then used to predict the labels for the test dataset (feature_test).

MAX_DEPTH = [ None , 2 , 3 , 4 , 5 , 6 , 7]

class D e c i s i o n T r e e C l a s s i f i e r I n f o G a i n ( D e c i s i o n T r e e C l a s s i f i e r ) :
def __init__ ( self , max_depth = None ) :
super () . __init__ ( criterion = ’ entropy ’ , max_depth = max_depth )

# Create Decision Tree classifer object


clf = D e c i s i o n T r e e C l a s s i f i e r I n f o G a i n ( MAX_DEPTH [0])

2.2.3 Performance Metrics


Upon finishing training the Decision Tree Classifier and predicting the response for test dataset,
Performance Metrics are then calculated for Decision Tree Evaluation:
− Accuracy Score: This metric quantifies the overall accuracy of the classifier by comparing the
predicted labels with the actual labels in the test dataset.
− Classification Report: The classification report provides a comprehensive summary of various
metrics such as precision, recall, F1-score, and support for each class in the dataset. It gives
insights into the classifier’s performance for individual classes.
− Confusion Matrix: The confusion matrix is a table that visualizes the performance of a clas-
sification algorithm. It presents a summary of the predictions made by the classifier against
the actual labels.

2.3 Evaluating the decision tree classifiers:


DISCLAIMER: For this evaluation, only Test Sets with non-limited Depth will be used (for
different Depths evaluation, refer to Section 3.1).

2.3.1 Classification Report and Confusion Matrix


NOTATIONS:
• True Positives (TP): These are instances where the model correctly predicts the positive
class (e.g., presence of a disease) when it is indeed present in the actual data.

VNUHCM-US-FIT 4 22127357
• True Negatives (TN): True Negatives occur when the model correctly predicts the negative
class (e.g., absence of a disease) when it is indeed not present in the actual data.

• False Positives (FP): These occur when the model incorrectly predicts the positive class when
it is not present in the actual data.

• False Negatives (FN): False Negatives happen when the model incorrectly predicts the nega-
tive class when it is, in fact, the positive class.

Before commenting/evaluating on the results of the Classification Report and Confusion Matrix, it
is important to know the meanings of the evaluation metrics listed.

+ Precision: Measures the accuracy of positive predictions. It is the ratio of correctly predicted
positive observations to the total predicted positives. A model which produces no false posi-
tives has a precision of 1.

TP
Precision =
TP + FP
+ Recall (Sensitivity): It measures the ability of the classifier to find all positive instances. It is
the ratio of correctly predicted positive observations to the all observations in actual class. A
model which produces no false negatives has a recall of 1.

TP
Recall =
TP + FN
*Note: To fully evaluate the effectiveness of a model, both precision and recall must be ex-
amined. Unfortunately, precision and recall are often in tension (as seen in their nature and
mathematic calculation). That is, improving precision typically reduces recall and vice versa.

+ F1-score: It is the weighted average of Precision and Recall. It is a good way to show that
a classifier has a good value for both recall and precision (The closer to 1, the better the
model).

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

+ Support: It is the number of actual occurrences of the class in the specified dataset.

+ Accuracy: The ratio of correctly predicted observation to the total observations. Perfect accu-
racy is equal to 1.

TP + TN
Accuracy =
TP + TN + FP + FN
+ Macro Avg: Short for macro average. In macro average, the metric (precision, recall, or F1-
score) for each class is calculated independently, and then the average of these metrics is
taken without considering class imbalance. This means that each class contributes equally
to the final average, regardless of the number of instances in each class, so if there are class
imbalances this metric to be be payed attention to. A higher macro average indicates better
overall performance across all classes.

VNUHCM-US-FIT 5 22127357
+ Weighted Avg: In weighted average, the metric for each class is calculated independently,
and then the average of these metrics is taken, with each class weighted by its support (i.e.,
the number of true instances in each class). This means that it takes into account class im-
balance by giving more weight to classes with more instances and thus these classes have a
greater influence on the final average than classes with fewer instances. A higher weighted
average indicates better overall performance, with more weight given to classes with larger
support.

1. Classification Report:

− Training 40% - Test 60%


Classification Report :
precision recall f1 - score support

not_recom 1.00 1.00 1.00 2592


priority 0.98 0.98 0.98 2560
recommend 0.00 0.00 0.00 1
spec_prior 0.98 0.98 0.98 2426
very_recom 0.90 0.95 0.93 197

accuracy 0.98 7776


macro avg 0.77 0.78 0.78 7776
weighted avg 0.98 0.98 0.98 7776

− Training 60% - Test 40%


Classification Report :
precision recall f1 - score support

not_recom 1.00 1.00 1.00 1728


priority 0.99 0.99 0.99 1706
recommend 0.00 0.00 0.00 1
spec_prior 0.99 0.99 0.99 1618
very_recom 0.95 0.95 0.95 131

accuracy 0.99 5184


macro avg 0.79 0.79 0.79 5184
weighted avg 0.99 0.99 0.99 5184

− Training 80% - Test 20%


Classification Report :
precision recall f1 - score support

not_recom 1.00 1.00 1.00 864


priority 1.00 0.99 0.99 853
recommend 0.00 0.00 0.00 0
spec_prior 0.99 1.00 1.00 809
very_recom 0.94 0.95 0.95 66

accuracy 0.99 2592


macro avg 0.79 0.79 0.79 2592

VNUHCM-US-FIT 6 22127357
weighted avg 1.00 0.99 0.99 2592

− Training 90% - Test 10%


Classification Report :
precision recall f1 - score support

not_recom 1.00 1.00 1.00 432


priority 1.00 0.99 0.99 427
spec_prior 1.00 1.00 1.00 404
very_recom 0.94 1.00 0.97 33

accuracy 1.00 1296


macro avg 0.98 1.00 0.99 1296
weighted avg 1.00 1.00 1.00 1296

*Note: Due to having 0 instance of class recommend, it doesn’t appear in this ratio Re-
port.

2. Confusion Matrix: A confusion matrix is a table that summarizes the performance of a


classification model by comparing its predicted labels to the true labels. It displays the num-
ber of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives
(FN) of the model’s predictions.
How to read a Confusion Matrix:

+ Each row of the matrix represents the instances in a predicted class, while each column
represents the instances in an actual class. (Ex: The first row represents instances of the
first class not_recom, the second priority, and so on...)
+ The diagonal elements (from top-left to bottom-right) represent the correctly classified
instances for each class. (Ex: In the 40-60 ratio, class not_recom (top-left) has 2592 in-
stances correctly classified.)
+ The off-diagonal elements represent misclassifications. If there are no misclassifications
then all the off-diagonal elements are 0s. (Ex: In the 40-60 ratio, on row 3 - col 1 (num-
bered from 0 to 4) the elements has a value of 58, this indicates that there were 58 in-
stances of class spec_prior misclassified as class priority.)

− Training 40% - Test 60%

VNUHCM-US-FIT 7 22127357
− Training 60% - Test 40%

VNUHCM-US-FIT 8 22127357
− Training 80% - Test 20%

VNUHCM-US-FIT 9 22127357
− Training 90% - Test 10%

VNUHCM-US-FIT 10 22127357
2.3.2 Comments:

For the 40-60 ratio, a couple of things are note worthy.

+ The model has a precision of 0.98 — in other words, when it predicts the target (rank
applications for nursery schools) it is correct 98% of the time (a almost perfect model
prediction rate).
+ The model performs exceptionally well for classes not_recom, priority, and spec_prior,
with precision, recall, and F1-score all above 97%. For class very_recom, while precision
is relatively high at 90%, recall is slightly lower at 94%, resulting in a lower F1-score of
92%.
This is because:
• All 3 classes (excluding very_recom for having 12.5 TIMES less amount of in-
stances) have a balanced distribution of instances (around the 2500 - as seen in the
support column), making it easier for the model to learn and generalize patterns
effectively.
• Sufficient training data for these classes might have been available, enabling the
model to learn robust representations of these classes during training.
• The hyperparameters of the decision tree classifier maximum depth is set to be
None, without a limiter the model have been tuned effectively, leading to improved

VNUHCM-US-FIT 11 22127357
performance for these classes.
• These classes also have distinct and easily separable features from other classes
(clearly named and ranked), allowing the model to make accurate predictions.
+ Opposite to not_recom perfect score across the board, it can be seen that the recommend
attribute has a 0.00 score across all of the evaluation metrics (precision, recall,
f1-score), indicating that the model fails to correctly classify any instances of this
class. This is due to the imbalance of the classes, and insufficient data, recommend only
takes 0.015% in the whole dataset (12960 Instances) and close to 0% (as illustrated in
the Section 2.2.1) in the split sub-datasets (ONLY 1 instance - seen from support).
+ Because there is class imbalance, looking at the macro average, which calculates the
metric for each class and then takes the average, we can see that precision, recall, and
F1-score are around 78%. Meanwhile the weighted average has a score of 98%, much
higher due to the fact that recommend stats now doesn’t affect the metrics as much.

Conclusion: Overall, the model has a high accuracy of 98% and high F1-Scores, but its per-
formance varies significantly across different classes. Further investigation and possibly model
improvement are necessary, especially regarding attribute recommend.

SIMILAR TREND ARE SEEN FOR ALL OF THE OTHER SUBSETS, BUT
WITH LESS SUPPORTS/INSTANCES AS TEST SIZE IS LESSEN

Final Comment:

+ Consistency in Performance: Across all training-test splits (60%-40%, 80%-20%, and


90%-10%), the precision, recall, and F1-score for classes not_recom and priority are
consistently high, indicating that the model performs well in correctly identifying in-
stances belonging to these classes.
+ Imbalance in the Dataset: The recommend class has very low support (only 1 instance
in the 60%-40% split), leading to precision, recall, and F1-score values of 0 for this class
in all splits. This suggests that the dataset is imbalanced, with very few instances of the
recommend class, which makes it challenging for the model to learn patterns effectively
for this class.
+ Improvement with Larger Training Sets: Generally, as the size of the training set in-
creases (from 60% to 90%), there is an improvement in the performance metrics (preci-
sion, recall, and F1-score) for most classes. This improvement indicates that providing
the model with more data for training leads to better generalization and performance on
the test set.
+ Stability of Weighted Average: The weighted average for precision, recall, and F1-score
remains consistent across different training-test splits, indicating that the overall perfor-
mance of the model is stable regardless of the size of the training set.
+ Effect of Support on Macro Average: The macro average reflects the overall performance
of the model across all classes, with more emphasis on classes with larger support. In all
splits, the macro average for precision, recall, and F1-score is consistent, suggesting that
the model’s performance across all classes is relatively stable.

VNUHCM-US-FIT 12 22127357
Final Conclusion: In summary, these observations highlight the importance of dataset bal-
ance, the impact of training set size on model performance, and the stability of overall per-
formance metrics across different training-test splits. Additionally, the consistent high perfor-
mance for classes with sufficient support indicates that the model effectively learns patterns
for these classes.

3 Statistics Report
3.1 The depth and accuracy of a decision tree:
3.1.1 Decision Tree Visualization
NOTE: Due to some Trees having too large of a dimension, please refer to the attached ’.png’ in
the ’[REPORT] Data Assets’ folder.

1. max_depth = None

2. max_depth = 2

3. max_depth = 3

VNUHCM-US-FIT 13 22127357
4. max_depth = 4

5. max_depth = 5

6. max_depth = 6

VNUHCM-US-FIT 14 22127357
7. max_depth = 7

3.1.2 Accuracy to Max Depth


This table was made using the data gathered from the 80/20 training set and test set ratio.

Accuracy Scorings with different max depths.

Line graph representation: Where the value 1 represent max_depth = None.

VNUHCM-US-FIT 15 22127357
Comments:

− It is clear that the Accuracy Scorings rank similarly to the max_depth values, with no limiter
None resulting in the highest Accuracy Score (extremely close to the 100%).

− While it’s true that the Accuracy Scorings continues to improve with increasing max_depth,
the rate of improvement slows down. This is evident in the smaller increases in Accuracy ob-
served as the max_depth increases beyond 4. In contrast, there’s a significant jump in accu-
racy from depth 2 (76.23%) to depth 3 (80.05%), indicating that allowing the tree to grow
beyond a depth of 2 substantially improves performance.
=⇒ Therefore, this improvement is not linear.

− This can be due to the fact that while the deeper trees tend to yield higher accuracy on the
training data, there’s a risk of Overfitting to the training data, leading to poor generaliza-
tion on unseen data. The reverse would be Underfitting for shallow trees.

Conclusion: The Decision Tree’s depth significantly impacts its accuracy. While deeper trees gen-
erally lead to better performance, there’s a trade-off between accuracy and model complexity. It’s
essential to find the optimal depth that maximizes accuracy without Overfitting or Underfitting
the training data.

4 References
[1] scikit-learn library.

[2] Datacamp - Decision Tree Classification in Python Tutorial.

[3] GeeksforGeeks - How Decision Tree Depth Impact on the Accuracy.

[4] 105 Evaluating A Classification Model 6 Classification Report | Creating Machine Learning
Models

[5] How to interpret a confusion matrix for a machine learning model

VNUHCM-US-FIT 16 22127357

You might also like