You are on page 1of 29

CART(Classification and Regression Tree)

A classification and regression tree (CART) is a type of machine


learning algorithm that can perform both classification and regression
tasks.

It uses a binary tree structure to split the data into subsets based on the
values of the input variables.
Each node in the tree represents a test on an input variable, and each leaf
node represents a prediction for the output variable.

CART can handle both numerical and categorical variables, and can also
deal with missing values.
CART works by recursively finding the best split point for each input variable that
minimizes the impurity of the child nodes. Impurity is a measure of how mixed the
classes are in a node. For classification tasks, CART uses the Gini index as the
impurity measure, which is defined as:

uses the Gini index as the impurity measure, which is defined as:

𝐺 = ෍ 𝑃𝑖 1 − 𝑃𝑖
𝑖=1
where k is the number of classes and pi​ is the proportion of instances in
the node that belong to class i.

The Gini index ranges from 0 to 0.5, where 0 means the node is pure (all
instances belong to the same class) and 0.5 means the node is
completely mixed (equal proportion of instances for each class).
For regression tasks, CART uses the mean squared error (MSE) as the
impurity measure, which is defined as:

where n is the number of instances in the node, yi is the actual value of the
output variable for instance i, and yˉ​ is the mean value of the output variable
in the node. The MSE measures how much the predictions deviate from the
actual values. The lower the MSE, the more accurate the predictions are.
Applications and Use Cases of the CART
Algorithm
• Healthcare: Disease Diagnosis and Risk Assessment

• Finance: Credit Scoring and Fraud Detection

• E-commerce: Customer Segmentation and Product Recommendations

• Energy: Consumption Forecasting and Equipment Maintenance


Advantages of the CART Algorithm

1. Versatility: The dual nature of CART to handle both classification and


regression tasks sets it apart, allowing it to tackle a wide variety of
problems.

2. Interpretability: Decision trees, the outcome of the CART algorithm,


are visually intuitive and easy to understand. This transparency is
invaluable in sectors like finance and healthcare, where interpretability
is crucial.
3. Non-parametric: CART doesn't make any underlying assumptions about
the distribution of the data, making it adaptable to diverse datasets.

4. Handles Mixed Data Types: The algorithm can easily manage datasets
containing both categorical and numerical variables.

5. Automatic Feature Selection: Inherent in its design, CART will naturally


give importance to the most informative features, somewhat negating the
need for manual feature selection.
Limitations of the CART Algorithm

1. Overfitting: Without proper pruning, CART can create complex trees that
fit the training data too closely, leading to poor generalization on unseen data.

2. Sensitivity to Data Changes: Small variations in the data can result in


vastly different trees. This can be addressed by using techniques like bagging
and boosting.

3. Binary Splits: CART produces binary trees, meaning each node splits into
exactly two child nodes. This might not always be the most efficient
representation, especially with categorical data that has multiple levels.
• Decision Tree using CART algorithm Solved Example 1 - VTUPulse
Decision Tree using CART algorithm Solved Example 1
Outlook Temp Humidity Windy Play

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No


Outlook Temp Humidity Windy Play
Sunny Hot Normal True ?
In this example, there are four choices of questions based on the four
variables:

Start with any variable, in this case, outlook. It can take three values:
sunny, overcast, and rainy.

Start with the sunny value of outlook. There are five instances where the
outlook is sunny.

In two of the five instances, the play decision was yes, and in the other
three, the decision was no.
Thus, if the decision rule was that outlook: sunny → no, then three out
of five decisions would be correct, while two out of five such decisions
would be incorrect. There are two errors out of five. This can be
recorded in Row 1.
Similarly, we will write all rules for the Outlook attribute.

Outlook
Overcast 4 Yes 4

No 0

Sunny 5 Yes 2

No 3

Rainy 5 Yes 3

No 2
Attribute Rules Error Total Error

Outlook Sunny → No 2/5 4/14

Overcast → Yes 0/4

Rainy → Yes 2/5


hot 4 Yes 2
No 2
Mild 6 Yes 4
No 2
Cold 4 Yes 3
No 1
Rules, individual error, and total for Temp attribute

Attribute Rules Error Total Error


Temp Hot → No 2/4 5/14
Mild → Yes 2/6
Cool → Yes 1/4
C5.0

C5.0 is a type of decision tree algorithm that is used in supervised


learning. It takes a dataset of labeled examples, where each example has
a set of features and a label, and creates a decision tree that can be used
to make predictions on new, unlabeled examples.

The main goal of the C5.0 algorithm is to create a decision tree that is as
small as possible while still accurately predicting outcomes.
This is important because a smaller tree is easier to understand and use
in practice. C5.0 accomplishes this by pruning the tree and utilizing a
range of techniques to prevent overfitting, which occurs when the tree is
too complex and fits the training data too closely, resulting in poor
performance on new data.
Using C5.0, we can make predictions about a wide range of real-world
problems, such as predicting customer churn, diagnosing medical
conditions, and identifying fraudulent activity.

It is a powerful tool for decision-making and is widely used in industry


and academia.
Advantages of C5.0

C5.0 has several advantages, including:

• High accuracy in classification tasks

• Fast processing speed due to its efficient implementation

• Handling of both continuous and discrete data

• Automatic pruning to prevent overfitting


Limitations of C5.0

Some limitations of C5.0 include:

• It may not perform well on datasets with a large number of attributes


or classes

• It may not handle missing data well

• It may not be suitable for regression tasks


Applications of C5.0

C5.0 is widely used in various applications, such as:

• Medical diagnosis

• Customer segmentation

• Financial analysis

• Image classification
CHAID(Chi-squared Automatic Interaction
Detection)
CHAID, or Chi-squared Automatic Interaction Detection, is a classification
method for building decision trees by using chi-square statistics to identify
optimal splits.

It used for discovering relationships between a categorical response variable


and other categorical predictor variables. It is useful when looking for
patterns in datasets with lots of categorical variables and is a convenient way
of summarizing the data as the relationships can be easily visualized.
Decision tree components in CHAID analysis:

In CHAID analysis, the following are the components of the decision


tree:

Root node: Root node contains the dependent, or target, variable.

For example, CHAID is appropriate if a bank wants to predict the credit


card risk based upon information like age, income, number of credit
cards, etc. In this example, credit card risk is the target variable and the
remaining factors are the predictor variables.
Parent’s node: The algorithm splits the target variable into two or more
categories. These categories are called parent node or initial node. For the
bank example, high, medium and low categories are the parent’s nodes.

Child node: Independent variable categories which come below the parent’s
categories in the CHAID analysis tree are called the child node.

Terminal node: The last categories of the CHAID analysis tree are called the
terminal node. In the CHAID analysis tree, the category that is a major
influence on the dependent variable comes first and the less important
category comes last. Thus, it is called the terminal node.
In practice, CHAID is often used in direct marketing to understand how
different groups of customers might respond to a campaign based on
their characteristics.

So suppose, for example, that we run a marketing campaign and are


interested in understanding what customer characteristics (e.g., gender,
socio-economic status, geographic location, etc.) are associated with the
response rate achieved. We build a CHAID “tree” showing the effects of
different customer characteristics on the likelihood of response.

You might also like