CART - Machine Learning

CART(Classification and Regression Tree)
A classification and regression tree (CART) is a type of machine

learning algorithm that can perform both classification and regression
tasks.
It uses a binary tree structure to split the data into subsets based on the
values of the input variables.
Each node in the tree represents a test on an input variable, and each leaf
node represents a prediction for the output variable.
CART can handle both numerical and categorical variables, and can also
deal with missing values.
CART works by recursively finding the best split point for each input variable that
minimizes the impurity of the child nodes. Impurity is a measure of how mixed the
classes are in a node. For classification tasks, CART uses the Gini index as the
impurity measure, which is defined as:
uses the Gini index as the impurity measure, which is defined as:
𝐺 = ෍ 𝑃𝑖 1 − 𝑃𝑖
𝑖=1
where k is the number of classes and pi is the proportion of instances in
the node that belong to class i.
The Gini index ranges from 0 to 0.5, where 0 means the node is pure (all
instances belong to the same class) and 0.5 means the node is
completely mixed (equal proportion of instances for each class).
For regression tasks, CART uses the mean squared error (MSE) as the
impurity measure, which is defined as:
where n is the number of instances in the node, yi is the actual value of the
output variable for instance i, and yˉ is the mean value of the output variable
in the node. The MSE measures how much the predictions deviate from the
actual values. The lower the MSE, the more accurate the predictions are.
Applications and Use Cases of the CART
Algorithm
• Healthcare: Disease Diagnosis and Risk Assessment
• Finance: Credit Scoring and Fraud Detection
• E-commerce: Customer Segmentation and Product Recommendations
• Energy: Consumption Forecasting and Equipment Maintenance

Advantages of the CART Algorithm
1. Versatility: The dual nature of CART to handle both classification and

regression tasks sets it apart, allowing it to tackle a wide variety of
problems.
2. Interpretability: Decision trees, the outcome of the CART algorithm,

are visually intuitive and easy to understand. This transparency is
invaluable in sectors like finance and healthcare, where interpretability
is crucial.
3. Non-parametric: CART doesn't make any underlying assumptions about
the distribution of the data, making it adaptable to diverse datasets.
4. Handles Mixed Data Types: The algorithm can easily manage datasets
containing both categorical and numerical variables.
5. Automatic Feature Selection: Inherent in its design, CART will naturally

give importance to the most informative features, somewhat negating the
need for manual feature selection.
Limitations of the CART Algorithm
1. Overfitting: Without proper pruning, CART can create complex trees that
fit the training data too closely, leading to poor generalization on unseen data.
2. Sensitivity to Data Changes: Small variations in the data can result in

vastly different trees. This can be addressed by using techniques like bagging
and boosting.
3. Binary Splits: CART produces binary trees, meaning each node splits into
exactly two child nodes. This might not always be the most efficient
representation, especially with categorical data that has multiple levels.
• Decision Tree using CART algorithm Solved Example 1 - VTUPulse
Decision Tree using CART algorithm Solved Example 1
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No

Outlook Temp Humidity Windy Play
Sunny Hot Normal True ?
In this example, there are four choices of questions based on the four
variables:
Start with any variable, in this case, outlook. It can take three values:
sunny, overcast, and rainy.
Start with the sunny value of outlook. There are five instances where the
outlook is sunny.
In two of the five instances, the play decision was yes, and in the other
three, the decision was no.
Thus, if the decision rule was that outlook: sunny → no, then three out
of five decisions would be correct, while two out of five such decisions
would be incorrect. There are two errors out of five. This can be
recorded in Row 1.
Similarly, we will write all rules for the Outlook attribute.
Outlook
Overcast 4 Yes 4
No 0
Sunny 5 Yes 2
No 3
Rainy 5 Yes 3
No 2
Attribute Rules Error Total Error
Outlook Sunny → No 2/5 4/14
Overcast → Yes 0/4
Rainy → Yes 2/5

hot 4 Yes 2
No 2
Mild 6 Yes 4
No 2
Cold 4 Yes 3
No 1
Rules, individual error, and total for Temp attribute
Attribute Rules Error Total Error

Temp Hot → No 2/4 5/14
Mild → Yes 2/6
Cool → Yes 1/4
C5.0
C5.0 is a type of decision tree algorithm that is used in supervised

learning. It takes a dataset of labeled examples, where each example has
a set of features and a label, and creates a decision tree that can be used
to make predictions on new, unlabeled examples.
The main goal of the C5.0 algorithm is to create a decision tree that is as
small as possible while still accurately predicting outcomes.
This is important because a smaller tree is easier to understand and use
in practice. C5.0 accomplishes this by pruning the tree and utilizing a
range of techniques to prevent overfitting, which occurs when the tree is
too complex and fits the training data too closely, resulting in poor
performance on new data.
Using C5.0, we can make predictions about a wide range of real-world
problems, such as predicting customer churn, diagnosing medical
conditions, and identifying fraudulent activity.
It is a powerful tool for decision-making and is widely used in industry

and academia.
Advantages of C5.0
C5.0 has several advantages, including:
• High accuracy in classification tasks
• Fast processing speed due to its efficient implementation
• Handling of both continuous and discrete data
• Automatic pruning to prevent overfitting

Limitations of C5.0
Some limitations of C5.0 include:
• It may not perform well on datasets with a large number of attributes

or classes
• It may not handle missing data well
• It may not be suitable for regression tasks

Applications of C5.0
C5.0 is widely used in various applications, such as:
• Medical diagnosis
• Customer segmentation
• Financial analysis
• Image classification
CHAID(Chi-squared Automatic Interaction
Detection)
CHAID, or Chi-squared Automatic Interaction Detection, is a classification
method for building decision trees by using chi-square statistics to identify
optimal splits.
It used for discovering relationships between a categorical response variable

and other categorical predictor variables. It is useful when looking for
patterns in datasets with lots of categorical variables and is a convenient way
of summarizing the data as the relationships can be easily visualized.
Decision tree components in CHAID analysis:
In CHAID analysis, the following are the components of the decision

tree:
Root node: Root node contains the dependent, or target, variable.
For example, CHAID is appropriate if a bank wants to predict the credit

card risk based upon information like age, income, number of credit
cards, etc. In this example, credit card risk is the target variable and the
remaining factors are the predictor variables.
Parent’s node: The algorithm splits the target variable into two or more
categories. These categories are called parent node or initial node. For the
bank example, high, medium and low categories are the parent’s nodes.
Child node: Independent variable categories which come below the parent’s
categories in the CHAID analysis tree are called the child node.
Terminal node: The last categories of the CHAID analysis tree are called the
terminal node. In the CHAID analysis tree, the category that is a major
influence on the dependent variable comes first and the less important
category comes last. Thus, it is called the terminal node.
In practice, CHAID is often used in direct marketing to understand how
different groups of customers might respond to a campaign based on
their characteristics.
So suppose, for example, that we run a marketing campaign and are

interested in understanding what customer characteristics (e.g., gender,
socio-economic status, geographic location, etc.) are associated with the
response rate achieved. We build a CHAID “tree” showing the effects of
different customer characteristics on the likelihood of response.

CART - Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CART - Machine Learning

Uploaded by

Copyright:

Available Formats

CART(Classification and Regression Tree)

A classification and regression tree (CART) is a type of machine

• Finance: Credit Scoring and Fraud Detection

• E-commerce: Customer Segmentation and Product Recommendations

• Energy: Consumption Forecasting and Equipment Maintenance

1. Versatility: The dual nature of CART to handle both classification and

2. Interpretability: Decision trees, the outcome of the CART algorithm,

5. Automatic Feature Selection: Inherent in its design, CART will naturally

2. Sensitivity to Data Changes: Small variations in the data can result in

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

Outlook Sunny → No 2/5 4/14

Overcast → Yes 0/4

Rainy → Yes 2/5

Attribute Rules Error Total Error

C5.0 is a type of decision tree algorithm that is used in supervised

It is a powerful tool for decision-making and is widely used in industry

C5.0 has several advantages, including:

• High accuracy in classification tasks

• Fast processing speed due to its efficient implementation

• Handling of both continuous and discrete data

• Automatic pruning to prevent overfitting

Some limitations of C5.0 include:

• It may not perform well on datasets with a large number of attributes

• It may not handle missing data well

• It may not be suitable for regression tasks

C5.0 is widely used in various applications, such as:

It used for discovering relationships between a categorical response variable

In CHAID analysis, the following are the components of the decision

Root node: Root node contains the dependent, or target, variable.

For example, CHAID is appropriate if a bank wants to predict the credit

So suppose, for example, that we run a marketing campaign and are

You might also like