Professional Documents
Culture Documents
It uses a binary tree structure to split the data into subsets based on the
values of the input variables.
Each node in the tree represents a test on an input variable, and each leaf
node represents a prediction for the output variable.
CART can handle both numerical and categorical variables, and can also
deal with missing values.
CART works by recursively finding the best split point for each input variable that
minimizes the impurity of the child nodes. Impurity is a measure of how mixed the
classes are in a node. For classification tasks, CART uses the Gini index as the
impurity measure, which is defined as:
uses the Gini index as the impurity measure, which is defined as:
𝐺 = 𝑃𝑖 1 − 𝑃𝑖
𝑖=1
where k is the number of classes and pi is the proportion of instances in
the node that belong to class i.
The Gini index ranges from 0 to 0.5, where 0 means the node is pure (all
instances belong to the same class) and 0.5 means the node is
completely mixed (equal proportion of instances for each class).
For regression tasks, CART uses the mean squared error (MSE) as the
impurity measure, which is defined as:
where n is the number of instances in the node, yi is the actual value of the
output variable for instance i, and yˉ is the mean value of the output variable
in the node. The MSE measures how much the predictions deviate from the
actual values. The lower the MSE, the more accurate the predictions are.
Applications and Use Cases of the CART
Algorithm
• Healthcare: Disease Diagnosis and Risk Assessment
4. Handles Mixed Data Types: The algorithm can easily manage datasets
containing both categorical and numerical variables.
1. Overfitting: Without proper pruning, CART can create complex trees that
fit the training data too closely, leading to poor generalization on unseen data.
3. Binary Splits: CART produces binary trees, meaning each node splits into
exactly two child nodes. This might not always be the most efficient
representation, especially with categorical data that has multiple levels.
• Decision Tree using CART algorithm Solved Example 1 - VTUPulse
Decision Tree using CART algorithm Solved Example 1
Outlook Temp Humidity Windy Play
Start with any variable, in this case, outlook. It can take three values:
sunny, overcast, and rainy.
Start with the sunny value of outlook. There are five instances where the
outlook is sunny.
In two of the five instances, the play decision was yes, and in the other
three, the decision was no.
Thus, if the decision rule was that outlook: sunny → no, then three out
of five decisions would be correct, while two out of five such decisions
would be incorrect. There are two errors out of five. This can be
recorded in Row 1.
Similarly, we will write all rules for the Outlook attribute.
Outlook
Overcast 4 Yes 4
No 0
Sunny 5 Yes 2
No 3
Rainy 5 Yes 3
No 2
Attribute Rules Error Total Error
The main goal of the C5.0 algorithm is to create a decision tree that is as
small as possible while still accurately predicting outcomes.
This is important because a smaller tree is easier to understand and use
in practice. C5.0 accomplishes this by pruning the tree and utilizing a
range of techniques to prevent overfitting, which occurs when the tree is
too complex and fits the training data too closely, resulting in poor
performance on new data.
Using C5.0, we can make predictions about a wide range of real-world
problems, such as predicting customer churn, diagnosing medical
conditions, and identifying fraudulent activity.
• Medical diagnosis
• Customer segmentation
• Financial analysis
• Image classification
CHAID(Chi-squared Automatic Interaction
Detection)
CHAID, or Chi-squared Automatic Interaction Detection, is a classification
method for building decision trees by using chi-square statistics to identify
optimal splits.
Child node: Independent variable categories which come below the parent’s
categories in the CHAID analysis tree are called the child node.
Terminal node: The last categories of the CHAID analysis tree are called the
terminal node. In the CHAID analysis tree, the category that is a major
influence on the dependent variable comes first and the less important
category comes last. Thus, it is called the terminal node.
In practice, CHAID is often used in direct marketing to understand how
different groups of customers might respond to a campaign based on
their characteristics.