You are on page 1of 50

Unit 4.

Classification (1):
Decision trees and
Rule induction

4. Classification
Assoc. Prof Nguyen Manh Tuan
AGENDA

 Models
 Decision Tree
 Rule Induction

2/20/2024 internal use


Models
 A model is a simplified representation of reality created to serve a purpose.
 It is simplified based on some assumptions about what is and is not important for
the specific purpose, or sometimes based on constraints on information or
tractability.
 For example, a map is a model of the physical world.
 It abstracts away a tremendous amount of information that the mapmaker deemed
irrelevant for its purpose. It preserves, and sometimes further simplifies, the relevant
information.
 Various professions have well-known model types: an architectural blueprint,
an engineering prototype, etc.
 Each of these abstracts away details that are not relevant to their main
purpose and keeps those that are.

2/20/2024 internal use


Models
4
Terminology : Prediction
 In common usage, prediction means to forecast a future event.
 In data science, prediction more generally means to estimate an
unknown value. This value could be something in the future (in
common usage, true prediction), but it could also be something in the
present or in the past.
 Indeed, since data mining usually deals with historical data, models
very often are built and tested using events from the past.
 The key is that the model is intended to be used to estimate an
unknown value.

2/20/2024 internal use


Models
 Supervised learning is model creation where the model describes a relationship
between a set of selected variables (attributes or features)
 Predefined variable called the target variable. The model estimates the value of the
target variable as a function (possibly a probabilistic function) of the features.
 So, for our churn-prediction problem we would like to build a model of the propensity to
churn as a function of customer account attributes, such as age, income, length with
the company, number of calls to customer service, customer demographics, data usage,
and others.
 The creation of models from data is known as model induction.
 The procedure that creates the model from the data is called the induction algorithm or
learner. Most inductive procedures have variants that induce models both for
classification and for regression.

2/20/2024 internal use


Models
6

2/20/2024 internal use


Models

2/20/2024 internal use


Models - Classification

Target variable is categorical. Predictors could be of any data type.

Algorithms
Decision Trees
Rule Induction
kNN Ensemble Meta Models
Naive Bayesian
Neural Networks
Support Vector Machines

2/20/2024 internal use


Models - Classification

 A predictive model focuses on estimating the value of some particular target


variable of interest.
 To try to segment the population into subgroups that have different values for
the target variable .
 Segmentation may at the same time provide a human-understandable set of
segmentation patterns.
 We might like to rank the variables by how good they are at predicting the
value of the target.
 In our example, what variable gives us the most information about the future churn rate of
the population? being a professional? age? place of residence? income? number of
complaints to customer service?
 We now will look carefully into one useful way to select informative variables.

2/20/2024 internal use


Models - Selecting Informative Attributes

 The label over each head represents the value of the target variable (write-off or not).
 Colors and shapes represent different predictor attributes.
 Attributes:
head-shape: square, circular; body-shape: rectangular, oval; body-color: gray, white
 Target variable:
write-off: Yes, No

2/20/2024 internal use


Models - Selecting Informative Attributes
11
 Question:
 Which of the attributes would be best to segment these people into groups, in a way
that will distinguish write-offs from non-write-offs?
 Technically, we would like the resulting groups to be as pure as possible.
 Pure: homogeneous with respect to the target variable.
 If every member of a group has the same value for the target, then the group is pure. If there is at least
one member of the group that has a different value for the target variable than the rest of the group, then
the group is impure.
 Unfortunately, in real data we seldom expect to find a variable that will make the
segments pure.
 Purity measure
 The most common splitting criterion is called information gain, and it is based on a purity measure called
entropy.
 Both concepts were invented by one of the pioneers of information theory (Claude Shannon 1948)

2/20/2024 internal use


Models - Selecting Informative Attributes
 Entropy is a measure of disorder that can be applied to a set, such as one of
our individual segments.
 Disorder corresponds to how mixed (impure) the segment is with respect to
these properties of interest.

2/20/2024 internal use


Models - Selecting Informative Attributes
Another example:
p(non-write-off) = 7 / 10 = 0.7
p(write-off) = 3 / 10 = 0.3

entropy(S)
= - 0.7 × log2 (0.7) – 0.3 × log2 (0.3)
≈ - 0.7 × - 0.51 - 0.3 × - 1.74
≈ 0.88

A set containing 10 instances of 2


classes/properties (+; -)
2/20/2024 -0.5log(0.5) – 0.5log(0.5) = -1*(-1)=1
internal use p+ = 1 – p-
Models - Selecting Informative Attributes

 Information Gain (IG)


 To measure how much an attribute improves (decreases) entropy over the
whole segmentation it creates.

 Notably, the entropy for each child (ci) is weighted by the proportion of
instances belonging to that child, p(ci).

2/20/2024 internal use


Models - Selecting Informative Attributes

entropy(parent)
= - p( • ) × log2 p( • ) - p( ☆ ) × log2 p( ☆ )
≈ - 0.53 × - 0.9 - 0.47 × - 1.1
≈ 0.99 (very impure)

2/20/2024 left: 13/30 = 0.43; right: 17/30=0.57 internal use


Models - Selecting Informative Attributes

 The entropy of the left child is:


 entropy(Balance < 50K) = - p( • ) × log2p( • ) - p( ☆ ) × log2p( ☆ )
≈ - 0.92 × ( - 0.12) - 0.08 × ( - 3.7) ≈ 0.39

 The entropy of the right child is:


 entropy(Balance ≥ 50K) = - p( • ) × log2p( • ) - p( ☆ ) × log2p( ☆ )
≈ - 0.24 × ( - 2.1) - 0.76 × ( - 0.39) ≈ 0.79

2/20/2024 internal use


Models - Selecting Informative Attributes

2/20/2024 internal use


Models - Selecting Informative Attributes

entropy(parent) ≈ 0.99
entropy(Residence=OWN) ≈ 0.54
entropy(Residence=RENT) ≈ 0.97
entropy(Residence=OTHER) ≈ 0.98

Information Gain ≈ 0.13

 The Residence variable is less


informative than Balance

2/20/2024 internal use


Models - Classification with numeric attributes

 Numeric variables can be “discretized” by choosing one (many) split


point(s).
 For example, income could be divided into two or more ranges.
Information gain can be applied to evaluate the segmentation created
by this discretization of the numeric attribute.
 We still are left with the question of how to choose the split point(s) for the numeric
attribute.
 An alternative: the possible split points to examine are essentially averages of
available values.
 Conceptually, we can try all reasonable split points, and choose the
one that gives the highest information gain.

2/20/2024 internal use


Models - Tree Split with Entropy
 Two questions need to be answered at each step of the tree building process:
 Where to split the data
 When to stop splitting
 The algorithm/task computes the information gain for each attribute and the
attribute that yields the highest information gain will be used.
 The tree induction takes divide-and-conquer approach,
 starting with the whole dataset and applying variable selection to try to create the
“purest” subgroups possible using the attributes
 not all the final partitions are 100% homogenous
 next take each data subset and recursively apply attribute selection to find the best
attribute to partition it.

2/20/2024 internal use


Models - Tree Split with Entropy
 The task would stop when the nodes are pure, or all attributes are used to split on
 It is very unlikely that to get terminal nodes that are 100% homogenous.
 Several situations where the process can be terminated:
 No attribute satisfies a minimum information gain threshold
 A maximal depth is reached: as the tree grows larger, not only does interpretation gets harder,
but a situation called ‘overfitting’ is induced.
 There are less than a certain number of examples in the current subtree: again, a mechanism to
prevent overfitting.
• Overfitting occurs when a model tries to memorize the training data instead of generalizing
the relationship between inputs and output variables.
• Overfitting often has the effect of performing well on the training dataset but performing
poorly on any new data previously unseen by the model.
• Overfitting by a decision tree results not only in difficulty interpreting the model, but also
provides quite a useless model for unseen data.

2/20/2024 internal use


Decision Trees

 Example: Attributes:
head-shape: square, circular;
body-shape: rectangular, oval;
body-color: gray, white
Target variable:
write-off: Yes, No

2/20/2024 internal use


Decision Trees

First partitioning:
splitting on body
shape (rectangular
versus oval).

2/20/2024 internal use


Decision Trees

Second
partitioning:
oval body
people sub-
grouped by
head type.

2/20/2024 internal use


Decision Trees

Third
partitioning:
rectangular
body people
subgrouped
by body
color.
Decision Trees

The classification
tree resulting
Decision Trees - Visualizing Segmentations

 It is instructive to visualize exactly how a classification tree partitions the


instance space.
 The instance space is simply the space described by the data features.
 A common form of instance space visualization is a scatterplot on some pair
of features, used to compare one variable against another to detect
correlations and relationships.
 Though data may contain dozens or hundreds of variables, it is only
reallypossible to visualize segmentations in two or three dimensions at once
 Visualizing models in instance space in a few dimensions is useful for
understanding the different types of models because it provides insights that
apply to higher dimensional spaces as well
Decision Trees - Visualizing Segmentations

*The black dots correspond to instances of the


class Write-off.
*The plus signs correspond to instances of
2/20/2024 internal use class non-Write-off.
Decision Trees (To Play Golf or Not)

Predictors / Attributes Target / Response

2/20/2024 internal use


Decision Trees (To Play Golf or Not)

• 14 examples, with 4 attributes


(outlook, temperature, humidity,
windy)
• Target attribute needs to be
predicted is Play with 2 classes
(Yes and No)

2/20/2024 internal use


Decision Trees (To Play Golf or Not)
Start by partitioning the data on Outlook (3 categories - overcast, rain,
sunny)

Splitting data on Outlook

2/20/2024 internal use


Decision Trees (To Play Golf or Not)

Recalculate the IG above ?

2/20/2024 internal use


Decision Trees (To Play Golf or Not)

Training data: //Samples/data/Golf


Test data: //Samples/data/Golf-Testset

2/20/2024 internal use


Decision Trees (To Play Golf or Not)

2/20/2024 internal use


Decision Trees (To Play Golf or Not)

9/14 class predictions correct = 64.29%

2/20/2024 internal use


Decision Trees (Prospect Filtering)
 Credit scoring is a fairly common data science problem.
 Some types of situations where credit scoring could be applied are:
 Prospect filtering: Identify which prospects to extend credit to and determine how
much credit would be an acceptable risk.
 Default risk detection: Decide if a particular customer is likely to default on a loan.
 Bad debt collection: Sort out those debtors who will yield a good cost
(of collection) to benefit (of receiving payment) performance
 The German Credit dataset from the University of California-Irvine Machine
Learning (UCI-ML) data repository (http://archive.ics.uci.edu/ml/datasets/)
 1000 samples/examples/cases
 Total of 20 attributes and 1 label/target attribute
 Target variable/label: Credit Rating
Hands-on problem:
Using Decision Tree, solving Prospect Filtering with German Credit dataset
Decision Trees – Summary (1)
Application of decision tree algorithm (simple 5-step process):
 Using Shannon entropy, sort the dataset into homogenous (by class) and non-homogeneous
variables. Homogeneous variables have low information entropy and non-homogeneous
variables have high information entropy (Ex: I (outlook, no partition))
 Weight the influence of each independent variable on the target variable using the entropy
weighted averages (Ex: I (outlook)).
 Compute the information gain, which is essentially the reduction in the entropy of the target
variable due to its relationship with each independent variable (Ex: I (outlook, no partition) - I (outlook)).
 The independent variable with the highest information gain will become the root or the first
node on which the dataset is divided. This was done using the calculation of the information
gain table.
 Repeat this process for each variable for which the Shannon entropy is nonzero. If the
entropy of a variable is zero, then that variable becomes a “leaf” node.

2/20/2024 internal use


Decision Trees – Summary (2)
There are 4 main steps in setting up any supervised learning algorithm for a
predictive modeling exercise:
 Read in the cleaned and prepared data typically from a database or a
spreadsheet, but the data can be from any source.
 Split data into training and testing samples.
 Train the decision tree using the training portion of the dataset.
 Apply the model on the testing portion of the dataset to evaluate the
performance of the model.

2/20/2024 internal use


Decision Trees – Summary (3)
 Decision trees are one of the most commonly used predictive modeling algorithms in
practice.
 Several distinct advantages of using decision trees in many classification and
prediction applications
 Easy to interpret and explain to non-technical users
 Decision trees require relatively little effort from users for data preparation (scale normalization;
missing values; outliers)
 Nonlinear relationships between parameters do not affect tree performance
 Decision trees implicitly perform variable screening or feature selection (top few nodes are
important; feature selection automatically completed)
 Key disadvantage of decision tree:
 Without proper pruning or limiting tree growth, they tend to overfit the training data, making them
somewhat poor predictors.

2/20/2024 internal use


Rule Induction

2/20/2024 internal use


Rule Induction - Indirect approach: Tree to Rules

Trees as Sets of Rules

 You classify a new unseen instance by starting at the root node and
following the attribute tests downward until you reach a leaf node,
which specifies the instance’s predicted class.
 If we trace down a single path from the root node to a leaf, collecting
the conditions as we go, we generate a rule.
 Each rule consists of the attribute tests along the path connected
with AND.

2/20/2024 internal use


Rule Induction - Indirect approach: Tree to Rules

Rule 1: if (Outlook = overcast) then yes


Rule 2: if (Outlook = rain) and (Wind = false) then yes
Rule 3: if (Outlook = rain) and (Wind = true) then no
Rule 4: if (Outlook = sunny and (Humidity > 77.5) then no
Rule 5: if (Outlook = sunny and (Humidity ≤ 77.5) then yes

2/20/2024 internal use


Rule Induction - Direct approach: Rules

R = { r1 ∩ r2 ∩ r3 ∩ .. rk }
 Where k is the number of disjuncts in a rule set R.
 Individual rule ri (called disjunct/classification rule) can be represented as
ri = if (antecedent or condition) then (consequent)
 Each antecedent/condition can have many attributes and values each
separated by a logical AND operator.
 Each antecedent/condition test is called a conjunct of the rule
 Each conjunct is a node in the equivalent decision tress

Rule 2 as r2: if (Outlook = rain) and (Wind = false) then Play = yes
- (Outlook = rain) AND (Wind = false): antecedent/condition
- (Outlook = rain): conjunct

2/20/2024 internal use


Rule Induction - Direct approach: Rules

 Sequential covering: iterative procedure of extracting


rules from a dataset
 An attempt to find all the rules in the dataset by class
 One specific implementation: RIPPER (Cohen, 1995)
The procedure: 1) Class-oriented where all the rules of a class are developed before
Class selection (least-frequent class label) moving on to the next class: usually choose least frequent class
Rule development label
Learn-One-Rule 2) Cover all ‘+’ points using rectilinear box with 0 or as few ‘-’ as
Next Rule possible
Development of rule set 3) Each rule starts with an empty rule set and conjuncts are added
one by one to increase the rule accuracy *if {} then class =
‘+’.Then the algorithm greedily adds conjuncts until the accuracy
reaches 100%.
4) After a rule is developed, all points covered by the rule are
2/20/2024 internal use
eliminated.
Rule Induction - Sequential covering

A(r0) = 7/19=36.84%
A(r1) = 4/7=57.14%
A(r2) = 3/3=100%
Hands-on problem:
Employing rule induction, with the IRIS dataset

2/20/2024 internal use


Rule Induction - Direct approach: Rules

2/20/2024 internal use


Rule Induction - Direct approach: Rules

2/20/2024 internal use


Rule induction – Summary
 Classification using rules provides a simple framework to identify a relationship
between attributes and the class label that is not only used as a predictive
model, but also a descriptive model.
 Rules are closely associated to decision trees. They split the data space in a rectilinear
fashion and generate a mutually exclusive and exhaustive rule set.
 When the rule set is not mutually exclusive, then the data space can be divided by
complex and curved decision boundaries.
 Since rule induction is a greedy algorithm, the result may not be the most globally
optimal solution and like decision trees, rules can overlearn the example set.
 This scenario can be mitigated by pruning.
 Given the wide reach of rules, rule induction is commonly used as a tool to express the
results of data science, even if other data science algorithms are used to create the
model

2/20/2024 internal use


THE END

2/20/2024 internal use

You might also like