Professional Documents
Culture Documents
Introduction
1
Models and Patterns
x f(x)
Model = abstract representation of a given
training data 1 1
e.g., very simple linear model structure 2 4
Y=aX+b
a & b are parameters determined from the 3 9
data 4 16
Y = aX + b is the model structure
Y = 0.9X + 0.3 is a particular model 1259 ?
Credit approval
A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
The history of past customers is used to train the classifier
3
DM Task: Descriptive Modeling
• Goal is to build a “descriptive” model that models
the underlying observation
– e.g., a model that could simulate the data if needed
properties 3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume
7
Classification
Classification is a data mining (machine learning) technique
used to predict group membership for data instances.
Given a collection of records (training set), each record contains
a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other
attributes.
Goal: previously unseen records should be assigned a class as
accurately as possible. A test set is used to determine the accuracy
of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
For example, one may use classification to predict whether the
weather on a particular day will be “sunny”, “rainy” or “cloudy”.
8
CLASSIFICATION: A TWO-STEP PROCESS
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Modelusage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
9
Classification methods
Goal: Predict class Ci = f(x1, x2, .. xn)
There are various classification methods.
Popular classification techniques include the following:
Decision tree classifier: divide decision space into
piecewise constant regions.
K-nearest neighbor
Rule-based
Neural networks: partition by non-linear boundaries
Bayesian network: a probabilistic model
Support vector machine
10
Decision tree classifier
Decision tree performs classification by constructing a
tree based on training instances with leaves having class
labels.
The tree is traversed for each test instance to find a leaf,
and the class of the leaf is the predicted class. This is a
directed knowledge discovery in the sense that there is a
specific field whose value we want to predict.
Widely used learning method as it has been applied to:
classify medical patients based on the disease,
equipment malfunction by cause,
loan applicant by likelihood of payment.
11
Choosing the Splitting Attribute
At each node, the best attribute is selected for splitting the
training examples using a Goodness function
The best attribute is the one that separate the classes of the
training examples faster such that it results in the smallest tree
Typical goodness functions:
information gain, information gain ratio, and GINI index
Information Gain
Select the attribute with the highest information gain, that create
small average disorder
First, compute the disorder using Entropy; the expected
information needed to classify objects into classes
Second, measure the Information Gain; to calculate by how
much the disorder of a set would reduce by knowing the value
of a particular attribute.
12
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple
m
in D: Info( D) pi log 2 ( pi )
i 1
3 3 5 5
D(3 ,5 ) log 2 log 2 0.954
8 8 8 8
15
Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes with a
large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) log 2 ( )
j 1 |D| |D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
16
Comparing Attribute Selection
Measures
The three measures, in general, return good results but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one partition is
much smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and
purity in both partitions
17
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise
or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early ̵ do not split a node if
this would result in the goodness measure falling below a
threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get
a sequence of progressively pruned trees
Use a set of data different from the training data to decide
which is the “best pruned tree”
18
K-Nearest Neighbors
K-nearest neighbor is a supervised learning algorithm
where the result of new instance query is classified based
on majority of K-nearest neighbor category.
The purpose of this algorithm is to classify a new object
based on attributes and training samples: (xi, f(xi)), i=1..N.
Given a query point, we find K number of objects or
(training points) closest to the query point.
The classification is using majority vote among the classification
of the K objects.
K Nearest neighbor algorithm used neighborhood classification as
the prediction value of the new query instance.
K nearest neighbor algorithm is very simple. It works based
on minimum distance from the query instance to the
training samples to determine the K-nearest neighbors.
19
K Nearest Neighbors: Key issues
The key issues involved in training KNN model includes
Setting the variable K (Number of nearest neighbors)
The numbers of nearest neighbors (K) should be based on cross
validation over a number of K setting.
When k=1 is a good baseline model to benchmark against.
A good rule-of-thumb is that K should be less than or equal to the
square root of the total number of training patterns.
Setting the type of distant metric K N
We need a measure of distance in order to know who are the
neighbours
Assume that we have T attributes for the learning problem. Then
one example point x has elements xt , t=1,…T.
The distance between two points xi xj is often defined as the
Euclidean distance: D 2
Dist ( X , Y ) ( Xi Yi )
i 1
20
KNNs: advantages & Disadvantages
Advantage
Simple
Powerful
Requires no training time
Nonparametric architecture
Disadvantage: Difficulties with k-nearest neighbour
algorithms
Memory intensive: just store the training examples
when a test example is given then find the closest matches
Classification/estimation is slow
Have to calculate the distance of the test case from all training
cases
There may be irrelevant attributes amongst the attributes –
curse of dimensionality
21
Rule-Based Classification
The learned model is represented as a set of IF-THEN
rules.
A rule-based classifier uses a set of IF-THEN rules for
classification
Rule: If Condition(LHS) then Conclusion(RHS)
where
Condition is a conjunctions of attributes
conclusion is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
(Blood Type=Warm) (Lay Eggs=Yes)=> Birds
(age=youth) ^ (student=yes)=>(buys-computer=yes).
Rule Coverage and Accuracy
Coverage of a rule:
Fraction of records that satisfy Tid Refund Marital
Status
Taxable
Income Class
the antecedent of a rule.
1 Yes Single 125K No
Accuracy of a rule: 2 No Married 100K No
Fraction of records that satisfy 3 No Single 70K No
the antecedent that also satisfy 4 Yes Married 120K No
the consequent of a rule 5 No Divorced 95K Yes
Assessment of a rule: coverage and 6 No Married 60K No
accuracy 7 Yes Divorced 220K No
ncovers = # of tuples covered by R 8 No Single 85K Yes
ncorrect = # of tuples correctly classified 9 No Married 75K No
by R 10 No Single 90K Yes
10
Exhaustiverules
Classifier has exhaustive coverage if it accounts
for every possible combination of attribute
values
Each record is covered by at least one rule
Characteristics of Rule Sets: Strategy 2
Rules are not mutually exclusive
A record may trigger more than one rule
Solution?
Ordered rule set
Unordered rule set – use voting schemes
Rules are not exhaustive
A record may not trigger any rules
Solution?
Use a default class
Rule Ordering Schemes
Rule-based ordering
Individual rules are ranked based on their quality
Class-based ordering
Rules that belong to the same class appear together
Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
Examples: C4.5rules
Direct Method: Sequential Covering
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is
met
Direct Method: RIPPER
For 2-class problem, choose one of the classes as positive
class, and the other as negative class
Learn rules for positive class
Negative class will be default class
For multi-class problem
Order the classes according to increasing class
prevalence (fraction of instances that belong to a
particular class)
Learn the rule set for smallest class first, treat the rest
as negative class
Repeat with next smallest class as positive class
Direct Method: RIPPER
Growing a rule:
Start from empty rule
Add conjuncts as long as they improve FOIL’s information gain
Stop when rule no longer covers negative examples
Prune the rule immediately using incremental reduced error
pruning
Measure for pruning: v = (p-n)/(p+n)
p: number of positive examples covered by the rule in
the validation set
n: number of negative examples covered by the rule in
the validation set
Pruning method: delete any final sequence of conditions that
maximizes v
Direct Method: RIPPER
Building a Rule Set:
Use sequential covering algorithm
Finds the best rule that covers the current set of positive
examples
Eliminate both positive and negative examples covered by
the rule
Each time a rule is added to the rule set, compute the new
description length
Stop adding new rules when the new description length is d
bits longer than the smallest description length obtained so
far
Direct Method: RIPPER
Optimize the rule set:
For each rule r in the rule set R
Consider 2 alternative rules:
– Replacement rule (r*): grow new rule from scratch
– Revised rule(r′): add conjuncts to extend the rule r
Compare the rule set for r against the rule set for r*
and r′
Choose rule set that minimizes MDL principle
Repeat rule generation and rule optimization for the remaining
positive examples
Indirect Method: C4.5rules
Extract rules from an unpruned decision tree
For each rule, r: A y,
consider an alternative rule r′: A′ y where A′ is obtained by
removing one of the conjuncts in A
Compare the pessimistic error rate for r against all r’s
Prune if one of the alternative rules has lower pessimistic error
rate
Repeat until we can no longer improve generalization error
Indirect Method: C4.5rules
Instead of ordering the rules, order subsets of rules (class ordering)
Each subset is a collection of rules with the same rule
consequent (class)
Compute description length of each subset
Description length = L(error) + g L(model)
g is a parameter that takes into account the presence of
redundant attributes in a rule set
(default value = 0.5)
Advantages of Rule-Based Classifiers
Has characteristics quite similar to decision trees
As highly expressive as decision trees
Easy to interpret
Performance comparable to decision trees
Can handle redundant attributes
• The Machine
– Calculation
– Precision
– Logic
38
Features of the Brain
• Ten billion (1010) neurons
Neuron switching time >10-3secs
• Face Recognition ~0.1secs
• On average, each neuron has several thousand
connections
• Hundreds of operations per second
• High degree of parallel computation
• Distributed representations
• Die off frequently (never replaced)
• Compensated for problems by massive parallelism
39
Neural Network classifier
Itis represented as a layered set of interconnected
processors. These processor nodes has a relationship with
the neurons of the brain.
Each node has a weighted connection to several other nodes in
adjacent layers. Individual nodes take the input received from
connected nodes and use the weights together to compute output
values.
The inputs are fed simultaneously into the input layer.
The weighted outputs of these units are fed into hidden
layer.
The weighted outputs of the last hidden layer are inputs to
units making up the output layer.
40
Artificial Neural Networks (ANN)
Input
nodes Black box
Output
X1 w1 node
Model is an assembly of
w2
inter-connected nodes and X2 Y
weighted links w3
X3 t
Output node sums up each
of its input value Perceptron Model
according to the weights d
of its links Y sign( wi X i t )
i 1
Compare output node d
48
SVM—History and Applications
49
SVM—Linearly Separable
A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints
Quadratic Programming (QP) Lagrangian multipliers
50
Why Is SVM Effective on High Dimensional Data?
51
CB-SVM Algorithm: Outline
Construct two CF-trees from positive and negative data sets
independently
Need one scan of the data set
Train an SVM from the centroids of the root entries
De-cluster the entries near the boundary into the next level
The children entries de-clustered from the parent entries are
accumulated into the training set with the non-declustered
parent entries
Train an SVM again from the centroids of the entries in the
training set
Repeat until nothing is accumulated
52