Numerical Dimensions: Techniques To Improve Classification Accuracy

Techniques to Improve Classification Accuracy
Ensemble Methods
• Numerical
Content dimensions
Bagging
Boosting
Adaboost
Random Forest
1 Faculty of Information Science

Ensemble Methods: Increasing the Accuracy
New data
C1
sample
C2
Data .
. Combine Class
CT votes prediction
▪ Use a combination of models to increase
Ensemble accuracy
methods ▪ Combine a series of k learned models, M1, M2,

…, Mk, with the aim of creating an improved
model M*

Ensemble Methods: Increasing the Accuracy
Popular ensemble methods
▪ Averaging the prediction over a

Bagging
collection of classifiers
▪ Weighted vote with a collection of

Boosting
classifiers
▪ Combining a set of heterogeneous

Ensemble
classifiers

Bagging: Boostrap Aggregation
Analogy
❖ Diagnosis based on multiple doctors’ majority vote
Training
❖ Given a set D of d tuples, at each iteration i, a training

set Di of d tuples is sampled with replacement from D
(i.e., bootstrap)
❖ A classifier model Mi is learned for each training set Di
Classification
❖ Classify an unknown sample X

❖ Each classifier Mi returns its class prediction
❖ The bagged classifier M* counts the votes and assigns
the class with the most votes to X
Bagging: Boostrap Aggregation
• The prediction of continuous values by

Prediction taking the average value of each
prediction for a given test tuple
• Often significantly better than a single

classifier derived from D
Accuracy • Not considerably worse and more robust
for noise data
• Proved improved accuracy in prediction

Boosting
▪ Consult several doctors, based on a

Analogy combination of weighted diagnoses
➢ weight assigned based on the
previous diagnosis accuracy

Boosting(Cont’d)
Q How boosting works?
▪ Weights are assigned to each training tuple

▪ A series of k classifiers is iteratively learned
▪ After a classifier Mi is learned, the weights are
updated to allow the subsequent classifier, Mi+1, to
pay more attention to the training tuples that
were misclassified by Mi
▪ The final M* combines the votes of each
individual classifier, where the weight of each
classifier's vote is a function of its accuracy

Boosting(Cont’d)
❑ Boosting algorithm can be extended for

numeric prediction
❑ Comparing with bagging: Boosting tends

to have greater accuracy, but it also risks
overfitting the model to misclassified data

Adaboost
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)

• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
✓ Tuples from D are sampled (with replacement) to form a training
set Di of the same size
✓ Each tuple’s chance of being selected is based on its weight
✓ A classification model Mi is derived from Di
✓ Its error rate is calculated using Di as a test set
✓ If a tuple is misclassified, its weight is increased, o.w. it is
decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj . Classifier
Mi error rate is the sum of the weights of the misclassified tuples:
d
error ( M i ) =  w j  err ( X j )
j
• Weight of classifier Mi’s vote:
1 − error ( M i )
log
error ( M i )
Random Forest
▪ Each classifier in the ensemble is a decision

tree classifier
▪ Generated using a random selection of
attributes at each node to determine the split
▪ During classification, each tree votes and the
most popular class is returned

Random Forest(Cont’d)
Two Methods
to construct
Random Forest
Forest-RI (random Forest-RC (random

input selection) linear combinations)
▪ Randomly select, at ▪ Creates new attributes
each node, F attributes that are a linear
as candidates for the combination of the
split at the node existing attributes

Random Forest(Cont’d)
❑ Comparable in accuracy to Adaboost, but

more robust to errors and outliers
❑ Insensitive to the number of attributes

selected for consideration at each split, and
faster than bagging or boosting

Model Evaluation and Selection
Let's summarize what we learned today
o Evaluation metrics include: accuracy, sensitivity, specificity,

precision, recall, F measure, and Fß measure.
o Stratified k-fold cross-validation is recommended for accuracy

estimation.
o Bagging and boosting can be used to increase overall accuracy by
learning and combining a series of individual models
o Significance tests and ROC curves are useful for model selection.
o There have been numerous comparisons of the different
classification methods.

Numerical Dimensions: Techniques To Improve Classification Accuracy

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Numerical Dimensions: Techniques To Improve Classification Accuracy

Uploaded by

Copyright:

Available Formats

Techniques to Improve Classification Accuracy

1 Faculty of Information Science

▪ Use a combination of models to increase

methods ▪ Combine a series of k learned models, M1, M2,

2 Faculty of Information Science

Popular ensemble methods

▪ Averaging the prediction over a

▪ Weighted vote with a collection of

▪ Combining a set of heterogeneous

3 Faculty of Information Science

❖ Diagnosis based on multiple doctors’ majority vote

❖ Given a set D of d tuples, at each iteration i, a training

❖ Classify an unknown sample X

• The prediction of continuous values by

• Often significantly better than a single

5 Faculty of Information Science

▪ Consult several doctors, based on a

6 Faculty of Information Science

Q How boosting works?

▪ Weights are assigned to each training tuple

7 Faculty of Information Science

❑ Boosting algorithm can be extended for

❑ Comparing with bagging: Boosting tends

8 Faculty of Information Science

• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)

▪ Each classifier in the ensemble is a decision

10 Faculty of Information Science

Forest-RI (random Forest-RC (random

11 Faculty of Information Science

❑ Comparable in accuracy to Adaboost, but

❑ Insensitive to the number of attributes

12 Faculty of Information Science

Let's summarize what we learned today

o Evaluation metrics include: accuracy, sensitivity, specificity,

o Stratified k-fold cross-validation is recommended for accuracy

13 Faculty of Information Science

You might also like