Weka j48 Pruning

4/3/2010
Homework #1
CSC 578 Neural Networks and Machine Learning

Homework #1
Due: January 20 (Wed)
Do all questions below.
1. Textbook Exercise 3.1 (p. 77).
In case you don't have the textbook yet, the question is: "Give decision trees to represent the following
boolean functions:
a. A and notB
b. A or [B and C]
c. A xor B
d. [A and B] or [C and D]
2. The following data gives the conditions under which an optician might want to prescribe soft contact
lenses, hard contact lenses, or no contact lenses for a patient. Show the decision tree that would be
learned by ID3. The target attribute is 'Contact-lenses'.
Show all your work including the calculations of IG (REQUIRED). Do NOT use any decision-tree
induction tools such as Weka. You may use tools/software for numeric calculation (including Excel), but
NOT those that produce a decision tree.
Age
young
young
young
young
young
young
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
Spectacle-prescrip Astigmatism
myope
no
myope
yes
myope
yes
hypermetrope
no
hypermetrope
no
hypermetrope
yes
myope
no
myope
no
myope
yes
hypermetrope
no
hypermetrope
no
hypermetrope
yes
hypermetrope
yes
myope
no
myope
yes
myope
yes
hypermetrope
no
hypermetrope
no
hypermetrope
yes
hypermetrope
yes
Tear-prod-rate Contact-lenses
normal
soft
reduced
none
normal
hard
reduced
none
normal
soft
reduced
none
reduced
none
normal
soft
normal
hard
reduced
none
normal
soft
reduced
none
normal
none
normal
none
reduced
none
normal
hard
reduced
none
normal
soft
reduced
none
normal
none
3. Why is a decision tree that fits the data really well not necessarily better than another that doesn't fit it so
well? [Assume the whole data can fit in the computer memory.]
Write at least 3 sentences.
condor.depaul.edu//hw1.html
1/7
4/3/2010
Homework #1
4. Download WEKA and install on your system. Then conduct the following experiment.
Setup:
If your system already has Java 1.5 or newer, the second choice "a self-extracting executable
without the Java VM (weka-3-6-1.exe)" will do.
In case you encountered problems with the Weka site, here is a local, ZIP file of the self-extracting
executable (weka-3-6-1.zip, 18MB).
J48 in Weka:
For this and the next questions (4 & 5), you experiment the effect of pruning in decision trees. Weka's
'weka.classifiers.trees.J48' lets you generate pruned as well as unpruned trees. The J48 classifier
provides two methods for pruning a decision tree:
a. By using a "pessimistic estimate" function (described in Mitchell's textbook p. 71, the 9th line
from the bottom "Another method, used by C4.5,.."); and
b. By using a validation set to test if the pruning will improve accuracy -- by 'reducedErrorPruning'
scheme.
[FYI, J48 does NOT convert trees to rules. Both pruning schemes alter a tree after it is fully grown (thus
post-pruning).]
For question 4, we experiment with the former scheme.

The pessimistic estimate function has a parameter: Confidence Level. By setting this parameter to various
values, we can experiment the degree of pruning -- minimal to aggressive -- and its effect on classification
accuracy. In J48, this confidence level can be set by the 'confidenceFactor' parameter, in the pop-up
window which appears after clicking in the text box to the right of the "Choose" button. It is set to 0.25
by default. If you change it to a smaller value, you can do more aggressive pruning, while a
larger value will let you do minimal pruning.
2/7
4/3/2010
Homework #1
Specifics:
The purpose of the experiment is to derive the (sub-)optimal confidence factor value. To do so, we try
various confidence factor values, and with several datasets. The datasets are as follows. Here is also a
zip file which includes all.
vote.arff (40 kb)
16 attributes (nominal),
2 classes, 435 instances
This dataset contains the party affiliation of the 435

members of the 1984 US House of Representatives,
as well as their voting records on 16 different bills.
tic-tac-toe.arff (31 kb)
This database encodes the complete set of possible

board configurations at the end of tic-tac-toe games.
splice2.arff (393 kb)
Primate splice-junction gene sequences (DNA).

Given a sequence of DNA, recognize the boundaries
between exons (the parts of the DNA sequence
retained after splicing) and introns (the parts of the
DNA sequence that are spliced out).
breast-cancer.arff (30
kb)
Breast cancer data; classify into recurrent/nonrecurrent events
17 attributes
Records of baseball players inducted to the Baseball
(*) halloffame.arff (140
(nominal/numeric mixed),
Hall of Fame
kb)
3/7
4/3/2010
Homework #1
NOTE: (*) When you run the Hall of Fame data, remove the 'Player' attribute (the first
attribute). To do so, after you open the file (in the "Preprocess" step), select the attribute
and hit "Remove".
For each dataset,

You run J48 with the confidence factor 0.50, 0.45, 0.40, ... down to 0.05 (i.e., a decrement by
0.05; So you'll do a total of 10 runs). Also be sure to set the 'minNumObj' to 1, and make sure
the 'reducedErrorPruning' is False and 'unpruned' False, along with all other parameters as
indicated in the previous figure.
For each run, do the evaluation by 10-fold cross-validation. In the "Weka Explorer" window,
under "Test Options", select "Cross-validation" and set the number of folds to 10 (which is the
default in Weka).
4/7
4/3/2010
Homework #1
After each run, record the confidence factor, the size of the tree and the classification
accuracy.
Do the same procedure for all datasets.
To Answer:
Answer the following questions. In addition to running the experiments, I strongly recommend you read
the description of each dataset (written at the top of each file) in order to learn its domain.
a. Show a table which tabulates the values obtained for all runs for each dataset (confidence factor,
size of the tree, classification accuracy).
b. Your result probably indicated that pruning helped improve the accuracy greatly for some datasets
but only marginally if at all for others. Or in some cases, pruning might have hindered the accuracy.
Based on the results, discuss in detail what factor or factors you think influenced the effect by
pruning. Write at least 3 sentences.
c. Weka uses 0.25 as the default confidence value. Do you think it is a good value to use? Explain
why or why not.
5. For this question, you experiment the other pruning scheme (reduced error pruning), using the five
datasets from the previous question.
Specifics:
5/7
4/3/2010
Homework #1
The reduced error pruning in J48 has a parameter: 'numFolds'. It specifies the number of subsets into
which the training data is divided -- and one fold is reserved as a validation set (used only for testing the
effect of pruning a particular subtree) and the remaining folds are used for training/building a tree. By
changing the the number of folds, you essentially control the portion of the data used for training -- a
small number of folds makes the validation set larger, thus leaving the training set smaller,
while a large number of folds makes the validation set smaller, thus leaving the training set
larger (although it is still a subset of the original training data).
For each dataset,
You run J48 with the 'numFolds' = 2, 5 and 10 (so you'll do a total of 3 runs). Also, set the
'reducedErrorPruning' to True and the 'minNumObj' to 1. Note that you can leave the 'seed'
as 1. You can ignore other parameters (because J48 does too).
As for the overall evaluation, as with the previous question, do the evaluation by 10-fold crossvalidation.
After each run, record the number of folds, the size of the tree and the classification accuracy.
To Answer:
Answer the following questions.
a. Show a table which tabulates the values obtained for all runs for each dataset (number of folds, size
of the tree, classification accuracy).
b. Describe your observation on the effect of the size of the training set (i.e., the number of folds).
Write at least 3 sentences.
6/7
4/3/2010
Homework #1
c. How did this pruning scheme compare with the pessimistic estimate function? Was there large
differences in the accuracy or tree size between the two schemes? Which pruning scheme "works
better" or "is preferred" in your opinion?
Submission
Type all your answers in an electronic file (in doc, txt, or pdf), and submit the file on COL (under 'Submit
Homework' and 'HW#1' bin) before 11:59 pm on the due date.
If it's difficult for you to draw figures (trees in this homework) by using a software, you can alternatively
hand-draw them on paper, scan the paper and insert/paste/concatenate the scanned image in the file. No
matter how you create figures, make ONE file which contains ALL answers and submit the file.
Be sure to WRITE YOUR NAME in the beginning of the file. As I have on the syllabus, "Assignments
with NO NAME may be penalized by some points."
7/7

Weka j48 Pruning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Weka j48 Pruning

Uploaded by

Copyright:

Available Formats

4/3/2010

CSC 578 Neural Networks and Machine Learning

For question 4, we experiment with the former scheme.

vote.arff (40 kb)

This dataset contains the party affiliation of the 435

tic-tac-toe.arff (31 kb)

This database encodes the complete set of possible

splice2.arff (393 kb)

Primate splice-junction gene sequences (DNA).

Breast cancer data; classify into recurrent/nonrecurrent events

For each dataset,

You might also like