Professional Documents
Culture Documents
Homework #1
2. The following data gives the conditions under which an optician might want to prescribe soft contact
lenses, hard contact lenses, or no contact lenses for a patient. Show the decision tree that would be
learned by ID3. The target attribute is 'Contact-lenses'.
Show all your work including the calculations of IG (REQUIRED). Do NOT use any decision-tree
induction tools such as Weka. You may use tools/software for numeric calculation (including Excel), but
NOT those that produce a decision tree.
Age
young
young
young
young
young
young
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
Spectacle-prescrip Astigmatism
myope
no
myope
yes
myope
yes
hypermetrope
no
hypermetrope
no
hypermetrope
yes
myope
no
myope
no
myope
yes
hypermetrope
no
hypermetrope
no
hypermetrope
yes
hypermetrope
yes
myope
no
myope
yes
myope
yes
hypermetrope
no
hypermetrope
no
hypermetrope
yes
hypermetrope
yes
Tear-prod-rate Contact-lenses
normal
soft
reduced
none
normal
hard
reduced
none
normal
soft
reduced
none
reduced
none
normal
soft
normal
hard
reduced
none
normal
soft
reduced
none
normal
none
normal
none
reduced
none
normal
hard
reduced
none
normal
soft
reduced
none
normal
none
3. Why is a decision tree that fits the data really well not necessarily better than another that doesn't fit it so
well? [Assume the whole data can fit in the computer memory.]
Write at least 3 sentences.
condor.depaul.edu//hw1.html
1/7
4/3/2010
Homework #1
4. Download WEKA and install on your system. Then conduct the following experiment.
Setup:
If your system already has Java 1.5 or newer, the second choice "a self-extracting executable
without the Java VM (weka-3-6-1.exe)" will do.
In case you encountered problems with the Weka site, here is a local, ZIP file of the self-extracting
executable (weka-3-6-1.zip, 18MB).
J48 in Weka:
For this and the next questions (4 & 5), you experiment the effect of pruning in decision trees. Weka's
'weka.classifiers.trees.J48' lets you generate pruned as well as unpruned trees. The J48 classifier
provides two methods for pruning a decision tree:
a. By using a "pessimistic estimate" function (described in Mitchell's textbook p. 71, the 9th line
from the bottom "Another method, used by C4.5,.."); and
b. By using a validation set to test if the pruning will improve accuracy -- by 'reducedErrorPruning'
scheme.
[FYI, J48 does NOT convert trees to rules. Both pruning schemes alter a tree after it is fully grown (thus
post-pruning).]
condor.depaul.edu//hw1.html
2/7
4/3/2010
Homework #1
Specifics:
The purpose of the experiment is to derive the (sub-)optimal confidence factor value. To do so, we try
various confidence factor values, and with several datasets. The datasets are as follows. Here is also a
zip file which includes all.
16 attributes (nominal),
2 classes, 435 instances
9 attributes (nominal),
2 classes, 958 instances
breast-cancer.arff (30
kb)
9 attributes (nominal),
2 classes, 286 instances
17 attributes
Records of baseball players inducted to the Baseball
(*) halloffame.arff (140
(nominal/numeric mixed),
Hall of Fame
kb)
3 classes, 1338 instances
condor.depaul.edu//hw1.html
3/7
4/3/2010
Homework #1
NOTE: (*) When you run the Hall of Fame data, remove the 'Player' attribute (the first
attribute). To do so, after you open the file (in the "Preprocess" step), select the attribute
and hit "Remove".
condor.depaul.edu//hw1.html
4/7
4/3/2010
Homework #1
After each run, record the confidence factor, the size of the tree and the classification
accuracy.
Do the same procedure for all datasets.
To Answer:
Answer the following questions. In addition to running the experiments, I strongly recommend you read
the description of each dataset (written at the top of each file) in order to learn its domain.
a. Show a table which tabulates the values obtained for all runs for each dataset (confidence factor,
size of the tree, classification accuracy).
b. Your result probably indicated that pruning helped improve the accuracy greatly for some datasets
but only marginally if at all for others. Or in some cases, pruning might have hindered the accuracy.
Based on the results, discuss in detail what factor or factors you think influenced the effect by
pruning. Write at least 3 sentences.
c. Weka uses 0.25 as the default confidence value. Do you think it is a good value to use? Explain
why or why not.
5. For this question, you experiment the other pruning scheme (reduced error pruning), using the five
datasets from the previous question.
Specifics:
condor.depaul.edu//hw1.html
5/7
4/3/2010
Homework #1
The reduced error pruning in J48 has a parameter: 'numFolds'. It specifies the number of subsets into
which the training data is divided -- and one fold is reserved as a validation set (used only for testing the
effect of pruning a particular subtree) and the remaining folds are used for training/building a tree. By
changing the the number of folds, you essentially control the portion of the data used for training -- a
small number of folds makes the validation set larger, thus leaving the training set smaller,
while a large number of folds makes the validation set smaller, thus leaving the training set
larger (although it is still a subset of the original training data).
For each dataset,
You run J48 with the 'numFolds' = 2, 5 and 10 (so you'll do a total of 3 runs). Also, set the
'reducedErrorPruning' to True and the 'minNumObj' to 1. Note that you can leave the 'seed'
as 1. You can ignore other parameters (because J48 does too).
As for the overall evaluation, as with the previous question, do the evaluation by 10-fold crossvalidation.
After each run, record the number of folds, the size of the tree and the classification accuracy.
To Answer:
Answer the following questions.
a. Show a table which tabulates the values obtained for all runs for each dataset (number of folds, size
of the tree, classification accuracy).
b. Describe your observation on the effect of the size of the training set (i.e., the number of folds).
Write at least 3 sentences.
condor.depaul.edu//hw1.html
6/7
4/3/2010
Homework #1
c. How did this pruning scheme compare with the pessimistic estimate function? Was there large
differences in the accuracy or tree size between the two schemes? Which pruning scheme "works
better" or "is preferred" in your opinion?
Submission
Type all your answers in an electronic file (in doc, txt, or pdf), and submit the file on COL (under 'Submit
Homework' and 'HW#1' bin) before 11:59 pm on the due date.
If it's difficult for you to draw figures (trees in this homework) by using a software, you can alternatively
hand-draw them on paper, scan the paper and insert/paste/concatenate the scanned image in the file. No
matter how you create figures, make ONE file which contains ALL answers and submit the file.
Be sure to WRITE YOUR NAME in the beginning of the file. As I have on the syllabus, "Assignments
with NO NAME may be penalized by some points."
condor.depaul.edu//hw1.html
7/7