Professional Documents
Culture Documents
Cegelski - Week 1 Homework
Cegelski - Week 1 Homework
Karen Cegelski
Homework Week 1
CSIS 5420
June 3, 2005
Score 10 out of 10
1. (Question #2, page 30) For each of the following problem scenarios, decide if
a solution would best be addressed with supervised learning, unsupervised
clustering, or database query. As appropriate, state any initial hypothesis you
would like to test. If you decide that supervised learning or unsupervised
clustering is the best answer, list several input attributes you believe to be
relevant for solving the problem.
c. When customers visit my web site, what products are they most
likely to buy together?
Unsupervised clustering would be used to decide this scenario. If
this web site were a clothing site, it would be determined that if a
woman wanted a blue dress, accessories such as shoes, jewelry
could possibly be bought at the same time. Some attributes would
be: Customer ID, Type of clothing; accessories – shoes, necklace,
earrings; hose. Good
Score 10 out of 10
2. (Question #3, page 30) Medical doctors are experts at disease diagnosis and
surgery. Explain how medical doctors use induction to help develop their skills.
Score 10 out of 10
3. (Question #6, page 31) What happens when you try to build a decision tree for
the data in Table 1.1 without employing the attributes Swollen Glands and Fever?
Without using the symptoms fever or swollen glands the diagnosis could
be misleading. The diagnosis could be allergy or a cold without factoring in
these other 2 attributes. The attributes sore throat, congestion, and
headache are not important in determining the diagnosis. Good
Let's pick sore throat as the top-level node. The only possibilities are yes and no.
Instances one, three four, eight, and ten follow the yes path. The no path shows instances
2,5,6,7 & 9. The path for sore throat = yes has representatives from all three classes as
does sore throat = no.
Next we follow the sore throat = yes path and choose headache. We need only concern
ourselves with instances 1,3,4, 8 & 10. For headache = yes we have instances 1 (strep
throat) ,8 (allergy ), & 10 (cold). For headache = no we have instances 3 (cold) and 4 (strep
throat).
Next follow headache = yes and choose congestion the only remaining attribute. All
three instances show congestion = yes, therefore the tree is unable to further differentiate
the three instances. A similar problem is seen by following headache = no. Therefore, the
path following sore throat = yes is unable to differentiate any of the five instances. The
problem repeats itself for the path sore throat = no. In general, any top-level node choice
of sore throat, congestion, or headache gives a similar result.
Score 10 out of 10
4. (Question #6, page 63) Supposed you have used data mining to develop two
alternative models designed to accept or reject home mortgage applications.
Both models show an 85% test set classification correctness. The majority of
errors made by model A are false accepts whereas the majority of errors made
by model B are false rejects. Which model should you choose? Justify your
answer.
Model B should be chosen because this matrix tells us that this model is
less likely to erroneously offer a home loan to an individual who may be
likely to default. The test set error rate is a useful measure for model
evaluation, but other factors such as costs incurred for false inclusion as
well as losses resulting from false omission must be considered. Good,
but consider this perspective, since a mortgage is secured credit, is
there much risk in false accepts?
Score 10 out of 10
5. (Question #7, page 63) Supposed you have used data mining to develop two
alternative models designed to decide whether or not to drill for oil. Both models
show an 85% test set classification correctness. The majority of errors made by
model A are false accepts whereas the majority of errors made by model B are
false rejects. Which model should you choose? Justify your answer.
Model A should be chosen because this matrix tells us that this model is
more likely to provide oil in the site that we have chosen than Model A.
The test set error rate is a useful measure for model evaluation, but other
factors such as costs incurred for false inclusion as well as losses
resulting from false omission must be considered. OK, but consider if the
cost of drilling for oil is very high, Model B is the best choice.
Score 10 out of 10
6. (Question #8, page 63) Explain how unsupervised clustering can be used to
evaluate the likely success of a supervised learner model.
If our unsupervised learner determines that the same input attributes will
form clusters that differentiate the values of the output attribute, then the
complementary results verify the supervised learner assumptions.
Score 10 out of 10
7. (Question #97, page 63) Explain how supervised learning can be used to help
evaluate the results of an unsupervised clustering model.
Score 7 out of 10
Computed Decision
86% Good
b. According to the confusion matrix, how many
Democrats are in the Senate? How many
Republicans? How many Independents?
Score 7 out of 10
80 individuals 40 instances
80 instances 10 instances
Score 10 out of 10
Computed Computed
Model Y
Accept Reject
Accept 45 55
Reject 1,955 7,945
Score 8 out of 10
Lift = c11/Sum(ComputedSend)
Sum(Send)/Sum(Total)