Cegelski - Week 1 Homework

Total Score 102 out of 110
Karen Cegelski
Homework Week 1
CSIS 5420
June 3, 2005
Score 10 out of 10
1. (Question #2, page 30) For each of the following problem scenarios, decide if
a solution would best be addressed with supervised learning, unsupervised
clustering, or database query. As appropriate, state any initial hypothesis you
would like to test. If you decide that supervised learning or unsupervised
clustering is the best answer, list several input attributes you believe to be
relevant for solving the problem.
a. What characteristics differentiate people who have had back

surgery and have returned to work from those who have had back
surgery and have not returned to their jobs?
I would choose supervised learning to address this problem as the

model could be built using data instances of known origin. One
hypothesis that could be used would be the physician specialty –
Did patients who were operated on by a neurosurgeon return to
work faster than those who had the surgery performed by an
orthopedic surgeon? Attributes that could be relevant to the solution
would be physician, physician specialty, type of employment, age,
overall health. Very good
b. A major automotive manufacturer recently initiated a tire recall for

one of their top-selling vehicles. The automotive company blames
the tires for the unusually high accident rate seen with their top-
seller. The company producing the tires claims the high accident
rate only occurs when their tires are on the vehicle in question.
Who is to blame?
To solve this problem, I would do a database query. In querying the

database, I would want to know the make/model of the vehicle,
model year, type of tires, and filter for vehicles that have been
involved in accidents. If there is any correlation between the types
of tires and the number of accidents this would determine who is
responsible. Good
c. When customers visit my web site, what products are they most
likely to buy together?
Unsupervised clustering would be used to decide this scenario. If
this web site were a clothing site, it would be determined that if a
woman wanted a blue dress, accessories such as shoes, jewelry
could possibly be bought at the same time. Some attributes would
be: Customer ID, Type of clothing; accessories – shoes, necklace,
earrings; hose. Good
d. What percent of my employees miss one or more days of work

per month?
A database query would be used to answer this question. I would

ask the database the employee number, employee name, length of
employment, and filter for employees who have missed greater
than 1 day. Good
e. What relationships can I find between an individual's height,

weight, age, and favorite spectator sport?
I would use unsupervised clustering to find the answer to this question.

There are no predefined classes instead data instances can be grouped
together based on a similarity scheme defined by the clustering model.
The hypothesis would be if a relationship could be developed between
individual demographics and favorite spectator sport. It is very possible
that no relationship would be able to be determined. Attributes could be:
Person name or id number, height, weight, age, favorite sport . Good
Score 10 out of 10
2. (Question #3, page 30) Medical doctors are experts at disease diagnosis and
surgery. Explain how medical doctors use induction to help develop their skills.
Induction or inductive reasoning is the process of reasoning in which the

conclusion of an argument is very likely to be true based on given
symptoms. It is to base diagnosis on observations of particular patterns
and to formulate treatment based on these observations of recurring
patterns. In determining whether a patient who is having chest pains is
having a heart attack or just acid reflux, the physician will order certain
tests that will eliminate or determine what the symptoms are telling him.
Good
Score 10 out of 10
3. (Question #6, page 31) What happens when you try to build a decision tree for
the data in Table 1.1 without employing the attributes Swollen Glands and Fever?
Table 1.1 Hypothetical Training Data for Disease Diagnosis

Patient Sore Swollen
Fever Congestion Headache Diagnosis
ID Throat Glands
Strep
1 Yes Yes Yes Yes Yes
throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
Strep
4 Yes No Yes No No
throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
Strep
7 No No Yes No No
throat
8 Yes No No Yes Yes Allergy
9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold
Without using the symptoms fever or swollen glands the diagnosis could
be misleading. The diagnosis could be allergy or a cold without factoring in
these other 2 attributes. The attributes sore throat, congestion, and
headache are not important in determining the diagnosis. Good
Let's pick sore throat as the top-level node. The only possibilities are yes and no.
Instances one, three four, eight, and ten follow the yes path. The no path shows instances
2,5,6,7 & 9. The path for sore throat = yes has representatives from all three classes as
does sore throat = no.
Next we follow the sore throat = yes path and choose headache. We need only concern
ourselves with instances 1,3,4, 8 & 10. For headache = yes we have instances 1 (strep
throat) ,8 (allergy ), & 10 (cold). For headache = no we have instances 3 (cold) and 4 (strep
throat).
Next follow headache = yes and choose congestion the only remaining attribute. All
three instances show congestion = yes, therefore the tree is unable to further differentiate
the three instances. A similar problem is seen by following headache = no. Therefore, the
path following sore throat = yes is unable to differentiate any of the five instances. The
problem repeats itself for the path sore throat = no. In general, any top-level node choice
of sore throat, congestion, or headache gives a similar result.
Score 10 out of 10
4. (Question #6, page 63) Supposed you have used data mining to develop two
alternative models designed to accept or reject home mortgage applications.
Both models show an 85% test set classification correctness. The majority of
errors made by model A are false accepts whereas the majority of errors made
by model B are false rejects. Which model should you choose? Justify your
answer.
Model B should be chosen because this matrix tells us that this model is
less likely to erroneously offer a home loan to an individual who may be
likely to default. The test set error rate is a useful measure for model
evaluation, but other factors such as costs incurred for false inclusion as
well as losses resulting from false omission must be considered. Good,
but consider this perspective, since a mortgage is secured credit, is
there much risk in false accepts?
Score 10 out of 10
5. (Question #7, page 63) Supposed you have used data mining to develop two
alternative models designed to decide whether or not to drill for oil. Both models
show an 85% test set classification correctness. The majority of errors made by
model A are false accepts whereas the majority of errors made by model B are
false rejects. Which model should you choose? Justify your answer.
Model A should be chosen because this matrix tells us that this model is
more likely to provide oil in the site that we have chosen than Model A.
The test set error rate is a useful measure for model evaluation, but other
factors such as costs incurred for false inclusion as well as losses
resulting from false omission must be considered. OK, but consider if the
cost of drilling for oil is very high, Model B is the best choice.
Score 10 out of 10
6. (Question #8, page 63) Explain how unsupervised clustering can be used to
evaluate the likely success of a supervised learner model.
Unsupervised clustering can be used to evaluate the likely success of a

supervised learner model using:
 A confusion matrix to compute model accuracy by adding the values found

on the main diagonal and divide this number by the total number of test
set instances.
 Two-class error analysis to denote false accepts and false rejects
 Evaluate supervised models having numeric output mean absolute error
and mean square error can be utilized.
OK, but let me suggest a simpler answer.
In a supervised learner model, we pre-determine which attributes will be

used to classify our data and what specific clusters we will accept. In other
words, we assume that a chosen set of attributes will classify our data
under a chosen output attribute.
If our unsupervised learner determines that the same input attributes will
form clusters that differentiate the values of the output attribute, then the
complementary results verify the supervised learner assumptions.
Score 10 out of 10
7. (Question #97, page 63) Explain how supervised learning can be used to help
evaluate the results of an unsupervised clustering model.
Supervised learning can be used to help evaluate the results of an unsupervised

clustering model using following technique:
 Perform an unsupervised clustering. Designate each cluster as a class

and assign each an arbitrary name such as C1, C2, and C3.
 Do a random sample of instances from each of the classes as a result of
the instance clustering. Each class should be represented in the random
sampling in the same ratio as it is represented in the dataset.
 Construct a supervised learner model with the class name as the output
attribute using the randomly sampled instances as training data. Use the
remaining instances to test the supervised model for classification
correctness. Very good
Score 7 out of 10
8. (Computational Question #1, page 63) Consider the following three-class

confusion matrix. The matrix shows the classification results of a supervised
model that uses previous voting records to determine the political party affiliation
(Republican, Democrat, or Independent) of members of the United States
Senate.
Computed Decision
Rep Dem Ind

Rep 42 2 1
Dem 5 40 3
Ind 0 3 4
a. What percent of the instances were correctly

classified?
86% Good
b. According to the confusion matrix, how many
Democrats are in the Senate? How many
Republicans? How many Independents?
Democrats – 40  should be 48 (add across the

row)
Republicans – 42 should be 45 (add across the

row)
Independents - 4 should be 7 (add across the

row)
There are 100 senators total.
c. How many Republicans were classified as

belonging to the Democratic Party?
2 Republicans were classified as belonging to the

Democratic Party Good
d. How many Independents were classified as

Republicans?
0 Independents were classified as Republicans Good
Score 7 out of 10
9. (Computational Question #2, page 64) Suppose we have two

classes each with 100 instances. The instances in one class
contain information about individuals who currently have credit card
insurance. The instances in the second class include information
about individuals who have at least one credit card but are without
credit card insurance. Use the following to answer the questions
below:
IF Life Insurance = Yes & Income > $50K
THEN Credit Card Insurance = Yes
Rule Accuracy = 80%
Rule Coverage = 40%

a. How many individuals represented by the instances
in the class of credit card insurance holders have life
insurance and make more than $50,000 per year?
80 individuals 40 instances
b. How many instances representing individuals who

do not have credit card insurance have life insurance
and make more than $50,000 per year?
80 instances 10 instances
Score 10 out of 10
10. (Computational Question #3, page 64) Consider the confusion

matrices shown below.
a. Compute the lift for Model X.
Lift = 2.00785 Very good
b. Compute the lift for Model Y.
Lift = 2.25 Very good
Model Computed Computed

X Accept Reject
Accept 46 54
Reject 2,245 7,655
Computed Computed
Model Y
Accept Reject
Accept 45 55
Reject 1,955 7,945
Score 8 out of 10
11. (Computational Question #4, page 65) A certain mailing list

consists of P names. Suppose a model has been built to determine
a select group of individuals from the list who will receive a special
flyer. As a second option, the flyer can be sent to all individuals on
the list. Use the notation given in the confusion matrix below to
show that the lift for choosing the model over sending out the flyer
to the entire population can be computed with the equation:
Send Computed Computed

Flyer? Send Don't Send
Send C11 C12
Don't
C21 C22
Send
Lift = P(C11 | Sample)

P(C11 | Population)
Send Flyer? Computed Send Computed Don't Send

Send c11 c12 Sum(Send)
Don't Send c21 c21 Sum(Don't Send)
Sum(Computed Send) Sum (Computed Don't Send) Sum(Total)
Lift = c11/Sum(ComputedSend)
Sum(Send)/Sum(Total)
So Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / (C11+C12+C21 +C22) )
and we know that (C11+C12+C21 +C22) = the total number of names P.

Therefore, using substitution …
Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / P )
Lift = ( C11 / (C11 + C21) ) * (P / ( (C11+C12) )
Lift = ( C11 * P ) / ((C11 + C12) * (C11+ C21) )

Cegelski - Week 1 Homework

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cegelski - Week 1 Homework

Uploaded by

Copyright:

Available Formats

Total Score 102 out of 110

a. What characteristics differentiate people who have had back

I would choose supervised learning to address this problem as the

b. A major automotive manufacturer recently initiated a tire recall for

To solve this problem, I would do a database query. In querying the

d. What percent of my employees miss one or more days of work

A database query would be used to answer this question. I would

e. What relationships can I find between an individual's height,

I would use unsupervised clustering to find the answer to this question.

Induction or inductive reasoning is the process of reasoning in which the

Table 1.1 Hypothetical Training Data for Disease Diagnosis

Unsupervised clustering can be used to evaluate the likely success of a

 A confusion matrix to compute model accuracy by adding the values found

OK, but let me suggest a simpler answer.

In a supervised learner model, we pre-determine which attributes will be

Supervised learning can be used to help evaluate the results of an unsupervised

 Perform an unsupervised clustering. Designate each cluster as a class

8. (Computational Question #1, page 63) Consider the following three-class

Rep Dem Ind

a. What percent of the instances were correctly

Democrats – 40  should be 48 (add across the

Republicans – 42 should be 45 (add across the

Independents - 4 should be 7 (add across the

There are 100 senators total.

c. How many Republicans were classified as

2 Republicans were classified as belonging to the

d. How many Independents were classified as

0 Independents were classified as Republicans Good

9. (Computational Question #2, page 64) Suppose we have two

IF Life Insurance = Yes & Income > $50K

THEN Credit Card Insurance = Yes

Rule Accuracy = 80%

Rule Coverage = 40%

b. How many instances representing individuals who

10. (Computational Question #3, page 64) Consider the confusion

a. Compute the lift for Model X.

Lift = 2.00785 Very good

b. Compute the lift for Model Y.

Lift = 2.25 Very good

Model Computed Computed

11. (Computational Question #4, page 65) A certain mailing list

Send Computed Computed

Lift = P(C11 | Sample)

Send Flyer? Computed Send Computed Don't Send

So Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / (C11+C12+C21 +C22) )

and we know that (C11+C12+C21 +C22) = the total number of names P.

Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / P )

Lift = ( C11 / (C11 + C21) ) * (P / ( (C11+C12) )

Lift = ( C11 * P ) / ((C11 + C12) * (C11+ C21) )

You might also like