You are on page 1of 12

Data Mining A Tutorial Based Primer 2nd Roiger Solution Manual

Data Mining A Tutorial Based Primer 2nd Roiger


Solution Manual

To download the complete and accurate content document, go to:


https://testbankbell.com/download/data-mining-a-tutorial-based-primer-2nd-roiger-sol
ution-manual/

Visit TestBankBell.com to get complete for all chapters


ANSWERS TO CHAPTER 7 EXERCISES
Review Questions

1. Differentiate between the following terms:


a. Validation data is used to choose from several models built with the same training data. Test set
data is used to test the accuracy of a selected model.
b. A Type I error is rejection of a null hypothesis that should be accepted. A Type II error accepts a
null hypothesis that should be rejected.
c. The experimental group receives the treatment being measured whereas the control group receives
a placebo.
d. The mean squared error is computed by finding the average of the sum of squared error
differences between actual and computed output. The mean absolute error is computed by finding
the average of the absolute differences between actual and computed output values.

2. State the type I and type II error


a. The null hypothesis states that it will not snow. A type 1 error is predicting snow when it does not
snow. A type 2 error is incorrectly forecasting no snow. A type 2 error is more serious as motorists
will travel in bad weather. A model that commits fewer type 2 errors is a best choice.
b. The null hypothesis states that a customer will not purchase a television. A type 1 error concludes
that a customer will purchase a television when the customer will not. A type 2 error concludes
that a customer will not purchase a television when the customer will make the purchase. If the
model is used to target candidates for marketing purposes, a model that commits fewer type 2
errors is a best choice.
c. The null hypothesis states that an individual is not a likely telemarketing candidate. A type 1 error
concludes that individuals who will not accept a promotional offering are good telemarketing
candidates. A type 2 error rejects likely telemarketing candidates. A model committing fewer type
2 errors is a best choice.
d. The null hypothesis states that an individual should not have back surgery. A type 1 error results
in an unwarranted back surgery. A type 2 error is seen when an individual needing back surgery
does not get the surgery. A model committing fewer type 1 errors is a best choice as the model
will recommend fewer unwarranted surgeries.
e. The null hypothesis states that the return is valid. A type 1 error falsely flags a tax return as
fraudulent. A type 2 error accepts a return as valid when it is fraudulent. A model committing
fewer type 2 errors is the best choice.

3. A major advantage is that both measures of attribute signifance are included in the statistic. A major
disadvantage is that a highly predictive attribute value will appear insignificant if its corresponding
predictability score is low.

4. 99.62%

5. RapidMiner’s T-Test alpha parameter.


a. The value of alpha sets the degree of risk we are willing to take to commit a type 1 error (reject a true null
hypothesis).

b. As we are dealing with a sampling of all possible data we can never be 100% certain that our
results pertain to the entire instance population. Therefore, with statistical testing, we never prove
anything, we can only provide levels of confidence in the conclusions of our experiments.

6. Write a short description of each.


Sample Operator – This provides a random sample of the data. The random sample’s sample
parameter can be set to absolute (specify the exact number of instances to be included), relative
(the sample is a specified fraction of the total number of instances), or probability (a sample
probability is specified for each instance).
Sample (bootstrap) operator – This is sampling with replacement. The sample parameter can be set
to relative or absolute.
Sample (stratified) operator – This sampling operator ensures that each class within the data is
represented in the sample in the same way as in the entire dataset. This is only useful if a class
attribute is present. The sample parameter can be set to relative or absolute.

Data Mining Questions- RapidMiner


1. Create a process using a dataset of your choice to illustrate the three options available with RapidMiner’s
Sample operator. Attach a note to each operator that clearly describes what your example illustrates. Run
your process and provide screen shots of your output.

 Answers will vary. Here is one solution:


Absolute with balanced data and 10 instances:

Relative with 5% of instances = 15 instances:


Probability with each instance having a probability of 0.05 of selection. Here 20 instances have been
selected.
2. Use the cardiology patient dataset to illustrate the Bootstrap Sampling operator and the Stratified Sampling
operator. Attach a note to each operator that clearly describes what your example illustrates. Run your
process and provide screen shots of your output.

 Answers will vary. Here is one possibility.

Bootstrap

Stratified 11 healthy 9 sick


3. Implement the process model described in Section 7.7 but set the maximum depth for the decision trees at
2, 4, and 10.
a. Does the t-test show any significant differences in model performance?
b. Does the ANOVA help confirm the results of the t-tests?
c. Use the mikro value within each performance vector to compute the 95% confidence error (or
accuracy) interval for each model.

 Part a: None of the 3 t-test scores are significant. The screen showing the t-test results follows:
Part b: The Anova confirms the results in Part a (Prob = 0.120). The screen shot of the Anova follows:

Part c:
The Mikro scores are to be used as the values for model accuracy. Recall the Mikro score is model
accuracy when the model is applied to all instances.

Model with depth = 2


Mikro = 0.6931
Confidence interval = Upper: 0.7461 Lower: 0.6401
Computed as follows Var = 0.6931 (1 – 0.6931) SE =  (var / 303) = 0.0265
Upper = Mikro + 2SE = 0.6931 + 0.0530
Lower = Mikro – 2SE = 0.6931 – 0.0530
Model with depth = 4
Mikro = 0.7030
Confidence interval = Upper: 0.7555 Lower: 0.6505

Model with depth = 10


Mikro = 0.7624
Confidence interval = Upper: 0. 8113 Lower: 0.7135

4. Implement the process model given in Section 7.9 illustrating the Pareto lift chart but change the target
class parameter to loyal. Take a screen shot of the chart and summarize what the chart is telling you.

 The Pareto chart tells us that if we use 0.81 as our cutoff for customers classified as loyal we will be
correct 136 out of 156 times. If we use 0.72 as the cutoff value, we will be correct 164 out of 197
times. Further, choosing 0.18 as the cutoff value, our correctness in identifying loyal customers drops
significantly to 183 out of 456.
 The right vertical axis is used to interpret the four connected points. Consider the point residing above
and midway between the 0.72 to 0.81 confidence interval. This positioning of this point tells us that if
we include the instances within this confidence interval we select approximately 90% of all loyal
customers.

5. Implement the process model in section 7.7 using the credit screening dataset.
a. Does the t-test show any significant differences in model performance?
b. Does the ANOVA help confirm the results of the t-tests?
c. Use the mikro value within each performance vector to compute the 95% confidence error (or
accuracy) interval for each model.

 Part a: The accuracy of the decision tree of maximum depth 1 is significantly less than the accuracy
of the other models. Here is a screen shot of the t-test computations.
Part b: The Anova confirms the results in Part a (Prob = 0.000). The screen shot of the Anova follows:

Part c:
The Mikro scores are to be used as the values for model accuracy. The Mikro score is model accuracy
when the model is applied to all instances.

Model with depth = 1


Mikro = 0.5551
Confidence interval = Upper: 0.5929 Lower: 0.5173

Model with depth = 5


Mikro = 0.8478
Confidence interval = Upper: 0.8752 Lower: 0.8204
Data Mining A Tutorial Based Primer 2nd Roiger Solution Manual

Model with depth = 20


Mikro = 0.8420
Confidence interval = Upper: 0.8698 Lower: 0.8142

Visit TestBankBell.com to get complete for all chapters

You might also like