You are on page 1of 36

Solution Manual for Essentials of Business Analytics, 1st Edition

Solution Manual for Essentials of Business


Analytics, 1st Edition

To download the complete and accurate content document, go to:


https://testbankbell.com/download/solution-manual-for-essentials-of-business-analytic
s-1st-edition/

Visit TestBankBell.com to get complete for all chapters


Data Mining

Chapter 6
Data Mining

Solutions:

1. We specify # Iterations = 50 and # Starts = 10. We use the default fixed seed of 12345.

a. We see that the size of the clusters varies widely. In particular, three schools (Air Force, Arizona
State, and Stanford) are placed in their own respective cluster. Air Force has a unique combination
of small enrollment, small endowment, relatively geographically remote, and yet a stadium
capacity of about the average over all schools. Arizona State is in its own cluster because it has the
largest enrollment by a fair margin. Stanford is in its own cluster because it has the largest
endowment by a wide margin. The least dense cluster is the 17-school cluster 2 which includes
schools stretching geographically from Hawai’i to Florida with stadium capacities ranging from
20,000 to 95,000, endowments ranging from $3.5 million to $0.08 million, and enrollments
ranging from 20,000 to 60,000.

b. Cluster sizes range dramatically. A one-team conference or a 27-team conference is not practical.

c. The non-normalized clustering is quite different than the normalized clustering. When variables
are not normalized, the formation of clusters is dominated by the variable on the largest scale,

6-1
Data Mining

which in this case is endowment. This can be confirmed by clustering the schools solely on the
basis of endowment and then noting the similarity of the resulting clusters to the clusters based on
all (non-normalized) variables.

To exemplify how the non-normalized clustering differs, consider cluster 2 which contains 17
schools that are diverse in many dimensions. Geographically, this cluster stretches from BYU in
the West, to Florida in the South, to Syracuse in the Northeast. Enrollment in this cluster ranges
from Tulsa with 4,092 students to Florida with 49,589 students (1112% larger than Tulsa).
Stadium capacities vary from Wake Forest’s 31,500 to Tennessee’s 102,455 (225% larger than
Wake Forest). However, with regard to endowment, there is relatively little deviation as it ranges
from Louisville’s $772,157 to Florida’s $1,295,313 (68% larger than Louisville).

2.

a.

6-2
Data Mining

b. Cluster 3 has the largest average football stadium capacity ( 87,616). This cluster is a collection of
15 southern schools characterized by a cluster center indicating largest average football stadium
capacity, an average endowment over $1 billion, and the largest average enrollment.

c. Cluster 10 is a single school (Stanford) which is extreme due to its $16 billion endowment. Cluster
9 is also a single school (Hawai’i) which is extreme due to its geographical location.

d. It appears that 10 clusters appears to be a natural fit for this data as when there are more than 10
clusters, mergers result in a small marginal increase in distance, but when there are less 10 clusters
mergers lead to large marginal increases in distance. There is a jump of 5.32 distance units when
going from 10 to 9 clusters.

6-3
Data Mining

3. We specify # Iterations = 50 and # Starts = 10. We use the default fixed seed of 12345.

a. Single linkage results in clusters with extreme sizes. There are four single-school clusters (Boise
State, Hawai’i, Minnesota, and New Mexico); Hawai’i is the only one extremely geographically
remote university. There is one 98-school cluster that stretches from Illinois in the central U.S. to
Massachusetts in the northeast to Miami in the southeast and to San Antonio in the southwest.
Schools are relatively dense in this region so that the nearest school is not far away. The other
clusters also have the characteristic that each school in the cluster is close to at least one other
school in the same cluster. These other clusters are limited in size due to the collective remoteness
of the schools from others. That is, the schools are close to each other, but not to any others.

6-4
Data Mining

b. Average linkage results in two clusters that have a single school (Hawai’i which is clearly remote,
and Minnesota which is not as remote but has no schools in close proximity to the north or west
and only close school to the east and south). Many of the other average linkage clusters are closely
related to the single linkage clusters. For example, Cluster 1 in the average linkage is the merger
of Cluster 1 and 5 from the single linkage. Similarly, Cluster 4 in the average linkage is the merger
of Cluster 3, 6, and 10 from the single linkage cluster. Clusters 2, 3, 5 and 6 in the average linkage
are a geographical decomposition of the mega-cluster 2 from the single linkage.

6-5
Data Mining

c. Ward’s method results in clusters very similar to the average linkage clusters. It results in only one
cluster with a single school (Hawai’i). There are two clusters different from the average linkage.
Cluster 1 from Ward’s method combines average linkage’s Cluster 1 and the five north-most
schools from average linkage’s Cluster 4. Cluster 8 from Ward’s method combines the 11 most
northwest schools from average linkage’s Cluster 2 and adds Nebraska and Minnesota.

6-6
Data Mining

d. Five of the complete linkage clusters are identical to clusters from Ward’s method. Cluster 1 and
Cluster 7 in the complete linkage result from splitting Ward’s Cluster 1. The other differences
result from slight repartitioning of other schools. Cluster 3 in the complete linkage is a southern
sub-cluster of the single linkage’s mega-cluster 2. Cluster 5 in the complete linkage is composed
of the 20 most northeast of Ward’s Cluster 3 of southern states. Cluster 7 in the complete linkage
is the same cluster of northern California and western Nevada schools as single linkage’s Cluster
6. Cluster 8 in the complete linkage combines Ward’s Cluster 8 of central states and combines
them with Kansas, Missouri, and Kansas State.

6-7
Data Mining

e. Average group linkage results in clusters similar to average linkage, Ward’s method, and complete
linkage. Key (unique) differences involve the placement of Boise State, Virginia, New Mexico,
UNLV, Bowling Green, and Toledo.

6-8
Data Mining

4. 11th – Missouri, 12th – Kentucky, 13th – Pittsburgh, 14th – North Carolina, 15th – Nebraska, 16th –
Virginia. In the figure below the stages in which other schools or clusters of schools have been combined
with the original Big 10 schools.

6-9
Data Mining

6 - 10
Data Mining

5.

a. Western schools separated into “small” and “big” schools with respect to stadium capacity,
endowment, and enrollment.

b. Eastern schools separated into a cluster of 22 schools with small stadium capacity, small
endowments, and small enrollments. The other two clusters differ primarily in terms of stadium
capacity and endowment, but have similar enrollments.

c. Separating Southern schools into four clusters results in disparate clusters. There is a three-school
cluster with high endowments and low enrollments; a 17-school cluster distinguished by large
enrollments and large stadiums; a nine-school cluster with small stadiums, small endowments, and
relatively large enrollments; and a 24-school cluster with moderate stadium capacities, moderate
endowments, and relatively low endowments.

d. The sizes of the clusters vary substantially and do not correspond to feasible conferences.
Developing clusters with size constraints is a very difficult optimization problem. As a method to
address this, for each region, clusters could be aggregated until reaching a maximum size limit.
Upon reaching the maximum size, a cluster would be removed from the data and then the rest of
the data would be considered, repeating the aggregation of clusters until a cluster met the
maximum size limit again (prompting its removal and repetition of the process).

6. For each cluster, comparing the average distance in the cluster to the average distance between cluster
centers provides a measure of cluster strength. As displayed below there is no dramatic difference between
any of the clusterings, except for the clustering with three clusters which does not seem appropriate. If two
clusters provide enough differentiation for the bonus allocation, this seems to be acceptable from a cluster
perspective. If more clusters are required, four clusters would be a reasonable option. If the bonus
allocation was refined, the seven cluster approach results in a very nice decomposition.

For two clusters, the ratio of inter-cluster distance to intra-cluster distance was 2.00 / 1.73 = 1.16 and 2.00 /
0.29 = 7.00 for the two clusters. This suggests that the members of each cluster are more similar to each
other than they are of the other cluster.

6 - 11
Data Mining

For three clusters, Cluster 2 and 3 are distinct from the others. However, the ratio of distance between
Cluster 1 and Cluster 3 and the average distance in Cluster 1 is 1.51 / 2.10 = 0.72. This suggests that some
Cluster 1 members are more dissimilar to each other than they are to members of Cluster 3.

For four clusters, Cluster 3 is the least dense and also is the least distinct from Cluster 2. The ratio of
distance between Cluster 2 and Cluster 3 and the average distance in Cluster 3 is 1.52 / 1.34 = 1.13. Cluster
1 is the second-least dense, but is quite distinct from the other clusters.

6 - 12
Data Mining

For five clusters, Cluster 1 is the least dense cluster. The minimum ratio of distance between cluster centers
and the average distance in a cluster is 1.52 / 1.34 = 1.13, comparing the relative difference between
Cluster 1 and Cluster 2 to the density of Cluster 1. Cluster 4 is the second-least dense cluster, but is quite
distinct from the other clusters.

For six clusters, Cluster 5 is the least dense, but is quite distinct from the other clusters. Cluster 4 is the
second-least dense and its member observations bear some similarity to Cluster 3. Comparing the relative
difference between Cluster 3 and Cluster 4 to the density of Cluster 4, we observe the ratio 1.177 / 1.171 =
1.01, which is the minimum ratio of distance between cluster centers and the average distance in a cluster.
On average, Cluster 4 members are approximately as similar to each other as Cluster 3 members.

6 - 13
Data Mining

For seven clusters, Cluster 6 is the least dense cluster and the minimum ratio of distance between cluster
centers and the average distance in a cluster is 1.18 / 0.90 = 1.31, comparing the relative difference between
Cluster 5 and Cluster 6 to the density of Cluster 6. Considering the large number of clusters, this is a very
distinct clustering.

6 - 14
Data Mining

7.

a. Antecedent: RetinaDisplay, Stand ; Consequent: Speakers

b. If an iPad with retina display is purchased with a stand, then speakers are also purchased.

c. The support count of this item set is 204, which means that these three items were purchase
together 204 times.

d. The confidence of this rule 62.77%, which means of the 325 times that the retina display and a
stand were purchased together, 204 times speakers were also purchased.

e. The lift ratio = 2.27, which means that a customer who has purchased a retina display and a stand
is 127% more likely than a randomly selected customer to buy speakers.

f. The top 15 rules have many features in common that allow Apple to focus on these collections.
For instance, an iPad with retina display is often purchased along with other accessories, including
a stand, speakers, cellular service, and a case. An iPad with retina display often is paired with a
memory upgrade. A memory upgrade to 32 GB is commonly associated with speakers and/or
retina display. Commonly associated accessories include stand and speakers or case and speakers.

8. Facebook, Twitter, and YouTube are web sites often visited together. These are social media-related web
sites.

SheKnows and TheEveryGirl are web sites often visited during same browser session. These web sites are
oriented towards women.

If an individual visits BuzzFeed, they are 521% more likely than a randomly selected individual to visit
HuffingtonPost.

If an individual visits Deadspin, they are 481% more likely than a randomly selected individual to visit
NBA.com.

If an individual visits Pinterest, they are 473% more likely than a randomly selected individual to visit
Amazon.com.

If an individual visits BleacherReport, they are 469% more likely than a randomly selected individual to
visit NFL.com.

If an individual visits CNN.com, they are 448% more likely than a randomly selected individual to visit
WeatherChannel.

9.

a. 581 rules have a support count of at least 100 and confidence of 50%.

b. 10 rules have a support count of at least 250 and confidence of 50%. Increasing the minimum
support required removes spurious rules resulting from coincidence. The risk of raising the
minimum support required is that we cull out meaningful rules that involve uncommon items.

6 - 15
Data Mining

c. Antecedent: hummus; Consequent: peroni. If a customer purchases hummus, then they also
purchase peroni.

d. The support of this rule is 252 meaning that hummus and peroni have been purchased 252 times
together.

e. The confidence of this rule is 82.62% which means that of the 305 times that hummus was
purchased, peroni was also purchased 252 times.

f. The lift ratio of this rule is 1.38 which means that customer purchasing hummus is 38% more
likely than a randomly selected customer to purchase hummus.

g. Customers who enjoy snacking on hummus apparently also drink more peroni than do other
customers. Perhaps the two products could be displayed near each other or tied together in a
promotion or couponing campaign.

10.

a. The overall error rate for the training set is computed by comparing the true classification of each
observation in the training set to the classification of the majority of the k most similar
observations from the training set. Thus, for an observation (from the training set) and k = 1, the
most similar observation in the training set is the observation itself, which by leads to a correct
classification.

The overall error rate for the validation set is computed by comparing the true classification of
each observation in the validation set to the classification of the majority of the k most similar
observations from the training set. Thus, for k = 1, an observation (from the validation set) may
not have the same classification as the most similar observation in the training, thereby leading to
a misclassification.

b. The overall error rate is minimized at k = 5. The overall error rate is the lowest on the training data
since a training set observation’s set of k nearest neighbors will always include itself, artificially
lowering the error rate.

6 - 16
Data Mining

For k = 5, the overall error rate on the validation data is biased since this overall error rate is the
lowest error rate over all values of k. Thus, applying k = 5 on the test data will typically result in a
larger (and more representative) overall error rate since we are not using the test data to find the
best value of k.

c. The first decile lift is 2.36. For this test data set of 2000 observations and 838 actual undecided
voters, if we randomly selected 200 voters, on average 83.8 of them would be undecided.
However, if we use k-NN with k = 5 to identify the top 200 voters most likely to be undecided,
then (83.8)(2.36) = 198 of them would be undecided.

d. A cutoff value of 0.3 appears to provide a good tradeoff between the class 1 and class 0 error rates.

Cutoff Value Class 1 Error Rate Class 0 Error Rate

0.5 8.26% 4.00%

0.4 4.94% 9.53%

0.3 4.94% 9.53%

0.2 2.72% 26.25%

6 - 17
Data Mining

11.

a. The rules of the best pruned tree can be distilled to characterize an undecided voter as

i. 12.5 years < education < 16.5 years, attends church, and income < $174,00

OR

ii. Education > 12.5 years, attends church, female, and income < $174,500

OR

iii. Education > 17 years, attends church, female, and income between $174,500 and
$255,500

OR

iv. 12.5 years < education < 17 years, male, and income < $174,500

b. The full tree has 0% overall error rate on the training data because it continues branching until
every leaf node consists wholly of a single class. As this tree is very likely to be overfit to the
training data, the best pruned tree is formed by iteratively removing branches from the full tree
and applying the progressively smaller trees to the validation data. The best pruned tree is the
smallest tree that achieves an overall error rate within a standard error of the tree with the
minimum overall error rate.

6 - 18
Data Mining

c. For the default cutoff value of 0.5 on the best pruned tree on the test data, the overall error rate is
`1.70, the class 1 error rate is 3.82%, and the class 0 error rate is 0.17%.

d. The first decile lift of the best pruned tree on the test data is 2.37. For this test data set of 2000
observations and 838 actual undecided voters, if we randomly selected 200 voters, on average 83.8
of them would be undecided. However, if we use best prune tree to identify the top 200 voters
most likely to be undecided, then (83.8)(2.37) = 199 of them would be undecided.

12.

a. Using Mallow’s Cp statistic to guide the selection, we see that the models using 7 or 8 independent
variables seem to be viable candidates. We will select the full model with 8 variables (9
coefficients including the intercept).

The resulting model is: log odds of being undecided = -3.82 – 0.01*Age + 0.57*HomeOwner +
1.9*Female + 0.19*Married + 0.17*HouseholdSize – 0.006*Income + 0.21*Education –
1.66*Church.

6 - 19
Data Mining

b. Increases in, HouseholdSize and Education and being a HomeOwner, a Female, and Married
increase the likelihood of a voter being undecided. Increases in Age and Income, and Church
attendance decrease the likelihood of a voter being undecided.

c.

d. The first decile lift is 2.37. For this test data set of 2000 observations and 838 actual undecided
voters, if we randomly selected 200 voters, on average 83.8 of them would be undecided.
However, if we use k-NN with k = 5 to identify the top 200 voters most likely to be undecided,
then (83.8)(2.37) = 199 of them would be undecided.

13.

a. Churn observations only make up 14.49% of the data set. By oversampling the churn observations
in the training set, a data mining algorithm can better learn how to classify them.

b. A value of k = 19 minimizes the overall error rate on the validation set.

c. The overall error rate is 14.39%.

6 - 20
Data Mining

d. The class 1 error rate is 31.96% and the class 0 error rate is 11.40% for the test data.

e. Sensitivity = 1 – class 1 error rate = 68.04%. This means that the model can correctly identify
68.04% of the churners in the test data. Specificity = 1 – class 0 error rate = 88.6%. This means
that the model can correctly identify 88.07% of the non-churners in the test data.

f. There were 65 false positives (non-churners classified as churners). There were 31 false negatives
(churners classified as non-churners). Of the observations predicted to be churners, 65 / (65 + 66)
= 49.6% were false positives. Of the observations predicted to be non-churners, 31 / (31 + 505) =
5.8% were false negatives.

g. The first decile lift is 4.38. For this test data set of 667 customers and 97 actual churners, if we
randomly selected 67 customers, on average 9.7 of them would be churners. However, if we use k-
NN with k=19 to identify the top 67 customers most likely to be churners, then (9.7)(4.38) = 42 of
them would be churners.

14.

a. Churn observations only make up 14.49% of the data set. By oversampling the churn observations
in the training set, a data mining algorithm can better learn how to classify them.

b. Churners are categorized by:

i. Daytime minutes < 221.45, fewer than 3.5 customer service calls, and no recent contract
renewal

OR

ii. Daytime minutes < 221.45, more than 3.5 customer service calls, and monthly charge <
$50.15

OR

6 - 21
Data Mining

iii. Daytime minutes < 221.45, more than 3.5 customer service calls, and monthly charge >
$50.15, and data usage > 1.63 GB

OR

iv. Daytime minutes > 221.45, monthly charge < $61.80, no recent contract renewal

v. Daytime minutes > 221.45, $57.65 < monthly charge < $61.80, recent contract renewal,
roaming minutes > 13.1,

OR

vi. Daytime minutes > 221.45, monthly charge > $61.80, no data plan

vii. Daytime minutes > 221.45, monthly charge > $61.80, data plan, no recent contract
renewal

c. The full tree has 0% overall error rate on the training data because it continues branching until
every leaf node consists wholly of a single class. As this tree is very likely to be overfit to the
training data, the best pruned tree is formed by iteratively removing branches from the full tree
and applying the progressively smaller tree to the validation data. The best pruned tree is the
smallest tree that achieves an overall error rate within a standard error of the tree with the
minimum overall error rate.

d. The overall error rate is 11.84%. The class 1 error rate is 19.59% and the class 0 error rate is
10.53%.

6 - 22
Data Mining

e. The first decile lift is 4.81. For this test data set of 667 customers and 97 actual churners, if we
randomly selected 67 customers, on average 9.7 of them would be churners. However, if we the
classification tree to identify the top 67 customers most likely to be churners, then (9.7)(4.81) = 47
of them would be churners.

15.

a. Churn observations only make up 14.49% of the data set. By oversampling the churn observations
in the training set, a data mining algorithm can better learn how to classify them.

b. Using Mallow’s Cp statistic to guide the selection, we see that there are several viable models for
consideration. After comparing several of the candidate models based on classification error on the
validation, we suggest that includes 5 independent variables is a strong candidate.

The resulting model is: log odds of churning = -3.69 – 2.03*ContractRenewal – 0.93*DataUsage +
0.56*CustServCalls + 0.07*MonthlyCharge + 0.08*RoamMins.

If a customer has recently renewed her/his contract, the customeris less likely to churn. If a
customer uses a lot of data on her/his plan, the customer is less likely to churn. If a customer
makes many of customer service calls, the customer is more likely to churn (these calls suggest the
customer is unhappy with the cell service). If a customer’s monthly charge is high, the customer is
more likely to churn. An increase in roaming minutes also increases the likelihood of churning.

6 - 23
Data Mining

c. The overall error rate is 21.74%

d. The first decile lift is 2.71. For this test data set of 667 customers and 97 actual churners, if we
randomly selected 67 customers, on average 9.7 of them would be churners. However, if we the
classification tree to identify the top 67 customers most likely to be churners, then (9.7)(2.71) = 26
of them would be churners.

16.

a. The RMSE is 6.04 for k = 1 on the training data. Using k = 1 on the training set means that each
observation’s credit score should be predicted to be the average of the k = 1 most similar
observations with respect to the other variables. At first, one may expect a RMSE of 0 since a
training set’s observation most similar observation is itself, but a positive RMSE will occur if
there are observations in the training set which are identical with respect to all variables except

6 - 24
Data Mining

credit score because k-NN with k = 1 will use the average credit score of the identical observation
to predict the credit score (thus guaranteeing an error).

b. The value of k = 20 minimizes the RMSE on the validation data.

c. The RMSE = 53.87 on the validation set and RMSE = 54.69 on the test set. This is an encouraging
sign that the RMSE of k = 20 on new data should be in the range of 53 to 55.

d. The average error on the test data is -6.45 suggesting a slight bias towards over-estimating the
credit scores in the test data. In particular, it appears that k-NN with k = 20 overestimates the
credit score of individuals with low actual credit scores (<350). Plotting the histogram of credit
score values reveal that there are very few observations with credit scores below 400 (only 32).
The large negative error of k-NN suggest that these few low credit score observations are

6 - 25
Data Mining

relatively dissimilar and the prediction is being based on observations with much higher credit
scores that are more similar with respect to the other variables. Improvement may be possible by
obtaining a data set that is more balanced with respect to the credit scores of the individuals. Also,
using a different set or smaller set of variables on which to construct the k-NN model may result in
improved predictions.

17.

a. The best pruned tree achieves an RMSE of 60.60 on the validation data and 60.80 on the test data.
The RMSE of the best pruned tree on the validation data may be biased to underestimate the
RMSE on new data because the best pruned tree is the smallest tree that results in RMSE within
one standard error of the minimum RMSE on the validate data. It is encouraging that the RMSE
on the validation data and test data are very similar. This suggests that RMSE on new data will
reliably be close to 60.

b. An individual with two or more missed payments has a predicted credit score of 595.5.

An individual with less than two missed payments, less than 9.5 years of continuous employment,
and using less than 26.75% of their credit has a predicted credit score of 687.7.

6 - 26
Data Mining

An individual with less than two missed payments, less than 9.5 years of continuous employment,
and using more than 26.75% of their credit has a predicted credit score of 654.9.

An individual with less than two missed payments, between 9.5 and 13.5 years of continuous
employment has a predicted credit score of 704.5.

An individual with less than two missed payments and more than 13.5 years of continuous
employment has a predicted credit score of 723.8.

c. Although there are credit scores as small as 300, the lowest possible predicted credit score from
the regression tree is 595.5. Plotting the histogram of credit score values reveal that only 5% of the
observations have credit scores below 550. Improvement may be possible by obtaining a data set
that is more balanced with respect to the credit scores of the individuals so that the regression tree
can develop rules that address the entire range of credit scores.

6 - 27
Data Mining

d. Setting the minimum number of records in a terminal node to 1 results in a best prune tree with a
RMSE of 57.45 on the test data. This is a 5.2% reduction in RMSE resulting from the best pruned
tree from part a (using 100 as the minimum number of records in a terminal node). To achieve the
RMSE of 57.45, the best pruned tree uses 26 rules. The RMSE of 60.60 in part (a) is achieved
with only 4 rules.

18.

a. Using Mallow’s Cp statistic to guide the selection, we see that one of the models using 2
independent variables is a strong candidate.

The resulting model is: log odds of winning = -8.21 + 0.57*OscarNominations +


1.03*GoldenGlobeWins.

If a movie has more total Oscar nominations across all categories, it is more likely to win the Best
Picture Oscar. If a movie has won more Golden Globe awards, it is more likely to win the Best
Picture Oscar.

b. The overall error rate for the model from part a on the validation data is 15%, or 6 mis-predicted
movies out of 40.

c. By using a cutoff value to classify a movie as a winner or not, in some years the model may not
classify any Best Picture Oscar winner or multiple Best Picture Oscar winners. We know that each
of these scenarios is not possible.

d. For each year, identify the movie with the highest probability of being the winner and make that
the model’s “pick” for winner. Doing this results in identifying three correct winners in the six
years of validation data, i.e., a 50% accuracy rate.

e. The model predicted that “The Artist” was the movie most likely to win the 2011 Best Picture
Award. It was correct.

19.

a.

6 - 28
Data Mining

i. A value of k = 10 minimizes the RMSE on the pre-crisis validation data.

ii. The RMSE on the validation set is $25,481 and the RMSE on the test data is $20,774.

iii. The average error on the validation set is $4,574 and the average error on the test data is
$3,337. This suggests that the k-NN model tends to under-estimate the price of a home.
This is likely due to the fact that there are very few expensive homes in the pre-crisis data
set so the predicted prices of these homes are vastly under-estimated.

6 - 29
Data Mining

b.

i. A value of k = 8 minimizes the RMSE on the post-crisis validation data.

ii. The RMSE on the validation set is $24,344 and the RMSE on the test data is $23,358.

6 - 30
Data Mining

iii. The average error on the validate set is $3,276 and the average error on the test data is
$2,162. This suggests that the k-NN model tends to under-estimate the price of a home.
This is likely due to the fact that there are very few expensive homes in the post-crisis
data set so the predicted prices of these homes are vastly under-estimated.

c. The average percentage change = -1.9%.

20.

a.

i. There 973 decision nodes in the full tree and 22 decision nodes in the best pruned tree.

6 - 31
Data Mining

ii. The RMSE on the validation set is $21,265 and the RMSE on the test data is $19,043.

iii. The average error on the validation set is $1,641 and the average error on the test data is
$1,301. There is slight evidence of systematic under-estimation of home price.

iv. The best pruned tree for the pre-crisis data contains decision nodes on BuildingValue,
LandValue, PoorCondition, Age, and AboveSpace.

6 - 32
Data Mining

b.

i. There 816 decision nodes in the full tree and 19 decision nodes in the best pruned tree.

ii. The RMSE on the validation set is $21,355 and the RMSE on the test data is $21,835.

iii. The average error on the validation set is $816 and the average error on the test data is
$757. The regression tree seems to estimate price of post-crisis homes without much bias.

iv. The best pruned tree for the post-crisis data contains decision nodes on BuildingValue,
LandValue, Deck, and AboveSpace.

c. The average percentage change = -4.1%.

6 - 33
Data Mining

21.

a. Using goodness-of-fit measures such as Mallow’s Cp statistic and adjusted R2, we see that there
are several viable models for consideration. After comparing several of the candidate models
based on prediction error on the validation set, we suggest that includes 8 independent variables is
a strong candidate.

i. Price = 4970 + 1.24*LandValue + 0.73*BuildingValue – 9316*Acres +


14.98*AboveSpace + 3.88*Basement– 53.31*Age -20,953*PoorCondition +
6060*GoodCondition.

ii. The RMSE on the validation set is $16,240 and the RMSE on the test data is $16,593.

iii. The average error on the validation set is $1,064 and the average error on the test data is
$1,325. There is slight evidence of systematic under-estimation of home price.

b. Using goodness-of-fit measures such as Mallow’s Cp statistic and adjusted R2, we see that there
are several viable models for consideration. After comparing several of the candidate models
based on prediction error on the validation set, we suggest that includes 10 independent variables
is a strong candidate.

i. Price = -12,070 + 1.17*LandValue + 0.83*BuildingValue – 12,745*Acres + 2911*Baths


+ 1985*Fireplaces + 3868*Beds + 1018*Rooms + 3656*AC – 4470*PoorCondition +
4266*GoodCondition.

6 - 34
Solution Manual for Essentials of Business Analytics, 1st Edition

Data Mining

ii. The RMSE on the validation set is $17,962 and the RMSE on the test data is $17,161.

iii. The average error on the validation set is $713 and the average error on the test data is
$516. Estimated of home price using the post-crisis data appears to have no major bias.

c. The average percentage change = -4.3%.

6 - 35

Visit TestBankBell.com to get complete for all chapters

You might also like