Professional Documents
Culture Documents
CHAPTER
Data Mining
Konstantin Chagin/Shutterstock.com
HOTTEST NEW JOBS: STATISTICS AND
MA HEMATICS/
MAT
M uch of this book, as the title implies, is about data analysis. The term data
analysis has long been synonymous with the term statistics, but in today’s
world, with massive amounts of data available in business and many other
fields such as health and science, data analysis goes beyond the more narrowly
focused area of traditional statistics. But regardless of what it is called, data anal-
ysis is currently a hot topic and promises to get even hotter in the future. The
data analysis skills you learn here, and possibly in follow-up quantitative courses,
might just land you a very interesting and lucrative job.
This is exactly the message in a recent New York Times article, “For
Toda
Toda
oday’s
y’s Graduate, Just One Word: Statistics,” by Steve Lohr. (A similar arti-
cle, “Math Will Rock Your World,” by Stephen Baker, was the cover story
for Business Week. Both articles are available online by searching for their
titles.) The statistics article begins by chronicling a Harvard anthropology and
archaeology graduate, Carrie Grimes, who began her career by mapping the
locations of Mayan artifacts in places like Honduras. As she states, “People
think of field archaeology as Indiana Jones, but much of what you really do
is data analysis.” Since then, Grimes has leveraged her data analysis skills to
get a job with Google, where she and many other people with a quantitative
background are analyzing huge amounts of data to improve the company’s
search engine. As the chief economist at Google, Hal Varian, states, “I keep
saying that the sexy job in the next 10 years will be statisticians. And I’m not
kidding.” The salaries for statisticians with doctoral degrees currently start at
$125,000, and they will probably continue to increase. (The math article indi-
cates that mathematicians are also in great demand.)
Why is this trend occurring? The reason is the explosion of digital
data—data from sensor signals, surveillance tapes, Web clicks, bar scans,
897
Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
public records, financial transactions, and more. In years past, statisticians typically ana-
lyzed relatively small data sets, such as opinion polls with about 1000 responses. Today’s
T
massive data sets require new statistical methods, new computer software, and most
importantly for you, more young people trained in these methods and the correspond-
ing software. Several particular areas mentioned in the articles include (1) improving
Internet search and online advertising, (2) unraveling gene sequencing informa-
tion for cancer research, (3) analyzing sensor and location data for optimal handling
of food shipments, and (4) the recent Netflix contest for improving the company’s
recommendation system.
The statistics article mentions three specific organizations in need of data analysts.
The first is government, where there is an increasing need to sift through mounds of
data as a first step toward dealing with long-term economic needs and key policy priori-
ties. The second is IBM, which created a Business Analytics and Optimization Services
group in April 2009. This group will use the more than 200 mathematicians, statisticians,
and data analysts already employed by the company, but IBM intends to retrain or hire
4000 more analysts to meet its needs. The third is Google, which needs more data ana-
lysts to improve its search engine.You may think that today’s search engines are unbeliev-
ably efficient, but Google knows they can be improved. As Ms. Grimes states, “Even an
improvement of a percent or two can be huge, when you do things over the millions and
billions of times we do things at Google.”
Of course, these three organizations are not the only organizations that need to hire
more skilled people to perform data analysis and other analytical procedures. It is a need
faced by all large organizations. Various recent technologies, the most prominent by far
being the Web, have given organizations the ability to gather massive amounts of data easily.
Now they need people to make sense of it all and use it to their competitive advantage. ■
17-1 INTRODUCTION
The types of data analysis discussed throughout this book are crucial to the success of
most companies in today’s data-driven business world. However, the sheer volume of
available data often defies traditional methods of data analysis. Therefore, new methods—
and accompanying software—have been developed under the name of data mining. Data
mining attempts to discover patterns, trends, and relationships among data, especially non-
obvious and unexpected patterns.1 For example, an analysis might discover that people
who purchase skim milk also tend to purchase whole wheat bread, or that cars built on
Mondays before 10 a.m. on production line #5 using parts from supplier ABC have signifi-
cantly more defects than average. This new knowledge can then be used for more effective
management of a business.
The place to start is with a data warehouse. Typically, a data warehouse is a huge
database that is designed specifically to study patterns in data. A data warehouse is not the
same as the databases companies use for their day-to-day operations. A data warehouse
should (1) combine data from multiple sources to discover as many relationships as
possible, (2) contain accurate and consistent data, (3) be structured to enable quick and
accurate responses to a variety of queries, and (4) allow follow-up responses to specific
relevant questions. In short, a data warehouse represents a type of database that is specifi-
cally structured to enable data mining. Another term you might hear is data mart. A data
mart is essentially a scaled-down data warehouse, or part of an overall data warehouse,
that is structured specifically for one part of an organization, such as sales. Virtually all
1
The topics in this chapter are evolving rapidly, as is the terminology. Data mining is sometimes used as a syno-
nym for business analytics or data analytics, although these latter terms are broader and encompass most of the
material in this book. Another term gaining popularity is big data. This term is used to indicate the huge data sets
often analyzed in data mining.
Not too long ago, data mining was considered a topic only for the experts. In fact,
most people had never heard of data mining. Also, the required software was expensive and
difficult to learn. Fortunately, this is changing. Many people in organizations, not just the
quantitative experts, have access to large amounts of data, and they have to make sense of
it right away, not a year from now. Therefore, they must have some understanding of tech-
niques used in data mining, and they must have software to implement these techniques.
Data mining is a huge topic. A thorough discussion, which would fill a large book or two,
would cover the role of data mining in real business problems, data warehousing, the many
data mining techniques that now exist, and the software packages that have been developed to
implement these techniques. There is not nearly enough room to cover all of this here, so the
goal of this chapter is more modest. We begin with a discussion of powerful tools for exploring
and visualizing data. Not everyone considers these tools to be data mining tools—they are often
considered preliminary steps to “real” data mining—but they are too important not to discuss
here. Next, we discuss classification, one of the most important types of problems tackled by
data mining. Finally, the chapter concludes with a discussion of clustering.
It is not really possible, or at least not as interesting, to discuss data mining without
using software for illustration. There is no attempt here to cover any data mining software
2
Fortunately, Excel now uses the term table in exactly the same way as it has been used in database packages
for years. Also, when talking about databases, it is more common to refer to observations (rows) as records and
variables (columns) as fields.
3
You will soon see that the Excel Data Model can now play the role of a relational database, that is, a set of related
tables, but stored entirely within Excel.
Figure 17.1
Star Schema for
Foodmart Database
© Cengage Learning
Most data warehouses are built according to these basic ideas. By structuring cor-
porate databases in this way, facts can easily be broken down by dimensions, and—you
guessed it—the methodology for doing this is pivot tables. However, these pivot tables are
not just the “standard” Excel pivot tables. You might think of them as pivot tables on ster-
oids. The OLAP methodology and corresponding pivot tables have the following features
that distinguish them from standard Excel pivot tables.
■ The OLAP methodology does not belong to Microsoft or any other software company.
It has been developed by many analysts, and it has been implemented in a variety of
software packages. Of course, Microsoft is included in this group. Its OLAP tools are
located in the Analysis Services section of its SQL Server database software.
Figure 17.2 A B
Drilling Down a 1 Row Labels Revenue
Hierarchy in the 2 – Drink $142,578.37
Foodmart Database
3 + Alcoholic Beverages $41,137.07
4 – Beverages $80,152.27
5 + Carbonated Beverages $17,754.68
6 + Drinks $17,028.38
7 – Hot Beverages $26,227.46
8 + Chocolate $4,085.95
9 + Coffee $22,141.51
10 + Pure Juice Beverages $19,141.75
11 + Dairy $21,289.03
© Cengage Learning
12 + Food $1,187,171.39
13 + Non-Consumable $314,635.84
14 Grand Total $1,644,385.60
■ OLAP databases are typically huge, so it can take a while to get the results for a
particular pivot table. For this reason, the data are often “preprocessed” in such a way
that the results for any desired breakdown are already available and can be obtained
immediately. Specifically, the data are preprocessed into files that are referred to as
OLAP cubes. (The analogy is to a Rubik’s cube, where each little sub-cube contains
the result of a particular breakdown.) In Excel 2003, Microsoft let you build your own
OLAP cubes, but this feature was removed in subsequent versions of Excel. Now you
need Analysis Services in SQL Server (or some other company’s software) to build
cubes. We don’t assume that you have access to such software, so we won’t pursue
this approach here. Instead, we will discuss similar tools that are available in Excel in
the next subsection.
© Cengage Learning
Figure 17.4 Power Pivot Ribbon in Excel 2016
© Cengage Learning
The rest of this subsections leads you through a typical example using Power Pivot.
The data set for this example is stored in three separate, currently unrelated, files. The bulk
of the data are in the Access file ContosoSales.accdb. Related data are in the Excel file
Stores.xlsx and the comma-delimited text file Geography.csv. The ContosoSales database
has five related tables, Dates, Sales, Products, ProductSubcategories, and ProductCategories.
Each row in the Sales table is for a sale of some product on some date. The Access tables are
related through primary and foreign keys as indicated in Figure 17.5.
© Cengage Learning
© Cengage Learning
Figure 17.7
Power Pivot
Diagram View
Window
© Cengage Learning
easier way is to drag from GeographyKey field in the Stores table (the “many” side)
to the GeographyKey field in the Geography table (the “one” side) in Diagram View.
© Cengage Learning
Figure 17.9 Diagram View of Completed Relationships
© Cengage Learning
Then do this a second time with the following formula, and rename this column Product
Subcategory. Then the Products table should appear as in Figure 17.11.
=RELATED(ProductSubcategories[ProductSubcategoryName])
3. As indicated in step 2, you can hide ffields, or even entire tables, from users if you
think these fields or tables would never be used in a pivot table. This provides a less
cluttered look when the user eventually creates pivot tables from the Data Model.
To do this, right-click a column or a table tab and select Hide from Client Tools.
Figure 17.10
Total Net Revenue
Measure
© Cengage Learning
© Cengage Learning
Try this with the ProductCategories and ProductSubcategories table tabs or with
any other fields not useful for pivot tables. For example, primary and foreign key
fields are good candidates for hiding. They are necessary for relating tables, but they
aren’t likely to be used in pivot tables.
Step 4: Build One or More Pivot Tables from the Data Model
This is the easy step. From the Home ribbon in the Power Pivot window, select the first
item, PivotTable, from the PivotTable dropdown list. (You can experiment with the other
items.) Now you can drag any of the (non-hidden) fields from any of the (non-hidden)
tables to a pivot table in the usual way. In particular, you will see the Total Net Revenue
measure in the Sales table, a candidate for the Values area, and the two calculated columns
in the Products table, candidates for the Rows and Columns areas or for filtering.
Two example pivot tables appear in Figure 17.12 and 17.13. Assuming you have
followed along to this point, you should try to reproduce them.
There are two things to note here. If you return to the Power Pivot window and make any
changes, such as hiding more fields or creating more calculated columns, the changes will
be reflected automatically when you return to the pivot table. Second, suppose you forget to
relate tables, such as Stores and Geography. Then depending on the items you drag to the
pivot table, you might get a warning about missing relationships. In this case, you can create
the required relationships as described earlier to make the pivot table calculate correctly.
Arguably, Power Pivot is the most useful member of the “Power” add-ins for Excel
2016. However, you might also want to experiment with Power View for creating quick,
Power View, Power Map
insightful reports, and Power Map for creating insightful 3D maps. You can view the accom-
panying Power View and Power Map videos to get started. You’ll find that it is quite easy.
© Cengage Learning
Figure 17.13 Second Example Pivot Table
© Cengage Learning
17-2d Visualization Software
You can gain a lot of insight by using charts to view your data in imaginative ways. This
trend toward powerful charting software for data visualization is the wave of the future and
will certainly continue. Excel’s built-in tools, including the new Power View and Power
Map add-ins, can be used for visualization. In addition, many other companies are devel-
oping visualization software. To get a glimpse of what is currently possible, you can watch
Using Tableau Public the accompanying video about a free software package, Tableau Public, developed by
Tableau Software. Perhaps you will find other visualization software packages, free or
otherwise, that rival Tableau or Power View. Alternatively, you might see blogs with data
visualizations from ordinary users. In any case, the purpose of all visualization software is
to portray data graphically so that otherwise hidden trends or patterns can emerge clearly.
© Cengage Learning
0.1
0
7.0 5.0 3.0 1.0 1.0 3.0 5.0 7.0
x
The odds ratio is a term frequently used in everyday language. Suppose, for example,
that the probability p of a company going bankrupt is 0.25. Then the odds that the company
will go bankrupt are p/(1 − p) = 0.25/0.75 = 1/3, or “1 to 3.” Odds ratios are probably
most common in sports. If you read that the odds against Duke winning the NCAA basket-
ball championship are 4 to 1, this means that the probability of Duke winning the champi-
onship is 1/5. Or if you read that the odds against Purdue winning the championship are 99
to 1, then the probability that Purdue will win is only 1/100.
The logarithm of the odds ratio, the quantity on the left side of the above equation, is
called the logit (or log odds). Therefore, the logistic regression model states that the logit
is a linear function of the explanatory variables. Although this is probably a bit mysterious
and there is no easy way to justify it intuitively, logistic regression has produced useful
results in many applications.
The numerical algorithm used to estimate the regression coefficients is complex, but
the important goal for our purposes is to interpret the regression coefficients correctly.
First, if a coefficient b is positive, then if its X increases, the log odds increases, so the
probability of being in category 1 increases. The opposite is true for a negative b. So just
by looking at the signs of the coefficients, you can see which explanatory variables are
positively correlated with being in category 1 (the positive b’s) and which are positively
correlated with being in group 0 (the negative b’s).
You can also look at the magnitudes of the b’s to try to see which of the X’s are “most
important” in explaining category membership. Unfortunately, you run into the same prob-
lem as in regular regression. Some X’s are typically of completely different magnitudes than
others, which makes comparisons of the b’s difficult. For example, if one X is income, with
values in the thousands, and another X is number of children, with values like 0, 1, and 2, the
coefficient of income will probably be much smaller than the coefficient of children, even
though these two variables can be equally important in explaining category membership. We
won’t say more about the interpretation of the regression coefficients here, but you can find
comments about them in the finished version of lasagna triers file discussed next.
In many situations, especially in data mining, the primary objective of logistic regression
is to “score” members, given their X values. The score for any member is the estimated value
of p, found by plugging into the logistic regression equation to get the logit and then solving
algebraically to get p. (This is typically done automatically by the software package.) Those
members who score highest are the most likely to be in category 1; those who score lowest
are most likely to be in category 0. For example, if category 1 represents the responders to
some direct mail campaign, a company might mail brochures to the top 10% of all scorers.
These scores can also be used to classify members. Here, a cutoff probability is
required. All members who score below the cutoff are classified as category 0, and the rest
T he Lasagna Triers Logistic Regression.xlsx file contains the same data set from Chapter 3
on 856 people who have either tried or not tried a company’s new frozen lasagna product.
The categorical dependent variable, Have Tried, and several of the potential explanatory vari-
ables contain text, as shown in Figure 17.15. Some logistic regression software packages allow
such text variables and implicitly create dummies for them, but StatTools requires all numeric
variables. Therefore, the StatTools Dummy utility was used to create dummy variables for all
text variables. (You could also do this with IF formulas.) Using the numeric variables, including
dummies, how well is logistic regression able to classify the triers and nontriers?
A B C D E F G H I J K L M
1 Person Age Weight Income Pay Type Car Value CC Debt Gender Live Alone Dwell Type Mall Trips Nbhd Have Tried
2 1 48 175 65500 Hourly 2190 3510 Male No Home 7 East No
3 2 33 202 29100 Hourly 2110 740 Female No Condo 4 East Yes
4 3 51 188 32200 Salaried 5140 910 Male No Condo 1 East No
© Cengage Learning
5 4 56 244 19000 Hourly 700 1620 Female No Home 3 West No
6 5 28 218 81400 Salaried 26620 600 Male No Apt 3 West Yes
7 6 51 173 73000 Salaried 24520 950 Female No Condo 2 East No
8 7 44 182 66400 Salaried 10130 3500 Female Yes Condo 6 West Yes
9 8 29 189 46200 Salaried 10250 2860 Male No Condo 5 West Yes
Objective To use logistic regression to classify users as triers or nontriers, and to inter-
pret the resulting output.
Solution
A StatTools data set already exists (in the unfinished version of the file). It was used to cre-
ate the dummy variables. To run the logistic regression, you select Logistic Regression from
the StatTools Regression and Classification dropdown list. Then you fill out the usual StatTools
dialog box as shown in Figure 17.16. At the top, you see two options: “with no Count Variable”
or “with Count Variable.” The former is appropriate here. (The latter is used only when there is
a count of the 1’s for each joint category, such as males who live alone.) The dependent variable
is the dummy variable Have Tried Yes, and the explanatory variables are the original numeric
variables (Age, Weight, Income, Car Value, CC Debt, and Mall Trips) and the dummy variables
(Pay Type Salaried, Gender Male, Live Alone Yes, Dwell Type Condo, and Dwell Type Home).
As in regular regression, one dummy variable for each categorical variable should be omitted.
The logistic regression output is much like regular regression output. There is a summary
section and a list of coefficients, shown in Figure 17.17. The summary section is analogous
to the ANOVA table in a regression output. The Improvement value indicates how much bet-
ter the logistic regression classification is than a classification with no explanatory variables
at all. The corresponding p-value indicates that this improvement is statistically significant at
any of the usual significance levels, exactly like a small p-value in an ANOVA table.
The coefficient section is also analogous to regular regression output. The Wald value
is like the t-value, and each corresponding p-value indicates whether that variable could
be excluded from the equation. In this case, Income, Car Value, CC Debt, Gender Male,
© Cengage Learning
Figure 17.17 Summary and Coefficients in StatTools Logistic Regression Output
A B C D E F G H
8 Logistic Regression for Have Tried Yes
9 Summary Measures This is the output from using all explanatory
10 Null Deviance 1165.604813 variables. You can run the logistic regression
11 Model Deviance 687.9428839 again, deleting variables with high p-values, but
the basic results don’t change substantially.
12 Improvement 477.6619292
13 p-Value <0.0001
14
15 Standard Wald Lower Upper
Coefficient p-Value Exp(Coef)
16 Regression Coefficients Error Value Limit Limit
17 Constant –2.540587689 0.909698289 –2.79278055 0.0052 –4.323596336 –0.757579042 0.078820065
18 Age –0.069688555 0.010808445 –6.447602252 <0.0001 –0.090873108 –0.048504003 0.932684254
19 Weight 0.007033414 0.003849631 1.82703581 0.0677 –0.000511863 0.014578691 1.007058206
20 Income 4.76283E-06 3.77935E-06 1.260222403 0.2076 –2.64471E-06 1.21704E-05 1.000004763
21 Car Value –2.66917E-05 2.04171E-05 –1.307318278 0.1911 –6.67092E-05 1.33259E-05 0.999973309
22 CC Debt 7.78774E-05 9.14027E-05 0.852024709 0.3942 –0.000101272 0.000257027 1.00007788
23 Mall Trips 0.687005598 0.059764316 11.49524745 <0.0001 0.569867539 0.804143656 1.987754476
© Cengage Learning
24 Pay Type Salaried 1.332747327 0.220912727 6.032913283 <0.0001 0.899758382 1.765736273 3.791445433
25 Gender Male 0.255542473 0.191544317 1.334116706 0.1822 –0.119884388 0.630969333 1.291161851
26 Live Alone Yes 1.322630127 0.283886309 4.659013441 <0.0001 0.766212962 1.879047292 3.75328001
27 Dwell Type Condo –0.080928114 0.275087202 –0.294190764 0.7686 –0.620099029 0.458242801 0.922259987
28 Dwell Type Home 0.176721835 0.248863714 0.710114914 0.4776 –0.311051044 0.664494713 1.193299112
and the two Dwell Type dummies could possibly be excluded. (You can check that if these
variables are indeed excluded and the logistic regression is run again, very little changes.)
The signs of the remaining coefficients indicate whether the probability of being a trier
increases or decreases when these variables increase. For example, this probability decreases
as Age increases (a minus sign), and it increases as Weight increases (a plus sign). Again,
however, you have to use caution when interpreting the magnitudes of the coefficients. For
example, the coefficient of Weight is small because Weight has values in the hundreds, and
the coefficient of Live Alone Yes is much larger because this variable is either 0 or 1.
A B C D
Figure 17.18
30 1 0 Percent
Classification 31 Classification Matri
Matrix Correct
Summary 32 1 422 73 85.25%
33 0 81 280 77.56%
34
© Cengage Learning
35
Percent
36 Summary Classification
37 Correct 82.01%
38 Base 57.83%
39 Improvement 57.34%
The last part of the logistic regression output, a small part of which is shown in
Figure 17.19, lists all of the original data and the scores discussed earlier. For example, the
first person’s score is 75.28%. This is the probability estimated from the logistic regression
equation that this person is a trier. Because it is greater than 0.5, this person is classified
as a trier. However, this is one of the relatively few misclassifications. The first person is
actually a nontrier. In the same way, explanatory values for new people, those whose trier
status is unknown, could be fed into the logistic regression equation to score them. Then
M N O
Figure 17.19
41 Analysis Original
Scores for the First Probability Class Class
42
Few People 43 75.28% 1 0
© Cengage Learning
44 35.15% 0 1
45 7.65% 0 0
46 9.18% 0 0
47 60.22% 1 1
48 7.69% 0 0
Before leaving this subsection, you have probably noticed that StatTools includes
another classification procedure called discriminant analysis. This is a classical technique
developed decades ago that is still in use. It is somewhat similar to logistic regression and
has the same basic goals. However, it is not as prominent in data mining discussions as
logistic regression. Therefore, discriminant analysis is not discussed here.
Input 1 Hidden 1
Figure 17.20 W11
Neural Net with
W10
Two Inputs and Output
W12 There could be a few additional
Two Hidden Nodes
“bias” arrows, essentially like
© Cengage Learning
L ogistic regression provided reasonably accurate classifications for the lasagna triers
data set. Can a neural net, as implemented in Palisade’s NeuralTools add-in, provide
comparable results?
Objective T To learn how the NeuralTools add-in works, and to compare its results to
those from logistic regression.
Solution
The data for this version of the example are in the file Lasagna Triers NeuralTools.xlsx.
There are two differences from the file used for logistic regression. First, no dummy vari-
ables are necessary. The NeuralTools add-in is capable of dealing directly with text vari-
ables. Second, there is a Prediction Data sheet with a second data set of size 250 to be used
for prediction. Its values of the dependent Have Tried variable are unknown.
You launch NeuralTools just like StatTools, @RISK, or any of the other Palisade add-
ins. This produces a NeuralTools tab and ribbon, as shown in Figure 17.21. As you can see,
NeuralTools uses a Data Set Manager, just like StatTools. The only difference is that when
you specify the data set, you must indicate the role of each variable in the neural net. The
possible roles are Independent Numeric, Independent Categorical, Dependent Numeric,
Dependent Categorical, Tag, and Unused. Except for Tag, which isn’t used here, these
have the obvious meanings. So the first step is to create two data sets, one for each sheet,
with Have Tried as Dependent Categorical, Person as Unused, and the other variables as
Independent Numeric or Independent Categorical as appropriate. (NeuralTools usually
guesses the roles correctly.) We call these data sets Train/Test Data and Prediction Data,
respectively.
© Cengage Learning
Figure 17.21
NeuralTools Ribbon
Figure 17.22
Train Tab of
Training Dialog Box
© Cengage Learning
The Net Configuration tab shown in Figure 17.23 lets you select one of three options
for the training algorithm. The PN/GRN (probabilistic neural net) algorithm is relatively
new. It is fairly quick and it usually gives good results, so it is a good option to try, as is
done here.6 The MLF option (multi-layer feedforward) algorithm is more traditional, but
it is considerably slower. The Best Net Search tries both PN/GRN and various versions of
MLF to see which is best, but it is quite slow.
The Runtime tab (not shown here) specifies stopping conditions for the algorithm. You
can accept the defaults, and you can always stop the training prematurely if it doesn’t seem
to be making any improvement.
Once you click Next on any of the tabs, you will see a summary (not shown here) of
the model setup. Then you can click its Train button to start the algorithm. You will see a
6
The abbreviation PN/GRN is a bit confusing. For classification problems, the algorithm is called probabilistic
neural net (PNN). However, if the dependent variable is continuous, the same basic algorithm is called generalized
regression neural net, which explains the GRN abbreviation.
© Cengage Learning
progress monitor, and eventually you will see results on a new sheet, the most important
of which are shown in Figure 17.24. (As in other Palisade add-ins, the results are stored
by default in a new workbook. You can change this behavior from the Application Settings
dialog box, available from the Utilities dropdown list.)
B C D E
Figure 17.24
33 Classification Matrix
Selected Training
34 (for training cases)
Results
35 No Yes Bad(%)
36 No 260 30 10.3448%
37 Yes 26 369 6.5823%
38
39 Classification Matrix
© Cengage Learning
40 (for testing cases)
41 No Yes Bad(%)
42 No 54 17 23.9437%
43 Yes 16 84 16.0000%
The top part shows classification results for the 80%, or 685, cases used for training.
About 10% of the No values (row 36) were classified incorrectly, and about 6.5% of the
Yes values (row 37) were classified incorrectly. The bottom part shows similar results
for the 20%, or 171, cases used for testing. The incorrect percentages, about 24% and
16%, are not as good as for the training set, but this is not unusual. Also, these results are
slightly better than those from logistic regression, where about 18% of the classifications
were incorrect. (Remember, however, that the data set wasn’t partitioned into training
and testing subsets for logistic regression.) Note that these results are from an 80–20
random split of the original data. The results you get from a different random split will
probably be different.
Now that the model has been trained, it can be used to predict the unknown values
of the dependent variable in the Prediction Data set. To do so, you activate the Prediction