You are on page 1of 25

17

CHAPTER

Data Mining

Konstantin Chagin/Shutterstock.com
HOTTEST NEW JOBS: STATISTICS AND
MA HEMATICS/
MAT

M uch of this book, as the title implies, is about data analysis. The term data
analysis has long been synonymous with the term statistics, but in today’s
world, with massive amounts of data available in business and many other
fields such as health and science, data analysis goes beyond the more narrowly
focused area of traditional statistics. But regardless of what it is called, data anal-
ysis is currently a hot topic and promises to get even hotter in the future. The
data analysis skills you learn here, and possibly in follow-up quantitative courses,
might just land you a very interesting and lucrative job.
This is exactly the message in a recent New York Times article, “For
Toda
Toda
oday’s
y’s Graduate, Just One Word: Statistics,” by Steve Lohr. (A similar arti-
cle, “Math Will Rock Your World,” by Stephen Baker, was the cover story
for Business Week. Both articles are available online by searching for their
titles.) The statistics article begins by chronicling a Harvard anthropology and
archaeology graduate, Carrie Grimes, who began her career by mapping the
locations of Mayan artifacts in places like Honduras. As she states, “People
think of field archaeology as Indiana Jones, but much of what you really do
is data analysis.” Since then, Grimes has leveraged her data analysis skills to
get a job with Google, where she and many other people with a quantitative
background are analyzing huge amounts of data to improve the company’s
search engine. As the chief economist at Google, Hal Varian, states, “I keep
saying that the sexy job in the next 10 years will be statisticians. And I’m not
kidding.” The salaries for statisticians with doctoral degrees currently start at
$125,000, and they will probably continue to increase. (The math article indi-
cates that mathematicians are also in great demand.)
Why is this trend occurring? The reason is the explosion of digital
data—data from sensor signals, surveillance tapes, Web clicks, bar scans,

897
Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
public records, financial transactions, and more. In years past, statisticians typically ana-
lyzed relatively small data sets, such as opinion polls with about 1000 responses. Today’s
T
massive data sets require new statistical methods, new computer software, and most
importantly for you, more young people trained in these methods and the correspond-
ing software. Several particular areas mentioned in the articles include (1) improving
Internet search and online advertising, (2) unraveling gene sequencing informa-
tion for cancer research, (3) analyzing sensor and location data for optimal handling
of food shipments, and (4) the recent Netflix contest for improving the company’s
recommendation system.
The statistics article mentions three specific organizations in need of data analysts.
The first is government, where there is an increasing need to sift through mounds of
data as a first step toward dealing with long-term economic needs and key policy priori-
ties. The second is IBM, which created a Business Analytics and Optimization Services
group in April 2009. This group will use the more than 200 mathematicians, statisticians,
and data analysts already employed by the company, but IBM intends to retrain or hire
4000 more analysts to meet its needs. The third is Google, which needs more data ana-
lysts to improve its search engine.You may think that today’s search engines are unbeliev-
ably efficient, but Google knows they can be improved. As Ms. Grimes states, “Even an
improvement of a percent or two can be huge, when you do things over the millions and
billions of times we do things at Google.”
Of course, these three organizations are not the only organizations that need to hire
more skilled people to perform data analysis and other analytical procedures. It is a need
faced by all large organizations. Various recent technologies, the most prominent by far
being the Web, have given organizations the ability to gather massive amounts of data easily.
Now they need people to make sense of it all and use it to their competitive advantage. ■

17-1 INTRODUCTION
The types of data analysis discussed throughout this book are crucial to the success of
most companies in today’s data-driven business world. However, the sheer volume of
available data often defies traditional methods of data analysis. Therefore, new methods—
and accompanying software—have been developed under the name of data mining. Data
mining attempts to discover patterns, trends, and relationships among data, especially non-
obvious and unexpected patterns.1 For example, an analysis might discover that people
who purchase skim milk also tend to purchase whole wheat bread, or that cars built on
Mondays before 10 a.m. on production line #5 using parts from supplier ABC have signifi-
cantly more defects than average. This new knowledge can then be used for more effective
management of a business.
The place to start is with a data warehouse. Typically, a data warehouse is a huge
database that is designed specifically to study patterns in data. A data warehouse is not the
same as the databases companies use for their day-to-day operations. A data warehouse
should (1) combine data from multiple sources to discover as many relationships as
possible, (2) contain accurate and consistent data, (3) be structured to enable quick and
accurate responses to a variety of queries, and (4) allow follow-up responses to specific
relevant questions. In short, a data warehouse represents a type of database that is specifi-
cally structured to enable data mining. Another term you might hear is data mart. A data
mart is essentially a scaled-down data warehouse, or part of an overall data warehouse,
that is structured specifically for one part of an organization, such as sales. Virtually all
1
The topics in this chapter are evolving rapidly, as is the terminology. Data mining is sometimes used as a syno-
nym for business analytics or data analytics, although these latter terms are broader and encompass most of the
material in this book. Another term gaining popularity is big data. This term is used to indicate the huge data sets
often analyzed in data mining.

898 Chapter 17 Data Mining


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
large organizations, and many smaller ones, have developed data warehouses or data marts
in the past decade or two to enable them to better understand their business—their custom-
ers, their suppliers, and their processes.
Once a data warehouse is in place, analysts can begin to mine the data with a collection
of methodologies and accompanying software. Some of the primary methodologies are
classification analysis, prediction, cluster analysis, market basket analysis, and forecasting.
Each of these is a large topic in itself, but some brief explanations follow.

■ Classification analysis attempts to find variables that are related to a categorical


(often binary) variable. For example, credit card customers can be categorized as
those who pay their balances in a reasonable amount of time and those who don’t.
Classification analysis would attempt to find explanatory variables that help predict
which of these two categories a customer is in. Some variables, such as salary, are
natural candidates for explanatory variables, but an analysis might uncover others
that are less obvious.
■ Prediction is similar to classification analysis, except that it tries to find variables
that help explain a continuous variable, such as credit card balance, rather than a
categorical variable. Regression, the topic of Chapters 10 and 11, is one of the most
popular prediction tools, but there are others not covered in this book.
■ Cluster analysis tries to group observations into clusters so that observations within
a cluster are alike, and observations in different clusters are not alike. For example,
one cluster for an automobile dealer’s customers might be middle-aged men who
are not married, earn over $150,000, and favor high-priced sports cars. Once natural
clusters are found, a company can then tailor its marketing to the individual clusters.
■ Market basket analysis tries to find products that customers purchase together
in the same “market basket.” In a supermarket setting, this knowledge can help a
manager position or price various products in the store. In banking and other settings,
it can help managers to cross-sell (sell a product to a customer already purchasing a
related product) or up-sell (sell a more expensive product than a customer originally
intended to purchase).
■ Forecasting is used to predict values of a time series variable by extrapolating patterns
seen in historical data into the future. (This topic is covered in Chapter 12.) This is clearly
an important problem in all areas of business, including the forecasting of future demand
for products, forecasting future stock prices and commodity prices, and many others.

Not too long ago, data mining was considered a topic only for the experts. In fact,
most people had never heard of data mining. Also, the required software was expensive and
difficult to learn. Fortunately, this is changing. Many people in organizations, not just the
quantitative experts, have access to large amounts of data, and they have to make sense of
it right away, not a year from now. Therefore, they must have some understanding of tech-
niques used in data mining, and they must have software to implement these techniques.
Data mining is a huge topic. A thorough discussion, which would fill a large book or two,
would cover the role of data mining in real business problems, data warehousing, the many
data mining techniques that now exist, and the software packages that have been developed to
implement these techniques. There is not nearly enough room to cover all of this here, so the
goal of this chapter is more modest. We begin with a discussion of powerful tools for exploring
and visualizing data. Not everyone considers these tools to be data mining tools—they are often
considered preliminary steps to “real” data mining—but they are too important not to discuss
here. Next, we discuss classification, one of the most important types of problems tackled by
data mining. Finally, the chapter concludes with a discussion of clustering.
It is not really possible, or at least not as interesting, to discuss data mining without
using software for illustration. There is no attempt here to cover any data mining software

17-1 Introduction 899


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
package in detail. Instead, we highlight a few different packages for illustration. In some
cases, you already have the software. For example, the NeuralTools add-in in the Palisade
DecisionTools® Suite, available with this book, can be used to estimate neural nets for
classification. In other cases, we illustrate popular software that can be downloaded for free
from the Web. We also illustrate how some data mining methods can be implemented with
Excel-only formulas. However, you should be aware that there are numerous other software
packages that perform various data mining procedures, and many of them are quite expen-
sive with a fairly steep learning curve. You might end up using one of these in your job, and
you will then have to learn how it works.

17-2 DATA EXPLORATION AND VISUALIZATION


Data mining is a relatively new field—or at least a new term—and not everyone agrees with
its definition. To many people, data mining is a collection of advanced algorithms that can
be used to find useful information and patterns in large data sets. Data mining does indeed
include a number of advanced algorithms, but we broaden its definition in this chapter to
include relatively simple methods for exploring and visualizing data. This section discusses
some of the possibilities. They are basically extensions of methods discussed in Chapters 2
and 3, and the key ideas—tables, pivot tables, and charts—are not new. However, advances
in software now enable you to analyze large data sets quickly and easily.

17-2a Introduction to Relational Databases


First, we present some general concepts about database structure. The Excel data sets we
have discussed so far in this book are often called flat files or, more simply, tables. They are
also called single-table databases, where table is the database term for a rectangular range
of data, with rows corresponding to records and columns corresponding to fields.2 Flat files
are fine for relatively simple database applications, but they are not powerful enough for more
complex applications. These require a relational database, a set of related tables, where each
table is a rectangular arrangement of fields and records, and the tables are related explicitly.
As a simple example, suppose you would like to keep track of information on all of
the books you own. Specifically, you would like to keep track of data on each book (title,
author, copyright date, whether you have read it, when you bought it, and so on), as well
as data on each author (name, birthdate, awards won, number of books written, and so on).
Now suppose you store all of these data in a flat file. Then if you own 10 books by Danielle
Steele, you must fill in the identical personal information on Ms. Steele for each of the
10 records associated with her books. This is not only a waste of time, but it increases the
chance of introducing errors as you enter the same information multiple times.
A better solution is to create a Books table and an Authors table. The Authors table
will include a record for each author, with fields for name, gender, date of birth, and so
on. It will also contain a unique identifier field, probably named AuthorID. For example,
Danielle Steele might have AuthorID 1, John Grisham might have AuthorID 2, and so on.
The values of AuthorID are not important, but they must be unique. The Books table will
have a record for each book, with fields like title, copyright date, genre, and so on. It will
also contain an AuthorID field that lists the same AuthorID as in the Authors table for the
author of this book. It can also contain a unique identifier field, such as ISBN. (For the
purpose of this discussion, we are assuming that each book has a single author. If multiple
authors are allowed, the situation is slightly more complex.)

2
Fortunately, Excel now uses the term table in exactly the same way as it has been used in database packages
for years. Also, when talking about databases, it is more common to refer to observations (rows) as records and
variables (columns) as fields.

900 Chapter 17 Data Mining


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
The key to relating these two tables is the AuthorID field. In database software such as
Access, there is a link between the AuthorID fields in the two tables. This link allows you
to find data from the two tables easily. For example, suppose you see in the Authors table
that John Updike’s ID is 35. Then you can search through the Books table for all records
with AuthorID 35. These correspond to the books you own by John Updike. Going the
Primary keys, which other way, if you see in the Books table that you own The World According to Garp by
are unique, and foreign
keys, which are usually
John Irving, who happens to have AuthorID 21, you can look up the (unique) record in the
not unique, relate the Authors table with AuthorID 21 to find personal information about John Irving.
tables in a relational The linked fields are called keys. Specifically, the AuthorID field in the Authors
database. table is called a primary key, and the AuthorID field in the Books table is called a for-
eign key. A primary key must contain unique val-
Fundamental Insight ues, whereas a foreign key can contain duplicate
values. For example, there is only one Danielle
Spreadsheets versus Databases Steele, but she has written many books.
As is illustrated throughout this book, Excel is an excel-
The theory and implementation of relational
lent tool for analyzing data. However, Excel is not the
databases is both lengthy and complex. Indeed,
best place to store complex data. In contrast, database
many books have been written about the topic.
packages such as Access, SQL Server, Oracle, and oth-
However, this brief introduction suffices for our
ers have been developed explicitly to store data, and
purposes. As you will see in examples, an Access
much of corporate data is stored in such database
database file (recognizable by the .mdb extension,
packages. These packages typically have some tools
or the .accdb extension introduced in Access 2007)
for analyzing data, but these tools are neither as well
typically contains several related tables. They are
known nor as easy to use as Excel. Therefore, it is use-
related in the same basic way as the Books and
ful to know how to import data from a database into
Excel for analysis.
Authors were related in the previous paragraphs—
through links between primary and foreign keys.3

17-2b Online Analytical Processing (OLAP)


We introduced pivot tables in Chapter 3 as an amazingly easy and powerful way to break
data down by category in Excel®. However, the pivot table methodology is not limited
to Excel or even to Microsoft. This methodology is usually called online analytical
processing, or OLAP. This name was initially used to distinguish this type of data analy-
sis from online transactional processing, or OLTP. The latter has been used for years by
companies to answer specific day-to-day questions: Why was there a shipment delay in
this customer’s order? Why doesn’t the invoice amount for this customer’s order match the
customer’s payment? Is this customer’s complaint about a defective product justified? In
fact, database systems have been developed to answer such “one-off” questions quickly.
In contrast, OLAP is used to answer broader questions: Are sales of a particular product
decreasing over time? Is a particular product selling equally well in different stores? Do
customers who pay with our credit card tend to spend more?
When analysts began to realize that the typical OLTP databases are not well equipped
to answer these broader types of questions, OLAP was born. This led to much research into
the most appropriate database structure for answering OLAP questions. The consensus is
that the best structure is a star schema. In a star schema, there is at least one Facts table with
many rows and only a few columns. For example, in a supermarket database, a Facts table
might have a row for each line item purchased, including the number of items of the product
purchased, the total amount paid for the product, and possibly the discount. Each row of the
Facts table would also list lookup values (or foreign keys, in database terminology) about

3
You will soon see that the Excel Data Model can now play the role of a relational database, that is, a set of related
tables, but stored entirely within Excel.

17-2 Data Exploration and Visualization 901


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
the purchase: the date, the store, the product, the customer, any promotion in effect, and
possibly others. Finally, the database would include a dimension table for each of these.
For example, there would be a Products table. Each row of this table would contain multiple
pieces of information about a particular product. Then if a customer purchases product 15,
for example, information about product 15 could be looked up in the Products table.
One particular star schema, for the Foodmart Access database created by Microsoft for
illustration, appears in Figure 17.1. (This database is available in the Foodmart.mdb file
if you want to view it in Access.) The Facts table in the middle contains only two “facts”
about each line item purchased: Revenue and UnitsSold. (There are over 250,000 rows
in the Facts table, but even this is extremely small in comparison to many corporate facts
tables.) The other columns in the Facts table are foreign keys that let you look up informa-
tion about the product, the date, the store, and the customer in the respective dimensions
tables. You can see why the term “star schema” is used. The dimension tables surround the
central Facts table like stars.

Figure 17.1
Star Schema for
Foodmart Database

© Cengage Learning

Most data warehouses are built according to these basic ideas. By structuring cor-
porate databases in this way, facts can easily be broken down by dimensions, and—you
guessed it—the methodology for doing this is pivot tables. However, these pivot tables are
not just the “standard” Excel pivot tables. You might think of them as pivot tables on ster-
oids. The OLAP methodology and corresponding pivot tables have the following features
that distinguish them from standard Excel pivot tables.
■ The OLAP methodology does not belong to Microsoft or any other software company.
It has been developed by many analysts, and it has been implemented in a variety of
software packages. Of course, Microsoft is included in this group. Its OLAP tools are
located in the Analysis Services section of its SQL Server database software.

902 Chapter 17 Data Mining


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
■ In OLAP pivot tables, you aren’t allowed to drag any field to any area of the pivot
table, as you can in Excel. Only facts are allowed in the Values area, and only
dimensions are allowed in the Rows, Columns, and Filters areas. But this is not much
of a limitation. The whole purpose of these pivot tables is to break down facts, such
as Revenue, by dimensions such as Date and Product.
■ Some dimensions have natural hierarchies. For example, the Products dimen-
sion in Figure 17.1 has the natural hierarchy ProductFamily, ProductDepartment,
ProductCategory, and ProductSubcategory. Similarly, the Stores and Customers
dimensions have geographical hierarchies, and the Date dimension always has
hierarchies such as Year, Quarter, Month, and Day. OLAP software lets you specify
such hierarchies. Then when you create a pivot table, you can drag a hierarchy to
an area and “drill down” through it. For example, looking at Revenue totals, you
can start at the ProductFamily level (Drink, Food, or Non-Consumable). Then you
can drill down to the ProductDepartment level for any of these, such as Beverages,
Dairy, and Alcoholic for the Drink family. Then you can drill down further to the
ProductCategory level and so on. Figure 17.2 shows what a resulting pivot table
might look like.

Figure 17.2 A B
Drilling Down a 1 Row Labels Revenue
Hierarchy in the 2 – Drink $142,578.37
Foodmart Database
3 + Alcoholic Beverages $41,137.07
4 – Beverages $80,152.27
5 + Carbonated Beverages $17,754.68
6 + Drinks $17,028.38
7 – Hot Beverages $26,227.46
8 + Chocolate $4,085.95
9 + Coffee $22,141.51
10 + Pure Juice Beverages $19,141.75
11 + Dairy $21,289.03

© Cengage Learning
12 + Food $1,187,171.39
13 + Non-Consumable $314,635.84
14 Grand Total $1,644,385.60

■ OLAP databases are typically huge, so it can take a while to get the results for a
particular pivot table. For this reason, the data are often “preprocessed” in such a way
that the results for any desired breakdown are already available and can be obtained
immediately. Specifically, the data are preprocessed into files that are referred to as
OLAP cubes. (The analogy is to a Rubik’s cube, where each little sub-cube contains
the result of a particular breakdown.) In Excel 2003, Microsoft let you build your own
OLAP cubes, but this feature was removed in subsequent versions of Excel. Now you
need Analysis Services in SQL Server (or some other company’s software) to build
cubes. We don’t assume that you have access to such software, so we won’t pursue
this approach here. Instead, we will discuss similar tools that are available in Excel in
the next subsection.

17-2 Data Exploration and Visualization 903


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
17-2c Power Pivot and Self-Service BI Tools in Excel
This section is potentially one of the most important sections in the book because this is
the area that Microsoft, and other companies, see as the future of data analysis, and we
tend to agree.
It helps to have some background. Starting in Excel 2010, Microsoft introduced an
optional add-in called PowerPivot. This add-in is not part of Excel 2010, but you can down-
load it for free from a Microsoft website; just search for PowerPivot 2010. Once it is added
in, it provides two main advantages, plus some others not discussed here, over regular Excel
pivot tables. First, PowerPivot allows you to work with large data sets (over a million rows,
say), without bringing your computer to a standstill. This is due to a data-compression scheme
that lets you fit huge amounts of data into memory. Second, PowerPivot provides tools for
relating separate tables of data, all within Excel. These tables could be from different sources,
such as a combination of Access tables, Excel tables, text files, and others. As long as these
tables have appropriate columns for relating the tables, essentially primary and foreign keys,
you can relate them with PowerPivot. The tools previously required to relate tables, including
VLOOKUP functions, Microsoft Access, and/or Microsoft Query, are no longer necessary.
Next, in Excel 2013, Microsoft made PowerPivot (with a modified user interface) part
of Excel. More precisely, it introduced a Data Model for storing and relating large tables
of data. Then PowerPivot can be used to manipulate this Data Model.4 Excel 2013 also
includes another add-in called Power View, an easy-to-use add-in for producing reports
based on a Data Model. Finally, Microsoft developed two other Excel add-ins for Excel
2013, Power Query and Power Map. As their names imply, Power Query is used to create
queries for importing data into Excel, and Power Map is used to show data on geographical
maps. These two add-ins are not part of Excel 2013, but you can download them for free
from a Microsoft website.
These four add-ins, PowerPivot, Power View, Power Query, and Power Map, have
become known as Excel’s Self-Service Business Intelligence (or Self-Service BI) tools.
Actually, self-service BI is not exclusively a Microsoft term, but Microsoft has embraced
it. The idea is that before self-service BI tools were available, employees usually had to
request the reports they needed from their corporate IT departments. This could take days
or weeks, and the reports were often out of date, or not quite what the employees needed,
by the time they were received. Therefore, the purpose of self-service BI tools is to let
employees generate their own reports, quickly and easily, with familiar tools such as Excel.
All of these self-service BI add-ins are built into Excel 2016.5 It seems safe to say that
these Excel self-service BI tools will continue to evolve, getting easier to use and more
Intro to Power BI
powerful, as Excel evolves. It is most important at this point that you become aware of
these tools and have some familiarity with what they can do. We illustrate some of the pos-
sibilities next, using the tools and user interface from Excel 2016. (Unfortunately, even if
you have the free add-ins for Excel 2010 or 2013, the user interface is somewhat different
than in Excel 2016, and it would be too complex and confusing to provide instructions and
screenshots for each of them. Therefore, we focus only on the latest, Excel 2016.)
To get started, you must load the add-ins. First, make sure the Developer tab is visible.
(If it isn’t, right-click any ribbon, select Customize the Ribbon, and check the Developer
item in the right pane of the resulting Customize dialog box.) Then from the Developer
ribbon, click the COM Add-Ins button to see the list in Figure 17.3. You should check the
three “Power” add-ins in this list. (If any of them aren’t in your list, then your version of
Excel 2016 doesn’t include the add-ins.) Among other things, you should then see a Power
Pivot tab with the associated ribbon shown in Figure 17.4.
4
All versions of Excel 2013 include the Data Model. However, only owners of Office Professional Plus or higher
have the PowerPivot add-in.
5
Again, not all versions of Office include all of these tools; you evidently require Office Professional Plus or
higher. Also, Power Pivot is now (thankfully) spelled as two words, just like the other “power” add-ins.

904 Chapter 17 Data Mining


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
Figure 17.3 COM Add-Ins List (Excel 2016)

© Cengage Learning
Figure 17.4 Power Pivot Ribbon in Excel 2016

© Cengage Learning
The rest of this subsections leads you through a typical example using Power Pivot.
The data set for this example is stored in three separate, currently unrelated, files. The bulk
of the data are in the Access file ContosoSales.accdb. Related data are in the Excel file
Stores.xlsx and the comma-delimited text file Geography.csv. The ContosoSales database
has five related tables, Dates, Sales, Products, ProductSubcategories, and ProductCategories.
Each row in the Sales table is for a sale of some product on some date. The Access tables are
related through primary and foreign keys as indicated in Figure 17.5.

Figure 17.5 Relationships Diagram for ContosoSales Database

© Cengage Learning

17-2 Data Exploration and Visualization 905


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
Similarly, the Stores and Geography files each contain a single table of data that will
eventually be related to the ContosoSales data. The Stores file contains data about the stores
where the products are sold, which will be related to the Sales table through StoreKey
fields. The Geography file has information about the locations of the stores, which will
eventually be related to the Stores data through GeographyKey fields.
Here is an overview of the entire process. Once again, we remind you that these instruc-
tions apply to a version of Excel 2016 that contains Power Pivot. Also, you might want to
watch the accompanying Data Model and Power Pivot videos that lead you through most
of these same steps.
1. Import the data from the three sources into an Excel Data Model, not into physical
Excel tables.
Data Model, Power
Pivot 2. Create relationships between the tables in the Data Model.
3. Modify the Data Model to enable useful pivot tables.
4. Build one or more pivot tables from the Data Model.

Step 1: Import the Data into an Excel Data Model


1. Open a new workbook and save it as Power Pivot Tutorial.xlsx.
2. Click the Manage button on the Power Pivot ribbon. This opens a Power Pivot win-
dow, which is basically a backstage view for managing your (currently empty) Data
Model. Note that the Excel window remains open so that you can go back and forth
between the Power Pivot and Excel windows.
3. From the Home ribbon of the Power Pivot window, select From Access under the
From Database dropdown list, locate the ContosoSales.accdb file, and click Next.
Accept the option to select from a list of tables and click Next. Select all five tables
and click Finish. The data are loaded into five spreadsheet-like tabs in the Power Pivot
window, where you can view the data. Because of the relationships in the Access file,
the five tables in the Data Model are automatically related. You can see the relation-
ships by clicking the Diagram View button on the Home ribbon of the Power Pivot
window. This is your Data Model, at least so far. See Figures 17.6 and 17.7.
4. From the Home ribbon of the Power Pivot window, select Excel File from the From Other
Sources list and go through the obvious steps to import the single Stores table from the
Stores.xlsx file. (Remember to check that the first row should be used as column head-
ers.) Now the Stores table is part of the Data Model, although it isn’t yet related to the other
tables.
5. Repeat the previous step to import the Geography table from the Geography.csv
comma-delimited text file. (This time, choose Text File from the From Other Sources
list.) This adds one more table to the Data Model, again not related to other tables.

Step 2: Create Relationships between the Data Sources


One of the great features of the Excel Data Model is that you can relate “relatable” tables
(those with appropriate key fields) inside Excel, without the need for complex VLOOKUPs
or other software such as Access. In the current example, you can do this directly within the
Power Pivot window. There are at least two ways to do this, as discussed next.
1. Relate the Stores table to the Sales table through the StoreKey fields. To do this, select
Create Relationship from the Design ribbon in the Power Pivot window, fill out the
resulting dialog box as shown in Figure 17.8, and click OK. (The “many” side of the
relationship should be on the left, and the “one” side should be on the right.)
2. Relate the Geography table to the Stores table through the GeographyKey fields.
To do this, you could again go through the Create Relationships dialog box, but an

906 Chapter 17 Data Mining


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
Figure 17.6
Power Pivot Data
View Window

© Cengage Learning
Figure 17.7
Power Pivot
Diagram View
Window

© Cengage Learning

easier way is to drag from GeographyKey field in the Stores table (the “many” side)
to the GeographyKey field in the Geography table (the “one” side) in Diagram View.

Now all seven tables are related, as shown in Figure 17.9.

Step 3: Modify the Data Model


The data you see in the Power Pivot window, that is, the data in the Data Model, can’t be
manipulated like in regular Excel worksheets, but there are two useful things you can do:
(1) create measures and (2) create calculated columns. Both of these require Power Pivot’s

17-2 Data Exploration and Visualization 907


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
Figure 17.8 Create Relationship Dialog Box

© Cengage Learning
Figure 17.9 Diagram View of Completed Relationships

© Cengage Learning

908 Chapter 17 Data Mining


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
DAX language (short for Data Analysis Expressions). This language is very powerful and
can become quite complex, so only a taste of it is presented here. In any case, the whole
purpose of measures, calculated fields, and other modifications of the Data Model are for
their eventual use in pivot tables (or in Power View reports).
1. A measure is any summarization of fields that could be placed in the Values area of
a pivot table. For this example, create a measure called Total Net Revenue. To do so,
return to the Power Pivot ribbon in the Excel window, select New Measure from the
Measures dropdown list, and fill out the resulting dialog box as shown in Figure 17.10.
If you then click OK and return to Data View in the Power Pivot window, you will see
the Total Net Revenue measure below the data in the Sales tab.
2. A calculated column is like a new column added to an Excel table. Here, two cal-
culated columns will be added to the Products table so that the ProductCategories
and ProductSubcategories tables can be hidden, resulting in a less cluttered view for
the user. To do this, return to the Products table in Data View, click the first blank
column, type the following formula in the formula bar, and press Enter. (You’ll get
help as you type.) Then right-click this column, select Rename Column, and rename
it Product Category. The DAX function RELATED brings a field from a related
table into the current table.
=RELATED(ProductCategories[ProductCategoryName])

Then do this a second time with the following formula, and rename this column Product
Subcategory. Then the Products table should appear as in Figure 17.11.
=RELATED(ProductSubcategories[ProductSubcategoryName])

3. As indicated in step 2, you can hide ffields, or even entire tables, from users if you
think these fields or tables would never be used in a pivot table. This provides a less
cluttered look when the user eventually creates pivot tables from the Data Model.
To do this, right-click a column or a table tab and select Hide from Client Tools.

Figure 17.10
Total Net Revenue
Measure

© Cengage Learning

17-2 Data Exploration and Visualization 909


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
Figure 17.11 Products Table with Calculated Columns

© Cengage Learning
Try this with the ProductCategories and ProductSubcategories table tabs or with
any other fields not useful for pivot tables. For example, primary and foreign key
fields are good candidates for hiding. They are necessary for relating tables, but they
aren’t likely to be used in pivot tables.

Step 4: Build One or More Pivot Tables from the Data Model
This is the easy step. From the Home ribbon in the Power Pivot window, select the first
item, PivotTable, from the PivotTable dropdown list. (You can experiment with the other
items.) Now you can drag any of the (non-hidden) fields from any of the (non-hidden)
tables to a pivot table in the usual way. In particular, you will see the Total Net Revenue
measure in the Sales table, a candidate for the Values area, and the two calculated columns
in the Products table, candidates for the Rows and Columns areas or for filtering.
Two example pivot tables appear in Figure 17.12 and 17.13. Assuming you have
followed along to this point, you should try to reproduce them.
There are two things to note here. If you return to the Power Pivot window and make any
changes, such as hiding more fields or creating more calculated columns, the changes will
be reflected automatically when you return to the pivot table. Second, suppose you forget to
relate tables, such as Stores and Geography. Then depending on the items you drag to the
pivot table, you might get a warning about missing relationships. In this case, you can create
the required relationships as described earlier to make the pivot table calculate correctly.
Arguably, Power Pivot is the most useful member of the “Power” add-ins for Excel
2016. However, you might also want to experiment with Power View for creating quick,
Power View, Power Map
insightful reports, and Power Map for creating insightful 3D maps. You can view the accom-
panying Power View and Power Map videos to get started. You’ll find that it is quite easy.

910 Chapter 17 Data Mining


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
Figure 17.12 First Example Pivot Table

© Cengage Learning
Figure 17.13 Second Example Pivot Table

© Cengage Learning
17-2d Visualization Software
You can gain a lot of insight by using charts to view your data in imaginative ways. This
trend toward powerful charting software for data visualization is the wave of the future and
will certainly continue. Excel’s built-in tools, including the new Power View and Power
Map add-ins, can be used for visualization. In addition, many other companies are devel-
oping visualization software. To get a glimpse of what is currently possible, you can watch
Using Tableau Public the accompanying video about a free software package, Tableau Public, developed by
Tableau Software. Perhaps you will find other visualization software packages, free or
otherwise, that rival Tableau or Power View. Alternatively, you might see blogs with data
visualizations from ordinary users. In any case, the purpose of all visualization software is
to portray data graphically so that otherwise hidden trends or patterns can emerge clearly.

17-2 Data Exploration and Visualization 911


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
PROBLEMS
Note: Student solutions for problems whose numbers appear within Level B
a colored box are available for purchase at www.cengagebrain.com.
4. The file Adventure Works.cub contains sales data
Level A on biking and related products. It was created in
Problems 1–3 require Excel 2016 (or 2013). It is possible to do Microsoft SQL Server software and stored as an
them in Excel 2010, assuming the Power Pivot add-in has been offline cube. There are two dimension hierarchies,
installed and loaded, but there is more work involved.
Product Model Categories and Product Model Lines,
1. The Access database file Foodmart.mdb mentioned that categorize the products in slightly different
earlier has the tables and relationships shown in ways. Create a pivot table that shows Internet Sales
Figure 17.1. Import the tables into an Excel Data Model. Amount for all products in the Mountain line in
Then create a pivot table that shows, for each product the Rows area, broken down by all product model
family, the percentage of the total revenue from stores in categories in the Columns area. Then do whatever
each of the three countries (Canada, Mexico, and USA). is necessary to find the percentage of all Internet
For example, you should find that 26.20% of revenue Sales Amount in the Mountain line due to Tires and
for the Food family came from stores in Mexico. Tubes Accessories. (Hint: To do this, you’ll have to
2. Proceed as in the previous problem, but now create a use Microsoft Query, part of Microsoft Office and
pivot table that shows, for stores in each country, the discussed in detail in online Chapter 18. The required
percentage of units sold in each of the three product steps are listed below.)
families (Drink, Food, and Non-Consumable). For a. Select From Microsoft Query from the From Other
example, you should find that 70.83% of all units sold Sources dropdown list in the Get External Data
in Canada were in the Food family. group on the Data ribbon.
b. Click the OLAP Cubes tab, then <New Data
3. Proceed as in Problem 1, but now create a pivot Source>, then OK.
table that has two fields from the Products table, c. Give the source a name such as Ad Works Cube,
ProductFamily and ProductDepartment, in the Rows select the latest OLAP provider, and click Connect.
area and two fields from the Stores table, Country d. Click the Cube file option and browse for the
and Region, in the Columns area. Use this to find the Adventure Works.cub file.
percentage of revenue from any product family or any e. Click Finish, then OK, and then OK to back out to
product department that comes from any country or any an Import Data dialog.
region of any country. That is, a single pivot table should f. Select PivotTable Report and a location such as cell
answer all such questions. For example, you should find A1 on a worksheet, and you’ll be all set to go.
that 17.48% of all revenue from the Seafood department
comes from the Central region of Mexico.

17-3 CLASSIFICATION METHODS


This section discusses one of the most important problems studied in data mining: classi-
fication. This is similar to the problem attacked by regression analysis—using explanatory
variables to predict a dependent variable—but now the dependent variable is categorical. It
usually has two categories, such as Yes and No, but it can have more than two categories,
such as Republican, Democrat, and Independent. This problem has been analyzed with
very different types of algorithms, some regression-like and others very different from
regression, and this section discusses four of the most popular classification methods. Each
of the methods has the same objective: to use data from the explanatory variables to clas-
sify each record (person, company, or whatever) into one of the known categories.
Before proceeding, it is important to discuss the role of data partitioning in classification
and in data mining in general. Data mining is usually used to explore very large data sets,
with many thousands or even millions of records. Therefore, it is very possible, and also very
useful, to partition the data set into two or even three distinct subsets before the algorithms are
applied. Each subset has a specified percentage of all records, and these subsets are typically
chosen randomly. The first subset, usually with about 70% to 80% of the records, is called
the training data set. The second subset, called the testing data set, usually contains the rest
912 Chapter 17 Data Mining
Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
of the data. Each of these data sets should have known values of the dependent variable. Then
the algorithm is trained with the data in the training data set. This results in a model that can
be used for classification. The next step is to test this model on the testing data set. It is very
possible that the model will work quite well on the training data set because this is, after all,
the data set that was used to create the model. The real question is whether the model is flex-
ible enough to make accurate classifications in the testing data set.
Most data mining software packages have utilities for partitioning the data. (In the
following subsections, you will see that the logistic regression procedure in StatTools does not
yet have partitioning utilities, but the Palisade NeuralTools add-in for neural networks does
have them.) The various software packages might use slightly different terms for the subsets,
but the overall purpose is always the same, as just described. They might also let you specify a
third subset, often called a prediction data set, where the values of the dependent variable are
unknown. Then you can use the model to classify these unknown values. Of course, you won’t
know whether the classifications are accurate until you learn the actual values of the dependent
variable in the prediction data set.

17-3a Logistic Regression


Logistic regression is a popular method for classifying individuals, given the values of a set of
explanatory variables. It estimates the probability that an individual is in a particular category.
As its name implies, logistic regression is somewhat similar to the usual regression analysis,
but its approach is quite different. It uses a nonlinear function of the explanatory variables for
classification.
Logistic regression is essentially regression with a binary (0–1) dependent variable. For
the two-category problem (the only version of logistic regression discussed here), the binary
variable indicates whether an observation is in category 0 or category 1. One approach to the
classification problem, an approach that is sometimes actually used, is to run the usual mul-
tiple regression on the data, using the binary variable as the dependent variable. However,
this approach has two serious drawbacks. First, it violates the regression assumption that the
error terms should be normally distributed. Second, the predicted values of the dependent
variable can be between 0 and 1, less than 0, or greater than 1. If you want a predicted value
to estimate a probability, then values less than 0 or greater than 1 make no sense.
Therefore, logistic regression takes a slightly different approach. Let X1 through Xk be
the potential explanatory variables, and create the linear function b0 + b1X1 + ⋯ + bkXk.
Unfortunately, there is no guarantee that this linear function will be between 0 and 1, and
hence that it will qualify as a probability. But the nonlinear function
1/ 1 1 + e −(b0 +b1X1 + ⋯ +bkXk) 2
is always between 0 and 1. In fact, the function f( f x) = 1/(1 + e −x) is an “S-shaped logistic”
curve, as shown in Figure 17.14. For large negative values of x, the function approaches 0,
and for large positive values of x, it approaches 1.
The logistic regression model uses this function to estimate the probability that any obser-
vation is in category 1. Specifically, if p is the probability of being in category 1, the model

p = 1/(1 + e−(b0 +b1X1 + ⋯ +bkXk))

is estimated. This equation can be manipulated algebraically to obtain an equivalent form:


p
ln a b = b0 + b1X1 + ⋯ + bkXk
1−p
This equation says that the natural logarithm of p/(1 − p) is a linear function of the explan-
atory variables. The ratio p/(1 − p) is called the odds ratio.

17-3 Classification Methods 913


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
Figure 17.14 Logis c func on: 1/(1+exp(x))
1
S-shaped Logistic 0.9
Curve 0.8
0.7
0.6
0.5
0.4
0.3
0.2

© Cengage Learning
0.1
0
7.0 5.0 3.0 1.0 1.0 3.0 5.0 7.0
x

The odds ratio is a term frequently used in everyday language. Suppose, for example,
that the probability p of a company going bankrupt is 0.25. Then the odds that the company
will go bankrupt are p/(1 − p) = 0.25/0.75 = 1/3, or “1 to 3.” Odds ratios are probably
most common in sports. If you read that the odds against Duke winning the NCAA basket-
ball championship are 4 to 1, this means that the probability of Duke winning the champi-
onship is 1/5. Or if you read that the odds against Purdue winning the championship are 99
to 1, then the probability that Purdue will win is only 1/100.
The logarithm of the odds ratio, the quantity on the left side of the above equation, is
called the logit (or log odds). Therefore, the logistic regression model states that the logit
is a linear function of the explanatory variables. Although this is probably a bit mysterious
and there is no easy way to justify it intuitively, logistic regression has produced useful
results in many applications.
The numerical algorithm used to estimate the regression coefficients is complex, but
the important goal for our purposes is to interpret the regression coefficients correctly.
First, if a coefficient b is positive, then if its X increases, the log odds increases, so the
probability of being in category 1 increases. The opposite is true for a negative b. So just
by looking at the signs of the coefficients, you can see which explanatory variables are
positively correlated with being in category 1 (the positive b’s) and which are positively
correlated with being in group 0 (the negative b’s).
You can also look at the magnitudes of the b’s to try to see which of the X’s are “most
important” in explaining category membership. Unfortunately, you run into the same prob-
lem as in regular regression. Some X’s are typically of completely different magnitudes than
others, which makes comparisons of the b’s difficult. For example, if one X is income, with
values in the thousands, and another X is number of children, with values like 0, 1, and 2, the
coefficient of income will probably be much smaller than the coefficient of children, even
though these two variables can be equally important in explaining category membership. We
won’t say more about the interpretation of the regression coefficients here, but you can find
comments about them in the finished version of lasagna triers file discussed next.
In many situations, especially in data mining, the primary objective of logistic regression
is to “score” members, given their X values. The score for any member is the estimated value
of p, found by plugging into the logistic regression equation to get the logit and then solving
algebraically to get p. (This is typically done automatically by the software package.) Those
members who score highest are the most likely to be in category 1; those who score lowest
are most likely to be in category 0. For example, if category 1 represents the responders to
some direct mail campaign, a company might mail brochures to the top 10% of all scorers.
These scores can also be used to classify members. Here, a cutoff probability is
required. All members who score below the cutoff are classified as category 0, and the rest

914 Chapter 17 Data Mining


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
are classified as category 1. This cutoff value is often 0.5, but any value can be used. This
issue is discussed in more detail later in this section.
Fortunately, StatTools has a logistic regression procedure, as illustrated in the following
example.

EXAMPLE 17.1 C LASSIFYING L ASAGNA T RIERS WITH L OGISTIC R EGRESSION

T he Lasagna Triers Logistic Regression.xlsx file contains the same data set from Chapter 3
on 856 people who have either tried or not tried a company’s new frozen lasagna product.
The categorical dependent variable, Have Tried, and several of the potential explanatory vari-
ables contain text, as shown in Figure 17.15. Some logistic regression software packages allow
such text variables and implicitly create dummies for them, but StatTools requires all numeric
variables. Therefore, the StatTools Dummy utility was used to create dummy variables for all
text variables. (You could also do this with IF formulas.) Using the numeric variables, including
dummies, how well is logistic regression able to classify the triers and nontriers?

Figure 17.15 Lasagna Data Set with Text Variables

A B C D E F G H I J K L M
1 Person Age Weight Income Pay Type Car Value CC Debt Gender Live Alone Dwell Type Mall Trips Nbhd Have Tried
2 1 48 175 65500 Hourly 2190 3510 Male No Home 7 East No
3 2 33 202 29100 Hourly 2110 740 Female No Condo 4 East Yes
4 3 51 188 32200 Salaried 5140 910 Male No Condo 1 East No

© Cengage Learning
5 4 56 244 19000 Hourly 700 1620 Female No Home 3 West No
6 5 28 218 81400 Salaried 26620 600 Male No Apt 3 West Yes
7 6 51 173 73000 Salaried 24520 950 Female No Condo 2 East No
8 7 44 182 66400 Salaried 10130 3500 Female Yes Condo 6 West Yes
9 8 29 189 46200 Salaried 10250 2860 Male No Condo 5 West Yes

Objective To use logistic regression to classify users as triers or nontriers, and to inter-
pret the resulting output.

Solution
A StatTools data set already exists (in the unfinished version of the file). It was used to cre-
ate the dummy variables. To run the logistic regression, you select Logistic Regression from
the StatTools Regression and Classification dropdown list. Then you fill out the usual StatTools
dialog box as shown in Figure 17.16. At the top, you see two options: “with no Count Variable”
or “with Count Variable.” The former is appropriate here. (The latter is used only when there is
a count of the 1’s for each joint category, such as males who live alone.) The dependent variable
is the dummy variable Have Tried Yes, and the explanatory variables are the original numeric
variables (Age, Weight, Income, Car Value, CC Debt, and Mall Trips) and the dummy variables
(Pay Type Salaried, Gender Male, Live Alone Yes, Dwell Type Condo, and Dwell Type Home).
As in regular regression, one dummy variable for each categorical variable should be omitted.
The logistic regression output is much like regular regression output. There is a summary
section and a list of coefficients, shown in Figure 17.17. The summary section is analogous
to the ANOVA table in a regression output. The Improvement value indicates how much bet-
ter the logistic regression classification is than a classification with no explanatory variables
at all. The corresponding p-value indicates that this improvement is statistically significant at
any of the usual significance levels, exactly like a small p-value in an ANOVA table.
The coefficient section is also analogous to regular regression output. The Wald value
is like the t-value, and each corresponding p-value indicates whether that variable could
be excluded from the equation. In this case, Income, Car Value, CC Debt, Gender Male,

17-3 Classification Methods 915


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
Figure 17.16
StatTools Logistic
Regression Dialog
Box

© Cengage Learning
Figure 17.17 Summary and Coefficients in StatTools Logistic Regression Output

A B C D E F G H
8 Logistic Regression for Have Tried Yes
9 Summary Measures This is the output from using all explanatory
10 Null Deviance 1165.604813 variables. You can run the logistic regression
11 Model Deviance 687.9428839 again, deleting variables with high p-values, but
the basic results don’t change substantially.
12 Improvement 477.6619292
13 p-Value <0.0001
14
15 Standard Wald Lower Upper
Coefficient p-Value Exp(Coef)
16 Regression Coefficients Error Value Limit Limit
17 Constant –2.540587689 0.909698289 –2.79278055 0.0052 –4.323596336 –0.757579042 0.078820065
18 Age –0.069688555 0.010808445 –6.447602252 <0.0001 –0.090873108 –0.048504003 0.932684254
19 Weight 0.007033414 0.003849631 1.82703581 0.0677 –0.000511863 0.014578691 1.007058206
20 Income 4.76283E-06 3.77935E-06 1.260222403 0.2076 –2.64471E-06 1.21704E-05 1.000004763
21 Car Value –2.66917E-05 2.04171E-05 –1.307318278 0.1911 –6.67092E-05 1.33259E-05 0.999973309
22 CC Debt 7.78774E-05 9.14027E-05 0.852024709 0.3942 –0.000101272 0.000257027 1.00007788
23 Mall Trips 0.687005598 0.059764316 11.49524745 <0.0001 0.569867539 0.804143656 1.987754476

© Cengage Learning
24 Pay Type Salaried 1.332747327 0.220912727 6.032913283 <0.0001 0.899758382 1.765736273 3.791445433
25 Gender Male 0.255542473 0.191544317 1.334116706 0.1822 –0.119884388 0.630969333 1.291161851
26 Live Alone Yes 1.322630127 0.283886309 4.659013441 <0.0001 0.766212962 1.879047292 3.75328001
27 Dwell Type Condo –0.080928114 0.275087202 –0.294190764 0.7686 –0.620099029 0.458242801 0.922259987
28 Dwell Type Home 0.176721835 0.248863714 0.710114914 0.4776 –0.311051044 0.664494713 1.193299112

and the two Dwell Type dummies could possibly be excluded. (You can check that if these
variables are indeed excluded and the logistic regression is run again, very little changes.)
The signs of the remaining coefficients indicate whether the probability of being a trier
increases or decreases when these variables increase. For example, this probability decreases
as Age increases (a minus sign), and it increases as Weight increases (a plus sign). Again,
however, you have to use caution when interpreting the magnitudes of the coefficients. For
example, the coefficient of Weight is small because Weight has values in the hundreds, and
the coefficient of Live Alone Yes is much larger because this variable is either 0 or 1.

916 Chapter 17 Data Mining


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
The Exp(Coef) column is more interpretable. For example, as explained in the finished
version of the file and in StatTools cell comments, if Live Alone Yes increases from 0 to 1—
that is, a person who doesn’t live alone is compared to a person who does live alone—the odds
of being a trier increase by a factor of about 3.75. In other words, the people who live alone
are much more likely to try the product. The other values in this column can be interpreted
in a similar way, and you should be on the lookout for values well above 1 or well below 1.
Below the coefficient output, you see the classification summary shown in
Figure 17.18. To create these results, the explanatory values in each row are plugged into
the logistic regression equation, which results in an estimate of the probability that the per-
son is a trier. If this probability is greater than 0.5, the person is classified as a trier; if it is
less than 0.5, the person is classified as a nontrier. The results show the number of correct
and incorrect classifications. For example, 422 of the 495 triers, or 85.25%, are classified
correctly as triers. The bottom summary indicates that 82.01% of all classifications are
correct. However, how good is this really? It turns out that 57.83% of all observations are
triers, so a naïve classification rule that classifies everyone as a trier would get 57.83% cor-
rect. The last number, 57.34%, represents the improvement of logistic regression over this
naïve rule. Specifically, logistic regression is 57.34% of the way from the naïve 57.83% to
a perfect 100%.

A B C D
Figure 17.18
30 1 0 Percent
Classification 31 Classification Matri
Matrix Correct
Summary 32 1 422 73 85.25%
33 0 81 280 77.56%
34

© Cengage Learning
35
Percent
36 Summary Classification
37 Correct 82.01%
38 Base 57.83%
39 Improvement 57.34%

The last part of the logistic regression output, a small part of which is shown in
Figure 17.19, lists all of the original data and the scores discussed earlier. For example, the
first person’s score is 75.28%. This is the probability estimated from the logistic regression
equation that this person is a trier. Because it is greater than 0.5, this person is classified
as a trier. However, this is one of the relatively few misclassifications. The first person is
actually a nontrier. In the same way, explanatory values for new people, those whose trier
status is unknown, could be fed into the logistic regression equation to score them. Then

M N O
Figure 17.19
41 Analysis Original
Scores for the First Probability Class Class
42
Few People 43 75.28% 1 0
© Cengage Learning

44 35.15% 0 1
45 7.65% 0 0
46 9.18% 0 0
47 60.22% 1 1
48 7.69% 0 0

17-3 Classification Methods 917


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
perhaps some incentives could be sent to the top scorers to increase their chances of trying
the product. The point is that logistic regression is then being used as a tool to identify the
people most likely to be the triers. ■

Before leaving this subsection, you have probably noticed that StatTools includes
another classification procedure called discriminant analysis. This is a classical technique
developed decades ago that is still in use. It is somewhat similar to logistic regression and
has the same basic goals. However, it is not as prominent in data mining discussions as
logistic regression. Therefore, discriminant analysis is not discussed here.

17-3b Neural Networks


A neural network (or simply, neural net) methodology attempts to mimic the complex
behavior of the human brain. It sends inputs (the values of explanatory variables) through
a complex nonlinear network to produce one or more outputs (the values of the dependent
variable). Methods for doing this have been studied by researchers in artificial intelligence
and other fields for decades, and there are now many software packages that implement
versions of neural net algorithms. Some people seem to believe that data mining is syn-
onymous with neural nets. Although this is definitely not true—data mining employs many
algorithms that bear no resemblance to neural nets—the neural net methodology is cer-
tainly one of the most popular methodologies in data mining. It can be used to predict a
categorical dependent variable, as in this section on classification, and it can also be used
to predict a numeric dependent variable, as in multiple regression.
The biggest advantage of neural nets is that they often provide more accurate predictions
than any other methodology, especially when relationships are highly nonlinear. They also
have a downside. Unlike methodologies like multiple regression and logistic regression,
neural nets do not provide easily interpretable equations where you can see the contribu-
tions of the individual explanatory variables. For this reason, they are often called a “black
box” methodology. If you want good predictions, neural nets often provide an attractive
method, but you shouldn’t expect to understand exactly how the predictions are made.
A brief explanation of how neural nets work helps to clarify this black box behav-
ior. Each neural net has an associated network diagram, something like the one shown in
Figure 17.20. This figure assumes two inputs and one output. The network also includes a
“hidden layer” in the middle with two hidden nodes. Scaled values of the inputs enter the
network at the left, they are weighted by the W values and summed, and these sums are sent
to the hidden nodes. At the hidden nodes, the sums are “squished” by an S-shaped logistic-
type function. These squished values are then weighted and summed, and the sum is sent to
the output node, where it is squished again and rescaled. Although the details of this pro-
cess are best left to software, small illustrative examples are available in the file Neural Net
Explanation.xlsm. (The file is an .xlsm file because the logistic function is implemented

Input 1 Hidden 1
Figure 17.20 W11
Neural Net with
W10
Two Inputs and Output
W12 There could be a few additional
Two Hidden Nodes
“bias” arrows, essentially like
© Cengage Learning

the constant term in regression.


W21
W20
W22
Input 2 Hidden 2

918 Chapter 17 Data Mining


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
with a macro, so make sure you enable macros.) There is one sheet for a one-input neural
net and another for a two-input neural net. You can see how everything works by studying
the cell formulas. However, the main insight provided by this file is that you can see how
different sets of weights lead to very different nonlinear behaviors.
A neural net can have any number of hidden layers and hidden nodes, and the choices
for these are far from obvious. Many software packages make these choices for you, based
on rules of thumb discovered by researchers. Once the structure of the network is chosen,
the neural net is “trained” by sending many sets of inputs—including the same inputs mul-
tiple times—through the network and comparing the outputs from the net with the known
output values. Based on many such comparisons, the weights are repeatedly adjusted. This
process continues until the weights stabilize or some other stopping criterion is reached.
Depending on the size of the data set, this iterative process can take some time.
As research continues, the algorithms implemented with neural net software continue
to change. The ideas remain basically the same, but the way these ideas are implemented,
and even the results, can vary from one implementation to another.
StatTools does not implement neural nets, but another add-in in the Palisade suite,
NeuralTools, does. The following continuation of the lasagna triers example illustrates its use.

EXAMPLE 17.2 C LASSIFYING L ASAGNA T RIERS WITH N EURAL N ETS

L ogistic regression provided reasonably accurate classifications for the lasagna triers
data set. Can a neural net, as implemented in Palisade’s NeuralTools add-in, provide
comparable results?
Objective T To learn how the NeuralTools add-in works, and to compare its results to
those from logistic regression.

Solution
The data for this version of the example are in the file Lasagna Triers NeuralTools.xlsx.
There are two differences from the file used for logistic regression. First, no dummy vari-
ables are necessary. The NeuralTools add-in is capable of dealing directly with text vari-
ables. Second, there is a Prediction Data sheet with a second data set of size 250 to be used
for prediction. Its values of the dependent Have Tried variable are unknown.
You launch NeuralTools just like StatTools, @RISK, or any of the other Palisade add-
ins. This produces a NeuralTools tab and ribbon, as shown in Figure 17.21. As you can see,
NeuralTools uses a Data Set Manager, just like StatTools. The only difference is that when
you specify the data set, you must indicate the role of each variable in the neural net. The
possible roles are Independent Numeric, Independent Categorical, Dependent Numeric,
Dependent Categorical, Tag, and Unused. Except for Tag, which isn’t used here, these
have the obvious meanings. So the first step is to create two data sets, one for each sheet,
with Have Tried as Dependent Categorical, Person as Unused, and the other variables as
Independent Numeric or Independent Categorical as appropriate. (NeuralTools usually
guesses the roles correctly.) We call these data sets Train/Test Data and Prediction Data,
respectively.
© Cengage Learning

Figure 17.21
NeuralTools Ribbon

17-3 Classification Methods 919


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
To train the data in the Train/Test Data set, you activate the Data sheet and click Train
on the NeuralTools ribbon to get a Training dialog box with three tabs. The Train tab
shown in Figure 17.22 provides three basic options. The first option allows you to parti-
tion the data set into training and testing subsets. The default shown here is to set aside a
random 20% of cases for testing. The second option is for predicting cases with missing
values of the dependent variable. This is not relevant here because there are no such cases
in the Data sheet. Besides, prediction will be performed later on the Prediction Data set.
The third option is to calculate variable impacts. This is useful when you have a large
number of potential explanatory variables. It lets you screen out the ones that seem to be
least useful. You can check this option if you like. However, its output doesn’t tell you, at
least not directly, how the different explanatory variables affect the dependent variable.

Figure 17.22
Train Tab of
Training Dialog Box

© Cengage Learning
The Net Configuration tab shown in Figure 17.23 lets you select one of three options
for the training algorithm. The PN/GRN (probabilistic neural net) algorithm is relatively
new. It is fairly quick and it usually gives good results, so it is a good option to try, as is
done here.6 The MLF option (multi-layer feedforward) algorithm is more traditional, but
it is considerably slower. The Best Net Search tries both PN/GRN and various versions of
MLF to see which is best, but it is quite slow.
The Runtime tab (not shown here) specifies stopping conditions for the algorithm. You
can accept the defaults, and you can always stop the training prematurely if it doesn’t seem
to be making any improvement.
Once you click Next on any of the tabs, you will see a summary (not shown here) of
the model setup. Then you can click its Train button to start the algorithm. You will see a

6
The abbreviation PN/GRN is a bit confusing. For classification problems, the algorithm is called probabilistic
neural net (PNN). However, if the dependent variable is continuous, the same basic algorithm is called generalized
regression neural net, which explains the GRN abbreviation.

920 Chapter 17 Data Mining


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203
Figure 17.23
Net Configuration
Tab of Training
Dialog Box

© Cengage Learning
progress monitor, and eventually you will see results on a new sheet, the most important
of which are shown in Figure 17.24. (As in other Palisade add-ins, the results are stored
by default in a new workbook. You can change this behavior from the Application Settings
dialog box, available from the Utilities dropdown list.)

B C D E
Figure 17.24
33 Classification Matrix
Selected Training
34 (for training cases)
Results
35 No Yes Bad(%)
36 No 260 30 10.3448%
37 Yes 26 369 6.5823%
38
39 Classification Matrix

© Cengage Learning
40 (for testing cases)
41 No Yes Bad(%)
42 No 54 17 23.9437%
43 Yes 16 84 16.0000%

The top part shows classification results for the 80%, or 685, cases used for training.
About 10% of the No values (row 36) were classified incorrectly, and about 6.5% of the
Yes values (row 37) were classified incorrectly. The bottom part shows similar results
for the 20%, or 171, cases used for testing. The incorrect percentages, about 24% and
16%, are not as good as for the training set, but this is not unusual. Also, these results are
slightly better than those from logistic regression, where about 18% of the classifications
were incorrect. (Remember, however, that the data set wasn’t partitioned into training
and testing subsets for logistic regression.) Note that these results are from an 80–20
random split of the original data. The results you get from a different random split will
probably be different.
Now that the model has been trained, it can be used to predict the unknown values
of the dependent variable in the Prediction Data set. To do so, you activate the Prediction

17-3 Classification Methods 921


Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. WCN 02-200-203

You might also like