262 Chapter
Roe
+ Predictive Analytics I; Data Mining Process,
Methods, and Algorithms
TABLE 4.5 Tabulated Prediction Results for Individual and Ensemble Models
Prediction Models
Individual Models Ensemble Models
Random Boosted Fusion
Performance Measure sv ANN CART Forest__——Tree__—_(Averats)
‘Count (Bingo) 1 182 140 ro 17 194
‘Count (I-Away) 104 no 126 ai 104 120
Accuracy (% Bingo) —S5S% «5260 MOMS «SHAD 45K SHOT
Accuracy (% I-Away) 555% B728K——76BEK «BRO «ALON UTE
Standard devi 033 087 105 07% ons 0.63
for the prediction problem. The upper-left side of
the process map shows the model development
process, and the lower-right corner of the pro-
cess map shows the model assessment (ie., test-
ing or scoring) process (more details on IBM SPSS
Modeler tool and its usage can be found on the
book's Web site).
Results
Table 45 provides the prediction results of all
three data mining methods as well as the results|
of the three different ensembles. The first perfor-
mance measure is the percent correct classification,
rate, which is called Bingo. Also reported in the
table is the Z-Away correct classification rate (.e.
within one category). The results indicate that SVM.
performed the best among the individual prediction
models, followed by ANN; the worst of the three
was the CART decision tree algorithm. In general,
the ensemble models performed better than the indi-
vidual prediction models, of which the fusion algo-
rithm performed the best, What is probably more
important to decision makers, and standing out in
the results table, is the significantly low standard
deviation obtained from the ensembles compared to
the individual models,
Conclusion
The researchers claim that these prediction results
are better than any reported in the published litera-
ture for this problem domain, Beyond the attractive
accuracy of their prediction results of the box-office
receipts, these models could also be used to further
analyze (and potentially optimize) the decision vari-
ables to maximize the financial return. Specifically,
the parameters used for modeling could be altered
using the already trained prediction models to better
understand the impact of different parameters on the
end results. During this process, which is commonly
referred to as sensitivity anabysis, the decision maker
of a given entertainment firm could find out, with a
fairly high accuracy level, how much value a specific
actor (or a specific release date, or the addition of
more technical effects, ete.) brings to the financial
success of a film, making the underlying system an,
invaluable decision aid.
Questions For Discusston
1. Why is it important for many Hollywood profes-
sionals to predict the financial success of movies?
How can data mining be used for predicting
financial success of movies before the start of
their production process?
How do you think Hollywood did, and pethaps
still is performing, this task without the help of
data mining tools and techniques?
Sources: Sharda, R,, & Delen, D, (2006). Predicting boxoffice
‘success of motion pictures with neural networks. Expert Systems
with Applications, 30, 243-254; Delen, D., Sharda, R., & Kumar,
P (2007), Movie Forecast Guru: A Web-based DSS for Hollywood
managers. Decision Support Systems, 43, 1151-1170.Chapter 4 + Predictive Analytics I: Data Mining Process, Methods, and Algorithms
SECTION 4.6 REVIEW QUESTIONS
1. What are the most popular commercial data mining tools?
2. Why do you think the most popular tools are developed by statistics-based companies?
3. What are the most popular free data mining tools? Why are they gaining overwhelming
popularity (especially RY?
4, What are the main differences between commercial and free data mining software
tools?
5. What would be your top five selection criteria for a data mining tool? Explain.
Data Mining Privacy Issues, Myths, and Blunders
Data that is collected, stored, and analyzed in data mining often contains information
about real people. Such information may include identification data (name, address,
Social Security number, driver’ license number, employee number, etc.), demographic
data (e.g, age, sex, ethnicity, marital status, number of children), financial data (e.g,
salary, gross family income, checking or savings account balance, home ownership,
mortgage or loan account specifics, credit card limits and balances, investment account
specifics), purchase history Ge., what is bought from where and when—either from
vendor's transaction records or from ctedit card transaction specifics), and other per-
sonal data (e.g., anniversary, pregnancy, illness, loss in the family, bankruptcy filings),
Most of these data can be accessed through some third-party data providers, The main
question here is the privacy of the person to whom the data belongs. To maintain the
privacy and protection of individuals’ rights, data mining professionals have ethical (and
often legal) obligations. One way to accomplish this is the process of de-identification
of the customer records prior to applying data mining applications, so that the records
cannot be traced to an individual. Many publicly available data sources (e.g., CDC data,
SEER data, UNOS data) are already de-identified. Prior to accessing these data sources,
users are often asked to consent that, under no circumstances, will they try to identify
the individuals behind those figures.
‘There have been a number of instances in the recent past where the companies
shared their customer data with others without seeking the explicit consent of their cus-
tomers. For instance, as most of you might recall, in 2003, JetBlue Airlines provided more
than 2 million passenger records of their customers to Torch Concepts, a U.S. government
contractor. Torch then subsequently augmented the passenger data with additional infor-
‘mation such as family sizes and Social Security numbers—information purchased from a
lata broker called Acxiom, The consolidated personal database was intended to be used
for a data mining project to develop potential terrorist profiles. All of this was done with-
‘out notification or consent of passengers. When news of the activities got out, however,
dozens of privacy lawsuits were filed against JetBlue, Torch, and Acxiom, and several U.S
senators called for an investigation into the incident (Wald, 2004), Similar, but not as dra-
‘matic, privacy-related news came out in the recent past about the popular social network
companies, which allegedly were selling customer-specific data to other companies for
personalized target marketing,
There was another peculiar story about privacy concems that made it to the head
lines in 2012, In this instance, the company did not even use any private and/or personal
data, Legally speaking, there was no violation of any laws. It was about Target and is sum
marized in Application Case 47.
263264 Chapter
Cerone
+ Predictive Analytics I; Data Mining Process, Methods, and Algorithms
Predicting Customer Buying Patterns—The Target Story
In early 2012, an infamous story appeared concern-
ing Target's practice of predictive analytics. The story
was about a teenage girl who was being sent adver-
tising flyers and coupons by Target for the kinds of
things that a new mother-to-be would buy from a
store like Target. The story goes like this: An angry
man went into a Target outside of Minneapolis,
demanding to talk to a manager: “My daughter got,
this in the mail!” he said. “She's still in high school,
and you're sending her coupons for baby clothes
and cribs? Are you trying to encourage her to get,
pregnant?” The manager didn't have any idea what
the man was talking about, He looked at the mailer.
Sure enouigh, it was addressed to the man’s daughter
and contained advertisements for matemity clothing,
nursery furniture, and pictures of smiling infants
‘The manager apologized and then called a few days
later to apologize again, On the phone, though, the
father was somewhat abashed. “T had a talk with my
daughter,” he said, "It turns out there's been some
activities in my house I haven't been completely
aware of, She's due in August, owe you an apology.”
As it ums out, Target figured out a teen girl was
pregnant before her father did! Here is how they did
it, Target assigns every customer a Guest ID number
(ied to their credit card, name, or e-mail address)
that becomes a placeholder that keeps a history
of everything they have bought. Target augments
this data with any demographic information that
they had collected from them or bought from other
information sources. Using this information, Target
looked at historical buying data for all the females
who had signed up for Target baby registries in the
past. They analyzed the data from all directions, and.
soon enough some useful patterns emerged. For
‘example, lotions and special vitamins were among,
the products with interesting purchase patterns. Lots
of people buy lotion, but what they noticed was
that women on the baby registry were buying larger
‘quantities of unscented lotion around the beginning,
of their second trimester, Another analyst noted that
sometime in the first 20 weeks, pregnant women
loaded up on supplements like calcium, magnesium,
and zinc. Many shoppers purchase soap and cot-
ton balls, but when someone suddenly starts buying
lots of scent-free soap and extra-large bags of cotton
balls, in addition to hand sanitizers and washcloths,
it signals that they could be getting close to their
delivery date. In the end, they were able to iden-
tify about 25 products that, when analyzed together,
allowed them to assign each shopper a “pregnancy
prediction” score. More important, they could also
estimate a woman's due date to within a small win-
dow, so Target could send coupons timed to very
specific stages of her pregnancy.
If you look at this practice from a legal per
spective, you would conchide that Target did not
Use any information that violates customer privacy;
rather, they used transactional data that almost every
other retail chain is collecting and storing (and per
haps analyzing) about their customers, What was
disturbing in this scenario was perhaps the targeted
concept: pregnancy, There are certain events or con-
cepts that should be off limits or treated extremely
cautiously, such as terminal disease, divorce, and
bankruptcy.
Questions For Discusston
1. What do you think about data mining and its
implication for privacy? What is the threshold
between discovery of knowledge and infringe-
ment of privacy?
2. Did Target go too far? Did it do anything ille-
‘gal? What do you think Target should have done?
‘What do you think Target should do next (quit
these types of practices)?
Sources: Hill, K. (2012, February 16). How Target figured out 2
teen gisl was pregnant before her father did. Forbes; Nolan, R
(2012, February 21). Behind the cover story: How much does
“Target know? NYTimes.com.
Data Mining Myths and Blunders
Data mining is a powerful analytical tool that enables business executives to advance
from describing the nature of
future Cooking
1e past (looking a1 a rearview mirror) to predicting the
.ead) to better manage their business operations (making accurate andChapter 4 + Predictive Analytics I: Data Mining Process, Methods, and Algorithms 265
TABLE 4.
Mh Reaity
Data rin provides instant crystatbal Ske Datamining ia mite process that requires
predicons ceiberate proactive desgn 76 use.
Data ining isnot yet viable for mainstream ousiness ‘The caret state of te art realy to go for almost
applatons any business ype andlor sie
Data ining requires 2 separate, dedeated database, Because of the advances in dalabave technology. a
Secicated database i not recuires
‘Only those with advanced degrees can do data mining, Newer Web-baved tool enable managers ofall
educational eves to do data mining
Data miningis only for lege firms that have ots of ‘the data accurately reflect the business ors
customer data, ‘ustomers any company can use data mining,
timely decisions), Data mining helps marketers find patterns that unlock the mysteries
of customer behavior, The results of data mining can be used to increase revenue and
reduce cost by identifying fraud and discovering business opportunities, offering a whole
new realm of competitive advantage. AS an evolving and maturing field, data mining is
often associated with a number of myths, including those listed in Table 4.6 (Delen, 2014;
Zaima, 2003),
Data mining visionaries have gained enormous competitive advantage by under-
standing that these myths are just that: myths.
Although the value proposition and therefore the necessity of it is obvious to
anyone, those who carry out data mining projects (from novice to seasoned data
scientist) sometimes make mistakes that result in projects with less-than-desirable
outcomes. The following 16 data mining mistakes (also called blunders, pitfalls, or
bloopers) are often made in practice (Nesbit et al., 2009 Shultz, 2004; Skalak, 2001),
and data scientists should be aware of them, and to the extent that is possible, do their
best to avoid them:
1. Sclecting the wrong problem for data mining. Not every business problem can be
solved with data mining (i.e., the magic bullet syndrome). When there is no rep-
resentative data (large and feature rich), there cannot be a practicable data mining
project
2. Ignoring what your sponsor thinks data mining is and what it really can and cannot
do, Expectation management is the key for successful data mining projects.
3. Beginning without the end in mind, Although data mining is a process of knowledge
discovery, one should have a goal/objective (a stated business problem) in mind to
succeed, Because, as the saying goes, “if you don’t know where you are going, you
will never get there”
4, Define the project around a foundation that your data can’t support, Data mining is
all about data; that is, the biggest constraint that you have in a data mining project
is the richness of the data, Knowing what the limitations of data are help you craft
feasible projects that deliver results and meet expectations
5. Leaving insufficient time for data preparation. It takes more effort than is generally
understood, The common knowledge suggests that up to a thitd of the total project
time is spent on data acquisition, understanding, and preparation tasks. To succeed,
avoid proceeding into modeling until after your data is properly processed (aggre
gated, cleaned, and transformed),
6. Looking only at aggregated results and not at individual records. Data min-
ing is at its best when the data is at a granular representation. Try to avoid266 Chaprer 4
+ Predictive Analytics I; Data Mining Process, Methods, and Algorithms
10.
11.
12.
13.
14,
15.
16.
unnecessarily aggregating and overly simplifying data to help data mining
algorithms—they don't really need your help, they are more than capable of
figuring it out themselves.
Being sloppy about keeping track of the data mining procedure and results. Because
it is a discovery process that involves many iterations and experimentations, it is
highly likely to lose track of the findings. Success requires a systematic and orderly
planning, execution, and tracking/recording of all data mining tasks.
Using data from the future to predict the future. Because of the lack of descrip
tion and understanding of the data, oftentimes analysts include variables that are
unknown at the time when the prediction is supposed to be made. By doing so,
their prediction models produce unbelievable accurate results (@ phenomenon that
is often called “fool's gold). If your prediction results are too good to be true, they
usually are; in that case, the first thing that you need to look for is the incorrect use
of a variable from the future.
. Ignoring suspicious findings and quickly moving on. The unexpected findings are
often the indicators of real novelties in data mining projects. Proper investigation of
such oddities can lead to surprisingly pleasing discoveries.
Starting with a high-profile complex project that will make you a superstar, Data
mining projects often fail if they are not thought out carefully from start to end,
Success often comes with a systematic and orderly progression of projects from
smaller/simpler to larger/complex ones. The goal should be to show incremental
and continuous value added, as opposed to taking on a large project that will con-
sume resources without producing any valuable outcomes,
Running data mining algorithms repeatedly and blindly. Although today’s data
mining tools are capable of consuming data and setting algorithmic parameters to
produce results, one should know how to transform the data and set the proper
parameter values to obtain the best possible results. Each algorithm has its own
unique way of processing data, and knowing that is necessary to get the most out
of each model type
Ignore the subject matter experts. Understanding the problem domain and the
related data requires a highly involved collaboration between the data mining and
the domain experts. Working together helps the data mining expert to go beyond
the syntactic representation and also obtain semantic nature (i.c., the true meaning,
of the variables) of the data
Believing everything you are told about the data. Although it is necessary to talks
to domain experts to better understand the data and the business problem, the
data scientist should not take anything for granted, Validation and verification,
through a critical analysis is the key to intimate understanding and processing of
the data
Assuming that the keepers of the data will be fully on board with cooperation, Many
data mining projects fail because the data mining expert did not know/understand
the organizational politics. One of the biggest obstacles in data mining projects can
be the people who own and control the data, Understanding and managing the
politics is a key to identify, access, and properly understand the data to produce a
successfull data mining project
Measuring your results differently from the way your sponsor measures them, The
results should talk/appeal to the end user (manager/decision maker) who will be
using them. Therefore, producing the results in a measure and format that appeals,
to the end user tremendously increases the likelihood of true understanding and
proper use of the data mining outcomes,
If you build it, they will come: don't worry about how to serve it up. Usually, data
mining experts think they are done once they build models that meet and hopefully