Big Data 4 (7)

262 Chapter Roe + Predictive Analytics I; Data Mining Process, Methods, and Algorithms TABLE 4.5 Tabulated Prediction Results for Individual and Ensemble Models Prediction Models Individual Models Ensemble Models Random Boosted Fusion Performance Measure sv ANN CART Forest__——Tree__—_(Averats) ‘Count (Bingo) 1 182 140 ro 17 194 ‘Count (I-Away) 104 no 126 ai 104 120 Accuracy (% Bingo) —S5S% «5260 MOMS «SHAD 45K SHOT Accuracy (% I-Away) 555% B728K——76BEK «BRO «ALON UTE Standard devi 033 087 105 07% ons 0.63 for the prediction problem. The upper-left side of the process map shows the model development process, and the lower-right corner of the process map shows the model assessment (ie., test- ing or scoring) process (more details on IBM SPSS Modeler tool and its usage can be found on the book's Web site). Results Table 45 provides the prediction results of all three data mining methods as well as the results| of the three different ensembles. The first performance measure is the percent correct classification, rate, which is called Bingo. Also reported in the table is the Z-Away correct classification rate (.e. within one category). The results indicate that SVM. performed the best among the individual prediction models, followed by ANN; the worst of the three was the CART decision tree algorithm. In general, the ensemble models performed better than the individual prediction models, of which the fusion algorithm performed the best, What is probably more important to decision makers, and standing out in the results table, is the significantly low standard deviation obtained from the ensembles compared to the individual models, Conclusion The researchers claim that these prediction results are better than any reported in the published litera- ture for this problem domain, Beyond the attractive accuracy of their prediction results of the box-office receipts, these models could also be used to further analyze (and potentially optimize) the decision variables to maximize the financial return. Specifically, the parameters used for modeling could be altered using the already trained prediction models to better understand the impact of different parameters on the end results. During this process, which is commonly referred to as sensitivity anabysis, the decision maker of a given entertainment firm could find out, with a fairly high accuracy level, how much value a specific actor (or a specific release date, or the addition of more technical effects, ete.) brings to the financial success of a film, making the underlying system an, invaluable decision aid. Questions For Discusston 1. Why is it important for many Hollywood professionals to predict the financial success of movies? How can data mining be used for predicting financial success of movies before the start of their production process? How do you think Hollywood did, and pethaps still is performing, this task without the help of data mining tools and techniques? Sources: Sharda, R,, & Delen, D, (2006). Predicting boxoffice ‘success of motion pictures with neural networks. Expert Systems with Applications, 30, 243-254; Delen, D., Sharda, R., & Kumar, P (2007), Movie Forecast Guru: A Web-based DSS for Hollywood managers. Decision Support Systems, 43, 1151-1170.Chapter 4 + Predictive Analytics I: Data Mining Process, Methods, and Algorithms SECTION 4.6 REVIEW QUESTIONS 1. What are the most popular commercial data mining tools? 2. Why do you think the most popular tools are developed by statistics-based companies? 3. What are the most popular free data mining tools? Why are they gaining overwhelming popularity (especially RY? 4, What are the main differences between commercial and free data mining software tools? 5. What would be your top five selection criteria for a data mining tool? Explain. Data Mining Privacy Issues, Myths, and Blunders Data that is collected, stored, and analyzed in data mining often contains information about real people. Such information may include identification data (name, address, Social Security number, driver’ license number, employee number, etc.), demographic data (e.g, age, sex, ethnicity, marital status, number of children), financial data (e.g, salary, gross family income, checking or savings account balance, home ownership, mortgage or loan account specifics, credit card limits and balances, investment account specifics), purchase history Ge., what is bought from where and when—either from vendor's transaction records or from ctedit card transaction specifics), and other personal data (e.g., anniversary, pregnancy, illness, loss in the family, bankruptcy filings), Most of these data can be accessed through some third-party data providers, The main question here is the privacy of the person to whom the data belongs. To maintain the privacy and protection of individuals’ rights, data mining professionals have ethical (and often legal) obligations. One way to accomplish this is the process of de-identification of the customer records prior to applying data mining applications, so that the records cannot be traced to an individual. Many publicly available data sources (e.g., CDC data, SEER data, UNOS data) are already de-identified. Prior to accessing these data sources, users are often asked to consent that, under no circumstances, will they try to identify the individuals behind those figures. ‘There have been a number of instances in the recent past where the companies shared their customer data with others without seeking the explicit consent of their customers. For instance, as most of you might recall, in 2003, JetBlue Airlines provided more than 2 million passenger records of their customers to Torch Concepts, a U.S. government contractor. Torch then subsequently augmented the passenger data with additional infor- ‘mation such as family sizes and Social Security numbers—information purchased from a lata broker called Acxiom, The consolidated personal database was intended to be used for a data mining project to develop potential terrorist profiles. All of this was done with- ‘out notification or consent of passengers. When news of the activities got out, however, dozens of privacy lawsuits were filed against JetBlue, Torch, and Acxiom, and several U.S senators called for an investigation into the incident (Wald, 2004), Similar, but not as dra- ‘matic, privacy-related news came out in the recent past about the popular social network companies, which allegedly were selling customer-specific data to other companies for personalized target marketing, There was another peculiar story about privacy concems that made it to the head lines in 2012, In this instance, the company did not even use any private and/or personal data, Legally speaking, there was no violation of any laws. It was about Target and is sum marized in Application Case 47. 263264 Chapter Cerone + Predictive Analytics I; Data Mining Process, Methods, and Algorithms Predicting Customer Buying Patterns—The Target Story In early 2012, an infamous story appeared concern- ing Target's practice of predictive analytics. The story was about a teenage girl who was being sent adver- tising flyers and coupons by Target for the kinds of things that a new mother-to-be would buy from a store like Target. The story goes like this: An angry man went into a Target outside of Minneapolis, demanding to talk to a manager: “My daughter got, this in the mail!” he said. “She's still in high school, and you're sending her coupons for baby clothes and cribs? Are you trying to encourage her to get, pregnant?” The manager didn't have any idea what the man was talking about, He looked at the mailer. Sure enouigh, it was addressed to the man’s daughter and contained advertisements for matemity clothing, nursery furniture, and pictures of smiling infants ‘The manager apologized and then called a few days later to apologize again, On the phone, though, the father was somewhat abashed. “T had a talk with my daughter,” he said, "It turns out there's been some activities in my house I haven't been completely aware of, She's due in August, owe you an apology.” As it ums out, Target figured out a teen girl was pregnant before her father did! Here is how they did it, Target assigns every customer a Guest ID number (ied to their credit card, name, or e-mail address) that becomes a placeholder that keeps a history of everything they have bought. Target augments this data with any demographic information that they had collected from them or bought from other information sources. Using this information, Target looked at historical buying data for all the females who had signed up for Target baby registries in the past. They analyzed the data from all directions, and. soon enough some useful patterns emerged. For ‘example, lotions and special vitamins were among, the products with interesting purchase patterns. Lots of people buy lotion, but what they noticed was that women on the baby registry were buying larger ‘quantities of unscented lotion around the beginning, of their second trimester, Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium, and zinc. Many shoppers purchase soap and cotton balls, but when someone suddenly starts buying lots of scent-free soap and extra-large bags of cotton balls, in addition to hand sanitizers and washcloths, it signals that they could be getting close to their delivery date. In the end, they were able to identify about 25 products that, when analyzed together, allowed them to assign each shopper a “pregnancy prediction” score. More important, they could also estimate a woman's due date to within a small win- dow, so Target could send coupons timed to very specific stages of her pregnancy. If you look at this practice from a legal per spective, you would conchide that Target did not Use any information that violates customer privacy; rather, they used transactional data that almost every other retail chain is collecting and storing (and per haps analyzing) about their customers, What was disturbing in this scenario was perhaps the targeted concept: pregnancy, There are certain events or concepts that should be off limits or treated extremely cautiously, such as terminal disease, divorce, and bankruptcy. Questions For Discusston 1. What do you think about data mining and its implication for privacy? What is the threshold between discovery of knowledge and infringe- ment of privacy? 2. Did Target go too far? Did it do anything ille- ‘gal? What do you think Target should have done? ‘What do you think Target should do next (quit these types of practices)? Sources: Hill, K. (2012, February 16). How Target figured out 2 teen gisl was pregnant before her father did. Forbes; Nolan, R (2012, February 21). Behind the cover story: How much does “Target know? NYTimes.com. Data Mining Myths and Blunders Data mining is a powerful analytical tool that enables business executives to advance from describing the nature of future Cooking 1e past (looking a1 a rearview mirror) to predicting the .ead) to better manage their business operations (making accurate andChapter 4 + Predictive Analytics I: Data Mining Process, Methods, and Algorithms 265 TABLE 4. Mh Reaity Data rin provides instant crystatbal Ske Datamining ia mite process that requires predicons ceiberate proactive desgn 76 use. Data ining isnot yet viable for mainstream ousiness ‘The caret state of te art realy to go for almost applatons any business ype andlor sie Data ining requires 2 separate, dedeated database, Because of the advances in dalabave technology. a Secicated database i not recuires ‘Only those with advanced degrees can do data mining, Newer Web-baved tool enable managers ofall educational eves to do data mining Data miningis only for lege firms that have ots of ‘the data accurately reflect the business ors customer data, ‘ustomers any company can use data mining, timely decisions), Data mining helps marketers find patterns that unlock the mysteries of customer behavior, The results of data mining can be used to increase revenue and reduce cost by identifying fraud and discovering business opportunities, offering a whole new realm of competitive advantage. AS an evolving and maturing field, data mining is often associated with a number of myths, including those listed in Table 4.6 (Delen, 2014; Zaima, 2003), Data mining visionaries have gained enormous competitive advantage by understanding that these myths are just that: myths. Although the value proposition and therefore the necessity of it is obvious to anyone, those who carry out data mining projects (from novice to seasoned data scientist) sometimes make mistakes that result in projects with less-than-desirable outcomes. The following 16 data mining mistakes (also called blunders, pitfalls, or bloopers) are often made in practice (Nesbit et al., 2009 Shultz, 2004; Skalak, 2001), and data scientists should be aware of them, and to the extent that is possible, do their best to avoid them: 1. Sclecting the wrong problem for data mining. Not every business problem can be solved with data mining (i.e., the magic bullet syndrome). When there is no rep- resentative data (large and feature rich), there cannot be a practicable data mining project 2. Ignoring what your sponsor thinks data mining is and what it really can and cannot do, Expectation management is the key for successful data mining projects. 3. Beginning without the end in mind, Although data mining is a process of knowledge discovery, one should have a goal/objective (a stated business problem) in mind to succeed, Because, as the saying goes, “if you don’t know where you are going, you will never get there” 4, Define the project around a foundation that your data can’t support, Data mining is all about data; that is, the biggest constraint that you have in a data mining project is the richness of the data, Knowing what the limitations of data are help you craft feasible projects that deliver results and meet expectations 5. Leaving insufficient time for data preparation. It takes more effort than is generally understood, The common knowledge suggests that up to a thitd of the total project time is spent on data acquisition, understanding, and preparation tasks. To succeed, avoid proceeding into modeling until after your data is properly processed (aggre gated, cleaned, and transformed), 6. Looking only at aggregated results and not at individual records. Data mining is at its best when the data is at a granular representation. Try to avoid266 Chaprer 4 + Predictive Analytics I; Data Mining Process, Methods, and Algorithms 10. 11. 12. 13. 14, 15. 16. unnecessarily aggregating and overly simplifying data to help data mining algorithms—they don't really need your help, they are more than capable of figuring it out themselves. Being sloppy about keeping track of the data mining procedure and results. Because it is a discovery process that involves many iterations and experimentations, it is highly likely to lose track of the findings. Success requires a systematic and orderly planning, execution, and tracking/recording of all data mining tasks. Using data from the future to predict the future. Because of the lack of descrip tion and understanding of the data, oftentimes analysts include variables that are unknown at the time when the prediction is supposed to be made. By doing so, their prediction models produce unbelievable accurate results (@ phenomenon that is often called “fool's gold). If your prediction results are too good to be true, they usually are; in that case, the first thing that you need to look for is the incorrect use of a variable from the future. . Ignoring suspicious findings and quickly moving on. The unexpected findings are often the indicators of real novelties in data mining projects. Proper investigation of such oddities can lead to surprisingly pleasing discoveries. Starting with a high-profile complex project that will make you a superstar, Data mining projects often fail if they are not thought out carefully from start to end, Success often comes with a systematic and orderly progression of projects from smaller/simpler to larger/complex ones. The goal should be to show incremental and continuous value added, as opposed to taking on a large project that will con- sume resources without producing any valuable outcomes, Running data mining algorithms repeatedly and blindly. Although today’s data mining tools are capable of consuming data and setting algorithmic parameters to produce results, one should know how to transform the data and set the proper parameter values to obtain the best possible results. Each algorithm has its own unique way of processing data, and knowing that is necessary to get the most out of each model type Ignore the subject matter experts. Understanding the problem domain and the related data requires a highly involved collaboration between the data mining and the domain experts. Working together helps the data mining expert to go beyond the syntactic representation and also obtain semantic nature (i.c., the true meaning, of the variables) of the data Believing everything you are told about the data. Although it is necessary to talks to domain experts to better understand the data and the business problem, the data scientist should not take anything for granted, Validation and verification, through a critical analysis is the key to intimate understanding and processing of the data Assuming that the keepers of the data will be fully on board with cooperation, Many data mining projects fail because the data mining expert did not know/understand the organizational politics. One of the biggest obstacles in data mining projects can be the people who own and control the data, Understanding and managing the politics is a key to identify, access, and properly understand the data to produce a successfull data mining project Measuring your results differently from the way your sponsor measures them, The results should talk/appeal to the end user (manager/decision maker) who will be using them. Therefore, producing the results in a measure and format that appeals, to the end user tremendously increases the likelihood of true understanding and proper use of the data mining outcomes, If you build it, they will come: don't worry about how to serve it up. Usually, data mining experts think they are done once they build models that meet and hopefully

Big Data 4 (7)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data 4 (7)

Uploaded by

Copyright:

Available Formats

You might also like