Fundamentals of Predictive Analytics with JMP, Second Edition by Ron Klimberg and B. D. McCullough by Ron Klimberg and B. D. McCullough - Read Online

Book Preview

Fundamentals of Predictive Analytics with JMP, Second Edition - Ron Klimberg

You've reached the end of this preview. Sign up to read more!
Page 1 of 1

chapter.

Chapter 1: Introduction

Historical Perspective

Two Questions Organizations Need to Ask

Return on Investment

Cultural Change

Business Intelligence and Business Analytics

Introductory Statistics Courses

The Problem of Dirty Data

Added Complexities in Multivariate Analysis

Practical Statistical Study

Obtaining and Cleaning the Data

Understanding the Statistical Study as a Story

The Plan-Perform-Analyze-Reflect Cycle

Using Powerful Software

Framework and Chapter Sequence

Historical Perspective

In 1981, Bill Gates made his infamous statement that 640KB ought to be enough for anybody (Lai, 2008).

Looking back even further, about 10 to 15 years before Bill Gates’s statement, we were in the middle of the Vietnam War era. State-of-the-art computer technology for both commercial and scientific areas at that time was the mainframe computer. A typical mainframe computer weighed tons, took an entire floor of a building, had to be air-conditioned, and cost about $3 million. Mainframe memory was approximately 512 KB with disk space of about 352 MB and speed up to 1 MIPS (million instructions per second).

In 2016, only 45 years later, an iPhone 6 with 32-GB memory has about 9300% more memory than the mainframe and can fit in a hand. A laptop with the Intel Core i7 processor has speeds up to 238,310 MIPS, about 240,000 times faster than the old mainframe, and weighs less than 4 pounds. Further, an iPhone or a laptop costs significantly less than $3 million. As Ray Kurzweil, an author, inventor, and futurist has stated (Lomas, 2008): The computer in your cell phone today is a million times cheaper and a thousand times more powerful and about a hundred thousand times smaller (than the one computer at MIT in 1965) and so that's a billion-fold increase in capability per dollar or per euro that we've actually seen in the last 40 years.

Technology has certainly changed!

Two Questions Organizations Need to Ask

Many organizations have realized or are just now starting to realize the importance of using analytics. One of the first strides an organization should take toward becoming an analytical competitor is to ask themselves the following two questions.

Return on Investment

With this new and ever-improving technology, most organizations (and even small organizations) are collecting an enormous amount of data. Each department has one or more computer systems. Many organizations are now integrating these department-level systems with organization systems, such as an enterprise resource planning (ERP) system. Newer systems are being deployed that store all these historical enterprise data in what is called a data warehouse. The IT budget for most organizations is a significant percentage of the organization’s overall budget and is growing. The question is as follows:

With the huge investment in collecting this data, do organizations get a decent return on investment (ROI)?

The answer: mixed. No matter if the organization is large or small, only a limited number of organizations (yet growing in number) are using their data extensively. Meanwhile, most organizations are drowning in their data and struggling to gain some knowledge from it.

Cultural Change

How would managers respond to this question: What are your organization’s two most important assets?

Most managers would answer with their employees and the product or service that the organization provides (they might alternate which is first or second).

The follow-up question is more challenging: Given the first two most important assets of most organizations, what is the third most important asset of most organizations?

The actual answer is the organization’s data! But to most managers, regardless of the size of their organizations, this answer would be a surprise. However, consider the vast amount of knowledge that’s contained in customer or internal data. For many organizations, realizing and accepting that their data is the third most important asset would require a significant cultural change.

Rushing to the rescue in many organizations is the development of business intelligence (BI) and business analytics (BA) departments and initiatives. What is BI? What is BA? The answers seem to vary greatly depending on your background.

Business Intelligence and Business Analytics

Business intelligence (BI) and business analytics (BA) are considered by most people as providing information technology systems, such as dashboards and online analytical processing (OLAP) reports, to improve business decision-making. An expanded definition of BI is that it is a "broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI applications include the activities of decision support systems, query and reporting, online analytical processing (OLAP), statistical analysis, forecasting, and data mining" (Rahman, 2009).

Figure 1.1: A Framework of Business Analytics

The scope of BI and its growing applications have revitalized an old term: business analytics (BA). Davenport (Davenport and Harris, 2007) views BA as the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and actions. Davenport further elaborates that organizations should develop an analytics competency as a distinctive business capability that would provide the organization with a competitive advantage.

In 2007, BA was viewed as a subset of BI. However, in recent years, this view has changed. Today, BA is viewed as including BI’s core functions of reporting, OLAP and descriptive statistics, as well as the advanced analytics of data mining, forecasting, simulation, and optimization. Figure 1.1 presents a framework (adapted from Klimberg and Miori, 2010) that embraces this expanded definition of BA (or simply analytics) and shows the relationship of its three disciplines (Information Systems/Business Intelligence, Statistics, and Operations Research) (Gorman and Klimberg, 2014). The Institute of Operations Research and Management Science (INFORMS), one of the largest professional and academic organizations in the field of analytics, breaks analytics into three categories:

●   Descriptive analytics: provides insights into the past by using tools such as queries, reports, and descriptive statistics,

●   Predictive analytics: understand the future by using predictive modeling, forecasting, and simulation,

●   Prescriptive analytics: provide advice on future decisions using optimization.

The buzzword in this area of analytics for about the last 25 years has been data mining. Data mining is the process of finding patterns in data, usually using some advanced statistical techniques. The current buzzwords are predictive analytics and predictive modeling. What is the difference in these three terms? As discussed, with the many and evolving definitions of business intelligence, these terms seem to have many different yet quite similar definitions. Chapter 17 briefly discusses their different definitions. This text, however, generally will not distinguish between data mining, predictive analytics, and predictive modeling and will use them interchangeably to mean or imply the same thing.

Most of the terms mentioned here include the adjective business (as in business intelligence and business analytics). Even so, the application of the techniques and tools can be applied outside the business world and are used in the public and social sectors. In general, wherever data is collected, these tools and techniques can be applied.

Introductory Statistics Courses

Most introductory statistics courses (outside the mathematics department) cover the following topics:

●   descriptive statistics

●   probability

●   probability distributions (discrete and continuous)

●   sampling distribution of the mean

●   confidence intervals

●   one-sample hypothesis testing

They might also cover the following:

●   two-sample hypothesis testing

●   simple linear regression

●   multiple linear regression

●   analysis of variance (ANOVA)

Yes, multiple linear regression and ANOVA are multivariate techniques. But the complexity of the multivariate nature is for the most part not addressed in the introduction to statistics course. One main reason—not enough time!

Nearly all the topics, problems, and examples in the course are directed toward univariate (one variable) or bivariate (two variables) analysis. Univariate analysis includes techniques to summarize the variable and make statistical inferences from the data to a population parameter. Bivariate analysis examines the relationship between two variables (for example, the relationship between age and weight).

A typical student’s understanding of the components of a statistical study is shown in Figure 1.2. If the data are not available, a survey is performed or the data are purchased. Once the data are obtained, all at one time, the statistical analyses are done—using Excel or a statistical package, drawing the appropriate graphs and tables, performing all the necessary statistical tests, and writing up or otherwise presenting the results. And then you are done. With such a perspective, many students simply look at this statistics course as another math course and might not realize the importance and consequences of the material.

Figure 1.2: A Student’s View of a Statistical Study from a Basic Statistics Course

The Problem of Dirty Data

Although these first statistics courses provide a good foundation in introductory statistics, they provide a rather weak foundation for performing practical statistical studies. First, most real-world data are dirty. Dirty data are erroneous data, missing values, incomplete records, and the like. For example, suppose a data field or variable that represents gender is supposed to be coded as either M or F. If you find the letter N in the field or even a blank instead, then you have dirty data. Learning to identify dirty data and to determine corrective action are fundamental skills needed to analyze real-world data. Chapter 3 will discuss dirty data in detail.

Added Complexities in Multivariate Analysis

Second, most practical statistical studies have data sets that include more than two variables, called multivariate data. Multivariate analysis uses some of the same techniques and tools used in univariate and bivariate analysis as covered in the introductory statistics courses, but in an expanded and much more complex manner. Also, when performing multivariate analysis, you are exploring the relationships among several variables. There are several multivariate statistical techniques and tools to consider that are not covered in a basic applied statistics course.

Before jumping into multivariate techniques and tools, students need to learn the univariate and bivariate techniques and tools that are taught in the basic first statistics course. However, in some programs this basic introductory statistics class might be the last data analysis course required or offered. In many other programs that do offer or require a second statistics course, these courses are just a continuation of the first course, which might or might not cover ANOVA and multiple linear regression. (Although ANOVA and multiple linear regression are multivariate, this reference is to a second statistics course beyond these topics.) In either case, the students are ill-prepared to apply statistics tools to real-world multivariate data. Perhaps, with some minor adjustments, real-world statistical analysis can be introduced into these programs.

On the other hand, with the growing interest in BI, BA, and predictive analytics, more programs are offering and sometimes even requiring a subsequent statistics course in predictive analytics. So, most students jump from univariate/bivariate statistical analysis to statistical predictive analytics techniques, which include numerous variables and records. These statistical predictive analytics techniques require the student to understand the fundamental principles of multivariate statistical analysis and, more so, to understand the process of a statistical study. In this situation, many students are lost, which simply reinforces the students’ view that the course is just another math course.

Practical Statistical Study

Even with these ill-prepared multivariate shortcomings, there is still a more significant concern to address: the idea that most students view statistical analysis as a straightforward exercise in which you sit down once in front of your computer and just perform the necessary statistical techniques and tools, as in Figure 1.2. How boring! With such a viewpoint, this would be like telling someone that reading a book can simply be done by reading the book cover. The practical statistical study process of uncovering the story behind the data is what makes the work exciting.

Obtaining and Cleaning the Data

The prologue to a practical statistical study is determining the proper data needed, obtaining the data, and if necessary cleaning the data (the dotted area in Figure 1.3). Answering the questions Who is it for? and How will it be used? will identify the suitable variables required and the appropriate level of detail. Who will use the results and how they will use them determine which variables are necessary and the level of granularity. If there is enough time and the essential data is not available, then the data might have to be obtained by a survey, purchasing it, through an experiment, compiled from different systems or databases, or other possible sources. Once the data is available, most likely the data will first have to be cleaned—in essence, eliminating erroneous data as much as possible. Various manipulations of data will be taken to prepare the data for analysis, such as creating new derived variables, data transformations, and changing the units of measuring. Also, the data might need to be aggregated or compiled various ways. These preliminary steps can account for about 75% of time of a statistical study and are discussed further in Chapter 17.

As shown in Figure 1.3, the importance placed on the statistical study by the decision-makers/users and the amount of time allotted for the study will determine whether the study will be only a statistical data discovery or a more complete statistical analysis. Statistical data discovery is the discovery of significant and insignificant relationships among the variables and the observations in the data set.

Figure 1.3: The Flow of a Real-World Statistical Study

Understanding the Statistical Study as a Story

The statistical analysis (the enclosed dashed-line area in Figure 1.3) should be read like a book—the data should tell a story. The first part of the story and continuing throughout the study is the statistical data discovery.

The story develops further as many different statistical techniques and tools are tried. Some will be helpful, some will not. With each iteration of applying the statistical techniques and tools, the story develops and is substantially further advanced when you relate the statistical results to the actual problem situation. As a result, your understanding of the problem and how it relates to the organization is improved. By doing the statistical analysis, you will make better decisions (most of the time). Furthermore, these decisions will be more informed so that you will be more confident in your decision. Finally, uncovering and telling this statistical story is fun!

The Plan-Perform-Analyze-Reflect Cycle

The development of the statistical story follows a process that is called here the plan-perform-analyze-reflect (PPAR) cycle, as shown in Figure 1.4. The PPAR cycle is an iterative progression.

Figure 1.4: The PPAR Cycle

The first step is to plan which statistical techniques or tools are to be applied. You are combining your statistical knowledge and your understanding of the business problem being addressed. You are asking pointed, directed questions to answer the business question by identifying a particular statistical tool or technique to use.

The second step is to perform the statistical analysis, using statistical software such as JMP.

The third step is to analyze the results using appropriate statistical tests and other relevant criteria to evaluate the results. The fourth step is to reflect on the statistical results. Ask questions, like what do the statistical results mean in terms of the problem situation? What insights have I gained? Can you draw any conclusions? Sometimes, the results are extremely useful, sometimes meaningless, and sometimes in the middle—a potential significant relationship.

Then, it is back to the first step to plan what to do next. Each progressive iteration provides a little more to the story of the problem situation. This cycle continues until you feel you have exhausted all possible statistical techniques or tools (visualization, univariate, bivariate, and multivariate statistical techniques) to apply, or you have results sufficient to consider the story completed.

Using Powerful Software

The software used in many initial statistics courses is Microsoft Excel, which is easily accessible and provides some basic statistical capabilities. However, as you advance through the course, because of Excel’s statistical limitations, you might also use some nonprofessional, textbook-specific statistical software or perhaps some professional statistical software. Excel is not a professional statistics software application; it is a spreadsheet.

The statistical software application used in this book is the SAS JMP statistical software application. JMP has the advanced statistical techniques and the associated, professionally proven, high-quality algorithms of the topics/techniques covered in this book. Nonetheless, some of the early examples in the textbook use Excel. The main reasons for using Excel are twofold: (1) to give you a good foundation before you move on to more advanced statistical topics, and (2) JMP can be easily accessed through Excel as an Excel add-in, which is an approach many will take.

Framework and Chapter Sequence

In this book, you first review basic statistics in Chapter 2 and expand on some of these concepts to statistical data discovery techniques in Chapter 4. Because most data sets in the real world are dirty, in Chapter 3, you discuss ways of cleaning data. Subsequently, you examine several multivariate techniques:

●   regression and ANOVA (Chapter 5)

●   logistic regression (Chapter 6)

●   principal components (Chapter 7)

●   cluster analysis Chapter 9)

The framework that of statistical and visual methods in this book is shown in Figure 1.5. Each technique is introduce with a basic statistical foundation to help you understand when to use the technique and how to evaluate and interpret the results. Also, step-by-step directions are provided to guide you through an analysis using the technique.

Figure 1.5: A Framework for Multivariate Analysis

The second half of the book introduces several more multivariate/predictive techniques and provides an introduction to the predictive analytics process:

●   LASSO and Elastic Net (Chapter 8),

●   decision trees (Chapter 10),

●   k-nearest neighbor (Chapter 11),

●   neural networks (Chapter 12)

●   bootstrap forests and boosted trees (Chapter 13)

●   model comparison (Chapter 14)

●   text mining (Chapter 15)

●   association rules (Chapter 16), and

●   data mining process (Chapter 17).

The discussion of these predictive analytics techniques uses the same approach as with the multivariate techniques—understand when to use it, evaluate and interpret the results, and follow step-by-step instructions.

When you are performing predictive analytics, you will most likely find that more than one model will be applicable. Chapter 14 examines procedures to compare these different models.

The overall objectives of the book are to not only introduce you to multivariate techniques and predictive analytics, but also provide a bridge from univariate statistics to practical statistical analysis by instilling the PPAR cycle.

Chapter 2: Statistics Review

Introduction

Fundamental Concepts 1 and 2

FC1: Always Take a Random and Representative Sample

FC2: Remember That Statistics Is Not an Exact Science

Fundamental Concept 3: Understand a Z-Score

Fundamental Concept 4

FC4: Understand the Central Limit Theorem

Learn from an Example

Fundamental Concept 5

Understand One-Sample Hypothesis Testing

Consider p-Values

Fundamental Concept 6:

Understand That Few Approaches/Techniques Are Correct—Many Are Wrong

Three Possible Outcomes When You Choose a Technique

Introduction

Regardless of the academic field of study—business, psychology, or sociology—the first applied statistics course introduces the following statistical foundation topics:

●   descriptive statistics

●   probability

●   probability distributions (discrete and continuous)

●   sampling distribution of the mean

●   confidence intervals

●   one-sample hypothesis testing and perhaps two-sample hypothesis testing

●   simple linear regression

●   multiple linear regression

●   ANOVA

Not considering the mechanics or processes of performing these statistical techniques, what fundamental concepts should you remember? We believe there are six fundamental concepts:

●   FC1: Always take a random and representative sample.

●   FC2: Statistics is not an exact science.

●   FC3: Understand a z-score.

●   FC4: Understand the central limit theorem (not every distribution has to be bell-shaped).

●   FC5: Understand one-sample hypothesis testing and p-values.

●   FC6: Few approaches are correct and many wrong.

Let’s examine each concept further.

Fundamental Concepts 1 and 2

The first two fundamental concepts explain why we take a random and representative sample and that the sample statistics are estimates that vary from sample to sample.

FC1: Always Take a Random and Representative Sample

What is a random and representative sample (called a 2R sample)? Here, representative means representative of the population of interest. A good example is state election polling. You do not want to sample everyone in the state. First, an individual must be old enough and registered to vote. You cannot vote if you are not registered. Next, not everyone who is registered votes,so, does a given registered voter plan to vote? You are not interested in individuals who do not plan to vote. You don’t care about their voting preferences because they will not affect the election. Thus, the population of interest is those individuals who are registered to vote and plan to vote.

From this representative population of registered voters who plan to vote, you want to choose a random sample. Random, means that each individual has an equal chance of being selected. So you could suppose that there is a huge container with balls that represent each individual who is identified as registered and planning to vote. From this container, you choose so many balls (without replacing the ball). In such a case, each individual has an equal chance of being drawn.

You want the sample to be a 2R sample, but why? For two related reasons. First, if the sample is a 2R sample, then the sample distribution of observations will follow a pattern resembling that of the population. Suppose that the population distribution of interest is the weights of sumo wrestlers and horse jockeys (sort of a ridiculous distribution of interest, but that should help you remember why it is important). What does the shape of the population distribution of weights of sumo wrestlers and jockeys look like? Probably somewhat like the distribution in Figure 2.1. That is, it’s bimodal, or two-humped.

Figure 2.1: Population Distribution of the Weights of Sumo Wrestlers and Jockeys

If you take a 2R sample, the distribution of sampled weights will look somewhat like the population distribution in Figure 2.2, where the solid line is the population distribution and the dashed line is the sample distribution.

Figure 2.2: Population and a Sample Distribution of the Weights of Sumo Wrestlers and Jockeys

Why not exactly the same? Because it is a sample, not the entire population. It can differ, but just slightly. If the sample was of the entire population, then it would look exactly the same. Again, so what? Why is this so important?

The population parameters (such as the population mean, µ, the population variance, σ², or the population standard deviation, σ) are the true values of the population. These are the values that you are interested in knowing. In most situations, you would not know these values exactly only if you were to sample the entire population (or census) of interest. In most real-world situations, this would be a prohibitively large number (costing too much and taking too much time).

Because the sample is a 2R sample, the sample distribution of observations is very similar to the population distribution of observations. Therefore, the sample statistics, calculated from the sample, are good estimates of their corresponding population parameters. That is, statistically they will be relatively close to their population parameters because you took a 2R sample. For these reasons, you take a 2R sample.

FC2: Remember That Statistics Is Not an Exact Science

The sample statistics (such as the sample mean, sample variance, and sample standard deviation) are estimates of their corresponding population parameters. It is highly unlikely that they will equal their corresponding population parameter. It is more likely that they will be slightly below or slightly above the actual population parameter, as shown in Figure 2.2.

Further, if another 2R sample is taken, most likely the sample statistics from the second sample will be different from the first sample. They will be slightly less or more than the actual population parameter.

For example, suppose that a company’s union is on the verge of striking. You take a 2R sample of 2,000 union workers. Assume that this sample size is statistically large. Out of the 2,000, 1,040 of them say that they are going to strike. First, 1,040 out of 2,000 is 52%, which is greater than 50%. Can you therefore conclude that they will go on strike? Given that 52% is an estimate of the percentage of the total number of union workers who are willing to strike, you know that another 2R sample will provide another percentage. But another sample could produce a percentage perhaps higher and perhaps lower and perhaps even less than 50%. By using statistical techniques, you can test the likelihood of the population parameter being greater than 50%. (You can construct a confidence interval, and if the lower confidence level is greater than 50%, you can be highly confident that the true population proportion is greater than 50%. Or you can conduct a hypothesis test to measure the likelihood that the proportion is greater than 50%.)

Bottom line: When you take a 2R sample, your sample statistics will be good (statistically relatively close, that is, not too far away) estimates of their corresponding population parameters. And you must realize that these sample statistics are estimates, in that, if other 2R samples are taken, they will produce different estimates.

Fundamental Concept 3: Understand a Z-Score

Suppose that you are sitting in on a marketing meeting. The marketing manager is presenting the past performance of one product over the past several years. Some of the statistical information that the manager provides is the average monthly sales and standard deviation. (More than likely, the manager would not present the standard deviation, but, a quick conservative estimate of the standard deviation is the (Max − Min)/4; the manager most likely would give the minimum and maximum values.)

Suppose that the average monthly sales are $500 million, and the standard deviation is $10 million. The marketing manager starts to present a new advertising campaign which he or she claims would increase sales to $570 million per month. And suppose that the new advertising looks promising. What is the likelihood of this happening? Calculate the z-score as follows:

Z = x−μσ = 570 − 50010 = 7

The z-score (and the t-score) is not just a number. The z-score is how many standard deviations away that a value, like the 570, is from the mean of 500. The z-score can provide you some guidance, regardless of the shape of the distribution. A z-score greater than (absolute value) 3 is considered an outlier and highly unlikely. In the example, if the new marketing campaign is as effective as suggested, the likelihood of increasing monthly sales by 7 standard deviations is extremely low.

On the other hand, what if you calculated the standard deviation and it was $50 million? The z-score is now 1.4 standard deviations. As you might expect, this can occur. Depending on how much you like the new advertising campaign, you would believe it could occur. So the number $570 million can be far away, or it could be close to the mean of $500 million. It depends on the spread of the data, which is measured by the standard deviation.

In general, the z-score is like a traffic light. If it is greater than the absolute value of 3 (denoted |3|), the light is red; this is an extreme value. If the z-score is between |1.65| and |3|, the light is yellow; this value is borderline. If the z-score is less than |1.65|, the light is green, and the value is just considered random variation. (The cutpoints of 3 and 1.65 might vary slightly depending on the situation.)

Fundamental Concept 4

This concept is where most students become lost in their first statistics class. They complete their statistics course thinking every distribution is normal or bell-shaped, but that is not true. However, if the FC1 assumption is not violated and the central limit theorem holds, then something called the sampling distribution of the sample means will be bell-shaped. And this sampling distribution is used for inferential statistics; that is, it is applied in constructing confidence intervals and performing hypothesis tests.

FC4: Understand the Central Limit Theorem

If you take a 2R sample, the histogram of the sample distribution of observations will be close to the histogram of the population distribution of observations (FC1). You also know that the sample mean from sample to sample will vary (FC2).

Suppose that you actually know the value of the population mean and you took every combination of sample size n (and let n be any number greater than 30), and you calculated the sample mean for each sample. Given all these sample means, you then produce a frequency distribution and corresponding histogram of sample means. You call this distribution the sampling distribution of sample means. A good number of sample means will be slightly less and more, and fewer will be further away (above and below), with equal chance of being greater than or less than the population mean. If you try to visualize this, the distribution of all these sample means would be bell-shaped, as in Figure 2.3. This should make intuitive sense.

Figure 2.3: Population Distribution and Sample Distribution of Observations and Sampling Distribution of the Means for the Weights of Sumo Wrestlers and Jockeys

Nevertheless, there is one major problem. To get this distribution of sample means, you said that every combination of sample size n needs to be collected and analyzed. That, in most cases, is an enormous number of samples and would be prohibitive. Also, in the real world, you only take one 2R sample.

This is where the central limit theorem (CLT) comes to our rescue. The CLT will hold regardless of the shape of the population distribution of observations—whether it is normal, bimodal (like the sumo wrestlers and jockeys), or whatever shape, as long as a 2R sample is taken and the sample size is greater than 30. Then, the sampling distribution of sample means will be approximately normal, with a mean of x¯ and a standard deviation of (s / √n) (which is called the standard error).

What does this mean in terms of performing statistical inferences of the population? You do not have to take an enormous number of samples. You need to take only one 2R sample with a sample size greater than 30. In most situations, this will not be a problem. (If it is an issue, you should use nonparametric statistical techniques.) If you have a 2R sample greater than 30, you can approximate the sampling distribution of sample means by using