You are on page 1of 5

Data Mining Methods and Models

By Daniel T. Larose, Ph.D.

Chapter Summaries and Keywords
Preface. The preface begins by discussing why Data Mining Methods and Models is needed. Because of the powerful data mining software platforms currently available, a strong caveat is given against glib application of data mining methods and techniques. In other words, data mining is easy to do badly. The best way to avoid these costly errors, which stem from a blind black-box approach to data mining, is to instead apply a “white-box” methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software. Data Mining Methods and Models applies this white-box approach by (1) walking the reader through the operations and nuances of the various algorithms, using small sample data sets, so that the reader gets a true appreciation of what is really going on inside the algorithm, (2) providing examples of the application of the various algorithms on actual large data sets, (3) supplying chapter exercises, which allow readers to assess their depth of understanding of the material, as well as have a little fun playing with numbers and data, and (4) providing the reader with hands-on analysis problems, representing an opportunity for the reader to apply his or her newly-acquired data mining expertise to solving real problems using large data sets. Data mining is presented as a well-structured standard process, namely, the Cross-Industry Standard Process for Data Mining (CRISP-DM). A graphical approach to data analysis is emphasized, stressing in particular exploratory data analysis. Data Mining Methods and Models naturally fits the role of textbook for an introductory course in data mining. Instructors may appreciate (1) the presentation of data mining as a process, (2) the “White box” approach, emphasizing an understanding of the underlying algorithmic structures, (3) the graphical approach, emphasizing exploratory data analysis, and (4) the logical presentation, flowing naturally from the CRISP-DM standard process and the set of data mining tasks. Particularly useful for the instructor is the companion website, which provides ancillary materials for teaching a course using Data Mining Methods and Models, including Powerpoint® presentations, answer keys, and sample projects. The book is appropriate for advanced undergraduate or graduate-level courses. No computer programming or database expertise is required. The software used in the book includes Clementine, Minitab, SPSS, and WEKA. Free trial versions of Minitab and SPSS are available for download from their company websites. WEKA is open-source data mining software freely available for download. Keywords:

Data Mining Methods and Models
Copyright © by Daniel T. Larose, Ph.D. Chapter Summary and Keywords

influential observation. along with the confidence interval for the slope of the regression line. Next. high leverage point. Methods are shown for verifying the assumptions underlying the regression model. the Data Mining Methods and Models Copyright © by Daniel T. variation. outlier. graphical and exploratory approach. where more than one predictor variable is used to estimate a response variable. eigenvalues. is introduced by way of an example.D. Emphasis is given to profiling the principal components for the end-user.Algorithm walk-throughs. commonality. component weights. data mining as a process. the confidence interval for the mean value of y given x. and the prediction interval for a randomly chosen value of y given x. Finally. along with the importance of validating the principal components using the usual hold-out methods in data mining. SPSS. transformations. the regression model is introduced. high leverage points. Chapter 1: Dimension Reduction Methods Chapter one begins with an assessment of the need for dimension reduction in data mining. The correlation coefficient r is discussed. along with the ANOVA table for succinct display of results. user-defined composites are briefly discussed. chapter exercises. using an example. “white-box” approach. Minitab. least squares. Principal components analysis is demonstrated. Chapter 2: Regression Modeling Chapter Two begins by using an example to introduce simple linear regression and the concept of least squares. Chapter 3: Multiple Regression and Model Building Multiple regression. prediction interval. factor analysis is introduced and demonstrated using the real-world Adult data set. factor analysis. WEKA. The t-Test for the relationship between x and y is shown. Key Words: Principal components. Finally. in the context of a real-world example using the Houses data set. Larose. Various criteria are compared for determining how many components should be extracted. Moving from descriptive methods to inference. Outliers. Key Words: Simple linear regression. methods of applying transformations to achieve linearity is provided. The usefulness of the regression is then measured by the coefficient of determination r 2 . prediction error. Clementine. Detailed examples are provided using the Baseball and California data sets. factor loadings. To allow for inference. scree plot. Chapter Summary and Keywords . companion website. and influential observations are discussed in detail. Ph. The need for factor rotation is discussed. and the typical prediction error is estimated using the standard error of the estimate s. hands-on analysis problems. user-defined composite. confidence interval. which clarifies the definition of the factors. factor rotation.

multiple logistic regression. and continuous predictors. Next. model selection methods. best-subsets. as an aid in variable selection. Inference within the framework of the logistic regression model is discussed. An example of using WEKA for Naïve Bayes is provided. Ph. Larose. the application of logistic regression using the freely available software WEKA is demonstrated. backward elimination. Variable selection methods are then provided.multiple regression model is defined. where more than one predictor is used to classify a response. Naïve Bayes classification is derived. forward selection. Finally. using a simplifying assumption which greatly reducing the search space. The variance inflation factor is defined. the zero-cell problem. including for dichotomous. The maximum likelihood estimation methods for logistic regression are outlined. using a small example. which is used to select the preferred response classification. regression with categorical predictors (indicator variables) is explained. indicator variables. Finally. Methods are discussed for introducing higher order terms to handle non-linearity. Key Words: Categorical predictors. including determining whether the predictors are significant. including the posterior odds ratio. The importance of balancing the data is discussed. including forward selection. Mallows’ C p statistic is defined. Chapter 5: Naïve Bayes and Bayesian Networks Chapter Five begins by contrasting the Bayesian approach with the usual (frequentist) approach to probability. Emphasis is placed on interpreting logistic regression output. polychotomous. stepwise. classification. Chapter Summary and Keywords . Data Mining Methods and Models Copyright © by Daniel T. methods for using the principal components as predictors in multiple regression are discussed. backward elimination.D. The assumption of linearity is discussed. The problems of multicollinearity are examined. variance inflation factor. multicollinearity represents an unstable response surface due to overly correlated predictors. multicollinearity. Then. with both model and inferential methods representing extensions of the simple linear regression case. as an aid in identifying multicollinear predictors. Odds ratios are discussed. categorical response. stepwise regression. Key Words: Maximum likelihood estimation. Methods for interpreting the logistic regression model are examined. WEKA. the logistic regression model must be validated. Bayesian Belief Networks (Bayes Nets) are introduced and defined. Methods for handling numeric predictors for Naïve Bayes classification are demonstrated. Chapter 4: Logistic Regression Logistic regression is introduced by way of a simple example for predicting the presence of disease based on age. and best-subsets regression. The maximum a posteriori (MAP) classification is defined. As usual. as well as methods for tackling the zero-cell problem. We then turn to multiple logistic regression.

Methods for using the Bayesian network to find probabilities are discussed. Clustering analysis is performed. Chapter Summary and Keywords . Transformations to achieve normality or symmetry are applied. Bayesian belief networks. Key Words: Selection. which uses the principal components. crowding. selection pressure. Key Words: Bayesian approach. and Mutation. and Collection B. optimization. posterior odds ratio. The method of combining models via voting is demonstrated. two principal components are derived. odds ratio. The technique of using over-balancing as a surrogate for misclassification costs is applied. Ph. Crossover. In the data understand and data preparation phases. crossover. balancing the data. Chapter 7: Case Study: Modeling Response to Direct Mail Marketing The case study begins with an overview of the cross-industry standard process for data mining: CRISP-DM. Larose. Next comes the modeling phase. modifications and enhancements from the literature are discussed. The relationships between the predictors and the response are explored. Useful new variables are derived. especially for the selection and crossover operators. A simple example of a genetic algorithm at work is examined. with particular emphasis on the construction of an accurate cost / benefit table. For the business understanding phase. The use of genetic algorithms as optimizers within a neural network is demonstrated. which does not. maximum a posteriori classification. Collection A. Chapter 6: Genetic Algorithms Chapter Six begins by introducing genetic algorithms by way of analogy with the biological processes at work in the evolution of organisms.D. WEKA. the direct mail marketing response problem is defined. Finally. Next. Finally. Here. an example of using Bayes nets in WEKA is provided. The basic framework of a genetic algorithm is provided. Emphasis is laid on the effects of balancing (and over-balancing) the training data set. an example of the use of WEKA for genetic algorithms is provided. with each step explained and demonstrated. global optimum. as is the method of combining models using the mean response probabilities. and the correlation structure among the predictors is investigated. where the genetic algorithm replaces the using backpropagation algorithm. The baseline model performance is established. using the BIRCH clustering algorithm. Naïve Bayes classification. using principal components analysis. Genetic algorithms for real-valued variables are discussed. as is standardization and the construction of flag variables. WEKA. the Clothing Store data set is explored. mutation. Data Mining Methods and Models Copyright © by Daniel T. Two sets of models are examined. including the three basic operators: Selection. fitness. which will be used to assess the usefulness of all later models.

Chapter Summary and Keywords . Ph.Key Words: CRISP-DM standard process for data mining. voting. cost / benefit analysis. BIRCH clustering algorithm.D. Larose. Data Mining Methods and Models Copyright © by Daniel T. mean response probabilities. over-balancing. model combination. misclassification costs.