Data Mining in Market Research

• What is data mining?
– Methods for finding interesting structure in large databases
• E.g. patterns, prediction rules, unusual cases

– Focus on efficient, scalable algorithms
• Contrasts with emphasis on correct inference in statistics

– Related to data warehousing, machine learning

• Why is data mining important?
– Well marketed; now a large industry; pays well – Handles large databases directly – Can make data analysis more accessible to end users
• Semi-automation of analysis • Results can be easier to interpret than e.g. regression models • Strong focus on decisions and their implementation

CRISP-DM Process Model

Data Mining Software
• Many providers of data mining software
– SAS Enterprise Miner, SPSS Clementine, Statistica Data Miner, MS SQL Server, Polyanalyst, KnowledgeSTUDIO, … – See for a list – Good algorithms important, but also need good facilities for handling data and meta-data

• We’ll use:
– WEKA (Waikato Environment for Knowledge Analysis)
• Free (GPLed) Java package with GUI • Online at • Witten and Frank, 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations.

– R packages
• E.g. rpart, class, tree, nnet, cclust, deal, GeneSOM, knnTree, mlbench, randomForest, subselect

Data Mining Terms
• Different names for familiar statistical concepts, from database and AI communities
– Observation = case, record, instance – Variable = field, attribute – Analysis of dependence vs interdependence = Supervised vs unsupervised learning – Relationship = association, concept – Dependent variable = response, output – Independent variable = predictor, input

Common Data Mining Techniques
• Predictive modeling
– Classification
• Derive classification rules • Decision trees

– Numeric prediction
• Regression trees, model trees

• Association rules • Meta-learning methods
– Cross-validation, bagging, boosting

• Other data mining methods include:
– artificial neural networks, genetic algorithms, density estimation, clustering, abstraction, discretisation, visualisation, detecting changes in data or models

Classification • Methods for predicting a discrete response – One kind of supervised learning – Note: in biological and other sciences. referring to cluster analysis • Applications include: – Identifying good prospects for specific marketing or sales efforts • Cross-selling. classification has long had a different meaning. up-selling – when to offer products • Customers likely to be especially profitable • Customers likely to defect – Identifying poor credit risks – Diagnosing customer problems .

numeric Wind .Weather/Game-Playing Data • Small dataset – 14 instances – 5 attributes • • • • • Outlook .nominal Play – Whether or not a certain game would be played – This is what we want to understand and predict .nominal Temperature .numeric Humidity .

90. 64. yes sunny. 70. rainy } temperature numeric humidity numeric windy { true. true. no sunny. 85. 91. yes sunny. 69. yes overcast. no } @data % % 14 instances % sunny. yes rainy. false. no overcast. yes rainy. 70. false. 86. false. 72. 90. yes rainy. 75. yes rainy.ARFF file for the weather data. % ARFF file for the weather data with some numeric features % @relation weather @attribute @attribute @attribute @attribute @attribute outlook { sunny. 95. true. 80. 75. true. 68. no overcast. overcast. 75. no . false. 80. false. false. 70. 80. 81. 71. 70. 83. false } play? { yes. no sunny. 72. yes rainy. 96. 85. yes overcast. 65. false. false. true. true. true. 65.

nz/~reilly/credit-g. duration. credit_history. housing.stat.auckland. purpose. job. num_dependents. savings_status. Age. foreign_worker • Want to predict credit risk • Data available at UCI machine learning data repository • http://www.g.uci. 21 attributes – “class” attribute describes people as good or bad credit risks – Other attributes include financial information and demographics • E.arff .html – and on 747 web page • http://www.ics. checking_status. employment.German Credit Risk Dataset • 1000 instances (people).

ID3. PART. Support vector machine (SMO) • Usually train on part of the data.Classification Algorithms • Many methods available in WEKA – 0R. Instance-based learner (IB1. IBk). 1R. DecisionTable. test on the rest • Simple method – Zero-rule. but a useful baseline for evaluating performance of more complex methods . C4. NaiveBayes. PRISM.5 (J48). or 0R – Predict the most common category • Class ZeroR in WEKA – Too simple for practical use.

and choose best predictor • Called OneR in WEKA • Must group numerical predictor values for this method – Common method is to split at each change in the response – Collapse buckets until each contains at least 6 instances .1-Rule (1R) Algorithm • Based on single predictor – Predict mode within each value of that predictor • Look at error rate for each predictor on training dataset.

1R Algorithm (continued) • Biased towards predictors with more categories – These can result in over-fitting to the training data • But found to perform surprisingly well – Study on 16 widely used datasets • Holte (1993). decision trees) – Produced rules that were much simpler and more easily understood . 63-91 – Often error rate only a few percentages points higher than more sophisticated methods (e. Machine Learning 11.g.

C   P  A X P B X P C X P X  P ( A. C ) • Response value with highest probability is predicted • Numeric attributes are assumed to follow a normal distribution within each response value – Contribution to probability calculated from normal density function – Instead can use kernel density estimate. or simply discretise the numerical attributes . B. B.Naïve Bayes Method • Calculates probabilities of each response value. assuming independence of attribute effects P  X A.

temperature=cool.Naïve Bayes Calculations • Observed counts and probabilities above – Temperature and humidity have been discretised • Consider new day – – – – Outlook=sunny.0053 Probability(play=no) α 3/5 x 1/5 x 4/5 x 3/5 x 5/14= 0.0206/(0. windy=true Probability(play=yes) α 2/9 x 3/9 x 3/9 x 3/9 x 9/14= 0.0206) = 79.0206 Probability(play=no) = 0.0053+0.5% • “no” four times more likely than “yes” . humidity=high.

the whole probability is zero – Effectively a veto on that response value – Add one to each cell’s count to get around this problem • Corresponds to weak positive prior information • Naïve Bayes effectively assumes that attributes are equally important – Several highly correlated attributes could drown out an important variable that would add new information • However this method often works well in practice .Naïve Bayes Method • If any of the component probabilities are zero.

they may obscure rather than reveal the true pattern • More info online at http://recursive-partitioning.Decision Trees • Classification rules can be expressed in a tree structure – Move from the top of the tree. down through various . to the leaves – At each node. a decision is made using a simple test based on attribute values – The leaf you reach holds the appropriate predicted value • Decision trees are appealing and easily used – However they can be verbose – Depending on the tests being used.

If x=1 and y=1 then class = a If z=1 and w=1 then class = a Otherwise class = b Decision tree with a replicated subtree .

Problems with Univariate Splits .

want leaf nodes to be as pure as possible – Choose the attribute that maximises the average purity of the daughter nodes • The measure of purity used is the entropy of the node • This is the amount of information needed to specify the value of an instance in that node. measured in bits entropy p1 . we assume univariate splits are used – For accurate predictions. .Constructing Decision Trees • Develop tree recursively – Start with all data in one root node – Need to choose attribute that defines first split • For now. pn     pi log 2 pi i 1 n . p2 .

Tree stumps for the weather data (a) (b) (c) (d) .

693 • Entropy of root node is 0.029 bits – Gain(humidity)=0.940 bits • Gain of 0. with entropy – 2/5 * log2(2/5) – 3/5 * log2(3/5) = 0.247 bits • Other splits yield: – Gain(temperature)=0.152 bits – Gain(windy)=0.971 + 4/14 x 0 + 5/14 x 0.971= 0.048 bits • So “outlook” is the best attribute to split on .Weather Example • First node from outlook split is for “sunny”.971 • Average entropy of nodes from outlook split is 5/14 x 0.

Expanded tree stumps for weather data (a) (b) (c) .

Decision tree for the weather data .

missing values and noisy data (via pruning) – Resulting algorithm known as C4.0) – WEKA has a version called J4.8 .5 • Described by Quinlan (1993) – Widely used (as is the commercial version C5.Decision Tree Algorithms • The algorithm described in the preceding slides is known as ID3 – Due to Quinlan (1986) • Tends to choose attributes with many values – Using information gain ratio helps solve this problem • Several more improvements have been made to handle numeric attributes (via univariate splits).

Olshen.J. missing values and numeric attributes – Multivariate splits are possible.5 method caught up with CART in most areas • CART also incorporates methods for pruning. and C. J. Friedman.H. R.A. Stone. • More sophisticated method than ID3 – However Quinlan’s (1993) C4. Classification and Regression Trees. Breiman. as well as univariate – CART typically uses Gini measure of node purity to determine best splits • This is of the form Σp(1-p) • Split on linear combination Σcjxj > d – But information/entropy measure also available .Classification Trees • Described (along with regression trees) in: – L. 1984.

MASS 3rd Ed.5 frameworks – Can use a model at each node instead • Implemented in Weka’s M5’ algorithm • Harder to interpret than regression trees • Classification and regression trees are implemented in R’s rpart package – See Ch 10 in Venables and Ripley.Regression Trees • Trees can also be used to predict numeric attributes – Predict using average value of the response in the appropriate node • Implemented in CART and C4. .

works better on test data • Bagging the tree algorithm also gives more stable results – Will discuss bagging later . and refit the tree on the reduced dataset – Typically gives a smaller tree – This usually works almost as well on the training data – But generalises better.g. remove all training cases that are not correctly predicted. e.Problems with Trees • Can be unnecessarily verbose • Structure often unstable – “Greedy” hierarchical algorithm • Small variations can change chosen splits at high level nodes. which then changes subtree below • Conclusions about attribute importance can be unreliable • Direct methods tend to overfit training dataset – This problem can be reduced by pruning the tree • Another approach that often works well is to fit the tree.

103 leaves .Classification Tree Example • Use Weka’s J4. 21 attributes • Produces a pruned tree with 140 nodes.8 algorithm on German credit data (with default options) – 1000 instances.

3241 Root mean squared error 0.596 0.134 % Root relative squared error 100.• • • • • • • • • • • • • • • • • • • • • • • • • === Run information === Scheme: weka.9 Incorrectly Classified Instances 261 26.25 -M 2 Relation: german_credit Instances: 1000 Attributes: 21 Number of Leaves : Size of the tree : 103 140 === Stratified cross-validation === === Summary === Correctly Classified Instances 739 73.j48.4589 % Total Number of Instances 1000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.1 Kappa statistic 0.classifiers.J48 -C 0.4604 Relative absolute error 77.3153 Mean absolute error 0.403 0.classified as 618 82 | a = good 179 121 | b = bad % % .826 good 0.597 0.883 0.481 bad === Confusion Matrix === a b <-.883 0.775 0.403 0.117 0.

Cross-Validation • Due to over-fitting. cannot estimate prediction error directly on the training dataset • Cross-validation is a simple and widely used method for estimating prediction error • Simple approach – Set aside a test dataset – Train learner on the remainder (the training dataset) – Estimate prediction error by using the resulting prediction model on the test dataset • This is only feasible where there is enough data to set aside a test dataset and still have enough to reliably train the learning algorithm .

and average the resulting prediction errors • This method measures the prediction error when training the learner on a fraction (k-1)/k of the data • If k is small. train on the other k-1 parts and use this part as the test dataset – Do this for each of the k parts. this will overestimate the prediction error – k=10 is usually enough . use k-fold cross-validation – Split dataset into k roughly equal parts Tr Tr Test Tr Tr Tr Tr Tr Tr – For each part.k-fold Cross-Validation • For smaller datasets.

auto <.Regression Tree Example • data(car.test.frame) • .frame) • post(z.“”) • summary(z.rpart(Mileage ~ Weight.

63 n=8 .41 n=22 24.47 n=45 30.8 n=15 25.58 | n=60 Weight>=2568 Weight< 2568 22.43 n=23 Weight>=2748 Weight< 2748 23.24.93 n=15 Weight>=3088 Weight< 3088 20.

13452819 1 0. complexity param=0. improve=0.09134201 1 2 3 4 Node number 1: 60 observations.frame) n= 60 CP nsplit rel error 0. MSE=22. (0 missing) …(continued on next page)… .09178782 0.6081645 0.57639 left son=2 (45 obs) right son=3 (15 obs) Primary splits: Weight < 2567. (0 missing) Node number 2: 45 observations. MSE=8.17981796 0.5953491.59534912 0 1.0322233 0.4046509 0.01000000 3 0.5045118. complexity param=0.4557341 0.11371656 0.5 to the right.• • • • • • • • • • • • • • • • • • • Call: rpart(formula = Mileage ~ Weight. data = car.4659556 xstd 0.2572943 xerror 1.58333. improve=0.026667 left son=4 (22 obs) right son=5 (23 obs) Primary splits: Weight < 3087.5953491 mean=24.2701227 0.01282843 2 0.1345282 mean=22.46667.0000000 0.5 to the right.test.

(0 missing) Node number 10: 15 observations mean=23. MSE=4.43478. MSE=4. MSE=2.01282843 mean=24.984375 .• • • • • • • • • • • • • Node number 3: 15 observations mean=30. MSE=12. improve=0. MSE=5.46222 Node number 4: 22 observations mean=20.625.026667 Node number 11: 8 observations mean=25.115312 left son=10 (15 obs) right son=11 (8 obs) Primary splits: Weight < 2747.8.5 to the right. complexity param=0.78719 Node number 5: 23 observations.1476996.93333.40909.

file="".cp=0.1) • post(z2.prune(z.Regression Tree Example (continued) • plotcp( • <. cex=1) .

011 .Complexity Parameter Plot size of tree 1 2 3 4 1.8 1.28 cp 0.4 0.6 0.0 Inf 0.2 X-val Relative Error 0.042 0.

8 n=15 25.58 | n=60 Weight>=2568 Weight< 2568 22.41 n=22 24.93 n=15 Weight>=3088 Weight< 3088 20.47 n=45 30.43 n=23 Weight>=2748 Weight< 2748 23.24.63 n=8 .

93 n=15 Weight>=3088 Weight< 3088 20.43 n=23 .47 n=45 30.Pruned Regression Tree Endpoint = Mileage 24.58 | n=60 Weight>=2568 Weight< 2568 22.41 n=22 24.

g. loan officer) as input to their decision .Classification Methods • Project the attribute space into decision regions – Decision trees: piecewise constant approximation – Logistic regression: linear log-odds approximation – Discriminant analysis and neural nets: linear & non-linear separators • Density estimation coupled with a decision rule – E. Naïve Bayes • Define a metric space and decide based on proximity – One type of instance-based learning – K-nearest neighbour methods • IBk algorithm in Weka – Would like to drop noisy and unnecessary points • Simple algorithm based on success rate confidence intervals available in Weka – Compares naïve prediction with predictions using that instance – Must choose suitable acceptance and rejection confidence levels • Many of these approaches can produce probability distributions as well as predictions – Depending on the application. this information may be useful • Such as when results reported to expert (e.g.

C4.Numeric Prediction Methods • Linear regression • Splines. including smoothing splines and multivariate adaptive regression splines (MARS) • Generalised additive models (GAM) • Locally weighted regression (lowess. M5’ • Artificial neural networks (ANNs) . loess) • Regression and Model Trees – CART.5.

that are connected by communication channels that carry numeric data • ANNs are very flexible. and data reduction models – They do require some expertise to set up – An appropriate architecture needs to be selected and tuned for each application • They can be useful tools for learning from examples to find patterns in data and predict outputs – However on their own. encompassing nonlinear regression models. discriminant models. but not widely . they tend to overfit the training data – Meta-learning tools are needed to choose the best fit • Various network architectures in common use – Multilayer perceptron (MLR) – Radial basis functions (RBF) – Self-organising maps (SOM) • ANNs have been applied to data editing and imputation.Artificial Neural Networks (ANNs) • An ANN is a network of many simple processors (or units).

bagging for short – Select B bootstrap samples from the data • Selected with replacement.Meta-Learning Methods . same # of instances – Can use parametric or non-parametric bootstrap – Fit the model/learner on each bootstrap sample – The bagged estimate is the average prediction from all these B models • E. for a tree learner. and can greatly improve prediction accuracy – However it does not always work for poor 0-1 predictors . the bagged estimate is the average prediction from the resulting B trees • Note that this is not a tree – In general.Bagging • General methods for improving the performance of most learning algorithms • Bootstrap aggregation.g. bagging a model or learner does not produce a model or learner of the same form • Bagging reduces the variance of unstable procedures like regression trees.

Meta-Learning Methods .Boosting • Boosting is a powerful technique for improving accuracy • The “AdaBoost. where Gm(x) is the prediction from model m • Then adjust weights for incorrectly classified cases by multiplying them by exp(αm). and repeat . & store resulting model m • If prediction error rate “err” is zero or >= 0. • Otherwise calculate αm=log((1-err)/err) – This is the log odds of success – Predict using a weighted majority vote: ΣαmGm(x).M1” method (for classifiers): – Give each instance an initial weight of 1/n – For m=1 to M: • Fit model using the current weights.5. terminate loop.

M1 with the DecisionStump algorithm.8) .Boosting • For example. for the German credit dataset: – using 100 iterations of AdaBoost.1% for J4.9% (compared to 26.Meta-Learning Methods . – 10-fold cross-validation gives an error rate of 24.

they also often buy coffee.Association Rules • Data on n purchase baskets in form (id. purchases from a supermarket • Association rules are statements of the form: – “When people buy tea. ….” • May be useful for product placement decisions or cross-selling recommendations • We say there is an association rule i1 ->i2 if – i1 and i2 occur together in at least s% of the n baskets (the support) – And at least c% of the baskets containing item i1 also contain i2 (the confidence) • The confidence criterion ensures that “often” is a large enough proportion of the antecedent cases to be interesting • The support criterion should be large enough that the resulting rules have practical importance – Also helps to ensure reliability of the conclusions . itemk) – For example. item1. item2.

so there has been a focus on efficient methods for large databases – “Large” means up to around 100 million instances.Association rules • The support/confidence approach is widely used – Efficiently implemented in the Apriori algorithm • First identify item sets with sufficient support • Then turn each item set into sets of rules with sufficient confidence • This method was originally developed in the database community. and about ten thousand binary attributes • However this approach can find a vast number of rules. and it can be difficult to make sense of these • One useful extension is to identify only the rules with high enough lift (or odds ratio) .

e.g.g. • If outlook=sunny and humidity=high then play =no • Association rules predict the value of an arbitrary attribute (or combination of attributes) • E. If temperature=cool then humidity=normal • If humidity=normal and play=no then windy=true • If temperature=high and humidity=high then play=no .Classification vs Association Rules • Classification rules predict the value of a pre-specified attribute.

σ1. • This is hard to maximise directly – Use the expectation-maximisation (EM) algorithm instead .Clustering – EM Algorithm • Assume that the data is from a mixture of normal distributions – I. σ2) = θ • Log-likelihood: l  X    log  p1 xi   1  p 2 xi . μ2. μ1.e. one normal component for each cluster • For simplicity. consider one attribute x and two components or clusters – Model has five parameters: (p. n i 1 where j xi  is the normal density for the j th component.

calculate the expected value of di • Then substitute this into the above log-likelihood and maximise to obtain new parameter values – This will have increased the log-likelihood i 1 • Repeat until the log-likelihood converges . D    log d i1  xi   1  d i 2  xi  n • Starting with initial values for the parameters.Clustering – EM Algorithm • Think of data as being augmented by a latent 0/1 variable di indicating membership of cluster 1 • If the values of this variable were known. the loglikelihood would be: l  X .

can calculate segment membership probabilities for each case 1 xi  P Di  1 . xi   1 xi   2  xi  .Clustering – EM Algorithm • Resulting estimates may only be a local maximum – Run several times with different starting points to find global maximum (hopefully) • With parameter estimates.

and assuming conditional independence between attributes • Can extend this approach to a handle joint clustering and prediction models. but this can rapidly increase the number of parameters required • Nominal attributes can be accommodated by allowing different discrete distributions in each latent class.Clustering – EM Algorithm • Extending to more latent classes is easy – Information criteria such as AIC and BIC are often used to decide how many are appropriate • Extending to multiple attributes is easy if we assume they are independent. as mentioned in the MVA lectures . at least conditioning on segment membership – It is possible to introduce associations.

Clustering .Scalability Issues • • • • • • k-means algorithm is also widely used However this and the EM-algorithm are slow on large databases So is hierarchical clustering .requires O(n2) time Iterative clustering methods require full DB scan at each iteration Scalable clustering algorithms are an area of active research A few recent algorithms: – Distance-based/k-Means • • • • Multi-Resolution kd-Tree for K-Means [PM99] CLIQUE [AGGR98] Scalable K-Means [BFR98a] CLARANS [NH94] – Probabilistic/EM • Multi-Resolution kd-Tree for EM [Moore99] • Scalable EM [BRF98b] • CF Kernel Density Estimation [ZRL99] .

probably ethical to use ethnicity to diagnose and choose treatments for a medical problem.g.g. but not to decline medical insurance . screening job applications based on age or sex (where not directly relevant) – E.Ethics of Data Mining • Data mining and data warehousing raise ethical and legal issues • Combining information via data warehousing could violate Privacy Act – Must tell people how their information will be used when the data is obtained • Data mining raises ethical issues mainly during application of results – E.g. declining insurance coverage based on neighbourhood if this is related to race (“red-lining” is illegal in much of the US) • Whether something is ethical depends on the application – E.g. using ethnicity as a factor in loan approval decisions – E.