You are on page 1of 66

Data Mining in Excel Using

XLMiner
Nitin R. Patel
Cytel Software and M.I.T.Sloan

Contact Info
XLMiner is distributed by Resampling
Stats, Inc.
www.xlminer.net
Contact Peter Bruce: pbruce@resample.com
703-522-2713

What is XLMiner?
XLMiner is an affordable, easy-to-use tool for
business analysts, consultants and business
students to:
learn strengths and weaknesses of data mining methods,
prototype large scale data mining applications,
implement medium scale data mining applications.

More generally, XLMiner is a tool for data


analysis in Excel that uses classical and modern,
computationally-intensive techniques.
3

Available Data Mining Software


Application-specific: aimed at providing
solutions to end-users for common tasks
(e.g. Unica for Customer Relationship
Management, Urban Science for location
and distribution)
Technique-specific: focused on a few data
mining methods (e.g. CART from Salford
Associates, Neural Nets from HNC
Software)
4

Kohonen Net
Association Rules
K-Means
Sequential. Rules
TimeSeries
Logistic
Regression
Rule Induction

x
See5

x
NeuroShell

x
x
WizWhy

Nave Bayes

Radial Basis Fns.

Cognos

K-Nearest
Neighbors
Multilayer Neural
Net
Linear Regression
Class. & Regr.
Trees

x
CART (Salford)

Algorithms>

Source: Elder Research


TECHNIQUE-SPECIFIC PRODUCTS

Available Data Mining Software


Horizontal products: designed for data
mining analysts: (e.g. SAS Enterprise
Miner, SPSS Clementine, IBM Intelligent
Miner, NCR Teraminer, Splus Insightful
Miner, Darwin/Oracle)
Powerful, comprehensive, easy-to-use; but
Need substantial learning effort
Expensive
6

HORIZONTAL PRODUCTS

Source: Elder Research

x x x
x x x x

x
x

x x

Kohonen Net

PRW (Unica)

x x x x

Association Rules

Darwin
(Oracle)

x
x

K-Means
Sequential. Rules

x
x

TimeSeries

MineSet (SGI)

Logistic Regression

x x x

Rule Induction

Intelligent
Miner (IBM)

Nave Bayes

x x x

Radial Basis Fns.

Clementine
(SPSS)

K-Nearest Neighbors

x x x

Multilayer Neural Net

Class. & Regr. Trees

Enterprise
Miner (SAS)

Linear Regression

Algorithms>

x x

x
x

Desiderata for Data Mining and


Modern Data Analysis Software
Easy-to-use
Data import (e.g. cross-platform, various data bases)
Data handling (e.g. data partitioning, scoring)
Invoking and experimenting with procedures

Comprehensive Range of Procedures:


Statistics (e.g. Regression, Multivariate procedures)
Machine learning (e.g. Neural Nets, Classification
Trees)
Database (e.g. Association Rules)
8

XLMiner is Unique
Low cost,
Comprehensive set of data mining models and
algorithms that includes statistical, machine
learning and database methods,
Based on prototype used in three years of MBA
courses on data mining at Sloan School, M.I.T.
Focus on business applications: Book of lecture
notes and cases in preparation (first draft available
for examination).
9

Why Data Mining in Excel?


Leverage familiarity of MBA students,
managers and business analysts with
interface and functionality of Excel to
provide them with hands-on experience in
data mining.

10

Advantages
Low learning hurdle
Promotes understanding of strengths and
weaknesses of different data mining techniques
and processes
Enables interactive analysis of data (important in
early stages of model building)
Facilitates incorporation of domain knowledge
(often key to successful applications) by
empowering end-users to participate actively in
data mining projects
Enables pre-processing of data and postprocessing of results using Excel functions,
reporting in Word, presentations in PowerPoint
11

Advantages (cont.)
Supports communication between data miners and
end-users
Supports smooth transition from prototyping to
custom solution development (VB and VBA)
Emphasizes openness
enables integration with other analytic software for
optimization (Solver), simulation (Crystal Ball) ,
numerical methods;
interface modifications (e.g.custom forms and outputs)
solution specific routines (VBA)

Examples:
Boston Celtics analysis of player statistics
Clustering for improving forecasts, optimizing price
markdowns.

12

Size Limitations
An Excel spreadsheet cannot exceed 64,000 rows.
If data records are stored as rows in a single
spreadsheet this is the largest data set that can be
accommodated. The number of variables cannot
exceed 256 (number of columns).
These limits do not apply to deployment of model
to score large databases.
If Excel is used as a view-port into a database such
as Access, MS SQL Server, Oracle or SAS, these
limits do not apply.
13

Sampling
Practical Data Mining Methodologies such
as SEMMA (SAS) and CRISP-DM (SPSS
and European Industry Standard)
recommend working with a sample
(typically 10,000 random cases) in the
model and algorithm selection phase. This
facilitates interactive development of data
mining models.
14

XLMiner
Free 30 day trial version: limit is 200 records per
partition.
Education version: limit is 2,000 records per
partition, so maximum size for a data set is 6,000
records.
Standard version (currently in beta test: will be
available by end August):
Up to 60,000 records obtained by drawing samples
from large data bases in accordance with SASs
SEMMA (Sample, Explore, Model, Measure, Apply)
methodology. Training data restricted to 10,000 records
Sampling from and scoring to Access databases (later
SQLServer, Oracle, SAS)
15

Data Mining Procedures in


XLMiner
Partitioning data sets (into Training, Validation,
and Test data sets)
Scoring of training, validation, test and other data
Prediction (of a continuous variable)
Classification
Data reduction and exploration
Affinity
Utilities: Sampling, graphics, missing data,
binning, creation of dummy variables
16

Prediction
Multiple Linear Regression with subset
selection, residual analysis, and collinearity
diagnostics.
K-Nearest Neighbors
Regression Tree
Neural Net

17

Classification
Logistic Regression with subset selection,
residual analysis, and collinearity diagnostics
Discriminant Analysis
K-Nearest Neighbors
Classification Tree
Nave Bayes
Neural Networks
18

Data Reduction and Exploration


Principal Components
K-Means Clustering
Hierarchical Clustering

19

Affinity
Association Rules (Market Basket Analysis)

20

Partitioning
Aim: To construct training,
validation, and test data sets from
Boston Housing data
21

22

Boston Housing Data


CRIM
0.00632
0.02731
0.02729
0.03237
0.06905
0.02985
0.08829
0.14455
0.21124
0.17004
0.22489
0.11747
0.09378
0.62976

ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO


18 2.31
0 0.538 6.575 65.2 4.09
1 296
15.3
0 7.07
0 0.469 6.421 78.9 4.97
2 242
17.8
0 7.07
0 0.469 7.185 61.1 4.97
2 242
17.8
0 2.18
0 0.458 6.998 45.8 6.06
3 222
18.7
0 2.18
0 0.458 7.147 54.2 6.06
3 222
18.7
0 2.18
0 0.458 6.43 58.7 6.06
3 222
18.7
13 7.87
0 0.524 6.012 66.6 5.56
5 311
15.2
13 7.87
0 0.524 6.172 96.1 5.95
5 311
15.2
13 7.87
0 0.524 5.631 100 6.08
5 311
15.2
13 7.87
0 0.524 6.004 85.9 6.59
5 311
15.2
13 7.87
0 0.524 6.377 94.3 6.35
5 311
15.2
13 7.87
0 0.524 6.009 82.9 6.23
5 311
15.2
13 7.87
0 0.524 5.889
39 5.45
5 311
15.2
0 8.14
0 0.538 5.949 61.8 4.71
4 307
21

B
397
397
393
395
397
394
396
397
387
387
393
397
391
397

LSTAT MEDV
4.98
24
9.14 21.6
4.03 34.7
2.94 33.4
5.33 36.2
5.21 28.7
12.43 22.9
19.15 27.1
29.93 16.5
17.1 18.9
20.45
15
13.27 18.9
15.71 21.7
8.26 20.4

23

XLMiner : Data Partition Sheet

Date: 29-Jul-2003 13:50:09 (Ver: 1.2.0.1)

Output Navigator
Training Data

Validation Data

Test Data

Data
Data source

housing!$A$2:$O$507

Selected variables

CRIM

Partitioning Method
Random Seed
# training row s
# validation row s
# test row s

Randomly chosen
81801
253
152
101

ZN

INDUS

CHAS

NOX

RM

AGE

DIS

RAD

TAX

PTRATIO

Selected variables
Row Id.
1
2
5
6
7
8
10
12
14

CRIM

ZN

INDUS

CHAS

NOX

RM

AGE

DIS

RAD

TAX

PTRATIO

LSTAT

0.00632
0.02731
0.06905
0.02985
0.08829
0.14455
0.17004
0.11747
0.62976

18
0
0
0
12.5
12.5
12.5
12.5
0

2.31
7.07
2.18
2.18
7.87
7.87
7.87
7.87
8.14

0
0
0
0
0
0
0
0
0

0.538
0.469
0.458
0.458
0.524
0.524
0.524
0.524
0.538

6.575
6.421
7.147
6.43
6.012
6.172
6.004
6.009
5.949

65.2
78.9
54.2
58.7
66.6
96.1
85.9
82.9
61.8

4.09
4.9671
6.0622
6.0622
5.5605
5.9505
6.5921
6.2267
4.7075

1
2
3
3
5
5
5
5
4

296
242
222
222
311
311
311
311
307

15.3
17.8
18.7
18.7
15.2
15.2
15.2
15.2
21

396.9
396.9
396.9
394.12
395.6
396.9
386.71
396.9
396.9

4.98
9.14
5.33
5.21
12.43
19.15
17.1
13.27
8.26

3
9
13

0.02729
0.21124
0.09378

0
12.5
12.5

7.07
7.87
7.87

0
0
0

0.469
0.524
0.524

7.185
5.631
5.889

61.1
100
39

4.9671
6.0821
5.4509

2
5
5

242
311
311

17.8
15.2
15.2

392.83
386.63
390.5

4.03
29.93
15.71

4
11
17

0.03237
0.22489
1.05393

0
12.5
0

2.18
7.87
8.14

0
0
0

0.458
0.524
0.538

6.998
6.377
5.935

45.8
94.3
29.3

6.0622
6.3467
4.4986

3
5
4

222
311
307

18.7
15.2
21

394.63
392.52
386.85

2.94
20.45
6.58

24

Prediction
Multiple Linear Regression using
subset selection

Aim: To estimate median residential


property value for a census tract
25

The Regression Model


Coefficient
32.677
-0.094
0.055
0.030
2.836
-15.889
3.872
0.007
-1.405
0.358
-0.013
-0.934
0.014
-0.582

Input variables
Constant term
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT

Std. Error
7.444
0.049
0.020
0.091
1.199
5.463
0.597
0.019
0.292
0.097
0.005
0.208
0.004
0.073

p-value
0.000
0.054
0.007
0.742
0.019
0.004
0.000
0.728
0.000
0.000
0.019
0.000
0.000
0.000

SS
128852
3566
2550
1529
645
143
4697
0
938
1
174
620
502
1623

239
0.738
5.025
6036

Residual df
Multiple R-squared
Std. Dev. estimate
Residual SS

Training Data scoring - Summary Report


Total sum of
squared
errors
6036

RMS Error

Average
Error

4.884

0.000

# Records training
# Records validation
# Records test

253
152
101

Validation Data scoring - Summary Report


Total sum of
squared
errors
2848

RMS Error

Average
Error

4.329

0.066

Test Data scoring - Summary Report


Total sum of
squared
errors
2392

RMS Error

Average
Error

4.866

-1.019

26

Subset selection (exhaustive enumeration)

Subset size

RSS

Adjusted RCp R-Squared


Squared

Prob

Models (Constant present in all models)


1

2 19472.3789

362.7529

0.5441

0.5432

0.0000

Constant

LSTAT

3 15439.3086

185.6474

0.6386

0.6371

0.0000

Constant

RM

LSTAT

4 13727.9863

111.6489

0.6786

0.6767

0.0000

Constant

RM

PTRATIO

LSTAT

5 13228.9072

91.4852

0.6903

0.6878

0.0000

Constant

RM

DIS

PTRATIO

LSTAT

6 12469.3447

59.7537

0.7081

0.7052

0.0000

Constant

NOX

RM

DIS

PTRATIO

LSTAT

7 12141.0723

47.1754

0.7158

0.7123

0.0000

Constant

CHAS

NOX

RM

DIS

PTRATIO

LSTAT

27

The Regression Model


Predictor (Indep. Var.)

Coefficient

Std. Error

42.8367

7.1766

0.0000 126430.6016

Residual df

-21.7852

4.6042

0.0000

3404.4565

Multiple R-squared

0.6601

RM

3.7503

0.6177

0.0000

6583.3579

Std. Dev. Estimate

5.3467

DIS

-1.4072

0.2535

0.0000

211.6853

PTRATIO

-1.0086

0.1747

0.0000

1453.9551

LSTAT

-0.5907

0.0696

0.0000

2060.2676

Constant
NOX

p-value

SS
247.0000

Residual SS

7061.1646

XLMiner : Multiple Linear Regression - Prediction of Validation Data


MaxAbsErr=
RMSErr= Data range Data_Partition1!$C$273:$P$424
4.9355 AvMEDV=
22.9645 %RMSErr=
SqErr
0.8439637
0.2196854
0.2137043
6.6637521
4.0947798
18.224484
0.3253246
51.86411

20.33

Back to Navigator

21.5%

AvAbsErr=

3.57

Predicted
Value

Actual
Value

NOX

RM

DIS

PTRATIO

LSTAT

AbsErr

22.0187

21.1

0.4640

5.8560

4.4290

18.6000

13.0000

32.8687

32.4

0.4470

6.7580

4.0776

17.6000

3.5300

25.4623

25

0.4890

6.1820

3.9454

18.6000

9.4700

31.0814

28.5

0.4110

6.8610

5.1167

19.2000

3.3300

22.4236

20.4

0.5470

5.8720

2.4775

17.8000

15.3700

24.5690

20.3

0.5440

5.9720

3.1025

18.4000

9.9700

23.4704

22.9

0.5240

6.0120

5.5605

15.2000

12.4300

14.6983

21.9

0.7180

4.9630

1.7523

20.2000

14.0000

0.92
0.47
0.46
2.58
2.02
4.27
0.57
7.20

28

%AvAbsErr=15.6%

Frequency in Validation Dataset

AbsErr Freq
0
0
2
61
4
40
6
25
8
10
10
9
12
2
14
3
16
0
18
0
20
1
22
1

70
60
50
40
30
20
10
0
0

10

12

14

16

18

20

22

AbsErr

29

Prediction
K_Nearest Neighbors

Aim: To estimate median residential


property value for a census tract
30

XLMiner : K-Nearest Neighbors Prediction


Data
Source data w orksheet

Data_Partition1

Training data used for building the model

Data_Partition1!$C$19:$Q$322

Validation data

Data_Partition1!$C$323:$Q$524

# cases in the training data set

304

# cases in the validation data set

202

Normalization

TRUE

# nearest neighbors (k)

Variables
Input variables

NOX

Output variable

MEDV

RM

DIS

PTRATIO

LSTAT

31

Param eters/Options
# Nearest neighbors

Training Data scoring - Summary Report


Total sum of
squared
errors
0

RMS Error

Average
Error

Validation Data scoring - Summary Report


Total sum of
squared
errors
3314

RMS Error

Average
Error

4.669

0.805

# Records training
# Records validation
# Records test

253
152
101

Test Data scoring - Summary Report


Total sum of
squared
errors
3895

RMS Error

Average
Error

6.210

-0.450

Timings
Overall (secs)

3.00

32

Validation Data prediction details


Row Id.
3
9
13
15
16
20
25
29

Predicted
Value
28.70
14.40
22.90
19.60
20.40
20.40
16.60
19.60

Actual
Actual #Nearest
Residual
Value
Neighbors
34.70
6.00
1
16.50
2.10
1
21.70
-1.20
1
18.20
-1.40
1
19.90
-0.50
1
18.20
-2.20
1
15.60
-1.00
1
18.40
-1.20
1

CRIM

ZN

INDUS

CHAS

NOX

0.02729
0.21124
0.09378
0.63796
0.62739
0.7258
0.75026
0.77299

0
12.5
12.5
0
0
0
0
0

7.07
7.87
7.87
8.14
8.14
8.14
8.14
8.14

0
0
0
0
0
0
0
0

0.469
0.524
0.524
0.538
0.538
0.538
0.538
0.538

33

Classification
Classification Tree

Aim: To classify census tracts into


high and low residential property
value classes
34

Boston Housing Data


CRIM
ZN INDUS CHAS NOX RM AGE DIS
RAD TAX PTRATIO B
LSTAT MEDV HIGHCLASS
0.00632 18 2.31
0 0.54 6.575 65.2
4.09
1 296
15.3 396.9 4.98
24
0
0.02731 0 7.07
0 0.47 6.421 78.9
4.97
2 242
17.8 396.9 9.14
21.6
0
0.02729 0 7.07
0 0.47 7.185 61.1
4.97
2 242
17.8 392.83 4.03
34.7
1
0.03237 0 2.18
0 0.46 6.998 45.8
6.06
3 222
18.7 394.63 2.94
33.4
1
0.06905 0 2.18
0 0.46 7.147 54.2
6.06
3 222
18.7 396.9 5.33
36.2
1
0.02985 0 2.18
0 0.46 6.43 58.7
6.06
3 222
18.7 394.12 5.21
28.7
0
0.08829 13 7.87
0 0.52 6.012 66.6
5.56
5 311
15.2 395.6 12.43
22.9
0
0.14455 13 7.87
0 0.52 6.172 96.1
5.95
5 311
15.2 396.9 19.15
27.1
0
0.21124 13 7.87
0 0.52 5.631 100
6.08
5 311
15.2 386.63 29.93
16.5
0
0.17004 13 7.87
0 0.52 6.004 85.9
6.59
5 311
15.2 386.71 17.1
18.9
0
0.22489 13 7.87
0 0.52 6.377 94.3
6.35
5 311
15.2 392.52 20.45
15
0
0.11747 13 7.87
0 0.52 6.009 82.9
6.23
5 311
15.2 396.9 13.27
18.9
0
0.09378 13 7.87
0 0.52 5.889
39
5.45
5 311
15.2 390.5 15.71
21.7
0

35

Training Log
Grow ing the Tree
#Nodes

Error

13.82

3.45

2.97

0.67

0.65

0.56

0.2

0.14

0.06

0.05

10

0.05

11

0.04

12

0.02

13

0.01

14

0.01

15

Validation Misclassification Summary


Classification Confusion Matrix
Predicted Class
Actual
Class
0
1

152

36

Error Report
Class
0
1
Overall

# Cases

# Errors

% Error

158

3.80

44

18.18

202

14

6.93

36

XLMiner : Classification Tree - Prune Log


Back to Navigator
# Decision
Nodes
15

0.0792

14

0.0644

13

0.0644

12

0.0644

11

0.0644

10

0.0644 <-- Minimum Error Prune

Error

0.0743

0.0743

0.0743

0.0693

0.0693

0.0693

0.0693 <-- Best Prune

0.099

0.2079

Std. Err.

0.0172708

37

Classification Tree : Full Tree


Back to Navigator

6.55
05

RM

228

76

1.35
929

6.79
1

DIS
222
73.0 %

10.1
702

CRIM

19.4
5

LSTAT
14
17

4
1.31 %

37

5.59 %

3.43
515

DIS

12

PTRATIO
3
2
0.98 %

35.0
000

286.
000

ZN

TAX

2.30 %

4.13
499

25

DIS

8.22 %

0.32 %

2.30 %

378

LSTAT
1
1

RM

1.25
934

2.30 %

4.62
499

LSTAT
1
1

PTRATIO

7.06
449

18.1

45

7.63
5

0.65 %

RM

31

TAX

0.32 %

0.32 %

0.32 %

0.32 %

0.98 %

0.65 %

38

Classification Tree : Best Pruned Tree


Back to Navigator

6.55
05

RM
136
66
67.3 %

6.79
1

0
16

RM

50

7.92 %

19.4
5

PTRATIO
44
6
21.7 %

2.97 %

0
39

Classification Tree : Minimum Error Tree


Back to Navigator

6.55
05

RM

136

66

1.35
929

DIS
133

0
0

1.48 %

0%

RM

16

65.8 %

10.1
702

CRIM

6.79
1

7.63
5

19.4
5

LSTAT
11
5

PTRATIO
44
6

2.47 %

3.43
515

DIS

50

0
11

14

0%

5.44 %

2.97 %

7.06
449

RM

30
14.8 %

286.
000

TAX

1
7

3.46 %

378

1
6

TAX

2.97 %

0.49 %

40

Classification
Neural Network

Aim: To classify census tracts into


high and low residential property
value classes
41

XLMiner : Neural Network Classification


Epochs Inform ation
Number of Epochs

30

Accumulated Trials

9120

Class

Trials

7860

1260

Architecture
Number of hidden layers

Hidden Layer

# Nodes

25

Step size for gradient descent

0.1000

Weight change momentum

0.6000

Weight decay

0.0000

Cost Function

Squared Error

Hidden layer sigmoid

Standard

Output layer sigmoid

Standard

42

Training Data scoring - Summary Report


Cut of f Prob.Val. f or Success (Updatable)
Clas sification Confus ion Matrix
Pre dicte d Class
Actual
1
Class
1
40
0
4

Class
1
0
Overall

0.5

0
11
249

Error Re port
# Cas es
# Errors
51
11
253
4
304
15

% Error
21.57
1.58
4.93

Validation Data scoring - Summary Report


Cut of f Prob.Val. f or Success (Updatable)
Clas sification Confus ion Matrix
Pre dicte d Class
Actual
1
Class
1
26
0
1

Class
1
0
Overall

0.5

0
7
168

Error Re port
# Cas es
# Errors
33
7
169
1
202
8

% Error
21.21
0.59
3.96

43

Cumulative

Lift chart (validation dataset)


35
30
25
20
15
10
5
0

Cumulative
HIGHV when
sorted using
predicted values

100

200

Cumulative
HIGHV using
average

300

# cases

Decile mean / Global


mean

Decile-wise lift chart (validation dataset)


7
6
5
4
3
2
1
0
1

10

De cile s

44

Data Reduction and Exploration


Hierarchical Clustering

Aim: To cluster electric utilities into


similar groups

45

Utilities Data
seq#
Arizona
Boston
Central
Common
Consolid
Florida
Hawaiian
Idaho
Kentucky
Madison
Nevada
NewEngla
Northern
Oklahoma
Pacific
Puget
SanDiego
Southern
Texas
Wisconsi
United
Virginia

x1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

x2
1.06
0.89
1.43
1.02
1.49
1.32
1.22
1.1
1.34
1.12
0.75
1.13
1.15
1.09
0.96
1.16
0.76
1.05
1.16
1.2
1.04
1.07

x3
9.2
10.3
15.4
11.2
8.8
13.5
12.2
9.2
13
12.4
7.5
10.9
12.7
12
7.6
9.9
6.4
12.6
11.7
11.8
8.6
9.3

x4
151
202
113
168
192
111
175
245
168
197
173
178
199
96
164
252
136
150
104
148
204
174

x5
54.4
57.9
53
56
51.2
60
67.6
57
60.4
53
51.5
62
53.7
49.8
62.2
56
61.9
56.7
54
59.9
61
54.3

x6
1.6
2.2
3.4
0.3
1
-2.2
2.2
3.3
7.2
2.7
6.5
3.7
6.4
1.4
-0.1
9.2
9
2.7
-2.1
3.5
3.5
5.9

x7
9077
5088
9212
6423
3300
11127
7642
13082
8406
6455
17441
6154
7179
9673
6468
15991
5714
10140
13507
7287
6650
10093

x8
0
25.3
0
34.3
15.6
22.5
0
0
0
39.2
0
0
50.2
0
0.9
0
8.3
0
0
41.1
0
26.6

0.628
1.555
1.058
0.7
2.044
1.241
1.652
0.309
0.862
0.623
0.768
1.897
0.527
0.588
1.4
0.62
1.92
1.108
0.636
0.702
2.116
1.306

46

Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Single linkage)


4

3.5

Distance

2.5

1.5

0.5

18 14 19

10 13 20

12 21 15 22

16 17 11

47

Predicted Clusters

Back to Navigator

Cluster id.

x1

x2

x3

x4

x5

x6

x7

x8

1.06

9.2

151

54.4

1.6

9077

0.628

0.89

10.3

202

57.9

2.2

5088

25.3

1.555

1.43

15.4

113

53

3.4

9212

1.058

1.02

11.2

168

56

0.3

6423

34.3

0.7

1.49

8.8

192

51.2

3300

15.6

2.044

1.32

13.5

111

60

-2.2

11127

22.5

1.241

1.22

12.2

175

67.6

2.2

7642

1.652

1.1

9.2

245

57

3.3

13082

0.309

1.34

13

168

60.4

7.2

8406

0.862

1.12

12.4

197

53

2.7

6455

39.2

0.623

0.75

7.5

173

51.5

6.5

17441

0.768

1.13

10.9

178

62

3.7

6154

1.897

1.15

12.7

199

53.7

6.4

7179

50.2

0.527

1.09

12

96

49.8

1.4

9673

0.588

0.96

7.6

164

62.2

-0.1

6468

0.9

1.4

1.16

9.9

252

56

9.2

15991

0.62

0.76

6.4

136

61.9

5714

8.3

1.92

1.05

12.6

150

56.7

2.7

10140

1.108

1.16

11.7

104

54

-2.1

13507

0.636

1.2

11.8

148

59.9

3.5

7287

41.1

0.702

1.04

8.6

204

61

3.5

6650

2.116

1.07

9.3

174

54.3

5.9

10093

26.6

1.306

48

Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Complete


linkage)
7

Distance

18 14

19

22

20 10

13

12 21

15 17

16

11

49

Predicted Clusters

Back to Navigator

Cluster id.

x1

x2

x3

x4

x5

x6

x7

x8

1.06

9.2

151

54.4

1.6

9077

0.628

0.89

10.3

202

57.9

2.2

5088

25.3

1.555

1.43

15.4

113

53

3.4

9212

1.058

1.02

11.2

168

56

0.3

6423

34.3

0.7

1.49

8.8

192

51.2

3300

15.6

2.044

1.32

13.5

111

60

-2.2

11127

22.5

1.241

1.22

12.2

175

67.6

2.2

7642

1.652

1.1

9.2

245

57

3.3

13082

0.309

1.34

13

168

60.4

7.2

8406

0.862

1.12

12.4

197

53

2.7

6455

39.2

0.623

0.75

7.5

173

51.5

6.5

17441

0.768

1.13

10.9

178

62

3.7

6154

1.897

1.15

12.7

199

53.7

6.4

7179

50.2

0.527

1.09

12

96

49.8

1.4

9673

0.588

0.96

7.6

164

62.2

-0.1

6468

0.9

1.4

1.16

9.9

252

56

9.2

15991

0.62

0.76

6.4

136

61.9

5714

8.3

1.92

1.05

12.6

150

56.7

2.7

10140

1.108

1.16

11.7

104

54

-2.1

13507

0.636

1.2

11.8

148

59.9

3.5

7287

41.1

0.702

1.04

8.6

204

61

3.5

6650

2.116

1.07

9.3

174

54.3

5.9

10093

26.6

1.306

50

Predicted Clusters (sorted)


Cluster id.

x1

x2

x3

x4

x5

x6

x7

x8

1.06

9.2

151

54.4

1.6

9077

0.628

1.43

15.4

113

53

3.4

9212

1.058

1.32

13.5

111

60

-2.2

11127

22.5

1.241

1.34

13

168

60.4

7.2

8406

0.862

1.09

12

96

49.8

1.4

9673

0.588

1.05

12.6

150

56.7

2.7

10140

1.108

1.16

11.7

104

54

-2.1

13507

0.636

0.89

10.3

202

57.9

2.2

5088

25.3

1.555

1.02

11.2

168

56

0.3

6423

34.3

0.7

1.49

8.8

192

51.2

3300

15.6

2.044

1.12

12.4

197

53

2.7

6455

39.2

0.623

1.15

12.7

199

53.7

6.4

7179

50.2

0.527

1.2

11.8

148

59.9

3.5

7287

41.1

0.702

1.07

9.3

174

54.3

5.9

10093

26.6

1.306

1.22

12.2

175

67.6

2.2

7642

1.652

1.13

10.9

178

62

3.7

6154

1.897

0.96

7.6

164

62.2

-0.1

6468

0.9

1.4

0.76

6.4

136

61.9

5714

8.3

1.92

1.04

8.6

204

61

3.5

6650

2.116

1.1

9.2

245

57

3.3

13082

0.309

0.75

7.5

173

51.5

6.5

17441

0.768

1.16

9.9

252

56

9.2

15991

0.62

1.21
1.13
1.02
1.00

12.5
10.9
9.1
8.9

128
183
171
223

55.5
55.1
62.9
54.8

1.7
3.1
3.7
6.3

10163
6546
6526
15505

3.2
33.2
1.8
0.0

0.874
1.065
1.797
0.566

M eans
Cluster
Cluster
Cluster
Cluster

1
2
3
4

51

Affinity
Association Rules
(Market Basket Analysis)

Aim: to identify types of books that


are likely to be bought by customers
based on past purchases of books
52

ChildBks

YouthBks

CookBks

DoItYBks

RefBks

ArtBks

GeogBks

ItalCook

ItalAtlas

ItalArt

Florence

2000
customers

0
1
0
1
0
1
0
0
1
1
0
0
1
1
1
1
0
0
1
1

1
0
0
1
0
0
1
1
0
1
0
0
0
1
1
1
0
0
1
1

0
0
0
1
1
0
0
0
0
1
0
1
0
0
1
1
1
1
1
1

1
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0

0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0

0
0
0
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
1
1

1
0
0
1
1
0
0
0
0
1
0
1
0
0
0
1
0
0
1
0

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0

0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
53

XLMiner : Association Rules


Data

Rule #
1

Input Data

Sheet1!$A$1:$K$2001

Data Format

Binary Matrix

Min. Support

200

Min. Conf. %

70

# Rules

19

Conf. % Antecedent (a)


100 ItalCook=>

Consequent (c)

Support(a)

Support(c) Support(a U c)

Lift Ratio

CookBks

227

862

227

2.32

82.19 DoItYBks, ArtBks=>

CookBks

247

862

203

1.91

81.89 DoItYBks, GeogBks=>

CookBks

265

862

217

1.90

80.33 CookBks, RefBks=>

ChildBks

305

846

245

1.90

80 ArtBks, GeogBks=>

ChildBks

255

846

204

1.89

81.18 ArtBks, GeogBks=>

CookBks

255

862

207

1.88

79.63 YouthBks, CookBks=>

ChildBks

324

846

258

1.88

80.86 ChildBks, RefBks=>

CookBks

303

862

245

1.88

78.87 DoItYBks, GeogBks=>

ChildBks

265

846

209

1.86

10

79.35 ChildBks, DoItYBks=>

CookBks

368

862

292

1.84

11

77.87 CookBks, DoItYBks=>

ChildBks

375

846

292

1.84

12

77.66 CookBks, GeogBks=>

ChildBks

385

846

299

1.84

13

78.18 ChildBks, YouthBks=>

CookBks

330

862

258

1.81

14

77.85 ChildBks, ArtBks=>

CookBks

325

862

253

1.81

15

75.75 CookBks, ArtBks=>

ChildBks

334

846

253

1.79

16

76.67 ChildBks, GeogBks=>

CookBks

390

862

299

1.78

17

70.65 GeogBks=>

ChildBks

552

846

390

1.67

18

70.63 RefBks=>

ChildBks

429

846

303

1.67

19

71.1 RefBks=>

CookBks

429

862

305

54

1.65

Some Utilities

Sampling from worksheets and databases


Database scoring
Graphics
Binning

55

Simple
Random
Sampling

56

Stratified
Random
Sampling

57

Scoring to
databases and
worksheets

58

Binning
continuous
variables

59

Missing Data

60

Graphics: Boston Housing data


Histogram

Box Plot
120

180
160

100

Frequency

140

Y Values

80
60
40

120
100
80
60
40
20

20

0
0

AGE

10

20

30

40

50

60

70

80

90

AGE

61

100

Histogram

Box Plot
250

10
9

200

8
Frequency

Y Values

7
6
5
4
3

150
100
50

RM

3.6 4.2 4.8 5.4

6.6 7.2 7.8 8.4

RM

62

9.6

Matrix Plot
0.4

0.6

0.8

0.2

4.2 5.4 6.6 7.8

High tax
towns have
fewer rooms
on average?

0.2 0.4 0.6 0.8

RM
0
10

4.2 5.4 6.6

AGE
2
10

4.2

5.4

6.6

7.8

1.8

4.2

5.4

6.6

1.8

TAX
2
10

63

RM

Box Plot

10

Y Values

8
6
4
2
0
1

Binned_TAX

64

Future Extensions

Cross Validation
Bootstrap, Bagging and Boosting
Error-based clustering
Time Series and Sequences
Support Vector Machines
Collaborative Filtering
65

In Conclusion
XLMiner is a modern tool-belt for data mining. It
is an affordable, easy-to-use tool for consultants,
MBAs and business analysts to learn, create and
deploy data mining methods,
More generally, XLMiner is a tool for data
analysis in Excel that uses classical and modern,
computationally intensive techniques.

66