Professional Documents
Culture Documents
5 Preparation
5 Preparation
Install the Citrix client on your laptop: Go to http://www.mcmaster.ca/uts/vip/citrix/ins and follow the instructions
Access XLMiner
Access the Citrix log-in page: https://macapps.mcmaster.ca and use your mac email and password to login Click on the RJC > Microsoft Office folder and then click on Excel Miner 3_3. xLMiner menu appears in the Addins tab on Excel.
Access XLMiner
The folder that appears by default is My Documents. This folder is not the one in your local computer, so you have to browse in the Look in option in the top of the dialog box in order to find your files. Your local drives will appear with a format similar to C$ on Client (V:). When you click in your drive, a new dialog box will show up. In this dialog box, you will be asked What access do you want to grant?. Please select the option Full access. It will give you the possibility to modify and save the files you have in your computer.
Sampling
Data mining typically deals with huge databases Algorithms and models are typically applied to a sample from a database, to produce statistically-valid results XLMiner, e.g., limits the training partition to 10,000 records Once you develop and select a final model, you use it to score the observations in the larger database
Often the event of interest is rare Examples: response to mailing, fraud in taxes, Sampling may yield too few interesting cases to effectively train a model A popular solution: oversample the rare cases to obtain a more balanced training set Later, need to adjust results for the oversampling
Pre-processing Data
Types of Variables
Determine the types of pre-processing needed, and algorithms used Main distinction: Categorical vs. numeric
Numeric
Continuous Integer Categorical Ordered (low, medium, high) Unordered (male, female)
Variable handling
Numeric
Most algorithms in XLMiner can handle numeric data May occasionally need to bin into categories Nave Bayes can use as-is In most other algorithms, must create binary dummies (number of dummies = number of categories 1)
Categorical
Detecting Outliers
An outlier is an observation that is extreme, being distant from the rest of the data (definition of distant is deliberately vague) Outliers can have disproportionate influence on models (a problem if it is spurious) An important step in data preprocessing is detecting outliers Once detected, domain knowledge is
Detecting Outliers
In some contexts, finding outliers is the purpose of the DM exercise (airport security screening). This is called anomaly detection.
Most algorithms will not process records with missing values. Default is to drop those records. Solution 1: Omission
If a small number of records have missing values, can omit them If many records are missing values on a small set of variables, can drop those variables (or use proxies) If many records have missing values, omission is not practical Replace missing values with reasonable substitutes
Solution 2: Imputation
Used in some techniques when variables with the largest scales would dominate and skew results Puts all variables on same scale Normalizing function: Subtract mean and divide by standard deviation (used in XLMiner) Alternative function: scale to 0-1 by subtracting minimum and dividing by the range
Problem: How well will our model perform with new data? Solution: Separate data into two parts
Training partition to develop the model Validation partition to implement the model and evaluate its performance on new data
Test Partition
When a model is developed on training data, it can overfit the training data (hence need to assess on validation) Assessing multiple models on same validation data can overfit validation data Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data Solution: final selected model is applied to a test partition to give
Excel is limited in data capacity However, the training and validation of DM models can be handled within the modest limits of Excel and XLMiner Models can then be used to score larger databases XLMiner has functions for interacting with various databases (taking samples from a database, and scoring a database from a developed model)
http://www.resample.com/xlminer/help/Index.htm
Using XLMiner for Multiple Linear Regression: Specify input and output var.
Specifying Output
RMS error
Error = actual - predicted RMS = Root-mean-squared error = Square root of average squared error In previous example, sizes of training and validation sets differ, so only RMS Error and Average Error are comparable
Summary
Data Mining consists of supervised methods (Classification & Prediction) and unsupervised methods (Association Rules, Data Reduction, Data Exploration & Visualization) Before algorithms can be applied, data need to be characterized and preprocessed To evaluate performance and to avoid overfitting, data partitioning is used Data mining methods are usually applied to
Average (sample mean) Median (the middle point) Minimum Maximum Standard deviation Counts & percentages
1 -0.73766
PTRATIO pupil-teacher ratio by town B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT % lower status of the population MEDV Median value of owner-occupied homes in $1000's
In Boston Housing example: Compare average home values in neighborhoods that border Charles River (1) and those that do not (0)
Graphs
Histograms
Boston Housing example:
180
H istog ram
160 140 120 100 80 60 40 20 0 5 10 15 20 25 30 35 40 45 50 ME D V
Histogram shows the distribution of the outcome variable (median house value)
Mode: the value with highest frequency
y c n u q e r F
Box Plot
outliers max
Top outliers defined as those above Q3+1.5(Q3-Q1). max is the maximum of nonoutliers Analogous definitions for bottom outliers and for min Details may differ
Boxplots
B oxPlot
60 50 40 30
s e u l a V Y
MEDV
20 10 0
0
CH AS
Boston Housing Example: Display distribution of outcome variable (MEDV) for neighborhoods on Charles (1) and not on Charles (0)
Correlation Analysis
Below: Correlation matrix for portion of Boston Housing data Shows correlation between variable pairs
CRIM ZN INDUS CHAS NOX RM 1 -0.20047 1 0.406583 -0.53383 1 -0.05589 -0.0427 0.062938 1 0.420972 -0.5166 0.763651 0.091203 -0.21925 0.311991 -0.39168 0.091251
1 -0.30219
Matrix Plot
Shows scatterplots for variable pairs Example: scatterplots for 3 Boston Housing variables
Reducing Categories
A single categorical variable with m categories is typically transformed into m-1 dummy variables Each dummy variable takes the values 0 or 1
Problem: Can end up with too many variables Solution: Reduce by combining categories that are close to each other Use pivot tables to assess outcome variable sensitivity to the dummies Exception: Nave Bayes can handle categorical variables without transforming them into
Description of Variables
Name: name of cereal mfr: manufacturer type: cold or hot calories: calories per serving protein: grams fat: g. sodium: mg. fiber: g.
carbo: g. complex carbohydrates sugars: g. potass: mg. vitamins: % FDA rec shelf: display shelf weight: oz. 1 serving cups: in one
Total variance (=information) is sum of individual variances: 379.63 + 197.32=576.95 Calories accounts for 379.63/576.95 = 66% of total variances
Top: weights to project original data onto z1 & z2 e.g., col. 1 scores are computed z1 scores using weights (0.847, 0.532) Bottom: reallocated variance for new variables
Components Variable calories rating Variance Variance% Cum% P-value 1 -0.84705347 0.53150767 498.0244751 86.31913757 86.31913757 0 2 0.53150767 0.84705347 78.932724 13.68086338 100 1
New variances = 498 (for z1) and 79 (for z2) Sum of variances = sum of variances for original variables calories and ratings New variable z1 has most of the total variance, might be used as proxy for both calories and ratings z1 and z2 have correlation of zero (no information overlap)
Generalization
X1, X2, X3, Xp, original p variables Z1, Z2, Z3, Zp, weighted averages of original variables All pairs of Z variables have 0 correlation Order Zs by variance (z1 largest, Zp smallest) Usually the first few Z variables contain most of the information, and so the rest can be dropped.
Normalizing data
In these results, sodium dominates first PC Just because of the way it is measured (mg), its scale is greater than almost all other variables Hence its variance will be a dominant component of the total variance Normalize each variable to remove scale effect
First component accounts for smaller part of variance Need to use more components to capture same
PCA in Classification/Prediction
Apply PCA to training data Decide how many PCs to use Use variable weights in those PCs with validation/new data This creates a new reduced set of predictors in validation/new data
Summary
Data summarization is important for data exploration Data summaries include numerical metrics (average, median, etc.) and graphical summaries Data reduction is useful for compressing the information in the data into a smaller subset Categorical variables can be reduced by combining similar categories Principal components analysis transforms an original set of numerical data into a smaller set of weighted averages of the original data that contain most of the original information in less variables.