Question 1 (30 marks, max. 4 pages): DATA MINING METHODOLGY PROBLEMA) Introduction:
This paper attempts at comparing two different methods of data mining toanalyse a dataset, and produce predictive results on the customer retention patterns in the mobile phone industry. The given data set is first pre-processed and then used as an input data source for the analysis. The data pre-processing performed in Microsoft Excel and in SAS, includesmethodologies like Transformation of Variables, Filtration of Outliers, Replacement of Rare data,Data Partition and Stratified Sampling. The data mining techniques used in this process are:Logistic Regression and Decision Tree, which is performed in SAS Enterprise Miner 4.3.
B) Data Pre-processing
The given data set churn.xls is a dataset comprising of 21 variables and 5000 entries for eachvariable. This dataset is to be used to analyse the behaviour of a customer which is expressed bymeans of a probabilistic expression ‘churn’. This indicates if a customer would be leaving theorganisation or not. The data pre-processing procedure followed is described below:
•
Randomising
of dataset with Microsoft Excel
TM
by introducing a new column calledRANDOM in the dataset and filling the 5000 values using RAND ( ) function. Thedataset is then sorted on ascending order of RANDOM, and then the column is deleted,and dataset saved as input data source to be analysed.
•
Input Data Source
is created in SAS Enterprise Miner (SAS-EM) and churn dataset isimported as source in the Work Folder. The node of Variable selection is then added inand linked to the Input Data Source. In the Input data source, the variable churn isselected as target.
Variable selection
is done on the basis of Chi-Square to get rid of irrelevant variables (a
χ
2
value of 3.92 is set as the cut-off, which is very low enough for SAS-EM to reject a variable).
•
The output from the Variable Selection is linked to
Transform Variables
node. Here thevariables are transformed into standardised form, so that in the subsequent step of Replacement of variables, outliers can be easily replaced.
•
The data is then divided into Training and Test set through the
Data partition
node in theratio of 2:1. Further, 34% (approx) of Training set is taken as Validation set. Hence, thefinal ratio of Training: Validation: Testing is – 50:17:33. This node is linked to a Filter Outliers node.
•
The
Replacement
node which follows next, replaces the extreme values which mightaffect the central tendency of the data. The data has been standardised already, and hence by replacing values which less than -3 with -3, and those greater than 3 with 3, the dataset is winsorised within 6 times the standard deviation.
•
The Data output from the Replacement node is then used for further analysis in DecisionTree, and Interactive Grouping for Coarse Classification for Regression Analysis.
Leave a Comment