• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
CREDIT SCORING AND DATA MININGMANG 6054(Individual Course-Work)13-05-2009
Sujoy Singha, M.Sc Management Science,University of Southampton, 2008/09Student ID: 22924299ss21g08@soton.ac.uk 
TUTOR:
Dr. Bart Baesens, School of Management,University of SouthamptonBart.Baesens@econ.kuleuven.ac.be
 
Question 1 (30 marks, max. 4 pages): DATA MINING METHODOLGY PROBLEMA) Introduction:
This paper attempts at comparing two different methods of data mining toanalyse a dataset, and produce predictive results on the customer retention patterns in the mobile phone industry. The given data set is first pre-processed and then used as an input data source for the analysis. The data pre-processing performed in Microsoft Excel and in SAS, includesmethodologies like Transformation of Variables, Filtration of Outliers, Replacement of Rare data,Data Partition and Stratified Sampling. The data mining techniques used in this process are:Logistic Regression and Decision Tree, which is performed in SAS Enterprise Miner 4.3.
B) Data Pre-processing
The given data set churn.xls is a dataset comprising of 21 variables and 5000 entries for eachvariable. This dataset is to be used to analyse the behaviour of a customer which is expressed bymeans of a probabilistic expression ‘churn’. This indicates if a customer would be leaving theorganisation or not. The data pre-processing procedure followed is described below:
Randomising
of dataset with Microsoft Excel
TM
by introducing a new column calledRANDOM in the dataset and filling the 5000 values using RAND ( ) function. Thedataset is then sorted on ascending order of RANDOM, and then the column is deleted,and dataset saved as input data source to be analysed.
Input Data Source
is created in SAS Enterprise Miner (SAS-EM) and churn dataset isimported as source in the Work Folder. The node of Variable selection is then added inand linked to the Input Data Source. In the Input data source, the variable churn isselected as target.
Variable selection
is done on the basis of Chi-Square to get rid of irrelevant variables (a
χ 
2
 
value of 3.92 is set as the cut-off, which is very low enough for SAS-EM to reject a variable).
The output from the Variable Selection is linked to
Transform Variables
node. Here thevariables are transformed into standardised form, so that in the subsequent step of Replacement of variables, outliers can be easily replaced.
The data is then divided into Training and Test set through the
Data partition
node in theratio of 2:1. Further, 34% (approx) of Training set is taken as Validation set. Hence, thefinal ratio of Training: Validation: Testing is – 50:17:33. This node is linked to a Filter Outliers node.
The
Replacement
node which follows next, replaces the extreme values which mightaffect the central tendency of the data. The data has been standardised already, and hence by replacing values which less than -3 with -3, and those greater than 3 with 3, the dataset is winsorised within 6 times the standard deviation.
The Data output from the Replacement node is then used for further analysis in DecisionTree, and Interactive Grouping for Coarse Classification for Regression Analysis.
 
C) Results from Data pre-processing:
Based upon the Chi-Square criteria of Variable Selection, the selected variableswere:
STATE, INTL_PLAN, MAIL_PLAN, DAY_MINS, DAY_CHARGE, EVE_MINS, EVE_CHARGE, NIGHT_MINS, INTL_MINS and INTL_CALLS.
An Insight node is linked to the Input DataSource to detect presence of anyUnivariate outlier. Vmail_Message showsa Univariate characteristic histogram plot.But the variable is eliminated in theselection process, and hence no treatmentis required.Fig 1: Network diagram for Churn analysis.
D) LOGISTIC REGRESSION:
The coarse classification of variables is handled by the Interactive Grouping node. Here, theCommit criteria were selected as Information value and the Commit value was set to 0.1.For the regression analysis, logistic regression was used with a logit link function, and themethod selected was Stepwise. No initial parameters estimates were used and no interactions between variables were modelled. The network diagram for the entire analysis is shown below.
RESULTS OBTAINED FROM LOGISTIC REGRESSION:
i) Estimated Parameters: These are obtained from the Maximum Likelihood coefficient estimatesfrom the Regression Output.
Parameter 
 
EstimateP-Value
 
Intercept
Value-1.4697<.0001Intl_Plan No-1.2144<.0001Vmail_Plan No0.6108<.0001Evening Usage (Mins) 0.297<.0001Night Usage (Mins) 0.24050.0005Intl Usage (Mins) 0.26030.0004Intl Calls -0.27710.0008Customer Service Calls -1.285<.0001Day Charge -1.2362<.0001
Table 1: Analysis of Maximum Likelihood co-efficient estimates
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...