Professional Documents
Culture Documents
1
NESUG 2007 Statistics and Data Analysis
Any value which lies beyond the extended curve is treated datasets were used to derive the results. The results
by an appropriate function, which maintains the presented are average across the datasets. Wherever
monotonicity of the values but brings them to an appropriate, results were calculated for validation
acceptable range. datasets.
2.4 Mahalanobis Distance Approach For logistic regressions the performance metrics used
were Lift, percentage concordance and c-value. For the
Mahalanobis distance is a multivariate approach and is linear regressions, Lift, R-square, percentage error bands
calculated for every observation in the dataset. Then every and coefficient of variation for the process were used for
observation is given a weight as inverse of the the evaluations.
Mahalanobis distance. The observations with extreme
values get lower weights.. Finally a weighted regression is The metrics used to evaluate the performance of the
run on to minimize the effect of outliers. various techniques are fairly standard. A brief description
of these metrics follows.
Mahalanobis distance is different from Euclidian distance
in that: Lift at x is defined as the percentage of total responders
• It is based on correlations between variables by which captured (for logistic regression) or the percentage of total
different patterns can be identified and analyzed volume captured (for linear regression) in the top x
• is scale-invariant, i.e. not dependent on the scale of percentile of the model output. In our experimental study,
th
measurements we chose lift at the 10 percentile as it is a very common
• it takes into account the correlations of the data set cutoff used in different marketing exercises. The results
th and
reported broadly hold for lifts at the 20 30th deciles as
Here is how it is derived: well
D2 = y’*y = y * i2 = (x – x avg)’*C-1*(x – x avg)
Where D (= +√D2) equals the Percentage concordance provides the percentage of total
Mahalanobis-distance. population which has come into concordance according to
the model. A pair of observations with different observed
y = C-1/2(x – x avg); Where x = original (log-transformed) responses is said to be concordant if the observation with
data-vector the lower ordered response value has a lower predicted
C = estimated covariance matrix mean score than the observation with the higher ordered
y = transformed data-vector response value. For more on concordance see [6].
x avg = estimated vector of averages
C-value is equivalent to the well known measure ROC. It
For more on Mahalanobis technique see [2], [3]. ranges from 0.5 to 1, where 0.5 corresponds to the model
randomly predicting the response, and a 1 corresponds to
2.5 Robust-Reg Approach the model perfectly discriminating the response. For more
on C-value see [6].
Robust-Reg is also a multivariate approach. It runs
regression on the dataset and then calculates the
R-square is the proportion of variability in a data set that is
residuals. It then computes the median of residual after
accounted for by a statistical model. The R-square value is
which it takes difference of median and actual residual and
an indicator of how well the model fits the data. Higher is
calculates MAD (mean absolute deviation) as:
its value better will be the method used to build the model.
It is one of the most common metrics used to evaluate a
Median/0.6745 (where 0.6745 is value
very broad class of models.
of sigma)
Now it calculates absolute value as residual/mad and
Coefficient of variation is a measure of dispersion of a
gives weight with respect to absolute value. This process
probability distribution. Lower its value better will be the
is run iteratively and gives weight to each observation.
method used to build the model. For more on coefficient of
Outliers are given 0 weight and are deleted from final
variation see [7].
regression. For more on robust regressions, see [4], [5].
Percentage error bands are formed by breaking the
3. THE EVALUATION PROCESS percentage of error in prediction into bands. The
percentages of population present in different error bands
Our focus of application is in the direct marketing and determine the efficiency of the model. It is desirable to
consumer finance areas. In these areas as well as in a lot have more of the observations fall into the lower error
of other industry and research settings, the problems bands.
analyzed involve either a binary or a continuous
dependent variable. Therefore, we have evaluated the
outlier treatment approaches for the linear and logistic
regression problems. Due to the difference in the nature of
these problems, we have used separate but appropriate
performance metrics for each of them. Four different
2
NESUG 2007 Statistics and Data Analysis
4. RESULTS 46
45 44.6
Percentage Concordance
43
regression model, followed by the linear regression 42 41.6
41.2
results. 41 40.6
40
38
Lift Curve: 37
36
The sigma approach gave the highest lift at the top Capping and
Flooring Method
Exponential
Smoothing
Sigma Approach
Method
Mahalabonis
Distance Method
90
C-value:
80
Capping and Flooring
70
Method
When looking at the c-values, again sigma approach came
Cumulative Lift
60
Exponential Smoothing
50 Method out to be the best approach.
40 Sigma Approach
Method
30
Mahalabonis Distance
20
Method
10
C-value of Model for different methods
0
10 20 30 40 50
0.69
Cum ulative Percentage of
Population 0.685 0.683
0.68
0.675
Figure 1: Lift comparison curves for various methods 0.671
c-value
0.669
using logistic regression 0.67
0.664
0.665
0.66
Concordance:
0.655
results.
Lift Curve:
3
Robust regression is more appropriate for linear regressions
and was not considered for logistic regressions.
3
NESUG 2007 Statistics and Data Analysis
Lift Comparison Curves of Model for different methods Coefficient of Variation values of Model for different
methods
60.00
120.00
80.00
Method
63.58
Mahalabonis Distance
30.00 60.00
Method
Sigma Approach Method
20.00 40.00
Robustreg Method
10.00 20.00
0.00 0.00
10 20 30 40 50 Capping and Exponential Mahalabonis Sigma Robustreg
Flooring Smoothing Distance Approach Method
Cumulative Percentage of Population Method Method Method Method
Figure 4: Lift comparison curves for various methods Figure 6: Coefficient of variation values comparison
using linear regression chart for various methods using linear regression
R-square:
Error Bands:
When looking at the R-square values, again Mahalabonis
distance approach came out to be slightly better than the Mahalanobis distance method outperforms all other
other methods (see figure 5). methods on the error band criteria, as it captures a
substantially larger amount of population in the lower error
R-square values of Model for different methods bands (see figure 7).
0.28
0.27 Error Bands Comparison of different methods
0.27
100.00
0.26
Cumulative Percentage of population
0.26
0.25 90.00
R-square values
0.21 10.00
Capping and Exponential Mahalabonis Sigma Robustreg 0.00
Flooring Method Smoothing Distance Approach Method
10 0 %
15 5%
20 0%
25 5%
30 0%
40 0%
50 0%
% %
%
5% %
%
75 -7 5
<5
00
00
Method Method Method
-1
-1
-2
-2
-3
-4
-5
>1
-1
%
5. CONCLUSIONS
.
Our analysis provides evidence that the type of technique
used for outlier treatment does influence the model
performance.
4
NESUG 2007 Statistics and Data Analysis
6. REFERENCES
7. ACKNOWLEDGEMENTS
8. CONTACT INFORMATION