You are on page 1of 15

Understanding the Outliers in Healthcare Expenditure Data

Gandhi R. Bhattarai, PhD, OptumHealth, Golden Valley MN

ABSTRACT
When data are distributed normally, distance- and ranking-based outlier detection methods based on interquartile
ranges, standard deviations, etc. can be simple yet powerful tools to understand variation. Data related to
healthcare expenditures generally have skewed distributions, however, and may include many extreme values.
Often these data are used in case-control or other studies to estimate the impact of interventions such as
wellness or disease management programs that are designed to improve health. To produce more accurate and
useful program impact estimates, the skewed healthcare expenditure data need special handling. For example,
the presence of outliers in the program intervention (treatment) or non-intervention (comparison) groups can
influence the magnitude and variance of the program impact estimate. In this paper, an effort is made to
understand the effect of various outlier detection methods on the performance of statistical models that are used
to estimate program impact. Multivariate regression models are used with simulated data to illustrate the impact of
outliers on program impact estimates. A new approach to adjust for outliers is applied, by equalizing the ranges of
the values of healthcare expenditures across the intervention and comparison groups. This approach is applied in
a two-stage process. The first stage regression model creates propensity weights to balance the measured
demographic, health status, and other characteristics between the intervention and comparison groups. A
subsequent, second-stage regression model then applies generalized estimating equation (GEE) techniques
several times, each time involving different ways to adjust for outliers. Quasilikelihood under the Independence
model Criterion (QIC) goodness of fit values across these models is then compared to help find a preferred outlier
adjustment approach.

INTRODUCTION
Outlier detection is often an important part of data processing and analysis. The methods can be as simple as
graphical visualization via histograms and box-plot diagrams or simple descriptive tools like interquartile ranges or
standard deviation. More complex approaches apply multivariate tools to look for outliers through various density
or distance based algorithms. These approaches are based on the assumption of normal distribution of data. Data
related to healthcare expenditure generally follow skewed distributions, however, such as a Gamma distribution
and may include many extreme values. Sometimes outliers can be handled by natural log transformation of the
variables however this works only if the transformation approximates a log-normal distribution. Other approaches
to finding and adjusting for outliers hold value, and this can be illustrated by comparing results from statistical
analyses that use many different outlier detection and removal methods.
To help ground the discussion here, consider that many wellness and disease management program
interventions are used to help improve the health of a population. These interventions are often evaluated on
whether participation in a program is associated with lower health care expenditures than would have been
observed had no intervention taken place (Goetzel et al., 2005). However, the presence of outliers in either or
both of the program participant and non-participant groups can influence the estimated savings or return on
investment metrics that are produced in the evaluation. In this paper an effort is made to illustrate the effects of
various outlier detection and removal methods when program impact estimates are generated. Multivariate
regression models are used with simulated data to illustrate the impact of outliers on program impact estimates. A
new approach to adjust for outliers is applied, by equalizing the ranges healthcare expenditure values across the
two participant and non-participant groups. Several versions of this process are generated and QIC values across
these versions are compared to help find the best outlier adjustment approach.

DATA
A set of simulated data was prepared by random methods suitable for Gamma distributions. A sample code used
to create part of the data is included in Appendix 1. Few extremes values were created using different alpha and
scale parameters. The datasets included several extreme values that we would normally found in healthcare
expenditure data (e.g. Nyman et al., 2009; Tian and Huang, 2007). The data are generated for pre and post
intervention periods for a program participant group and a comparison group of non-participants. Healthcare

1
expenditures are expressed in terms of per member per month values measured over a simulated one-year study
period.
Table 1 shows the distribution of expenditures per member per month for both pre- and post-intervention periods.
th
The values jump exponentially from the third quartile to the 90 percentile. A graphical presentation of pre-period
cost shows a positively skewed right tail in the data (Figure 1a). This illustrates the non-normality of these
distributions. A log-transformation of the cost variable shows much closer normal distribution (Figure 1b).
Table 1 Distribution of pre-period and post-period cost per member per month
Pre-Intervention Post-Intervention
Variable
Expenditures Expenditures
Sample Size 11,871 11,871
Mean $2,010 $1,643
Std Dev $2,635 $2,011
Minimum $44 $27
1st Pctl $173 $160
5th Pctl $323 $286
10th Pctl $439 $381
25th Pctl $677 $597
50th Pctl $1,092 $965
75th Pctl $1,824 $1,583
90th Pctl $5,381 $4,137
95th Pctl $8,005 $6,336
99th Pctl $13,146 $9,865
Maximum $30,117 $19,509
Figure 1 Graphical presentation of pre-cost distribution: (a) raw cost and (b) log-transformed values

Pre-Period Cost Pre-Period Cost


60 12
Summary Statistics Summary Statistics
N 11871 N 11871
Mean 2010 Mean 7.108
Std Dev 2635 Std Dev 0.928
Skewness 2.984 Skewness 0.554
50 10

40 8
Percent

Percent

30 6

20 4

10 2

0 0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 3.7 4.1 4.5 4.9 5.3 5.7 6.1 6.5 6.9 7.3 7.7 8.1 8.5 8.9 9.3 9.7 10.1
precost lnprecost

a. Raw Cost b. After log-transformation

OUTLIER DETECTION METHODS


Once the analytic framework is set up, the robustness of the model and program impact estimates were
compared after dropping outliers based on various detection methods. Outlier detection was done separately for
pre-period expenditure and post-period expenditure. If an observation met either of these two thresholds, it was
flagged as outlier. SAS® codes used to identify outliers in some of the methods are given in Appendix2 and
Appendix 3.

2
Outlier detection was done in the following ways:
• Interquartile range: As seen in Table 1 and Figure 1a, the univariate distribution of data did not support a
normal distribution of the cost variables. Both the pre-period and post-period costs are converted to
natural logarithm values and assumed to be distributed as log-normal. For the interquartile based method,
the cutoff point was set as 1.5 times the interquartile range. Observations with values larger than third
quartile (Q3) plus 1.5*IQR were dropped.
• Standard deviation from mean: Both the raw and log-transformed costs were used separately to create
two sets of outlier flags. It should be noted that standard deviation based methods are often prone to
masking effect if few extremely large values happen to be together at the upper end. Mean plus 3*STD
threshold was used to identify outliers in both raw and log-transformed expenditures.
• Robust regression: Two robust regressions were run, one for each pre and post-period expenditure on
the log-transformed variables. The model identifies observations with the Cook’s distance greater than
3.0 as outlier (for detailed theory behind the ROBUSTREG procedure see Chen, 2002).
• Ranking: In this approach, observations with values higher than the top 2.5 percentile rank were removed
as outliers.
• Ranged equalized at the top: In this approach, threshold for outlier were set in such as a way that the top
values between two cohorts had the lowest difference at the highest possible level, provided that not
more than two percent of the observations could be removed. This was done separately for pre and post
period expenditures therefore the cumulative drop could be as much as four percent.
• Extreme values in Gamma distribution: This approach is based on recent progress in identifying outliers
in the exponential family of distribution, which includes the Gamma distribution. More specifically, a test
statistic was developed based on Tk statistics proposed by Nooghabi et al. (2010). This statistic was
calculated as:
X ( n − k ) − X (1)
Tk =

n
j = n − k +1
( X ( j ) − X (1) )
Where, X represents the value at the given position, n represents the last observation (=sample count), k
th th
represents the k observation from the top and j represents the k +1 position. The output from this equation is a
th
constant by dividing range for n-k observation with the cumulative sum of the ranges for all values that lay higher
th th
than k position. Thus a larger value of Tk indicated that k observation has relatively larger influence in the data
even in the presence of subsequent larger values. For this analysis, observations with calculated Tk>=0.05 were
considered upper outliers. Further readings on this topic can be found in Kimber (1983), Zerbert and Nikulin
(2006) and Kumar and Lalitha (2012).

MODELING FRAMEWORK
Differences in demographic, health status or other characteristics prior to intervention often make the comparison
of program participants and non-participants difficult. There are a number of ways to adjust for at least the
measurable differences in these characteristics. Examples include standard multiple regression techniques and
propensity score-based approaches. Either can be used to help illustrate the impact of outlier detection and
removal processes.
Given the popularity of the propensity score approaches in the literature (reviews by Sturmer, 2006; Guiping et al.,
2007; Austin, 2008; Austin, 2011a), I proceed by first using a propensity score approach to adjust for measurable
differences in the characteristics of program participants and non-participants. Then I use the results of the
propensity score analyses in subsequent regressions where alternative outlier detection and removal processes
are applied.
To adjust at least for measurable differences, propensity score analyses are often used. Many different types of
propensity score approaches are useful. Here a propensity weighted cohort balancing technique is used to make
program participants and non-participants look more like each other when applying the outlier adjustment
processes.
The propensity weighted cohort balancing technique is also known as inverse probability of treatment weighting,
or IPTW. This usually involves a logistic regression model (e.g., as explained in Rubin, 2001; Kurth et al., 2006;
Austin et al., 2007; Ahmeda et al., 2008; or Schneeweiss et al., 2009) that is used to estimate the probability of
3
participating in a health improvement program of interest, for both participants and non-participants in that
program. Here the predicted probability of participation in the program would be based on the aforementioned
demographic, health status, and other measurable characteristics. The predicted probabilities of participation that
are obtained from the logistic regressions would then be used to construct case weights for analyses where
different outlier detection and removal processes are applied. Applying the case weights in these analyses helps
make the program participants and non-participants look more like each other prior to intervention, and therefore
leads to more believable results that are obtained from the outlier analyses.
The propensity score adjustment approach can be illustrated as follows:
Logit(p) = a + b1x1 + b2x2 + ... + bixi (i)

where, p is the predicted probability of participation (pred)


and x1, x2 ... xi are the explanatory variables that influence that predicted probability.

Using the results obtained from the logistic regression, the IPTW propensity score weights are calculated as:
ps_weight = (1 / pred) for participants, who were coded as 1 in the logistic regression, and
ps_weight = 1 / (1 – pred) for non-participants, who were coded as 0 in the regression.
The utility of the propensity score weighting can be shown by investigating the values of the standardized
difference for each variable included in the logistic regression model (for details see Tritchler, 1995; Yang and
Dalton, 2012; Austin, 2011b). Standardized difference values less than or equal to 0.10 are considered as
evidence of successfully balancing the participant and non-participant groups (Austin, 2011b), leading to more
accurate estimates of the impact of program participation on healthcare expenditures.
Most healthcare expenditure data are right-skewed and models based on the assumption of a normal expenditure
distribution need to be modified or avoided. More appropriate generalized linear regression models can be used
with skewed data. The SAS PROC GENMOD approach can be used where a gamma distribution and log-link
function are applied. Many recent studies have been published using this family of generalized linear models (see
for example Manning, 2002; Titler et. al. 2008; Vlahiotis et. al. 2011; Batscheider, 2012)
In the analysis used for this paper, a panel data set containing expenditure values obtained before and after
program participation periods was constructed. Other types of variables were also included. Thus, the statistical
model looks like this:
EXPENDITUREt0,t1 = EXP (TIME, ENGAGED, TIME*ENGAGED, X) (ii)
where,
EXPENDITURE = Healthcare expenditures per month in the pre- and post-intervention periods, which are
denoted as t0 and t1).
EXP = exponential relationship is assumed to apply between healthcare expenditures, participation
status, and the demographic, location, health status, and other measurable variables of sample members,
X = set of demographic, health status, location, and other variables (Age, Gender, Urban residence,
Income group, Education group, Racial group, and measures of comorbid disease conditions),
TIME = an indicator for whether healthcare expenditures were measured before or after the start of the
program intervention (i.e., post-intervention =1, pre-intervention = 0),
ENGAGED = a binary indicator denoting program participants (coded as 1) and non-participants (coded
as 0), and
TIME*ENGAGED = an interaction term based on time period and participation status.

RESULTS
Cut-off values for each outlier method and associated drop in sample is shown in Table 2. Largest drop in sample
(7.69%) came from interquartile range method (Row I: Q3+1.5*IQR used, all observations were within
Q3+3*IQR). In contrast, mean+3*Std Dev approach dropped 5.07% observation when raw values were used and
dropped only 0.14% when log-transformed variable were used (Row II and III). Range equalized at the top across
cohort dropped only 2.06% of observation (Row V). Robust regression with log-transformed expenditures dropped

4
only 0.99% observation (Row VI) while the method to drop n largest values in gamma distributed sample dropped
only 0.30% of observations (Row VII).

Table 2 Upper end cut-off values and sample drop using alternative outlier detection methods

Sample size = 11,871 Pre-Intervention Period Post-Intervention Period Overall


Cut-off Sample Percent Cut-off Sample Percent Sample Percent
Outlier Detection Method Applied
Value Dropped Dropped Value Dropped Dropped Dropped Dropped
I. Q3+1.5*IQR (log of cost) $8,060 584 4.92% $6,828 487 4.10% 913 7.69%
II. Mean+3*STD (raw cost) $9,910 333 2.81% $7,665 338 2.85% 602 5.07%
III. Mean+3*STD (log of cost) $19,113 6 0.05% $15,017 11 0.09% 17 0.14%
IV. Top 2.5% (ranked, raw cost) $10,290 297 2.50% $8,016 296 2.49% 548 4.62%
V. Range equalized (raw cost) $13,317 111 0.94% $9,362 143 1.20% 244 2.06%
VI. Robust regression (log of cost) $17,239 22 0.19% $13,384 20 0.17% 118 0.99%
VII. Gamma sample (Tk>=0.05, raw cost) $17,988 19 0.16% $13,838 18 0.15% 36 0.30%

The SAS Macro to run logistic regression and calculate inverse propensity weights is given in Appendix 4. The
actual logistic output is omitted in this paper for brevity but is available upon request. Once the normalized inverse
propensity weights were calculated, they were used in further descriptive statistics and in the second stage
regression models. Appendix 5 presents another set of macros that calculates standardized differences for each
binary variable across two cohorts. These variables pertain to measures of disease comorbidity, gender, location,
income, education and race categories. The SAS code is developed based on the methodological discussion in
Yang and Dalton (2012).
Table 3 Standardized difference of variables before and after propensity weighting (original data)
Unweight Weighted
Description
Ctrl_pct Engd_pct Diff. Std Diff Ctrl_pct Engd_pct Diff. Std Diff
Age 18 to 35 0.249 0.329 0.080 0.177 0.280 0.281 0.000 0.001
35 to 50 0.445 0.444 0.001 0.002 0.446 0.448 0.002 0.004
50 to 65 0.306 0.227 0.079 0.179 0.274 0.271 0.003 0.006
Gender Female 0.527 0.457 0.070 0.140 0.498 0.496 0.002 0.004
Disease Comorbid 0.308 0.413 0.105 0.221 0.351 0.351 0.000 0.000
Education High 0.319 0.292 0.027 0.059 0.310 0.313 0.003 0.006
Medium 0.387 0.344 0.044 0.091 0.370 0.368 0.001 0.003
Low 0.294 0.364 0.071 0.151 0.321 0.319 0.001 0.002
Income High 0.250 0.213 0.037 0.087 0.234 0.235 0.000 0.001
Medium 0.484 0.450 0.034 0.067 0.471 0.470 0.000 0.000
Low 0.266 0.336 0.070 0.154 0.295 0.295 0.000 0.000
Race Black 0.150 0.191 0.040 0.107 0.166 0.166 0.000 0.001
Hispanic 0.356 0.408 0.052 0.107 0.376 0.375 0.001 0.001
White 0.494 0.402 0.092 0.186 0.458 0.458 0.001 0.002
Location Urban 0.717 0.793 0.076 0.177 0.748 0.749 0.001 0.003
Table 3 above showed how the propensity weight application removes the differences in the demographic and
other variables that were used in the propensity score weighting logistic regression model. The column labeled
StdDiff before weighting show many values higher than 0.10, which suggests large differences in these variables
for program participants prior to weighting. In contrast, after weighting these standardized difference values fall
well below 0.10 after weighting, suggesting that the propensity score approach worked well to balance program
participants and non-participants before the subsequent outlier removal processes were applied.
The second stage (i.e., propensity score weighted) regression results are shown in Tables 4a and 4b. While these
tables show the regression coefficient estimates for each variable included in the healthcare expenditure analysis,
it is perhaps most instructive to focus on the row that is labeled “simulated savings”. This row of each table shows
the estimated savings associated with program participation and its standard deviation. Simulated savings were
obtained from the regression intercepts, coefficient values, and raw data from sample members, as given in the
code Appendix 6. Notice that the savings estimates vary quite a bit based on the methods used to detect and
remove outliers. Table 4 presents the results from four of the outlier methods in which savings vary from a high of

5
$246.18 to a low of $109.54 per program participant per month. Interestingly, the QIC values become bigger for
every drop in sample size for any outliers, conforming to the accurate modeling using GENMOD with gamma
distribution and a logit link (remember, the extreme values were randomly generated using gamma distribution).
Notice that there is wide variation in the QIC values for each model, even though every model used the same set
of independent variables. The QIC (i.e., Quasi-likelihood under the Independence model Criterion) statistic can be
used to identify the best model, as suggested by Pan (2001). A slight variant, denoted as QICu, can also be used.
When using either of these values to compare regression models, the model with the smallest statistic is preferred
(SAS documentation). Considering Table 4, the best model appears to the first one, where no outliers were
excluded followed by the outliers removed using Mean+3*Std Dev with log-linear distribution (Outlier Model III)
and the Gamma T statistics (Outlier Model VII). Overall, in this particular analysis, the best model appears to be
the one where no outliers were removed. That model has the lowest QIC and QICu statistics.

Table 4 Summary of second stage regression results

Outliers I Outliers II Outliers III


Description Outliers Included
Q3+3*IQR (log cost) Mean+3*STD (raw cost) Mean+3*STD (log cost)
Estimates Coeff. Std Err Coeff. Std Err Coeff. Std Err Coeff. Std Err
Intercept 7.615 0.042 ** 7.375 0.032 ** 7.491 0.035 ** 7.623 0.042 **
Engaged 0.181 0.026 ** 0.032 0.020 0.087 0.022 ** 0.173 0.026 **
Time (post=1) -0.140 0.013 ** -0.119 0.010 ** -0.142 0.010 ** -0.146 0.012 **
Engaged*Time -0.149 0.020 ** -0.089 0.017 ** -0.100 0.017 ** -0.140 0.020 **
Female -0.168 0.022 ** -0.124 0.017 ** -0.151 0.019 ** -0.166 0.022 **
Race (White: Base)
Hispanic 0.119 0.024 ** 0.081 0.019 ** 0.063 0.021 * 0.119 0.024 **
Black 0.055 0.032 0.019 0.024 -0.017 0.026 0.052 0.031
Age (50-70: Base)
18 to 35 0.068 0.029 * 0.019 0.024 0.024 0.025 0.066 0.029 *
35 to 50 -0.023 0.026 -0.058 0.021 * -0.047 0.023 * -0.030 0.026
Education (High: Base)
Low -0.091 0.027 * -0.076 0.022 * -0.074 0.024 * -0.095 0.027 *
Medium -0.136 0.027 ** -0.113 0.021 ** -0.105 0.023 ** -0.133 0.027 **
Income (High: Base)
Low 0.074 0.030 * 0.080 0.024 * 0.075 0.027 * 0.065 0.030 *
Medium -0.038 0.028 -0.026 0.022 -0.034 0.024 -0.046 0.028
Comorbid 0.045 0.023 * 0.025 0.018 0.012 0.020 0.048 0.023 *
Urban 0.005 0.025 -0.023 0.020 -0.021 0.022 0.002 0.025
Simulated Savings $246.18 $42.55 $109.54 $11.91 $134.55 $16.53 $228.89 $38.73
Model Performance
Working Correlation Matrix 0.6587 0.6509 0.6832 0.6689
GEE Fit Criteria
QIC 260222.77 404577.05 347020.77 266796.75
QICu 260203.69 404557.39 347000.53 266777.23
Sample size (N) 11,871 100% 10,958 92.3% 11,269 94.9% 11,854 99.9%
** Significant at <.0001 level; * Significant at 0.05 level

Continued .....

6
Table 4 (Continued): Summary of second stage regression results

Outliers IV Outliers V Outliers VI Outliers VII


Description
Ranked Top 2.5% Top Range Equalized Robustreg (log cost) Gamma sample (tk>=.05)
Estimates Coeff. Std Err Coeff. Std Err Coeff. Std Err Coeff. Std Err
Intercept 7.502 0.035 ** 7.563 0.040 ** 7.622 0.041 ** 7.621 0.042 **
Engaged 0.091 0.023 ** 0.143 0.024 ** 0.155 0.025 0.170 0.025 **
Time (post=1) -0.144 0.010 ** -0.151 0.011 ** -0.136 0.012 ** -0.147 0.012 **
Engaged*Time -0.097 0.017 ** -0.122 0.018 ** -0.146 0.019 ** -0.134 0.020 **
Female -0.150 0.019 ** -0.164 0.021 ** -0.177 0.021 * -0.161 0.022 **
Race (White: Base)
Hispanic 0.066 0.021 * 0.101 0.023 ** 0.120 0.023 0.116 0.024 **
Black -0.007 0.027 0.008 0.029 0.038 0.030 ** 0.040 0.031
Age (50-70: Base)
18 to 35 0.026 0.026 0.059 0.028 * 0.054 0.028 0.056 0.029
35 to 50 -0.049 0.023 * -0.034 0.025 -0.046 0.026 * -0.036 0.026
Education (High: Base)
Low -0.076 0.024 * -0.085 0.025 * -0.106 0.026 * -0.095 0.027 **
Medium -0.110 0.023 ** -0.109 0.025 ** -0.144 0.026 * -0.134 0.026 **
Income (High: Base)
Low 0.075 0.027 * 0.084 0.029 * 0.082 0.029 0.072 0.030 *
Medium -0.033 0.025 -0.027 0.027 -0.033 0.027 -0.037 0.028
Comorbid 0.012 0.020 0.036 0.021 0.054 0.022 0.044 0.022
Urban -0.015 0.022 -0.024 0.024 -0.016 0.025 -0.003 0.025
Simulated Savings $132.69 $16.47 $183.28 $27.93 $232.85 $39.70 $217.96 $36.02
Model Performance
Working Correlation Matrix 0.6835 0.6869 0.6774 0.6715
GEE Fit Criteria
QIC 339507.39 301218.35 282809.47 270912.45
QICu 339487.18 301197.77 282789.51 270892.86
Sample size (N) 11,323 95.4% 11,627 97.9% 10,254 86.4% 11,835 99.7%

** Significant at <.0001 level; * Significant at 0.05 level

In order to understand the whether the model performance was confounded by the effect of propensity weights
and sample size adjustment together, a set of models were run without propensity weighting. Models performed in
the same pattern as the weighted models (Table 5).

Table 5 Model performance without propensity weighting


Working GEE Fit Criteria Simulated Saving
Number
Model Correlation
of Clusters QIC QICu Mean Std. Dev.
Matrix
Outliers Included 11871 0.657 259281.55 259266.98 233.57 38.74
Model I: Q3+3*IQR, Log Cost 10958 0.649 407418.72 407403.92 109.35 11.27
Model II: Mean+3*STD, Raw Cost 11269 0.681 348927.73 348912.46 130.74 15.45
Model III: Mean+3*STD, Log Cost 11854 0.669 266368.97 266354.12 218.41 35.49
Model IV: Top 2.5% rank, Raw Cost 11323 0.681 340977.67 340962.44 129.59 15.46
Model V: Range Equalized at top 11627 0.685 301665.38 301650.20 170.58 24.64
Model VI: Robust Regression, Log Cost 11753 0.677 282601.14 282586.03 218.02 35.83
Model VII: Gamma Sample (tk>=0.05) 11835 0.672 270659.43 270644.53 206.49 32.73

7
DISCUSSION
Although the estimation and other results are explained in detail above, the primary purpose of this analysis is to
compare models with and without outliers included in the model. Different methods of outlier detection and
removal were used so that model goodness of fit could be compared via the QIC statistics.
The results varied wildly across different subsamples resulting from the different outlier filters. The interquartile
range performed poorly even after log-transformation. Outliers removed based on ranking and standard deviation
without log-transformation were other two poorly performed models. Model performance was more stable in
methods based upon using standard deviation after log transformation, range equalization, or by considering the
upper extremes of a Gamma expenditure distribution. The results from standard deviation after log-transformation
and upper outlier detection in Gamma approaches were close to each other.

LIMITATIONS OF THE STUDY


Healthcare expenditure datasets often contain many values pertaining to study subjects who do not use the
healthcare system in a period of interest, hence resulting in zero values (e.g. Nyman et al., 2009). Zero inflated
two step regression models are generally used in such situations (for example, see Diehr et al., 1999; Tian and
Huang, 2007).However, explanation of zero-inflation models is not required to illustrate outlier detection and
adjustment methods, so is beyond the scope of this paper.
Since the objective of this paper is to review how alternative outlier detection and removal methods can affect
program impact estimates, it is sufficient to avoid the zero-value problem by removing observations with zero-
expenditure values from the study, even if that detracts a bit from observable reality. An alternative approach
would be to leave the zero-dollar values in the study and use them as the low-values in the expenditure ranges for
the participant and non-participant groups, but doing so would needlessly complicate the estimation processes by
applying zero-inflated models that many readers may not be familiar with. Moreover, dropping the zeros may
leave differences in the low-end expenditure values. If there are big differences in these low-end expenditure
values across the participant and non-participant groups, one can equalize the ranges accordingly. If it is just the
highest values of the expenditure distribution are much different in the participant and non-participant group –
which is often the case in healthcare expenditure data sets, one can focus on just the upper-end values for outlier
detection and removal.

CONCLUSIONS
There is not a single approach of outlier removal that works for every situation. The distribution of data, relative
skewness, and the presence of extreme values all warrant thorough investigation of the data. In general, the
healthcare expenditure data used here follow the Gamma distribution with a positive right tail. The program
impact estimates obtained here varied a lot based on the methods used to find and remove outliers. The best
model here was the model that did not remove any outliers. If there is a strong desire for an analysis that removes
at least some outliers, the best model here would be Model IX, based on the Gamma expenditure distribution. It is
strongly recommended that analysts try multiple models that are guided by economic theory and the underlying
distribution of the expenditure data, to see which model may serve them best in their situation. .

REFERENCES
Ahmeda, Ali, James B. Young, Thomas E. Love, Raynald Levesqued and Bertram Pitt (2008): A propensity-
matched study of the effects of chronic diuretic therapy on mortality and hospitalization in older adults with heart
failure at University of Alabama at Birmingham. International Journal of Cardiology, 125(2): 246–253.
Austin, Peter C., Paul Grootendorst and M. Anderson (2007): A comparison of the ability of different propensity
score models to balance measured variables between treated and untreated subjects: a Monte Carlo study
Statistics in Medicine, 26:734-753.
Austin, Peter C. (2008): A critical appraisal of propensity-score matching in the medical literature between 1996
and 2003, Statistics in Medicine, 27: 2037-2049.

8
Austin, Peter C. (2011a): An introduction to propensity score methods for reducing the effects of confounding in
observational studies, Multivariate Behavioral Research, 46:399-424.
Austin, Peter C. (2011b): A tutorial and case study in propensity score analysis: an application to estimating the
effect of in-hospital smoking cessation counseling on mortality, Multivariate Behavioral Research, 46:119-151.
Batscheider A., Zakrzewska S., Heinrich J., Teuner C. M., Menn P., Bauer C.P., Hoffmann U., Koletzko S., Lehmann I.,
Herbarth O., von Berg A., Berdel D., Krämer U., Schaaf B., Wichmann H.E. and Leidl R. (2012): Exposure to second-hand
smoke and direct healthcare costs in children – results from two German birth cohorts, GINIplus and LISAplus. BMC Health
Services Research,12:344
Chen, Collin (2002): Robust regression and outlier detection with the ROBUSTREG procedure. Paper presented
th
at the 27 SAS User Group International conference, Orlando, FL, April 14-17, 2002.
Diehr, P., D. Yanez, A. Ash, M. Hornbrook and D.Y. Lin (1999): Methods for analyzing healthcare utilization and
costs, Annual Review of Public Health, 20: 125-44.
Goetzel, Ron Z., Ronald J. Ozminkowski, Victor G. Villagra and Jennifer Duffy (2005): Return on investment in
disease management: a review, Health Care Financing Review, Vol 26 (4).
Griswold, Michael, Giovanni Parmigiani, Arnie Potosky and Joseph Lipscomb (2004): Analyzing health care costs:
a comparisons of statistical methods motivated by Medicare colorectal cancer charges, Biostatistics, 1(1): 1-23.
Guiping, Yang, Stephen Stemkowki and William Saunders (2007): A review of propensity score application in
healthcare outcome and epidemiology. Paper presented at the Pharmaceutical SUG Conference, Denver, CO,
June 3-6, 2007.
Joseph C. Gardiner (2012): Modeling heavy-tailed distributions in healthcare utilization by parametric and
Bayesian methods. Paper presented at the SAS Global Forum 2012, Orlando, FL.
Kimber, A. C. (1983): Discordancy Testing in Gamma Samples with Both Parameters Unknown, Applied
Statistics, 32(3): 304-310.
Kumar, Nirpeksh and S. Lalitha (2012): Testing for upper outliers in gamma sample, Communications in
Statistics – Theory and Methods, 41(5): 820-828.
Kurth, Tobias, Alexander M. Walkera, Robert J. Glynn, K. Arnold Chan, J. Michael Gaziano, Klaus Bergers and
James M. Robins (2006): Results of multivariable logistic regression, propensity matching, propensity adjustment,
and propensity-based weighting under conditions of nonuniform effect. American Journal of Epidemiology; 163:
262-270
Manning, W. G., A. Basu, J. Mullahy, and W. Manning (2002): Modeling costs with generalized gamma
regression. Available at http://citeseerx.ist.psu.edu/viewdoc/summary [Accessed June 6, 2013]
Nooghabi, M. Jabbari, H. Jabbari Nooghabi and P. Nasiri (2010): Detecting outliers in gamma distribution,
Communications in Statistics – Theory and Methods, 39(4): 698-706.
Nyman, John A., Nathan A. Barleen and Bryan E. Dowd (2009): A return-on-investment analysis of the health
promotion program at the University of Minnesota. JOEM, 51(1): 54-65.
Pan, Wei (2001): Akaike’s Information Criterion in Generalized Estimating Equations. Biometrics, 57(1):120-125
Rubin, Donald B. (2001): Using propensity scores to help design observational studies: application to the tobacco
litigation. Health Services & Outcomes Research Methodology, 2:169–188.
SAS Documentation. QIC goodness of fit statistic for GEE models. http://support.sas.com/kb/26/100.html
[Accessed on 6/6/2013]
Schneeweiss, Sebastian, Jeremy A. Rassen, Robert J. Glynn, Jerry Avorn, Helen Mogun and M. Alan Brookhart
(2009): High-dimensional propensity score adjustment in studies of treatment effects using health care claims
data. Epidemiology, 20(4): 512-522.
Schonlau, Matthias, Arthur van Soest, Arie Kapteyn and Mick Couper (2009): Selection bias in web surveys and
the use of propensity scores. Sociological Methods and Research, 37(3) 291-318.

9
Simpson, Pippa, Robert Hamer, Chan Hee Jo, B Emma Haung, Rajiv Goel, Eric Siegel and Richard Dennis
th
(2004): Assessing model fit and finding a fit model. Paper presented at the 29 SAS User Group International
conference, Montreal, Canada, May 9-12, 2004
Sturmer, Til, Manisha Joshi, Robert J Glynn, Jerry Avorn, Kenneth J Rothman and Sebastian Schneeweiss
(2006): A review of the application of propensity score methods yielded increasing use, advantages in specific
settings, but not substantially different estimates compared with conventional multivariate methods, Journal of
Clinical Epidemiology, 59(5): 437-447.
Tian, Lu and Jie Huang (2007): A two-part model for censored medical cost data, Statistics in Medicine, 26:4273-
4292
Titler, Marita G., Gwenneth A. Jensen, Joanne McCloskey, Dochterman, Xian-Jin Xie, Mary Kanak, David Reed,
and Leah L. Shever (2008): Cost of hospital care for older adults with heart failure: medical, pharmaceutical, and
nursing costs. Health Service Research, 43(2): 635–655 (doi: 10.1111/j.1475-6773.2007.00789.x)
Tritchler, David (1995): Interpreting the standardized difference. Biometrics, 51: 351-353.
Vlahiotis, Anna, Scott T. Devine, Jeff Eichholz,and Adam Kautzner (2011): Discontinuation rates and health care
costs in adult patients starting generic versus brand SSRI or SNRI antidepressants in commercial health plans.
Journal of Managed Care Pharmacy,17(2).
Yang, Dongshen and Jarrod E Dalton (2012): A unified approach to measuring the effect size between two
groups using SAS®. Paper presented at the SAS Global Forum 2012, Orlando, FL.
Zerbet, A. and M. Nikulin (2003): A new statistics for detecting outliers in exponential case. Communications in
Statistics – Theory and Methods, 32(3): 573-583.

ACKNOWLEDGMENTS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Gandhi R Bhattarai
OptumHealth Care Solutions
PO Box 9472
Minneapolis, MN 55440-9472
Work Phone: 860-221-0355
Fax: 866-287-3122
Email: gandhi.bhattarai@otpum.com
Web: www.optum.com

10
SAS Macro for the range equalization between cohort using pre period/post period as well as
top/bottom values is available from the author upon request.

Appendix 1 Sample SAS® code used in creating random data


data randata1;
do i=1 to 10000;
indv_id = round(1000+((70000-10000)*ranuni(367887)));
engaged = ranbin(45268,1,.40);
if engaged=0 then do;
female = ranbin(83548,1,.55);
urban = ranbin(23567,1,.70);
comorbid = ranbin(57892,1,.30);
minority = rantbl(73566,.35,.50,.15);
edugroup = rantbl(63548,.30,.40,.30);
incmgroup= rantbl(36978,.25,.50,.25);
agegroup = rantbl(23578,.25,.45,.30);
precost = round(1000*(.34*rangam(39765,3.1))) ;
pstcost = round(1000*(.30*rangam(11345,3.2))) ;
end;
else if engaged=1 then do;
female = ranbin(32548,1,.45);
urban = ranbin(63567,1,.80);
comorbid = ranbin(32892,1,.40);
minority = rantbl(21566,.40,.40,.20);
edugroup = rantbl(34548,.35,.35,.30);
incmgroup= rantbl(34978,.35,.45,.30);
agegroup = rantbl(62578,.30,.45,.25);
precost = round(1000*(.35*rangam(83765,3.0))) ;
pstcost = round(1000*(.29*rangam(25345,3.1))) ;
end; drop i;
output;
end; run;
/* Create different set with different gamma distribution for larger values and
append them together */

Appendix 2 SAS code for creating outlier flags (rank, IQR, STD based)

/* Create ranking */
proc rank data=outldat
out=outldat groups=1000;
var precost pstcost;
ranks rprecost rpstcost; run;
/* Save mean, std, q1, q3 in memory */
proc means data = outldat noprint;
output out=outldatms
mean = mprecost mpstcost mlnprecost mlnpstcost
std = sprecost spstcost slnprecost slnpstcost
q1 = q1precost q1pstcost q1lnprecost q1lnpstcost
q3 = q3precost q3pstcost q3lnprecost q3lnpstcost ;
var precost pstcost lnprecost lnpstcost ; run;
data _null_; set outldatms;
call symput ('q1precost' , q1precost);
call symput ('q1pstcost' , q1pstcost);
call symput ('q3precost' , q3precost);
11
call symput ('q3pstcost' , q3pstcost);
call symput ('mprecost' , mprecost);
call symput ('mpstcost' , mpstcost);
call symput ('sprecost' , sprecost);
call symput ('spstcost' , spstcost);
call symput ('q1lnprecost', q1lnprecost);
call symput ('q1lnpstcost', q1lnpstcost);
call symput ('q3lnprecost', q3lnprecost);
call symput ('q3lnpstcost', q3lnpstcost);
call symput ('mlnprecost' , mlnprecost);
call symput ('mlnpstcost' , mlnpstcost);
call symput ('slnprecost' , slnprecost);
call symput ('slnpstcost' , slnpstcost); run;
/* Create all the flags */
data outldat;
set outldat;
/* Raw cost, top 2.5% rank, 974th position out of 999*/
outl_rank=1;
if rprecost<=974 and rpstcost<=974 then outl_rank=0;
/* LogCost, IQR based outliers */
outl_iqr=1;
if lnprecost<=(&q3lnprecost.+1.5*(&q3lnprecost.-&q1lnprecost.))
and lnpstcost<=(&q3lnpstcost.+1.5*(&q3lnpstcost.-&q1lnpstcost.))
then outl_iqr=0;
/* Raw Cost, STD based outliers */
outl_stda=1;
if precost<=(&mprecost.+3*&sprecost.)
and pstcost<=(&mpstcost.+3*&spstcost.)
then outl_stda=0;
/* LogCost, STD based outliers */
outl_stdb=1;
if lnprecost<=(&mlnprecost.+3*&slnprecost.)
and lnpstcost<=(&mlnpstcost.+3*&slnpstcost.)
then outl_stdb=0;
run;

Appendix 3 SAS code to create outliers based on robust regression (example only)

%let regvars = engaged female race_hsp race_blk age1835 age3550


urban edu_low edu_med incm_low incm_med comorbid ;
%macro runrobust (depvar, dout);
proc robustreg
method=mm /* other options: m mm s lts */
data=outldat
plots=ddplot(label=none);
ods output diagnostics=&dout.;
model &depvar. = &regvars. /
diagnostics leverage;
output out=robout
r=resid sr=stdres;
run;
proc sort data=&dout.; by obs; run;
%mend;
/* Merge pre & post cost output data to create single flag */

12
Appendix 4 SAS macro for propensity score weighting

%let psvars = female race_hsp race_blk urban age1835 age3550


edu_low edu_med incm_low incm_med comorbid;
%let outlflag = outlflag10;
%macro runpsm (outlflag);
/* Run logistic model */
proc logistic data=outldat;
model &grpvar. (event='1')= &psvars.
/link=logit rsquare lackfit;
output out = ps&outlflag. pred = pred;
where &outlflag.=0; run;
/* Create weights */
data ps&outlflag.; set ps&outlflag. ;
if &grpvar. = 1 then ps_wgt=1/pred;
else ps_wgt=1/(1-pred); run;
/* Normalize weights */
proc summary nway
data=ps&outlflag.;
class &grpvar.;
var ps_wgt ;
output out=tt&outlflag. sum=ttwt; run;
proc sort data=tt&outlflag.; by &grpvar.; run;
proc sort data=ps&outlflag.; by &grpvar.; run;
data ps&outlflag.;
merge ps&outlflag.
tt&outlflag. (keep=&grpvar. ttwt _freq_);
by &grpvar.;
psadj1_wgt =(ps_wgt/ttwt);
psadjn_wgt =(ps_wgt/ttwt)*(_freq_);
drop ttwt _freq_; run;
%mend;
%runpsm(&outflag.);

Appendix 5 SAS Macro for measuring standardized difference (binary variables)


%let varlisty = age1835 age3550 age5065 female urban comorbid
edu_low edu_med edu_high incm_low incm_med
incm_high race_hsp race_wht race_blk;
%let outlflag0 = outlflag10;

%macro stddiffy(outlflag);
ods output "Statistics" = statsy
"T-Tests" = ttesty
"Equality of Variances" = varnsy;
proc ttest data=ps&outlflag.;
class engaged;
var &varlisty. ;
where &outlflag.=0;
run;
proc sort data=statsy; by variable class; run;
data statsy; set statsy;
by variable;
retain n0 n1 p0 p1 pdiff
std0 std1 pstd sd ;
if first.variable then do;
n0=.; n1=.; p0=.; p1=.; pdiff=.;
std0=.; std1=.; pstd=.; sd=.;

13
end;
if class='0' then do;
n0=n; p0=mean; std0=stddev;
end;
if class='1' then do;
n1=n; p1=mean; std1=stddev;
end;
pdiff=abs(p0-p1);
pstd=sqrt((p0*(1-p0)+p1*(1-p1))/2);
sd=pdiff/pstd;
if class='Diff (1-2)' then output;
keep variable n0 n1 p0 p1 std0 std1 pdiff pstd sd;
run;
data stdy&outlflag; set statsy;
sigmad=0; sd_ll=0; sd_ul=0;
sigmad=sqrt(((n0+n1)/(n0*n1))+ (sd**2)/(2*(n0+n1)));
sd_ll= sd - 1.96 * sigmad;
sd_ul= sd + 1.96 * sigmad;
run;
%mend;

Appendix 6 SAS Macro for second stage regression model (GENMOD)


(Including codes for simulation of expenditures)

%let glmvars = engaged post eng_post ;


%let simvars = col1 + col2*engaged + col3*post + col4*eng_post ;
%let varcnt = 4;

%macro rungenmod (glmvars, simvars, varcnt, outlflag);


* 1. Create panel data for GENMOD;
proc sql;
create table predat as
select *, precost as costx,
case when indv_id>0 then 0 else 1 end as post
from ps&outlflag.; quit;
proc sql;
create table pstdat as
select *, pstcost as costx,
case when indv_id>0 then 1 else 0 end as post
from ps&outlflag.; quit;
data panel&outlflag.;
set predat pstdat ;
eng_post=0;
if engaged=1 and post=1 then eng_post=1;
run;
data predat pstdat;
set panel&outlflag.;
if post=0 then output predat;
else if post=1 then output pstdat;
run;
* 2. Run GENMOD;
proc genmod data=panel&outlflag.;
ods output GEEEmpPEst=gammapse;
class indv_id ;
model costx=&glmvars./d=gamma link=log;
repeated subject=indv_id/ corrw covb type=unstr;

14
weight psadjn_wgt;
run; quit;
* 3. Transpose the beta values;
proc transpose data=gammapse
prefix=col out=glmest;
var Estimate;
run;
* 4. Simulate four scenarios;
data simy1pre(keep=indv_id y1_pre py1pre);
if _n_=1 then set glmest(keep=col1-col&varcnt.);
set predat;
group=1; post=0; eng_post=0;
y1_pre=&simvars.;
py1pre=exp(y1_pre);
run;
data simy0pre(keep=indv_id y0_pre py0pre);
if _n_=1 then set glmest(keep=col1-col&varcnt.);
set predat;
group=0; post=0; eng_post=0;
y0_pre=&simvars.;
py0pre=exp(y0_pre);
run;
data simy1pst(keep=indv_id y1_pst py1pst);
if _n_=1 then set glmest(keep=col1-col&varcnt.);
set pstdat;
group=1; post=1; eng_post=1;
y1_pst=&simvars.;
py1pst=exp(y1_pst);
run;
data simy0pst(keep=indv_id y0_pst py0pst);
if _n_=1 then set glmest(keep=col1-col&varcnt.);
set pstdat;
group=0; post=1; eng_post=0;
y0_pst=&simvars.;
py0pst=exp(y0_pst);
run;
* 5. Merge scenarios and calculate difference-of-difference;
proc sql;
create table xbeta&outlflag as
select *
from simy1pst as a, simy1pre as b,
simy0pst as c, simy0pre as d
where a.indv_id=b.indv_id=c.indv_id=d.indv_id;
quit;
data xbeta&outlflag.;
set xbeta&outlflag.;
dif_ctrl=py0pre-py0pst;
dif_engd=py1pre-py1pst;
dif0dif=dif_engd-dif_ctrl;
run;
title;
%mend;

15

You might also like