You are on page 1of 17

Statistica & Applicazioni - Vol. XI, n. 2, 2013, pp.

177-192

STATISTICAL ANALYSIS OF CUSTOMER LIFETIME VALUE:


A CASE STUDY ON TELECOMMUNICATIONS DATA USING SAS

Maurizio Toccu*
Alessandro Fassò*

SUMMARY
Customers are important intangible assets of a company, but they are not equally remunerative.
Therefore, to identify profitable customers, it is necessary to use a disaggregated metric called cus-
tomer lifetime value (CLV). This metric represents the present value of all future profits generated
by a customer. Using a known formula, it is possible to calculate customer lifetime value taking into
account the survival probability of each customer. Therefore, in this paper we develop a statistical
analysis of CLV using SAS. To illustrate this method we present a simple case study related to a tel-
ecommunications company. In particular, we start the study by analyzing the distribution of custo-
mers and revenues (a component of CLV) with respect to geographical area and size of customer.
We continue with an analysis of the distribution of CLV with respect to the above factors, and this
allows us to make a comparison, on average, with the distribution of revenues. Finally, we estimate
a regression model to detect the factors affecting CLV, thereby obtaining interesting results.

Keywords: Churn Prevention, Customer Retention, Telecommunications Sector, SAS.


q 2013 Vita e Pensiero / Pubblicazioni dell’Università Cattolica del Sacro Cuore

1. INTRODUCTION

Over the past decade, customer management has rapidly emerged as an important
area of marketing research. In particular, the transition from the traditional product-
centric approach to a customer-centric approach reflects the fact that customers are
not only critical assets (Blattberg and Deighton, 1991) but are actually the most im-
portant assets of a company. For this reason, over time, different economic and fi-
nancial metrics have been created to evaluate customers. Indeed, it may be advanta-
geous to select certain profitable customers or assign different resources to different
groups of customers. For example, Peppers and Rogers (1993) argue that selling
more goods to fewer people is not only more efficient but also more profitable.
However, such controls are not possible using aggregate financial measures. Custo-
mer lifetime value (CLV), in contrast, is a disaggregate metric that enables valuable
customers to be identified and resources to be allocated accordingly.
Specifically, CLV can be used for several purposes. The first of these is customer
acquisition. Indeed, CLV defines the upper bound of what one should spend to ac-
quire a customer (Berger and Nasr, 1998). The second is customer selection, that is,
the focus on customers with high CLV (Venkatesan and Kumar, 2004). Thirdly it

* Dipartimento di Ingegneria - Università di Bergamo - via G. Marconi, 5 - 24044 BERGAMO


(e-mail: maurizio.toccu@unibg.it; alessandro.fasso@unibg.it).

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 177


178 M. TOCCU - A. FASSÒ

can be used for the purpose of resource allocation, with respect to more profitable
customers (Reinartz and Kumar, 2003).
As customers are not all equally valuable, there is growing interest on the part of
companies in performing quantitative marketing and statistical analysis based on
CLV. As a result, in recent years there has been an increase in the literature dedi-
cated to CLV analysis. Below we summarise significant recent literature on CLV.
Gupta, Hanssens, Hardie, Kahn, Kumar, Lin, Ravishanker and Sriram (2006)
have developed a new method for calculating CLV which uses survival methods to
estimate customer churn probability. We discuss these methods in Section 2.
Aeron, Bhaskar, Sundararajan, Kumar and Moorthy (2008) discuss the credit card
market. In particular, they argue that CLV estimation can help a credit card issuer in
making crucial decisions. The authors present a conceptual model for credit card rev-
enue and a metric for CLV designed for credit card customers.
Jen, Chou and Allenby (2009) write about the analysis of customer value in direct
marketing. Typically this analysis combines customer timing and quantity data into a
single statistic. This statistic is used to compute lifetime values in order to rank cus-
tomers for different actions and to identify prospects for cross-selling. However, cur-
rent models assume that purchase timing and quantity decisions are independent (i.e.
uncorrelated) over time, given individual-level parameters. The authors show that in
these models customer value calculations can be severely biased when timing and
quantity are dependently related. They thus propose some alternative models.
Kumar, Aksoy, Donkers, Venkatesan, Wiesel and Tillmanns (2010) argue that
customers can interact and create value for firms in different ways. Their article pro-
poses that assessing the value of customers based solely upon their transactions with
a firm may not be sufficient. Therefore they propose four components of customer
engagement value (CEV) for a firm: customer lifetime value, customer referral value,
value of customer influence and customer knowledge value.
With respect to CLV for telecommunications companies (TLCs), we cite Lu
(2003), who estimates CLV using the equation (1) proposed by Gupta et al. (2006).
Compared to previous article, in this work we introduce a statistical analysis of CLV
which examines the distribution with respect to geographical area and size of custo-
mer. Moreover, we use regression analysis to verify what factors affect CLV, obtain-
ing interesting results. Our analyses use SAS because it makes it possible to manage
large datasets and performs advanced statistical analyses.
The rest of this paper consists of the following sections. Section 2 introduces sur-
vival analysis techniques and CLV; Section 3 contains a statistical analysis of CLV
based on the dataset of a TLC company and relevant pieces of SAS code. Finally, in
Section 4 we present our conclusions.

2. STATISTICAL METHODS

As mentioned previously, the purpose of this paper is to develop a statistical ana-


lysis of CLV. Specifically, we have chosen the CLV model developed by Gupta et

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 178


STATISTICAL ANALYSIS : A CASE STUDY ON TELECOMMUNICATIONS DATA USING SAS 179

al. (2006) as it allows to weigh CLV with customers survival probabilities of our
TLC company.
Therefore, in the sections which follow we provide a brief introduction to the
concept of customer lifetime value and survival analysis techniques.

2.1 Customer Lifetime Value

CLV is defined as the net present value of the cash flows, attributed to the relation-
ship with a customer over time. CLV resembles the discounted cash flow approach
used in finance. However, it differs from it in two ways. Firstly, CLV is defined
and estimated for each customer, it is a disaggregate metric. Secondly, in contrast to
the discounted cash flow approach, according to Gupta et al. (2006) CLV explicitly
incorporates the possibility that a customer may defect to competitors in the future.
Specifically, Gupta et al. (2006) define CLV as follows:

XT
ðpt  ct Þrt
CLV ¼  AC ð1Þ
t¼0 ð1 þ iÞt

where:
– pt is the price paid by a customer at time t
– ct is the direct cost of servicing the customer at time t;
– i is the discount rate or cost of capital for the firm;
– tt is the probability of customer to be survived at time t;
– AC is the acquisition cost;
– T is the time horizon for estimating CLV.
When the above formula includes the acquisition cost AC, it represents the CLV
of a potential customer. Without AC, it represents the expected residual lifetime value
of an existing customer.

2.2 Survival Analysis

Survival data have two important features that are difficult to manage with alterna-
tive statistical methods: duration data and censored data. There are three kinds of
censoring: right, left and interval. In the dataset for above TLC company only the
first of these is detected. Survival analysis is characterized by two important func-
tions: the survival function and the hazard function. These are used to estimate sur-
vival probabilities for customers and risk of customer churn for during the study
period.
The main methods used in analyzing survival data are nonparametric, semipara-
metric and parametric. In this paper only the first two are used.
Non-parametric methods are useful in implementing exploratory data analysis. In

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 179


180 M. TOCCU - A. FASSÒ

particular, Section 3 contains graphs illustrating the survival and hazard functions for
customers of above TLC company. Parametric methods, on the other hand, are more
efficient when the shape of the hazard function follows a known theoretical distribu-
tion. However, we prefer to use Cox’s semiparametric model because, thanks to the
corrections made by Breslow (1974) and Efron (1977) to the Cox partial likelihood
function, it makes it possible to handle tied survival times that are present in the da-
taset of above TLC company. Finally, the SAS procedure proc phreg, which we use
to estimate the Cox model, makes it possible to handle time-dependent covariates
(Allison, 2004).
A property of the Cox model is that different individuals have hazard functions
that are proportional. For example, hðtjx1 Þ=hðtjx2 Þ is the ratio ofthe hazard functions

for two individuals with prognostic factors or covariates x1 ¼ x11 ; x21 ; :::; xp1 and
x2 ¼ ðx12 ; x22 ; :::; xp2 Þ. This ratio is constant. Therefore the ratio of the risk of death
(in this case churn) of two individuals is the same, no matter how long they survive.
The Cox model is characterized by the partial likelihood function:
" #
Pp
exp bj xjðiÞ
Yk
j¼1
LðbÞ ¼ " #
i¼1 P P p
exp bj xjl
l2RðtðiÞ Þ j¼1

where:
– k is the
 number of distinct failure times;
– R tðiÞ is the risk set that contains all persons at risk at time ti ;
– xjl are the covariates observed from individual l;
– bj are the unknown coefficients of covariates.
However, in practice, the covariates may be observed more than once during the
study and their values change with time (Lee and Wang, 2003). Therefore, in this pa-
per, we use an extension of the Cox model  as the dataset of studied TLC company
contains time-dependent covariates xjl tðiÞ that are observed from individual l at the
ordered uncensored event time tðiÞ .
The hazard function states that the hazard for an individual at time t is the product
of two factors:
– a baseline hazard function h0 ðt Þ, i.e. the hazard function for an individual whose
covariates all have value of zero;
– a linear function of a set p of exponentiated covariates.
!
Xp
hðtjxÞ ¼ h0 ðt Þ exp bj xj ¼ h0 ðt Þ expðbx Þ ð2Þ
j¼1

where the left-hand side of (2) is a function of hazard ratio (or relative risk) and the
right-hand side is a linear function of the covariates and their respective coeffi-
cients.

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 180


STATISTICAL ANALYSIS : A CASE STUDY ON TELECOMMUNICATIONS DATA USING SAS 181

Therefore, by dividing both sides of (2) by h0 ðt Þ and taking its logarithm, we ob-
tain the Cox regression model, with reference to the ith case (1, 2, ..., n):
hi ðtÞ
log ¼ b1 x1i þ b2 x2i þ ::: þ bp xpi
h0 ðtÞ
An important statistic provided by the Cox model is the hazard ratio (HR), which
is defined as eb . In Section 3.1 we show the interpretation of HR and explain the
method for handling time-dependent covariates using SAS.

3. CASE STUDY

Churn of profitable customers represents a significant problem for TLC companies,


who employ retention strategies to retain them. In addition, TLC companies need to
identify highly-profitable and non-profitable customers.
Therefore, in this section, we calculate the customer lifetime value (CLV) for
each customer of above TLC company using equation (1). We then propose a statisti-
cal analysis of CLV. The results of this study may be useful for TLC companies and
their decision makers in developing customer loyalty and maximizing CLV.
The dataset of our TLC company contains 30,379 active Italian business custo-
mers. The period of study is six months long (July 2010-December 2010). The re-
sponse variable, duration, represents the durations, or survival times, of each custo-
mer during the period of study. In this period these survival times may be effective,
if a customer leaves the TLC company, or censored, if a customer does not churn.
To indicate censoring, we use the dummy variable event. If event = 1 the survival
time is actual or uncensored, while if event = 0 the survival time is censored.
Our dataset contains two types of variables:
– time-independent, such as customer information variables (pre-activation, size of
customer, geographical area, type of contract);
– time-dependent, such as customer behaviour variables (duration of calls, number
of calls and value of calls).
In case studies where customers are not companies, it would be interesting to
have variables such as gender and geo-referenced data with latitude and longitude. In
this way it would be possible to segment the market, to understand the behaviour of
customers in particular places and thus to design more effective marketing strategies.
The purpose of this section is to analyse the two components of CLV represented
in (1): the revenues from the TLC company’s customers and customer survival. Spe-
cifically, we describe customer and revenue distribution and we show customer sur-
vival rates with respect to geographical area and size of customer. Subsequently, in
Section 3.2, we perform a CLV analysis and specifically verify whether there are dif-
ferences in the CLV with respect to geographical area and size of customer.
First, using the SAS procedure proc freq, we classify the customers of above
TLC company according to the above factors, and report total and average revenues
from customers.

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 181


182 M. TOCCU - A. FASSÒ

Table 1 reports in detail the above variables that are recorded for each customer:

TABLE 1. - Variables recorded for each customer


Variable name Description
geo_zone identifier of geographical areas: North, Centre, South
cust_size dummy for large customers or small customers
flag_preact dummy for contract with pre-activation or not
flag_mobile dummy for customer with mobile telephone or not
flag_bundle dummy for customer with full package or not
dur_fixed number of minutes from fixed telephone
dur_mobile number of minutes from mobile telephone
num_fixed number of calls from fixed telephone
num_mobile number of calls from mobile telephone
val_fixed revenue of calls from fixed telephone
val_mobile revenue of calls from mobile telephone

TABLE 2. - Customer and revenue distribution by geographical area


Geographical area % Total revenues (E) Average revenues (E)
Northern Italy 56.49 8,955,486 521.79
Central Italy 27.51 4,020,880 481.02
Southern Italy 16.00 1,537,092 316.47

Table 2 shows that Northern Italy accounts for more than 50% of customers. It is
also significant that customers in Northern Italy generate about twice the total reven-
ues as the rest of the customers. Finally, a Kruskal-Wallis test allowed us to ascertain
that there are statistically significant differences (p-value<.0001) in average revenues
with respect to geographical area.

TABLE 3. - Customers and revenues distribution by size of customer


Size of customer % Total revenues (E) Average revenues (E)
Small 80.57 6,176,783 252.36
Large 19.43 8,336,676 1412.28

Table 3 shows that most of the customers of are small customers and we note that
these proportions are maintained across the three geographical areas. Large customers
generate, on average, approximately six times the revenues of small customers.
After checking the distribution of customers with respect to the above factors, we
introduced a nonparametric survival analysis, using proc lifetest, in order to verify
the survival rate of customers. Specifically, proc lifetest is an SAS procedure which
estimates survival and hazard functions using two methods: Kaplan-Meier and life-ta-

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 182


STATISTICAL ANALYSIS : A CASE STUDY ON TELECOMMUNICATIONS DATA USING SAS 183

ble. The former is suitable for small datasets with precisely measured event times,
whereas the latter is suitable for large datasets or when the measurements of event
times are imprecise. As our TLC dataset is large, we chose the second method.
Below we show the SAS code for proc lifetest. The s parameter produces the sur-
vival plot, the h parameter creates the hazard plot, and the life parameter indicates
the life-table method.
proc lifetest data=tel;
method=life plot=(s,h) width=1 graphics;
time duration*event(0);
strata geo_zone;
run;
Using this procedure, with the strata statement we produce survival and hazard
graphs stratifying with respect to geographical area (respectively Figures 1 and 2)
and then size of customer. These graphs are helpful in highlighting the difference in
the behaviour of customer survival and hazard with respect to the above factors.

FIGURE 1. - Survival plot by Italian geographical areas

The survival plot, displayed in Figure 1, shows that the customer survival prob-
ability behaves in the same way across the three geographical areas. To verify this
statement we used the Log-Rank test, which verifies the null hypothesis that the sur-
vival functions are the same in the three geographical areas. This test confirms our
initial impression and thus we do not reject the null hypothesis (p-value=0.055). We
recall that the number of customers is about 30,000 therefore we can say that the p-
value is not small.
The hazard plot, displayed in Figure 2, shows that in the first month of the study
period the risk of customers churning increases. The risk then decreases sharply and

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 183


184 M. TOCCU - A. FASSÒ

FIGURE 2. - Hazard plot by geographical area

in the central part of the study period remains stable until the fifth month. Finally, in
the last part of the study, the risk increases again.
We also performed nonparametric analysis with respect to size of customer. How-
ever, the results are similar to those obtained with respect to geographical area.
In conclusion, the nonparametric analysis shows that the above factors do not af-
fect the survival probability of customers of above TLC company. It is interesting to
note that this statement is also confirmed by the Cox model which we illustrate in
Table 4.
Below we show how we verified the proportional assumption. We start by saying
that the proportional hazard model assumes that the hazard ratio of two individuals is
time-independent. This requires that covariates not be time-dependent and, if some
covariates vary with time, this assumption is violated (Lee and Wang, 2003).
In this section, we show two methods for verifying the above assumption: a sta-
tistical test method and a graphical method. Another graphical method consists in
creating a scatter plot of partial residuals of the covariate as y variable and time-to-
event as the x variable. Therefore, below, we show the above methods and in particu-
lar we report the results for the geographical area (geo_zone) factor.
In order to implement the graphical method we use the previous SAS code. How-
ever, in this case we substitute the lls parameter for the s and h parameters.
proc lifetest data=tel;
method=life plot=(lls) width=1 graphics;
time duration*event(0);
strata geo_zone;
run;

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 184


STATISTICAL ANALYSIS : A CASE STUDY ON TELECOMMUNICATIONS DATA USING SAS 185

This procedure produces the plot displayed in Figure 3, which shows the log-log
survivor functions for each of the three geographical areas. Since the curves are ap-
proximately parallel, we can say, with respect to geographical area, that the propor-
tional assumption is satisfied.

FIGURE 3. - Log-Log Survival plot by Italian geographical areas

To confirm this result it is possible to use a statistical test method. In particular, it


is necessary to create an interaction variable between geo_zone and duration. This
interaction variable is then included in the model and, using proc phreg, it may be
observed that its coefficient is not statistically significant (p-value=0.96). Therefore,
there is no evidence that the effect of geographical area changes over time and it is
possible to conclude that the PH assumption is not violated for the geo_zone vari-
able.
The PH assumption is respected for the other time-independent covariate but it is
violated for the time-dependent covariates. Therefore, because it is common for stu-
dies to include both time-independent and time-dependent covariates, and the latter
may be observed more than once during the study (Lee and Wang, 2003), in the next
section we present a method for handling this case.

3.1 Cox Model Estimation

Survival datasets typically take the form of one record per subject. Therefore, for
each subject, there is one measurement for each variable. Then, each of these re-
cords is a vector which assumes a typical format (duration, event, covariates).
However, as mentioned at the beginning of Section 3, the dataset of our TLC com-

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 185


186 M. TOCCU - A. FASSÒ

pany contains two types of variables: time-independent variables and time-depen-


dent variables. In particular, time dependent variables are collected at regular inter-
vals of time (months); there are this many measurements for each variable. Thus it
is necessary to organize the data according to a counting process format which con-
tains more than one record per subject, because each variable is measured more than
once. Each record has the format (t1, t2, event, covariates), where t1 represents the
start of the interval, t2 represents the end of the interval, and event is an indicator
censoring. According to Carpenter and Ake (2003), the counting process format can
easily accommodate a number of special features in the data, including multiple
events of the same type and time-dependent covariates. Therefore, because our data-
set has a typical format, we transform it using SAS into a counting process format.
Now, it is possible to obtain the coefficient estimates and the hazard ratios of the
Cox model using an SAS procedure, proc phreg, as follows.
proc phreg data=tel;
model (t1, t2)*event(0)= &varlist / ties=efron selection=s;
run;
In particular, the efron command makes it possible to handle tied survival time
(Efron, 1977), while t1 and t2 are two support variables which enable proc phreg to
pinpoint the risk period for each customer.
Table 4 shows the results of proc phreg, which are the significant covariates and
the hazard ratios.

TABLE 4. - Significant Covariates and Hazard Ratios


Variable Parameter Std err HR
flag_preact -0.26 0.03 0.77
flag_bundle 0.22 0.08 1.24
num_fixed -0.03 0.003 0.97
val_mobile -0.3 0.03 0.74
cust_size -0.04 0.02 0.96

The hazard ratio (HR) represents the risk increase of customer churn per unit in-
crease in the covariates. For example, the estimated risk ratio for the dummy variable
flag_preact is 0.77. This means that the hazard of churn for customers who stipulated
a contract with pre-activation is only 77% of the hazard for those who stipulated a con-
tract without pre-activation. For the quantitative covariate val_mobile the risk ratio is
0.74, yielding a decrease of 26%. Therefore, for each one-euro increase in the con-
sumption of mobile telephone traffic, customer churn decreases by an estimated 26%.
It may be observed that the selection of variables excludes the following covariates:
flag_fixed, flag_mobile, geo_zone, dur_fixed, dur_mobile, num_mobile, val_fixed.
Furthermore, Table 4 shows that the Cox model confirms the results of the non-
parametric analysis. In particular, geographical area is not significant, while customer
size is slightly significant (the churn risk for large customers is lower than that for
small customers), whereas the hazard ratio of customer size is very close to one.

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 186


STATISTICAL ANALYSIS : A CASE STUDY ON TELECOMMUNICATIONS DATA USING SAS 187

Therefore, we can say that, with the available data, these factors do not affect the sur-
vival probability of customers.
Below, we assess the goodness of fit of the estimated Cox model. The graph in
Figure 4 shows the deviance residuals (Therneau et al., 1990) against duration.

FIGURE 4. - Goodness of fit

The residuals are distributed fairly symmetrically around zero, between -3 and 3,
and in no particular pattern. Therefore we conclude that the model fits the data suffi-
ciently well.

FIGURE 5. - Survival function for an average customer

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 187


188 M. TOCCU - A. FASSÒ

Finally, we evaluate the model for an average customer. Allison (2004) proposes,
after estimating the parameters of Cox model, to estimate baseline survival function
S0 ðt Þ. Thus, it is possible generate, by proc phreg, the estimated survival function
S ðtÞ for average values of covariates by substitution in the equation (3).

S ðt Þ ¼ ½S0 ðt Þexpð xÞ ð3Þ

Figure 5 represents survival curve of an average customer with average covariates


values. It notes that, survival probability falls sharply in the last part of the period of
study. Therefore, decision makers should implement an appropriate strategies of cus-
tomer retention to delay churn of customers.

FIGURE 6. - CLV concentration

3.2 Statistical Analysis of Customer Lifetime Value

The aim of this subsection is not to obtain a definitive result concerning the specific
case in hand but rather to illustrate a method of CLV analysis. We discuss CLV in
terms of market distribution and show some potential applications of CLV in market
strategy analysis. In doing so, we commence a simple, example analysis and discuss
the preliminary results. A more in-depth analysis, in terms of multivariate analysis
and clustering, is beyond the scope of this paper.
In this section, we present a statistical analysis of customer lifetime value based
on two variables: geographical area (geo_zone) and size of customer (cust_size). The
goal of this analysis is to help decision-makers in above TLC company to develop a
more objective marketing strategy. To this end, first we examine a CLV concentra-
tion showing the Lorenz curve and CLV divided into percentiles. Secondly, we ana-
lyse CLV distribution and verify average differences with respect to the above fac-
tors. Finally, we estimate a regression model in order to detect the covariates which
affect CLV. This has produced some interesting results.

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 188


STATISTICAL ANALYSIS : A CASE STUDY ON TELECOMMUNICATIONS DATA USING SAS 189

Table 5 shows that CLV seems to be highly concentrated in a small 10% of the
customers. In fact, the last 10% of the distribution accounts for approximately 60%
of CLV. Therefore, a small fraction of customers generate a large proportion of rev-
enues for studied TLC company. Specifically, 1% of customers gives a CLV be-
tween A 3,700 and A 81,375.

TABLE 5. - CLV distribution


CLV per Cumulative
Percentile Estimated CLV (E) N
percentile band CLV (E)
5% 4 1,518 2,058 2,058
10% 10 1,520 9,891 11,949
25%Q1 43 4,487 113,171 125,120
50%Median 111 7,664 641,729 766,849
75%Q3 323 7,595 1,742,314 2,509,163
90% 750 4,558 2,435,218 4,944,381
95% 1,250 1,518 1,529,752 6,474,133
99% 3,700 1,216 2,414,587 8,888,720
100% 81,375 303 2,521,243 11,409,963

Figure 6 shows a Lorenz Curve with customers (P axis) and CLV (Q axis).
The concentration area is approximately 70%, confirming that most of the CLV is
highly concentrated among a small share of its customers.
We shall now present the study of CLV distribution with respect to geographical
area and size of customer.

TABLE 6. - CLV distribution by Italian geographical area and size of customer


Geographical area (%)
Size of customer (%) Northern Italy Central Italy Southern Italy Total
Small 35 17 6 58
Large 27 11 4 42
Total 62 28 10 100

Table 6 shows that Northern Italy generates a large part of CLV. In particular, it
may be noted that, with respect to size of customer, on average, approximately 60%
of CLV is generated by small customers, of whom there are approximately 4 times
as many as there are large customers as we have seen in Table 3. This pattern is es-
sentially the same across the three geographical areas.
It is interesting to verify which customers are the most profitable with respect to
the above two factors. Therefore in Table 7 we report mean CLV with respect to size
of customer.

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 189


190 M. TOCCU - A. FASSÒ

TABLE 7. - Average CLV with respect to size of customer


Size of customer % Mean (E)
Small 80.57 265.26
Large 19.43 833.02

In particular, it may be observed that the per-capita CLV of large customers is


about 3 times higher than the per-capita CLV of small customers.
Finally, in Table 8 we evaluate if there are, on average, differences of CLV in the
three geographical areas.

TABLE 8. - Average CLV with respect to geographical area


Geographical area % Mean (E)
Northern Italy 56.49 409.14
Central Italy 27.51 381.02
Southern Italy 16.00 247.65

The Kruskal-Wallis test shows that there is, on average, a statistically significant
difference (p-value<.0001) in CLV generated by customers in the three geographical
areas. In particular, a single customer of Northern Italy on average, generate highest
CLV.
Therefore, recalling the differences in average revenues shown in Tables 2 and 3,
it may be noted that CLV also exhibits differences with respect to the above factors.
Moreover, thanks to the higher costs incurred for large customers, it is interesting to
note that the ratio between the average CLV of large and small customers (Table 7)
is about half the ratio between the average revenues of large and small customers
(Table 3).
To conclude our CLV analysis, we evaluate the impact of the covariates on CLV.
Therefore, we assess how the standardized variables of the dataset of above TLC
company influence CLV. In particular, we are interested in ascertaining how fixed
and mobile telephone traffic contribute to CLV. Table 9 shows the result of our re-
gression analysis. The data are adjusted so well by the model (R2 ¼ 0:92) and the re-
sidual analysis respects the regression assumptions.

TABLE 9. - Covariates influencing CLV


Variable Estimated Parameter Std err
intercept 345.06 1.82
val_mobile 1060.21 1.82
val_fixed 527.33 2.45
dur_fixed 75.21 1.96

It may be observed that the truly significant result is the contribution made by va-
lue generated by mobile and fixed-line telephone traffic. Specifically, the contribu-

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 190


STATISTICAL ANALYSIS : A CASE STUDY ON TELECOMMUNICATIONS DATA USING SAS 191

tion to CLV of calls made from mobiles is about twice the contribution of calls made
from fixed-line telephones. In fact, as covariates are standardized, for each sigma unit
of mobile or fixed value the CLV increases by approximately A 1,000 and A 500 re-
spectively.

4. CONCLUSIONS

In this paper we have introduced a simple statistical analysis of customer lifetime


value (CLV) using SAS. In particular, we calculated CLV for each customer, taking
into account their survival probability. We then analyzed CLV with respect to geo-
graphical area and size of customer.
The results of the statistical analysis of CLV show that, on average, customers in
Northern Italy generate most of our TLC company’s revenues. Moreover, with re-
spect to size of customer, large customers generate approximately three times more
CLV, on average, than small customers.
A remarkable result of this paper is provided by our regression analysis, which
shows that mobile calls generate around twice as much CLV as fixed-line calls. In
particular, since covariates are standardized, for each sigma unit of mobile or fixed
phone call value, the CLV increases by approximately A 1,000 and A 500 respec-
tively.
From the point of view of customer survival probability, our analysis produced
an interesting result. In fact, using an observation period of six months, we find that
geographical area and size of customer are not significant and that therefore these
factors do not affect CLV. However, small customers are important as they represent
most (80%) of above TLC company’s customers. This is important as small custo-
mers can be an excellent vehicle for word-of-mouth marketing. Therefore, one poten-
tial future development of this paper is to measure the impact of relationships be-
tween a TLC company’s customers on CLV, using social network analysis techni-
ques.

ACKNOWLEDGEMENTS
We would like to thank Nunatac Srl for supporting our research by providing data for the TLC
company. We are also grateful to several anonymous reviewers for helpful comments.

REFERENCES

Aeron H., Bhaskar T., Sundararajan R., Kumar A., Moorthy J. (2008). A metric for customer
lifetime value of credit card customers. Journal of Database Marketing & Customer Strategy
Management, 15, 153-168.

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 191


192 M. TOCCU - A. FASSÒ

Allison P. (2004). Survival analysis using SAS. A practical guide. SAS Publishing, Cary.
Berger P., Nasr N. (1998). Customer lifetime value: marketing models and applications. Jour-
nal of Interactive Marketing, 12(1), 17-30.
Blattberg R., Deighton J. (1991). Interactive marketing: exploiting the age of addressability.
Fall, 33(1).
Breslow N. (1974). Analysis of censored survival data. Biometrics, 30, 89-99.
Carpenter A., Ake C. (2003). Extending the use of proc phreg in survival analysis. Proceed-
ings of the 11th Annual Western Users of SAS Software, Inc. Users Group Conference. SAS
Institute Inc. Cary.
Efron B. (1977). The efficiency of the Cox’s likelihood function for censored data. Journal of
the American Statistical Association, 72, 557-565.
Gupta S., Hanssens D., Hardie B., Kahn W., Kumar V., Lin N., Ravishanker N., Sriram S.
(2006). Modelling customer lifetime value. Journal of Service Research, 9(2), 139-155.
Jen L., Chou C., Allenby G. (2009). The importance of modeling temporal dependence of tim-
ing and quantity in direct marketing. Journal of Marketing Research, 46(4), 482-493.
Kumar V., Aksoy L., Donkers B., Venkatesan R., Wiesel T., Tillmanns S. (2010). Underva-
lued or overvalued customers: capturing total customer engagement value. Journal of Service
Research, 13 (3), 297-310.
Lee E., Wang J. (2003). Statistical methods for survival data analysis, Wiley, New Jersey.
Peppers D., Rogers M. (1993). The one to one future: building relationships one customer at a
time. Doubleday Business, New York.
Lu J. (2003). Modeling customer lifetime value using survival analysis - An application in the
telecommunications industry. Proceedings of the 28th Annual SAS Users Group International
Conference. SAS Institute Inc., Cary.
Malthouse E., Blattbeg R. (2005). Can we predict customer lifetime value? Journal of Interac-
tive Marketing, 19(1), 1-16.
Reinartz W., Kumar V. (2003). The impact of customer relationship characteristics on profit-
able lifetime duration. Journal of Marketing, 67(1), 77-99.
Therneau T., Grambsch P., Fleming T. (1990). Martingale-based residuals and survival models.
Biometrica, 77(4), 147-160.
Venkatesan R., Kumar V. (2004). A customer lifetime value framework for customer selection
and resource allocation strategy. Journal of Marketing, 68(4), 106-125.

p:/3b2 job/vita-pensiero/Statistica-Applicazioni/2013-02/05-Toccu.3d – 18/4/14 – 192


Copyright of Statistica & Applicazioni is the property of Statistica & Applicazioni and its
content may not be copied or emailed to multiple sites or posted to a listserv without the
copyright holder's express written permission. However, users may print, download, or email
articles for individual use.

You might also like