You are on page 1of 6

Analysis and Prediction Server with Column Store

Database – A Case Study in Telecom Churn


Suresh L Dr.Jay B. Simha Dr.Rajappa Velur
Professor Chief Technology officer Dean
Department of CSE Analytics Consultant Academics
Cambridge Institute of Technology Abiba Systems Cambridge Institute of Technology
Bangalore, India Bangalore, India Bangalore, India
e-mail: suriakls@gmail.com e-mail: jbsimha@abibasystems.com e-mail: rajuvelur@rediffmail.com

Abstract— One of the major concepts in business analytics is to start to obtain follow-up information such as an explanation of
identify the anomalies over time, also called as trend analysis. a particular anomaly in a metric.
This can be easily done in pivot tables by using time as one of the
dimensions, usually across columns. However, the trending Typically, a business's data is stored on a database or on
information itself is insufficient to make any quick and insightful databases. These databases are operated with associated
observations. Ranking the time series to identify the similar units
database servers, which manage the storage and retrieval of
of information can accelerate the analysis process. Similarly
projection of the time series to the future will help the decision records from the databases. Analytical servers have
maker to proactively build alternate plans for differing scenarios. additionally been provided to format database queries or
For doing such an analysis and modeling it becomes necessary to information requests sent from a client user interface to the
have aggregated data on demand. Current breed of row store database server for handling. The analytical servers can be
database have limited capabilities to provide the response time used to improve the efficiency of the database accesses and to
required for such an analysis and modeling. Hence, column store provide metrics of interest to the user from the retrieved
database are expected to be better alternatives, for these types of records from the database.
problems. In this work the attempts made by the authors to
develop such a system named as ‘rePivot’ are presented. The One of the major approaches to business analytics is to
proposed frame work consists of three modules namely – a
identify the impending change in trend before it accelerates.
column store database to provide quick access to data, a time
series ranking module and a probabilistic forecasting module. A
This kind of early warning systems are very important and will
case study of the proposed frame work in churn analysis and be useful in various scenarios like trading financial securities,
modeling in telecom has been carried out to test the suitability of predicting sales performance, analyzing the churn etc.
framework for industrial applications. Application of the However, using only a scalar value to compare multiple series,
framework has shown promising results. Work is under progress rank them and project the series to the future is not
to develop additional modules for survival analytics of individual appropriate, even though practically it is possible.
entities in the database.
In this research work an attempt has been made to develop a
system for analyzing the multidimensional data, ranking the
Keywords- Business Analytics, Telecom Churn, multidimensional
time series and predicting the time series to the future. The
data, SQL, prediction model.
framework consists of a crosstab query to provide the
I. INTRODUCTION transposed time series data for selected dimensions, similarity
search module which will rank the time series data and a
prediction module based on probabilistic forecasting model.
Today's businesses have sophisticated data analysis
requirements. The metrics or analyses of a business's data can II. ARCHITECTURE
be difficult to obtain. To calculate a meaningful metric,
business analysts often use spreadsheets to manually analyze The proposed framework consists of three major components.
data. Manual analysis, of course, is a tedious and time- They are - (i) an analytical query system based on an advanced
consuming process. Most applications fail to deliver useful column store database system, which provides the aggregated
metrics that provide unique insights into an organization's time series data from a star schema data mart, (ii) a time series
performance. Useful metrics highlight significant performance ranking module which ranks the time series with an adoptive
measures of the business. Typically, business analysts must algorithm and (iii) a prediction module, which provides a
execute multiple queries and other time-consuming manual simple but effective parametric model building capabilities. A
interventions to to produce these metrics. Then, despite the simplified architecture diagram of the proposed system is
time-consuming effort, analysts must start the process from shown in figure 1. These components are integrated on a

978-1-4244-4547-9/09/$26.00 ©2009 IEEE TENCON 2009


1
common platform of technologies – SQL. The integration of cross tab is a tuple (Xi, Yj, Fk) where Xi are dimensional
the components is done through java/python. values plotted on rows of a spread sheet, Yj are the
dimensional values plotted on columns of a spread sheet and
A. Analytical Query System Fk are aggregated / functional factual values constituting the
row column interaction i.e. the shells several commercial
There has been a significant amount of excitement and recent systems provide some means of generating cross tabs from the
work on column-oriented database systems (“column-stores”). data [6, 9, 10].
These database systems have been shown to perform more
than an order of magnitude better In this research Xi is allowed to take theoretically unlimited
number of dimensions (practically limited to six) and time
dimension is fixed as Yj. User selected facts are aggregated as
Fk values.
B. Ranking Time Series

Once the time series data for the user defined criteria is
extracted from the query system, it will be processed by the
time series ranking module.
The trend i.e. time series can have some prominent patterns
which are of interest to business analysts. Some of them are
like [11]:
1. Vary considerably over the past few periods
2. Increase greatly
3. Drop drastically
4. Increase greatly and then drop drastically
5. Perform differently than the total trend

A typical comparison of the time series is given in figure 2.

Figure 1. Architecture of the proposed system

than traditional row-oriented database systems (“row-stores”)


on analytical workloads such as those found in data
warehouses, decision support, and business intelligence
applications [1]. In fact there are arguments against the multi
dimensional data cube approach used by some vendors,
favoring column store databases. One of the arguments is that
column store databases provide near full support for SQL. In
this research work an open source column store database
called MonetDB [5] is used as analytical server database.
Figure 2. Typical patterns in time series trends
It is assumed that the data model for the analytical database
follows the star schema due to several reasons stated by
Points 1 through 4 basically translate into the statistical spread
Kimball et. al [7]. This type of schema de-normalizes the
of a time-series’ values around the mean of the respective time
dimension tables to facilitate the faster joins on the queries,
series. Point 5 needs comparative or reference data.
reducing the latency. Time is defined as a permanent
dimension in the schema to facilitate creation of time series A simple pattern based approach has been developed in this
data against user selected dimension list. research work to compare the time series data. The ranking of
time series is done through automated sorting of patterns.
The major function in the query system is the SQL query
model for a cross tab. This function will be executed by the
DBMS and provide the result set for further analysis.
A typical cross tab provides aggregated factual values for In order to sort the time series values, the spread of each
selected dimensions in the required format. Generally the series is computed and compared with the spread of all the

2
series. Large variances suggest a very different development, automatically adapts to the underlying data, it also delivers
while small variances indicate a similar development pattern. trustworthy conclusions for a completely different set of time
series.
Since the values for each series are very different, it is not
possible to compare the series values directly. In order to C. Prediction with a Probabilistic Projection Model
make the series comparable, the series will be normalized, by
dividing the individual values of the series by series mean. Customer defection is a prominent issue in subscription based
industry like mobile phones, credit cards; internet services etc.
Once the data is normalized, square of a sum of differences of
individual values in the time series with that of the overall The major characteristics of customer behavior prediction are
mean vector values is computed, which results in scalar values contractual agreement between the company and its
for each series. Ranking of these series of scalars will provide customers, acquisition cost for new customers, availability of
statistically valid ranks for the time series. The algorithm for large datasets of behavior at the customer level. This can
computing the ranking of time series is shown in figure 3. provide an ability to predict defection point of individual
customer.

A standard approach for modeling defection/churn is to use


survival analysis like Keplan-Meyer approach or Cox
regression or using neural networks. However several
problems limit their usefulness when dealing with ad-hoc
analysis in a practical situation. Some of them are like –

1. Irrational projection from standard statistical parametric


models like multiple linear regression [for e.g. the expected
value can become negative - which in reality is absurd- when
projected beyond the limits] [3].

2. Some fixed time models like logistic regression [8],


decision tree [2] can provide good diagnostic information but
fail to provide time to defect/churn.

3. Non-parametric models like Keplan-Meyer tables [2]


though simple require human evolution which limits the usage
in a dynamic setting like interactive analytics.

4. Though non-parametric methods like Cox regression


[9] and neural networks [8] show promising results their
computational complexity is prohibitive.

The prediction of customer defection also called churn is an


important function in customer retention. However, there is a
very little work in aggregate projection [5] and almost no
Figure 3. Algorithm for computing the time series rank published work in integrating this with a decision support
system as per our survey.
A typical visual presentation of the above processing steps is
shown in the figure 4 This necessitates use of simple but robust parametric model
for churn/defection projection. Hardie et. al [4] have used
shifted geometric negative binomial distribution to predict the
customer churn. The model is simple yet extremely effective.
It is based on computing the gamma value in the equation as
shown in figure 5.
Actual implementation of this algorithm is given below in
figure 6.
Figure 4. typical time series comparison

It can be observed that comparison of each time series with the


time series for reference indicates that proposed ranking is
very reliable in calculating similarities. Since this benchmark

3
This analysis can be done in two modes manual and
automatic. In both the modes the process remains the same,
only the space in which the analysis is carried out will differ.
In manual mode, user will select the dimensions of interest.
However in the automatic mode a pre defined structure for
hierarchical analysis is followed.

The actual process consists of

1) Computing SPG/NBD parameters for the selected


dimensional values.
2) Predicting the churn/ retention for subsequent periods of
interest and
3) Ranking the time series of predicted values.

A case study on real world data and decision support


requirement is discussed in the next section.

Figure 5. Algorithm for prediction III. CASE STUDY

The proposed approach has been tested on a real world


telecom datasets. Actual details of the dimensions or facts
used and the identity of the operator are kept confidential due
to sensitivity of the issue.

In total about 17 dimensions 7 demographic, 3 products


related and 7 behavioral dimensions with 7 facts for usage and
revenue has been used.

A customer based of about 100,000 has been used in testing


the approach. The time related data of all these customers on
all the dimensions for a period of 24 months are selected for
churn projection analysis. The data has been sanitized to
protect the business interests.

The data is drawn from the data warehouse into the churn data
mart. Both manual as well as automated ranking analysis of
the projection is carried out.

A sample analysis report for the case study has been shown in
figure 7. It shows the regions where the trends for churn have
been shown. Based on the interest, decision maker can choose
the graphs for which further analysis can be done, if used in
the manual mode and all the analysis is done at the back end,
if used in the automatic mode.

Figure 6. Prediction with a probabilistic projection model

4
Regression (MLR). Table 1, shows the results for different
queries executed in manual mode. It can be observed that the
MAE and RMSE are superior from the proposed system than
from regression. The statistical test for change in the average
error has confirmed that the proposed approach significantly
out performs other methods on aggregate projection.

Further data for the Cox regression has to be modified to suit


the software requirements[R]. Since our approach uses an
integrated database dependent model, it is an order of
Figure 7. Sample churn in different regions magnitude factor than other approaches on performance.
However automated analysis couldn’t be tested on all the
Figure 8, 9, 10 shows typical results from the proposed algorithms due to computational requirements. This again
approach. It can be observed that the results from proposed proves the effectiveness of the proposed approach in
approach are significantly similar to the actual results. automated analysis.

TABLE 1. PREDICTION ACCURACY COMPARISON

Proposed method MLR

Query MAE RMSE MAE RMSE

1 1529 3534 10000 18000

2 155 1707 1200 2300

Figure 8. Projections for contract type with actual values. 3 51 186 400 1000

4 38 123 450 1127

5 37 73 480 1400

6 162 267 1500 4000

7 9 37 85 320

8 29 75 120 350

9 98 103 400 500

10 15 59 75 405

Figure 9. Projections for contract type by region with actual values

IV. CONCLUSIONS

Trend analysis with time series ranking prediction is one of the


important analytical functions in business analytics. This
requires a new architecture with state of art components to
provide information on demand. In this research, an attempt
has been made to develop an analytical server to provide these
features.

The proposed architecture is built using column store database


with time series ranking using pattern recognition and time
series projection using probabilistic prediction model. A case
study on telecom churn has been carried out and results are
Figure 10. Projections for contract type by region with product type
promising. The research is under progress to provide survival
analytics capabilities at the individual level within the
In order to evaluate the suitability of the prediction algorithm, proposed framework.
it has been compared with the results of Multi Linear

5
REFERENCES
[1] Abadi D.J, Maden S.R, Hachem N, “Column-Stores vs. Row-Stores: How
Different Are They Really?” SIGMOD 2008, June 2008, Vancouver, Canada.
[2] Berry, Michael J. A., Linoff, Gordon S. Data Mining Techniques: For
Marketing, Sales, and Customer Relationship Management, O’Reilly, 2004.
[3] Fader S, Peter, Hardie GS, Bruce, “Probability models for customer base
analysis”, journal for interactive marketing, 23, 2009, 61-69.
[4] Fader S, Peter, Hardie GS, Bruce, “How to project Customer retention”,
May 2006, available at SSRN.
[5] Ivanova M, Nes N, Goncalves R, Kersten M.L., “MonetDB/SQL Meets
SkyServer: the Challenges of a Scientific Database”, In Proceedings of the Dr. Rajappa Velur is working as a Professor and Dean Academics at
International Conference on Scientific and Statistical Database Management Cambridge Institute of Technology, Bangalore, INDIA. He has
(SSDBM), July 2007. received Bachelor, Master Degree in Electronics Discipline from
[6] Celko J, Joecelka’s SQL for smarties: Advanced SQL Programming, Gulbarga University and Ph. D. Degree in Graph Theory and its
Margan Kaufman, 2005. Computer Applications from Magadh University. His main research
[7] Kimball Ralph: The Data Warehouse Toolkit, Second Edition, Wiley
2004.
interests are in Graph Theory, Data Mining, Ad-Hoc Networks, and
[8] Parrrud, Olivia, and Data mining: modeling data for marketing, risk and Signals and Systems.
CRM, willey-Dream tech, 2003.
[9] R Manual Cran.r-project.org/manual.html
[10]Tien P L, Lin_T, Mc Granaghan M, “Some tips and examples for using
SAS@PROCTABULATE”, Proceedings of the SUGI22.
[11] http://blog.bissantz.com/imetrics-1, last accessed on 29 August, 2009

About the authors

Suresh L is working as a Professor & Head, Department of


Computer Science and Engineering at Cambridge Institute of
Technology, Bangalore, INDIA. He has received his B.E., M.E.
in Computer Science & Engineering and pursuing his Ph.D. at
Dr.M.G.R. University, Chennai, INDIA. His Research areas are
Data Mining and Distributed computing.

Dr. Jay B.Simha is working as a Chief Technology Officer Abiba


Systems, Bengaluru, INDIA. He has received Ph.D. degree in
Intelligent Decision Support and Data mining from Bangalore
University in 2003. His research areas are Business Intelligence,
Visual Data Mining, Predictive Analytics and Large Scale Data
Analysis.

You might also like