CDR Analysis Using Big Data Technology

International Conference on Computing, Control, Networking, Electronics and Embedded Systems Engineering, 2015
CDR Analysis using Big Data Technology

Sara B. Elagib1, Aisha-Hassan A. Hashim2, R. F. Olanrewaju3
Department of Electrical & Computer Engineering, Faculty of Engineering,
International Islamic University Malaysia
Kuala Lumpur Malaysia,
1
sarelagib@hotmail.com; 2aisha@iium.edu.my; 3frasidah@iium.edu.my
Abstract— Call Detail Records (CDR) is a valuable source of offers), and behavioral based applications which can be used
information; it opens new opportunities for telecom industry and in social studies. TABLE I shows some example of CDR
maximize its revenues as well as it helps the community to raise analytics based applications inside and outside telecom
its standard of living in many different ways. However, we need industry.
to analyze CDRs in order to extract its big value. But CDRs has a
However, we need to analyze CDRs in order to extract
huge volume, variety of data and high data rate, while current
telecom systems are designed without these issues in mind. CDRs these valuable information, but telecom industry was built
can be seen as Big Data source, and hence, it is applicable to use without the high CDR analytics capabilities that can be used
Big Data technologies (storage, processing and analysis) in CDR for the above use case scenarios.
analytics. There are considerable research efforts to address the Big Data has many definitions, but generally, it is the data
CDR analysis challenges. This paper presents the use of Big Data with the 3Vs (Volume, Variety, Velocity) characteristics [27].
technology in CDR analysis by giving some CDR analytics based Most of the recent researches on Big Data analysis have
application examples, highlighting their architecture, the utilized moved towards parallel and distributed computing; it provides
Big Data tools and techniques, and the CDR use case scenarios. cost effective and scalable solutions capable to process Big
Index Terms— CDR, Big Data, Telecom, analysis, Parallel
Data with the required performance measures. Unlike
DBMS, MapReduce. centralized computing, which faces scalability issues when it
comes to Big Data volume, variety and velocity.
This paper is organized as follows section II illustrates how
I. INTRODUCTION CDR is considered a Big Data source, section
Telecom as a center for communication, telephony and III, introduces two Big Data technologies, section IV gives
internet, it has extreme verity of data such as, CDR, network some CDR analytics based application that utilizes Big Data
data, subscribers’ personal and billing data. technologies. Finally the paper ends with a conclusion in
CDR is a record contains detailed information about a section IV.
telecom transaction, such as call start time, end time, duration,
call parties, cell ID, requested websites. CDR Lifecycle
TABLE II
generally begins with CDR generation of a call, it is updated EXAMPLE OF CDR ANALYTICS BASED APPLICATIONS INSIDE AND OUTSIDE
according to the events occurs in the call (call end, call join, TELECOM INDUSTRY
etc…), then it is collected by different network elements.
Inside Telecom Outside Telecom
After that it goes through the mediation system; mediation
systems format the row CDRs to predefined format that is Real time analysis and decision Smart marketing and political
compatible to others telecom system modules. Finally, it is making (dynamic network campaigns.
written in the file system for later use. monitoring, location based
services).
Call Detail Record (CDR) is telecom most valuable source Precise marketing (offer Social studies (ex: constructing
of data, it is used in telecom fundamental processes (charging, optimization, churn identification, Call graphs by using SMS CDRs,
settlement, billing, network efficiency, fraud detection, churn prediction & SNA, visualize the regional flows of
revenue assurance, churn detection, value added services, personalize telecom and 3rd people, Comparing Rural and
parties’ advertisements). Urban Social and Economic
business intelligence, etc…). Moreover, CDR may help Behavior in particular places)
improve many existed services and processes in areas such as Operation efficiency (preemptive Emergency (subscribers that live
business intelligence, marketing, transportations [1, 2], etc... customer care using IVR, cell-site or frequently go to the
There are main use case scenarios among the heterogeneous optimization). emergency location, or at the
time of the emergency)
CDR based applications, for example, subscriber’s based
Customer experience enhancement
applications; it is to determine subscribers based on their Analyzing the profitability of both
activities (number or duration of calls, number of SMSs, customers and services
location, used volume, etc…). Location or route based
application (most visited places, crowded streets, and suitable
places for restaurants). And time frame based applications
(surge time of calls, SMS, internet, best time for certain
978-1-4673-7869-7/15/$31.00 ©2015 IEEE

II. CDR IS BIG DATA (up to 2000 machines) with different programming tasks and it
CDRs has big volume; network with 10 million subscribers showed a superior performance measures. But when it was
can generate billions of records monthly. Moreover, it has evaluated again by [11] in smaller cluster (up to 100 nodes),
variety of data (subscribers’ based data, location based data, the startup time overhead was almost 30% of the total
time based data, behavioral data, network usage data, etc…), processing time. However, when [13] utilized some indexing
from over 100 fields for each record, and from different call technique in IO operation, the processing time improved from
types; CDR is not only a record of voice calls details but also 2.5 to 10 times in the 100 nodes scale, depending on the
SMSs, MMSs, conference and video calls, internet access, implemented Map/Reduce jobs.
etc…. In addition to that, CDR has high velocity with high Many systems were implemented on MapReduce
growing data rate in a way that many researches focus on framework, Apache Hadoop [14] platform for example, with
CDR stream processing [8, 9, and 10]. its modules and related projects, provide systems for: data
CDR has the 3Vs characteristics of Big Data (Volume, storage layer (Hadoop Distributed File System HDFS), key-
Variety, and Velocity), therefore it can be seen as Big Data value storage layer (Cassandra [15], HBase [16]), parallel
source, and hence, it is applicable to use Big Data processing layer (Hadoop), high-level language layer to deal
technologies (storage, processing and analysis) in CDR with lower layers (Hive [17], Pig [18]) and for streaming
analytics. processing (Apache Storm [19]). These different systems have
different specifications and fit various purposes
III. BIG DATA STORAGE AND ANALYSIS TECHNOLOGIES
There are two main approaches to Big Data analysis,
IV. EXAMPLES OF CDR ANALYSIS SOLUTIONS
parallel DBMS and MapReduce [4].
BASED ON BIG DATA TECHNOLOGIES
A. Parallel DBMS Considerable part of the recent CDR analytics based
It is a software that manages a database which is physically application and researches have utilized one of the Big Data
distributed in a number of nodes, as if it is a centralized technologies, Parallel DBMS, MapReduce, or others.
database system [5]. The database is distributed in a shared-
A. Parallel DBMS based solutions
nothing architecture which distributes the storage and
processing among machines in a cluster. It is widely used for Commercial software industry has done considerable work
large-scale data processing because it is mature and has stable on CDR analysis and processing utilizing parallel DBMS. For
structure that developed over more than two decades [6]. In example, Vertica (column-oriented DBMS) in 2007 [25] had
addition to that, it has the same relational database features conducted benchmark testing in CDR analysis for 1.2 TB using
such as: schemas, indexing, and query optimization, which only 3 nodes cluster (each: 2 CPU, 16GB RAM, 1TB disk), the
improve analysis and processing performance. benchmark results (Fig. 1) shows 0.5 minute for average query
However, parallel DBMS deals with structured and semi- response time. In 2013 Microsoft and REDKNEE presented a
structured data only, and has limited scalability and fault solution with better benchmark results. The solution is
tolerance capabilities, especially when the machines have high Redknee’s TCB (Turnkey Converged Billing) [28] solution
failure rate. Moreover, it uses SQL language for programming, running on SQL server 2012 with NEC servers, Intel Xeon
and there are many analysis algorithms which would be processors, and X-IO storage. The solution is used for invoices
difficult to be implemented using SQL. reports generation [26]. The bench mark results proved the
Many systems were implemented on the parallel and solution capability of large CDR volumes analysis (100- 250
distributed DBMS structure such as: Vertica [7] (column million subscribers) with lower processing, storage, and
oriented DBMS) and Teradata [8] (can scale up to 400 nodes). licensing cost, along with good and scalable performance (at
peak of 1,250 invoices per second) (Fig. 2).
B. MapReduce
A simple programming model used to process large-scale
data in parallel and distributed architecture. It is a processing
framework; it can deal with any type of data store (relational
database, files, NoSQL). It has high scalability measures; it can
manage data processing in thousands of node with optimized
fault tolerance. Moreover, it processes structured, semi-
structure and unstructured datasets [9]. However, it is criticized
for its poor processing time and high energy consumption [10]
when compared to two parallel DBMS models [11]. For
implementation, users express their program functionality in
the form of two functions, map() and reduce().
MapReduce was introduced by Google [12] as a
programming model for large-scale data analysis in 2004, they
evaluated its performance using large datasets in large clusters,
Fig. 1 Vertica 1.2 TB CDR benchmark results [25]
468
Some researchers have moved towards designing and
developing a special platform for telecom Big Data analysis,
which comes along with a high level language to simplify
writing some complicated analysis scripts (complex queries)
for business users. Such a platform [22] has been developed on
MapReduce framework, and uses HBase as NoSQL data store.
The high level language is called DSL (Domain Specific
Language), it hides the complexity (for business users) of
writing analysis quires on MapReduce, Pig or Hive. Hence
Fig. 2 Microsoft and REDKNEE Benchmark testing results for invoicing users have to learn only one programming language (DSL) and
[26] perform all the required analysis processes in this platform
utilizing only this language. However, the platform is not
generic enough to deal directly with SQL queries, so each time
the user has to write a DSL script even for the simplest and
straightforward SQL queries.
DisTec [3] is another SNA solution for marketing purposes.
It is based on parallel and distributed computing architecture. It
uses Hadoop platform along with HDFS. The SNA analysis
algorithms were re-implemented on MapReduce programming
model. The system has three layers (Fig. 3); data layer (current
database system, HDFS for CDRs, virtualization and access
tools for the both database systems), basic service layer which
provides access to end users to the data warehouse as well as to
support the higher layer services. And the third layer provides
core services (for knowledge discovery process) along with
three different interfaces (WS, RMI, and API).
Real time and near real time analysis of Telecom big data
require streaming analysis instead of batch computation to
Fig. 3 DisTec overview [3]
decrease the processing latency. For example, [23], is a system
for CDRs processing in real time, the system was designed to
be scalable, configurable, and to process up to 220,000
CDRs/second with high accuracy and proper fault tolerance.
B. MapReduce based solutions Many classic features were utilized to obtain the required
MapReduce framework based projects (Hadoop related performance measures despite the high velocity and the big
projects) are widely used because most of them are open- size challenges of the CDRs stream; the features are, in-
source. memory processing for high system throughput, bloom filter to
The first example is a Criminal investigation solution de-duplicate CDRs, record checkpoints for fault tolerance,
which is developed by [20], Hive (for querying) with HDFS (as database parallel insertion, and CDRs analysis on the fly. The
a file system). The system was implemented on the cloud to system was implemented on IBM InfoSphere Streams [24]. It
reduce the solution cost. It has better overall efficiency ingests CDRs from the source directly parses them to the
measures and cost reduction compared to the original platform. required format, processes them with pre-specified rules in
Another example is a Social Network Analysis system configuration files, records the checkpoints, de-duplicates the
(SNA) [21], which uses CDRs to classify influential CDRs, and then the CDRs data will be ready for the real time
subscribers. The system was built on Hadoop, and the SNA aggregation and analysis, as well as for storing in the
algorithms were re-implemented on MapReduce programming repository.
model. The process of classifying the influential subscribers This solution addressed important Telecom Big Data
includes CDRs filtering, SNA algorithms processing, and system requirements and challenges, and it can be utilized
specific machine learning processing to adjust some weights to successfully for many application require real time or/and
determine the most influential subscribers on the network. The patch CDRs analysis. However, since the system takes CDRs
SNA processing stage was implemented on Hadoop, while the directly from the network elements, it either replaces part of
machine learning part is done using Weka toolbox in the already implemented system, or duplicates its functionality
centralized architecture. The developed solution showed of CDRs collection and processes, which can be seen as a
accepted processing durations for all the stages, however the change in the current system or extra cost. Moreover, system
iterative part of the SNA algorithms consumed more time scalability was not addressed despite of its high importance
because Hadoop treats each SNA iteration as a separate especially in CDRs processing.
Map/Reduce job. Kanthaka [6] system analyzes CDRs using predefined rules
to select the customers who are eligible to get promotions from
469
specific commercial agents. Rules for promotions depend on Kanthaka: 2013 - Process batch - Low
number of calls or SMSs to specific commercial agent over Big Data Caller CDRs with scalability.
Detail Record average processing - Considers
period of time. CDRs are provided to the solution as a CSV file (CDR) Analyzer time 25,000 offers and
periodically. This solution faces processing time constrains as for Near Real CDRs/second. promotions
each CSV file has to be processed before the next CSV file is Time - It provides user use cases
sent. Cassandra system (NoSQL) was used as parallel and Telecom interface for only.
Promotions. [6] promotions rules
distributed platform because, it is scalable enough for this configuration.
application (only 2 nodes used), and hence to avoid the extra Big Data 2013 - Analyzes many - There are no
networking and data loading overhead that comes with Platform telecom data; such APIs for
MapReduce (Hadoop specifically) and the distributed and Development as billing, external
with a Domain subscriber profiles, application
parallel DBMS. Specific and network data usages.
The front end module of Kanthaka is a web based user Language Performance up to
interface that is responsible for adding, removing or updating for Telecom 30MB/sec.
commercial agents' promotions rules. While the backend Industries. [22]
module consists of two main parts; the first part uses the
promotions details to analyze the CDRs in the CSV file and
update each rule counter Hash Map, utilizing centralized
architecture. And the second part is the Cassandra cluster V. CONCLUSIONS
which is updated periodically with the data in the Hash Maps From the above mentioned CDR analysis solutions, it is
(in system cache). Unlike the first part, this part uses the clearly seen that Big Data storage, processing and analysis
parallel and distributed architecture of Cassandra cluster. The technologies fit CDR analysis. It showed better performance
bottleneck of Kanthaka solution is the centralized part of the measures and cost efficient solutions in batch CDR
backend module; its performance is dramatically affected by processing, and real time and stream processing.
the size of the CSV file, and by the number of promotion rules. Parallel DBMS benchmarks have showed better
Table II below shows the comparison between the previous performance measures in querying and predefined processes
CDR analysis solutions, highlighting the advantages and the (invoices reports generation), however, for other high time
limitations of each. consuming processes and complicated analysis algorithms,
MapReduce has offered better flexibility (algorithms can be re-
implemented in MapReduce )
TABLE IIII
COMPARISON BETWEEN THE PREVIOUS CDR ANALYSIS SOLUTIONS,
HIGHLIGHTING THE ADVANTAGES AND THE LIMITATIONS OF EACH REFERENCES
CDR Solution Yea Advantages Limitations
r [1] D. Agrawal, P. Bernstein, E. Bertino, S. Davidson and U. Dayal,
"Challenges and Opportunities with Big Data," A community white
Vertica [25] 2007 - Linear scalability - Long average paper developed by leading researchers across the United States,
- Analyze up to 21 response CYBER CENTER TECHNICAL REPORTS, 2011,
TB CDRs. time. http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1000&context=c
- Low hardware ctech.
cost. [2] R. Cumbley, P. Church, "Is “Big Data” creepy?," Linklaters LLP,
Redknee 2012 - Process stored - Addresses computer law & security review, pp. 601-609, 2013.
TCB Running CDRs with more only one [3] S. Yang, B. Wang, H. Zhao, Y. Gao, and B. Wu, "DisTec: Towards a
on SQL Server than 100,000 CDR use distributed system for telecom computing," In Proceedings of 1st
2012 [28] CDRs/second and case International Conference on Cloud Computing, CloudCom, Beijing,
up to 540 (invoices China, Vol. 5931 LNCS, pp. 212-223, 2009.
invoices/sec. generation). [4] Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden,
- Linear scalability. S., & Stonebraker, M. (2009, June). A comparison of approaches to
DisTec [3] 2009 - Consider system's - Cannot be large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD
quality of service, extended to International Conference on Management of data (pp. 165-178). ACM.
security, and data support real [5] M. T. O¨ZSU, P. VALDURIEZ, "Distributed and Parallel Database
privacy. time Systems," Journal ACM Computing Surveys, Vol. 28, No. 1, pp. 125-
processing. 128, March 1996, http://dl.acm.org/citation.cfm?id=234368
Processing 2012 - Process CDR - Scalability [6] P. VALDURIEZ, "Parallel database systems: Open problems and new
6 Billion streams on real was not issues," Distrib. Parallel Databases, pp. 137–165, April 1993.
CDRs/day [23] time, up to addressed. [7] HP, Vertica web page, http://www.vertica.com/, [April, 15, 2014]
220,000 [8] Teradata, Teradata web page, http://www.teradata.com/, [April, 15,
CDRs/second 2014]
Big Data 2013 - Scalable solution. - Addresses [9] J. Dean and S. Ghemawat, “Mapreduce: a flexible data processing
Storage - Overall only use tool,” Communications of the ACM, vol. 53, no.1, pp. 72-77, 2010.
Techniques performance cases that [10] J. Leverich and C. Kozyrakis, “On the energy (in) efficiency of hadoop
Implemented to improvement and related to clusters,” ACM SIGOPS Operating Systems Review, vol. 44, no.1, pp.
Criminal cost reduction criminal 61-65, 2010.
Investigation for compared to the investigation [11] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden,
Telecom. [20] original platform. application. and M. Stonebraker, “A comparison of approaches to large-scale data
470
analysis,” in Proceedings of the 35th SIGMOD International Data, Streams and Heterogeneous Source Mining: Algorithms,
Conference on Management of Data. ACM Press, New York, 2009, pp. Systems, Programming Models and Applications, Beijing, China,
165–178 August 12, 2012. pp. 77-84.
[12] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on http://dl.acm.org/citation.cfm?doid=2351316.2351327.
large clusters,” In OSDI, USENIX Association, pp. 137-150, 2004 [22] Senbalc, S. Altunta, Z. Bozkus and T. Arsan, "Big Data Platform
[13] D. Jiang, B. C. Ooi, L. Shi, and S. Wu, “The performance of Development with a Domain Specific Language for Telecom
mapreduce: An in-depth study,” VLDB Endowment, Vol. 3, No. 1 pp. Industries," in Proceeding 10th International Conference on High
472-483, 2010. Capacity Optical Networks and Emerging/Enabling Technologies,
[14] Apache Foundation, Hadoop web page, HONET-CNS, Famagusta, Cyprus, 11-13 December 2013, pp. 116-
http://hadoop.apache.orghttp://hadoop.apache.org// , [April, 15, 2014] 120.
[15] Apache Foundation , Cassandra web page, [23] E. Bouillet, R. Kothari, V. Kumar, L. Mignet, S. Nathan, A.
http://cassandra.apache.orghttp://cassandra.apache.org//, [April, 15, Ranganathan, D. S. Turaga, O. Udrea, and O. Verscheure, "Processing
2014] 6 billion cdrs/day: from research to production. Experience report," In
[16] Apache Foundation , HBase web page, http://hbase.apache.org/, [April, Proceedings of the 6th ACM International Conference on Distributed
15, 2014] Event-Based Systems (DEBS’12), ACM, New York, pp. 264–267,
[17] Apache Foundation, Hive web page, http://hive.apache.org/, [April, 2012.
15, 2014] [24] IBM, InfoSphere Streams web page,
[18] Apache Foundation, Pig web page, http://pig.apache.org/, [April, 15, http://www.03.ibm.com/software/products/en/infosphere-streams/,
2014] [April, 15, 2014].
[19] Apache Foundation, Storm web page, http://storm- [25] Vertica. 2007. “The Vertica Analytic Database – Introducing a New
project.net/http://storm-project.net/, [April, 15, 2014] Era in DBMS Performance and Efficiency.” Technical White Paper.
[20] J.-C. Tseng, H.-C. Tseng, C.-W. Liu, C.-C. Shih, K.-Y. Tseng, C.-Y. [26] Microsoft, REDKNEE. 2012 “Benchmark Testing Results: Redknee
Chou, C.-H. Yu, and F.-S. Lu, “A successful application of big data TCB Running on SQL Server 2012.” Technical White Paper.
storage techniques implemented to criminal investigation for telecom,” [27] D. Agrawal, P. Bernstein, E. Bertino, S. Davidson and U. Dayal,
in Proceedings of the 15th Asia-Pacific Network Operations and "Challenges and Opportunities with Big Data," A community white
Management Symposium (APNOMS) conference, Hiroshima, Japan, paper developed by leading researchers across the United States,
Sept. 2013, pp. 1-3, 25-27. CYBER CENTER TECHNICAL REPORTS, 2011.
[21] J. Magnusson, T, Kvernvik, “Subscriber classification within telecom [28] Redknee, Redknee TCB, http://www.redknee.com/products/tcb/, [June,
networks utilizing big data technologies and machine learning,” 30, 2015].
BigMine’12, in Proceedings of the 1st International Workshop on Big
471

CDR Analysis Using Big Data Technology

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CDR Analysis Using Big Data Technology

Uploaded by

Copyright:

Available Formats

International Conference on Computing, Control, Networking, Electronics and Embedded Systems Engineering, 2015

CDR Analysis using Big Data Technology

978-1-4673-7869-7/15/$31.00 ©2015 IEEE

You might also like