You are on page 1of 11

Warming Up

This document is an attempt at summarizing the technologies that play part in the (big data) analytics ecosystem. I put the word big data between parentheses because I will be focusing on the analytics aspect. Big data is a relatively new (industry) term, but not necessarily a new concept. So if we limit our understanding of big data only to the volume, maybe we're missing the bigger picture (because big is relative; our civilization has been dealing with increasingly bigger volume of data. What was big 10 years ago, is quite likely not so big today). Just a heads up, from this point on, the word by big data sometimes refers to techniques & tools for dealing with big data. People coined these Vs (Volume, Velocity, Variety) as a way to characterize big data. I think talking about Variety and Velocity is a good / easier way to start a discussion about big data (especially with people coming from strong database administration / datawarehousing background). At times I need to convince (or doubt) myself about anything, including big data. I found it is easier to convince myself about big data if I start from Variety and Velocity. Why Variety? Because people (as user of technologies) can easily appreciate the fact that data are coming from more and more varied sources, thanks to cheaper sensors, mobility, and hiperconnectivity. Everybody with a cellphone generates data now. Every single activity they do, on various online services, through their cellphone generates the so-called data exhaust. Consequently, we are dealing with a variety of format of data, from structured to unstructured1. At this point I'm skeptical or wondering, how big data plays unique role in this situation? We already have and perform Extract Transform Load (ETL; common to datawarehousing practicioners) to address that challenge. Right, so maybe we should look at the other aspect to support this argument, velocity. Why? Because ETL process implies non-realtime analytics; the data is not processed just-in-time; it has to go through the transformation stage before it is loaded to datawarehouse, where eventually data is picked up to be analyzed. Actually there are several articles about datawarehousing in the advent of big data. I was trying to understand if they are competing? Or are they complement to each other? I admit I haven't read through those articles, so I'll just a put their links for now; one from O'Reilly2, and the other one from Teradata3. People are talking about soft real-time analysis of data (or events). There's a phrase for that: Complex Event Processing (CEP) platform, which basically is a platform that enables us to observe movingwindow of events (taken from continuous stream of data), and do time-series analysis such as patternmatching on that window4. One open-source product for CEP that I know is JBoss Drools Fusion5. There's a strong link between Velocity and Variety, provided by Value6. Fusing data from broad range of sources (variety) opens up the risk of having low-value data, data with low signal-to-noise ratio.
1 2 3 4 5 6 http://www.finextra.com/community/FullBlog.aspx?blogid=6129 http://strata.oreilly.com/2011/01/data-warehouse-big-data.html http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/?type=WP http://bit.ly/xrLyV1 http://www.jboss.org/drools/drools-fusion.html http://www.finextra.com/community/fullblog.aspx?blogid=6222

Example: facebook / twitter. The need for speed here is to find high-value data among piles of lowvalue data. At this point maybe I should stop for a while trying to convince myself about big data. Some practical hindsights I will gather along the way will clear things up, or provide me with a better / bigger / more fundamental questions. Either of them is beneficial. For now I'll just take it as a fact that big data is important, it's here, and it's part of continuum of tools we need to employ to bring up intelligence. I'll switch back to analytics, but before that I would like to do a round-up of big data tools I've found so far. The following table is a summary of tools I'm currently learning to use. They're in my own words, based on my current understanding. So it's not comprehensive and may contain inaccuracies. Name Hadoop Description Provides a programming framework that implements MapReduce and platform for its execution. It also provides distributed filesystem smart-enough to ensure minimum data-motion7 during parallel the processing of (big) data set spread across cluster of machines. It splits the dataset such that the task assigned the subset of data executes on the machine where the subset of data is located (locally or nearby); Map phase. Then it coordinates the aggregating of results of computation collected from the task nodes (during the Reduce phase). It is still not clear to me how the process-affinity is achieved8, but looks to me it's through configuration of our hadoop cluster, specifically optimized for the application we're working on (knowing the nature of data distribution, pipeline of data processing, etc.). Apache Pig While Hadoop provides programming library for implementing the map and reduce task, Apache Pig projects raise the level abstraction, allowing people to specify those operations in SQL-like syntax. It brings productivity / efficienty for the project. Yahoo is said to have 40% to 60% of its Hadoop workloads implemented in Pig scripts9. A collection of machine-learning algorithms (e.g.: clustering, naive bayes, covering, linear modeling, etc), ready-to-use, for execution over Hadoop cluster. A big relieve, thanks to Mahout we can save time in the projects trying to make those standard machine-algorithms paralelizable, using MapReduce approach, specifically for execution in Hadoop cluster. Drools Fusion A platform & programming libraries for complex event-processing. Basically we define some rules where we specify the way we correlate an event with past event(s), and make a conclusion based on that. Based

Apache Mahout

7 http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/ 8 http://www.plexxi.com/wp-content/uploads/2013/03/Plexxi-Use-Case-Hadoop-March-2013.pdf 9 http://www.ibm.com/developerworks/library/l-apachepigdataquery/

on that conclusion, an action is taken (which we also have to specify / program). Obviously the rules will have to be discovered beforehand (e.g.: applying machine-learning techniques over historical records). MongoDB One of many NoSQL databases available in the market. NoSQL, a catch-all terms for non-relational database (which uses SQL as query language). My understanding at this moment is that NoSQL databases excels at scalability, by sacrificing the ACID (Atomicity, Consistency, Integrity, and Durability) property we can expect from RDBMS. MongoDB is document-oriented database10. Meaning (according to wikipedia): designed for storing, retrieving, and managing documentoriented information, also known as semi-structured data.. So, I guess if what we're capturing & storing looks like documents (e.g.: forms, receipts, articles, etc.), probably we should consider MongoDB. The articles / book I've read on MongoDB gave me the impression that data-modeling for MongoDB does not put much emphasis on normalizing our data (unlike in data modeling for RDBMS, since normalization is cornerstone in achieving consistence). Documentoriented database seem to be focusing on the speed of retrieval, and horizontal scaling (denormalized database is easier to be spread across machines). Others in NoSQL landscape Existing approaches in NoSQL landscape can be categorized as: Key/Value stores Column-oriented stores Document-oriented stores Object databases Graph databases

I believe our choice should be based on the structure (or lack of) of the data we're dealing with, the kind of questions we want to get answered, and obviously the application requirement. Common cliche: NoSQL database is not the right choice for financial system. It is a sweeping statement, as if financial system is one big single application. Maybe it should be rephrased as: NoSQL, for its lack of support for transaction, should not be used in applications where we have to support use-cases like transferring money between accounts. For other application, such as capturing tick data (deals, transfer events, stock price, etc), NoSQL database has its place in financial system11. Finally, I'd like to refer a book I might recommend later, that can help us understand better the nature of several popular NoSQL databases:
10 http://en.wikipedia.org/wiki/Document-oriented_database 11 http://www.10gen.com/post/45116404296/how-banks-use-mongodb-as-a-tick-database

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement12. I think that's quite a solid toolset for big data analytics. I wouldn't spend much more time on philosophising; not until I get more hindsight from using those tools to crack some problems. What problems? Well, one good place to start is Kaggle.com, where we can get some problems (accompanied by dataset), and compete.

12 http://www.amazon.com/Seven-Databases-Weeks-Movement-ebook/dp/B00AYQNR50/ref=tmm_kin_title_0? ie=UTF8&qid=1366862920&sr=8-1

Wandering
Among so many challenges that CSP facing these days, we will have a look at two of them, namely customer retention and dynamic pricing. How did they end up there? Several factors: 1. Saturated market. 2. Number portability. 3. Inability of CSPs to come up with killer-app(s). Factor number 1 and 2 lead to churn prevention. Factor number 3 leads to dynamic pricing. It's only logical, with not much room left for expansion, maintaining the customer base becomes a basic necessity for survival. With number portability, the situation is even worse for CSP; subscriber can switch to another CSP without worrying of losing her current number. On to point #3, no killer-app. CSP have been complaining about decreasing ARPU (Average Revenue Per-Unit). They can't just hike the price now (after they sacrificed it in the sake of expansion). So they look for revenue-generation from VAS (Value-Added Service). The following screenshots show the VAS offered by Telcel.

I personally don't find those services valueable enough for me to pay for; I never use them. In the last picture, Identificar de Llamadas (description: conoce el numero de quien te llama y/o el nombre si esta en el agenda de tu equipo / know the number of the calling person and / or the name if that person is registered in the contact list of your device). I'm not sure how is that a VAS. Are they charging me for having that service? I certainly hope not! I think I can accept Banca Movil as valuable service. But then again, banks (such as Bancomer) already offers an app for Android & iOS to do just that, and I'm pretty sure it has nothing to do with Telcel because the app uses internet technologies all the way (meaning: no income for Telcel for usage of Bancomer-provided app). I guess contribution of VAS to the revenue of CSP is not significant enough to compensate decreasing ARPU. Here is another common complaint by CSP: they act only as a pipe, without benefiting enough from their investment in building the infrastructure. The one that benefits from the traffic generated by video browsing on YouTube is.... Google mainly. For photo uploads on Instagram..., Instagram mainly. Same goes for Facebook, Twitter, etc13. Therefore, they start experimenting with online-charging schemes that takes traffic into account. Something nicer than simple bytes-to-cents function, in order to avoid backlash from subscribers. To me being-nice here means: at least there would be a notification when certain browsing activity that can incur additional charges, offering user to decide next action (proceed with extra payment, or cancel); minimizing surprises (and complaints). More on that after the box below. Clarification about contribution and trend of VAS to revenue generation of CSP: this article is not scientific; it's only a summary of what I gathered from various sources. I tried to find something to backup my hunch that VAS will die out. Turns out the situation is different from market to market, and even from one segment to another segment in the same market. In Telcel case for example, I don't know how they label me in their system :) but certainly not a lowend. I have smartphone with unlimited 3G plan. Obviously I use internet for everything. I have no idea what percentage of Telcel subsribers still use low-end cellphone, and what percentage of them actually use the VAS displayed above. So I googled for VAS revenue decline and VAS revenue increase (and similar queries). I even googled for Telcel Earning Report (found nothing). I found conflicting results. In India, more than 33% of total revenue of Tata Docomo comes from VAS14. I'd like to think this result draws from skills and experience of NTT in VAS (whose 43% of its revenue reportedly comes from VAS in 201015). So, NTT knows how to do it, very well. Part of the strategy, I guess, is by providing incentive for innovation16.

13 Although that can change if CSP open their platform, giving access to software developers to build / enhance applications using telecommunication services provided by CSP. Example, by providing API like Twilio (http://www.twilio.com) or BlueVia (http://www.bluevia.com) 14 http://www.wirelessduniya.com/2011/11/11/tata-docomo-vas-revenue-surpasses-33-of-its-entire-revenue/ 15 http://www.voicendata.com/voice-data/news/166226/data-services-revenue-exceed-usd330-bn-2013 16 http://www.thehindubusinessline.com/todays-paper/tp-info-tech/ntt-docomo-laments-limited-valueadd-inindia/article1060116.ece

It's very similar to revenue share developers get from Android Store & Apple App Store17.

Back on Track
Now I will drive the essay back to (big data) analytics. To recap, the situation provides motives for the following use-cases: 1. Churn prevention. 2. Dynamic pricing. They are only two of a dozen other use-cases, listed in the following diagram18.

The line of thinking I used in the essay is the usual start from goal. Specifically the approach will be in this order: 1. What questions need / want to be answered? (one end) 2. What data we have? (the other end) 3. Techniques? (fill-in the blank in between) In the case of churn modeling, what are the questions that are nomally / possibly by CSP (that leads them to building their churn model). My lame attempt at mimicking Shakespeare: What are the questions? That is the question.. I googled around, found nothing concrete enough to my liking, so I
17 http://www.techrepublic.com/blog/app-builder/app-store-fees-percentages-and-payouts-what-developers-need-toknow/1205 18 http://www.intracom-svyaz.com/download/eng_pdf/bigdata/BigStreamer.pdf

have to resort to writing something that sounds logic to me. So..., basically, based on historical data (mainly) we want to find out how certain actions, events, and / or conditions lead up to customer churn. Actions can be price change or marketing campaign, events can be dropped calls or congestion, conditions basically attributes related to the customers. All those information are gathered from sources like OSS (Operational Support System), BSS (Business Support System).

The diagram is copied from http://ossline.typepad.com/.a/6a0105359f53d8970c0147e06562af970b-pi From OSS we get network-related information (things like dropped calls are gathered from there). From BSS we get business-related information (customer detail, the amount of billed, contracts, are all in that area). For more information about OSS/BSS, this page is a good start: http://www.quora.com/How-would-you-explain-OSS-and-BSS-to-a-layman Now, let's get really practical. Oracle for example, has an offering called Oracle Communication Data Model19 (part of its BSS offerings). As the name implies, it models the entities needed for business operation of CSP. One of the components of OCDM is data-mining model. The following screenshot of the Oracle's online-documention can give you an idea of what it is:

19 http://www.oracle.com/us/products/applications/communications/industry-analytics/data-model/overview/index.html

From that we can glimpse features relevant for churn-modeling20. This general idea can be transferred to situations where we don't use Oracle's product for example, and we have to build it by hand. Here are some of them: Customer id, Target column of churn model, Number of future contract count in last 3 months, Subscription count in last 3 months, Suspension count in last 3 months, Contract count in last 3 months, Complaint count in last 3 months, Complaint call count to call center in last 3 months, Complaint call count to call center in the life time in last 3 months, Contract left days in last 3 months, Account left value in last 3 months, Remaining contract sum in last 3 months, Debt total in last 3 months, Loyalty program balance in last 3 months, Total payment revenue in last 3 months, Monthly revenue (arpu) in last 3 months, Contract arpu amount in last 3 months, Party type code, individual or organizational in last 3 months, Business legal status, Marital status for individual user, Household size, Job Code, Nationality code, ...., For how long billing address is in effective, in days, .... Of course those attributes in OCDM are tied to the algorithm used in the product for generating prediction models. Some of the dimensions might not be relevant for our specific case. As in any data analysis activities, we have to apply some algorithms to select relevant dimensions to base our analysis on (http://en.wikipedia.org/wiki/Feature_selection). But at least all those features in OCDM can give us starting point. From there we can work backward to parts of the system where those data can be obtained from (e.g.: billing and charging system, CRM, etc). Let's see another product, BSCS iX (a billing and charging solution now offered by Ericsson), just to get a glimpse of what is out there. The screenshot shows a table where invoices are stored.

20 http://docs.oracle.com/cd/E11882_01/doc.112/e15886/data_mining_cdm.htm#autoId4

Churn-modeling itself is quite a feat. There's a good book on that matter, from Rob Matisson, given away for free, and is also available in Google Books: http://books.google.com.mx/books/about/The_Telco_Churn_Management_Handbook.html? id=M_uuQx7vMngC&redir_esc=y . And of course, exercise. Teradata in collaboration with Duke University and several CSP held a contest in 2003 on churnmodeling. From the tournamen't site at http://www.fuqua.duke.edu/centers/ccrm/ , we can download the problem description and dataset, among other materials.

Wrapping up, a few words about dynamic pricing. I was wondering how is that related to big data analytics. Turns out dynamic pricing is a matter of yield management21, which can benefit from analytics, specifically forecasting techniques.
21 http://en.wikipedia.org/wiki/Yield_management

Here are some scenarios of dynamic pricing, which basically an attempt at offering bandwidth to customer, at an attractive price, especially when the network conditions allows to do so. This has something to do with giving flexibility to subscriber; by not forcing them to pick between subscribing to unlimited capacity (at higher cost) or staying with limited bandwidth (barring them from using more services when needed). All-in-all this can make more people drawn into subscribing to / staying with the CSP. These diagrams are copied from a whitepaper by Tango Telecom, Beyond Policy, a New Era in Real-Time Charging22.

This is an example of the case for real-time analytics over stream of data, coming in from the network elements (OSS), fed into forecasting model (built from historical data), to figure out for example if the traffic is low (thus offering turbo-boost to a customer wouldn't affect the rest), etc. For a more usecases in this area, the following whitepaper from Telcordia is a good source: Applying Yield Management in the Mobile Broadband Market23.

22 http://www.tango.ie/dynamic-marketplace.html 23 http://bit.ly/ZXdXPV (www.telecomtv.com/DocSend.aspx?fileid...9f55...yield-management...)