You are on page 1of 7

Contextualized Monitoring and Root Cause Discovery in IPTV Systems Using Data Visualization

Urban Sedlar, Mojca Volk, Janez Sterle, and Andrej Kos, University of Ljubljana Radovan Sernec, Telekom Slovenije, d.d. Abstract
This article describes the architecture and design of an IPTV network monitoring system and some of the use cases it enables. The system is based on distributed agents within IPTV terminal equipment (set-top box), which collect and send the data to a server where it is analyzed and visualized. In the article we explore how large amounts of collected data can be utilized for monitoring the quality of service and user experience in real time, as well as for discovering trends and anomalies over longer periods of time. Furthermore, the data can be enriched using external data sources, providing a deeper understanding of the system by discovering correlations with events outside of the monitored domain. Four supported use cases are described, among them using weather information for explaining away the IPTV quality degradation. The system has been successfully deployed and is in operation at the Slovenian IPTV provider Telekom Slovenije.

nternet Protocol television (IPTV) has become a vital part of modern triple play offerings and is being deployed worldwide [1]. However, the complexity of such systems is much greater than that of classical broadcast systems, where there was nothing but air medium and an occasional relay node between the broadcaster and the subscriber. Modern IPTV solutions are a complex chain of systems where the path from source to destination of video stream must cross multiple devices and multiple levels of network hierarchy. Thus, the potential for introduction of network and application-level errors is much greater. In addition, due to the maturity of traditional broadcast systems, users have cultivated high expectations of video quality as well as the quality of the overall service consumption experience; hence, it is imperative for any modern IPTV provider to focus on the assurance of quality of service (QoS) and quality of (user) experience (QoE) [2], as well as the required level of network and service monitoring to achieve such a goal. The state of the art in IPTV network and service monitoring has significantly advanced in the last decade [3] and encompasses solutions ranging from advanced network probes to client-side probes. The former capture and analyze network traffic at key points in the network and provide detailed reporting, but lack the granularity associated with hundreds of thousands of subscribers [4]. On the other hand, the clientside solutions collect data at user premises and provide a high-resolution snapshot of an entire network, but are typically very rigid and limited to a set of predefined use cases [5]. The high cost of setting up and maintaining such systems plays an important role as well. Attempting to break out of the constraints of commercially available systems and to address the issues of flexibility, ease of deployment, price, and the possibilities that the collected

data provides, we have designed and implemented a scalable system for monitoring the state of an IPTV network. We have deployed it in the network of a medium-sized provider. This article presents the lessons we have learned along the way as well as the use cases we have identified by working with the provider and its support personnel. The proposed solution is by nature a highly distributed system of probes, deployed at the end users equipment (STB: settop box). As such it provides a possibility of 100 percent network coverage, but the production coverage has been limited to approximately 5 percent of the terminal network nodes to limit the amount of collected data. By its nature, a video quality probe located as close to the user as possible can provide a wide variety of network and application-level metrics, which would be difficult to obtain by any other means. The software agent implemented as a part of the STB operating system is attached to key subsystems of the STB, including video decoding and network stacks, which can, in addition to network-level monitoring, provide a way for application-level monitoring. This makes it possible to also take into account any hypothetical decoder errors, which could, for example, arise due to a faulty decoder implementation. In that respect, the monitoring is end-to-end in the true sense of the term, and we strongly believe such an approach is applicable to many other systems as well, ranging from smart home automation [6] to Internet of Things (IoT) applications [7], as well as for increasing situational awareness by providing a network common operating picture (COP).

System Architecture
The presented end-to-end solution is designed as a loosely coupled system, comprising a server side capable of receiving, storing and processing messages in real time, and a large

40

0890-8044/12/$25.00 2012 IEEE

IEEE Network November/December 2012

IPTV providers network

Server side

Message parsing and number of dedicated software agents filtering running on distributed IPTV STBs across the IPTV providers network. SNMP Parsed Complex Trap The data collection was initially impleInbound raw trap message event data queue mented using a simple batch processing server handler queue processing model, but the message-oriented approach presented here was later choRaw data Parsed data STB Dashboard sen, mainly driven by the needs of the archival storage IPTV provider to be able to monitor SNMP Trap trap the health of the system in real time. STB server handler Offline Figure 1 outlines the high-level system data architecture. Each IPTV STB acts as an STB analysis Monitoring IPTV quality sensor node in a distributed No SQL Relational agent database database network, capturing quality and telemetrySTB Raw data Pre-processed related data and sending it to the server STB (archive) data side of the system. The data source (STB) generates both periodic messages and messages triggered by user activity (i.e., Figure 1. System architecture. every time the user changes channel). At the server side, all generated messages are collected by a SNMP trap server and injected into a message which allows it to collect both network-level events (buffer queue, which broadcasts them to two subscribers. underruns, Ethernet-level errors) and application-level events The first subscribed process handles the long-term archival (MPEG transport stream discontinuities, channel change times, storage of the data in raw form, which serves many purposes: etc.). The agent can be controlled remotely by means of Simple firstly, it provides a reference point to observe the actual sysNetwork Management Protocol (SNMP) messaging (enabling tem input before the data has been tampered with. From this and disabling the reporting functionality, querying a limited set archival database, messages can be replayed at any time, of parameters, and setting the reporting period). Once activatwhich simplifies the development and testing of different ed, the agent gathers information in real time and reports perievent-processing approaches. Additionally, the horizontally odically the summary of video decoding and network-level scalable key-value store for data archival is well suited to events to the server side using standard SNMP trap messages. map-reduce analytics, which represents a promising approach Reporting is performed either periodically or at each zapping for analysis of large quantities of data. event (i.e., each time the user changes the channel). The second subscribed process performs message preprocessThe following information is collected with each message, ing; it operates as a parsing engine that interprets the binary valrepresenting a set of IPTV telemetry metrics: ues of the originating messages and converts them to structured Originating IP objects with numeric and textual data values. In the process it Date and time of message reception (used as the timestamp also filters the data and discards the messages with invalid or of an event) erroneous combinations of values, which indicate that the mes Duration of the interval being reported (either equal to the sage has been corrupted. Finally, the parsing module injects the predefined reporting period or smaller, indicating channel message with structured data into the parsed message queue. change) The parsed message queue also has two subscribers to which Number of transport stream discontinuities in the reported the messages are broadcast simultaneously. The first one is a time period real-time event processing subsystem, which can perform com Number of seconds with at least one transport stream displex event processing and displays key performance indicators continuity (KPIs) and metrics in the form of a dashboard, while the sec Number of buffer underruns ond subscriber stores the structured data in a relational Zapping time required to tune into the reported channel database, where it can be queried by different analytics tools. Multicast IP of the current channel Additional proprietary fields

System Implementation
The described system has been developed and implemented in cooperation with the Slovenian IPTV provider Telekom Slovenije and an STB manufacturer. An agent was deployed on all STBs in the IPTV providers network during an automated system-wide firmware upgrade. This allows us to achieve unprecedented user coverage for gathering telemetry information and different metrics of the video decoding process, which correlate tightly with the QoE. To accommodate such a system, and allow real-time collection, analytics, and visualization of data, an event-driven data analysis platform was assembled using readily available open source components, as described below.

Data Volume
All of the above information is transmitted either periodically or when a channel change occurs. The relevant information, together with UDP, IP, and Ethernet frame overhead yields 180 bytes on wire per SNMP message. Assuming the reporting period of 30 s and messages being evenly distributed throughout the day, a medium-sized provider with 100,000 subscribers would yield the required bandwidth of 4.8 Mb/s. An increase in peak hours has to be taken into account, raising by 30 percent the peak number of messages and required bandwidth due to zapping and an increased audience. However, this is still a modest data rate, which represents no problems for the event-driven modules of the system and leaves some room for growth. Nonetheless, the number of messages per day under such conditions would reach 288 million and would require a net size of 48 Gbytes to store. Additionally, to get any reasonable historical trending and analysis, more than one day of data would have to be considered.

Data Source
The source of the data is a distributed network of software agents implemented within the IPTV STBs. The agent is hooked into the video decoding and networking processes,

IEEE Network November/December 2012

41

Figure 2. Web-based dashboard for interactive visualization of different error types. For the reason of keeping down the volume of data, we increased the reporting period to 300 s, which reduces the numbers by a factor of 10; additionally, only 5 percent network coverage was chosen, further reducing the storage requirements to 370 Mbytes/day. It is important to note that data summarization cannot be used to reduce the storage requirements because some of the use cases below require temporal resolution of less than an hour. module. The parsed data storage module stores the preprocessed data to a relational database (MySQL) where it is available for offline analysis by a set of external tools, while the event processing module performs a set of basic data summarizations in real time.

Data Analysis
As mentioned, two types of analytics are employed to accommodate both real-time needs and long-term historical trending. The traditional approach to long-term data analysis is based on performing queries against the data. The amount of data is large and persistent, which allows drilling down and refining the queries until the desired hypothesis is confirmed or rejected. However, large amounts of data also require long processing times, and the user is required to define the exact steps to be performed during the analysis [8]. This limits the potential user base to data scientists, statisticians, or anyone sufficiently familiar with the domain-specific knowledge of the system being analyzed. We have performed most of our offline analyses with the aim of extracting as much valuable information as possible from the dataset; some of our use cases are described in the next section. As our tools, we have used the Matlab package and Tableau for high-level analyses. Additionally, we have created a set of predefined visualization templates available through a web-based user-interface (Fig. 2). Such charts, as well as the Matlab/Tableau analyses, are currently static and generated on demand. The second approach to data analysis, on the other hand, is based on real-time event processing and is less common in traditional analytics tools; here, instead of running queries against the data, the data is sent through a set of queries. Once the data passes through, it is discarded; only the results remain, and the process cannot be repeated over the same data by simply readjusting the query. This allows us to execute complex event detection, analysis, and visualizations in real

Server Side
The SNMP trap messages generated by the STB agents are collected by a standard SNMP trap server (Linux-based snmptrapd), extended by a lightweight trap handler script to push the messages in raw format into an inbound AMQP message queue (RabbitMQ). The trap handler also prepends some necessary metadata to the message: the originating IP address and SNMP trap reception timestamp. The inbound message queue delivers the message to the raw data archival module and the message parsing and filtering module. The raw data archival module stores the messages to a key-value (NoSQL) database (Apache Cassandra) for archival purposes. Such implementation allows high horizontal scalability and provides a future upgrade path for big-data analytics systems. The message parsing and filtering module first deserializes the binary payload of the message and interprets the values of individual data fields. Next, each message is inspected to ensure its values do not fail a predefined set of constraints (i.e., containing erroneous combinations of values or enumerated fields with unsupported values). If the originating IP is whitelisted in the customer support database, the preprocessing module also performs a lookup in a DHCP log database to resolve originating IP address into the MAC address of the STB, which serves to identify the user later in the process. Pre-processed events are pushed to the parsed message AMPQ queue (RabbitMQ), which again has two subscribers: the event processing module and the parsed data storage

42

IEEE Network November/December 2012

BNG MSAN BNG MSAN STB

3.44% % Err 0.00%

Figure 3. IPTV network topology map obtained by working back from the end nodes (tree leaves) and matching them to common ancestors. The rst graph shows the entire network from the source (SRC) to MSANs with the BNGs numbered from 1 to 65; the second picture shows a zoomed-in section of a larger graph with an additional hierarchical level: the end users. Node size in both graphs is a function of logical distance from the source. Node color indicates percentage of transport stream errors (green is low, red is high). BNGs are marked with numbers, while MSANs in the second graph are marked with a circle. End-node labels in both graphs were omitted for clarity. Created using the open source GePhi software. time by using a predefined set of rules within the event-driven subsystem. We have created a simple real-time dashboard displaying system-wide metrics and use Esper for complex event processing. spikes of errors during different network maintenance procedures and can in the future help detect potentially unknown causes that affect the degradation of end users experience.

Integration with Customer Support


There is a variety of other use cases for such data; by integrating a comprehensive reporting system with the operations support system (OSS)/business support system (BSS), the providers helpdesk personnel can streamline the user complaint management: instead of opening a ticket and forwarding the problem to the field team, the data collection can be enabled on-the-fly if the user opts in over the phone. Thus collected data allows more than just confirmation of the subpar experience and can guide further decisions that need to be undertaken to mediate the problem. By working with the helpdesk personnel, we have created a special group of users, which is not anonymized, but undergoes an additional address resolution process; a webbased interface is then used to add new users or manage existing ones. Once the reporting is enabled on the STB, the data starts flowing in and is received by the incoming message queue. It is stored in the raw message archival database in a similar (anonymized) way as described before; however, if the source IP of the message is matched in a helpdesk whitelist, it is sent to a medium access control (MAC) resolution process, which performs a lookup in the Dynamic Host Configuration Protocol (DHCP) logs to determine the MAC address of the user; this step is necessary, since the terminals are assigned new IP addresses when their DHCP lease expires. The data is then displayed in a near-real-time manner in a web interface, where the status of the network and application-level errors can be monitored for a specific user; this shortens the delay between a decision and its results from days down to hours and allows the errors to be caught as they happen.

Use Cases
In this section the scenario-specific aspects of the IPTV metrics, data collection, analysis, and visualization are presented. The described system has been deployed in cooperation with the Slovenian IPTV provider Telekom Slovenije. A consenting initial user base of 100 participants was added to the system in the testing phase; later, 6500 anonymous probes were activated in a manner that ensured all of the 65 broadband network gateways (BNGs) had a sample of N = 100 probes evenly distributed over all multiservice access nodes (MSANs). In addition to that, users with stream quality issues who give consent are added to the system when needed by the helpdesk personnel. In the timespan of 10 months, we have thus collected 200 million events and are currently receiving about 1 million events/day.

Application-Level IPTV Quality Monitoring


First and foremost, our goal was to estimate the quality of user experience and the degradation thereof, which can be achieved by simply monitoring the described application-level metrics through time. By establishing a baseline level of application-level metrics (e.g., transport stream discontinuities) and network-related metrics (e.g., Ethernet errors, buffer underruns) per user, per channel, per MSAN or per BNG, any significant increase in errors can be detected, suggesting a sub-par experience for the users. All metrics are displayed as time series on live charts (similar to Fig. 2), together with the predefined thresholds, which serve both as a guide for the operator and for triggering the alarms. Thus collected and visualized data has also confirmed large

IEEE Network November/December 2012

43

Day 103

10 BNG-17 20 BNG-17 8.2% 7.5%

30 BNG

40

5.0%

50

2.5%

60 10 20 30 40 50 60 Days 70 80 90 100 110 Day 103

0.0%

We obtained the network masks and names of BNGs, which allowed us to map the clusters of users to geography. We integrated the mapbased view into our dashboards as seen in Fig. 2. Additionally, since the network is hierarchically subdivided, knowing the MSAN and BNG netmasks allowed us to reverse engineer and visually represent the entire network topology and present it in the form of a graph (Fig. 3), where the source is in the center, and each leaf node is one of the 6729 active users. Such a representation allows visual exploration of the network hierarchy and quick identification of problematic nodes by their color.

Percentage of errored seconds

Error Localization

Figure 4. A heatmap of error severity (ratio of errored seconds over the monitored duration) for all channels; each pixel represents a single day on a single BNG. Minimum relative error in the picture is 0.005% (dark blue), maximum relative error is 8.20% (dark red); vertical axis represents broadband network gateways numbered from 1 to 65, horizontal axis represents days from Dec 1 2011 to Mar 23 2012. Bright horizontal streaks can be observed that indicate BNGs with worst performance (see example: BNG-17). Additionally, some vertical streaks are clearly observable (see example: day 103), which imply a channel-wide degradation felt over a large part of the network.

Network Topology Mapping


Since precise network topology map was unavailable to us due to the significant administrative investment required, weve recreated the network graph by exploiting the knowledge of IP addressing hierarchy.

By having available a network topology map, it is further possible to localize errors in the network hierarchy. In Fig. 4 we have visualized the percentage of errors (normalized to the entire monitoring duration) per BNG over time. Weve chosen the time resolution of 1 day and the topology resolution of the BNG level; by compacting all 65 BNG time series, weve created a long-term visualization of transport stream discontinuities (applicationlevel errors on the STB) in the time span of 115 days, as shown in Fig. 4. This representation is well suited for visual analytics [8, 9] and allows the patterns to be discovered quickly. The first thing that becomes obvious is the horizontal

Multicast stream Error propagation

Monitored domain

Error detected Error detected Overload in output queue

Error detected Error detected

STB

CPE

MSAN

Aggregation

BRAS/BNG

Core router

Content provider

Figure 5. Error localization within the multicast tree.

44

IEEE Network November/December 2012

streaks, which indicate long-running underperformance of an individual BNG. Additionally, some clustering is observable in the form of vertical streaks; this either implies there was a connection between geographically independent BNGs, which means the errors must have originated upstream, or that there was a similar usage pattern and the errors are a manifestation of similarly underprovisioned resources. The resolution of such visualizations can be increased both on the temporal axis (from days to hours or minutes) and on the network topology axis by expanding each BNG to MSANs or even individual users. Additionally, since the errors in the multicast video distribution architecture propagate in the direction from the root of the hierarchical multicast tree to the STBs, it is possible to pinpoint the source of the error by correlating the reports from the terminal nodes and localizing the affected user sites in the network topology map. The concept is presented in Fig. 5, showing that a synchronous error occurrence in the monitored domain (STBs) can be used to pinpoint the common ancestor in the multicast chain (BNG) where the error likely originated. Such detection can be further contextualized by using additional data sources (e.g., network management system logs), which can provide a deeper understanding of the nature of detected events.

Figure 6. Explaining away the video quality degradation by comparing IPTV decoding errors with weather radar map; red areas indicate high rainfall rate, which often coincides with lightning strikes. Blue dots represent BNGs with low error rates; magenta dot (indicated by the arrow) represents a BNG with high error rate. Visualization is sparse due to a small number of volunteers at the time of pilot deployment and the hour of report (noon). The errors were reported by the user as well and conrmed that they coincide with lightning. Weather radar map courtesy of Slovenian Environment Agency (ARSO).

Correlation with Weather Phenomena


It is well known that lightning strikes create a large amount of impulse noise, which manifests itself as an observable and/or audible disturbance in various communication systems. The effect is felt in analog wireless and wired communications as well as in digital systems (e.g., xDSL). IPTV systems without forward error correction (FEC) are especially susceptible to such disturbances, since a data corruption as small as a bit flip happening at the exactly right time (i.e., inside an intra-frame, which represents a starting point for decoding multiple seconds of subsequent video) can have significant effects. Such errors are highly localized and little can be done to eliminate them altogether. Therefore, recognizing the errors, which happen due to natural causes, is a vital step in the process of explaining away the unavoidable and focusing the attention on the preventable. A visualization of such an occurrence is shown in Fig. 6.

services. However, this scenarioalthough commonly employed on the web to serve better content to the usersis commonly shunned due to the sensitive nature of the data and raises red flags with the providers and regulators alike.

Conclusions
IPTV systems have been widely deployed around the world for years, but have yet to live up to their true potential. The return communication channel in such systems already serves as a basis for many innovative services, but it is often neglected as a means for sensing the state of the entire system from the end users point of view. In this article, we have presented an architecture and implementation of a scalable real-time IPTV monitoring system, which has been deployed in cooperation with the Slovenian IPTV provider Telekom Slovenije. The system uses the existing STB terminals, upgraded with a data reporting agent to collect a variety of application and network-level metrics. To process large amounts of messages generated by the terminals, we employed a message-based event-driven model. Furthermore, we explored some of the use cases that can be supported by analyzing and visualizing large amounts of collected data. However, use cases described in this article represent just the tip of the iceberg when compared to the true potential of combining such data with a variety of external data sources.

Other Use Cases and Further Work


Since many of the described use cases rely on visual analytics, automation would present an important improvement of such a system. For that, the signature characteristics of individual events would have to be captured and suitable algorithms developed to enable automated event discovery. In addition to the described cases, the collected data also conveys information about how the subscribers use and interact with the IPTV system. Such information can be used anonymously in concert with EPG data to provide ratings for individual TV shows. Furthermore, since every channel zapping is reported as well, a large sample of users can be used to detect near-synchronous channel changes, which would imply that undesirable content (e.g., ads) was being broadcast. Lastly, subject to users opting in, the personal TV activity data could in the future be stored without anonymization and mined, which would pave the way for different context-aware

Acknowledgments
The authors would like to thank company Telekom Slovenije for excellent cooperation on the research and development project Automated system for triple-play QoE measurements. Part of the work was supported by the Ministry of Higher Education, Science and Technology of Slovenia, the Slovenian Research Agency, and the Open Communication Platform Competence Center (OpComm).

IEEE Network November/December 2012

45

References
[1] M. N. O. Sadiku and S.R. Nelatury, IPTV: An Alternative to Traditional Cable and Satellite Television, IEEE Potentials , vol. 30, no. 4, JulyAug. 2011, pp. 4446. [2] M. Volk et al ., An Approach to Modeling and Control of QoE in Next Generation Networks [Next Generation Telco IT Architectures], IEEE Commun. Mag., vol. 48, no. 8, Aug. 2010, pp. 12635. [3] P. Gupta, P. Londhe, and A. Bhosale, IPTV End-to-End Performance Monitoring, Advances in Computing and Communication, Communications in Computer and Information Science, vol. 193, part 5, 2011, pp. 51223. [4] J. R. Goodall et al., Preserving the Big Picture: Visual Network Traffic Analysis with TNV, IEEE Wksp. Visualization for Computer Security, 26 Oct. 2005, pp. 4754. [5] J. Valerdi, A. Gonzalez, and F. J. Garrido, Automatic Testing and Measurement of QoE in IPTV Using Image and Video Comparison, 4th Intl. Conf. Digital Telecommunications, 2025 July 2009, pp. 7581. [6] M. Umberger, S. Lumbar, and I. Humar, Modeling the Influence of Network Delay on the User Experience in Distributed Home-Automation Networks, Information Systems Frontiers, vol. 14, no. 3, July 2012, pp. 57184. [7] G. M. Lee and N. Crespi, Shaping Future Service Environments with the Cloud and Internet of Things: Networking Challenges and Service Evolution, Leveraging Applications of Formal Methods, Verification, and Validation, LNCS, vol. 6415, 2010, pp. 399410. [8] D.A. Keim et al. , Visual Analytics: Scope and Challenges, Visual Data Mining: Theory, Techniques and Tools for Visual Analytics, LNCS, Springer, 2008. [9] L. Xiao, J. Gerth, and P. Hanrahan, Enhancing Visual Analysis of Network Traffic Using a Knowledge Representation, 2006 IEEE Symp. Visual Analytics Science And Technology , Oct. 31Nov. 2 2006, pp. 10714.

rently working as a researcher at the Laboratory for Telecommunications of the Faculty of Electrical Engineering. His research focuses on Internet systems and web technologies, QoE in fixed and wireless networks, converged multimedia service architectures, and applications in IoT systems. M OJCA V OLK (mojca.volk@fe.uni-lj.si), was awarded her Ph.D. from the Faculty of Electrical Engineering, University of Ljubljana, in 2010. She is currently with the Laboratory for Telecommunications as a researcher. Her main research interests include advanced fixed-mobile communications systems and services, converged contextualized and IoT solutions, and analysis, visualization, admission control, and quality assurance areas in the next-generation multimedia systems and services. JANEZ STERLE (janez.sterle@fe.uni-lj.si) graduated in 2003 from the Faculty for Electrical Engineering, University of Ljubljana, where he is currently pursuing a postgraduate degree. His educational, research, and development work is oriented toward design and development of next-generation networks and services. Current research areas include next-generation Internet protocol, network security, traffic analyses, QoE modeling, QoS measuring, and development and deployment of new integrated services into fixed and mobile networks. R ADOVAN S ERNEC (radovan.sernec@telekom.si) was awarded his Ph.D. from the Faculty of Electrical Engineering, University of Ljubljana, in 2000. He is working for Telekom Slovenias R&D department as a senior researcher and strategist. His research interests include network architectures and topologies of interconnection networks, also for data centers, sustainable renewable energy models for telco operators, and innovation management within enterprises. ANDREJ KOS (andrej.kos@fe.uni-lj.si) is an assistant professor at the Faculty of Electrical Engineering, University of Ljubljana. He has extensive research and industrial experience in analysis, modeling, and design of advanced telecommunications elements, systems, and services. His current work focuses on managed broadband packet switching and nextgeneration intelligent converged services.

Biographies
URBAN SEDLAR (urban.sedlar@fe.uni-lj.si) was awarded his Ph.D. from the Faculty of Electrical Engineering, University of Ljubljana, in 2010. He is cur-

46

IEEE Network November/December 2012